Page MenuHomePhabricator

UTF-8 decoding problem across multiple reads
Open, NormalPublic


When a UTF-8 character is split across multiple read()s, it is not decoded
correctly. This causes random screen corruption problems.

Execute: echo -ne '\0303\0241\0303'; sleep 1; echo -e '\0251'
Expected output: áé
Actual output: �

Gnome-terminal used to have the same issue a long time ago:

cedric created this task.Dec 12 2013, 7:11 PM
cedric updated the task description. (Show Details)
cedric raised the priority of this task from to Incoming Queue.
cedric added a project: Terminology.
cedric added a subscriber: cedric.
raster triaged this task as Normal priority.Dec 14 2013, 1:09 AM
raster added a subscriber: raster.
egmont added a subscriber: egmont.Dec 18 2013, 1:35 AM

Here's a screenshot of mc, you'll often (1 out of 10, or so) get similar result if you resize mc to a bit larger than the default window size.

I have eina 1.7, I realized that a different code is executed with 1.8, so I changed the conditional (termpty.c:225) to use the first branch, and changed eina_unicode_utf8_next_get() to eina_unicode_utf8_get_next() to make it compile. I couldn't figure out if e…u…u…next_get() and e…u…u…get_next() are the same or not. Now the bug is gone, but studying the code reveals further issues.

This one's functional:

egmont@foo:~$ echo -ne '\0241abc'
egmont@foo:~$ echo -ne '\0241abcd'

'abc' is swallowed in the first case, whereas the second command's output appears properly.

This one's timing only:

echo -ne '\0241'; sleep 1; echo -ne '\0241'

Expected: A replacement symbol should appear, and one second later another replacement symbol should appear.
Actual: Nothing for a second, and then two replacement symbols.

I'm afraid it wouldn't be hard to find other similar ones.

The UTF-8 decoder apparently builds around a method that is designed to decode UTF-8 strings that are completely known in their full length. Decoding a stream is more complicated: you clearly need to distinguish between an incomplete (i.e. invalid on its own, but a valid prefix) and totally invalid sequences. Building around a method that is unable to provide this information is doomed to be faulty, prone to suffer from not reporting errors as soon as it could, or depend on which bytes are read in a single shot.

billiob claimed this task.Dec 26 2013, 2:33 PM
raster removed a project: efl.May 19 2014, 1:13 AM
This comment was removed by godfath3r.
billiob removed billiob as the assignee of this task.Jan 29 2015, 3:27 PM
billiob changed the visibility from "All Users" to "Public (No Login Required)".Dec 27 2015, 3:00 PM

i don't see your swallowing:

11:03AM ~ > echo -ne '\0241abc'
11:03AM ~ > echo -ne '\0241abcd'
11:04AM ~ >

Resurrecting this thread years later. The issue is still present with current master, using the libs from Ubuntu Cosmic (e.g. eina-1.20.7).

@raster The "%" sign at the end of the output suggests that you have the standard trick in place that makes sure there's a newline even if the output didn't end in one. Probably this trick changes how the decoder behaves.

Here's a more self-contained example that I think should work for everyone:

egmont@foo:~ $ echo -ne '\0241foo'; echo bar
egmont@foo:~ $ echo -ne '\0241foo'; sleep 1; echo bar
egmont@foo:~ $

Also note that when copy-pasting that inverted question mark, I get the low surrogate U+DCA1 (I've manually replaced them by U+FFFD for filing this comment). I don't think that's correct either, I think the UTF-8 decoder should emit the replacement character rather than surrogates on invalid individual bytes.