| Issue 1361: | UTF-8 decoder does not conform to Unicode standard re: invalid sequences | |
| 1 person starred this issue and may be notified of changes. | Back to list |
Affected Version: hterm What steps will reproduce the problem? 1. Download and cat this file to the terminal: http://web.mit.edu/keithw/Public/htermbugs/inv3.txt (this is the octet stream of Unicode 6.1, Table 3-8). What is the expected output? What do you see instead? The best output is "a���b�c��d", or if that did not translate, "a???b?c??d", where ? is the Unicode replacement character (U+FFFD). This complies with both Unicode 6.1 requirement C10 ("Conformant processes cannot interpret ill-formed code unit sequences.") as well as the "Best Practices for using U+FFFD" in Unicode 6.1, section 3.9. When given an encoding of UTF-8, Chrome (the Web browser) complies with C10 but not with the "Best Practice," as does gnome-terminal. In hterm, I see output that doesn't comply with C10 or with the "Best Practice": "añáÂbc¿d". (This is also different from what we would get if we interpreted the octet sequence as Latin-1.) Please provide any additional information below. It's probably not feasible to try to fall back gracefully from invalid UTF-8 subsequences and decode them (and only them) as ISO 8859-1, going back and forth within a line, and this kind of thing is not permitted by the Unicode specifications for relatively good reasons.
Apr 30, 2012
#1
winst...@gmail.com
May 1, 2012
Wrong project
Status:
Invalid
|
|
| ► Sign in to add a comment |