My favorites | Sign in
Project Home Downloads Wiki Issues Source
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 1361: UTF-8 decoder does not conform to Unicode standard re: invalid sequences
1 person starred this issue and may be notified of changes. Back to list
Status:  Invalid
Owner:  ----
Closed:  May 2012


Sign in to add a comment
 
Reported by winst...@gmail.com, Apr 30, 2012
Affected Version: hterm

What steps will reproduce the problem?
1. Download and cat this file to the terminal: http://web.mit.edu/keithw/Public/htermbugs/inv3.txt (this is the octet stream of Unicode 6.1, Table 3-8).

What is the expected output? What do you see instead?
The best output is "a���b�c��d", or if that did not translate, "a???b?c??d", where ? is the Unicode replacement character (U+FFFD).

This complies with both Unicode 6.1 requirement C10 ("Conformant processes cannot interpret ill-formed code unit sequences.") as well as the "Best Practices for using U+FFFD" in Unicode 6.1, section 3.9.

When given an encoding of UTF-8, Chrome (the Web browser) complies with C10 but not with the "Best Practice," as does gnome-terminal. 

In hterm, I see output that doesn't comply with C10 or with the "Best Practice": "añ€€á€Âb€c€¿d".

(This is also different from what we would get if we interpreted the octet sequence as Latin-1.)

Please provide any additional information below.
It's probably not feasible to try to fall back gracefully from invalid UTF-8 subsequences and decode them (and only them) as ISO 8859-1, going back and forth within a line, and this kind of thing is not permitted by the Unicode specifications for relatively good reasons.
Apr 30, 2012
#1 winst...@gmail.com
I wish I could delete or close my own bug report! I clicked the wrong "report bug" button. I will refile this against Chromium OS.
May 1, 2012
#2 sop@google.com
Wrong project
Status: Invalid
Sign in to add a comment

Powered by Google Project Hosting