My favorites | Sign in
v8
Project Home Downloads Wiki Issues Source Code Search
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 761: Incorrect UTF-8 encoding/decoding for non-BMP characters in String related functions
27 people starred this issue and may be notified of changes. Back to list
Status:  Fixed
Owner:  l...@chromium.org
Closed:  Sep 2011


Sign in to add a comment
 
Reported by kangxn@gmail.com, Jun 29, 2010
Non-BMP unicode characters(over 0xFFFF) are presented in 4 bytes or more in UTF8, 2 uint16_ts in UTF-16,  V8 can't handle that correctly.

Sample case:

Unicode: \U00010412
UTF-8: f0 90 90 92
UTF-16LE: 01 d8 12 dc ('\ud801\udc12')

String::New fails to accept the utf-8 string, returning an empty string.
And String::WriteUtf8 would write 6 bytes out for the sample.


Jun 29, 2010
Project Member #1 ipo...@chromium.org
(No comment was entered for this change.)
Cc: ipo...@chromium.org
Jun 29, 2010
Project Member #2 ager@chromium.org
(No comment was entered for this change.)
Status: Assigned
Owner: LasseReichsteinHolstNielsen
Jun 30, 2010
#3 lrn%chro...@gtempaccount.com
V8 currently only accepts characters in the BMP as input, using UCS-2 as internal representation (the same representation as JavaScript strings).

As such, the output is correct (the UTF-8 encoding of <U+D801,U+DC12> is six characters, even if the code-points have no meaning). 
This is unlikely to change for the standard output functions, since JavaScript strings are inherently UCS-2. If someone uses V8 and knows that a string really contains UTF-16 encoded data, they need to add their own output function that parses the string data and converts it to whatever is needed, and which can handle malformed UTF-16 data.


However, the input is correctly parsed as U+00010412, but is then silently truncated to 16-bits when building the string. That's not helpful behavior.

There are two things we can do here,:
1 - Make it an error to enter characters outside the BMP. That avoids the silent truncation.
2 - Convert the input to UTF-16, using surrogate pairs for non-BMP code points, knowing that it will be treated as UCS-2 internally. 

The latter isn't as dangerous as it seems, since all valid BMP-only UTF-8 texts will be unchanged, it will handle characters outside of the BMP (without actually understanding it), and an input containing UTF-8 encodings of the surrogate pair range is invalid anyway.

I'm leaning towards the second option.
Status: Accepted
Labels: Type-FeatureRequest
Jul 1, 2010
#4 lrn%chro...@gtempaccount.com
After closer inspection, I don't see any way we can safely use the second option.

We parse the same input as either utf-8 or as a String value, and they should parse exactly the same. 
More precisely, the pre-parser parses it as utf-8, and the real parser parses it from a string later on. The pre-parser stores indices into the string for later use, so the number of codepoints in the two representations MUST be the same.
That means that we can't turn one code-point into a surrogate pair as long as we don't parse the string as UTF-16 too.

I.e., the second option requires us to also change any parsing from string values to interpreting the string value as UTF-16, which is a larger change.
Aug 2, 2010
#5 lrn%chro...@gtempaccount.com
V8 won't not support characters outside the Basic Multilingual Plane for now.
Characters outside the supported range are converted to U+FFFD (REPLACEMENT CHARACTER) when parsing.
Status: Fixed
Sep 13, 2011
#6 dipesh.s...@gmail.com
We are using Websocket for sending bulk data from native application to web browser application written in java script. our native application is sending bulk data in utf-8 decoding format. 

web browser java script application works fine with data having character in Basic Multilingual Plane. If there is a utf-8 codded character outside the Basic Multilingual Plane (code point in surrogate area) it replace it with U+FFFD (REPLACEMENT CHARACTER). due to which java script application never know what string has been received.

One option is to fix this using utf-16 for code point in surrogate area. our orginal data is in utf-8 format and conversion from utf-8 to utf-16 for these characeted require to scan complete string and identify location of those characted and then replace them with utf-16 surrogate pair. This replcement is a costly operation and slowdown whole application.

Is there any plan to support code point in surrogate area in utf-8 format in browser itself to avoid this costly conversion.
Sep 13, 2011
Project Member #7 l...@chromium.org
No plans at the moment, no.
We will never (barring a development in ECMAScript) support surrogate pairs in JavaScript strings. Characters with codes in the surrogate pair range are considered a single stand-alone character from JavaScript's point of view.

I'm reconsidering whether it's possible to convert all incoming UTF-8 into UTF-16 sequences instead of UCS-2 (i.e., convert a non-BMP character into a surrogate pair). 
This will be on input only, and won't make sense outside of comments and String and RegExp literals (since a surrogate code isn't valid anywhere else). 
It's likely to confuse users, since we won't ever interpret the result as UTF-16 anyway. That means that the length of a string literal containing non-BMP characters is different from the number of Unicode characters sent as UTF-8. 

If we can avoid problems with the parsers by always counting non-BMP input as two code-points, then this might be possible, but it's not obvious that it's desirable, except for very specific uses.
As such, no current plans to change anything.
Owner: l...@chromium.org
Sep 13, 2011
#8 jsbell@chromium.org
There has been some discussion in TC39 - at least, on the es-discuss mailing list - about full Unicode support for ECMAScript strings. A strawman proposal is at: http://wiki.ecmascript.org/doku.php?id=strawman:support_full_unicode_in_strings - note that this is NOT accepted for ES6 and there were concerns raised about this particular proposal.

...

FWIW, it is possible to do text processing in JavaScript treating strings as UTF-16 sequences, both manually and with a little help from the browser. For example:

// see http://ecmanaut.blogspot.com/2006/07/encoding-decoding-utf8-in-javascript.html
function encode_utf8( s ) { return unescape( encodeURIComponent( s ) ); }
function decode_utf8( s ) { return decodeURIComponent( escape( s ) ); }

function codes(s) { var c = [], i; for (i = 0; i < s.length; i += 1) { c[i] = s.charCodeAt(i); }; return c.map(function(d) { return d.toString(16); }).join(' '); }

// from original poster's sample
var utf8str = encode_utf8('\ud801\udc12');
codes(utf8str); // > "f0 90 90 92"
codes(decode_utf8(utf8str)); // > "d801 dc12"

IMHO, the suggestion to convert incoming UTF-8 to UTF-16 instead of UCS-2 matches at least part of the reality on the web. With DOM interop, here's another example:

// from the WebKit inspector
var u = '\uD834\uDD1E'; // U+1D11E MUSICAL SYMBOL G CLEF
document.title = u; // works on my machine

The 16-bit JavaScript string is being interpreted as a UTF-16 sequence somewhere between the script runtime and the display. Converting incoming WebSocket UTF-8 strings to UTF-16 before handing them to JavaScript seems like the right thing to do, so that they can later find their way back out to the DOM for display.

Sep 13, 2011
Project Member #9 jshin@chromium.org
<quote>
I'm reconsidering whether it's possible to convert all incoming UTF-8 into UTF-16 sequences instead of UCS-2 (i.e., convert a non-BMP character into a surrogate pair). 
This will be on input only, and won't make sense outside of comments and String and RegExp literals (since a surrogate code isn't valid anywhere else). 
It's likely to confuse users, since we won't ever interpret the result as UTF-16 anyway. That means that the length of a string literal containing non-BMP characters is different from the number of Unicode characters sent as UTF-8. 
</quote>

I think this is perfectly fine.  Especially, it's fine that 'length' will keep counting 2-byte code units instead of Unicode characters.  

 
Sep 13, 2011
Project Member #10 ipo...@chromium.org
(No comment was entered for this change.)
Cc: -ipo...@chromium.org
Sep 13, 2011
#11 markda...@google.com
I agree on converting to UTF-16. That means that data won't be corrupted when displayed to the user. Java works the same way; the storage is UTF-16, and the length is in char's (16-bit units). 
Sep 14, 2011
Project Member #12 l...@chromium.org
While it would be convenient to convert UTF-8 to UTF-16 and then treat it as UCS-2, we should still be compatible with other browsers.
Currently we match Safari and IE: A four-byte sequence like F0 80 80 80 (UTF-8 of U+10000) is converted to four U+FFFD characters (probably because the first byte isn't recognized by the decoder, and the following bytes aren't valid UTF-8 starters).
(In comparison, Opera and Firefox read it as one U+FFFD. They obviously decode the UTF-8 correctly, and then converts the one non-BMP character to invalid).

We should probably keep compatibility for now.
Sep 14, 2011
#13 dipesh.s...@gmail.com
Hi,
   I need to send some data from native application to browser Java script apps though websocket which contain non-BMP Character e.g.  (𝍖) U+1D356  in UTF-8=(f0 9d 8d 96) in UTF-16=(\uD834\uDF56) (Refrence:http://graphemica.com/%F0%9D%8D%96)
   
   I have tried follwing methods
   
Way 1:
   
   char str[9];
   char *p = out;
	*p++ = '\\';
	*p++ = 'u';
	*p++ = 0xd8;
	*p++ = 0x34;
	*p++ = '\\';
	*p++ = 'u';
	*p++ = 0xdf;
	*p++ = 0x56;
	*p = '\0';
	
	send this str over websocket but in application it not able to parse showing NaN Error
	
Way 2:	
  	printf(str,"\\u%x\\u%x", 0xD7C0 + (c >> 10), 0xDC00 | c & 0x3FF);
   
   	send this str over websocket but in chrome application it display \ud834\udf56 and not able to render as a unicode character.
   	
Please suggest
   
Sep 15, 2011
Project Member #14 l...@chromium.org
I'm not sure what you are trying to send here. The "\u" suggests that it's part of a string, but in that case the following should be ASCII hex digits.

You can't send the character U+1D356 to the V8 JavaScript engine, since it simply doesn't recognize code points outside the BMP.

Since you are running in a browser, the above discussion doesn't apply - that was about the V8 API. When running in the browser, UTF-8 decoding is generally handled by WebKit.

If you want to send the two 16-bit words D834 and DF56, and the browser will be the one interpreting it first, you send the UTF-8 encoding as part of a normal HTML file or JS file. Then it will be expanded into the two surrogate codes before being passed to V8. It only works for valid character encodings (my U+10000 above should be encoded as F0 90 80 80, then it works too).

I haven't checked whether Chrome does something else to characters coming through a web-socket, but I would try the same thing there.

If you are embedding V8 directly, and creating strings through the API, then it's a different matter, because then you use the V8 UTF-8 decoder, which turns any non-BMP character into U+FFFD. That's the one that we might consider changing (if it can be done without breaking the parser/preparser interaction), but it's not a high priority. I'll reopen this feature request.
Status: Accepted
Labels: Priority-Low
Sep 15, 2011
#15 testemai...@gmail.com
Thanks for your reply. Actully I am sending UTF-8 Encodded Data from a Native (C) application to Client inside Chrome browser which recieve data from WebSocket using javascript (I think i use V8 for same). This Data containg non-BMP character as well. 

But to due to limitation of V8 Engine as I have seen in Chrome browser it has been converted into U+FFFD

So I have tried non-BMP character in UTF-16 surrogate pair 
e.g. charcter (𝍖) U+1D356  in UTF-8=(f0 9d 8d 96) in UTF-16=(D834 DF56)

Native Apps:
	char *p = out;
	*p++ = 0xd8;
	*p++ = 0x34;
	*p++ = 0xdf;
	*p++ = 0x56;
	*p = '\0';
	
JavaScrip in Chrome:
	var ws = new WebSocket('ws://localhost:12345/mySession');
	this.ws.onmessage = function(evt)
	{
		var reply = evt.data;
		console.log ('reply :'+ reply); // empty string received(when send non-BMP char in UTF-16 )
										// replacement char U+FFFD ( when send non-BMP char in UTF-8 )
	}
	
This code is Native (C) sending data in UTF-16 ASCII hex digits. But at Chrome browser Java script Application receive empty string

I am not sure it is a problem with V8 or Webkit or Chrome. But finally data in UTF-16(surrogate pair) is not received.
Sep 16, 2011
#16 shrivast...@gmail.com
If we send utf8 encoded non BMP characters {example UTF-8=(f0 9d 8d 96)} and receive it through websocket, it is getting replaced by U+FFFD. If we send it in utf16 {example UTF-16=(D834 DF56)} we receive empty string through websocket. 

Please someone help us how to transfer non BMP characters through websocket so that they are received properly in js application in chrome. 

We tried every way known to us without luck, only solution seems to be encode it in some format (like complete string in base64 or just non BMP characters in \uxxxx\uxxxx where xx are hex digits as string example "\ud834\udf56" ) and decode it at application which is inefficient for obvious reasons.
Sep 19, 2011
#17 shrivast...@gmail.com
\uxxxx\uxxxx is standard javascript character representation. 
It is best if we can support utf8 encoding for all data including that received through websocket. Even ECMAScript standard should support utf8 encoded strings, that way we will have uniformity in receiving and rendering. Currently we are completely lost as far as receiving/rendering of non BMP characters is concerned. 
Sep 19, 2011
Project Member #18 l...@chromium.org
The \uxxxx sequence is recognized in ECMAScript string and RegExp literals and in identifiers only. It's always a six-character sequence, and the 'x's must be ASCII hex digits. The above example used four-byte sequences with non-ASCII hex digits, and not obviously inside a string or RegExp literal.

In any case, I agree that we should have conformity in behavior. If non-BMP code points encoded as UTF-8 is treated in one way when entering the browser as a script, but differently if entering as a web-socket, then it's a problem. I'd say it's the responsibility of the browser code to do the same thing before passing it on to JavaScript.

I'll see if I can reproduce it locally, and then I'll open a Chromium bug for it (or you can go ahead and do that, since you have an example already). Then we'll see if it should be handled inside Webkit or non-V8 Chromium (as the other incoming UTF-8 data), or if it should be delegated to V8 (in which case our UTF-8 decoder needs changing).
Sep 21, 2011
#19 shrivast...@gmail.com
Problem is even if we send data in exact javascript string format converting all non BMP characters to surrogate pairs (as mentioned in previous mail), we are not able to render it. If we assign directly exactly same data as string literal, it gets rendered properly !! 
Probably it is chromium issue, we will file bug in chromium with all information.  
Sep 21, 2011
Project Member #20 l...@chromium.org
I've been reading up on websockets, and splitting into surrogate pairs before sending it from the server is probably not the way to go (it is if you want to inject strings directly into V8 using the UTF-8 API, but that's not what Chromium normally does). The websocket protocol expects valid UTF-8.

If you are sending *from* JavaScript, you'll have to use properly matched surrogate pairs (because all you have is strings as 16-bit number sequences), and the web-socket send method must convert that to valid UTF-8.
Sep 21, 2011
#21 shrivast...@gmail.com
That is what our default implementation everything is sent utf8 encoded from native server over websocket, issue here is non BMP utf8 characters get replaced by U+FFFD That's the whole reason we are trying out different ways for transferring non BMP characters over websocket, till now unable to figure out an elegant method. 
We tried these
1) UTF8 encoded data, here non BMP get replaced by U+FFFD
2) UTF16 encoded, here we get empty string
3) Javascript string format (surrogate pairs), here rendering issue as mentioned in previous posting

Sep 21, 2011
Project Member #22 jshin@chromium.org
In 3), what rendering issue do you have? Your previous comment about rendering (comment #17) does not say much. 

Anyway, if #1 does not work, it's not likely that it's a v8 issue. I guess we have to look at Chromium's WebSocket implementation. 
Sep 21, 2011
#23 yutak@chromium.org
Hi, I'm a developer of WebSocket.

I couldn't reproduce the issue on my local environment (WebSocket received non-BMP characters correctly).

Please check out:
- WebSocket only accepts UTF-8 encoded data in text frames. (UTF-16 is not allowed.)
- WebSocket has its own frame format; you need to append frame header before your text data.
- Since Chrome 14, WebSocket protocol implementation has been changed to comform to the recent protocol. You may need to update your server implementation so it conforms to <http://tools.ietf.org/html/draft-ietf-hybi-thewebsocketprotocol-10>.

If you still believe this is a bug, feel free to open a new Chromium bug with detailed information (version number, OS, steps to reproduce, etc.) from <http://new.crbug.com/>.
Sep 22, 2011
#24 shrivast...@gmail.com
Hi yutak
We are using older chromium 11.01 with websocket protocol version 0 support and issue very much exist in that. 
We have support for websocket protocol version 10 in native server and tried with chrome version 14 (14.0.835.186) and same issue exist.
But dev version 15 (15.0.874.21) of chrome works fine and we are able to receive properly all characters including non BMP. Please note that we are sending all data utf8 encoded in text frames.
I don't know whether we need to file a bug for chrome versions < 15

Sep 22, 2011
#25 shrivast...@gmail.com

Hi jshin

Regarding rendering issue

We are receiving string \ud834\udf56 from websocket. When we are assiging this string to innerHTML to display it then it is rendered as string \ud834\udf56 instead of '𝍖'. Whereas same string if we assign to innerHTML manually like this: innerHTML= '\ud834\udf56', then it is rendered properly (𝍖).


Sep 22, 2011
Project Member #26 l...@chromium.org
The string you receive from the websocket, what is its length?
If it's 12, you are receiving the string "\\ud834\\udf56" (i.e., where
the first character is a backslash), and not the string "\ud834\udf56"
(where the first character is the surrogate pair starter U+D834).

Also, do you know which *bytes* are sent by the websocket server?
It should be the four byte UTF-8 encoding: f0 9d 8d 96.

/L

Sep 23, 2011
#27 shrivast...@gmail.com
Your are right, native library we were using to create json object (json-glib) was replacing single \ in string passed to it with double \\ (!!) and it was getting transmitted as such. We can conclude that there is no rendering issue, our mistake in not analysing all aspects properly.

As said earlier we were sending non BMP as utf8 only but as it was getting replaced by U+FFFD  we tried sending surrogate pairs instead in format \ud834\udf56 for non BMP.

As already updated, in chrome v15 our existing implementation of sending everything as utf8 over websocket works fine, don't know where is the fix(v8 or chrome)
Sep 28, 2011
Project Member #28 l...@chromium.org
(No comment was entered for this change.)
Status: Fixed
Nov 30, 2011
#29 mickael....@gmail.com
I don't see why this issue was closed... I think offering a full UTF-8 codec is still a valid feature request.

I think V8 should adopt a pragmatic approach there. Indeed, I don't care if JavaScript String.* simply ignores surrogate pairs (conforming to the standard), but I just want to be able to get UTF-8 data from and to a socket, a file, a database passing by a conversion into V8 string without losing data due to those few extra-BMP characters.

If you are interested, I've a patch of v8 to do so.

Regards, M.
Jan 19, 2012
#30 m...@ranney.com
Now that iOS and Android support Unicode 6, there are a lot more users composing strings with characters outside of the BMP.  This is a pretty serious limitation for V8 and JavaScript in general.
Mar 3, 2012
Project Member #31 erik.corry
I'm going to fix the API functions in V8 that accept and produce UTF-8 so that they uinderstand surrogate pairs  I'm not sure whether that will be enough to fix this issue vis a vis web sockets, because I am not sure that Chromium uses those API functions.
Mar 5, 2012
#32 mickael....@gmail.com
Just so you know... I have a version of v8 that does understand surrogate pairs (to a point), you might want to give it a peek. https://github.com/polazarus/v8
Mar 12, 2012
#33 erikcorry@google.com
The bleeding edge revision 11007 has fixes to handle surrogate pairs on input and output.  The intended behaviour is:

* 4-byte UTF-8 sequences turn into 2 surrogates in the JS String
* Two 3-byte UTF-8 sequences can also be used to create 2 surrogates in the JS String
* String.fromCharCode(x) takes a single UTF-16 code unit, so you still can't give it numbers above 0xffff
* Most places in JS (RegExp, [], charCodeAt, charAt, etc.) work on UTF-16 code units with no special treatment for surrogates.
* On output to UTF-8, unmatched surrogates map to a 3-byte UTF-8 sequence, and surrogate pairs map to a single 4-byte UTF-8 sequence.
Sign in to add a comment

Powered by Google Project Hosting