My favorites | Sign in
Project Home Downloads Wiki Issues Source
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 158: SnakeYAML can't load higher-plane unicode characters
3 people starred this issue and may be notified of changes. Back to list
Status:  Fixed
Owner:  py4fun@gmail.com
Closed:  Sep 2012


Sign in to add a comment
 
Project Member Reported by head...@headius.com, Sep 26, 2012
In JRuby, we had an issue reported where YAML could be dumped containing higher-plan unicode characters like HAMBURGER, U+1F354), but could not load them.

Example from JRuby:

system ~/projects/jruby $ jruby -ryaml -e "p YAML.dump(\"\u{1f354}\")"
"--- ! \"šŸ”\"\n"

system ~/projects/jruby $ jruby -ryaml -e "p YAML.load(YAML.dump(\"\u{1f354}\"))"
Psych::SyntaxError: (<unknown>): <reader> unacceptable character '?' (0xD83C) special characters are not allowed
in "<reader>", position 7 at line 0 column 0
         parse at org/jruby/ext/psych/PsychParser.java:197
  parse_stream at /Users/headius/projects/jruby/lib/ruby/1.9/psych.rb:203
         parse at /Users/headius/projects/jruby/lib/ruby/1.9/psych.rb:151
          load at /Users/headius/projects/jruby/lib/ruby/1.9/psych.rb:127
        (root) at -e:1

(the blank in the first command's output is a HAMBURGER character)

It seems like perhaps SnakeYAML is not considering the possibility of surrogate characters in the String it parses, and only reads the first half of the HAMBURGER character. It should be checking characters for surrogacy and using Unicode codepoint logic if necessary.

I will see if I can find where in code this needs to happen.
Sep 26, 2012
Project Member #1 head...@headius.com
Ok, I think SnakeYAML is handling the input properly, but not outputting properly.

According to the YAML specs for 1.1 and 1.2:

"On output, a YAML processor must only produce these acceptable characters, and should also escape all non-printable Unicode characters."

In the above case, it is clearly not emitting the escaped version of the HAMBURGER character, and is instead emitting the actual utf-8 sequence into the output. It then blows up attempting to load those characters, since StreamReader properly errors out on non-printable characters in the YAML stream.
Sep 26, 2012
Project Member #2 head...@headius.com
This issue may be a duplicate of https://code.google.com/p/snakeyaml/issues/detail?id=155 which appears to have had a fix in the past month.

Is a 1.11 release pending any time soon? I will test 1.11 build in JRuby now.
Sep 26, 2012
Project Member #3 head...@headius.com
I have tested JRuby with a HEAD build of SnakeYAML and the issue persists. It does not appear the regular expression for catching unprintable characters is catching this case correctly.
Sep 26, 2012
Project Member #4 head...@headius.com
It appears SnakeYAML is missing the unprintable mask for higher-plane characters... specifically the range 0x10000-0x10FFFF shown in the YAML spec for 32-bit character ranges.

This seems like the right fix, but it doesn't appear to match correctly:

diff -r 9126dde66e63 src/main/java/org/yaml/snakeyaml/reader/StreamReader.java
--- a/src/main/java/org/yaml/snakeyaml/reader/StreamReader.java	Wed Sep 05 20:14:52 2012 +0200
+++ b/src/main/java/org/yaml/snakeyaml/reader/StreamReader.java	Wed Sep 26 15:03:43 2012 -0500
@@ -29,8 +29,13 @@
  * Reader: checks if characters are in allowed range, adds '\0' to the end.
  */
 public class StreamReader {
-    public final static Pattern NON_PRINTABLE = Pattern
-            .compile("[^\t\n\r\u0020-\u007E\u0085\u00A0-\uD7FF\uE000-\uFFFD]");
+    public final static Pattern NON_PRINTABLE;
+    static {
+        StringBuilder builder = new StringBuilder("[^\t\n\r\u0020-\u007E\u0085\u00A0-\uD7FF\uE000-\uFFFD")
+                .appendCodePoint(0x10000).append('-').appendCodePoint(0x10FFFF)
+                .append(']');
+        NON_PRINTABLE = Pattern.compile(builder.toString());
+    }
     private String name;
     private final Reader stream;
     private int pointer = 0;
Sep 26, 2012
Project Member #5 head...@headius.com
Note also that SnakeYAML is encoding 32-bit (i.e. higher-plane) characters using two \u sequences rather than a single \U sequence, as stated in the YAML spec:

Escaped 16-bit Unicode character:
[63]	ns-esc-16-bit	::=	ā€œ\ā€ ā€œuā€ ( ns-hex-digit x 4 )	 
Escaped 32-bit Unicode character:
[64]	ns-esc-32-bit	::=	ā€œ\ā€ ā€œUā€ ( ns-hex-digit x 8 )
Sep 26, 2012
Project Member #6 head...@headius.com
Also, SnakeYAML appears to want to encode the entire string when any character matches, when it should just encode the unprintable characters.
Sep 27, 2012
Project Member #7 head...@headius.com
Ok, updates...

* SnakeYAML does encode only the characters that need encoding, via a check in Emitter.writeDoubleQuoted.
* It encodes UTF-16 surrogate pairs as separate 16-bit escape sequences, which is wrong.
* It only encodes when !allowUnicode.

So the following changes are needed:

* Always encode unprintable characters (YAML spec).
* Encode 32-bit characters (UTF-16 surrogate pairs) as \UXXXXXXXX, not as a pair of \uXXXX.

This patch does both: https://gist.github.com/3796362

It appears to work for JRuby purposes, other than some minor behavior changes I need to reconcile with Ruby's YAML tests.
Sep 28, 2012
Project Member #8 head...@headius.com
Updated the gist to include a test that exercises the new code path (or at least, it exercises the case of *valid* Unicode UTF-16 surrogate sequences...invalid sequences will basically come out the same as they were before).

https://gist.github.com/3796362
Sep 29, 2012
Project Member #9 py4fun@gmail.com
The patch has been accepted. Unfortunately the test does not cover the new code.
Please run 'mvn site' and check the coverage report for Emitter.java (target/site/cobertura/index.html). The 'if' statement in line 1238 is never 'false' in tests.
Status: Started
Owner: py4fun@gmail.com
Sep 29, 2012
Project Member #10 head...@headius.com
I have updated the gist with an additional test that covers the other code branch I added.
Sep 29, 2012
Project Member #11 py4fun@gmail.com
It will be delivered in version 1.11
Status: Fixed
Sign in to add a comment

Powered by Google Project Hosting