HTMLEntityCodec destroys 32-bit CJK (Chinese, Japanese and Korean) characters #303

meg23 · 2014-11-13T18:26:37Z

From ri.j...@gmail.com on April 04, 2013 08:36:58

What steps will reproduce the problem? 1. Escape "𡘾𦴩𥻂" with org.owasp.esapi.Encoder#encodeForHTML
2. View the result in a browser What is the expected output? What do you see instead? Expected: 𡘾𦴩𥻂
Current: �� What version of the product are you using? On what operating system? 2.0.1 on Mac OS X 10.8.3 Does this issue affect only a specified browser or set of browsers? It's the same in Chrome, Firefox and IE. Please provide any additional information below. The reason is that 32-bit characters do not fit in a Java char/Character. Here some code to illustrate it:

String s = "𡘾𦴩𥻂";
// Wrong:
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); i++) {
sb.append("&#x").append(Integer.toHexString(s.charAt(i))).append(';');
}
System.out.println(sb); // ��

// Correct:
sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
sb.append("&#x").append(Integer.toHexString(codePoint)).append(';');
i += Character.charCount(codePoint);
}
System.out.println(sb); // 𡘾𦴩𥻂

Original issue: http://code.google.com/p/owasp-esapi-java/issues/detail?id=297

meg23 · 2014-11-13T18:26:38Z

From julian.r...@googlemail.com on June 14, 2014 00:08:24

Duplicate of issue 294 , reported over a year ago. Seems this project is dead.

xeno6696 · 2017-07-19T17:14:55Z

This was missing some info. Apparently the code encodes as &#xd845;&#xde3e;&#xd85b;&#xdd29;&#xd857;&#xdec2;

but then is misrendered in the application.

We're not handling the encoding as the writer expected. The root problem in our case is that we're casting ints to chars as we construct the entity map, like this:

                map.put((char)34,	"quot");	/* quotation mark */
		map.put((char)38,	"amp");		/* ampersand */
		map.put((char)60,	"lt");		/* less-than sign */

I'll experiment by just switching the cast into creating a code point but that could require a nasty refactor.

kwwall · 2017-07-20T01:58:20Z

@xeno6696 Thanks for taking this on.

xeno6696 · 2017-08-06T18:30:17Z

This is also fixed in PR #413

agiannone · 2020-10-14T22:47:06Z

I am experiencing similar input / output issue when using the PercentCodec.

Input: %E7%B1%B3%E5%9F%94%E7%94%9F%E7%89%A9%E5%A4%9A%E6%A8%A3%E6%80%A7-%E9%AD%9A%E5%A1%98-99849a10bed9
Output: ç±³å??ç??ç?©å¤?æ¨£æ?§-é?å¡?-99849a10bed9

It attempts to decode the character using 2 hex digits for each character, so E7, then B1, then B3 etc.
For the 32-bit characters it would need the next 3 digits together.

This might be related to #300

xeno6696 · 2020-10-15T14:00:51Z

@agiannone So the groundwork to fix this is in place--it just hasn't been done for the percent encoding/decoding. I modified the API so that the preferred default would be an integer-based codec as opposed to a char-based codec, and successfully implemented the pattern on the HTML path of encoding/decoding. The rest...

...was left as an exercise for the reader! :-D

There's an extra possible wrinkle that I didn't want to pursue at the time that relates to percent encoding, specifically that there already exists two paths to encode a URI string. IIRC it's ASCII or UTF-8, where ASCII encoding will destroy strings sent to it that are UTF-8 encoded (non bmp characters are perfect for this).

It needs to be thought through what the correct behavior here should be. We might not care--if we defaulted to UTF-8 multi-byte encoding we will break with the past, BUT this might be a bogeyman...

agiannone · 2020-10-15T14:07:34Z

@xeno6696 After reading through #300 I gathered it would be an exercise left to the reader :-)

I went ahead and implemented a PercentCodec that extends the AbstractIntegerCodec. Still need to run tests against it though, so not sure it works yet.

agiannone · 2020-10-15T15:50:55Z

@xeno6696 Looks like my tests failed miserably :-) ... but I will get to the bottom of this

xeno6696 · 2020-10-16T04:07:18Z

If you forked it to your github repo let me know so if I have time this weekend I can play along.

agiannone · 2020-10-16T14:34:55Z

I've cobbled something together locally (attached).
Any feedback you may have would be greatly appreciated :-)

CustomPercentCodec.txt
CustomPercentCodecShould.txt

xeno6696 · 2020-10-17T21:36:48Z

I'm afraid at the moment we can't accept this: The compiler target for ESAPI is still Java 1.7, and your inclusion of java.xml.bind doesn't compile, and the 1.7 compile target disallows the typical response in maven to add

			            <arg>--add-modules</arg>
			            <arg>java.xml.bind</arg>

That said, if @kwwall would be willing to allow us to bump support to Java 1.8 as base, we can start down that path, but it'll be a major release fix and not a point-release. If you feel this fix is more urgent, let's try and do it without java.xml.bind.

Once we're there, please, make your modifications in the PercentCodec class so we can easily ensure that the changes don't break the unit tests in the rest of the library. Once we're there, submit a PR and then we can wrap it up.

meg23 added bug imported Priority-Medium labels Nov 13, 2014

xeno6696 self-assigned this Jul 19, 2017

xeno6696 mentioned this issue Jul 21, 2017

non-BMP characters incorrectly encoded #300

Closed

xeno6696 closed this as completed Aug 6, 2017

kwwall mentioned this issue Oct 15, 2020

PercentCodec Doesn't Handle UTF-8 percent encoding #377

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTMLEntityCodec destroys 32-bit CJK (Chinese, Japanese and Korean) characters #303

HTMLEntityCodec destroys 32-bit CJK (Chinese, Japanese and Korean) characters #303

meg23 commented Nov 13, 2014

meg23 commented Nov 13, 2014

xeno6696 commented Jul 19, 2017

kwwall commented Jul 20, 2017

xeno6696 commented Aug 6, 2017

agiannone commented Oct 14, 2020 •

edited

xeno6696 commented Oct 15, 2020

agiannone commented Oct 15, 2020 •

edited

agiannone commented Oct 15, 2020 •

edited

xeno6696 commented Oct 16, 2020

agiannone commented Oct 16, 2020 •

edited

xeno6696 commented Oct 17, 2020

HTMLEntityCodec destroys 32-bit CJK (Chinese, Japanese and Korean) characters #303

HTMLEntityCodec destroys 32-bit CJK (Chinese, Japanese and Korean) characters #303

Comments

meg23 commented Nov 13, 2014

meg23 commented Nov 13, 2014

xeno6696 commented Jul 19, 2017

kwwall commented Jul 20, 2017

xeno6696 commented Aug 6, 2017

agiannone commented Oct 14, 2020 • edited

xeno6696 commented Oct 15, 2020

agiannone commented Oct 15, 2020 • edited

agiannone commented Oct 15, 2020 • edited

xeno6696 commented Oct 16, 2020

agiannone commented Oct 16, 2020 • edited

xeno6696 commented Oct 17, 2020

agiannone commented Oct 14, 2020 •

edited

agiannone commented Oct 15, 2020 •

edited

agiannone commented Oct 15, 2020 •

edited

agiannone commented Oct 16, 2020 •

edited