Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLEntityCodec destroys 32-bit CJK (Chinese, Japanese and Korean) characters #303

Closed
meg23 opened this issue Nov 13, 2014 · 11 comments
Closed

Comments

@meg23
Copy link

meg23 commented Nov 13, 2014

From ri.j...@gmail.com on April 04, 2013 08:36:58

What steps will reproduce the problem? 1. Escape "𡘾𦴩𥻂" with org.owasp.esapi.Encoder#encodeForHTML
2. View the result in a browser What is the expected output? What do you see instead? Expected: 𡘾𦴩𥻂
Current: ������ What version of the product are you using? On what operating system? 2.0.1 on Mac OS X 10.8.3 Does this issue affect only a specified browser or set of browsers? It's the same in Chrome, Firefox and IE. Please provide any additional information below. The reason is that 32-bit characters do not fit in a Java char/Character. Here some code to illustrate it:

String s = "𡘾𦴩𥻂";
// Wrong:
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); i++) {
sb.append("&#x").append(Integer.toHexString(s.charAt(i))).append(';');
}
System.out.println(sb); // ������

// Correct:
sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
sb.append("&#x").append(Integer.toHexString(codePoint)).append(';');
i += Character.charCount(codePoint);
}
System.out.println(sb); // 𡘾𦴩𥻂

Original issue: http://code.google.com/p/owasp-esapi-java/issues/detail?id=297

@meg23
Copy link
Author

meg23 commented Nov 13, 2014

From julian.r...@googlemail.com on June 14, 2014 00:08:24

Duplicate of issue 294 , reported over a year ago. Seems this project is dead.

@xeno6696
Copy link
Collaborator

This was missing some info. Apparently the code encodes as &#xd845;&#xde3e;&#xd85b;&#xdd29;&#xd857;&#xdec2;

but then is misrendered in the application.

We're not handling the encoding as the writer expected. The root problem in our case is that we're casting ints to chars as we construct the entity map, like this:

                map.put((char)34,	"quot");	/* quotation mark */
		map.put((char)38,	"amp");		/* ampersand */
		map.put((char)60,	"lt");		/* less-than sign */

I'll experiment by just switching the cast into creating a code point but that could require a nasty refactor.

@xeno6696 xeno6696 self-assigned this Jul 19, 2017
@kwwall
Copy link
Contributor

kwwall commented Jul 20, 2017

@xeno6696 Thanks for taking this on.

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 6, 2017

This is also fixed in PR #413

@xeno6696 xeno6696 closed this as completed Aug 6, 2017
@agiannone
Copy link

agiannone commented Oct 14, 2020

I am experiencing similar input / output issue when using the PercentCodec.

Input: %E7%B1%B3%E5%9F%94%E7%94%9F%E7%89%A9%E5%A4%9A%E6%A8%A3%E6%80%A7-%E9%AD%9A%E5%A1%98-99849a10bed9
Output: ç±³å??ç??ç?©å¤?樣æ?§-é­?å¡?-99849a10bed9

It attempts to decode the character using 2 hex digits for each character, so E7, then B1, then B3 etc.
For the 32-bit characters it would need the next 3 digits together.

This might be related to #300

@xeno6696
Copy link
Collaborator

@agiannone So the groundwork to fix this is in place--it just hasn't been done for the percent encoding/decoding. I modified the API so that the preferred default would be an integer-based codec as opposed to a char-based codec, and successfully implemented the pattern on the HTML path of encoding/decoding. The rest...

...was left as an exercise for the reader! :-D

There's an extra possible wrinkle that I didn't want to pursue at the time that relates to percent encoding, specifically that there already exists two paths to encode a URI string. IIRC it's ASCII or UTF-8, where ASCII encoding will destroy strings sent to it that are UTF-8 encoded (non bmp characters are perfect for this).

It needs to be thought through what the correct behavior here should be. We might not care--if we defaulted to UTF-8 multi-byte encoding we will break with the past, BUT this might be a bogeyman...

@agiannone
Copy link

agiannone commented Oct 15, 2020

@xeno6696 After reading through #300 I gathered it would be an exercise left to the reader :-)

I went ahead and implemented a PercentCodec that extends the AbstractIntegerCodec. Still need to run tests against it though, so not sure it works yet.

@agiannone
Copy link

agiannone commented Oct 15, 2020

@xeno6696 Looks like my tests failed miserably :-) ... but I will get to the bottom of this

@xeno6696
Copy link
Collaborator

If you forked it to your github repo let me know so if I have time this weekend I can play along.

@agiannone
Copy link

agiannone commented Oct 16, 2020

I've cobbled something together locally (attached).
Any feedback you may have would be greatly appreciated :-)

CustomPercentCodec.txt
CustomPercentCodecShould.txt

@xeno6696
Copy link
Collaborator

I'm afraid at the moment we can't accept this: The compiler target for ESAPI is still Java 1.7, and your inclusion of java.xml.bind doesn't compile, and the 1.7 compile target disallows the typical response in maven to add

			            <arg>--add-modules</arg>
			            <arg>java.xml.bind</arg>

That said, if @kwwall would be willing to allow us to bump support to Java 1.8 as base, we can start down that path, but it'll be a major release fix and not a point-release. If you feel this fix is more urgent, let's try and do it without java.xml.bind.

Once we're there, please, make your modifications in the PercentCodec class so we can easily ensure that the changes don't break the unit tests in the rest of the library. Once we're there, submit a PR and then we can wrap it up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants