New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTMLEntityCodec destroys 32-bit CJK (Chinese, Japanese and Korean) characters #303
Comments
From julian.r...@googlemail.com on June 14, 2014 00:08:24 Duplicate of issue 294 , reported over a year ago. Seems this project is dead. |
This was missing some info. Apparently the code encodes as but then is misrendered in the application. We're not handling the encoding as the writer expected. The root problem in our case is that we're casting ints to chars as we construct the entity map, like this:
I'll experiment by just switching the cast into creating a code point but that could require a nasty refactor. |
@xeno6696 Thanks for taking this on. |
This is also fixed in PR #413 |
I am experiencing similar input / output issue when using the PercentCodec. Input: %E7%B1%B3%E5%9F%94%E7%94%9F%E7%89%A9%E5%A4%9A%E6%A8%A3%E6%80%A7-%E9%AD%9A%E5%A1%98-99849a10bed9 It attempts to decode the character using 2 hex digits for each character, so E7, then B1, then B3 etc. This might be related to #300 |
@agiannone So the groundwork to fix this is in place--it just hasn't been done for the percent encoding/decoding. I modified the API so that the preferred default would be an integer-based codec as opposed to a char-based codec, and successfully implemented the pattern on the HTML path of encoding/decoding. The rest... ...was left as an exercise for the reader! :-D There's an extra possible wrinkle that I didn't want to pursue at the time that relates to percent encoding, specifically that there already exists two paths to encode a URI string. IIRC it's ASCII or UTF-8, where ASCII encoding will destroy strings sent to it that are UTF-8 encoded (non bmp characters are perfect for this). It needs to be thought through what the correct behavior here should be. We might not care--if we defaulted to UTF-8 multi-byte encoding we will break with the past, BUT this might be a bogeyman... |
@xeno6696 Looks like my tests failed miserably :-) ... but I will get to the bottom of this |
If you forked it to your github repo let me know so if I have time this weekend I can play along. |
I've cobbled something together locally (attached). |
I'm afraid at the moment we can't accept this: The compiler target for ESAPI is still Java 1.7, and your inclusion of
That said, if @kwwall would be willing to allow us to bump support to Java 1.8 as base, we can start down that path, but it'll be a major release fix and not a point-release. If you feel this fix is more urgent, let's try and do it without java.xml.bind. Once we're there, please, make your modifications in the PercentCodec class so we can easily ensure that the changes don't break the unit tests in the rest of the library. Once we're there, submit a PR and then we can wrap it up. |
From ri.j...@gmail.com on April 04, 2013 08:36:58
What steps will reproduce the problem? 1. Escape "𡘾𦴩𥻂" with org.owasp.esapi.Encoder#encodeForHTML
2. View the result in a browser What is the expected output? What do you see instead? Expected: 𡘾𦴩𥻂
Current: ������ What version of the product are you using? On what operating system? 2.0.1 on Mac OS X 10.8.3 Does this issue affect only a specified browser or set of browsers? It's the same in Chrome, Firefox and IE. Please provide any additional information below. The reason is that 32-bit characters do not fit in a Java char/Character. Here some code to illustrate it:
String s = "𡘾𦴩𥻂";
// Wrong:
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); i++) {
sb.append("&#x").append(Integer.toHexString(s.charAt(i))).append(';');
}
System.out.println(sb); // ������
// Correct:
sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
int codePoint = s.codePointAt(i);
sb.append("&#x").append(Integer.toHexString(codePoint)).append(';');
i += Character.charCount(codePoint);
}
System.out.println(sb); // 𡘾𦴩𥻂
Original issue: http://code.google.com/p/owasp-esapi-java/issues/detail?id=297
The text was updated successfully, but these errors were encountered: