New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
non-BMP characters incorrectly encoded #300
Comments
From julian.r...@googlemail.com on March 04, 2013 05:38:54 test cases:
and
|
Even with the provided test case, this is very cryptic. Okay, so the character in question is 你 What isn't so clear to me, is why the user would expect |
Further in the second test case,
From what I can tell, these entities also represent the same character. Digging up the blog post from this issue #, http://ainthek.blogspot.de/2010/09/orgowaspesapicodecshtmlentitycodecjava.html Looks like @planetlevel had a small argument with this coder, and I'd like to get more context before deciding what to do with this issue. At present, this seems to be a highly esoteric use case where folks should be using Unicode if they want desired results. |
Is #303 a duplicate of this? |
Aaaah... getting into some fun esoterism in code points and BMP. The core problem is in Codec.java:
Because we use chars. |
This one's been entertaining. Adding methods to the core Codec class and will have to override them in the other codecs to make sure they all work. I've solved the immediate problem however! :-) |
My guess is maybe the whole codec / validation / output encoding components
were apparently designed assuming that devs would only be using ASCII or
various extended-ASCII character sets. At least it was probably never
tested with much else.
…-kevin
--
Blog: http://off-the-wall-security.blogspot.com/ | Twitter: @KevinWWall
NSA: All your crypto bit are belong to us.
On Jul 30, 2017 19:11, "Matt Seil" ***@***.***> wrote:
This one's been entertaining. Adding methods to the core Codec class and
will have to override them in the other codecs to make sure they all work.
I've solved the immediate problem however! :-)
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#300 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AB3nmwylzO3-M1hUU9UBuOwhQb08xgtCks5sTQ2RgaJpZM4C6zDo>
.
|
The error was most likely an oversight in assuming UTF-8 as a web standard, when Java mistakenly asserted UTF-16. I never realized the impact of that difference until working this bug. I think I will start working specific issues in their own branches. |
@kwwall -- This one is getting ugly fast. So I started out conservatively, intending on preserving as much legacy functionality as possible:
The kicker is this... that "immune" array of chars... chars are EVERYWHERE in our encoding framework. Several utility classes were created just to help, and this gets complicated VERY quickly. The helper class Ideally... we convert all references of But that might be too big a bite. Do you have a preference? We can stick with the conservative route, as I have started. I've created both an int and a char version of several methods, which will spider into almost all the codec classes... in the process they will each be able to handle non-BMP codePoints. Okay, I think I've talked myself into staying conservative, and growing the encoder API into the right direction. |
Found a nasty little puzzle @jeremiahjstacey
That first declaration for the char... I have two methods, one is Java decides to upcast the variable |
Surely the "immune" arrays being chars isn't a problem. It's trivial to add a new static method to |
Maybe I'm missing something Kevin, but since Java cannot represent non-BMP characters as chars at all, it means that we would have to default all our handling to ints. By default, a programmer could not add a non-BMP character to an immune list because that non-BMP character lacks a value that can be represented by a char. |
Jon Skeet gave me a schooling on overload resolution. @kwwall what would you say for defensive purposes, we add a method with signature Also... when thinking about it, I don't see the problem with leaving the immune chars as |
That's the key. As far as I know, no interpreters use non-BMP for keywords or other significant syntax. |
And if they did, they would be playing in the UTF extensions field, and no languages in common use ought to be using characters in that region. (Considering its typically used for non-standard Chinese and other asian scripts not covered by the UTF spec.) |
Sorry I'm running behind in responding... will try to catch up. @xeno6696 wrote:
Sorry, I was confusing BMP characters with Unicode, which can all be encoded with UTF-16, but I see now that BMP are above U+FFFF wihch means the won't fit into a fixed-width 16-bit character encoding so an int would be required.
Wait a moment; is that even important for encoders? That is specifically, is there any BMP charater that we have to actually NEVER encode? If there are such BMP characters then they should qualify as 'immune' and be included, but otherwise I don't think they should be allowed to do this. |
@planetlevel wrote:
Yeah, I think that would be unusual. My concern would be if non-BMP was allowed allowed as part of identifiers. If so, there's a possibility that it may allow an identifier to be referenced, a function invoked, etc. I would think that allowing non-BMP for identifiers would be more the expected case than keywords but this is way outside my scope of expertise, so there's that. |
@xeno6696 wrote:
Like I mentioned on the issue about SLF4J support, I'm okay with adding methods, but I don't think we should be changing method signatures. (Although, for the record, throwing an unchecked exception like |
The contract implied by |
@xeno6696 Two things... 1) I meant to type 'non-BMP characters' here, not BMP characters. And 2) is the point is that the So I was reacting to your statement of
As in "why does that even matter?" The point is, that we don't allow altering the set of immune char array (and I don't think we want to). If someone really needs to place some special non-BMP characters there, then let them write their own replacement for |
My point was that I think after all of that, we're all in agreement. @jeremiahjstacey you said in the past you found encoding discussions enlightening. Do you have any questions or comments? |
Right, think I'm saying as you are to worry about it. Because the CTOR for |
@kwwall, I completely missed that the immune characters were controlled by the DefaultEncoder. @jeremiahjstacey --ignore that whole conversation. |
Ugh.... updating the decoder is trickier. The For now I'm going to create an Integer version of |
Well, if you are going to need that, you may has well write a more general |
I'm going to have to update all of the codec classes as a result of this change anyway. You know what they say...
|
Actually... the more I think about your idea, Kevin, I like that idea. I don't really want to be on the hook for fixing codecs that might be used by nobody... |
My final problem is in this bit of code, I think I'm missing something so simple that I'd be labelled an idiot.
The rest of my unit tests are failing because this method calls I'm not sure what the best solution would be. I've thought about parameterizing If we could use Java 1.8... obviously this would be great for a That leaves either closures (I think those were introduced in 1.7) or as a last resort... de-abstracting that decode method and... (yuck) copying that code to every @kwwall -- Not sure what you think about Java's closures. I didn't want to reach for them if you were deadset against them. |
Nevermind... Closures are Java 8. So no |
We now correctly handle non-BMP markup. We also made Codec and PushbackSequence interfaces, and have a path moving forward to migrate ALL of our other codecs to also be able to handle non-BMP characters as well. This was a fun journey, gotta thank everyone who got involved or helped me with bugfixes. @kwwall the PushbackSequence idea belongs to you, and @jeremiahjstacey for helping me past a tricky polymorphism bug. (It's always something small... always...) |
Stowing this away for when this issue comes up for JavaScript: https://bugs.chromium.org/p/v8/issues/detail?id=761 |
Issue #300 -- Fixing ESAPI's inability to handle non-BMP codepoints.
From julian.r...@googlemail.com on March 01, 2013 11:36:39
What steps will reproduce the problem? 1. run test case What is the expected output? What do you see instead? The test case encodes a non-BMP character, which internally is represented in java as two chracters, yet needs to be serialized as a single HTML character. What version of the product are you using? On what operating system? 2.0.1, win Please provide any additional information below. Test:
Note: this problem has been mentioned over two years ago in http://ainthek.blogspot.de/2010/09/orgowaspesapicodecshtmlentitycodecjava.html but apparently hasn't been fixed.
Original issue: http://code.google.com/p/owasp-esapi-java/issues/detail?id=294
The text was updated successfully, but these errors were encountered: