non-BMP characters incorrectly encoded #300

meg23 · 2014-11-13T18:26:35Z

From julian.r...@googlemail.com on March 01, 2013 11:36:39

What steps will reproduce the problem? 1. run test case What is the expected output? What do you see instead? The test case encodes a non-BMP character, which internally is represented in java as two chracters, yet needs to be serialized as a single HTML character. What version of the product are you using? On what operating system? 2.0.1, win Please provide any additional information below. Test:

public static void main(String[] args) {
    String test = new String (new int[]{0x2f804}, 0, 1);
    System.out.println(test + " " + test.length());
    System.out.println(ESAPI.encoder().encodeForHTML(test));
}

Note: this problem has been mentioned over two years ago in http://ainthek.blogspot.de/2010/09/orgowaspesapicodecshtmlentitycodecjava.html but apparently hasn't been fixed.

Original issue: http://code.google.com/p/owasp-esapi-java/issues/detail?id=294

The text was updated successfully, but these errors were encountered:

meg23 · 2014-11-13T18:26:35Z

From julian.r...@googlemail.com on March 04, 2013 05:38:54

test cases:

    public void testHtmlEncodeStrSurrogatePair()
    {
            String inStr = new String (new int[]{0x2f804}, 0, 1);
            String expected = "&#x2f804;";
            String result;

            result = htmlCodec.encode(EMPTY_CHAR_ARRAY, inStr);
            assertEquals(expected, result);
    }

and

    public void testHtmlDecodeHexEntititesSurrogatePair()
    {
            String expected = new String (new int[]{0x2f804}, 0, 1);
            assertEquals( expected, htmlCodec.decode("&#194564;") );
            assertEquals( expected, htmlCodec.decode("&#x2f804;") );
    }

xeno6696 · 2017-07-21T18:55:24Z

Even with the provided test case, this is very cryptic.

Okay, so the character in question is 你

What isn't so clear to me, is why the user would expect 你 instead of 你

xeno6696 · 2017-07-21T19:04:20Z

Further in the second test case,

        assertEquals( expected, htmlCodec.decode("&#194564;") );

From what I can tell, these entities also represent the same character.

Digging up the blog post from this issue #,

http://ainthek.blogspot.de/2010/09/orgowaspesapicodecshtmlentitycodecjava.html

Looks like @planetlevel had a small argument with this coder, and I'd like to get more context before deciding what to do with this issue. At present, this seems to be a highly esoteric use case where folks should be using Unicode if they want desired results.

xeno6696 · 2017-07-21T19:06:20Z

Is #303 a duplicate of this?

xeno6696 · 2017-07-30T21:30:06Z

Aaaah... getting into some fun esoterism in code points and BMP.

The core problem is in Codec.java:

	/**
	 * Encode a String so that it can be safely used in a specific context.
	 * 
	 * @param immune
	 * @param input
	 * 		the String to encode
	 * @return the encoded String
	 */
	public String encode(char[] immune, String input) {
		StringBuilder sb = new StringBuilder();
		for (int i = 0; i < input.length(); i++) {
			char c = input.charAt(i);
			sb.append(encodeCharacter(immune, c));
		}
		return sb.toString();
	}

Because we use chars.

xeno6696 · 2017-07-30T23:11:13Z

This one's been entertaining. Adding methods to the core Codec class and will have to override them in the other codecs to make sure they all work. I've solved the immediate problem however! :-)

kwwall · 2017-07-30T23:31:58Z

My guess is maybe the whole codec / validation / output encoding components were apparently designed assuming that devs would only be using ASCII or various extended-ASCII character sets. At least it was probably never tested with much else.

…

-kevin -- Blog: http://off-the-wall-security.blogspot.com/ | Twitter: @KevinWWall NSA: All your crypto bit are belong to us.

On Jul 30, 2017 19:11, "Matt Seil" ***@***.***> wrote: This one's been entertaining. Adding methods to the core Codec class and will have to override them in the other codecs to make sure they all work. I've solved the immediate problem however! :-) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#300 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AB3nmwylzO3-M1hUU9UBuOwhQb08xgtCks5sTQ2RgaJpZM4C6zDo> .

xeno6696 · 2017-07-31T06:52:03Z

The error was most likely an oversight in assuming UTF-8 as a web standard, when Java mistakenly asserted UTF-16.

I never realized the impact of that difference until working this bug.

I think I will start working specific issues in their own branches.

xeno6696 · 2017-08-02T05:04:12Z

@kwwall -- This one is getting ugly fast. So I started out conservatively, intending on preserving as much legacy functionality as possible:

	/**
	 * Encode a String so that it can be safely used in a specific context.
	 * 
	 * @param immune
	 * @param input
	 * 		the String to encode
	 * @return the encoded String
	 */
	public String encode(int[] immune, String input) {
		StringBuilder sb = new StringBuilder();
		for(int offset  = 0; offset < input.length(); ){
			final int point = input.codePointAt(offset);
			if(Character.isBmpCodePoint(point)){
				//We can then safely cast this to char and maintain legacy behavior.
				//FIXME:  Being very conservative right now, but the final version of this should dump chars entirely for ints.
				sb.append(encodeCharacter(immune, (char) point));
			}else{
				sb.append(encodeCharacter(immune, point));	
			}
			offset += Character.charCount(point);
		}
		return sb.toString();
	}

The kicker is this... that "immune" array of chars... chars are EVERYWHERE in our encoding framework. Several utility classes were created just to help, and this gets complicated VERY quickly. The helper class CollectionsUtil.java also lacks a single unit test. At any rate, this one must change as well. StringUtilities has unit tests.

Ideally... we convert all references of char to int to assist with the full range of internationalization implied by using codepoints.

But that might be too big a bite.

Do you have a preference?

We can stick with the conservative route, as I have started. I've created both an int and a char version of several methods, which will spider into almost all the codec classes... in the process they will each be able to handle non-BMP codePoints.

Okay, I think I've talked myself into staying conservative, and growing the encoder API into the right direction.

xeno6696 · 2017-08-02T05:30:24Z

Found a nasty little puzzle @jeremiahjstacey

	public void testJavaScriptEncodeChar0x100()
	{
		char in = 0x100;
		String inStr = Character.toString(in);
		String expected = "\\u0100";
		String result;

        	result = javaScriptCodec.encodeCharacter(EMPTY_CHAR_ARRAY, in);
		// this should be escaped
        	assertFalse(inStr.equals(result));
        	assertEquals(expected,result);
	}

That first declaration for the char... I have two methods, one is char[], Character and the other is char[], int

Java decides to upcast the variable in to an int instead of autoboxing the char into the Character class as one would expect. I have to explicitly declare the char as a Character in order for expected calling behavior to happen. Did I mention I hate runtime fuckery?

kwwall · 2017-08-02T05:38:42Z

Surely the "immune" arrays being chars isn't a problem. It's trivial to add a new static method to CollectionUtils to convert an array of chars to an array of ints and just use that. But aside from that, I'd recommend the conservative approach as well. And if you hate the autoboxing in Java, it doesn't hold a candle to C++'s auto type conversions.

xeno6696 · 2017-08-02T06:00:32Z

Maybe I'm missing something Kevin, but since Java cannot represent non-BMP characters as chars at all, it means that we would have to default all our handling to ints. By default, a programmer could not add a non-BMP character to an immune list because that non-BMP character lacks a value that can be represented by a char.

xeno6696 · 2017-08-03T01:30:49Z

Jon Skeet gave me a schooling on overload resolution.

@kwwall what would you say for defensive purposes, we add a method with signature char[], char with an IllegalArgumentException that informs the caller that they must use either the Character or the int method?

Also... when thinking about it, I don't see the problem with leaving the immune chars as char[] because I don't think any programming languages uses characters that lie within the non-BMP range.

planetlevel · 2017-08-03T01:39:08Z

That's the key. As far as I know, no interpreters use non-BMP for keywords or other significant syntax.

xeno6696 · 2017-08-03T02:01:39Z

And if they did, they would be playing in the UTF extensions field, and no languages in common use ought to be using characters in that region. (Considering its typically used for non-standard Chinese and other asian scripts not covered by the UTF spec.)

kwwall · 2017-08-03T03:58:05Z

Sorry I'm running behind in responding... will try to catch up.

@xeno6696 wrote:

Maybe I'm missing something Kevin, but since Java cannot represent non-BMP characters as chars at all, it means that we would have to default all our handling to ints.

Sorry, I was confusing BMP characters with Unicode, which can all be encoded with UTF-16, but I see now that BMP are above U+FFFF wihch means the won't fit into a fixed-width 16-bit character encoding so an int would be required.
However, given that, how does one store a BMP String? Isn't String store things as char sequences under the hood? Has JDK 7 adopted support for 32-bit characters now? If not, is that not a problem unto itself since as soon as you pick off (say) a query parameter using HttpServletRequest.getParameter("paramName") it gets retrieved as a String. So unless the String class in Java 7 and the underlying Java Servlet API 3.0.1 underlying servlet related classes support BMP characters, it seems pretty much like game over before we even a BMP string passed of to an ESAPI encoder to encode.

By default, a programmer could not add a non-BMP character to an immune list because that non-BMP character lacks a value that can be represented by a char.

Wait a moment; is that even important for encoders? That is specifically, is there any BMP charater that we have to actually NEVER encode? If there are such BMP characters then they should qualify as 'immune' and be included, but otherwise I don't think they should be allowed to do this.

kwwall · 2017-08-03T05:24:47Z

@planetlevel wrote:

That's the key. As far as I know, no interpreters use non-BMP for keywords or other significant syntax.

Yeah, I think that would be unusual. My concern would be if non-BMP was allowed allowed as part of identifiers. If so, there's a possibility that it may allow an identifier to be referenced, a function invoked, etc. I would think that allowing non-BMP for identifiers would be more the expected case than keywords but this is way outside my scope of expertise, so there's that.

kwwall · 2017-08-03T05:30:58Z

@xeno6696 wrote:

@kwwall what would you say for defensive purposes, we add a method with signature char[], char with an IllegalArgumentException that informs the caller that they must use either the Character or the int method?

Like I mentioned on the issue about SLF4J support, I'm okay with adding methods, but I don't think we should be changing method signatures. (Although, for the record, throwing an unchecked exception like IllegalArgumentException should be okay as one always needs to be potentially prepared to handle RuntimeExceptions, even if via a top-level error handler page defined in web.xml.) If there are ones that these new methods are intended to replace, you should also mark the old ones as deprecated and note the intended replacement in the Javadoc.

xeno6696 · 2017-08-03T06:22:17Z

Wait a moment; is that even important for encoders? That is specifically, is there any BMP charater that we have to actually NEVER encode? If there are such BMP characters then they should qualify as 'immune' and be included, but otherwise I don't think they should be allowed to do this.

The contract implied by char[] immune is a list of characters that the caller wants to exempt from encoding.

kwwall · 2017-08-03T16:40:21Z

@xeno6696 Two things... 1) I meant to type 'non-BMP characters' here, not BMP characters. And 2) is the point is that the char[] immune array is not configurable, so it is only an implicit part of the contract, not an explicit part; e.g., a developer can't use an Encoder interface to change / pass in a different set of immune characters, nor add to it.

So I was reacting to your statement of

By default, a programmer could not add a non-BMP character to an immune list because that non-BMP character lacks a value that can be represented by a char

As in "why does that even matter?" The point is, that we don't allow altering the set of immune char array (and I don't think we want to). If someone really needs to place some special non-BMP characters there, then let them write their own replacement for DefaultEncoder either by extending it, of using it as a pattern. It's easy enough to change from the DefaultEncoder to something else in the ESAPI.properties file. In other words, unless we really have zero confidence that there are some non-BMP characters that should not be encoded, then I don't think it matters. But what I am saying that AFAIK, we will do output encoding on all non-BMP characters, won't we? That is, I'm not aware that should be "immune".

xeno6696 · 2017-08-03T19:33:05Z

My point was that char[] immune implies that the programmer can define their own immunity list, and that technically speaking if we switched under the hood to using int that we'd be expected to provide int[] immune. However, in thinking about that, i can find no legitimate reason why we should allow anything in the non-bmp range to be immune from encoding.

I think after all of that, we're all in agreement.

@jeremiahjstacey you said in the past you found encoding discussions enlightening. Do you have any questions or comments?

kwwall · 2017-08-03T20:11:57Z

Right, think I'm saying as you are to worry about it. Because the CTOR for DefaultEncoder is a private and you access it through a getInstance() method even subclassing DefaultEncoder is not strictly possible (in the usual sense of using super()). Also none of the methods in the Encoder interface references the 'immune' list, so since it is not exposed in any way, I don't even consider it part of the programming contract. If someone wants to add non-BMP characters to the immune list, let them start with a copy of DefaultEncoder and hack to their heart's content and at that time, let them deal with changing the 'immune' array from char[] to int[]. So I think we're on the same page there...don't fell compelled to change it to an int[].

xeno6696 · 2017-08-03T23:41:56Z

@kwwall, I completely missed that the immune characters were controlled by the DefaultEncoder.

@jeremiahjstacey --ignore that whole conversation.

xeno6696 · 2017-08-04T06:48:15Z

Ugh.... updating the decoder is trickier. The PushBackString works as a combined unit of Iterator and business logic. And unfortunately it isn't directly unit-tested. (Indirectly via the Codec tests and the decoding tests in EncoderTest.) Basically we need to convert its function to use Integer instead of Character, but that ripples throughout the rest of the decoding methods.

For now I'm going to create an Integer version of PushBackString and write unit tests to validate the basic behavior of both versions, then I'll use it in the HTMLEntityCodec.

kwwall · 2017-08-04T14:20:32Z

Well, if you are going to need that, you may has well write a more general PushbackSequence<T> where T can be Character, Integer, or anything else that may be useful in the future and we can eventually ditch PushbackString once we clean up all the existing Codec classes. Just change the CTOR to accept T[] array rather than a String. I think that would be pretty straight-forward.

xeno6696 · 2017-08-04T19:48:46Z

I'm going to have to update all of the codec classes as a result of this change anyway. You know what they say...

In for a penny, in for a pound!

xeno6696 · 2017-08-04T20:21:56Z

Actually... the more I think about your idea, Kevin, I like that idea. I don't really want to be on the hook for fixing codecs that might be used by nobody...

xeno6696 · 2017-08-05T23:03:31Z

@jeremiahjstacey

My final problem is in this bit of code, I think I'm missing something so simple that I'd be labelled an idiot.

package org.owasp.esapi.codecs;

public abstract class AbstractCharacterCodec extends AbstractCodec<Character> {
	/* (non-Javadoc)
	 * @see org.owasp.esapi.codecs.Codec#decode(java.lang.String)
	 */
	@Override
	public String decode(String input) {
		StringBuilder sb = new StringBuilder();
		PushbackSequence<Character> pbs = new PushbackString(input);
		while (pbs.hasNext()) {
			Character c = decodeCharacter(pbs);
			if (c != null) {
				sb.append(c);
			} else {
				sb.append(pbs.next());
			}
		}
		return sb.toString();
	}
}

The rest of my unit tests are failing because this method calls encodeCharacter() but I need it to somehow defer to the subclass's method--if it exists.

I'm not sure what the best solution would be. I've thought about parameterizing decode to pass in the subclass, check to see if the method in question exists, execute it if it does or delegate to the method in the next superclass.

If we could use Java 1.8... obviously this would be great for a java.util.Function. But we can't make that move yet.

That leaves either closures (I think those were introduced in 1.7) or as a last resort... de-abstracting that decode method and... (yuck) copying that code to every Character based codec in our library. I really don't want to do that.

@kwwall -- Not sure what you think about Java's closures. I didn't want to reach for them if you were deadset against them.

xeno6696 · 2017-08-05T23:06:46Z

Nevermind... Closures are Java 8. So no java.util.Function or closures to solve this. Will still need an assist... all my potential solutions right now are inelegant and clumsy.

xeno6696 · 2017-08-06T18:22:49Z

We now correctly handle non-BMP markup. We also made Codec and PushbackSequence interfaces, and have a path moving forward to migrate ALL of our other codecs to also be able to handle non-BMP characters as well.

This was a fun journey, gotta thank everyone who got involved or helped me with bugfixes. @kwwall the PushbackSequence idea belongs to you, and @jeremiahjstacey for helping me past a tricky polymorphism bug. (It's always something small... always...)

xeno6696 · 2017-08-07T15:53:07Z

Stowing this away for when this issue comes up for JavaScript: https://bugs.chromium.org/p/v8/issues/detail?id=761

Issue #300 -- Fixing ESAPI's inability to handle non-BMP codepoints.

meg23 added bug imported Priority-Medium labels Nov 13, 2014

xeno6696 closed this as completed Aug 6, 2017

xeno6696 added a commit that referenced this issue Aug 11, 2017

Merge pull request #413 from xeno6696/develop

92734ea

Issue #300 -- Fixing ESAPI's inability to handle non-BMP codepoints.

anshumanp2902 mentioned this issue Feb 28, 2019

Issue while decoding non-BMP characters #481

Closed

kwwall mentioned this issue Apr 30, 2019

Transitive dependency on esapi OWASP/owasp-java-encoder#30

Closed

This was referenced Oct 14, 2020

HTMLEntityCodec destroys 32-bit CJK (Chinese, Japanese and Korean) characters #303

Closed

PercentCodec Doesn't Handle UTF-8 percent encoding #377

Open

kwwall mentioned this issue Aug 7, 2023

canonicalize sees entity which isn't there #794

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-BMP characters incorrectly encoded #300

non-BMP characters incorrectly encoded #300

meg23 commented Nov 13, 2014

meg23 commented Nov 13, 2014

xeno6696 commented Jul 21, 2017

xeno6696 commented Jul 21, 2017

xeno6696 commented Jul 21, 2017

xeno6696 commented Jul 30, 2017

xeno6696 commented Jul 30, 2017

kwwall commented Jul 30, 2017 via email

xeno6696 commented Jul 31, 2017

xeno6696 commented Aug 2, 2017

xeno6696 commented Aug 2, 2017

kwwall commented Aug 2, 2017

xeno6696 commented Aug 2, 2017

xeno6696 commented Aug 3, 2017

planetlevel commented Aug 3, 2017

xeno6696 commented Aug 3, 2017

kwwall commented Aug 3, 2017

kwwall commented Aug 3, 2017

kwwall commented Aug 3, 2017

xeno6696 commented Aug 3, 2017

kwwall commented Aug 3, 2017

xeno6696 commented Aug 3, 2017

kwwall commented Aug 3, 2017

xeno6696 commented Aug 3, 2017

xeno6696 commented Aug 4, 2017

kwwall commented Aug 4, 2017

xeno6696 commented Aug 4, 2017

xeno6696 commented Aug 4, 2017

xeno6696 commented Aug 5, 2017

xeno6696 commented Aug 5, 2017

xeno6696 commented Aug 6, 2017

xeno6696 commented Aug 7, 2017

non-BMP characters incorrectly encoded #300

non-BMP characters incorrectly encoded #300

Comments

meg23 commented Nov 13, 2014

meg23 commented Nov 13, 2014

xeno6696 commented Jul 21, 2017

xeno6696 commented Jul 21, 2017

xeno6696 commented Jul 21, 2017

xeno6696 commented Jul 30, 2017

xeno6696 commented Jul 30, 2017

kwwall commented Jul 30, 2017 via email

xeno6696 commented Jul 31, 2017

xeno6696 commented Aug 2, 2017

xeno6696 commented Aug 2, 2017

kwwall commented Aug 2, 2017

xeno6696 commented Aug 2, 2017

xeno6696 commented Aug 3, 2017

planetlevel commented Aug 3, 2017

xeno6696 commented Aug 3, 2017

kwwall commented Aug 3, 2017

kwwall commented Aug 3, 2017

kwwall commented Aug 3, 2017

xeno6696 commented Aug 3, 2017

kwwall commented Aug 3, 2017

xeno6696 commented Aug 3, 2017

kwwall commented Aug 3, 2017

xeno6696 commented Aug 3, 2017

xeno6696 commented Aug 4, 2017

kwwall commented Aug 4, 2017

xeno6696 commented Aug 4, 2017

xeno6696 commented Aug 4, 2017

xeno6696 commented Aug 5, 2017

xeno6696 commented Aug 5, 2017

xeno6696 commented Aug 6, 2017

xeno6696 commented Aug 7, 2017