Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

non-BMP characters incorrectly encoded #300

Closed
meg23 opened this issue Nov 13, 2014 · 31 comments
Closed

non-BMP characters incorrectly encoded #300

meg23 opened this issue Nov 13, 2014 · 31 comments

Comments

@meg23
Copy link

meg23 commented Nov 13, 2014

From julian.r...@googlemail.com on March 01, 2013 11:36:39

What steps will reproduce the problem? 1. run test case What is the expected output? What do you see instead? The test case encodes a non-BMP character, which internally is represented in java as two chracters, yet needs to be serialized as a single HTML character. What version of the product are you using? On what operating system? 2.0.1, win Please provide any additional information below. Test:

public static void main(String[] args) {
    String test = new String (new int[]{0x2f804}, 0, 1);
    System.out.println(test + " " + test.length());
    System.out.println(ESAPI.encoder().encodeForHTML(test));
}

Note: this problem has been mentioned over two years ago in http://ainthek.blogspot.de/2010/09/orgowaspesapicodecshtmlentitycodecjava.html but apparently hasn't been fixed.

Original issue: http://code.google.com/p/owasp-esapi-java/issues/detail?id=294

@meg23
Copy link
Author

meg23 commented Nov 13, 2014

From julian.r...@googlemail.com on March 04, 2013 05:38:54

test cases:

    public void testHtmlEncodeStrSurrogatePair()
    {
            String inStr = new String (new int[]{0x2f804}, 0, 1);
            String expected = "你";
            String result;

            result = htmlCodec.encode(EMPTY_CHAR_ARRAY, inStr);
            assertEquals(expected, result);
    }

and

    public void testHtmlDecodeHexEntititesSurrogatePair()
    {
            String expected = new String (new int[]{0x2f804}, 0, 1);
            assertEquals( expected, htmlCodec.decode("你") );
            assertEquals( expected, htmlCodec.decode("你") );
    }

@xeno6696
Copy link
Collaborator

Even with the provided test case, this is very cryptic.

Okay, so the character in question is 你

What isn't so clear to me, is why the user would expect 你 instead of 你

@xeno6696
Copy link
Collaborator

Further in the second test case,

        assertEquals( expected, htmlCodec.decode("你") );

From what I can tell, these entities also represent the same character.

Digging up the blog post from this issue #,

http://ainthek.blogspot.de/2010/09/orgowaspesapicodecshtmlentitycodecjava.html

Looks like @planetlevel had a small argument with this coder, and I'd like to get more context before deciding what to do with this issue. At present, this seems to be a highly esoteric use case where folks should be using Unicode if they want desired results.

@xeno6696
Copy link
Collaborator

Is #303 a duplicate of this?

@xeno6696
Copy link
Collaborator

Aaaah... getting into some fun esoterism in code points and BMP.

The core problem is in Codec.java:

	/**
	 * Encode a String so that it can be safely used in a specific context.
	 * 
	 * @param immune
	 * @param input
	 * 		the String to encode
	 * @return the encoded String
	 */
	public String encode(char[] immune, String input) {
		StringBuilder sb = new StringBuilder();
		for (int i = 0; i < input.length(); i++) {
			char c = input.charAt(i);
			sb.append(encodeCharacter(immune, c));
		}
		return sb.toString();
	}

Because we use chars.

@xeno6696
Copy link
Collaborator

This one's been entertaining. Adding methods to the core Codec class and will have to override them in the other codecs to make sure they all work. I've solved the immediate problem however! :-)

@kwwall
Copy link
Contributor

kwwall commented Jul 30, 2017 via email

@xeno6696
Copy link
Collaborator

The error was most likely an oversight in assuming UTF-8 as a web standard, when Java mistakenly asserted UTF-16.

I never realized the impact of that difference until working this bug.

I think I will start working specific issues in their own branches.

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 2, 2017

@kwwall -- This one is getting ugly fast. So I started out conservatively, intending on preserving as much legacy functionality as possible:

	/**
	 * Encode a String so that it can be safely used in a specific context.
	 * 
	 * @param immune
	 * @param input
	 * 		the String to encode
	 * @return the encoded String
	 */
	public String encode(int[] immune, String input) {
		StringBuilder sb = new StringBuilder();
		for(int offset  = 0; offset < input.length(); ){
			final int point = input.codePointAt(offset);
			if(Character.isBmpCodePoint(point)){
				//We can then safely cast this to char and maintain legacy behavior.
				//FIXME:  Being very conservative right now, but the final version of this should dump chars entirely for ints.
				sb.append(encodeCharacter(immune, (char) point));
			}else{
				sb.append(encodeCharacter(immune, point));	
			}
			offset += Character.charCount(point);
		}
		return sb.toString();
	}

The kicker is this... that "immune" array of chars... chars are EVERYWHERE in our encoding framework. Several utility classes were created just to help, and this gets complicated VERY quickly. The helper class CollectionsUtil.java also lacks a single unit test. At any rate, this one must change as well. StringUtilities has unit tests.

Ideally... we convert all references of char to int to assist with the full range of internationalization implied by using codepoints.

But that might be too big a bite.

Do you have a preference?

We can stick with the conservative route, as I have started. I've created both an int and a char version of several methods, which will spider into almost all the codec classes... in the process they will each be able to handle non-BMP codePoints.

Okay, I think I've talked myself into staying conservative, and growing the encoder API into the right direction.

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 2, 2017

Found a nasty little puzzle @jeremiahjstacey

	public void testJavaScriptEncodeChar0x100()
	{
		char in = 0x100;
		String inStr = Character.toString(in);
		String expected = "\\u0100";
		String result;

        	result = javaScriptCodec.encodeCharacter(EMPTY_CHAR_ARRAY, in);
		// this should be escaped
        	assertFalse(inStr.equals(result));
        	assertEquals(expected,result);
	}

That first declaration for the char... I have two methods, one is char[], Character and the other is char[], int

Java decides to upcast the variable in to an int instead of autoboxing the char into the Character class as one would expect. I have to explicitly declare the char as a Character in order for expected calling behavior to happen. Did I mention I hate runtime fuckery?

@kwwall
Copy link
Contributor

kwwall commented Aug 2, 2017

Surely the "immune" arrays being chars isn't a problem. It's trivial to add a new static method to CollectionUtils to convert an array of chars to an array of ints and just use that. But aside from that, I'd recommend the conservative approach as well. And if you hate the autoboxing in Java, it doesn't hold a candle to C++'s auto type conversions.

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 2, 2017

Maybe I'm missing something Kevin, but since Java cannot represent non-BMP characters as chars at all, it means that we would have to default all our handling to ints. By default, a programmer could not add a non-BMP character to an immune list because that non-BMP character lacks a value that can be represented by a char.

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 3, 2017

Jon Skeet gave me a schooling on overload resolution.

@kwwall what would you say for defensive purposes, we add a method with signature char[], char with an IllegalArgumentException that informs the caller that they must use either the Character or the int method?

Also... when thinking about it, I don't see the problem with leaving the immune chars as char[] because I don't think any programming languages uses characters that lie within the non-BMP range.

@planetlevel
Copy link

That's the key. As far as I know, no interpreters use non-BMP for keywords or other significant syntax.

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 3, 2017

And if they did, they would be playing in the UTF extensions field, and no languages in common use ought to be using characters in that region. (Considering its typically used for non-standard Chinese and other asian scripts not covered by the UTF spec.)

@kwwall
Copy link
Contributor

kwwall commented Aug 3, 2017

Sorry I'm running behind in responding... will try to catch up.

@xeno6696 wrote:

Maybe I'm missing something Kevin, but since Java cannot represent non-BMP characters as chars at all, it means that we would have to default all our handling to ints.

Sorry, I was confusing BMP characters with Unicode, which can all be encoded with UTF-16, but I see now that BMP are above U+FFFF wihch means the won't fit into a fixed-width 16-bit character encoding so an int would be required.
However, given that, how does one store a BMP String? Isn't String store things as char sequences under the hood? Has JDK 7 adopted support for 32-bit characters now? If not, is that not a problem unto itself since as soon as you pick off (say) a query parameter using HttpServletRequest.getParameter("paramName") it gets retrieved as a String. So unless the String class in Java 7 and the underlying Java Servlet API 3.0.1 underlying servlet related classes support BMP characters, it seems pretty much like game over before we even a BMP string passed of to an ESAPI encoder to encode.

By default, a programmer could not add a non-BMP character to an immune list because that non-BMP character lacks a value that can be represented by a char.

Wait a moment; is that even important for encoders? That is specifically, is there any BMP charater that we have to actually NEVER encode? If there are such BMP characters then they should qualify as 'immune' and be included, but otherwise I don't think they should be allowed to do this.

@kwwall
Copy link
Contributor

kwwall commented Aug 3, 2017

@planetlevel wrote:

That's the key. As far as I know, no interpreters use non-BMP for keywords or other significant syntax.

Yeah, I think that would be unusual. My concern would be if non-BMP was allowed allowed as part of identifiers. If so, there's a possibility that it may allow an identifier to be referenced, a function invoked, etc. I would think that allowing non-BMP for identifiers would be more the expected case than keywords but this is way outside my scope of expertise, so there's that.

@kwwall
Copy link
Contributor

kwwall commented Aug 3, 2017

@xeno6696 wrote:

@kwwall what would you say for defensive purposes, we add a method with signature char[], char with an IllegalArgumentException that informs the caller that they must use either the Character or the int method?

Like I mentioned on the issue about SLF4J support, I'm okay with adding methods, but I don't think we should be changing method signatures. (Although, for the record, throwing an unchecked exception like IllegalArgumentException should be okay as one always needs to be potentially prepared to handle RuntimeExceptions, even if via a top-level error handler page defined in web.xml.) If there are ones that these new methods are intended to replace, you should also mark the old ones as deprecated and note the intended replacement in the Javadoc.

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 3, 2017

Wait a moment; is that even important for encoders? That is specifically, is there any BMP charater that we have to actually NEVER encode? If there are such BMP characters then they should qualify as 'immune' and be included, but otherwise I don't think they should be allowed to do this.

The contract implied by char[] immune is a list of characters that the caller wants to exempt from encoding.

@kwwall
Copy link
Contributor

kwwall commented Aug 3, 2017

@xeno6696 Two things... 1) I meant to type 'non-BMP characters' here, not BMP characters. And 2) is the point is that the char[] immune array is not configurable, so it is only an implicit part of the contract, not an explicit part; e.g., a developer can't use an Encoder interface to change / pass in a different set of immune characters, nor add to it.

So I was reacting to your statement of

By default, a programmer could not add a non-BMP character to an immune list because that non-BMP character lacks a value that can be represented by a char

As in "why does that even matter?" The point is, that we don't allow altering the set of immune char array (and I don't think we want to). If someone really needs to place some special non-BMP characters there, then let them write their own replacement for DefaultEncoder either by extending it, of using it as a pattern. It's easy enough to change from the DefaultEncoder to something else in the ESAPI.properties file. In other words, unless we really have zero confidence that there are some non-BMP characters that should not be encoded, then I don't think it matters. But what I am saying that AFAIK, we will do output encoding on all non-BMP characters, won't we? That is, I'm not aware that should be "immune".

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 3, 2017

My point was that char[] immune implies that the programmer can define their own immunity list, and that technically speaking if we switched under the hood to using int that we'd be expected to provide int[] immune. However, in thinking about that, i can find no legitimate reason why we should allow anything in the non-bmp range to be immune from encoding.

I think after all of that, we're all in agreement.

@jeremiahjstacey you said in the past you found encoding discussions enlightening. Do you have any questions or comments?

@kwwall
Copy link
Contributor

kwwall commented Aug 3, 2017

Right, think I'm saying as you are to worry about it. Because the CTOR for DefaultEncoder is a private and you access it through a getInstance() method even subclassing DefaultEncoder is not strictly possible (in the usual sense of using super()). Also none of the methods in the Encoder interface references the 'immune' list, so since it is not exposed in any way, I don't even consider it part of the programming contract. If someone wants to add non-BMP characters to the immune list, let them start with a copy of DefaultEncoder and hack to their heart's content and at that time, let them deal with changing the 'immune' array from char[] to int[]. So I think we're on the same page there...don't fell compelled to change it to an int[].

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 3, 2017

@kwwall, I completely missed that the immune characters were controlled by the DefaultEncoder.

@jeremiahjstacey --ignore that whole conversation.

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 4, 2017

Ugh.... updating the decoder is trickier. The PushBackString works as a combined unit of Iterator and business logic. And unfortunately it isn't directly unit-tested. (Indirectly via the Codec tests and the decoding tests in EncoderTest.) Basically we need to convert its function to use Integer instead of Character, but that ripples throughout the rest of the decoding methods.

For now I'm going to create an Integer version of PushBackString and write unit tests to validate the basic behavior of both versions, then I'll use it in the HTMLEntityCodec.

@kwwall
Copy link
Contributor

kwwall commented Aug 4, 2017

Well, if you are going to need that, you may has well write a more general PushbackSequence<T> where T can be Character, Integer, or anything else that may be useful in the future and we can eventually ditch PushbackString once we clean up all the existing Codec classes. Just change the CTOR to accept T[] array rather than a String. I think that would be pretty straight-forward.

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 4, 2017

I'm going to have to update all of the codec classes as a result of this change anyway. You know what they say...

In for a penny, in for a pound!

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 4, 2017

Actually... the more I think about your idea, Kevin, I like that idea. I don't really want to be on the hook for fixing codecs that might be used by nobody...

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 5, 2017

@jeremiahjstacey

My final problem is in this bit of code, I think I'm missing something so simple that I'd be labelled an idiot.

package org.owasp.esapi.codecs;

public abstract class AbstractCharacterCodec extends AbstractCodec<Character> {
	/* (non-Javadoc)
	 * @see org.owasp.esapi.codecs.Codec#decode(java.lang.String)
	 */
	@Override
	public String decode(String input) {
		StringBuilder sb = new StringBuilder();
		PushbackSequence<Character> pbs = new PushbackString(input);
		while (pbs.hasNext()) {
			Character c = decodeCharacter(pbs);
			if (c != null) {
				sb.append(c);
			} else {
				sb.append(pbs.next());
			}
		}
		return sb.toString();
	}
}

The rest of my unit tests are failing because this method calls encodeCharacter() but I need it to somehow defer to the subclass's method--if it exists.

I'm not sure what the best solution would be. I've thought about parameterizing decode to pass in the subclass, check to see if the method in question exists, execute it if it does or delegate to the method in the next superclass.

If we could use Java 1.8... obviously this would be great for a java.util.Function. But we can't make that move yet.

That leaves either closures (I think those were introduced in 1.7) or as a last resort... de-abstracting that decode method and... (yuck) copying that code to every Character based codec in our library. I really don't want to do that.

@kwwall -- Not sure what you think about Java's closures. I didn't want to reach for them if you were deadset against them.

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 5, 2017

Nevermind... Closures are Java 8. So no java.util.Function or closures to solve this. Will still need an assist... all my potential solutions right now are inelegant and clumsy.

@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 6, 2017

We now correctly handle non-BMP markup. We also made Codec and PushbackSequence interfaces, and have a path moving forward to migrate ALL of our other codecs to also be able to handle non-BMP characters as well.

This was a fun journey, gotta thank everyone who got involved or helped me with bugfixes. @kwwall the PushbackSequence idea belongs to you, and @jeremiahjstacey for helping me past a tricky polymorphism bug. (It's always something small... always...)

@xeno6696 xeno6696 closed this as completed Aug 6, 2017
@xeno6696
Copy link
Collaborator

xeno6696 commented Aug 7, 2017

Stowing this away for when this issue comes up for JavaScript: https://bugs.chromium.org/p/v8/issues/detail?id=761

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants