What's new? | Help | Directory | Sign in
Google
             
Search
for
Updated Nov 15, 2008 by pilgrim
Labels: about-security, is-article
ArticleMalformedUtf8  
Malformed UTF-8: Who said "hello%EE" can't be dangerous

Español日本語Français
HomeWeb Security

Web applications frequently need to output data derived from user input, possibly enclosed within double quotes (say inside an HTML attribute). if this data is properly HTML-escaped -- which escapes double quotes into " -- this data should remain confined within the surrounding double quotes.

However, consider some malicious user input which ends with an invalid UTF-8 byte, such as 0xDE. In UTF-8, the byte 0xDE must be followed with additional valid bytes to form a multi-byte character. If this invalid UTF-8 string is merged into an HTML template, the 0xDE byte could result in "swallowing" the next character. If the next character is a double-quote meant to end an HTML attribute value, then the user input has effectively "broken out" of the quoted attribute value, and further user input could potentially by executed as markup and/or script.

Solution

As always, paranoid input validation is a major deterrent for such vulnerabilities. In this case, you should check that each piece of user input is valid UTF-8. If there are invalid bytes, remove them or replace them with safe characters, then restart your input validation routine on the entire input. This is a very important point; you need to ensure that the safe character you are providing as a replacement to the original malformed UTF-8 byte does not lead to further vulnerabilities. In particular, if you replace invalid characters with whitespace, you could be opening up additional security problems.

Here is an example of how that could happen. Assume an input URL like this:

http://www.example.com/search?xss%dfonmouseover=alert%28String.fromCharCode%2888,83,83%29%29%ee&oe=shift-jis&q=a

In the HTML markup of this page, your web application takes this input URL and constructs an output URL by appending start=10 to it, like this:

<a href=INPUTURL&start=10>

Notice three things here

  1. The %df after the xss in the input URL, which will be URL-decoded into the invalid UTF-8 byte 0xDF
  2. The %ee after the %29%29 in the input URL, which will be URL-decoded into the invalid UTF-8 byte 0xEE
  3. The lack of double quotes around the output URL in the HTML template

Without any UTF-8 validation, the output would look like this:

<a href=http://www.example.com/search?xssnmouseover=alert(String.fromCharCode(88,83,83))oe=shift-jis&q=a&start=10>

The o after the 0xDF and the & after the 0xEE will both get swallowed, leaving a nonfunctional URL, but not a security hole. But suppose your input validation routine noticed the invalid characters in the input URL and replaced them with whitespace. Now the output URL will look like this:

<a href=http://www.google.com/blogsearch?xss onmouseover=alert(String.fromCharCode(88,83,83)) &oe=shift-jis&q=a&start=10>

Oops! Your input validation routine has actually created the XSS opportunity by splitting the URL and outputting the onmouseover string as a separate attribute. When the user moves their cursor over that link, the user-provided script will execute.

Further reading


Comment by alexkon, Dec 29, 2008

Is the str.decode() method in Python guaranteed to raise a UnicodeDecodeError? on any input that is invalid UTF-8? If str.decode() is secure, it could be used like this:

# unsafe_user_input is a plain old str, not a unicode string
try:
  safe_unicode = unsafe_user_input.decode('utf-8')
except UnicodeDecodeError:
  # the input is not valid UTF-8

Sign in to add a comment