My favorites | Sign in
Project Home Wiki Issues Source
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 425: HTML URL attributes with multi-byte characters may be misinterpreted
1 person starred this issue and may be notified of changes. Back to list
Status:  Fixed
Owner:  jmara...@google.com
Closed:  May 2012


Sign in to add a comment
 
Reported by osch...@gmail.com, Apr 26, 2012
What steps will reproduce the problem?
1. code up a filter that scans a href attributes
2. send an utf-8 document through that filter

What is the expected output? What do you see instead?
Calling DecodedValueOrNull on a href that containins multi-byte characters will return NULL. I probably expected to see an html escaped single byte version of the attribute, but i'm not sure about that :)

What version of the product are you using (please check X-Mod-Pagespeed
header)?

On what operating system?
ubuntu server 11.10
Which version of Apache?
apache traffic server / custom implementation
Which MPM?
none
Please provide any additional information below, especially a URL or an
HTML file that exhibits the problem.

i created a custom filter to rebase documents to a new domain and remove any base tag found while doing that. it's working fine, except when there are multibyte characters found in attribute values that need to be rewritten. Currently, i have a workaround based on rewriting escaped_value() instead of DecodedValueOrNull(). Do i need to reencode the stream to a single byte character set before passing it in to ParseText (possibly creating html escapes by doing that)? Or should this be handled  by pagespeed (it does see the response headers, specifiying character sets and all?)
Apr 26, 2012
Project Member #1 jmara...@google.com
While our codebase does not decode multi-byte attributes well, in general it does not need to (so far).

If you are making a deep copy of DOM or something I'd just work directly with the escaped_value.  In our existing filters, we've found we generally only need to decode attributes that are URLs.  For the most part we have not seen limitations with this so far in our testing, but the new paranoid methodology that gave birth to DecodedValueOrNull is relatively recent, and hasn't made its way into an official mod_pagespeed release yet.

You are correct that we have access to the charset from response-headers & meta-tags.  However we see a lot of non-utf8 html on the web so I think to do a credible job of decoding them we'd need to cover also gb2312, iso8859, and a Russian one as well whose ID I don't recall.  These different charsets all appear to work fine in mod_pagespeed because we're careful not to mess up what we don't understand :)

Since in most of our current work we only need to decode URLs we have not invested in a more thorough char-set decoding infracture.

Can you describe why, in your described new filter, you feel you need to decode all attributes?


See also this old filter, which seems like it might be similar to what your goal is here: https://code.google.com/p/modpagespeed/source/browse/branches/1/src/net/instaweb/rewriter/base_tag_filter.cc

This file is from an earlier SVN branch of our code, and in the current trunk is deleted.  However, for obscure reasons related to internal testing we are likely to revive that filter soon & maybe fix some bugs in it.
Owner: jmara...@google.com
Apr 26, 2012
#2 osch...@gmail.com
that filter sure is similar, but i'd like to get rid of base tags since i think they cause trouble with some older browsers. i'd rather just absolutify all urls against a new base url, and drop the base tag. i also need this so i can host origins on other domains than the original one, and keep configuration as simple as possible (i do a startparse with the new domain, and to page speed, all hrefs/resource urls now are relative to the new domain). 

thinking about this, i probably won't really need decoded attributes for this purpose soon. 

But, another thought on this: is it at all possible, to disallow a resource with multibyte characters in them?
Since allowing/disallowing resources is configurable at the option level, shouldn't those be matched against (src) attribute values in a normalized manner? this might be  a (minor) security issue.

Apr 26, 2012
Project Member #3 jmara...@google.com
RE re-hosting without base-tag: be careful of absolute references in Javascript.  We currently have no visibility into URLs from Javascript.

RE multi-byte characters in src attribute values: we will not consider these to be rewritable resources.  This is because our pattern (which probably needs to be factored a little) is (from cache_extender.cc)
    ResourcePtr input_resource(CreateInputResource(href->DecodedValueOrNull()));
    if (input_resource.get() == NULL) {
      return;
    }
Where CommonFilter::CreateInputResource has:
  ResourcePtr resource;
  if (!input_url.empty()) {
    ...
  }
  return resource;

So any resources with src= or href= attributes that have multi-byte characters will get decoded as NULL and fail this test: we'll leave them alone.


May 10, 2012
Project Member #4 jmara...@google.com
I think this is working properly, at least in trunk.  If you find a specific problem, please re-open.

Status: Fixed
May 23, 2012
Project Member #5 jmara...@google.com
Summary was: multibyte characters fail to decode 

Summary: HTML URL attributes with multi-byte characters may be misinterpreted
Labels: release-note Milestone-v22
Sign in to add a comment

Powered by Google Project Hosting