|Issue 425:||HTML URL attributes with multi-byte characters may be misinterpreted|
|1 person starred this issue and may be notified of changes.||Back to list|
What steps will reproduce the problem? 1. code up a filter that scans a href attributes 2. send an utf-8 document through that filter What is the expected output? What do you see instead? Calling DecodedValueOrNull on a href that containins multi-byte characters will return NULL. I probably expected to see an html escaped single byte version of the attribute, but i'm not sure about that :) What version of the product are you using (please check X-Mod-Pagespeed header)? On what operating system? ubuntu server 11.10 Which version of Apache? apache traffic server / custom implementation Which MPM? none Please provide any additional information below, especially a URL or an HTML file that exhibits the problem. i created a custom filter to rebase documents to a new domain and remove any base tag found while doing that. it's working fine, except when there are multibyte characters found in attribute values that need to be rewritten. Currently, i have a workaround based on rewriting escaped_value() instead of DecodedValueOrNull(). Do i need to reencode the stream to a single byte character set before passing it in to ParseText (possibly creating html escapes by doing that)? Or should this be handled by pagespeed (it does see the response headers, specifiying character sets and all?)
Apr 26, 2012
While our codebase does not decode multi-byte attributes well, in general it does not need to (so far). If you are making a deep copy of DOM or something I'd just work directly with the escaped_value. In our existing filters, we've found we generally only need to decode attributes that are URLs. For the most part we have not seen limitations with this so far in our testing, but the new paranoid methodology that gave birth to DecodedValueOrNull is relatively recent, and hasn't made its way into an official mod_pagespeed release yet. You are correct that we have access to the charset from response-headers & meta-tags. However we see a lot of non-utf8 html on the web so I think to do a credible job of decoding them we'd need to cover also gb2312, iso8859, and a Russian one as well whose ID I don't recall. These different charsets all appear to work fine in mod_pagespeed because we're careful not to mess up what we don't understand :) Since in most of our current work we only need to decode URLs we have not invested in a more thorough char-set decoding infracture. Can you describe why, in your described new filter, you feel you need to decode all attributes? See also this old filter, which seems like it might be similar to what your goal is here: https://code.google.com/p/modpagespeed/source/browse/branches/1/src/net/instaweb/rewriter/base_tag_filter.cc This file is from an earlier SVN branch of our code, and in the current trunk is deleted. However, for obscure reasons related to internal testing we are likely to revive that filter soon & maybe fix some bugs in it.
Apr 26, 2012
that filter sure is similar, but i'd like to get rid of base tags since i think they cause trouble with some older browsers. i'd rather just absolutify all urls against a new base url, and drop the base tag. i also need this so i can host origins on other domains than the original one, and keep configuration as simple as possible (i do a startparse with the new domain, and to page speed, all hrefs/resource urls now are relative to the new domain). thinking about this, i probably won't really need decoded attributes for this purpose soon. But, another thought on this: is it at all possible, to disallow a resource with multibyte characters in them? Since allowing/disallowing resources is configurable at the option level, shouldn't those be matched against (src) attribute values in a normalized manner? this might be a (minor) security issue.
Apr 26, 2012
May 10, 2012
I think this is working properly, at least in trunk. If you find a specific problem, please re-open.
May 23, 2012
Summary was: multibyte characters fail to decode
Summary: HTML URL attributes with multi-byte characters may be misinterpreted
Labels: release-note Milestone-v22
|► Sign in to add a comment|