Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

HTML URL attributes with multi-byte characters may be misinterpreted #425

Closed
GoogleCodeExporter opened this issue Apr 6, 2015 · 5 comments

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?
1. code up a filter that scans a href attributes
2. send an utf-8 document through that filter

What is the expected output? What do you see instead?
Calling DecodedValueOrNull on a href that containins multi-byte characters will 
return NULL. I probably expected to see an html escaped single byte version of 
the attribute, but i'm not sure about that :)

What version of the product are you using (please check X-Mod-Pagespeed
header)?

On what operating system?
ubuntu server 11.10
Which version of Apache?
apache traffic server / custom implementation
Which MPM?
none
Please provide any additional information below, especially a URL or an
HTML file that exhibits the problem.

i created a custom filter to rebase documents to a new domain and remove any 
base tag found while doing that. it's working fine, except when there are 
multibyte characters found in attribute values that need to be rewritten. 
Currently, i have a workaround based on rewriting escaped_value() instead of 
DecodedValueOrNull(). Do i need to reencode the stream to a single byte 
character set before passing it in to ParseText (possibly creating html escapes 
by doing that)? Or should this be handled  by pagespeed (it does see the 
response headers, specifiying character sets and all?)

Original issue reported on code.google.com by osch...@gmail.com on 26 Apr 2012 at 8:08

@GoogleCodeExporter
Copy link
Author

While our codebase does not decode multi-byte attributes well, in general it 
does not need to (so far).

If you are making a deep copy of DOM or something I'd just work directly with 
the escaped_value.  In our existing filters, we've found we generally only need 
to decode attributes that are URLs.  For the most part we have not seen 
limitations with this so far in our testing, but the new paranoid methodology 
that gave birth to DecodedValueOrNull is relatively recent, and hasn't made its 
way into an official mod_pagespeed release yet.

You are correct that we have access to the charset from response-headers & 
meta-tags.  However we see a lot of non-utf8 html on the web so I think to do a 
credible job of decoding them we'd need to cover also gb2312, iso8859, and a 
Russian one as well whose ID I don't recall.  These different charsets all 
appear to work fine in mod_pagespeed because we're careful not to mess up what 
we don't understand :)

Since in most of our current work we only need to decode URLs we have not 
invested in a more thorough char-set decoding infracture.

Can you describe why, in your described new filter, you feel you need to decode 
all attributes?


See also this old filter, which seems like it might be similar to what your 
goal is here: 
http://code.google.com/p/modpagespeed/source/browse/branches/1/src/net/instaweb/
rewriter/base_tag_filter.cc

This file is from an earlier SVN branch of our code, and in the current trunk 
is deleted.  However, for obscure reasons related to internal testing we are 
likely to revive that filter soon & maybe fix some bugs in it.

Original comment by jmara...@google.com on 26 Apr 2012 at 11:45

@GoogleCodeExporter
Copy link
Author

that filter sure is similar, but i'd like to get rid of base tags since i think 
they cause trouble with some older browsers. i'd rather just absolutify all 
urls against a new base url, and drop the base tag. i also need this so i can 
host origins on other domains than the original one, and keep configuration as 
simple as possible (i do a startparse with the new domain, and to page speed, 
all hrefs/resource urls now are relative to the new domain). 

thinking about this, i probably won't really need decoded attributes for this 
purpose soon. 

But, another thought on this: is it at all possible, to disallow a resource 
with multibyte characters in them?
Since allowing/disallowing resources is configurable at the option level, 
shouldn't those be matched against (src) attribute values in a normalized 
manner? this might be  a (minor) security issue.

Original comment by osch...@gmail.com on 26 Apr 2012 at 12:22

@GoogleCodeExporter
Copy link
Author

RE re-hosting without base-tag: be careful of absolute references in 
Javascript.  We currently have no visibility into URLs from Javascript.

RE multi-byte characters in src attribute values: we will not consider these to 
be rewritable resources.  This is because our pattern (which probably needs to 
be factored a little) is (from cache_extender.cc)
    ResourcePtr input_resource(CreateInputResource(href->DecodedValueOrNull()));
    if (input_resource.get() == NULL) {
      return;
    }
Where CommonFilter::CreateInputResource has:
  ResourcePtr resource;
  if (!input_url.empty()) {
    ...
  }
  return resource;

So any resources with src= or href= attributes that have multi-byte characters 
will get decoded as NULL and fail this test: we'll leave them alone.


Original comment by jmara...@google.com on 26 Apr 2012 at 2:41

@GoogleCodeExporter
Copy link
Author

I think this is working properly, at least in trunk.  If you find a specific 
problem, please re-open.

Original comment by jmara...@google.com on 11 May 2012 at 1:50

  • Changed state: Fixed

@GoogleCodeExporter
Copy link
Author

Summary was: multibyte characters fail to decode 

Original comment by jmara...@google.com on 23 May 2012 at 3:34

  • Changed title: HTML URL attributes with multi-byte characters may be misinterpreted
  • Added labels: Milestone-v22, release-note

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant