Skip to content
This repository has been archived by the owner on Apr 21, 2023. It is now read-only.

htmlparser breaks ie7 rendering in an edge case with invalid html #470

Open
GoogleCodeExporter opened this issue Apr 6, 2015 · 8 comments

Comments

@GoogleCodeExporter
Copy link

i noticed the html parser slightly mangles the html in the following situation:

input:
<a href="http://foo.com" target="_new"">foo</a>

output:
<a href="http://foo.com" target="_new" ">foo</a>

note the extra space. that breaks ie7 rendering. in my case this caused the 
link to be invisible due to the css involved.

Regards, 

Otto van der Schaaf


Original issue reported on code.google.com by osch...@gmail.com on 18 Jul 2012 at 12:43

@GoogleCodeExporter
Copy link
Author

That's a good corner-case.  While you have ie7 open, can you inspect the DOM 
given your input?  How many attributes does your 'a' element have?

Original comment by jmara...@google.com on 18 Jul 2012 at 12:49

@GoogleCodeExporter
Copy link
Author

First, sorry about the multiple submissions.. i got 500 errors, i assumed my 
submission wasn't processed. 

I only have ietester for simulating ie7, which does not provide the dom 
inspector. so i can't see how may attributes the a element has in ie7, sorry

Original comment by osch...@gmail.com on 18 Jul 2012 at 1:04

@GoogleCodeExporter
Copy link
Author

Can you explain this a little more? This doesn't look like it could possibly be 
correct HTML syntax (with the extra "). Is this an IE hack? Typo? If it's a 
hack, what's the purpose, can you point to a reference.

Original comment by sligocki@google.com on 18 Jul 2012 at 3:11

@GoogleCodeExporter
Copy link
Author

Well, it sure isn't valid html (hence the bug report title :)).
As far as i know this isn't generally used as an ie rendering hack.

I have seen this happen often in the past years in 2 general cases:
- in this case, the html was directly entered by end users through a cms. it is 
a typo this time.
- other times i have seen this happen, is when a bad script is generating html. 
in that case, it's a buggy html generator



I don't know about any numbers about how often this occurs; however, there is a 
potentially big penalty when it occurs in these ways:
- the resulting rendering difference is hard to spot, causing people to spend 
lots of time finding out what is going on.
- the rendering difference will probably go unnoticed for some time (lots of 
people do not test with every browser), potentially crippling stuff (in this 
case, the form where it was encountered did not display it's link to the 
agreement pdf)
- the reason why that occurred is tough to explain to customers. every major 
browser handles the original (invalid) input html OK. At least ie7 seems to 
fail on the (invalid but slightly different) output html, maybe other older 
browsers fail too.
- this might be just an instance of a more general case, i have not thought 
about other invalid tag/value/browser combinations that get an extra space.

Ideally, i think the parser should this invalid html alone.

Regards, Otto van der Schaaf

Original comment by osch...@gmail.com on 19 Jul 2012 at 8:44

@GoogleCodeExporter
Copy link
Author

I understand the problem; we generally do try to leave invalid HTML alone.  But 
we made a design decision about how to do that which is good for some cases and 
not this one.  The decision was that during attribute parsing, we would look 
for "garbage" characters and retain them in our data model.  In particular we 
saw sometimes <a b=c ; d> (or something like that).  We wanted to retain the 
";" and not eliminate it from our data model.

I think we would treat <a b="c" " d> the same way.  The tricky thing about this 
bug is that we don't have (currently) in our data model a way of distinguishing 
these two:
   <a b="c"" d>
   <a b="c" " d>
In either case we'd treat this as a valid tag with three attributes.  I guess 
to replicate this we'll have to add a new 'bool' field to 
HtmlElement::Attribute to indicate whether it is separated from the previous 
attribute by whitespace.  I don't think we want to retain all the 
intra-attribute whitespace as that would make our memory usage worse to handle 
this corner case.

Original comment by jmara...@google.com on 20 Jul 2012 at 1:47

@GoogleCodeExporter
Copy link
Author

Original comment by jmara...@google.com on 20 Jul 2012 at 1:47

  • Changed state: Accepted

@GoogleCodeExporter
Copy link
Author

Original comment by jmara...@google.com on 23 Sep 2013 at 6:40

@GoogleCodeExporter
Copy link
Author

I've confirmed this problem still exists after Maks' recent parser improvement.

Original comment by jmara...@google.com on 23 Sep 2013 at 7:47

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant