New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC 7303 changed how charset detection works #291
Comments
Oof, thanks for this info. That may be worthwhile to implement. Perhaps there's also a package that implements this already, too. I'll have to look into this. thanks! |
I might submit a pull request to implement RFC 7303. At a glance, it looks like this would require changes to feedparser/encodings.py and docs/character-encoding.rst. Would you prefer if I just replaced the old behavior or added an option to switch between them? Scenarios that could start having problems as a result of a wholesale change (or an option that defaulted to the new behavior) seem very unlikely but not strictly impossible to me. |
It seems like RFC 7303 may be specific to content transferred over HTTP. Is that accurate? I have started to transition the codebase to the Is it still necessary to implement RFC 7303? |
It's specific to content with a MIME Content-Type header. feedparser does let users specify this header themselves in the Does requests handle format-specific charset detection like for XML? Something somewhere in the pipeline should implement RFC 7303. I agree that feedparser isn't the best place for it, since none of it is specific to Atom or RSS. |
feedparser implements charset detection according to RFC 3023, which has been superseded by RFC 7303. RFC 7303 made an incompatible change to charset detection to align with the behavior of real-world software: text/xml is now an alias of application/xml and has the same charset detection behavior, no longer treating an omitted charset parameter as US-ASCII. A byte-order mark also now takes precedence over the charset parameter if present.
The text was updated successfully, but these errors were encountered: