Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC 7303 changed how charset detection works #291

Open
autumnontape opened this issue Nov 2, 2021 · 4 comments
Open

RFC 7303 changed how charset detection works #291

autumnontape opened this issue Nov 2, 2021 · 4 comments

Comments

@autumnontape
Copy link

feedparser implements charset detection according to RFC 3023, which has been superseded by RFC 7303. RFC 7303 made an incompatible change to charset detection to align with the behavior of real-world software: text/xml is now an alias of application/xml and has the same charset detection behavior, no longer treating an omitted charset parameter as US-ASCII. A byte-order mark also now takes precedence over the charset parameter if present.

@kurtmckee
Copy link
Owner

Oof, thanks for this info. That may be worthwhile to implement. Perhaps there's also a package that implements this already, too. I'll have to look into this. thanks!

@autumnontape
Copy link
Author

I might submit a pull request to implement RFC 7303. At a glance, it looks like this would require changes to feedparser/encodings.py and docs/character-encoding.rst. Would you prefer if I just replaced the old behavior or added an option to switch between them? Scenarios that could start having problems as a result of a wholesale change (or an option that defaulted to the new behavior) seem very unlikely but not strictly impossible to me.

@kurtmckee
Copy link
Owner

It seems like RFC 7303 may be specific to content transferred over HTTP. Is that accurate?

I have started to transition the codebase to the requests package, and would prefer to avoid doing additional detection on top of what that package provides (although local files and binary content fed directly into feedparser would still need some detection).

Is it still necessary to implement RFC 7303?

@autumnontape
Copy link
Author

It's specific to content with a MIME Content-Type header. feedparser does let users specify this header themselves in the response_headers argument to parse. I don't know if you care about preserving that feature or if there's a way to do it with the requests package.

Does requests handle format-specific charset detection like for XML? Something somewhere in the pipeline should implement RFC 7303. I agree that feedparser isn't the best place for it, since none of it is specific to Atom or RSS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants