RFC 7303 changed how charset detection works #291

autumnontape · 2021-11-02T21:56:08Z

feedparser implements charset detection according to RFC 3023, which has been superseded by RFC 7303. RFC 7303 made an incompatible change to charset detection to align with the behavior of real-world software: text/xml is now an alias of application/xml and has the same charset detection behavior, no longer treating an omitted charset parameter as US-ASCII. A byte-order mark also now takes precedence over the charset parameter if present.

kurtmckee · 2021-11-08T13:10:04Z

Oof, thanks for this info. That may be worthwhile to implement. Perhaps there's also a package that implements this already, too. I'll have to look into this. thanks!

autumnontape · 2023-06-14T22:09:23Z

I might submit a pull request to implement RFC 7303. At a glance, it looks like this would require changes to feedparser/encodings.py and docs/character-encoding.rst. Would you prefer if I just replaced the old behavior or added an option to switch between them? Scenarios that could start having problems as a result of a wholesale change (or an option that defaulted to the new behavior) seem very unlikely but not strictly impossible to me.

kurtmckee · 2023-06-14T23:03:06Z

It seems like RFC 7303 may be specific to content transferred over HTTP. Is that accurate?

I have started to transition the codebase to the requests package, and would prefer to avoid doing additional detection on top of what that package provides (although local files and binary content fed directly into feedparser would still need some detection).

Is it still necessary to implement RFC 7303?

autumnontape · 2023-06-15T21:17:49Z

It's specific to content with a MIME Content-Type header. feedparser does let users specify this header themselves in the response_headers argument to parse. I don't know if you care about preserving that feature or if there's a way to do it with the requests package.

Does requests handle format-specific charset detection like for XML? Something somewhere in the pipeline should implement RFC 7303. I agree that feedparser isn't the best place for it, since none of it is specific to Atom or RSS.

kurtmckee added the character detection label Nov 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC 7303 changed how charset detection works #291

RFC 7303 changed how charset detection works #291

autumnontape commented Nov 2, 2021

kurtmckee commented Nov 8, 2021

autumnontape commented Jun 14, 2023

kurtmckee commented Jun 14, 2023

autumnontape commented Jun 15, 2023

RFC 7303 changed how charset detection works #291

RFC 7303 changed how charset detection works #291

Comments

autumnontape commented Nov 2, 2021

kurtmckee commented Nov 8, 2021

autumnontape commented Jun 14, 2023

kurtmckee commented Jun 14, 2023

autumnontape commented Jun 15, 2023