Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan for upcoming 3x memory improvement PR (and a few others) #296

Open
lemon24 opened this issue Dec 18, 2021 · 8 comments
Open

Plan for upcoming 3x memory improvement PR (and a few others) #296

lemon24 opened this issue Dec 18, 2021 · 8 comments
Assignees
Labels
character detection parser Parser implementations (expat, libxml2, etc) performance

Comments

@lemon24
Copy link
Contributor

lemon24 commented Dec 18, 2021

Hi @kurtmckee,

I'm using feedparser heavily in reader, a feed reader library.

There are 3 issues I'd like to submit PRs for, I've opened this to provide a bit of background.

If time allows, please let me know if you see any issues with the proposed approaches. (I will be working on the PRs regardless.)

Thank you, appreciate your work!


Proposed changes follow.

3x memory usage reduction

I managed to reduce memory usage of parse() by ~3x (prototype) and make it 4% faster, by not reading the entire feed in memory, and decoding it on the fly.

I tested it with 150 feeds I use day to day and the results were identical with those of feedparser 6.0.8.

tl;dr of the change: in parse():

  • use convert_to_utf8() with a small prefix of feed data (optimistically assuming the rest of the file will have the detected encoding)
  • for the StrictFeedParser branch
    • wrap the file in a codecs.getreader() that converts it to a text stream
    • pass the text stream to the SAX parser
  • for the LooseFeedParser branch
    • seek() the file back to the initial position
    • again, wrap it in a codecs.getreader()

For backwards compatibility:

  • I think it's acceptable for convert_to_utf8() to only use a fragment of the of the feed. However, it should still be possible to enable the old behavior of passing the entire feed to convert_to_utf8() (in the unlikely case when the encoding isn't properly detected; I propose a new argument parse(..., optimistic_encoding_detection=True).
    • Not having optimistic_encoding_detection would make the code simpler, though. To increase the chances of correctly detecting encoding, we can make the feed fragment used reasonably large (e.g. 100K). Do you think this is an acceptable tradeoff?
  • For the fallback to LooseFeedParser to be possible, the file must be .seekable(). If it's not, we still need to read the entire file in memory, but at least we do it exactly once – from my profiling, convert_to_utf8() consumes most of the memory by having multiple copies of the entire feed (with different encodings).

Proposed refactoring (not required, but it would make parse() easier to understand/change going forward):

  • Add a new function parse_file(file, ...), that only takes a stream. parse() would call parse_file() after getting one from _open_resource().

Resolves: #209, maybe #287.

Make Atom summary/content not depend on order

#59 already has a pending pull request for this.

I'm not sure if it was not merged because it was never a priority, or more testing is required; if the latter, I can do the needed work, please let me know.

A sharper solution is to side-step the summary/content detection logic for Atom only. My understanding is that the reliance on order was mainly because RSS doesn't really have separate summary and content elements. Atom does, so when we get a summary tag, we can just use it for summary. Prototype (tested with 150 feeds, of which ~55 were Atom).

Resolves: #59.

defusedxml or lxml can't be used with feedparser

This can kinda be done by appending to feedparser.api.PREFERRED_XML_PARSERS, but it has a number of issues:

  • PREFERRED_XML_PARSERS is not documented.
  • PREFERRED_XML_PARSERS is feedparser global state; it wouldn't be polite for reader to change it (unless explicitly instructed by the user).
  • xml.sax.make_parser() falls back to the default SAX parser, which is insecure; when used in the context of untrusted input, it should either work with the specified parser, or fail.
    • This can be worked around by calling xml.sax.make_parser() ahead of time, and checking the returned parser is one of the safe ones (but it's kinda brittle, since it assumes feedparser uses make_parser()).

For reference, here's what using PREFERRED_XML_PARSERS with feedparser looks like now: defusedxml, lxml.

My solution to this would be to add a new argument parse(..., sax_parser=None); for backwards compatibility, it'd default to the current behavior.

Resolves: #107 (XML parsing is not safe by default, but it is possible to make it safe).

@lemon24
Copy link
Contributor Author

lemon24 commented Dec 20, 2021

Progress report:

In revision 5 of the prototype, I extracted the "join a prefix to a file object" logic into a separate class (with a few tests); this is useful for a few reasons:

  • It keeps parse() almost as simple as it was originally.
  • It prevents the SAX parser from closing the stream (I don't know how to stop it from doing that); if the stream is closed, the (fallback) LooseFeedParser branch can't seek it to its original position.

I'm still thinking of how to split the change into a few refactoring commits and a few feature commits, so it's easier to review.

@tadeoos
Copy link

tadeoos commented Jan 18, 2022

@lemon24 great initiative! These leaks are hurting us too. How far are you? Is there something I can help with or do you have a workaround for these issues?

@lemon24
Copy link
Contributor Author

lemon24 commented Jan 18, 2022

@tadeoos, parse() in the gist from my previous comment works correctly for any seek-able() files (that is, to parse an URL you have to download the feed yourself and pass the open file to it).

The next step is to integrate the stuff from the gist into feedparser itself (instead wrapping some internals), without breaking anything, and without making the code significantly harder to maintain going forward.

(Sadly, I haven't had the time/disposition to continue this. It's likely I'll pick it up again in the following month, but I'm not promising anything.)

lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 24, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 24, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 24, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 24, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 25, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 25, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 25, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 25, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 27, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 27, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 27, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 27, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 28, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Jan 30, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Feb 6, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Feb 6, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Feb 6, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Feb 6, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Feb 6, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Apr 25, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue Apr 25, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022
lemon24 added a commit to lemon24/feedparser that referenced this issue May 22, 2022
kurtmckee added a commit that referenced this issue Jun 24, 2022
@kurtmckee
Copy link
Owner

Thanks for your work on PR #302!

@lemon24
Copy link
Contributor Author

lemon24 commented Jun 24, 2022

@kurtmckee, I want to continue working on the other two issues when I have some time. Do you agree with the solutions proposed in my first comment?

The first one, defusedxml or lxml can't be used with feedparser, requires a (backwards compatible) API change:

My solution to this would be to add a new argument parse(..., sax_parser=None); for backwards compatibility, it'd default to the current behavior.

I'm looking for your input because this implies that feedparser will continue to use a SAX parser indefinitely (but IMHO, the existence of PREFERRED_XML_PARSERS already does).

@kurtmckee
Copy link
Owner

kurtmckee commented Jun 24, 2022

My apologies, I overlooked the additional items!

I'd like to discuss these items in more detail, but the discussion re: SAX parsers and fallbacks has been tearing at me for a long time. I want feedparser to get to a point where it's not trying to use an XML parser and then fallback to a loose parser. To that end, I'm currently writing a handler that uses html.parser.HTMLParser as its backend. It's working well so far, and I'd like to add a handler that uses lxml's HTML parser if it's importable.

I think at that point the SAX parsing could be completely ripped out and feedparser could drop its code that supports re-parsing the same document twice in two different modes. I anticipate that this lays the groundwork for feedparser to begin parsing h-feeds.

Do you think that migrating from XML/sgml parsers to HTML/lxml parsers could address the current lack of support for defused?

@kurtmckee kurtmckee reopened this Jun 24, 2022
@lemon24
Copy link
Contributor Author

lemon24 commented Jun 25, 2022

Overall, I am +1 on not using an XML parser (if it's not an overwhelming amount of work to do).

From what I understand, most issues with XML come from it being able to do too much (file/network access etc.), and libraries assuming it comes from a trusted source; a feed is pretty much in the opposite situation.

This will likely simplify things for decoding as well:

  • AFAICT, convert_to_utf8() and replace_doctype() both need to change the stream prefix for the benefit of the XML parser
  • if there's no XML parser, the detection logic can be moved in the new parser
  • ... so there's no need to change the stream
  • ... so we can remove PrefixFileWrapper (and maybe StreamFactory too!)

Regarding the need for defusedxml:

  • html.parser.HTMLParser: IMO this sidesteps the issue entirely, since being relatively low level it won't have the nasty behaviors out of the box.
  • lxml XML parser: Based on this table, lxml seems to be less vulnerable than most other XML libraries (only to quadratic blowup and external entity expansion (local file)).
  • lxml HTML parsers (there's more than one): don't know about them, but they might be even less vulnerable (by being lower-level). For reference, lxml does seem to support a feed()-handle_event()-style interface.

So, given your plans, I think it's more than reasonable to not move forward with defusedxml / custom SAX parser support (feel free to close this).

@lemon24
Copy link
Contributor Author

lemon24 commented Jun 25, 2022

While writing the comment above, I realized there may be a number of opportunities presented by a reworking of the parsing backend.

I understand you're operating under a number of constraints: time, backwards compatibility, and, I assume, pure-Python-only dependencies – at least out of the box. Even if none of the following are included, I still think it's useful to list them out, in case they influence your design.

  1. Streaming support: instead of returning a list of entries, yield them one by one.
  • This would further reduce the amount of memory used.
  1. Support Bleach for HTML sanitization.
  1. Use BeautifulSoup for broken feed parsing and/or broken encoding handling.
  • This got my attention while looking though the lxml.html.soupparser docs:

    A very nice feature of BeautifulSoup is its excellent support for encoding detection which can provide better results for real-world HTML pages that do not (correctly) declare their encoding.

From what I understand, feedparser is really old, and it needed to do a lot of work for things that aren't strictly related to feeds because there was no better option available at the time (and the packaging story wasn't ideal either). Since then, the library ecosystem has matured quite a bit, and (carefully) farming out some of that work to others might lower the maintenance burden significantly.

I am not suggesting you do any of the above, just bringing them up so they may have an influence on whatever happens next. If you find one of them particularly interesting, I can look into it further when I have some time.

Once again, thank you for all your work on feedparser!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
character detection parser Parser implementations (expat, libxml2, etc) performance
Projects
None yet
Development

No branches or pull requests

3 participants