Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debian's UDD feeds freak out feedparser #112

Open
anarcat opened this issue Sep 6, 2017 · 8 comments
Open

Debian's UDD feeds freak out feedparser #112

anarcat opened this issue Sep 6, 2017 · 8 comments

Comments

@anarcat
Copy link

anarcat commented Sep 6, 2017

My personal UDD todo list breaks feedparser. If you add the tests to the "illformed" directory, tox says:

GLOB sdist-make: /home/anarcat/dist/feedparser/setup.py
py27 create: /home/anarcat/dist/feedparser/.tox/py27
py27 inst: /home/anarcat/dist/feedparser/.tox/dist/feedparser-5.2.1.zip
py27 installed: feedparser==5.2.1,pkg-resources==0.0.0
py27 runtests: PYTHONHASHSEED='1353716627'
py27 runtests: commands[0] | /home/anarcat/dist/feedparser/.tox/py27/bin/python tests/runtests.py
Traceback (most recent call last):
  File "tests/runtests.py", line 835, in <module>
    runtests()
  File "tests/runtests.py", line 789, in runtests
    description, evalString, skipUnless = getDescription(xmlfile, data)
  File "tests/runtests.py", line 740, in getDescription
    raise RuntimeError("can't parse %s" % xmlfile)
RuntimeError: can't parse ./tests/illformed/udd.xml
ERROR: InvocationError: '/home/anarcat/dist/feedparser/.tox/py27/bin/python tests/runtests.py'
py35 create: /home/anarcat/dist/feedparser/.tox/py35
py35 inst: /home/anarcat/dist/feedparser/.tox/dist/feedparser-5.2.1.zip
py35 installed: feedparser==5.2.1,pkg-resources==0.0.0,sgmllib3k==1.0.0
py35 runtests: PYTHONHASHSEED='1353716627'
py35 runtests: commands[0] | /home/anarcat/dist/feedparser/.tox/py35/bin/python tests/runtests.py
Traceback (most recent call last):
  File "tests/runtests.py", line 835, in <module>
    runtests()
  File "tests/runtests.py", line 789, in runtests
    description, evalString, skipUnless = getDescription(xmlfile, data)
  File "tests/runtests.py", line 740, in getDescription
    raise RuntimeError("can't parse %s" % xmlfile)
RuntimeError: can't parse ./tests/illformed/udd.xml
ERROR: InvocationError: '/home/anarcat/dist/feedparser/.tox/py35/bin/python tests/runtests.py'
_______________________________________________________________________________ summary ________________________________________________________________________________
ERROR:   py27: commands failed
ERROR:   py35: commands failed

the problem seems to be there is no guid field and an empty link field on some entries, which breaks (reasonable) expectations from feedparser...

@twm
Copy link
Contributor

twm commented Jan 13, 2018

What behavior do you expect from feedparser in this case? Should the invalid entries be silently ignored? Should feedparser produce entries without a link?

Maybe UDD should be fixed? That feed is not valid.

@anarcat
Copy link
Author

anarcat commented Jan 15, 2018

it should:

  1. not crash

  2. make an educated guess at a UID

I do this in feed2exec:

        if not item.get('id'):
            item['id'] = item.get('title')

it's just a dumb heuristic, but it works better than crashing on an arbitrary feed.

at the very least, i would want feedparser to be robust (ie. not crash) on bad content. delivering a non-empty feed is extra...

@twm
Copy link
Contributor

twm commented Jan 19, 2018

Hmm, that heuristic would work in this particular case but in the wild repeated entry titles are pretty common (e.g., http://www.pusheen.com/rss) so I wouldn't want it built into feedparser except on an opt-in basis. As a feedparser user I'd rather have no ID than a heuristic that I can't fix.

My first inclination for a heuristic would have been to use the item date as a final fall-back, but that doesn't work for this feed either. :-/ So maybe skipping 'id' or making it the empty string is best in this case. Then you can add heuristics on top (e.g., a more robust one would be to hash all the item fields in cases like this).

@anarcat
Copy link
Author

anarcat commented Jan 23, 2018

yep, i don't mind rolling my own heuristics here... i guess what i need here is for feedparser to ... er... not crash. :)

@kurtmckee
Copy link
Owner

@anarcat, are you still seeing this behavior? If so, I'll jump in on this and work to get feedparser to quit crashing.

Re: GUID heuristics, feedparser won't be updated to inject GUID's but you're right, feedparser shouldn't be crashing!! =)

@anarcat
Copy link
Author

anarcat commented May 7, 2018

i still get the same error than originally reported. should i send a PR to get the failing unit test in place?

to reproduce, you simply need to do this:

wget -O tests/illformed/udd.xml 'https://udd.debian.org/dmd/?email1=anarcat%40debian.org&email2=&email3=&packages=&ignpackages=photofloat&nosponsor1=on&format=rss#todo'

and run the test suite.

@kurtmckee
Copy link
Owner

kurtmckee commented May 7, 2018 via email

@buhtz
Copy link

buhtz commented Jul 14, 2019

FYI: There is also another problem with debian related feeds.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=926074

Please open a bug report on for Debian against the tracker.debian.org package and post the link here. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants