Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is your recommended way to convert feedparser s date representation to datetime object? #321

Closed
slidenerd opened this issue Aug 4, 2022 · 1 comment

Comments

@slidenerd
Copy link

slidenerd commented Aug 4, 2022

I think this question belongs here and not on stackoverflow because as the library author you would be able to answer this best

Issues I referenced before asking
#212
#51

Problem

  • feedparser returns a string representation of published date under published and a struct_time representation of the same
  • I am not able to store either of these directly to Postgres because it needs a datetime when working via asyncpg

How to reproduce this problem


def md5(text):
    import hashlib
    return hashlib.md5(text.encode('utf-8')).hexdigest()

def fetch():
    import feedparser
    data = feedparser.parse('https://cointelegraph.com/rss')
    return data

async def insert(rows):
    import asyncpg
    async with asyncpg.create_pool(user='postgres', database='postgres') as pool:
        async with pool.acquire() as conn:
            results = await conn.executemany('INSERT INTO test (feed_item_id, pubdate) VALUES($1, $2)', rows)
            print(results)

async def main():
    data = fetch()
    first_entry = data.entries[0]
    await insert([(md5(first_entry.guid), first_entry.published)])
    await insert([(md5(first_entry.guid), first_entry.published_parsed)])

import asyncio
asyncio.run(main())

Both insert statements above will fail

What have I found so far?

I found 3 methods but they seem to have a limitation each

Method 1

Convert it with strptime

import feedparser
data = feedparser.parse('https://cointelegraph.com/rss')
pubdate = data.entries[0].published
pubdate_parsed = data.entries[0].published_parsed


>>> pubdate
'Thu, 04 Aug 2022 06:53:42 +0100'

I could do this


>>> method1 = datetime.strptime(pubdate, '%a, %d %b %Y %H:%M:%S %z')
>>> method1
datetime.datetime(2022, 8, 4, 6, 53, 42, tzinfo=datetime.timezone(datetime.timedelta(seconds=3600)))

I am guessing this would raise an error if some feed returns an incorrect format and also I am not sure if this works when an extra leapsecond gets added

Method 2


>>> datetime.fromtimestamp(mktime(pubdate_parsed))
datetime.datetime(2022, 8, 4, 5, 53, 42)

This seems to completely lose out the timezone information or am I wrong about it? What happens here if there is a DST

Method 3
Requires a third party library called dateutil and shown below
https://stackoverflow.com/a/18726020/5371505

Question

  • What is the most robust way to convert the published or published_parsed output that feedparser generates into datetime object?
  • Can it be done without a third party library such as dateutil
  • Is there any native undocumented approach to get a datetime object directly from feedparser that I am not aware of?

Thank you for your time

@mattzque
Copy link

I'm not the developer, but they do document it here: https://feedparser.readthedocs.io/en/latest/date-parsing.html#advanced-date

Different feed types and versions use wildly different date formats. Universal Feed Parser will attempt to auto-detect the date format used in any date element, and parse it into a standard Python 9-tuple in UTC

So I believe to create a timezone aware datetime object, you would do something like:

from time import mktime
from datetime import datetime, timezone
datetime.fromtimestamp(mktime(pubdate_parsed), timezone.utc)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants