|
PublisherEfficiency
How to maximize the efficiency of publishing Atom feeds
PubSubHubbub provides high-throughput event delivery for subscribers by aggregating multiple Atom feeds together using the atom:source element. This enables a single HTTP POST to contain a lot of data for the single subscriber. But what about the Publisher side? One question I've heard from people is, "How can this be efficient if the Hub has to re-pull the feed on every publish event?" This document attempts to answer this question. HTTP persistent connections and pipeliningBy default in HTTP 1.1, TCP connections will be reused between requests. For a publisher serving many separate Atom feeds, this allows Hubs to get around the expense of creating a new TCP connection every time an event happens. Instead, Hubs MAY leave their TCP connections open and reuse them to make HTTP requests for freshly published events. Pipelining allows a single TCP connection to make a series of HTTP requests simultaneously in batch. This reduces the round-trip time of the HTTP requests and maximizes the data sent per packet, cutting down on the number of total packets sent and received. To this end, both Hubs and Publishers SHOULD support HTTP piplining. Feed windowingPublishers SHOULD properly supply the caching headers Last-Modified and ETag for their Atom feed responses. Hubs SHOULD respect these fields and supply them properly on requests (receiving a 304 response if no content has changed). The great thing about ETag headers is that they are opaque and must be reproduced by the requesting client exactly. That means Publishers can encode information in the ETag to represent the state of the requesting client. For example, as a Publisher you only want to supply the Hub with the content that has changed since the Hub last pulled the feed. You can do this by encoding the ID of the newest atom:entry from the feed in the ETag. Then, when the Hub requests again, you can use the ID in the ETag to formulate the query for the newest content, or as a cache key for an already-cached result. The time-line would look like this:
To make this safe, Publishers should be sure to cryptographically sign their ETag header values. The ETag value could then be entry_id:hmac. To do this in Python you would use this code: import hmac
entry_id = '1234 entry id'
signature = hmac.new('secret key', entry_id).hexdigest()
header_value = '%s:%s' % (entry_id, signature)For additional HTTP correctness, publishers SHOULD ensure they serve a Vary: If-None-Match or Vary: If-Modified-Since header in their response to indicate the cachability of the response. Atom feed archivingRFC5005 defines a convention for having archives of feeds organized in chunks of an arbitrary size based on time. Using this convention, publishers can reduce the size of their "latest items" feed, and thus save bandwidth and CPU serving the latest contents to Hubs. Serving feeds from multiple datacentersOne thing many large publishers do is serve their feeds from multiple datacenters at the same time. Often this is done using master-slave replication, but this also applies to master-master replication. Essentially, each feed has a primary datacenter where all writes go. After the write, the data is replicated to all other datacenters in a lazy fashion (eventually consistent). In the world of RSS pings, replication causes a significant issue:
The race condition here shows that the model of sending pings with only the feed URL is flawed. It can only provide low-latency for for singly-homed services running in one datacenter. For large-scale systems run by many companies out there, this can't work reliably. This problem is exacerbated by transparent proxies and edge caches. PubSubHubbub to the rescue! In the Hubbub model, the Hub can be integrated into the publisher's content management system itself. This has some large benefits that overcome the limitations of traditional pinging:
|
Sign in to add a comment
Are you sure about the Vary header bit?
From section 13.6 of the HTTP RFC: A server SHOULD use the Vary header field to inform a cache of what request-header fields were used to select among multiple representations of a cacheable response subject to server-driven negotiation. The set of header fields named by the Vary field value is known as the "selecting" request-headers.
ETag and Last-Modified are response headers. Do you really mean Vary: If-None-Match and Vary: If-Modified-Since ?
That said... not sure how comfortable I am with the windowing thing... seems brittle. I always hit problems trying to get too fancy with caching -- too many proxies doing wierd things. But who knows... could work.
What about using RFC 5005 to improve publishing efficiency? i.e. rolling a subscribe feed over into archive feeds on a regular basis.
andrew.wahbe has a good point here. The Vary header makes no sense, unless any Http Server implements Vary: Etag and Vary: Last-Update
Thanks for the feedback!
Wouldnt you be better off using a "next message" link rather than encoding information within the ETag? Then it would be easier to add generic caching intermediaries without worry.