My favorites | Sign in
Logo
             
Search
for
Updated Aug 23, 2009 by bslatkin
PublisherEfficiency  
How to maximize the efficiency of publishing Atom feeds

PubSubHubbub provides high-throughput event delivery for subscribers by aggregating multiple Atom feeds together using the atom:source element. This enables a single HTTP POST to contain a lot of data for the single subscriber.

But what about the Publisher side? One question I've heard from people is, "How can this be efficient if the Hub has to re-pull the feed on every publish event?" This document attempts to answer this question.


HTTP persistent connections and pipelining

By default in HTTP 1.1, TCP connections will be reused between requests. For a publisher serving many separate Atom feeds, this allows Hubs to get around the expense of creating a new TCP connection every time an event happens. Instead, Hubs MAY leave their TCP connections open and reuse them to make HTTP requests for freshly published events.

Pipelining allows a single TCP connection to make a series of HTTP requests simultaneously in batch. This reduces the round-trip time of the HTTP requests and maximizes the data sent per packet, cutting down on the number of total packets sent and received. To this end, both Hubs and Publishers SHOULD support HTTP piplining.


Feed windowing

Publishers SHOULD properly supply the caching headers Last-Modified and ETag for their Atom feed responses. Hubs SHOULD respect these fields and supply them properly on requests (receiving a 304 response if no content has changed). The great thing about ETag headers is that they are opaque and must be reproduced by the requesting client exactly. That means Publishers can encode information in the ETag to represent the state of the requesting client.

For example, as a Publisher you only want to supply the Hub with the content that has changed since the Hub last pulled the feed. You can do this by encoding the ID of the newest atom:entry from the feed in the ETag. Then, when the Hub requests again, you can use the ID in the ETag to formulate the query for the newest content, or as a cache key for an already-cached result.

The time-line would look like this:

  1. Publisher adds new entries to feed X, sends Hub a publish event
  2. Hub pulls feed X, gets entries (1,2,3,4) and ETag of 4
  3. (later) Publisher publishes adds to feed X again, sends Hub an event
  4. Hub pulls feed X with If-None-Match = 4; Publisher serves content after 4: (5,6,7)

To make this safe, Publishers should be sure to cryptographically sign their ETag header values. The ETag value could then be entry_id:hmac. To do this in Python you would use this code:

import hmac
entry_id = '1234 entry id'
signature = hmac.new('secret key', entry_id).hexdigest()
header_value = '%s:%s' % (entry_id, signature)

For additional HTTP correctness, publishers SHOULD ensure they serve a Vary: If-None-Match or Vary: If-Modified-Since header in their response to indicate the cachability of the response.


Atom feed archiving

RFC5005 defines a convention for having archives of feeds organized in chunks of an arbitrary size based on time. Using this convention, publishers can reduce the size of their "latest items" feed, and thus save bandwidth and CPU serving the latest contents to Hubs.


Serving feeds from multiple datacenters

One thing many large publishers do is serve their feeds from multiple datacenters at the same time. Often this is done using master-slave replication, but this also applies to master-master replication. Essentially, each feed has a primary datacenter where all writes go. After the write, the data is replicated to all other datacenters in a lazy fashion (eventually consistent).

In the world of RSS pings, replication causes a significant issue:

  1. Publisher adds content to a feed in the master datacenter
  2. Publisher sends a ping to Weblogs.com, etc
  3. Feed consumers pull the feed from the publisher
  4. Due to geographical locality of each feed consumer:
    • Some consumers hit the master datacenter-- they will see the new content
    • Other consumers hit the read-only datacenters that are behind on replication-- they will see no changes!

The race condition here shows that the model of sending pings with only the feed URL is flawed. It can only provide low-latency for for singly-homed services running in one datacenter. For large-scale systems run by many companies out there, this can't work reliably. This problem is exacerbated by transparent proxies and edge caches.

PubSubHubbub to the rescue! In the Hubbub model, the Hub can be integrated into the publisher's content management system itself. This has some large benefits that overcome the limitations of traditional pinging:

  • The Hub can be configured to always pull from the master for each feed
    • This means event notifications may be delivered to subscribers before the feed update is visible to traditional polling feed consumers
  • Multi-homed providers don't have to worry about cache invalidations and situations where replication delays are elevated (due to networking issues, etc)
  • Notification latency can be lower because the feed update doesn't have to propagate to all feed servers before sending out pings
  • Notifications may be distributed even in the case that feed serving is down!


Comment by andrew.wahbe, Aug 17, 2009

Are you sure about the Vary header bit?

From section 13.6 of the HTTP RFC: A server SHOULD use the Vary header field to inform a cache of what request-header fields were used to select among multiple representations of a cacheable response subject to server-driven negotiation. The set of header fields named by the Vary field value is known as the "selecting" request-headers.

ETag and Last-Modified are response headers. Do you really mean Vary: If-None-Match and Vary: If-Modified-Since ?

That said... not sure how comfortable I am with the windowing thing... seems brittle. I always hit problems trying to get too fancy with caching -- too many proxies doing wierd things. But who knows... could work.

Comment by andrew.wahbe, Aug 18, 2009

What about using RFC 5005 to improve publishing efficiency? i.e. rolling a subscribe feed over into archive feeds on a regular basis.

Comment by jaderd, Aug 22, 2009

andrew.wahbe has a good point here. The Vary header makes no sense, unless any Http Server implements Vary: Etag and Vary: Last-Update

Comment by bslatkin, Aug 23, 2009

Thanks for the feedback!

Comment by patriot1burke, Sep 25, 2009

Wouldnt you be better off using a "next message" link rather than encoding information within the ETag? Then it would be easier to add generic caching intermediaries without worry.


Sign in to add a comment
Hosted by Google Code