My favorites | Sign in
Logo
             
Search
for
Updated Sep 27, 2009 by bslatkin
Labels: Featured
ComparingProtocols  
Comparison of PubSubHubbub to light-pinging protocols

Comparison of PubSubHubbub to light-pinging protocols

People want a comparison of the concrete differences between fat pinging (PubSubHubbub, XMPP pubsub) and light pinging (rssCloud, XML-RPC pings, changes.xml, SUP, SLAP). This document aims to construct and convey an evaluation of these protocols that's easy to understand.

The core difference is how new information from feeds is delivered from a publisher to a subscriber:

  • Light pings: Send the URL of the feed that has updated to the subscriber.
  • Fat pings: Send the updated content of the feed to the subscriber.

There is also another series of criteria to consider for each protocol. Green is good, red is bad, yellow is so-so.

Consideration XML-RPC ping changes.xml SUP SLAP XMPP pubsub rssCloud PubSubHubbub
TransportHTTPHTTPHTTP/HTTPSUDPTCP/XMPPHTTPHTTP/HTTPS
Distribution stylePing/PollPollingPollingPing/PollPushPing/PollPush
LatencyLowHighHighLowMinimum possibleLowMinimum possible
Thundering herdYesYesYesYesNoYesNo
Spamable (no topics)YesYesNoNoNoNoNo
DoSes PublishersPreventableNoNoPreventablePreventablePreventablePreventable
DoS Relay attacksYesNoNoNoNoYesNo
$5/month host subscriberNoNoNoNoNoMaybeYes
Message formatXML schemaXML schemaJSONBinary packetComplex XMPPXML schemaOriginal RSS or Atom content
Secure notificationsNoNoSomewhatNoYesNoYes
Publisher complexityXML-RPC clientXML-RPC clientSUP IDsUDP sendXMPP sendXML-RPC/REST pingREST ping
Subscriber complexityCrawl pipelineCrawl pipelineCrawl pipelineCrawl pipelineXMPP clientCrawl pipelineSimple webapp



The rest of this document will compare light and fat pinging by these metrics:


Latency

To simplify this explanation, latency is represented as network "hops": the time it takes on average for data to propagate between two Internet nodes.

Naively, light pings and fat pings look the same:

There are four network hops required to deliver new content to a subscriber.

However, this leaves light pinging hubs open to relay denial of service attacks, so the Hub must verify there is new content:

Often publishers will be combined with their own hub (for better integration with their application, better statistics gathering, optimizations) yielding:

With popular sites, a feed will be served from multiple datacenters:


Bandwidth

Assume an average feed is 100KB consisting of fifty 2KB posts.

Take the case of a single new item with 2KB of data and 100 subscribers to the feed:

To prevent denial of service attacks, light pinging Hubs must verify there is new content:

  • Result:
    • Even worse than naive light pinging case.
    • Same bandwidth overhead as naive light pings.
    • Fat pinging is 33% faster than light pinging.

Light-ping advocates suggest that the Hub should re-serve only the new content on behalf of the publisher:

  • Result:
    • Equal bandwidth as fat pings.
    • Still 100x as many incoming HTTP requests as fat pings.
    • 33% more latency than naive fat pings.
    • 66% more latency than combined publisher/hub fat pings.
    • Trust/security model for proxied feed on behalf of publisher unclear.

CPU Usage

Assume parsing a whole feed on average takes 10ms per item. Again assume an average feed has 25 items.

Take the case of a single new item being sent to 100 subscribers to the feed with naive pings:

However, the Hub must verify the feed is new to make this safe:

  • Result:
    • Light pings require 25.25 CPU seconds consumed; 250ms by the hub, 250ms for each consumer
    • Even worse than the naive case.

And when, for light pings, Hubs re-serve only the new content on behalf of the publisher:

  • Result:
    • Equal CPU as fat pings.
    • Still 100x as many incoming HTTP requests as fat pings.
    • 33% more latency than naive fat pings.
    • 66% more latency than combined publisher/hub fat pings.
    • Trust/security model for proxied feed on behalf of publisher unclear.

Publisher complexity

Assume the publisher tells hubs the feed URLs.


Subscriber complexity


Hub complexity

Assuming that Hubs must verify that the original feed has changed or else they will just be an open relay for DoS attacks.


Comment by mterenzio, Sep 15, 2009

Good stuff.

Please correct me if I'm wrong, but while your representation is accurate, it may give the wrong impression to the less technical, as if anyone else is reading this stuff ;)

What I mean to say is that the subscriber here is most likely not a end-user client but a server whose job is to propagate the new content to end users. In other words, both protocols are server to server.

Therefore, the savings you speak of are true but it's not necessarily orders of magnitude. It may be in some cases, I'll grant you that.

But in many cases, the cloud and/or the hub is notifying thousands of servers of an update. Those thousands then inform the millions of end-users. It's a federated system like XMPP or email, not peer-to-peer.

When that is put into consideration, the extra complexity may outweigh the savings.

Now, I will grant you that anyone implementing either one of the protocols will probably have the requisite technical chops to handle the extra overhead of processing a fat ping, so maybe my point is moot.

On the other hand, RSS is extensible and there is nothing saying that adding a fat ping can't be put into a namespace. Then it would be just a matter of whether a developer wanted to support it or not. Their choice, and not imposed by the protocol from the onset.

Great work you are doing here.

Comment by bslatkin, Sep 16, 2009

Matt: Thanks for the feedback. Some responses:

1. I'd claim that the last mile is not significant. Topologies (application, network, datacenter) and architectures (long-polling) optimize it away. Think of how Akamai edge caching works. The real cost is federating server-to-server communication and getting data to local nodes.

2. You mentioned "the extra overhead of processing a fat ping." You're missing something crucial here. Processing fat pings requires less effort for both publishers and subscribers (i.e., the ones who matter); they offload all the hard work to hubs. That's the point: complexity in the center.

3. PubSubHubbub supports RSS right now: http://code.google.com/p/pubsubhubbub/wiki/RssFeeds

Comment by mterenzio, Sep 16, 2009

Everything you say is true to a large degree. My only rebuttals might be:

1. Some (not me) might criticize your claim that long-polling optimizes a certain problem away. It is not yet in widespread use (that's a relative statement) and to scale it well brings complexity back to the edge. Not everyone is running Tornado yet. ;)

2. Others might say, if you wanted complexity at the center you already had XMPP. Not an XMPP joke, but if you can implement PSHB, couldn't you rig up XMPP PubSub??

3. Less work on Publishers and Subscribers. Well Brett, it's more work for me as a subscriber right now, since I don't have code in place to process and aggregate these atomized and multifeed updates. I have code that reads RSS and Atom documents on a whole. But once I do implement reassembling partial documents, I see your point.

Comment by joshfraz, Sep 16, 2009

The most compelling reason I've heard why publishers don't like fat-pings is that it will likely cause their compete.com scores to drop.

Comment by mcdtracy, Sep 26, 2009

If everyone converts to a realtime network, then the new measurement of success is the number of followers (subscribers). When the number of realtime followers is very high then the "Thundering Herd" problem emerges if the Hub doesn't offload the Publishers website.

The only effective way to build a distributed "twitter-like" web is to insure equal access via large-scale services.

Small clouds dont need excessive optimizations... but a realtime twitter killer has to think about removing every potential source of latency between the Pub and the Sub.

How about that Denial of Service box? There won't even be a Fail Whale to know the Hub is in trouble... just a lack of realtime news. Some say that RSScloud would just fall back to mormal RSS... if you ask for News from every site you subcribe too. They will give you their latest posts. But that's just not realtime is it.

We want it real and timely. News measured in seconds... new news.

Comment by mason.lee, Sep 28, 2009

Great analysis!

For context, I would just add that there is no one right way to do pub/sub architectures. The case for light pings is in those architectures where the subscriber should decide whether it wants to grab the additional data or not.

Optimal pings will contain the least amount of information necessary for the subscriber to be able to make that decision (weighed, of course, against the additional costs of using "light" pings shown in this doc). In rare pub/sub cases, the near emptiness of RssCloud?'s pings ("I updated something, but not telling you what") might actually suffice. RSS/Atom, however, does not seem to be one of those cases. RSS/Atom itself is already very close to the very header-type info you'd want to include in the "I updated" pings to let the subscriber decide if the actual content should be pulled down. This gets to the old question of how much of an article one should include in RSS. If large articles are embedded into the feeds, with pictures and ad attachments, one might be better off with rssCloud style minimalist pings than full pings. You have to take this into account when calculating the bandwidth tradeoff. Of course, a data-hungry aggregator with many end-users as the subscriber will always want to pull down everything and will always prefer fat pings. A more evenly distributed system, on the other hand, might be more selective about what goes where.

"Fat pings of light RSS/Atom" is probably the most generally optimal solution.

Comment by james.abley, Nov 03, 2009

What's "Complex XMPP"? XMPP transporting Atom entries doesn't feel very complex, for example.


Sign in to add a comment
Hosted by Google Code