My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
JavaMapperForHadoopProgrammers  
Updated Jun 3, 2010 by frewst...@gmail.com

Getting Started with App Engine Mappers for Hadoop Programmers

Intended Audience: Developers with some experience as users with Apache Hadoop MapReduce who are interested in either writing mappers on App Engine or porting their existing applications to App Engine.

The Horrible, Unspeakable (but True and About to be Spoken!) Secret: The mapper interface on App Engine is just a framework on top of the App Engine Task Queue API. To see more about the architectural reasons for this, see our I/O talk.

What does this mean for you? There are certain changes that we had to make to the Hadoop interfaces to support this underlying task queue infrastructure. We tried to keep the changes as minimal as possible, so while you’ll have to change a few things, we hope that modifications will be fairly quick and painless.

The modifications required:

You must use Hadoop 0.20: We believe that most frameworks already have support for the new API, so we’re targeting it exclusively. If you need support for the pre-0.20 API, please let us know via the issue tracker and we’ll prioritize it according to demand.

RecordReader must be SerializationFactory compatible: In the same way that InputSplit is serialized between the JobTracker and the workers in normal Hadoop operation, we need to serialize RecordReaders between tasks. Note that InputSplit isn’t re-serialized, so any state that changes as more records are read must be stored as part of the serialized RecordReader.

Your Mapper must inherit from AppEngineMapper: The main restriction this imposes is that you can’t inherit run() since that method has to be in scope for the entire time the InputSplit is being processed, and that may be spread over multiple tasks. Additionally, AppEngineMapper adds two methods that you can use to do once-per-task operations: they are taskSetup() and taskCleanup(), which are analagous to setup() and cleanup(), but they run once per task rather than once per InputSplit.

The job creation interface is somewhat different: Since the Hadoop job creation system expects the controller to block for the duration of the job, we can’t use it wholesale. Currently, the suggested job starting mechanism is to construct a Configuration with the relevant parameters, serialize it to XML via writeXML(), and then send it as the configuration POST parameter to {location of your mapreduce handler}/start. We’re actively looking into integrating with frameworks that start jobs such as Pig and Cascading as we add support for the full MapReduce process.

Other than that, your Hadoop mapper should work unchanged.

Comment by skygam...@gmail.com, Sep 21, 2011

Hi, I am now going to code migration from Hadoop to Google App Engine.

But I have noticed that this article was last updated on Jun 3, 2010. And at that time, Google App Engine only supplied App Engine Mapper Library.

Since Google App Engine have full version of MapReduce?, which was announced on Google I/O 2011. I am wondering if there are any big different or some special part need to pay attention.

Thank you very much.

Comment by googea...@gmail.com, Nov 5, 2011

Is now GAE support for reducer ?


Sign in to add a comment
Powered by Google Project Hosting