You can also use pipelines to chain together MapReduce jobs.
It is also possible to add your own input sources.
Check the WhatsNew page for recent changes in the library.
Python users should check out the new demo app documentation for details on how to use the MapReduce API - examples included.
Mapper documentation is available for in the Getting Started documents for Python and Java. If you have experience with Hadoop, you may also be interested in our transition guide for Hadoop programmers.
Watch and Learn
Place a nice frontend that analysts or non-programmers can use to run their jobs:
The MapReduce API also comes with a UI that shows how far into each step of your computation you are. In the top screenshot, we're a little way into our MapReduce job, and in the bottom screenshot, we've finished a MapReduce job.
Finally, users can download the results of their MapReduce jobs - here we show the results of our WordCount job, showing how many times each word in our input set shows up:
Chris Bunch chats about the "MapReduce Made Easy" app. This app is a revamped version of the one used in Mike Aizatsky's talk, and easily lets non-programmers upload their data and run MapReduce jobs on it. This is a short video that should whet your appetite for the MapReduce API.
Want to run the app you saw in the screenshots and video above? Go for it! We've revamped the sample Python app that comes with the MapReduce API to make it easier to use - check out the source here and let us know what you think about it!
Like the MapReduce library? Think it could be better? It’s all Apache 2.0 licensed. Check out our Subversion repository and feel free to post patches on the issue tracker.
Iterate over line-oriented blob and datastore data out of the box. An extensible framework for adding input readers for your own data formats is included.
Automatic sharding for faster execution. Use as many workers as you need to get your results faster.
Processing rate limiting. Don’t worry about running over quota. Slow down your mapper and space it out over days. Need your results now? Turn it up all the way and get up to 300 entities/second/worker!
Status pages. Always know what jobs you’re running and how they’re doing.
Aggregated counters. Keep statistics along the way and do simple rollup reports.
Parametrized, reusable mappers. Let non-programmers run their own mapper jobs using parameters and validation that you configure.
Batching datastore operation. Automatically batches datastore puts so you don’t have to.