My favorites | Sign in
Project Home Downloads Wiki Issues Source
Project Information
Members
Links

Why do we develop the HaLoop project?

The growing demand for large-scale data mining and data analysis applications has led both industry and academia to design new types of highly scalable data-intensive computing platforms. MapReduce and Dryad are two popular platforms in which the dataflow takes the form of a directed acyclic graph of operators. However, these new platforms do not have built-in support for iterative programs, which arise naturally in many applications including data mining, web ranking, graph processing, model fitting, and so on.


What is HaLoop?

Simply speaking, HaLoop = Ha, Loop:-) HaLoop is a modified version of the Hadoop MapReduce framework, designed to serve these applications. HaLoop not only extends MapReduce with programming support for iterative applications, but also dramatically improves their efficiency by making the task scheduler loop-aware and by adding various caching mechanisms. We evaluate HaLoop on real queries and real datasets and find that, on average, HaLoop reduces query runtimes by 1.85 compared with Hadoop, and shuffles only 4% of the data between mappers and reducers compared with Hadoop.

In short, HaLoop has the following features: 1) provide caching options for loop-invariant data access, 2) let users reuse major building blocks from applications' Hadoop implementations, and 3) have similar intra-job fault-tolerance mechanisms to Hadoop. Also, HaLoop is backward-compatible with Hadoop jobs.

Note that at this stage, HaLoop is only a prototype system rather than a production system. We are trying our best to make the system more robust and stable.

Get started:

HaLoop publications

HaLoop: Efficient Iterative Data Processing on Large Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska, Michael D. Ernst. In VLDB'10: The 36th International Conference on Very Large Data Bases, Singapore, 24-30 September, 2010.

  • The main paper describing loop-aware task scheduling and cache mechanisms in HaLoop, as well as experiments that compare iterative applications' performance on HaLoop and Hadoop.
  • The application programming interface has been tuned to be more natural and intuitive than that described in the paper. Therefore, major building blocks from standard Hadoop iterative application implementation could be reused.

Contact

Yingyi Bu (buyingyi@gmail.com)

Sponsor

This project is supported by the National Science Foundation Grant No. 0844572.

Powered by Google Project Hosting