My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
SectorFileSystem  
Running Hadoop MapReduce over the Sector Distributed File System
Featured, Phase-Design
Updated Oct 23, 2009 by collinbe...@gmail.com

We created an interface to Sector which uses the Java Native Interface to allow applications written in Java to access the Sector file system (Sector is written in C++). To illustrate using this JNI component to access Sector from a Java application, and to provide an example of interoperability between cloud systems, we created an implementation of the Hadoop file system abstraction which allows Sector to be used as the backing file store for Hadoop MapReduce applications.

Introduction

Hadoop/HDFS and Sector/Sphere are software frameworks and distributed file systems designed to allow users to store and process extremely large data sets. Although these are open systems the programming interfaces are incompatible. This makes inter-operation difficult – applications targeted for one system cannot be run with a competing system.

Hadoop does however offer the ability to access other file systems, allowing them to be used as the backing store for MapReduce applications. This requires creating a custom implementation of the Hadoop file system abstraction – for example, there is an implementation that allows for Amazon S3 to be used as the backing store for Hadoop. The MapReduce processing still runs within the Hadoop framework, with the alternative file system implementation providing the bridge between MapReduce and the backing file system. We created an implementation of the Hadoop file system abstraction for the Sector file system. This allows the same Hadoop MapReduce application to be run against data in either the Hadoop distributed file system or the Sector file system.

Implementation

The interface between Hadoop MapReduce and the Sector file system is implemented by three primary sets of components:

  1. The interface to Sector is implemented using the Java Native Interface (JNI). The Java Native Interface is a framework that allows Java code to call native applications written in languages such as C or C++. Sector is implemented in C++, so this layer is necessary to allow Java code to access Sector.
  2. Data reads and writes between the systems use Java NIO (New I/O) which uses direct-mapped byte buffers.
  3. On the Hadoop side, access to the Sector JNI bridge goes through a custom implementation of the Hadoop File System interface.

Components 1 and 2 make up the JNI component which has been contributed to Sector. Note that since we are utilizing Hadoop MapReduce as the application framework, Sphere (Sector's Programming Framework) is not utilized as part of the interface described in this document. The figure below describes a high level view of the software architecture which implements the interface between Hadoop and Sector.

The following provides a description of each component in the diagram.

  • Sector – The Sector master and slave servers. This is the primary Sector infrastructure which controls access to data stored in the Sector file system.
  • Sector Client – The client API, implemented in C++, which allows client applications to access Sector data. This API provides standard file system operations such as opening files, reading and writing to files, retrieving directory listings, etc.
  • SectorJNIClient – This is the JNI component contributed to Sector. It is a custom implemented component that provides access to the Sector Client API. This component uses the Java Native Interface (JNI) to allow Java applications to access the Sector C++ client. This component has two primary pieces:
    1. SectorJniClient.cpp – The C++ code which acts as the translation layer between the Sector C++ client and Java. This code is compiled into a shared object.
    2. SectorJniClient.java - A Java class which loads the C++ shared object and provides an interface between Java clients and the C++ layer. Note that this class mainly acts as a bridge to the C++ layer which manages all access to the Sector client API, translation between C++ objects and Java objects, etc.
  • Hadoop MapReduce – The Hadoop implementation of MapReduce. This is the infrastructure provided by Hadoop to implement and manage MapReduce processing.
  • Hadoop FileSystem – The Hadoop abstraction that defines the interface between Hadoop applications and an underlying file system. This interface must be implemented for each files system that Hadoop will access. Hadoop includes several standard implementations of File System such as LocalFileSystem and DistributedFileSystem, which access the local file system and Hadoop distributed file system respectively.
  • Sector Filesystem – A custom implementation of the Hadoop File System interface which provides access to Sector through the SectorJniClient.java class associated with the event.

Testing

We tested Hadoop MapReduce over Sector using a stylized analytic computation. We chose the Malstone A-10 benchmark for clouds designed for data intensive computing. Although we did not perform formal benchmarks, using Malstone A-10 for the tests provided a consistent, repeatable test that should have stressed the components.

The data for Malstone A-10 is generated by Malgen, an open source software package for generating site-entity log files. Malgen generates large, distributed data sets suitable for testing and benchmarking software designed to support data intensive computing. See http://code.google.com/p/malgen.

Each record is 100 bytes, with fixed-width fields and has the following pipe-delimited format:

Event ID | Timestamp | Site ID | Compromise Flag | Entity ID

Code

The code used for testing is a simple computation we refer to as Malstone A. Here is a description of the computation using pseudo-code:

for record in read( data )
    ( site, compromised_indicator ) = parse( record )
    group by site

for each site
    total_compromised, total_seen = 0
    total_compromised += compromised_indicator 
    total_seen += 1

    statistic[site] = total_compromised / total_seen

Configuration

The following Hadoop MapReduce configuration was used for testing:

  • Hadoop version 0.18.3 was used.
  • Sector version 1.22 was used.
  • The Hadoop heap size was set to 6000 MB.
  • Replication for each file system was set to 1.
  • The number of reducers was 47 while the number of mappers was left unspecified, allowing Hadoop to determine the number of mappers based on the input format.

Overhead / Issues

We saw an unexpected increase in the time required to run the job from both Hadoop using HDFS and the same job written as a Sphere UDF (User Defined Function) running over Sector.

The following are possible reasons for performance differences between Hadoop MapReduce with HDFS and Sector.

  • JNI overhead – although the Java Native Interface is considered the most efficient way to access C++ code from Java, it is possible that there is a performance penalty involved in going through a JNI layer to access Sector.
  • I/O overhead running against a non-Hadoop file system – there appears to be a great deal of file transfers between the distributed and local files systems. It is not clear if these file operations are greater than Hadoop running against HDFS, but it is possible this is a potential source of processing slow down.
  • Configuration – We ran the MapReduce jobs over HDFS and Sector using identical configurations. This was done for a sense of 'fairness', but there is no reason to assume that either of the setups would not run better under a more targeted configuration.
  • Block Size - Sector does not block files as HDFS does and there is some difficultly in trying to gracefully handle the issue.

Locality

When we began developing this File System implementation, there was poor support for data locality exposed by Sector implementation . A feature used by Hadoop MapReduce to improve processing efficiency is to move processing to the same rack/nodes where the data is located. The Hadoop file system abstraction provides a method to be implemented which can get and use the data locality information from the underlying data store.

A concept of data locality that is accessible from Sector over the JNI component has been released in version 1.22. We may not be taking full advantage of this due to differences in the systems and the newness of the feature.

Further Work

In addition to taking better advantage of locality, we need to create a methodology for determining an optimal configuration based on the system and the size of the data set. As stated above, both systems were tested with the same configuration. When looking for a more optimal configuration, we should focus on:

  1. The heap size made available to the JVM running Hadoop.
  2. The size of the in-memory filsystem instance in MB
  3. The number of server threads for the datanode.
  4. The default block size for new and stored files.
  5. The number of parallel transfers run by reduce during the copy (shuffle) phase

Sign in to add a comment
Powered by Google Project Hosting