|
SectorFileSystem
We created an interface to Sector which uses the Java Native Interface to allow applications written in Java to access the Sector file system (Sector is written in C++). To illustrate using this JNI component to access Sector from a Java application, and to provide an example of interoperability between cloud systems, we created an implementation of the Hadoop file system abstraction which allows Sector to be used as the backing file store for Hadoop MapReduce applications. IntroductionHadoop/HDFS and Sector/Sphere are software frameworks and distributed file systems designed to allow users to store and process extremely large data sets. Although these are open systems the programming interfaces are incompatible. This makes inter-operation difficult – applications targeted for one system cannot be run with a competing system. Hadoop does however offer the ability to access other file systems, allowing them to be used as the backing store for MapReduce applications. This requires creating a custom implementation of the Hadoop file system abstraction – for example, there is an implementation that allows for Amazon S3 to be used as the backing store for Hadoop. The MapReduce processing still runs within the Hadoop framework, with the alternative file system implementation providing the bridge between MapReduce and the backing file system. We created an implementation of the Hadoop file system abstraction for the Sector file system. This allows the same Hadoop MapReduce application to be run against data in either the Hadoop distributed file system or the Sector file system. ImplementationThe interface between Hadoop MapReduce and the Sector file system is implemented by three primary sets of components:
Components 1 and 2 make up the JNI component which has been contributed to Sector. Note that since we are utilizing Hadoop MapReduce as the application framework, Sphere (Sector's Programming Framework) is not utilized as part of the interface described in this document. The figure below describes a high level view of the software architecture which implements the interface between Hadoop and Sector.
The following provides a description of each component in the diagram.
TestingWe tested Hadoop MapReduce over Sector using a stylized analytic computation. We chose the Malstone A-10 benchmark for clouds designed for data intensive computing. Although we did not perform formal benchmarks, using Malstone A-10 for the tests provided a consistent, repeatable test that should have stressed the components. The data for Malstone A-10 is generated by Malgen, an open source software package for generating site-entity log files. Malgen generates large, distributed data sets suitable for testing and benchmarking software designed to support data intensive computing. See http://code.google.com/p/malgen. Each record is 100 bytes, with fixed-width fields and has the following pipe-delimited format: Event ID | Timestamp | Site ID | Compromise Flag | Entity ID CodeThe code used for testing is a simple computation we refer to as Malstone A. Here is a description of the computation using pseudo-code: for record in read( data )
( site, compromised_indicator ) = parse( record )
group by site
for each site
total_compromised, total_seen = 0
total_compromised += compromised_indicator
total_seen += 1
statistic[site] = total_compromised / total_seenConfigurationThe following Hadoop MapReduce configuration was used for testing:
Overhead / IssuesWe saw an unexpected increase in the time required to run the job from both Hadoop using HDFS and the same job written as a Sphere UDF (User Defined Function) running over Sector. The following are possible reasons for performance differences between Hadoop MapReduce with HDFS and Sector.
LocalityWhen we began developing this File System implementation, there was poor support for data locality exposed by Sector implementation . A feature used by Hadoop MapReduce to improve processing efficiency is to move processing to the same rack/nodes where the data is located. The Hadoop file system abstraction provides a method to be implemented which can get and use the data locality information from the underlying data store. A concept of data locality that is accessible from Sector over the JNI component has been released in version 1.22. We may not be taking full advantage of this due to differences in the systems and the newness of the feature. Further WorkIn addition to taking better advantage of locality, we need to create a methodology for determining an optimal configuration based on the system and the size of the data set. As stated above, both systems were tested with the same configuration. When looking for a more optimal configuration, we should focus on:
|