|
PySphere
Embedded Python for SphereIntroductionWe introduce a Python embedding into Sphere called PySphere that exposes the Sphere MapReduce framework to pure Python code. This allows arbitrary Python MapReduce scripts to take advantage of Sphere's MapReduce capabilities. To demonstrate how PySphere can be used, we present an illustrative implementation of the algorithm used in the MalStone A-10 benchmark. Normally a Sphere application is implemented using pure C++. With PySphere the work of reading and writing data to and from the Sector file system is handled by generic C++ code written with the Sphere framework, but the processing of the data is performed by custom functions written in Python. PySphere is composed of two primary components:
InteroperabilityLanguage PySphere (and the related project PySector which exposes native Sector file system commands over Python) make Sector / Sphere coding accessible to clients written in Python. Previously, Sector / Sphere could only be accessed by using C++ code. It is our intention to make a Sector cloud accessible from multiple programming languages. Note: The Sector JNI project has also been contributed to Sector which makes Sector but not Sphere available to Java clients. Cloud We do not know of any large data clouds that use Python for the primary computation language. However, due to efforts made towards interoperability, many are accessible via Python. For example, Hadoop, which is a Java application, has pipes and streaming interfaces. Pipes allows for C++ code to run against Hadoop. Streaming opens up Hadoop’s Map Reduce to any coding language which can read and write Standard I/O, including Python. Below is pseudo-code illustrating a way the map step for MalStone A can be realized. Versions for PySphere and Hadoop's Streaming API are then given and compared. Aside from differences due to the latter using Standard I/O instead of function calls, the code is very similar. Map for record in read( data )
( site, compromised_indicator ) = parse( record , '|')
group by sitePySphere map function: #!/usr/bin/env python
def parse(line):
return line.split('|')
def map(line, sep='\t'):
data = parse(line)
return data[2] + sep + data[3]
if __name__ == __main__:
map()slight modification to run against Hadoop's Streaming API: #!/usr/bin/env python
import sys
def read_input(file):
for line in file:
yield line.split('|')
def map(sep='\t'):
data = read_input(sys.stdin)
for record in data:
print %s%s%s % (record[2], sep, record[3])
if __name__ == __main__:
map()All of the functions, including the native versions, are compared on the page MalStoneAFunctions. Tests and Test ResultsThe MalStone A-10 benchmark was used to run tests evaluating the performance of PySphere. Tests were performed using a pure C++ Sphere MapReduce implementation of MalStone A, followed by tests using a PySphere version. The resulting times were then compared. The Open Cloud Consortium Testbed was used for all tests.
The hardware configuration was:
The Sector master was run on the hardware master node and 20 Sector slaves were used, one per slave node.
Limitations and Next StepsThe following are some limitations with the current version of pySphere:
These results were obtained with a partition function which created a small number of map output files, seemingly leading to poor performance in the reduce stage. Data from the map stage was written to a smaller number of files then physical nodes and this imbalanced cause some nodes to run hot while others were under-utilized.
Opening Sphere up to Standard I/O streams is underway, but that is not expected to completely replace pySphere. There will be applications not well suited to using Standard I/O and the overhead is still unknown. The next steps for pySphere are to
| ||||||||||||||||