
malgen
MalGen
Is a set of scripts which generate large, distributed data sets suitable for testing and benchmarking software designed to perform parallel processing on large data sets. The data sets can be thought of as site-entity log files. After an initial seeding, the scripts allow for the data generation to be initiated from a single central node to run the generation concurrently on multiple remote nodes of the cluster.
The data generated follows certain statistical distributions which we believe presents a usable model for such logs.
There are two intended uses for MalGen
1. is to generate a large, possibly distributed, data set for use with analytics.
1. is to generate data for use with benchmarking algorithms or applications.
With the first use, records are generated probabilistically and extra records may be produced so that the entire data set follows the specified distribution. With the second use, strict adherence to the distribution is not necessary as the user is more interested in generating exactly the specified number of records.
Release v0.9 exposes a switch which can be used at the command line to toggle between following the distribution and generating exactly the number of records specified. When the distribution is followed, the number of records generated is probabilistic, so there is no way to accurately determine how many records will be included in each generated file. When the exact number of records is generated, the data may be slightly inappropriate for statistical analysis.
View MalGen_vX.X_Overview.pdf
in the distribution or download it separately for more details, including information on using the scripts.
MalStone
2009-06-18. v0.8.2 has just been released.
MalStone
is a stylized benchmark for data intensive computing that uses records generated by MalGen
.
The MalStone A-10
and B-10
benchmarks each consist of 10 billion records and the timestamps are all within a year period. The MalStone A-10
benchmark computes a ratio for each site w as follows: for each site w, aggregate all entities that visited the site at any time, and compute the percent of visits for which the entity became compromised at any future time subsequent to the visit.
MalStone B-10
is similar except that the ratio is computed each week d, and computes: for each site w, and for all entities that visited the site at week d or earlier, the percent of visits for which the entity became compromised at any time between the visit and the end of the week d.
The MalStone
package is available on the Downloads tab.
Sample run of MalGen
data generation
| Compromised Stage | | | | Uncompromised Stage | | | |:-------------------------|:|:|:|:--------------------------|:|:| | Num Records | RAM | Duration | | Num Records | RAM | Duration | | 100 M | 16 GB | 60 min | | 100 M | 4 GB | 54 min | | 500 M | 16 GB | 190 min | | 500 M | 4 GB | 275 min |
Project Information
- License: GNU GPL v2
- 20 stars
- svn-based source control
Labels:
Python
SyntheticData
Benchmarking
StatisticalDistribution
malgen
OpenCloudConsortium
OCC
SimulatedData