Qizmt on EC2 Tutorial
Overview
In this tutorial we will attempt to use the Qizmt-on-EC2 wizard (GPLv3) to automate the rental of EC2 instances linked to an EC2 account, generate a set of random words and execute a Qizmt Mapreduce job which will count the number of occurrences of each word. This tutorial also covers cluster diagnostics and using Qizmt Mapreduce IDE/Debugger on EC2.
The actual time that it takes to complete this tutorial varies and there are no guarantees that any software described in this tutorial will work properly or function as expected.
Results from Qizmt on EC2 word-count test 12/21/2010
| Execution Time | 8hrs 11min | | Per virtual core | 2.5 EC2 Compute Units | |:---------------|:-----------|:-------|:-----------------|:----------------------| | Instance setup | ~1hrs | | MR.DFS size | ~7.14TB (TiB) | | Gen inputs time | ~1hrs 14min | | Disk IO per instance | ~160Mb/s (observed) | | Input doc size | ~1TB (TiB) | | Total disk IO | ~3.1Gb/s (observed) | | Output doc size | ~2MB | | | | | | | | | | | Replication level used | 1 | | MR Algorithm | Grouped | | Instances rented | 20 | | Reduce | Partial Reduce (reduce applied both before and after exchange phase) | | Instance type rented | High-CPU XL | | Logic | Map all words as keys with value as 1, in partial reduce produce intermediate word counts, in reduce add up total count for each word. | | Total virtual cores | 160 | | | |
Preparing your EC2 account on Amazon Web Services
- Read and make sure that you understand this entire tutorial prior to performing any of its described actions. Also, because this tutorial describes how to automate the rental of EC2 instances, you should review the code for Qizmt-on-EC2 (GPLv3) wizard via SVN at http://code.google.com/p/qizmt/
- Before resuming this tutorial, make sure that you fully understand all current EC2 rental costs, concepts and terminology and that you have fully reviewed that Qizmt-on-EC2 wizard code that you will be using to automate the rental of EC2 instances. Also be sure that you have read and agree to the GPLv3 license under which all code on http://code.google.com/p/qizmt/ is released.
- Sign up for a free Amazon Web Services Account http://aws.amazon.com/.
You can use your existing Amazon account login information, or create a new account.
- Sign up for EC2 through the Amazon Elastic Compute Cloud (EC2) section under the Products tab
- Go to the Security Credentials under the Account tab to start getting your X.509 certificate.
- Go to the Access Credentials and then X.509 Certificates tab
- Create a new certificate.
- Download the certificate to your computer. The file has pattern
cert-*.pem
- Download the private key to your computer. The file has pattern
pk-*.pem
If you are attempting to use an existing certificate, you must already have this private key file and it cannot be downloaded again (in this case, create and use a new certificate). - Enter the AWS Management Console to start to create and download an EC2 KeyPair. A link can be found at the top of the page. Note that this is not the same key as from the previous steps.
- Click the Key Pairs link under NETWORKING & SECURITY in the left Navigation menu under the EC2 tab.
- If you do not have a key pair, create one with a name such as mykeypair1.
- Download the private key file to your computer. The file will be named mykeypair1.pem (for a key pair named mykeypair1).
Get EC2 Command-line Tools
- Ensure that Java is installed on your computer.
- Download ec2-api-tools.zip file from http://aws.amazon.com/developertools/351?_encoding=UTF8&jiveRedirect=1 to your computer if you do not have it already.
- Un-zip to your computer.
Qizmt on EC2 Wizard
- Download Qizmt EC2 setup wizard from http://code.google.com/p/qizmt/downloads/list
The code for this wizard is available on http://code.google.com/p/qizmt/ Notes on the AMI setup is in the appendix of this document.
- Run the “Qizmt on EC2” wizard and fill out the Setup tab:
- Java home folder is where Java is installed.
- EC2 command-line tools home folder is the directory extracted from the tools zip file.
- EC2 X.509 certificate file was created during EC2 sign-up and matches pattern
cert-*.pem
- EC2 private key file was created during EC2 sign-up and matches pattern
pk-*.pem
- Click Next to go to the EC2 Machines tab:
- Select the ami-76bb4a1f - MySpace Qizmt x86_64 instance-store - Server2003r2-x86_64-Win-v1.07 AMI ID. AMI ID is the Amazon image ID with the operating system. A custom built one for MySpace Qizmt must be used for this tutorial.
- Select the c1.xlarge - x86_64 - High-CPU/Extra-Large Instance type. For this wizard you can select any instance of the 64 bit instance types. For info on the listing types and rental costs see http://aws.amazon.com/ec2/instance-types/ and http://aws.amazon.com/ec2/pricing/
- Select the KeyPair created in EC2 in prior steps of this tutorial. KeyPair name is the name of the key pair. The example used in this tutorial is test3. The drop down button can be used to fetch your available KeyPair names from your EC2 account.
- Set KeyPair private key file to the private key file downloaded in prior steps of this tutorial. KeyPair private key file is the file downloaded after creating the key pair and matches file name test3.pem (for a key pair named test3).
- Leave the Availability Zone option blank. Availability Zone is where the machines are located. This can be left at the default or can be changed to another availability zone. Note that availability zones may have different pricing; see http://aws.amazon.com/ec2/ for information.
- Leave Security Groups specifying “default” Security Groups can list the security groups you wish to use, such as to restrict access. Note that this should be done carefully and can prevent Qizmt from operating.
- Click Next to go to the Qizmt Cluster tab.
- Enter the number of machines to rent for this Qizmt cluster. This is the number of EC2 instances that will be powered on to create this cluster. Note that each machine counts as an EC2 instance and has additional pricing; see the http://aws.amazon.com/ec2/ for information.
- Set an administrator password and retype it.
Reminder: By clicking on “Start Qizmt Cluster” you will be attempting to automate the rental of multiple EC2 instances. The code for this wizard is available on http://code.google.com/p/qizmt/'>http://code.google.com/p/qizmt/ and it is your responsibility to review and re-build the code for the Qizmt-on-EC2 wizard as there is never a guarantee that it or any other code on http://code.google.com/p/qizmt/'>http://code.google.com/p/qizmt/ or the Qizmt EC2 AIMs will function properly.
- Click the Start Qizmt Cluster button and wait for your cluster to be rented. Click OK to confirm. This can take about over 40 minutes for EC2 to allocate the Qizmt instances. While waiting you should log into EC2 and confirm that your instances are launching.
- When the Qizmt cluster is ready, the Qizmt-on-EC2 wizard will automatically launch a remote desktop session. Log in using user name Administrator and the password you entered. Leave the Qizmt-on-EC2 wizard open while you are still using the cluster.
- Open a Command Prompt in the Remote Desktop connected to one of the EC2 machines.
Issuing Qizmt commands from one of the machines in the EC2 Qizmt cluster will operate on the cluster as a whole.
- Type
Qizmt dir
to view the files contained in MR.DFS (Qizmt’s Map Reduce Distributed File System).There are no files yet. Free disk space may be different.
- Type
Qizmt ps
to view the Qizmt jobs and other Qizmt processes currently running on the cluster.The only thing currently running is the Qizmt ps just invoked. Total and free memory may be different. Number of processes and machines may be different.
- Type
Qizmt examples
to generate some example jobs into MR.DFS. Qizmt dir can be used again to list these files - Type
Qizmt edit Qizmt-WordCountByPartialReduce.xml
to view the built-in word count example. - Press Cancel and then type
Qizmt edit
MyJob.xml (or instead of MyJob.xml, type another job file name you wish). This opens a new jobs editor with a template file that can be edited. - Add breakpoints by clicking on the margin.
- Press F5 to begin debugging.
This allows the job to be debugged directly on the cluster on your current machine to test the logic in the environment in which it will run.
- Close the jobs editor, save if desired.
- In the Command Prompt type
Qizmt exec Qizmt-WordCountByPartialReduce.xml
This will execute the jobs in the file specified. In this case it is a built-in example. The job will run across the machines in the cluster and the output will be listed in MR.DFS
- Type
Qizmt exec Qizmt-LargeWordCount.xml 1TB
to start a terabyte word count test, designed for 20 high-CPU 8-core EC2 instances. This test may take around 8 hours to complete.If you have less machines or less powerful machines, you can change the 1TB to a smaller byte size. The input data will be generated first.
Followed by a map reduce job to perform word counts.
Qizmt perfmon diskio
can be called in another Command Prompt window to display the Disk I/O performance of the machines in the cluster while the job is running.Note that this can take several minutes to complete due to the job using resources.
Qizmt perfmon cputime
can be used to get the CPU usage of the machines in the cluster.- Type
Qizmt perfmon availablememory
to get the available memory of the machines in the cluster. - Type
Qizmt head Qizmt-LargeWordCount-WordCounts.txt 30
to view the first 30 lines of the word count output file when the job is successfully done. - Type
qizmt get Qizmt-LargeWordCount-WordCounts.txt \\<host>\d$\<dir>\wordcounts.txt
to get a copy of the resulting word counts out of MR.DFS - When you are done using the cluster,
- click on Terminate Cluster,
- or if you no longer have the “Qizmt on EC2” wizard open:
- log into the AWS Management Console,
- click the Instances link under INSTANCES in the left Navigation menu for EC2,
- place a check mark next to all the instances you wish to terminate,
- using the Instance Actions drop down choose Terminate.
- Log into EC2 and confirm that all of your instances have been terminated.
Appendix
AMI setup notes for Qizmt on EC2
This is for preparing to create an AMI (Amazon EC2 OS image) with Qizmt and QizmtEC2Service preinstalled.
EC2 base images:Instance Store
ami-dd20c3b4 x86_64 windows "ec2-public-windows-images/Server2003r2-x86_64-Win-v1.07.manifest.xml"
ami-df20c3b6 i386 windows "ec2-public-windows-images/Server2003r2-i386-Win-v1.07.manifest.xml"
Log into EC2 instance via Remote Desktop using auto-generated password from EC2 (through elasticfox, etc). (In Remote Desktop’s options, enable ‘Local devices and resources’ for Drives to allow copying files into EC2)
Set Network Location: Home (Not on 2003)
Change network and sharing settings in Network and Sharing Center: (Not on 2003)
- Turn On "File sharing"
- Turn Off "Password protected sharing"
Disable Windows Firewall. (Only if already enabled)
Apply Qizmt registry changes.
Set console defaults. Put "Command Prompt" shortcut on desktop.
Folder settings: (explorer: alt, tools, folder options, view)
- don't hide file extensions
- show hidden files
Install Qizmt at C:\Qizmt with
.\Administrator
account. If it is already installed, ensure it is pointing toC:\Qizmt
:sc config "DistributedObjects" binPath= "C:\Qizmt\MySpace.DataMining.DistributedObjects.exe"
Note: Qizmt must have been built with#define LOGON_MACHINES
in Surrogate.csUpdate system environment variables to include
D:\Qizmt
beforeC:\Qizmt
. Note:D:\Qizmt
won’t necessarily exist yet.Copy 2 exe's to
C:\QizmtEC2Service
QizmtEC2Service.exe
QizmtEC2ServiceInit.exe
Run:
sc create "QizmtEC2Service" binPath= "C:\QizmtEC2Service\QizmtEC2Service.exe" start= auto
then start it:
net start "QizmtEC2Service"
Delete
C:\QizmtEC2Service\QizmtEC2Service-status.txt
(after looking at it). Check forC:\QizmtEC2Service\QizmtEC2Service-errors.txt
(Ignore and delete if error 404)Restore hosts file at
%SystemRoot%\system32\drivers\etc\hosts
withC:\QizmtEC2Service\hosts.old
(restore default, if found) (can potentially skip restoring hosts file; shouldn't hurt anything) (note: if running service or init twice, hosts.old won't be the default)C:\Program Files\Amazon\Ec2ConfigService\Settings\config.xml
- orC:\Program Files (x86)\Amazon\Ec2ConfigSetup\config.xml
- Plugin Ec2SetPassword:<State>Enabled</State>
C:\Program Files\Amazon\Ec2ConfigService\Logs\Ec2ConfigLog.txt
- orC:\Program Files (x86)\Amazon\Ec2ConfigSetup\Ec2ConfigLog.txt
replace contents with "Preparing for Qizmt
" (newline)Delete
\Qizmt\logon.dat
(if exists)del c:\logon.dat
del d:\logon.dat
Stop
DistributedObjects
service and set to Manual start.Delete in
C:\Qizmt
:del harddrive_history.txt *.xlib *.ylib zfoil* *.tmp service-stoplog.txt
del slave.dat
del dfs.xml execlog.txt errors.txt jid.dat
Delete any temp files (e.g. qizmt msi) Empty recycle bin.
S3:
http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?creating-an-ami-s3.html
ec2-bundle-instance <instance_id> -b <bucket_name> -p <bundle_name> -o <access_key_id> -w <secret_access_key>
such as: ec2-bundle-instance i-2afaa947 -b qizmt2 -p "MySpaceQizmt2_x86_64" -o AKIADQKE4SARGYLE -w eW91dHViZS5jb20vd2F0Y2g/dj1SU3NKMTlzeTNKSQ==
(-o and -p are in account credentials)
ec2-describe-bundle-tasks
ec2-register <your-s3-bucket>/image.manifest.xml -n image_name
such as: ec2-register "qizmt2/MySpaceQizmt2_x86_64.manifest.xml" -n "MySpaceQizmt2_x86_64"
succeeded with: IMAGE ami-baea1cd3
EBS:
http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/index.html?creating-an-ami-ebs.html
ec2-create-image -n <your_image_name> <instance_id>
such as: ec2-create-image -n "MySpaceQizmt1" i-90f3effb
succeeded with: IMAGE ami-b8c42cd1
View at: https://console.aws.amazon.com/ec2/home#c=EC2&s=Images