How to Set Up a Hadoop Cluster with Mongo Support on EC2

In the previous post I described how to setup hadoop on your local machine and make it work with MongoDB. That’s good enough only for development and testing, but if you want to crunch any serious numbers you have to run hadoop as a cluster. So let’s figure out how to run a hadoop cluster with mongodb support on Amazon EC2.

This is a step by step guide that should show you how to:

  • Create your own AMI with the custom settings (installed hadoop and mongo-hadoop)
  • Launch a hadoop cluster on EC2
  • Add mode nodes to the cluster
  • Run the jobs

So let’s hack.

Setting an AWS account.

1. Create an Amazon AWS account if you still didn’t get one.

2. Download amazon command line tools on your local machine. Unpack them, for example, in ~/ec2-api-tools folder.

3. In your home folder edit .bash_profile (or .bashrc if you’re not using OS X) and add there the following lines

export EC2_HOME=~/ec2-api-tools

export PATH=$PATH:$EC2_HOME/bin

4. Create X.509 certificate and a private key in your AWS Management Console. Check out this doc to see how to do that. It’s pretty straightforward process, in the end you should download two .pem files. Put them in ~/.ec2 folder. Those are your keys to access amazon aws services.

After you done with this edit your .bash_profile and add the following lines there:

export EC2_PRIVATE_KEY=~/.ec2/name_of_your_private_key_file.pem

export EC2_CERT=~/.ec2/name_of_your_certificate_file.pem

5. Create a private ssh key pair

To be able to login to your instances you need to create a separate keypair. From a command line run:

ec2-add-keypair gsg-keypair

This will output you a private key. Save it to the file id_rsa-gsg-keypair and put it to ~/.ec2 folder with the rest of your key files. Then change the permissions:

chmod 600 ~/.ec2/id_rsa-gsg-keypair

And add the key to authentication agent:

ssh-add ~/.ec2/id_rsa-gsg-keypair

6. Create a rule to allow to login to your amazon instances

Go to AWS Management console, click EC2 tab, then choose Network and Security from the menu on the left sidebar. Add a new rule, choose “SSH rule” from the list and apply the changes.

Hadoop settings

Luckily, hadoop has built-in tools to help you deploy a cluster on EC2.

1. Set your amazon security credentials

Go to your hadoop folder and edit file src/contrib/ec2/bin/hadoop-ec2-env.sh

Fill out the following lines:

Your Amazon Account Number

AWS_ACCOUNT_ID=""

Your Amazon AWS access key

AWS_ACCESS_KEY_ID=""

Your Amazon AWS secret access key

AWS_SECRET_ACCESS_KEY=""

You can find your amazon account number by logging into your AWS account. It’s gonna be on the top-right corner below your name. To get access and secret keys go to My Account –> Security Credentials and click “Access keys” tab.

2. Setting a private key

In the same file hadoop-ec2-env.sh find this:

KEY_NAME=gsg-keypair

If you named your key id_rsa-gsg-keypair then leave this field without changes.

3. Set the hadoop version

In the line HADOOP_DIR= set your hadoop version. If you were following a previous post to set up the hadoop then it’s gonna be:

HADOOP_VERSION=1.0.0

4. Set S3 bucket

This bucket will be used to upload an amazon image that you are going to create later. First you have to create a bucket. Go to AWS Management console, click S3 tab and click “Create a bucket” button. Choose the name, for example, ‘my-hadoop-images’ and save the bucket. Then edit hadoop-ec2-env.sh and update the following line:

S3_BUCKET=my-hadoop-images

5. Choose the amazon instance type

There are number of types of instance AWS provides depends on your requirements. You can read more about that later. Let’s stick with m1.small for now. Uncomment the following line:

INSTANCE_TYPE="m1.small"

6. Edit Java settings

We are still keep editing the same hadoop-ec2-env.sh file. Set the Java version to:

JAVA_VERSION=1.6.0_30

Then find a block after # SUPPORTED_ARCHITECTURES = [i386, x86_64] and edit JAVA_BINARY_URLs.

For i386:

JAVA_BINARY_URL=http://download.oracle.com/otn-pub/java/jdk/6u30-b12/jdk-6u30-linux-i586.bin

For x86_64 (large and xlarge instance type):

JAVA_BINARY_URL=http://download.oracle.com/otn-pub/java/jdk/6u30-b12/jdk-6u30-linux-x64.bin

If by the time you are reading this post those URLs do not work or if you want to get another Java version you can find URLs for download here.

Create an Amazon Machine Image (AMI)

Additionally our image should have mongo-hadoop driver, mongo java driver and our mapreduce jobs installed. First you need to upload your mongo-hadoop core and mapreduce jobs somewhere so it can be accessible from the internet. The easiest way is just to upload those files to the same S3 bucket you created earlier for hadoop images. Where do you get these files? In the previous post I showed how to compile mongo-hadoop. Your mongo-hadoop-core lays in $mongo_hadoop_dir/core/target/ and is called something like mongo-hadoop-core.jar. Upload it to your S3 bucket and URL for it is going to look something like [http://my-hadoop-images.s3.amazonaws.com/mongo-hadoop-core.jar](http://my-hadoop-images.s3.amazonaws.com/mongo-hadoop-core.jar). Your mapreduce jobs is also a jar file. You can create and compile your own mapreduce jobs but for now let’s use treasury_yield we compiled in the previous post. This file should be in $your_mongo_hadoop_dir/examples/treasury_yield/target/ and is called something like mongo-hadoop-treasury_yield.jar. Upload this file to your S3 bucket too.

When it’s done, edit create-hadoop-image-remote file in your $hadoop_dir/src/contrib/ec2/bin/image/ folder. After the block # Configure Hadoop insert the following block:

#Install MONGO-HADOOP
cd /usr/local/hadoop-$HADOOP_VERSION/lib

#Copy mongo java driver
wget -nv https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.7.3.jar

#Copy mongo-hadoop core
wget -nv http://your_url_here/mongo-hadoop-core.jar

#Copy MapReduce jobs
wget -nv http://your_url_here/mongo-hadoop-treasury_yield.jar

That will download the files we need when image is configured.

Create an image

Finally to create an image, in your hadoop folder, run the following command:

bin/hadoop-ec2 create-image

If everything runs well that should bundle a new image and upload it to your S3 bucket. Now you are ready to run the cluster.

Launch the cluster

Phew.. when a long way of setting everything up is successfully passed launching a new hadoop cluster is very easy. The format of the command is this:

bin/hadoop-ec2 launch-cluster <name_of_the_cluster> <number_of_slaves>

So to run a hadoop cluster named “mongohadoop” consisting of 1 master and one slave nodes you would run the following command:

bin/hadoop-ec2 launch-cluster mongohadoop 1

In output you’ll see a hostname for your master instance. Give it a few minutes to run and then you can open

http://<your_master_hostname>:50030

in your browser and you should see a hadoop administration page. The number of nodes showed should be 1 (it counts only slave nodes).

Later if you want to add, for example, two more slaves you would do it like this:

bin/hadoop-ec2 launch-slaves mongohadoop 2

To login to your master instance:

bin/hadoop-ec2 login mongohadoop

Run the jobs

To run a job you have to login to the master instance and then use the command you learned in the previous post:

bin/hadoop jar mongo-hadoop/core/target/mongo-hadoop-core.jar com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig -conf mongo-path_to_jobs_config/mongo-treasury_yield.xml

For your own mapreduce jobs you would change com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig and mongo-treasury_yield.xml accordingly of course.

If you want to run a treasury_yield example in your hadoop cluster you have to do few more things:

  1. Have a MongoDB server running and accessible from the internet.
  2. Import yield_historical_in.json there like you did for your local MongoDB in the previous post.
  3. On the hadoop master instance upload mongo-treasury_yield.xml and put it, for example, in /usr/local/hadoop-1.0.0/ folder.
  4. Edit mongo-treasury_yield.xml and change mongo.input.uri and mongo.output.uri. Put the hostname of your database instead of 127.0.0.1 and add login and password if needed.
  5. Then run the job:

bin/hadoop jar mongo-hadoop/core/target/mongo-hadoop-core.jar com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig -conf mongo-/usr/local/hadoop-1.0.0/mongo-treasury_yield.xml

On your hadoop administration page verify that job is running.

Done! Now you have a hadoop cluster running on EC2 and you know how to make it work with MongoDB. To write your own mappers and reducers check out how treasury_yield example is made, change it in a way you want, recompile and upload it to your hadoop cluster (or create an image which will have this file).

Comments