Hadoop 1.0 + MongoDB: the Beginning

I’ve been playing with Hadoop and MongoDB for a couple of months and noticed that there’s lack of information describing how to actually make them work together. In the next few posts I’ll try to cover this starting from the basics.

In this post I’ll explain how to setup hadoop with mongo-hadoop library on your localhost. I did it on OS X but the same approach should work for Linux as well.

Installing Hadoop

1. Download hadoop 1.0

wget http://apache.mirrors.pair.com//hadoop/common/hadoop-1.0.0/hadoop-1.0.0.tar.gz

2. Unpack it somewhere in your home directory:

tar xzvf hadoop-1.0.0.tar.gz

3. Set JAVA_HOME

cd hadoop-1.0.0

vim conf/hadoop-env.sh

And for OS X add the following line:

export JAVA_HOME=$(/usr/libexec/java_home)

If you are using Linux change it to the actual path to your Java binary.

4. Edit config files

conf/core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
  </property>
</configuration>

conf/hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
     <name>dfs.replication</name>
     <value>1</value>
  </property>
</configuration>

conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
  </property>
</configuration>

5. Setup the permissions

In order to make hadoop work, user which will run the jobs should be able to ssh to localhost. For OS X go to the Settings –> Sharing and check “Remote login” box. You can specify the name of the user. Also if you don’t want to enter the password each time you start hadoop or run the jobs you have to add your ssh key to ~/.ssh/authorized_keys.

If you don’t have a ssh key you can generate it with the following command:

ssh-keygen -t dsa -P -f ~/.ssh/id_dsa_for_hadoop

And then:

cat ~/.ssh/id_dsa_for_hadoop.pub >> ~/.ssh/authorized_keys

6. Format namenode

In your hadoop home directory run:

bin/hadoop namenode -format

7. Start hadoop

bin/start-all.sh

If everything went well this should start bunch of stuff like namenode, jobtracker, secondarynamenode, datanode and tasktracker.

Verify everything works by opening http://localhost:50030 in your browser.

Installing mongo-hadoop library

Now this is a fun and tricky part. This library is fairly young and sometimes requires some additional hacking.

1. Download mongo-hadoop

Go to your hadoop home folder and run:

git clone https://github.com/mongodb/mongo-hadoop

2. Compile it

In mongo-hadoop folder use maven to compile the package:

mvn -U package

3. Copy compiled files to hadoop-1.0.0/lib

cp core/target/mongo-hadoop-core-1.0-SNAPSHOT.jar ../lib

cp examples/treasury_yield/target/mongo-hadoop-treasury_yield-example-1.0-SNAPSHOT.jar ../lib

4. Download mongo java driver

Download https://github.com/downloads/mongodb/mongo-java-driver/mongo-2.7.3.jar and put it in hadoop-1.0.0/lib folder.

For now use only 2.7.3 version of java driver since mongo-hadoop was compiled against it.

5. Restart hadoop

bin/stop-all.sh

bin/start-all.sh

Running examples

I assume you already have mongo server up and running on your localhost.

Let’s verify that our system works by using MongoTreasuryYield example from mongo-hadoop package.

1. Import initial data

In mongo-hadoop folder run:

mongoimport --db demo --collection yield_historical.in --type json --file examples/treasury_yield/src/main/resources/yield_historical_in.json

Make sure path to json file is correct. Verify data was imported from mongo console:

use demo

db.yield_historical.in.count()

You should get 5193.

2. Run the example!

From hadoop folder run this long command:

bin/hadoop jar mongo-hadoop/core/target/mongo-hadoop-core-1.0-SNAPSHOT.jar com.mongodb.hadoop.examples.treasury.TreasuryYieldXMLConfig -conf mongo-hadoop/examples/treasury_yield/src/main/resources/mongo-treasury_yield.xml

Now open http://localhost:50030 in your browser and you should see a running job there. After it’s done verify you got the results in mongo console:

use demo

db.yield_historical.in.find()

Done!

Now your local hadoop system is ready for the further hacking. In the next post I’ll cover how to launch your own hadoop cluster on Amazon EC2.

Links

Launching Hadoop on EC2
Hbase/Hadoop on OS X
mongo-hadoop library

Installing Hadoop

Installing mongo-hadoop library

Running examples

Links

Comments