Something Special Making it EVEN better.

20Feb/100

Hadoop Installation Quick Start

We've just finished setting up a small 15-node Hadoop cluster at work and more and more things keep coming to my mind that could be solved or dealt with easily in Hadoop. However, relying on a production cluster to mess around with just isn't worth all the hassle (amongst other things). So with all of the various ways to actually get up and running with Hadoop, what's the easiest or best choice? Here are some options that are available and some comments on the ones that I've played around with the last couple of months.

Apache distro

  • Hadoop 0.20.1
  • ~38MB
  • tar.gz package
  • latest release
  • no extras included

Yahoo! distro

  • Hadoop 0.20.0
  • includes Yahoo! patches
  • often a few releases behind the official distro
  • source release only from GitHub
  • straightforward (albeit a bit of a pain)
  • looks the same from the outside as the official Hadoop distro once compiled

Yahoo! training VM

  • Hadoop 0.18.0 (note that this is older than the plain Yahoo! distro)
  • ~370MB
  • an all-in-one VMWare VM: Ubuntu 8.04, Java 6 U7
  • small download
  • works on first boot
  • accompanying tutorial is pretty good

Cloudera distro

  • Hadoop 0.20.0
  • CDH2
  • install from yum or apt
  • requires a Debian or RedHat distro
  • a bit fussy, I sadly couldn't get the Debian distro installed based on their brief instructions

Cloudera training VM

  • Hadoop 0.20.0
  • CDH2
  • ~1.3GB
  • an all-in-one VMWare VM
  • VM version 0.3.3 failed to decompress after two separate downloads (from HTTP and BitTorrent)

Cloudera Amazon AWS

  • Hadoop 0.20.0
  • CDH2
  • instant cluster on Amazon EC2 based on a Cloudera Amazon EC2 machine image (AMI)
  • costs a few US$, but good if you have anything larger than small jobs to play around with
  • not tested - didn't waste my time based on the experience I had with the other two Cloudera distros

Short Term

To start experimenting as quickly as possible, the Yahoo! training VM ended up being the fastest. Small download, worked on the first boot up. The documentation on the Yahoo! Hadoop tutorial is also very good.

Long Term

I went with the straight Apache distro on a locally built Ubuntu Server 9.10 VM. Why? I like to start from scratch always and learn from all of the various nitty-gritty steps with configuration, installing Pig, Cascading, etc. It's like starting to code with a text editor vs an IDE. Somehow you get the feeling of the way things work a lot better. Besides, I'm a control freak ;)

Notes

I didn't use VMWare Player or Fusion for any of the VMs. I prefer using Sun VirtualBox as it's free, very stable, runs on any OS including OS X and capable is of reading VMWare disk images (vmdk) as well.

  • Print
  • Google Bookmarks
  • Facebook
  • Tumblr
  • Twitter
  • LinkedIn
  • StumbleUpon
  • Digg
  • del.icio.us
  • Reddit
  • Slashdot
Comments (0) Trackbacks (0)

No comments yet.


Leave a comment


No trackbacks yet.