Hadoop Installation Quick Start
We've just finished setting up a small 15-node Hadoop cluster at work and more and more things keep coming to my mind that could be solved or dealt with easily in Hadoop. However, relying on a production cluster to mess around with just isn't worth all the hassle (amongst other things). So with all of the various ways to actually get up and running with Hadoop, what's the easiest or best choice? Here are some options that are available and some comments on the ones that I've played around with the last couple of months.
- Hadoop 0.20.1
- ~38MB
- tar.gz package
- latest release
- no extras included
- Hadoop 0.20.0
- includes Yahoo! patches
- often a few releases behind the official distro
- source release only from GitHub
- straightforward (albeit a bit of a pain)
- looks the same from the outside as the official Hadoop distro once compiled
- Hadoop 0.18.0 (note that this is older than the plain Yahoo! distro)
- ~370MB
- an all-in-one VMWare VM: Ubuntu 8.04, Java 6 U7
- small download
- works on first boot
- accompanying tutorial is pretty good
- Hadoop 0.20.0
- CDH2
- install from yum or apt
- requires a Debian or RedHat distro
- a bit fussy, I sadly couldn't get the Debian distro installed based on their brief instructions
- Hadoop 0.20.0
- CDH2
- ~1.3GB
- an all-in-one VMWare VM
- VM version 0.3.3 failed to decompress after two separate downloads (from HTTP and BitTorrent)
- Hadoop 0.20.0
- CDH2
- instant cluster on Amazon EC2 based on a Cloudera Amazon EC2 machine image (AMI)
- costs a few US$, but good if you have anything larger than small jobs to play around with
- not tested - didn't waste my time based on the experience I had with the other two Cloudera distros
Short Term
To start experimenting as quickly as possible, the Yahoo! training VM ended up being the fastest. Small download, worked on the first boot up. The documentation on the Yahoo! Hadoop tutorial is also very good.
Long Term
I went with the straight Apache distro on a locally built Ubuntu Server 9.10 VM. Why? I like to start from scratch always and learn from all of the various nitty-gritty steps with configuration, installing Pig, Cascading, etc. It's like starting to code with a text editor vs an IDE. Somehow you get the feeling of the way things work a lot better. Besides, I'm a control freak
Notes
I didn't use VMWare Player or Fusion for any of the VMs. I prefer using Sun VirtualBox as it's free, very stable, runs on any OS including OS X and capable is of reading VMWare disk images (vmdk) as well.