Santosh Aditham

Context

What: Hadoop Administration
Need: Hortonworks Sandbox or similar

The problem with learning hadoop and its administration, even with huge online community support, is spoon-feeding minute details and having the infrastructure. Hortonworks has a cool product called sandbox for poor graduate students like myself who wont/cant invest money to learn something new. This solves the problem of infrastructure to some extent. I will do my bit to help with the spoon-feeding that I did not get.

My Experience

Step 1: Get a lot of RAM (never sufficient), specially if you want to try a hadoop cluster.
4GB of RAM is minimum per datanode for smallest of the clusters.
I have a 16GB desktop which is miserable to host a 3-node VM cluster.
Step 2: Download and install sandbox with VM hosting software, depending on what you do not have already.
Though this is straight-forward and runs pretty smooth, I often see some errors in installing sandbox.
1. zookeeper always throws a connection refused exception, from sandbox.hortonworks.com/10.0.2.15 to sandbox.hortonworks.com:8020
2. sometimes datanode complains of not being able to talk to localhost
Though I am sure that this has to do with the configuration, why isn't it default?
Step 3: Know how to use the sandbox VM from ssh
This might be for dummies but ssh to sandbox using 127.0.0.1 was not obvious to me, especially the need to use port 2222 (obviously only for the very first time)
A lot can be done from terminal! But one needs to know where to go to and what commands to use
Step 4: Know how to use the sandbox VM from web
A list of ports to remember:

IP Port default ID/pwd Use

127.0.0.1 8000 hue/1111 Hue

127.0.0.1 8080 admin/admin Ambari

127.0.0.1 8888 root/hadoop HDP
HACK 1 (faulty results) : Theoretically, Sys-Admin uses Flume to monitor a particular eventlog and identify attacks by log analysis. I modified the eventlog (small bash script to remove some IPs pertaining to some countries) such that Flume will overwrite them in Hcatalog. So the analysis table (a Hive script) output will be different from original and any BI reports (MS Excel) generated from this data will also be inaccurate.
HACK 2 (data loss) : I overwrote the editlog in progress with another older version and as expected that lead to huge problems in further usage of my HDFS. So I had to
reset my namenode with hadoop namenode -format (as hdfs user) and then restart the cluster with service startup_script restart (as root).

Learning

Need lot of trial and error.
Hadoop targets large distributed systems which implies complex infrastructure. Cannot expect to sit in front of a basic desktop and learn it all.
Job market for hadoop and big data stuff is phenomenal.

IP	Port	default ID/pwd	Use
127.0.0.1	8000	hue/1111	Hue
127.0.0.1	8080	admin/admin	Ambari
127.0.0.1	8888	root/hadoop	HDP