Context
- What: Hadoop Administration
- Need: Hortonworks Sandbox or similar
The problem with learning hadoop and its administration, even with huge online community support, is spoon-feeding minute details and having the infrastructure. Hortonworks has a cool product called sandbox for poor graduate students like myself who wont/cant invest money to learn something new. This solves the problem of infrastructure to some extent. I will do my bit to help with the spoon-feeding that I did not get.
My Experience
- Step 1: Get a lot of RAM (never sufficient), specially if you want to try a hadoop cluster.
4GB of RAM is minimum per datanode for smallest of the clusters.
I have a 16GB desktop which is miserable to host a 3-node VM cluster.
- Step 2: Download and install sandbox with VM hosting software, depending on what you do not have already.
Though this is straight-forward and runs pretty smooth, I often see some errors in installing sandbox.
1. zookeeper always throws a connection refused exception, from sandbox.hortonworks.com/10.0.2.15 to sandbox.hortonworks.com:8020
2. sometimes datanode complains of not being able to talk to localhost
Though I am sure that this has to do with the configuration, why isn't it default?
- Step 3: Know how to use the sandbox VM from ssh
This might be for dummies but ssh to sandbox using 127.0.0.1 was not obvious to me, especially the need to use port 2222 (obviously only for the very first time)
A lot can be done from terminal! But one needs to know where to go to and what commands to use
- Step 4: Know how to use the sandbox VM from web
A list of ports to remember:
IP | Port | default ID/pwd | Use |
127.0.0.1 | 8000 | hue/1111 | Hue |
127.0.0.1 | 8080 | admin/admin | Ambari |
127.0.0.1 | 8888 | root/hadoop | HDP |
- HACK 1 (faulty results) : Theoretically, Sys-Admin uses Flume to monitor a particular eventlog and identify attacks by log analysis. I modified the eventlog (small bash script to remove some IPs pertaining to some countries) such that Flume will overwrite them in Hcatalog. So the analysis table (a Hive script) output will be different from original and any BI reports (MS Excel) generated from this data will also be inaccurate.
- HACK 2 (data loss) : I overwrote the editlog in progress with another older version and as expected that lead to huge problems in further usage of my HDFS. So I had to
reset my namenode with hadoop namenode -format (as hdfs user) and then restart the cluster with service startup_script restart (as root).
Learning
- Need lot of trial and error.
- Hadoop targets large distributed systems which implies complex infrastructure. Cannot expect to sit in front of a basic desktop and learn it all.
- Job market for hadoop and big data stuff is phenomenal.