Saturday, January 17, 2015

Building a Yarn cluster using Docker.io containers

In my previous post, we went trough the basic steps on building a basic standalone Docker.io container image.

Now, let's explore a more advanced scenario, building an Apache Hadoop Yarn Cluster similar to the topology described below:


Using Docker containers is proving to be a very viable and lightweight way build/simulate a local Yarn Cluster, compared with using heavy VMs.

See below all the steps you need to get started and build your own Yarn cluster in your desktop.

Dockerfile - The recipe for building the Docker.io Image

While building an Yarn cluster image, we have to take care of the few main things :

  • Configure passwordless ssh across all cluster containers.
  • Download, install and configura Java.
  • Download, install and configure Apache Yarn:
    • Configure Namenode and Datanode connectivity.
    • Enable dynamic Datanodes to connect to Namenode.
  • Configure Network:
    • Network connectivity.
    • Expose Yarn ports required by Administration UI and Node communication.
Below is a sample docker file that will handle most of the items above, with exception of some network connectivity, which is going to be handled during container initialization.

.......

USER root

# install dev tools
RUN yum install -y curl which tar sudo openssh-server openssh-clients rsync
RUN yum update -y libselinux

# passwordless ssh
RUN ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN ssh-keygen -q -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key
RUN ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa
RUN cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys

# java
RUN curl -LO 'http://download.oracle.com/otn-pub/java/jdk/7u71-b14/jdk-7u71-linux-x64.rpm' -H 'Cookie: oraclelicense=accept-securebackup-cookie'
RUN rpm -i jdk-7u71-linux-x64.rpm
RUN rm jdk-7u71-linux-x64.rpm

ENV JAVA_HOME /usr/java/default
ENV PATH $PATH:$JAVA_HOME/bin

# hadoop
RUN curl -s http://www.eu.apache.org/dist/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz | tar -xz -C /usr/local/
RUN cd /usr/local && ln -s ./hadoop-2.6.0 hadoop

ENV HADOOP_PREFIX /usr/local/hadoop
ENV HADOOP_COMMON_HOME /usr/local/hadoop
ENV HADOOP_HDFS_HOME /usr/local/hadoop
ENV HADOOP_MAPRED_HOME /usr/local/hadoop
ENV HADOOP_YARN_HOME /usr/local/hadoop
ENV HADOOP_CONF_DIR /usr/local/hadoop/etc/hadoop
ENV YARN_CONF_DIR $HADOOP_PREFIX/etc/hadoop

RUN sed -i '/^export JAVA_HOME/ s:.*:export JAVA_HOME=/usr/java/default\nexport HADOOP_PREFIX=/usr/local/hadoop\nexport HADOOP_HOME=/usr/local/hadoop\n:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
RUN sed -i '/^export HADOOP_CONF_DIR/ s:.*:export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

RUN mkdir $HADOOP_PREFIX/input
RUN cp $HADOOP_PREFIX/etc/hadoop/*.xml $HADOOP_PREFIX/input

# pseudo distributed
ADD core-site.xml $HADOOP_PREFIX/etc/hadoop/core-site.xml
#RUN sed s/HOSTNAME/localhost/ /usr/local/hadoop/etc/hadoop/core-site.xml.template > /usr/local/hadoop/etc/hadoop/core-site.xml
ADD hdfs-site.xml $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml

ADD mapred-site.xml $HADOOP_PREFIX/etc/hadoop/mapred-site.xml
ADD yarn-site.xml $HADOOP_PREFIX/etc/hadoop/yarn-site.xml

RUN $HADOOP_PREFIX/bin/hdfs namenode -format

# fixing the libhadoop.so like a boss
RUN rm  /usr/local/hadoop/lib/native/*
RUN curl -Ls http://dl.bintray.com/sequenceiq/sequenceiq-bin/hadoop-native-64-2.6.0.tar | tar -x -C /usr/local/hadoop/lib/native/

ADD ssh_config /root/.ssh/config
RUN chmod 600 /root/.ssh/config
RUN chown root:root /root/.ssh/config

ADD bootstrap.sh /etc/bootstrap.sh
RUN chown root:root /etc/bootstrap.sh
RUN chmod 700 /etc/bootstrap.sh

ENV BOOTSTRAP /etc/bootstrap.sh

# workingaround docker.io build error
RUN ls -la /usr/local/hadoop/etc/hadoop/*-env.sh
RUN chmod +x /usr/local/hadoop/etc/hadoop/*-env.sh
RUN ls -la /usr/local/hadoop/etc/hadoop/*-env.sh

# fix the 254 error code
RUN sed  -i "/^[^#]*UsePAM/ s/.*/#&/"  /etc/ssh/sshd_config
RUN echo "UsePAM no" >> /etc/ssh/sshd_config
RUN echo "Port 2122" >> /etc/ssh/sshd_config

CMD ["/etc/bootstrap.sh", "-d"]

EXPOSE 50020 50090 50070 50010 50075 8031 8032 8033 8040 8042 49707 22 8088 8030


DYI - Building the Docker.io image

sudo docker build  -t yarn-cluster .
Getting Started - Launching Yarn nodes

In order to simplify what process to start when launching a NameNode/NodeManager versus a DataNode, a boostrap shell script is used and it supports a --namenode and --datanode parameter which is used in conjunction with the docker run command to launch the Yarn node.

When launching the NameNode/NodeManager, there is also a need to map the ports used by the Yarn UI administration applications so it can be accessed ouside of the containers.

Below is the command to launch a NameNode/NodeManager node. Note that we use the -p to map the ports, and then we use bootstrap.sh --namenode to start the proper Yarn services.

sudo docker run -i -t -p 8088:8088 -p 50070:50070 -p 50075:50075 --name namenode -h namenode yarn-cluster /etc/bootstrap.sh -bash -namenode


Now that the master node is up and running, let's add some DataNodes to our cluster. A peculiarity of launching the DataNodes is that they need to be aware of the NameNode location, and for this, docker enable containers be linked, which will cause the local /etc/hosts to be updated with the address of the linked container.

Below is the command to launch a DataNode node. Note how the --link parameter links the DataNode container to the NameNode container, and also how the boostrap.sh --datanode now receives a different parameter to properly start only Yarn DataNode related services.

sudo docker run -i -t --link namenode:namenode --workdir /usr/local/hadoop yarn-cluster /etc/bootstrap.sh -bash -datanode


After launching a few images, the DataNode administration ui will then look like the one below :
Conclusion

Using Docker.io containers is a very good and lightweight option to build a Hadoop Yarn cluster, but in order to get it to the next level, there are few other items that need to be thought trough and solved, like a few described below :

  • Managing machine resources available for each container : cpu, memory, etc.
  • Strategy for non-transient persistent data.
  • Hack aware data replication, when in container environment.
  • etc.


Also, note that all the source code used to build this Yarn Cluster is also available in the github repository: docker-yarn-cluster.

No comments:

Post a Comment