Luciano Resende: Building a Yarn cluster using Docker.io containers

In my previous post, we went trough the basic steps on building a basic standalone Docker.io container image.

Now, let's explore a more advanced scenario, building an Apache Hadoop Yarn Cluster similar to the topology described below:

Using Docker containers is proving to be a very viable and lightweight way build/simulate a local Yarn Cluster, compared with using heavy VMs.

See below all the steps you need to get started and build your own Yarn cluster in your desktop.

Dockerfile - The recipe for building the Docker.io Image

While building an Yarn cluster image, we have to take care of the few main things :

Configure passwordless ssh across all cluster containers.
Download, install and configura Java.
Download, install and configure Apache Yarn:
- Configure Namenode and Datanode connectivity.
- Enable dynamic Datanodes to connect to Namenode.
Configure Network:
- Network connectivity.
- Expose Yarn ports required by Administration UI and Node communication.

Below is a sample docker file that will handle most of the items above, with exception of some network connectivity, which is going to be handled during container initialization.


.......

USER root

# install dev tools
RUN yum install -y curl which tar sudo openssh-server openssh-clients rsync
RUN yum update -y libselinux

# passwordless ssh
RUN ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN ssh-keygen -q -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key
RUN ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa
RUN cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys

# java
RUN curl -LO 'http://download.oracle.com/otn-pub/java/jdk/7u71-b14/jdk-7u71-linux-x64.rpm' -H 'Cookie: oraclelicense=accept-securebackup-cookie'
RUN rpm -i jdk-7u71-linux-x64.rpm
RUN rm jdk-7u71-linux-x64.rpm

ENV JAVA_HOME /usr/java/default
ENV PATH $PATH:$JAVA_HOME/bin

# hadoop
RUN curl -s http://www.eu.apache.org/dist/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz | tar -xz -C /usr/local/
RUN cd /usr/local && ln -s ./hadoop-2.6.0 hadoop

ENV HADOOP_PREFIX /usr/local/hadoop
ENV HADOOP_COMMON_HOME /usr/local/hadoop
ENV HADOOP_HDFS_HOME /usr/local/hadoop
ENV HADOOP_MAPRED_HOME /usr/local/hadoop
ENV HADOOP_YARN_HOME /usr/local/hadoop
ENV HADOOP_CONF_DIR /usr/local/hadoop/etc/hadoop
ENV YARN_CONF_DIR $HADOOP_PREFIX/etc/hadoop

RUN sed -i '/^export JAVA_HOME/ s:.*:export JAVA_HOME=/usr/java/default\nexport HADOOP_PREFIX=/usr/local/hadoop\nexport HADOOP_HOME=/usr/local/hadoop\n:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
RUN sed -i '/^export HADOOP_CONF_DIR/ s:.*:export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

RUN mkdir $HADOOP_PREFIX/input
RUN cp $HADOOP_PREFIX/etc/hadoop/*.xml $HADOOP_PREFIX/input

# pseudo distributed
ADD core-site.xml $HADOOP_PREFIX/etc/hadoop/core-site.xml
#RUN sed s/HOSTNAME/localhost/ /usr/local/hadoop/etc/hadoop/core-site.xml.template > /usr/local/hadoop/etc/hadoop/core-site.xml
ADD hdfs-site.xml $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml

ADD mapred-site.xml $HADOOP_PREFIX/etc/hadoop/mapred-site.xml
ADD yarn-site.xml $HADOOP_PREFIX/etc/hadoop/yarn-site.xml

RUN $HADOOP_PREFIX/bin/hdfs namenode -format

# fixing the libhadoop.so like a boss
RUN rm  /usr/local/hadoop/lib/native/*
RUN curl -Ls http://dl.bintray.com/sequenceiq/sequenceiq-bin/hadoop-native-64-2.6.0.tar | tar -x -C /usr/local/hadoop/lib/native/

ADD ssh_config /root/.ssh/config
RUN chmod 600 /root/.ssh/config
RUN chown root:root /root/.ssh/config

ADD bootstrap.sh /etc/bootstrap.sh
RUN chown root:root /etc/bootstrap.sh
RUN chmod 700 /etc/bootstrap.sh

ENV BOOTSTRAP /etc/bootstrap.sh

# workingaround docker.io build error
RUN ls -la /usr/local/hadoop/etc/hadoop/*-env.sh
RUN chmod +x /usr/local/hadoop/etc/hadoop/*-env.sh
RUN ls -la /usr/local/hadoop/etc/hadoop/*-env.sh

# fix the 254 error code
RUN sed  -i "/^[^#]*UsePAM/ s/.*/#&/"  /etc/ssh/sshd_config
RUN echo "UsePAM no" >> /etc/ssh/sshd_config
RUN echo "Port 2122" >> /etc/ssh/sshd_config

CMD ["/etc/bootstrap.sh", "-d"]

EXPOSE 50020 50090 50070 50010 50075 8031 8032 8033 8040 8042 49707 22 8088 8030

DYI - Building the Docker.io image

sudo docker build  -t yarn-cluster .

Getting Started - Launching Yarn nodes

In order to simplify what process to start when launching a NameNode/NodeManager versus a DataNode, a boostrap shell script is used and it supports a --namenode and --datanode parameter which is used in conjunction with the docker run command to launch the Yarn node.

When launching the NameNode/NodeManager, there is also a need to map the ports used by the Yarn UI administration applications so it can be accessed ouside of the containers.

Below is the command to launch a NameNode/NodeManager node. Note that we use the -p to map the ports, and then we use bootstrap.sh --namenode to start the proper Yarn services.

sudo docker run -i -t -p 8088:8088 -p 50070:50070 -p 50075:50075 --name namenode -h namenode yarn-cluster /etc/bootstrap.sh -bash -namenode

Now that the master node is up and running, let's add some DataNodes to our cluster. A peculiarity of launching the DataNodes is that they need to be aware of the NameNode location, and for this, docker enable containers be linked, which will cause the local /etc/hosts to be updated with the address of the linked container.

Below is the command to launch a DataNode node. Note how the --link parameter links the DataNode container to the NameNode container, and also how the boostrap.sh --datanode now receives a different parameter to properly start only Yarn DataNode related services.

sudo docker run -i -t --link namenode:namenode --workdir /usr/local/hadoop yarn-cluster /etc/bootstrap.sh -bash -datanode

After launching a few images, the DataNode administration ui will then look like the one below :

Conclusion

Using Docker.io containers is a very good and lightweight option to build a Hadoop Yarn cluster, but in order to get it to the next level, there are few other items that need to be thought trough and solved, like a few described below :

Managing machine resources available for each container : cpu, memory, etc.
Strategy for non-transient persistent data.
Hack aware data replication, when in container environment.
etc.

Also, note that all the source code used to build this Yarn Cluster is also available in the github repository: docker-yarn-cluster.

Luciano Resende

Saturday, January 17, 2015

Building a Yarn cluster using Docker.io containers

No comments:

Post a Comment