In my previous
post, we went trough the basic steps on building a basic standalone
Docker.io container image.
Now, let's explore a more advanced scenario, building an
Apache Hadoop Yarn Cluster similar to the topology described below:
Using Docker containers is proving to be a very viable and lightweight way build/simulate a local Yarn Cluster, compared with using heavy VMs.
See below all the steps you need to get started and build your own Yarn cluster in your desktop.
Dockerfile - The recipe for building the Docker.io Image
While building an Yarn cluster image, we have to take care of the few main things :
- Configure passwordless ssh across all cluster containers.
- Download, install and configura Java.
- Download, install and configure Apache Yarn:
- Configure Namenode and Datanode connectivity.
- Enable dynamic Datanodes to connect to Namenode.
- Configure Network:
- Network connectivity.
- Expose Yarn ports required by Administration UI and Node communication.
Below is a sample docker file that will handle most of the items above, with exception of some network connectivity, which is going to be handled during container initialization.
.......
USER root
# install dev tools
RUN yum install -y curl which tar sudo openssh-server openssh-clients rsync
RUN yum update -y libselinux
# passwordless ssh
RUN ssh-keygen -q -N "" -t dsa -f /etc/ssh/ssh_host_dsa_key
RUN ssh-keygen -q -N "" -t rsa -f /etc/ssh/ssh_host_rsa_key
RUN ssh-keygen -q -N "" -t rsa -f /root/.ssh/id_rsa
RUN cp /root/.ssh/id_rsa.pub /root/.ssh/authorized_keys
# java
RUN curl -LO 'http://download.oracle.com/otn-pub/java/jdk/7u71-b14/jdk-7u71-linux-x64.rpm' -H 'Cookie: oraclelicense=accept-securebackup-cookie'
RUN rpm -i jdk-7u71-linux-x64.rpm
RUN rm jdk-7u71-linux-x64.rpm
ENV JAVA_HOME /usr/java/default
ENV PATH $PATH:$JAVA_HOME/bin
# hadoop
RUN curl -s http://www.eu.apache.org/dist/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz | tar -xz -C /usr/local/
RUN cd /usr/local && ln -s ./hadoop-2.6.0 hadoop
ENV HADOOP_PREFIX /usr/local/hadoop
ENV HADOOP_COMMON_HOME /usr/local/hadoop
ENV HADOOP_HDFS_HOME /usr/local/hadoop
ENV HADOOP_MAPRED_HOME /usr/local/hadoop
ENV HADOOP_YARN_HOME /usr/local/hadoop
ENV HADOOP_CONF_DIR /usr/local/hadoop/etc/hadoop
ENV YARN_CONF_DIR $HADOOP_PREFIX/etc/hadoop
RUN sed -i '/^export JAVA_HOME/ s:.*:export JAVA_HOME=/usr/java/default\nexport HADOOP_PREFIX=/usr/local/hadoop\nexport HADOOP_HOME=/usr/local/hadoop\n:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
RUN sed -i '/^export HADOOP_CONF_DIR/ s:.*:export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop/:' $HADOOP_PREFIX/etc/hadoop/hadoop-env.sh
RUN mkdir $HADOOP_PREFIX/input
RUN cp $HADOOP_PREFIX/etc/hadoop/*.xml $HADOOP_PREFIX/input
# pseudo distributed
ADD core-site.xml $HADOOP_PREFIX/etc/hadoop/core-site.xml
#RUN sed s/HOSTNAME/localhost/ /usr/local/hadoop/etc/hadoop/core-site.xml.template > /usr/local/hadoop/etc/hadoop/core-site.xml
ADD hdfs-site.xml $HADOOP_PREFIX/etc/hadoop/hdfs-site.xml
ADD mapred-site.xml $HADOOP_PREFIX/etc/hadoop/mapred-site.xml
ADD yarn-site.xml $HADOOP_PREFIX/etc/hadoop/yarn-site.xml
RUN $HADOOP_PREFIX/bin/hdfs namenode -format
# fixing the libhadoop.so like a boss
RUN rm /usr/local/hadoop/lib/native/*
RUN curl -Ls http://dl.bintray.com/sequenceiq/sequenceiq-bin/hadoop-native-64-2.6.0.tar | tar -x -C /usr/local/hadoop/lib/native/
ADD ssh_config /root/.ssh/config
RUN chmod 600 /root/.ssh/config
RUN chown root:root /root/.ssh/config
ADD bootstrap.sh /etc/bootstrap.sh
RUN chown root:root /etc/bootstrap.sh
RUN chmod 700 /etc/bootstrap.sh
ENV BOOTSTRAP /etc/bootstrap.sh
# workingaround docker.io build error
RUN ls -la /usr/local/hadoop/etc/hadoop/*-env.sh
RUN chmod +x /usr/local/hadoop/etc/hadoop/*-env.sh
RUN ls -la /usr/local/hadoop/etc/hadoop/*-env.sh
# fix the 254 error code
RUN sed -i "/^[^#]*UsePAM/ s/.*/#&/" /etc/ssh/sshd_config
RUN echo "UsePAM no" >> /etc/ssh/sshd_config
RUN echo "Port 2122" >> /etc/ssh/sshd_config
CMD ["/etc/bootstrap.sh", "-d"]
EXPOSE 50020 50090 50070 50010 50075 8031 8032 8033 8040 8042 49707 22 8088 8030
DYI - Building the Docker.io image
sudo docker build -t yarn-cluster .
Getting Started - Launching Yarn nodes
In order to simplify what process to start when launching a NameNode/NodeManager versus a DataNode, a boostrap shell script is used and it supports a
--namenode and
--datanode parameter which is used in conjunction with the docker run command to launch the Yarn node.
When launching the NameNode/NodeManager, there is also a need to map the ports used by the Yarn UI administration applications so it can be accessed ouside of the containers.
Below is the command to launch a NameNode/NodeManager node. Note that we use the -p to map the ports, and then we use
bootstrap.sh --namenode to start the proper Yarn services.
sudo docker run -i -t -p 8088:8088 -p 50070:50070 -p 50075:50075 --name namenode -h namenode yarn-cluster /etc/bootstrap.sh -bash -namenode
Now that the master node is up and running, let's add some DataNodes to our cluster. A peculiarity of launching the DataNodes is that they need to be aware of the NameNode location, and for this, docker enable containers be linked, which will cause the local
/etc/hosts to be updated with the address of the linked container.
Below is the command to launch a DataNode node. Note how the
--link parameter links the DataNode container to the NameNode container, and also how the
boostrap.sh --datanode now receives a different parameter to properly start only Yarn DataNode related services.
sudo docker run -i -t --link namenode:namenode --workdir /usr/local/hadoop yarn-cluster /etc/bootstrap.sh -bash -datanode
After launching a few images, the DataNode administration ui will then look like the one below :
Conclusion
Using Docker.io containers is a very good and lightweight option to build a Hadoop Yarn cluster, but in order to get it to the next level, there are few other items that need to be thought trough and solved, like a few described below :
- Managing machine resources available for each container : cpu, memory, etc.
- Strategy for non-transient persistent data.
- Hack aware data replication, when in container environment.
- etc.
Also, note that all the source code used to build this Yarn Cluster is also available in the github repository:
docker-yarn-cluster.
No comments:
Post a Comment