Automating Hadoop using Ansible

6 min readMar 10, 2021

Hello Guys, Back with another article. In this, you will find how we can automate Hadoop using the Linux automation tool i.e Ansible.

What is hadoop?

Apache Hadoop is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage and processing of big data using the MapReduce programming model.

What is Hadoop Cluster?

A Hadoop cluster is a collection of computers, known as nodes, that are networked together to perform these kinds of parallel computations on big data sets.

What is Namenode?

The NameNode is the centerpiece of an HDFS file system. It keeps the directory tree of all files in the file system, and tracks where across the cluster the file data is kept. It does not store the data of these files itself.The NameNode is a Single Point of Failure for the HDFS Cluster.

What is Datanode?

DataNodes are the slave nodes in HDFS. Unlike NameNode, DataNode is commodity hardware, that is, a non-expensive system that is not of high quality or high-availability. The DataNode is a block server that stores the data in the local file.

What is Clientnode?

Hadoop Cluster Architecture The final part of the system is the Client Nodes, which are responsible for loading the data and fetching the results. Master nodes are responsible for storing data in HDFS and overseeing key operations, such as running serial computations on the data using MapReduce.

What is Ansible?

Ansible is a software tool that provides simple but powerful automation for cross-platform computer support. It is primarily intended for IT professionals, who use it for application deployment, updates on workstations and servers, cloud provisioning, configuration management, intra-service orchestration, and nearly anything a systems administrator does on a weekly or daily basis. Ansible doesn’t depend on agent software and has no additional security infrastructure, so it’s easy to deploy.

How does Ansible work?

In ansible, there are two categories one is the Control node which controls manage nodes another one is the manage node which is managed by the control node.

One of great things is ansible is idempotent and no target node need any software. Ansible works by conncting to nodes(Client, server etc) on a network, and then sending a small program called an Ansible module to that node.Ansible executes these modules over SSH and removes them when finished. The only requirement for this interaction is that your Ansible control node has login access to the managed nodes. SSH Keys are the most common way to provide access, but other forms of authentication are also supported.

Ansible playbooks

An Ansible playbook is a blueprint of automation tasks. Ansible playbooks are executed on a set, group, or classification of hosts, which together make up an Ansible inventory.

Variables in Our cluster

- namenodeIP: “192.168.43.202”
- port: “9091”

Ansible configuration file

Namenode playbook

- hosts: Name
  gather_facts: false 
  vars_files:
  - vars.yml
  vars:
  - node: "name"
  - directory: "namenode"
  
  tasks:
  - file:
     path: "/{{ directory }}"
     state: directory 
  - debug: 
     msg: "{{ namenodeIP }}"
  # Copy Hadoop and JDK files to a target nodes
  - name: "Copy Hadoop file"
    copy:
     src: "/root/hadoop-1.2.1-1.x86_64.rpm"
     dest: "/root/"
  - name: "Copy JDK file"
    copy:
     src: "/root/jdk-8u171-linux-x64.rpm"
     dest: "/root/"
  # Check whether Hadoop and JDK is present or not 
  - name: "Check whether Hadoop"
    package:
     name: "hadoop"
     state: present
    register: x
    # if error occur means Dosnot exist 
    ignore_errors: yes
  - name: "Check whether JDK"
    package:
     name: "java"
     state: present
    register: y
    ignore_errors: yes
  # Install Hadoop and JDK files using Command 
  - name: "Install Hadoop"
    command: "rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force"
    when: x.rc == 1
  - name: "Install JDK"
    command: "rpm -ivh jdk-8u171-linux-x64.rpm"
    when: y.rc == 1
  # Copy hdfs and core files
  - name: "Copy core file"
    template:
     src: "core-site.xml.j2"
     dest: "/etc/hadoop/core-site.xml"
  - name: "Copy HDFS file"
    template:
     src: "hdfs-site.xml.j2"
     dest: "/etc/hadoop/hdfs-site.xml"
  # Enable port number
  - name: "Enabling firewall daimon of {{ port }} port and accesss permamnent"
    firewalld:
     port: "{{ port }}/tcp"
     state: enabled
     permanent: yes
     immediate: yes# Format Namenode 
  - pause:
     prompt: "Do you wants to Format MetaData in Namenode (yes/no)"
    register: format
  - name: "Format Namenode"
    command: "echo Y | hadoop namenode -format"
    when: format.user_input | bool
  - debug:
     var: format.user_input
  # Start NameNode 
  - name: "Start Namenode"
    command: "hadoop-daemon.sh start namenode"
    register: start_namenode
  - debug:
     var: start_namenode
  # JPS
  - name: "JPS"
    command: "jps"
    register: jps_n
  - debug:
     var: jps_n

The command to run the namenode playbook is given below.

ansible-playbook hadoopNamenode.yml

Running Namenode Playbook

Datanode Playbook

- hosts: Data 
  gather_facts: false
  vars_files:
  - vars.yml
  vars:
  - node: "data"
  - directory: "datanode"
  
  tasks:
  - file:
     path: "/{{ directory }}"
     state: directory
  - debug: 
     msg: "{{ namenodeIP }}"
  # Copy Hadoop and JDK files to a target nodes
  - name: "Copy Hadoop file"
    copy:
     src: "/root/hadoop-1.2.1-1.x86_64.rpm"
     dest: "/root/"
  - name: "Copy JDK file"
    copy:
     src: "/root/jdk-8u171-linux-x64.rpm"
     dest: "/root/"
  # Check whether Hadoop and JDK is present or not 
  - name: "Check Hadoop file exists"
    package:
     name: "hadoop"
     state: present
    register: x
    # if error occur means Dosnot exist 
    ignore_errors: yes
  - name: "Check JDK file exists"
    package:
     name: "java"
     state: present
    register: y
    ignore_errors: yes
  # Install Hadoop and JDK files using Command 
  - name: "Install Hadoop"
    command: "rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force"
    when: x.rc == 1
  - name: "Install JDK"
    command: "rpm -ivh jdk-8u171-linux-x64.rpm"
    when: y.rc == 1
  # Copy hdfs and core files
  - name: "Configure Core Files"
    template:
     src: "core-site.xml.j2"
     dest: "/etc/hadoop/core-site.xml"
  - name: "Configure HDFS Files"
    template:
     src: "hdfs-site.xml.j2"
     dest: "/etc/hadoop/hdfs-site.xml"
  # Start DataNode 
  - name: "Start DataNode"
    command: "hadoop-daemon.sh start datanode"
    register: start_datanode
  - debug:
     var: start_datanode
  # JPS
  - name: "JPS"
    command: "jps"
    register: jps
  - debug:
     var: jps

The command to run the Datanode playbook is given below.

ansible-playbook hadoopDatanode.yml

Running Datanode Playbook

Client Playbook

- hosts:  Client
  gather_facts: false
  vars_files:
  - vars.yml
  vars:
  - blocks: "1024"
  - replicas: "3"
  
  tasks:
  - debug: 
     msg: "{{ namenodeIP }}"
  # Copy Hadoop and JDK files to a target nodes
  - name: "Copy Hadoop file"
    copy:
     src: "/root/hadoop-1.2.1-1.x86_64.rpm"
     dest: "/root/"
  - name: "Copy JDK file"
    copy:
     src: "/root/jdk-8u171-linux-x64.rpm"
     dest: "/root/"
  # Check whether Hadoop and JDK is present or not 
  - name: "Check whether Hadoop exists"
    package:
     name: "hadoop"
     state: present
    register: x
    # if error occur means Dosnot exist 
    ignore_errors: yes
  - name: "Check whether JDK exists"
    package:
     name: "java"
     state: present
    register: y
    ignore_errors: yes
  # Install Hadoop and JDK files using Command 
  - name: "Install Hadoop"
    command: "rpm -ivh hadoop-1.2.1-1.x86_64.rpm --force"
    when: x.rc == 1
  - name: "Install JDK"
    command: "rpm -ivh jdk-8u171-linux-x64.rpm"
    when: y.rc == 1
  # Copy hdfs and core files
  - name: "Configure HDFS file"
    template:
     src: "core-site.xml.j2"
     dest: "/etc/hadoop/core-site.xml"
  - name: "Configure CORE file"
    template:
     src: "hdfs-site-client.xml.j2"
     dest: "/etc/hadoop/hdfs-site.xml"
  - name: " Create a file"
    file:
     path: "/new.txt"
     state: touch
    
  - copy:
     content: "Hi this is Client file"
     dest: "/client.txt"- name: "Check file exists or not"
    command: "hadoop fs -ls /client.txt"
    register: fileCheck
    ignore_errors: yes
  # Share data 
  - name: "Upload file into Hadoop/hdfs cluster"
    command: "hadoop fs -put /client.txt /"
    register: data
    when: fileCheck.rc > 0- debug:
     var: data