How to Enable Apache Hadoop in Cloud?

Big Data and Cloud Computing combination is the latest trend nowadays, and Hadoop is another name for Big Data. So, it’s the right time to understand the enabling of Apache Hadoop in Cloud Computing.

Hadoop has spawned the foundation for many other big data technologies and tools. When industries need virtually unlimited scalability for large data, Hadoop supports a diverse range of workload types. However, with the exponential growth of big data, storage cost and maintainability is a prime question considering the budget of the companies.

Are you a fresher who wants to start a career in Big Data Hadoop? Read our previous blog on how to start Learning Hadoop for Beginners.

Hence, Apache Hadoop in Cloud computing is the latest trend that today’s industry follow. Big data processing on cloud platforms is especially effective for scenarios where an enterprise wants to save the cost by accelerating the analytic jobs for faster results to bring down the cluster.

In this blog, we will discuss how we can enable Apache Hadoop in Cloud computing instance.

Table of Contents

Key considerations for Enabling Apache Hadoop in Cloud Computing Environment

Before you configure Apache Hadoop environment in the cloud below points should be measured:

Security in the public cloud is a concern for Apache Hadoop cloud deployment. Hence, every enterprise must evaluate the security criteria before moving Hadoop cluster data as Hadoop provides very limited security.
The main purpose of Apache Hadoop is data analysis. Hence, Apache Hadoop in cloud computing deployment must support the tools associated with the Hadoop ecosystem, especially analytics and data visualization tools.
Data transmission in a cloud is chargeable. Hence, the place from where data is being loaded to the cloud is an important factor. The cost differs in case the data is going to be loaded from an internal system which is not on the cloud or if the data is already in the cloud.

How to Configure Apache Hadoop Environment in the Cloud?

To start with Apache Hadoop cloud configuration, you must have Linux platform installed in the cloud. Here we will discuss how to configure single node cluster of Apache Hadoop in cloud computing using pseudo mode. We have considered AWS EC2 as the cloud environment here.

Pre-requisites to configure Apache Hadoop environment in the cloud

An AWS account in an active state
Available private and public keys for the EC2 instance
Running Linux instance.
PuTTy installed and set up on the Linux

Want to enhance your enterprise capabilities? Start using Big Data and Cloud Computing together that makes a perfect combination.

Next, to enable Apache Hadoop in Cloud computing environment following steps need to be executed in two phases.

Phase1: Connect to EC2 instance using PuTTy

Step 1: Generate the private key for PuTTy

As you need to connect EC2 instance with PuTTY to configure Apache Hadoop environment, you need a private key for PuTTy which will support AWS private key format (.pem). Using tools like PuTTYyGen we can convert .pem format into the .ppk format which is PuTTY supported. Once the PuTTy private key is generated, using SSH client we can connect to the EC2 instance.

Step2: Start PuTTy session for EC2 instance

Once you start your PuTTy session to connect with EC2, you need to authenticate the connection first. To do so, in the category panel select connection and SSH and then expand it. Next, select Auth and then browse for the .ppk and open it.

Step3: Provide the permission

For the first time user of the instance, it will ask for the permission. Provide the login name as ec2-user. Once you press enter, it will start your session.

Phase 2: Configuring Apache Hadoop in cloud computing

Before you proceed to configure Apache Hadoop environment in EC2 instance make sure you have downloaded the following software beforehand:

Java Package
Hadoop Package

Next, follow the below steps:

Step 1: Create a Hadoop User on EC2 instance

You need to add a new Hadoop user in your EC2 instance. To do so, you need to have root access. It can be obtained by using the following command:

Once you have the root access you can create a new user by using the below command:

Now add a password to the added user

Next to make the new user as sudo user, type visudo and make the entry to visudo file for the user whizlabs below the line “Allow root to run any commands anywhere.”

Step 2: Exit from the root

Go back to the EC2 user and log in to the new user by entering the below commands

Step 3: Transfer the Hadoop and Java dump on the EC2 instance

Java should be installed before you install Hadoop on EC2 instance. Hence, you need to copy the downloaded zip versions of Hadoop and Java from Windows machine to EC2 instance using file transfer tools like WinSCP or FileZilla. To copy the files through these tools –

Launch the tool.
Enter the hostname and username of the EC2 instance and port number. Here the default port number will be 22.
Expand the ssh category and authenticate it using the PuTTy private key (.ppk) file.
Once you are logged in to the EC2 instance locate the Hadoop and Java .zip files from your local system and transfer them.
Once you get the files available into the instance use the command ls to confirm their availability.
You need to install Hadoop and Java using the new user you have created for Hadoop. Hence copy the files with the whizlabs user through below commands –

Next, unzip the files using the below commands

Step 4: Configure Apache Hadoop environment

To set the necessary environment variables for Hadoop and Java, you need to update .bashrc file in the Linux. From /home/whizlabs which is the home directory here type the below command

To enable environmental variable workable use the command

Step 5: Create NameNode and DataNode directories

Use the below commands for the NameNode and DataNode storage locations:

Step 6: Modify the directory permissions to 755

Use the below commands

Step 7: Change the directory location to the Hadoop installation directory

Use the below command

Step 8: Add the Hadoop home and Java Home path in Hadoop-env.sh

Use the below command to open hadoop-env.sh file

[Update Java classpath and version as per your installed version].

Step 9: Update Hadoop configuration details

Add the configuration properties in the below-mentioned files:

core-site.xml file

hdfs-site.xml file

yarn-site.xml

Step 10: Change the configuration properties of mapred-site.xml

Using the below command copy the content of mapred-site.xml template into mapred-site.xml

Add the below property in the file:

Step 11: ssh key generation for the Hadoop user

Follow the below steps for ssh key generations:

Step 12: Format the Namenode

Before you start the daemons, you need to format the namenode and change its location to the Hadoop location using the below commands:

Step 13: Start all the daemons in Hadoop

To bring up Hadoop in the running state you need to start the daemons in the below orders:

And you are all set with Apache Hadoop in the cloud.

Benefits of Apache Hadoop in Cloud Computing

The primary benefit of Cloud computing is – it enables business agility for enterprises, data scientists, and developers with unlimited scalability. It provides effective cost measurement with no upfront hardware costs. Besides that, it is cost-effective as it follows a pay-as-you-go model.

Apache Hadoop in cloud computing leverages many benefits as follows:

1. The scale of data analytics needs

With the enhanced data analysis requirement within the enterprises,7 to expand the capacity of Hadoop clusters is the need of the hour. While setting up the hardware for this could take weeks or months, the deployment in the cloud takes a few days. As a result, overall data processing scales fast and meets the business needs.

2. Lower cost for innovation

To configure Apache Hadoop environment in the cloud is a low capacity investment which does not charge for any upfront hardware cost. Apache Hadoop in cloud computing is specifically an ideal solution for the startups with big data analytics.

3. Pay specific to your need

Apache Hadoop cloud is suitable for the use cases when the requirement is to spin up the job to get the results and then stop or shut down the system. This is a flexible spending feature of the cloud as this only asks to pay for the compute and usage on the go basis.

4. Use the optimum infrastructure for the job

Not all the big data processing needs the same hardware infrastructure as memory, I/O, compute resources, etc. In cloud computing, you can select the suitable instance type for an optimal infrastructure for the solution.

5. Using cloud data as the source

Nowadays with the more usage of IoT and cloud, most enterprises prefer to store data in the cloud making it the primary source of data. As big data is all about large volumes of data, Apache Hadoop in cloud computing makes a lot of sense.

6. Makes your operations simple

The cloud computing technology provisions various types of clusters of Apache Hadoop with different characteristics and configurations, which is suitable for a particular set of jobs. This lowers the burden of administrative tasks for an enterprise from managing multiple clusters or implementing multi-tenant policies.

Final Verdict

To conclude, setting up Apache Hadoop in cloud computing requires hands-on skills on both Apache Hadoop and AWS. You must be comfortable and aware of both the technologies and techniques.

Whizlabs leverages the best level of theoretical and hands-on knowledge through its Hadoop certification training and Cloud Computing Certification training. Browse through our courses and build up your technical ground with us!

Have any questions? Just mention in the comment section below or submit here, we’ll be happy to respond back.

About the Author
More from Author

About Aditi Malhotra

Aditi Malhotra is the Content Marketing Manager at Whizlabs. Having a Master in Journalism and Mass Communication, she helps businesses stop playing around with Content Marketing and start seeing tangible ROI. A writer by day and a reader by night, she is a fine blend of both reality and fantasy. Apart from her professional commitments, she is also endearing to publish a book authored by her very soon.

Top 45 Fresher Java Interview Questions - March 9, 2023
25 Free Practice Questions – GCP Certified Professional Cloud Architect - December 3, 2021
30 Free Questions – Google Cloud Certified Digital Leader Certification Exam - November 24, 2021
4 Types of Google Cloud Support Options for You - November 23, 2021
APACHE STORM (2.2.0) – A Complete Guide - November 22, 2021
Data Mining Vs Big Data – Find out the Best Differences - November 18, 2021
Understanding MapReduce in Hadoop – Know how to get started - November 15, 2021
What is Data Visualization? - October 22, 2021