Syntax highlighter header

Sunday, 1 September 2019

Setting up code and data for AWS EMR

This post is in continuation of my earlier post. Please read that post before reading this post. Link to that post is:
https://blog.bigdatawithjasvant.com/2019/09/configuring-aws-cli.html

In this post I am going to tell you how to setup code and data for running a Map-Reduce program on AWS EMR cluster. I am going to use a sample work count program for running it on EMR cluster. In this post I am concentrating on running program on EMR and not on how to write Map-Reduce program. So lets get started.

Video:


Step 1: Download Word count program

Download word count program from 
and extract it in a folder called workspace. I am using D:\workspace in my example

Now start eclipse with D:\workspace as workspace.

Click on Import project in  Eclipse.

Select project type to Existing Maven Project.


Select project folder.

Build the project.

JAR file will be created in the target folder.

Step 2: Download data file to AWS instance

Download https://www.bigdatawithjasvant.com/blogdata/00/0000/data/bigtext.tar.gz file to AWS instance using following command.

curl -O https://www.bigdatawithjasvant.com/blogdata/00/0000/data/bigtext.tar.gz
tar -xvzf bigtext.tar.gz


Run ./generate.sh command to generate bigtext.dat file. It will be created by repeating complet_work.txt 700 times.

Step3: Grant permission on S3 buckets to AWS CLI user

Open IAM on aws console and click on Users.

Select user to edit.

Click on Add permissions button.

Click on Add existing policies directly.


Select S3 permissions and add them to user.


Click Add permissions button.


Step 4: Upload jar file and input data to S3

Create new bucket for you where you can upload test data and jar file. The bucket name needs to be unique in aws. I have used following command:
aws s3 mb s3://bigdatawithjasvant-emr

Copy bigtext.dat file to s3 bucket using following command

aws s3 cp bigtext.dat  s3://bigdatawithjasvant-emr/bigtext/


Now login to AWS console go to S3 service and click on the bucket. 

Create logs folder in S3 bucket.

Create jar folder in S3 bucket.

Click on jar folder.

Click on Upload button and click on select file in upload dialog box.


Select jar file to upload.

Click on Upload button to upload jar file containing Map-Reduce program.

Now you can see uploaded jar file in s3 bucket.

Now are all set to execute Map-Reduce program on EMR cluster. Code and input data is already available in S3 bucket.

No comments:

Post a Comment