In this post we will learn how to run a Map-Reduce program on AWS EMR cluster. This post is continuation of my earlier post:
https://blog.bigdatawithjasvant.com/2019/09/setting-up-machine-code-and-data-for.html
Please read it before continuing.
The video for this post is available below:
For running any jobs on AWS EMR the code input data needs to be in S3 storage. There should be a directory where log files will be stored. The output directory should not exist, because it is going to create it. If they output directory already exists then Map-Reduce program will fail.
We put the jar file containing the code for Map-Reduce program in S3 storage.
The input for Map-reduce program is also placed in S3 storage.
Now open EMR service console in AWS.
Now create a new cluster. Provide name of cluster. Provide a directory where log files for the cluster will be placed. These files can be used for analyzing reason of failure later on. For adding steps to cluster choose "Step type" as "Custom Jar" and click on "Configure" button to specify JAR file which contains our Map-Reduce program and arguments for the program.
In the pop-up dialog box select the jar file which contains the code. I the arguments box provide following arguments.
Now we select number of nodes in the cluster. I will be using 5 m5.xlarge nodes for the cluster.One will be used for master node and 4 core nodes of cluster.
Once we click on "Create cluster" button. The cluster will be created and started. It will take around 3 minutes to cluster to come up. You can keep clicking the refresh button to update the status.
Once cluster is in running state. we can see status of steps which we configured in the cluster.
Once our setp is in running state we can click "View jobs" to see jobs running as part of that step.
To see details of tasks of our job we need to click on "View tasks" link.
You can see total number of tasks, completed, running and pending tasks in the tasks page. I as you can see 13 tasks are currently running. there are 4 core in each of 4 core instances. Total 16 cores and 64GB RAM in core instances. I don't understand size of one execution slot in EMR in terms of RAM and CPU. The number of running tasks peeked at 13 which gives a hint that total number of slots were close to 13.
You can look at console logs of the job from web interface.
You can look at controller logs also using web console.
Once Map-Reduce job step is complete the cluster will shut down automatically. I my case it took around 5 minutes for my program to run. Around 5 minutes were taken in starting up and shutdown.
AWS EMR is a good option if you have a lot of data and want to process it quickly. You can launch a cluster with upto 20 instances and tear down once it is not needed. In our case we were billed for 10 minutes for each of 5 instances and for 10 minutes for EMR usage fee for each 5 instances. You are billed for 50 minutes for instance and 50 minutes for EMR usage fee. But you saved 40 minutes of time by running you program on a multi node cluster.
https://blog.bigdatawithjasvant.com/2019/09/setting-up-machine-code-and-data-for.html
Please read it before continuing.
The video for this post is available below:
For running any jobs on AWS EMR the code input data needs to be in S3 storage. There should be a directory where log files will be stored. The output directory should not exist, because it is going to create it. If they output directory already exists then Map-Reduce program will fail.
The input for Map-reduce program is also placed in S3 storage.
Now open EMR service console in AWS.
Now create a new cluster. Provide name of cluster. Provide a directory where log files for the cluster will be placed. These files can be used for analyzing reason of failure later on. For adding steps to cluster choose "Step type" as "Custom Jar" and click on "Configure" button to specify JAR file which contains our Map-Reduce program and arguments for the program.
In the pop-up dialog box select the jar file which contains the code. I the arguments box provide following arguments.
com.wordcount.WordCountHashPartitioner
s3://bigdatawithjasvant-emr/bigtext
s3://bigdatawithjasvant-emr/output
The first line contains the main class name. Second line contains location of the input files. Third line contains the location of the output directory. The output directory should not exist already. If the output directory exists the the job will fail. Select "Action on failure" as "Terminate the cluster" because we don't want to keep running the cluster if our Map-Reduce program fails.Now we select number of nodes in the cluster. I will be using 5 m5.xlarge nodes for the cluster.One will be used for master node and 4 core nodes of cluster.
Once we click on "Create cluster" button. The cluster will be created and started. It will take around 3 minutes to cluster to come up. You can keep clicking the refresh button to update the status.
Once cluster is in running state. we can see status of steps which we configured in the cluster.
Once our setp is in running state we can click "View jobs" to see jobs running as part of that step.
To see details of tasks of our job we need to click on "View tasks" link.
You can see total number of tasks, completed, running and pending tasks in the tasks page. I as you can see 13 tasks are currently running. there are 4 core in each of 4 core instances. Total 16 cores and 64GB RAM in core instances. I don't understand size of one execution slot in EMR in terms of RAM and CPU. The number of running tasks peeked at 13 which gives a hint that total number of slots were close to 13.
You can look at console logs of the job from web interface.
You can look at controller logs also using web console.
Once Map-Reduce job step is complete the cluster will shut down automatically. I my case it took around 5 minutes for my program to run. Around 5 minutes were taken in starting up and shutdown.
AWS EMR is a good option if you have a lot of data and want to process it quickly. You can launch a cluster with upto 20 instances and tear down once it is not needed. In our case we were billed for 10 minutes for each of 5 instances and for 10 minutes for EMR usage fee for each 5 instances. You are billed for 50 minutes for instance and 50 minutes for EMR usage fee. But you saved 40 minutes of time by running you program on a multi node cluster.
No comments:
Post a Comment