For running any jobs on AWS EMR the code input data needs to be in S3 storage. There should be a directory where log files will be stored. The output directory should not exist, because it is going to create it. If they output directory already exists then Map-Reduce program will fail.
We put the jar file containing the code for Map-Reduce program in S3 storage.
The input for Map-reduce program is also placed in S3 storage.
Now open EMR service console in AWS.
Now create a new cluster. Provide name of cluster. Provide a directory where log files for the cluster will be placed. These files can be used for analyzing reason of failure later on. For adding steps to cluster choose "Step type" as "Custom Jar" and click on "Configure" button to specify JAR file which contains our Map-Reduce program and arguments for the program.
In the pop-up dialog box select the jar file which contains the code. I the arguments box provide following arguments.
The first line contains the main class name. Second line contains location of the input files. Third line contains the location of the output directory. The output directory should not exist already. If the output directory exists the the job will fail. Select "Action on failure" as "Terminate the cluster" because we don't want to keep running the cluster if our Map-Reduce program fails.
Now we select number of nodes in the cluster. I will be using 5 m5.xlarge nodes for the cluster.One will be used for master node and 4 core nodes of cluster.
Once we click on "Create cluster" button. The cluster will be created and started. It will take around 3 minutes to cluster to come up. You can keep clicking the refresh button to update the status.
Once cluster is in running state. we can see status of steps which we configured in the cluster.
Once our setp is in running state we can click "View jobs" to see jobs running as part of that step.
To see details of tasks of our job we need to click on "View tasks" link.
You can see total number of tasks, completed, running and pending tasks in the tasks page. I as you can see 13 tasks are currently running. there are 4 core in each of 4 core instances. Total 16 cores and 64GB RAM in core instances. I don't understand size of one execution slot in EMR in terms of RAM and CPU. The number of running tasks peeked at 13 which gives a hint that total number of slots were close to 13.
You can look at console logs of the job from web interface.
You can look at controller logs also using web console.
Once Map-Reduce job step is complete the cluster will shut down automatically. I my case it took around 5 minutes for my program to run. Around 5 minutes were taken in starting up and shutdown.
AWS EMR is a good option if you have a lot of data and want to process it quickly. You can launch a cluster with upto 20 instances and tear down once it is not needed. In our case we were billed for 10 minutes for each of 5 instances and for 10 minutes for EMR usage fee for each 5 instances. You are billed for 50 minutes for instance and 50 minutes for EMR usage fee. But you saved 40 minutes of time by running you program on a multi node cluster.
Cloudera quickstart VM comes with 64GB disk size. This size is not suitable for storing big data files in this VM. This post shows how you can increase size of disk for Cloudera virtual machine running in VirtalBox.
First you need to change size of the disk in VirtualBox's "Virtual Media Manager". Please shutdown your virtual machine before resizing your virtual disk.
Launch the "Virtual Media Manager".
Change the size of your disk to 200GB from 64GB and click Apply button.
Now start the machine. Once virtual machine is launched look at block devices present in the virtual machine by running lsblk command. Notice that size of drive sda is now 200GB but size of dm-0 55GB. The partition dm-0 is mounted as root partition. We need resize this partition and filesystem under this partition.
We will be using fdisk command for resizing the partition. Please take a backup of your data before doing this step any mistake can make your virtual disk unusable and you may loose all your data. Run the fdisk command on /dev/sda and print the partition table.
Now delete partition number 2. We will create this partition again with a larger size.
Now create partition 2 again with larger size. We use all the free space on the disk while creating the new partition.
Now write the partition table to the disk. After writing the partition table to the disk the virtual machine you need to restart the virtual machine.
After restarting the virtual machine we need to resize LVM partition which hosts the root partition. Now you can resize the physical volume in LVM using sudo lvm pvresize /dev/sda2 command.
After resizing physical volume we need to resize the logical volume using sudo lvresize -l +100%FREE /dev/vg_quickstart/lv_root command. This command will increase logical volume for 100% of free space. Now you can see LSize is 191.50G.
Now we need to extend file system size present in the lv_root partition. The mounted filesystem can be resized using sudo resize2fs /dev/mapper/vg_quickstart-lv_root command.
Now size of root partition is 189GB. Now you can store large files in this virtual disk and use this virtual machine for processing large files.
For setting up a big data machine you need a 64 bit Linux machine. For that I am using a virtual machine created using VirtualBox by Oracle. I am using a 9 years old AMD Athlon(tm) II X2 240 Processor, 2800 Mhz, 2 Core(s), 2 Logical Processor(s) with 8GB RAM. I am finding that CPU is a bottleneck hare not the RAM. I allocated 4 GB RAM to the Virtual machine. With this virtual machine you can run sample programs but heavy processing can't be supported. For processing big files with a lots of data an processing I will be using Virtual machines in Google Cloud and Amazon Web Services. It turn out to be cheaper to use cloud rather than upgrading your desktop by spending 50,000 rupee. On cloud you may spend only 5,000 rupee in a year.
You can download VirtualBox from https://www.virtualbox.org/ . You need to enable virtualization in your BIOS for creating 64 bit virtual box. Please refer to following following youtube video for detail process https://www.youtube.com/watch?v=tv0WPJSWBQo . If you are using a different BIOS then search for enabling virtualization in bios on google for more information.
After enabling Virtualization in BIOS and installing Virtual box from https://www.virtualbox.org/ download pre built virtual machine with Hadoop installed from https://www.cloudera.com/downloads/quickstart_vms/5-13.html . In this page you need to provide some details about yourself like email address, name, company etc. After that it will download a zip file with one hard disk image and a configuration file. Expand that zip file in one folder like D:\virtualbox\cloudera-quickstart-vm-5.13.0-0-virtualbox .
Here is video tutorial
Start VirtualBox and click on "Import Appliance".
Select the VM configuration file from your directory where you downloaded Cloudera VM image.
Change MAC Address policy to "Generate new MAC Addresses for all network adapters" and name of virtual machine if you want.
Click on Import button and import will start.
Once the VM is imported you need to change network connection setting for our VM. I need one bridged adapter so that any PC on my local network can connect to my VM and a host local Adapter for local communication from my Windows host machine to the Virtual Machine. Don't change MAC addresses. Accept whatever is provided by default.
Increase the CPU allocated to Virtual machine so that it can consume 100% if required. This will increase performance of the Virtual machine. Also enable "extended VT-x/AMD-V".
Click OK on above dialog box. Now you can start your machine.
VM will be started in another window now. The GUI will open without password prompt. You can use the Virtual machine now. Password for user cloudera is cloudera and for root user password is cloudera.
Please don't enable Cloudera Manager because it will take a lot of resources in terms of memory and CPU. We will run Cloudera Manager on cloud instance.