Big Data with Jasvant

Sunday, 1 September 2019

Configuring AWS CLI

I this post I will be explaining steps to configure AWS CLI in a CentOS virtual machine configured in AWS. I have chosen CentOS instead of Amazon Linux because it is free and it is also available for your local PC and other cloud providers like Google Cloud.

First you need to create a AWS Instance by clicking Launch Instance button in AWS EC2 dashboard.

Choose AWS marketplace in Choose AMI screen.

Search for CentOS in the marketplace and select CentOS 7 AMI.

Select t2.micro as instance type. This instance type is free tire eligible. It mean that use of this instance is free for first year after registration.

Next you need to select subnet for the instance. I already have an subnet created so I am using that. Please make sure that you have Auto assign Public IP setting enabled. If you don't enable it then your instance will not get a public IP address attached to it and due to that you will not be able to connect to it.

Next you need to add storage to the instance. Upto 30GB General purpose storage is free for first year after registration. So I am using 30GB storage space.

Now add Name tag to the instance so that you can identify it.

Next you need to attach a security group to the instance. I am using an existing security group. The security group is used to configure incoming and outgoing network connection permissions. I am using it to allow incoming connection to AWS instance only from my PC so that no one can hack my AWS instance. If you allow incoming connections from whole world then there are attacks from China and Russia which consumes a lot of bandwidth and you get charged for it. So make sure that you allow incoming connections from your PC only.

Next you need to create a public-private key pair. Public key will be installed into the machine and private key will be downloaded to your PC. Using private key you will be able to login to the AWS instance without password. Authentication will be based on the private key present in the your PC. You need to keep copy of private key at multiple location to recover in case of disk failure or any other problem in your PC. Without private key you can't login to your AWS instance. Download the key to your local PC and launch the instance.

Once instance is started. Click on the security group to configure incoming connection permission on the AWS instance. Security group permissions apply to all instances in the security group. Currently we are having only one instance instance in the security group.

Click on Edit Inbound rules.

In the Inbound rules remove all rules and add only one rule which allows incoming connection on all ports from your PC. You can click on combo box in source column and select "My IP" to select your public IP. Once rule is set it will allow incoming connection only from your IP address. For some ISP the public IP address keep changing. So connection attempt to instance using putty will fail if public IP address was different while setting incoming rules and when you try to connect using putty. To fix this problem you need to add all IP addresses which your ISP can assign to you. IP addresses assigned you you are not random but has a common prefix. You can add that prefix to incoming rules to allow all IP addresses with that prefix. If you add 122.177.95.0/24 as source then it will allow all IP addresses in range 122.177.95.[1-255]

Now you need to convert private key for the instance downloaded earlier to format recognized by putty. Putty can't use the private key downloaded directly. You need to run PUTYGEN.EXE and click on load button.

Select the private key file downloaded earlier. In my case the private key file is named jasvant2.pem

Once key file is imported successfully you need to save it in putty's format. You need to click on Save private key button.

When you click Save private key button, it shows a warning message that private key not encrypted with a parse phase. If parse phase is not provide anybody can use the private key file to connect to the AWS instance. If parse phase is provided then every time you use key file for connecting to server you will be prompted for parse phase. I choose not to provide a parse phase. Click on OK button to remove the warning message.

Provide the private key file name in Putty's private key format and click on Save button.

Now click on PUTTY.EXE to run putty and use public IP address of the running AWS instance.

Now click on Auth and select the private key file in Putty's private key format.

Select the private key.

Once private key is selected and connect button is pressed. It will show a warning with public key of the server and ask you for confirmation from you if you accept the public key. Click on Yes button.

Provide centos as username to connect to AWS instance.

Now follow instruction provided in web page
https://docs.aws.amazon.com/cli/latest/userguide/install-linux.html

Run python --version command to check python version installed on AWS instance.

Run curl -O https://bootstrap.pypa.io/get-pip.py command to download PIP installer which is required for installing AWS CLI.

Run python get-pip.py --user command to install PIP which is required by AWS CLI.

Add an export command at the end of your ~/.bash_profile file.
export PATH=~/.local/bin:$PATH
Reload the profile into your current session to put those changes into effect.
source ~/.bash_profile
Use pip to install the AWS CLI.
pip install awscli --upgrade --user

Now open AWS console and click on IAM.

Click on Add user button.

Provide user name and select programmatic access.

Add permission boundary. I am setting it to custom policy which allow access to all S3 operations.

AWS is giving warning that user has no permissions attached to it. It is a mistake and I will correct it in my next post. I will add permissions to user before performing any S3 operations. Click on Create user button.

Now download .csv file with details of credentials to be used for accessing the AWS APIs.

Note down region ID for region which you will be using with AWS CLI. For Mumbai it is ap-south-1

Now open credentials.csv file in notepad++ and copy Access Key ID and Secret Access Key. You will need it for configuring AWS CLI.

Now Run aws configure command to configure credentials for accessing AWS APIs using aws command.

In my next post I will be using aws cli for accessing s3 buckets and running a Map reduce program on EMR cluster.

Tuesday, 27 August 2019

Problem of overcommit in Linux

Yesterday I was trying bring up my application server which loads a lots of data into the memory for that purpose I had provided 110GB heap space to my JVM. After starting the application server I used to start the data loading job. After some time in middle of data loading the application server used to exit without any error message. After checking server logs in /var/log/messages I found that my application server was being killed by the kernel because it was running low on memory.

It is wrong on the part of kernel to kill application with no fault on their part. If kernel is running low on memory then it should deny requests for allocating more memory by applications. But kernel was killing processes instead.

The problem lies with overcommit feature of Linux. This feature should have never been developed. I don't see any valid use case for this feature. It relies on assumption that 30% of the memory which is allocated by application is never written to by application. Whenever application request for memory kernel zero out the page and map that page to process address space. Kernel instead of providing a new memory page every time uses a single read only ZERO_PAGE and map it to all application requesting memory allocations. When applications start writing to allocated pages it generate page-fault which kernel handles by allocating a new page and replacing ZERO_PAGE in application address space. So if application never writes to allocated memory pages kernel don't need to allocate real memory page. It is OK if it is done to reduce overhead of zeroing out memory pages. Kernel need to keep as many pages allocated to applications in buffer as treat them as allocated because applications can start allocated memory any time.

But with feature of overcommit kernel allocates more number of pages to applications than it is having in free pool. When applications start writing to allocated memory and kernel is unable to find free pages to replace the ZERO_PAGE it kills the application. It can't fail the request because it is not a memory allocation request but memory write instruction of the processor which can't return failure.

I disabled the overcommit feature of kernel by setting vm.overcommit_memory=2 using following command

sysctl vm.overcommit_memory=2

For making this effect permanent across reboots you need to add following line to /etc/sysctl.conf

vm.overcommit_memory=2

You can find more analysis on this at following page:
http://engineering.pivotal.io/post/virtual_memory_settings_in_linux_-_the_problem_with_overcommit/

Wednesday, 24 July 2019

LEADER_NOT_AVAILABLE Kafka error

Recently I installed kafka on an Amazon AWS EC2 instance and tried to post messages to it using my windows machine using public IP address of EC2 instance. But it failed with following error.


23:06:01.622 [kafka-producer-network-thread | producer-1] DEBUG 
org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1] 
Sending metadata request (type=MetadataRequest, topics=kafka_core)
 to node 54.165.214.180:9092 (id: -1 rack: null)
23:06:01.956 [kafka-producer-network-thread | producer-1] WARN 
org.apache.kafka.clients.NetworkClient - [Producer clientId=producer-1]
 Error while fetching metadata with correlation id 70 : 
{kafka_core=LEADER_NOT_AVAILABLE}

I had modified the following line in server.properties configuration file for kafka to use public IP address of the EC2 instance.


advertised.listeners=PLAINTEXT://54.165.214.180:9092

But it was still not working. I searched the internet and everyone was saying that advertised.listeners needs to be same as host IP address. But for me it was not an option because in EC2 instances public and private IP addresses are different. I needed to connect from outside so advertised.listeners had to be public IP address. I struggled for a week and yesterday I found that if I use hostname instead of public IP address it works.


advertised.listeners=PLAINTEXT://ec2-54-165-214-180.compute-1.amazonaws.com:9092

The above line in server.properties config file for Kafka made it work.

Sunday, 7 April 2019

Installing spark 2 in cloudera VM

For installing Spark 2 in Cloudera VM we need to enable cloudera manager in your Cloudera VM. Since I am running Cloudera VM in a 4GB virtual machine configuration I need to --force option while setting up cloudera-manager because it exits without --force option because it require at least 8 GB RAM. The command for installing cloudera manager is:

sudo /home/cloudera/cloudera-manager --force --express

the above command installs cloudera manager on my VirtualBox VM the IP address for my virtual machine is 192.168.1.4

Now I open the cloudera manager user interface by typing 192.168.1.4:7180 in browser window. Username is cloudera and password is cloudera for login into cloudera manager.

Download latest JDK 8 from Oracle website https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html Look for .tar.gz packaging for linux-x64.

Download it and put it in cloudera home directory on cloudera VM. I downloaded it on my windows machine and later on used Filezilla client for transferring it to cloudera VM. Filezilla is really useful in transferring files to and from cloudera VM. For connecting to cloudera you can use host local IP address of bridged IP address of cloudera machine. You can get ip address of VM by running "ip addr" command in Cloudera VM.

You can download Filezilla client from https://filezilla-project.org/download.php?type=client
Open Filezilla site manager and add a site for Cloudera VM in it using IP address of cloudera VM shown in previous screenshot.

Now transfer the downloaded JDK to cloudera VM using Filezilla client. Remember password for cloudera user is cloudera

Extract the jdk tar file in /usr/java directory from cloudera VM terminal.

Now open modify JAVA_HOME in /etc/profile to point to jdk 8 we just installed using "sudo vi /etc/profile" command.

Now you need to modify /etc/default/cloudera-scm-server using sudo vi /etc/default/cloudera-scm-server command. add JAVA_HOME to this file.

Now you need to restart cloudera server using following commands:

sudo service cloudera-scm-agent stop

sudo service cloudera-scm-server stop

sudo service cloudera-scm-server start

sudo service cloudera-scm-agent start

Now confirm that new JDK 8 is picked by cloudera manager by logging into cloudera manager and looking at support about dialog box.

Now we can proceed to installing Spark 2 on Cloudera VM. We are going to follow https://www.cloudera.com/documentation/spark2/latest/topics/spark2_installing.html for installing Spark 2.

I downloaded Spark2 version 2.3.0 release 4 from https://www.cloudera.com/documentation/spark2/latest/topics/spark2_packaging.html#versions

after that I copied it into default CSD directory and changed ownership and permission as per the requiremnts by running the following commands:

sudo cp SPARK2_ON_YARN-2.3.0.cloudera4.jar /opt/cloudera/csd/

sudo chmod 644 /opt/cloudera/csd/SPARK2_ON_YARN-2.3.0.cloudera4.jar

sudo chown cloudera-scm:cloudera-scm /opt/cloudera/csd/SPARK2_ON_YARN-2.3.0.cloudera4.jar

Now change parcel configuration to disable parcel relation validation.

Now you need to restart cloudera manager services using following commands:

sudo service cloudera-scm-agent stop

sudo service cloudera-scm-server stop

sudo service cloudera-scm-server start

sudo service cloudera-scm-agent start

Once cloudera manager is restarted. Check for Spark 2 parcel. Now click on download button to download spark 2 to cloudera.

After download is complete click on Distribute button to distribute it to cluster.

Once distribution is done click on activate button to activate the parcel.

Once activation is done add spark 2 service to cluster.

After finishing adding the service my first run of spark2 failed. The error says that we need to install CDH parcel.

Now we need to install CDH 5 parcel in Cloudera manager. Click on download button. It is 1.7 GB in size so it will take some time for download.

After downloading CDH 5 click on distribute button.

Click on activate button once CDH parcel is distributed.

Restart the cluster after CDH 5 parcel gets activated.

Spark2 service failed to start with error saying Java 8 is required. So we need to set JAVA home directory for all hosts.

After setting Java home directory to Java 8 restart spark 2 service. I should start successfully.

Saturday, 6 April 2019

Extending cloudera quickstart VM disk size

Cloudera quickstart VM comes with 64GB disk size. This size is not suitable for storing big data files in this VM. This post shows how you can increase size of disk for Cloudera virtual machine running in VirtalBox.

First you need to change size of the disk in VirtualBox's "Virtual Media Manager". Please shutdown your virtual machine before resizing your virtual disk.

Launch the "Virtual Media Manager".

Change the size of your disk to 200GB from 64GB and click Apply button.

Now start the machine. Once virtual machine is launched look at block devices present in the virtual machine by running lsblk command. Notice that size of drive sda is now 200GB but size of dm-0 55GB. The partition dm-0 is mounted as root partition. We need resize this partition and filesystem under this partition.

We will be using fdisk command for resizing the partition. Please take a backup of your data before doing this step any mistake can make your virtual disk unusable and you may loose all your data. Run the fdisk command on /dev/sda and print the partition table.

Now delete partition number 2. We will create this partition again with a larger size.

Now create partition 2 again with larger size. We use all the free space on the disk while creating the new partition.

Now write the partition table to the disk. After writing the partition table to the disk the virtual machine you need to restart the virtual machine.

After restarting the virtual machine we need to resize LVM partition which hosts the root partition. Now you can resize the physical volume in LVM using sudo lvm pvresize /dev/sda2 command.

After resizing physical volume we need to resize the logical volume using sudo lvresize -l +100%FREE /dev/vg_quickstart/lv_root command. This command will increase logical volume for 100% of free space. Now you can see LSize is 191.50G.

Now we need to extend file system size present in the lv_root partition. The mounted filesystem can be resized using sudo resize2fs /dev/mapper/vg_quickstart-lv_root command.

Now size of root partition is 189GB. Now you can store large files in this virtual disk and use this virtual machine for processing large files.