Big Data with Jasvant

Tuesday, 14 April 2020

The ACL permission error while exporting Cloud Watch logs to S3

Yesterday I struggled for more than 6 hours to export Cloud Watch logs to S3 bucket. I was getting the following error:

The ACL permission for the selected bucket is not correct. The Amazon S3 bucket must reside in the same region as the log data that you want to export. Learn more.

I tried following all the steps mentioned in the link but still it did not work. Later on I found the mistake, it interesting one so I am writing it in my blog so that you don't make same mistake.

The page https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/S3ExportTasksConsole.html

says that you need to set following policy to S3 bucket:


{
    "Version": "2012-10-17",
    "Statement": [
      {
          "Action": "s3:GetBucketAcl",
          "Effect": "Allow",
          "Resource": "arn:aws:s3:::my-exported-logs",
          "Principal": { "Service": "logs.us-west-2.amazonaws.com" }
      },
      {
          "Action": "s3:PutObject" ,
          "Effect": "Allow",
          "Resource": "arn:aws:s3:::my-exported-logs/random-string/*",
          "Condition": { "StringEquals": { "s3:x-amz-acl": "bucket-owner-full-control" } },
          "Principal": { "Service": "logs.us-west-2.amazonaws.com" }
      }
    ]
}

Here my-exported-logs is the bucket name and it needs to be replaced with your bucket name and us-west-2 needs to be replaced with your region code for Mumbai it is ap-south-1.

The page says that random-string can be replaced with any random string which makes you believe that this string is not important but that is wrong. It is most important string for exporting logs to S3. The random string which you use in bucket permission needs to be provide as S3 bucket prefix while exporting logs to S3 bucket. If you don't provide S3 bucket prefix or provide a different prefix then you get the ACL error because the policy provide s3:PutObject permission only on random-string directory so if we try to put logs in some other directory then it will fail. The following export configuration works.

The only difference between working and not working dialog box is random-string being provided as S3 bucket prefix. I learned it the hard way by wasting 6 hours.

Wednesday, 5 February 2020

Kubernetes Error: Error response from daemon: invalid mode

Recently I was trying to mount a persistence volume in a Kubernetes pod. I faced the error "Error: Error response from daemon: invalid mode". After struggling for a long time and searching internet I was able to solve this problem. Here I am writing the solution to this problem. I was trying to follow tutorial https://kubernetes.io/docs/tasks/configure-pod-container/configure-persistent-volume-storage/ . Only difference being that I was trying it on windows rather than Linux.

I tried with following persistance_volume.yaml file

 
apiVersion: v1
kind: PersistentVolume
metadata:
  name: task-pv-volume
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "c:/jasvant/kubernetes/persistence_volume/data"

My directory on windows was present at "C:\jasvant\kubernetes\persistence_volume\data" so I replaced all '\' with '/'. But is did not work and pod container failed to start.


C:\Users\jasvant\kubernetes>kubectl get pod
NAME                                  READY   STATUS                 RESTARTS   AGE
hello-minikube-6fb6cb79cc-8drt9       1/1     Running                6          130d
kubernetes-bootcamp-dd569fc9c-bt74m   1/1     Running                5          96d
kubernetes-bootcamp-dd569fc9c-fl64v   1/1     Running                5          96d
task-pv-pod                           0/1     CreateContainerError   0          62s

The container failed to start with CreateContainerError. When I described the pod I got the following error:


C:\Users\jasvant\kubernetes>kubectl describe pod  task-pv-pod
Name:         task-pv-pod
Namespace:    default
Priority:     0
Node:         minikube/10.0.2.15
Start Time:   Wed, 05 Feb 2020 14:27:59 +0530
Labels:       
Annotations:  
Status:       Pending
IP:           172.17.0.7
Containers:
  task-pv-container:
    Container ID:
    Image:          nginx
    Image ID:
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CreateContainerError
    Ready:          False
    Restart Count:  0
    Environment:    
    Mounts:
      /usr/share/nginx/html from task-pv-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-scccw (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  task-pv-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  task-pv-claim
    ReadOnly:   false
  default-token-scccw:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-scccw
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                 From               Message
  ----     ------     ----                ----               -------
  Normal   Scheduled  2m14s               default-scheduler  Successfully assigned default/task-pv-pod to minikube
  Normal   Pulled     14s (x8 over 2m8s)  kubelet, minikube  Successfully pulled image "nginx"
  Warning  Failed     14s (x8 over 2m8s)  kubelet, minikube  Error: Error response from daemon: invalid mode: /usr/share/nginx/html
  Normal   Pulling    4s (x9 over 2m13s)  kubelet, minikube  Pulling image "nginx"

This error message fails to give any hint to actual problem. The root cause is that HostPath.path need to be a path on VM created by Minikube and not a directory path on windows machine. so C: was not accepted by minikube. It is not impossible to mount a windows directory onto VM and use that directory as persistence volume storage directory. By default 'C:\Users' directory from Windows machine is mounted at /c/Users folder in VM. So "/c/Users/jasvant/kubernetes/persistence_volume/data" will work for me. But if you create any directory out of 'C:\Users' directory then it will not work.. Please refer to https://minikube.sigs.k8s.io/docs/tasks/mount/ for more information.

If you want to login into VM created by Minikube for running Kubernetes cluster. you can use "minikube ssh" command on your windows machine. It will open a ssh to VM and you can browse the file system there.

Monday, 9 September 2019

java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access

I got the following error while trying to run spark streaming code for one of my assignment.


java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access

When I searched the internet I found that it is a bug and it is fixed in latest version. The link to bug is: https://jira.apache.org/jira/browse/SPARK-19185

It is fixed in version 2.4.0

I used the following pom.xml file to fix this problem.

https://www.bigdatawithjasvant.com/blogdata/00/0001/code/pom.xml

Running Map-Reduce program on AWS EMR

In this post we will learn how to run a Map-Reduce program on AWS EMR cluster. This post is continuation of my earlier post:
https://blog.bigdatawithjasvant.com/2019/09/setting-up-machine-code-and-data-for.html
Please read it before continuing.

The video for this post is available below:

For running any jobs on AWS EMR the code input data needs to be in S3 storage. There should be a directory where log files will be stored. The output directory should not exist, because it is going to create it. If they output directory already exists then Map-Reduce program will fail.

We put the jar file containing the code for Map-Reduce program in S3 storage.

The input for Map-reduce program is also placed in S3 storage.

Now open EMR service console in AWS.

Now create a new cluster. Provide name of cluster. Provide a directory where log files for the cluster will be placed. These files can be used for analyzing reason of failure later on. For adding steps to cluster choose "Step type" as "Custom Jar" and click on "Configure" button to specify JAR file which contains our Map-Reduce program and arguments for the program.

In the pop-up dialog box select the jar file which contains the code. I the arguments box provide following arguments.


com.wordcount.WordCountHashPartitioner

s3://bigdatawithjasvant-emr/bigtext

s3://bigdatawithjasvant-emr/output

The first line contains the main class name. Second line contains location of the input files. Third line contains the location of the output directory. The output directory should not exist already. If the output directory exists the the job will fail. Select "Action on failure" as "Terminate the cluster" because we don't want to keep running the cluster if our Map-Reduce program fails.

Now we select number of nodes in the cluster. I will be using 5 m5.xlarge nodes for the cluster.One will be used for master node and 4 core nodes of cluster.

Once we click on "Create cluster" button. The cluster will be created and started. It will take around 3 minutes to cluster to come up. You can keep clicking the refresh button to update the status.

Once cluster is in running state. we can see status of steps which we configured in the cluster.

Once our setp is in running state we can click "View jobs" to see jobs running as part of that step.

To see details of tasks of our job we need to click on "View tasks" link.

You can see total number of tasks, completed, running and pending tasks in the tasks page. I as you can see 13 tasks are currently running. there are 4 core in each of 4 core instances. Total 16 cores and 64GB RAM in core instances. I don't understand size of one execution slot in EMR in terms of RAM and CPU. The number of running tasks peeked at 13 which gives a hint that total number of slots were close to 13.

You can look at console logs of the job from web interface.

You can look at controller logs also using web console.

Once Map-Reduce job step is complete the cluster will shut down automatically. I my case it took around 5 minutes for my program to run. Around 5 minutes were taken in starting up and shutdown.

AWS EMR is a good option if you have a lot of data and want to process it quickly. You can launch a cluster with upto 20 instances and tear down once it is not needed. In our case we were billed for 10 minutes for each of 5 instances and for 10 minutes for EMR usage fee for each 5 instances. You are billed for 50 minutes for instance and 50 minutes for EMR usage fee. But you saved 40 minutes of time by running you program on a multi node cluster.

Sunday, 1 September 2019

Setting up code and data for AWS EMR

This post is in continuation of my earlier post. Please read that post before reading this post. Link to that post is:
https://blog.bigdatawithjasvant.com/2019/09/configuring-aws-cli.html

In this post I am going to tell you how to setup code and data for running a Map-Reduce program on AWS EMR cluster. I am going to use a sample work count program for running it on EMR cluster. In this post I am concentrating on running program on EMR and not on how to write Map-Reduce program. So lets get started.

Video:

Step 1: Download Word count program

Download word count program from

https://www.bigdatawithjasvant.com/blogdata/00/0000/code/wordcount.zip

and extract it in a folder called workspace. I am using D:\workspace in my example

Now start eclipse with D:\workspace as workspace.

Click on Import project in Eclipse.

Select project type to Existing Maven Project.

Select project folder.

Build the project.

JAR file will be created in the target folder.

Step 2: Download data file to AWS instance

Download https://www.bigdatawithjasvant.com/blogdata/00/0000/data/bigtext.tar.gz file to AWS instance using following command.

curl -O https://www.bigdatawithjasvant.com/blogdata/00/0000/data/bigtext.tar.gz
tar -xvzf bigtext.tar.gz

Run ./generate.sh command to generate bigtext.dat file. It will be created by repeating complet_work.txt 700 times.

Step3: Grant permission on S3 buckets to AWS CLI user

Open IAM on aws console and click on Users.

Select user to edit.

Click on Add permissions button.

Click on Add existing policies directly.

Select S3 permissions and add them to user.

Click Add permissions button.

Step 4: Upload jar file and input data to S3

Create new bucket for you where you can upload test data and jar file. The bucket name needs to be unique in aws. I have used following command:

aws s3 mb s3://bigdatawithjasvant-emr

Copy bigtext.dat file to s3 bucket using following command


aws s3 cp bigtext.dat  s3://bigdatawithjasvant-emr/bigtext/

Now login to AWS console go to S3 service and click on the bucket.

Create logs folder in S3 bucket.

Create jar folder in S3 bucket.

Click on jar folder.

Click on Upload button and click on select file in upload dialog box.

Select jar file to upload.

Click on Upload button to upload jar file containing Map-Reduce program.

Now you can see uploaded jar file in s3 bucket.

Now are all set to execute Map-Reduce program on EMR cluster. Code and input data is already available in S3 bucket.