top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

Hadoop Pi Example in Yarn

+2 votes
394 views

How does the PI example can determine the number of mappers?I thought the only way to determine number of mappers is via the amount of filesplits you have in the input file...So for instance if the input size is 100MB and filesplit size is 20MB then I would expect to have 100/20 = 5 map tasks.

posted Dec 18, 2013 by Meenal Mishra

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

1 Answer

+1 vote

A map task is created for each input split in your dataset. By default, an input split correlates to block in HDFS i.e. if a file consists of 1 HDFS block, then 1 map task will be started - if a file consists of N blocks, then N map task will be started for that file (obviously, assuming a default settings).

PiEstimator generates input files for itself. When you submit PiEstimator job, you need to specify how many map tasks you want to run. Then, before submitting a job to the cluster, it will generate a this number of input files in HDFS. For each file map task will be started. What is interesting each file, will contain a single line only.

You can see some code here http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/org.apache.hadoop/hadoop-examples/0.20.2-cdh3u1/org/apache/hadoop/examples/PiEstimator.java#PiEstimator

answer Dec 18, 2013 by Sheetal Chauhan
Similar Questions
+1 vote

It was very interesting that Hadoop could work with Docker and doing some trial on patch of YARN-1964.

I applied patch yarn-1964-branch-2.2.0-docker.patch of jira YARN-1964 on branch 2.2 and am going to install a Hadoop cluster using the new generated tarball including the patch.

Then, I think I can use DockerContainerExecutor, but I do not know much details on the usage and have following questions:

  1. After installation, Whats the detailed config steps to adopt DockerContainerExecutor?

  2. How to verify whether a MR task is really launched in Docker container not Yarn container?

  3. WHICH HADOOP BRANCH WILL OFFICIALLY INCLUDE DOCKER SUPPORT?

+1 vote

We are currently facing a frustrating hadoop streaming memory problem. our setup:

  • our compute nodes have about 7 GB OF RAM
  • hadoop streaming starts a bash script wich uses about 4 GB OF RAM
  • therefore it is only possible to start one and only ONE TASK PER NODE

out of the box each hadoop instance starts about 7 hadoop containers with default hadoop settings. each hadoop task forks a bash script that need about 4 GB of RAM, the first fork works, all following fail because THEY RUN OUT OF MEMORY. so what we are looking for is to LIMIT the number of containers TO ONLY ONE. so what we found on the internet:

  • yarn.scheduler.maximum-allocation-mb and mapreduce.map.memory.mb is set to values such that there is at most one container. this means, mapreduce.map.memory.mb must be MORE THAN HALF of the maximum memory (otherwise there will be multiple containers).

done right, this gives us one container per node. but it produces a new problem: since our java process is now using at least half of the max memory, our child (bash) process we fork will INHERIT THE PARENT MEMORY FOOTPRINT and since the memory used by our parent was more than half of total memory, WE RUN OUT OF MEMORY AGAIN. if we lower the map memory, hadoop will allocate 2 containers per node, which will run out of memory too.

since this problem is a blocker in our current project we are evaluating adapting the source code to solve this issue. as a last resort. any ideas on this are very much welcome.

+2 votes

I am using containerLaunchContext.setCommands() to add different commands that I wanted to run on container. But only first command is getting execute.Is there is something else I need to do?

List commands = new ArrayList();commands.add(cmd1);commands.add(cmd2);

I can see only cmd1 is getting executed.

+3 votes

Few questions about the new Hadoop release regarding YARN:

  1. Does YARN need to run on the same machines that are hosting the HDFS services or can HDFS be remote of a YARN cluster? If this done by placing the remote HDFS clusters configuration files (core-site.xml and hdfs-site.xml) on the YARN clusters machines?

  2. According to http://www.i-programmer.info/news/197-data-mining/6518-hadoop-2-introduces-yarn.html, Hadoop 2.2.0 supports Microsoft Windows. How do/Can you configure YARN for secure container isolation in Windows? It seems that the ContainerExecutor and DefaultContainerExecutor can detect and run on Windows, but the secure LinuxContainerExecutor are for *nix systems, so is there anything in place for maximum security like LCE is?

  3. If 1 is yes, then is it possible to have a cluster mixed with both Linux and Windows machines running YARN and working together?

...