top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

How to limit the number of containers requested by a pig script?

+1 vote
699 views

I would like to know how I can limit the number of concurrent containers requested(and used ofcourse) by my pig-script (not as a yarn queue configuration or some such stuff. I want to limit it from outside on a per job basis. I would ideally like to set the number in my pig-script.) Can I do this?

posted Oct 21, 2014 by anonymous

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

1 Answer

+1 vote

As far as I understand, number of mappers you cannot drive. The number of reducers you can control via PARALEL keyword. Number of containers on a node is given by following combination of settings:yarn.nodemanager.resource.memory-mb - set on a cluster.

And following properties can be "modified" from your script setting to a different number,
mapreduce.map.memory.mb and
mapreduce.reduce.memory.mb

answer Oct 21, 2014 by Vijay Shukla
are you saying that we cant change the mappers per job through the script, right? Because, otherwise, if invoking through command line or code, then we can, I think. We do have this property mapreduce.job.maps.
What I understand so far is that in pig you cannot decide how many mappers will run. That is given by some optimization - given the number of files, size of blocks etc. What you can control is the number of reducers via Parallel directive. But for sure you can SET mapreduce.job.maps  but not sure what the effect will be. That is what I remember from doc.
Hope this helps
Similar Questions
+1 vote

After upgraded to Hadoop 2 (yarn), I found that mapred.jobtracker.taskScheduler.maxRunningTasksPerJob no longer worked, right?

One workaround is to use queue to limit it, but its not easy to control it from job submitter.

Is there any way to limit the concurrent running mappers per job? Any documents or pointer?

+4 votes

I am hitting the following issue it is still open and there are no suggested workarounds

DFSClient#DFSInputStream#blockSeekTo may leak socket connection.

https://issues.apache.org/jira/browse/HDFS-5493

Can any one know any workaround

0 votes

My cluster version hdfs 2.2 stable ( 2 ha namenodes, 10 datanodes). I was command bin/hdfs dfs -setrep -R 2 / ( replication 1 to 2 )

I found that HDFS is actually replicating the under replicated blocks but it works very slowly. HDFS performs the replication about 1 block per second.

I have about 400000 under replicated blocks so it will take about 4 more days. Is there any way to speed it up?

...