How to set mapreduce.input.fileinputformat.split.maxsize for a specific job ?

1 Answer

You can either pass them on as command line argument using -D option. Assuming your job is implementing the standard Tool interface:
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/util/Tool.html

Or you can set them in the code using the various 'set' methods to set key/value values in the configuration object.

...
Job job = Job.getInstance(getConf());
job.setJarByClass(MyJob.class);

job.getConfiguration().set("<property-name>",<value>);
....

Docs for Configuration class:
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/conf/Configuration.html

This will work as long as the property is not marked final

answer May 16, 2015 by anonymous

Thanks, Is this the correct way to write ?
conf.set("mapreduce.input.fileinputformat.split.maxsize", "102400");
or
job.getConfiguration().set("mapreduce.input.fileinputformat.split.maxsize", "102400");

I think another ways as
FileInputFormat.setMaxInputSplitSize(null, 102400);

Is this all right ? Are these both solve the same purpose or something else ?

commented May 16, 2015 by Sudhakar Singh

What do you think is the type of the property value that you are trying to write? Is it string? Or numeric? Have you check the documentation of the Configuration class that I sent earlier?

There are multiple setXXX methods depending on the type of the property value being set:
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/conf/Configuration.html #setLong(java.lang.String, long)

For the other case below, why are you setting the job object (first parameter) as null?
FileInputFormat.setMaxInputSplitSize(null, 102400);
Check out the documentation here:
http://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html #setMaxInputSplitSize(org.apache.hadoop.mapreduce.Job, long)

Lastly,
conf.set("mapreduce.input.fileinputformat.split.maxsize", "102400");
VS.
job.getConfiguration().set("mapreduce.input.fileinputformat.split.maxsize", "102400");
is just a matter of how you are referencing the configuration object. Either as its own reference of through chained called from the job object. That is programming style decision and has no bearing on it.

commented May 17, 2015 by anonymous

Thanks a lot.
Sorry that was not null, but typed unintentionally. So I can use one of the following.
conf.setLong("mapreduce.input.fileinputformat.split.maxsize", 102400);
or
FileInputFormat.setMaxInputSplitSize(job, 102400);

commented May 18, 2015 by Sudhakar Singh

Similar Questions

+3 votes

How to find execution time of a MapReduce job?

Date date; long start, end; // for recording start and end time of job
date = new Date(); start = date.getTime(); // starting timer

job.waitForCompletion(true)

date = new Date(); end = date.getTime(); //end timer
log.info("Total Time (in milliseconds) = "+ (end-start));
log.info("Total Time (in seconds) = "+ (end-start)*0.001F);

I am not sure this is the correct way to find. Is there any other method or API to find the execution time of a MapReduce job?

+1 vote

How to stop a mapreduce job from terminal running on Hadoop Cluster?

To run a job we use the command
$ hadoop jar example.jar inputpath outputpath
If job is so time taken and we want to stop it in middle then which command is used? Or is there any other way to do that?

+2 votes

How to find min, max and mean of wordcount from text file in hadoop mapreduce?

public class MaxMinReducer extends Reducer {
int max_sum=0; 
int mean=0;
int count=0;
Text max_occured_key=new Text();
Text mean_key=new Text("Mean : ");
Text count_key=new Text("Count : ");
int min_sum=Integer.MAX_VALUE; 
Text min_occured_key=new Text();

 public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
       int sum = 0;           

       for (IntWritable value : values) {
             sum += value.get();
             count++;
       }

       if(sum < min_sum)
          {
              min_sum= sum;
              min_occured_key.set(key);        
          }     


       if(sum > max_sum) {
           max_sum = sum;
           max_occured_key.set(key);
       }          

       mean=max_sum+min_sum/count;
  }

 @Override
 protected void cleanup(Context context) throws IOException, InterruptedException {
       context.write(max_occured_key, new IntWritable(max_sum));   
       context.write(min_occured_key, new IntWritable(min_sum));   
       context.write(mean_key , new IntWritable(mean));   
       context.write(count_key , new IntWritable(count));   
 }
}

Here I am writing minimum,maximum and mean of wordcount.

My input file :

high low medium high low high low large small medium

Actual output is :

high - 3------maximum

low - 3--------maximum

large - 1------minimum

small - 1------minimum

but i am not getting above output ...can anyone please help me?

+1 vote

Can we run mapreduce job from eclipse IDE on fully distributed mode hadoop cluster?

A mapreduce job can be run as jar file from terminal or directly from eclipse IDE. When a job run as jar file from terminal it uses multiple jvm and all resources of cluster. Does the same thing happen when we run from IDE. I have run a job on both and it takes less time on IDE than jar file on terminal.

0 votes

How to get info about which data in hdfs or file system that a MapReduce job visits?

I was trying to implement a Hadoop/Spark audit tool, but l met a problem that I can't get the input file location and file name. I can get username, IP address, time, user command, all of these info from hdfs-audit.log. But When I submit a MapReduce job, I can't see input file location neither in Hadoop logs or Hadoop ResourceManager.

Does hadoop have API or log that contains these info through some configuration ?If it have, what should I configure?

How to set mapreduce.input.fileinputformat.split.maxsize for a specific job ?

Your comment on this post:

1 Answer

Your comment on this answer:

Your answer

Preview