top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

Hadoop: WholeFileInputFormat takes the entire input file as input or each record(input split) as whole?

0 votes
663 views
Hadoop: WholeFileInputFormat takes the entire input file as input or each record(input split) as whole?
posted Jun 28, 2014 by Amit Parthsarthi

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

2 Answers

+1 vote

It takes the entire file as input otherwise it wont be any different from the normal line/record-based input format.

answer Jun 28, 2014 by Bob Wise
+1 vote

It takes entire file as input. There is a method in the class isSplittable in this input format class which is set to false. This method determines if file can be split in multiple chunks.

answer Jun 28, 2014 by Meenal Mishra
Similar Questions
+2 votes

I am trying to run Nutch 2.2.1 on a Haddop 2-node cluster. My hadoop cluster is running fine and I have successfully added the input and output directory on to HDFS. But when I run

$HADOOP_HOME/bin/hadoop jar /nutch/apache-nutch-2.2.1.job org.apache.nutch.crawl.Crawler urls -dir crawl -depth 3 -topN 5

I am getting something like:

INFO input.FileInputFormat: Total input paths to process : 0

Which, I understand, is meaning that Hadoop cannot locate the input files. The job ends for obvious reasons citing the null pointer exception.

Can someone help me out?

+4 votes

My requirement is a typical Datawarehouse and ETL requirement. I need to accomplish

1) Daily Insert transaction records to a Hive table or a HDFS file. This table or file is not a big table ( approximately 10 records per day). I don't want to Partition the table / file.

In few articles It was being mentioned that we need to load to a staging table in Hive. And then insert like the below :

insert overwrite table finaltable select * from staging;

I am not getting this logic. How should I populate the staging table daily.

+1 vote

In xmls configuration file of Hadoop-2.x, "mapreduce.input.fileinputformat.split.minsize" is given which can be set but how to set "mapreduce.input.fileinputformat.split.maxsize" in xml file. I need to set it in my mapreduce code.

...