How to partition a file to smaller size for performing KNN in hadoop mapreduce

844 views

In KNN like algorithm we need to load model Data into cache for predicting the records.

Here is the example for KNN.

So if the model will be a large file say1 or 2 GB we will be able to load them into Distributed cache.

The one way is to split/partition the model Result into some files and perform the distance calculation for all records in that file and then find the min ditance and max occurance of classlabel and predict the outcome.

How can we parttion the file and perform the operation on these partition ?

ie 1 record  parttition1,partition2,.... 2nd record  parttition1,partition2,...

This is what came to my thought. Is there any further way. Any pointers would help me.

posted Jan 15, 2015 by Deepak Dasgupta

Looking for an answer? Promote on:

have you considered implementing using something like spark? That could be much easier than raw map-reduce

commented Jan 15, 2015 by Alok Sharma

Similar Questions

+2 votes

How to find min, max and mean of wordcount from text file in hadoop mapreduce?

public class MaxMinReducer extends Reducer {
int max_sum=0; 
int mean=0;
int count=0;
Text max_occured_key=new Text();
Text mean_key=new Text("Mean : ");
Text count_key=new Text("Count : ");
int min_sum=Integer.MAX_VALUE; 
Text min_occured_key=new Text();

 public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
       int sum = 0;           

       for (IntWritable value : values) {
             sum += value.get();
             count++;
       }

       if(sum < min_sum)
          {
              min_sum= sum;
              min_occured_key.set(key);        
          }     


       if(sum > max_sum) {
           max_sum = sum;
           max_occured_key.set(key);
       }          

       mean=max_sum+min_sum/count;
  }

 @Override
 protected void cleanup(Context context) throws IOException, InterruptedException {
       context.write(max_occured_key, new IntWritable(max_sum));   
       context.write(min_occured_key, new IntWritable(min_sum));   
       context.write(mean_key , new IntWritable(mean));   
       context.write(count_key , new IntWritable(count));   
 }
}

Here I am writing minimum,maximum and mean of wordcount.

My input file :

high low medium high low high low large small medium

Actual output is :

high - 3------maximum

low - 3--------maximum

large - 1------minimum

small - 1------minimum

but i am not getting above output ...can anyone please help me?

+1 vote

How to write a custom partitioner for a Hadoop MapReduce job?

+1 vote

How to stop a mapreduce job from terminal running on Hadoop Cluster?

To run a job we use the command
$ hadoop jar example.jar inputpath outputpath
If job is so time taken and we want to stop it in middle then which command is used? Or is there any other way to do that?

0 votes

How to get info about which data in hdfs or file system that a MapReduce job visits?

I was trying to implement a Hadoop/Spark audit tool, but l met a problem that I can't get the input file location and file name. I can get username, IP address, time, user command, all of these info from hdfs-audit.log. But When I submit a MapReduce job, I can't see input file location neither in Hadoop logs or Hadoop ResourceManager.

Does hadoop have API or log that contains these info through some configuration ?If it have, what should I configure?

+3 votes

Hadoop: How reduce tasks know which partition they should read?

I am looking to the Yarn mapreduce internals to try to understand how reduce tasks know which partition of the map output they should read. Even, when they re-execute after a crash?

I am also looking to the mapreduce source code. Is there any class that I should look to try to understand this question?

...

How to partition a file to smaller size for performing KNN in hadoop mapreduce

Your comment on this post:

Your answer

Preview