How hadoop 2.x perform better than hadoop 1.x

451 views

The setup consist of hadoop 1.0.1 and hbase 0.94.x. Loading raw data into hdfs and then into hbase consumes good amount of time for 10tb of raw data (using hadoop shell - copyFromLocal and pig script to load hbase).

Moving to hadoop 2.x will benefit performing better is my question. If yes please provide relevent links or docs which expains how it is achieved.
I do not need sorting my data while loading into hbase so what are the ways i can disable sort ta Mapper and at Reducer is my 2nd question.

Any Suggestions??

posted Dec 6, 2013 by Seema Siddique

Looking for an answer? Promote on:

Similar Questions

+2 votes

How to migrate Apache Hadoop 1.x HDFS to Apache Hadoop 2.x HDFS

Apache Hadoop includes HDFS Federation.
Does anyone know how to migrate Apache Hadoop 1.x HDFS to Apache Hadoop 2.x HDFS?

I am getting the following error:

$ bin/hdfs start namenode --config $HADOOP_CONF_DIR -upgrade -clusterId 
Error: Could not find or load main class start

+1 vote

Is there any benchmarks or 'performance heuristics' for Hadoop?

Is there any benchmarks or 'performance heuristics' for Hadoop? Is it possible to say something like 'You can process X lines of GZipped log file on a medium AWS server in Y minutes"? I would like to get an
idea of what kind of workflow is possible.

+2 votes

Estimating the time of hadoop job?

Currently I'm developing an application which would ingest logs of order of 70-80 GB of data/day and would then do Some analysis on them

Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual Machines) , all nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of about 30 GB, it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this amount of time.

+4 votes

Hadoop upgrading/migrating downtime from Apache Hadoop 1.x to 2.x

I want to know while upgrading/migrating from Apache Hadoop 1.x to 2.x(MRv2YARN) in a production cluster of several nodes is there any *ANTICIPATED DOWNTIME* that one needs to be aware of?

+1 vote

What configuration parameters cause a Hadoop 2.x job to run on the cluster?

Assume I have a machine on the same network as a hadoop 2 cluster but separate from it.

My understanding is that by setting certain elements of the config file or local xml files to point to the cluster I can launch a job without having to log into the cluster, move my jar to hdfs and start the job from the clusters hadoop machine.

Does this work? What Parameters need I sat? Where is the jar file? What issues would I see if the machine is running Windows with cygwin installed?

...

How hadoop 2.x perform better than hadoop 1.x

Your comment on this post:

Your answer

Preview