top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

Is there any benchmarks or 'performance heuristics' for Hadoop?

+1 vote
563 views

Is there any benchmarks or 'performance heuristics' for Hadoop? Is it possible to say something like 'You can process X lines of GZipped log file on a medium AWS server in Y minutes"? I would like to get an
idea of what kind of workflow is possible.

posted Feb 24, 2014 by Dewang Chaudhary

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

3 Answers

+1 vote

The terasort benchmark is probably the most common. It has mappers and reducers doing nothing, this way you only use the frameworks mergesort functionalities.

answer Feb 24, 2014 by Amit Parthsarthi
+1 vote

If you want to do profiling on your hadoop cluster, the starfish project might be interesting. You can find more info http://www.cs.duke.edu/starfish/

answer Feb 26, 2014 by Jagan Mishra
0 votes

http://sortbenchmark.org/

Doesnt just cover Hadoop, but maybe the methodology will give you an idea of what your'e looking for.
Theres too many variables to pin down a "general" average. Every job will run differently on every cluster, given the machines can be heterogenous builds, with heterogenous configs at the machine level, then the cluster will have configs that may or may not override the machine configs...plus the job submitter can specify runtime variables...

Things like the type of data being processed affect the amount of disk I/O, network traffic required, etc., which are in turn affected by their components...

Throwing more nodes at a problem will usually make it faster, but how much faster depends...
Best way to read your cluster is establish a benchmark operation that models your expected use case (or one of them), then adjust things on the cluster and see what tips the time, spill, network traffic, etc. one way or another.

answer Feb 25, 2014 by Luv Kumar
Similar Questions
+2 votes

Currently I'm developing an application which would ingest logs of order of 70-80 GB of data/day and would then do Some analysis on them

Now the infrastructure that I have is a 4 node cluster( all nodes on Virtual Machines) , all nodes have 4GB ram.

But when I try to run the dataset (which is a sample dataset at this point ) of about 30 GB, it takes about 3 hrs to process all of it.

I would like to know is it normal for this kind of infrastructure to take this amount of time.

+2 votes

The setup consist of hadoop 1.0.1 and hbase 0.94.x. Loading raw data into hdfs and then into hbase consumes good amount of time for 10tb of raw data (using hadoop shell - copyFromLocal and pig script to load hbase).

  1. Moving to hadoop 2.x will benefit performing better is my question. If yes please provide relevent links or docs which expains how it is achieved.

  2. I do not need sorting my data while loading into hbase so what are the ways i can disable sort ta Mapper and at Reducer is my 2nd question.

Any Suggestions??

+1 vote

I'm trying to implement security on my hadoop data. I'm using Cloudera hadoop and looking for the following.

  1. ROLE BASED AUTHORIZATION AND AUTHENTICATION

  2. ENCRYPTION ON DATA RESIDING IN HDFS

I have looked into Kerboroes but it doesn't provide encryption for data already residing in HDFS. Are there any other security tools i can go for? has anyone done above two security features in cloudera hadoop.

...