top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

Hadoop:Whats the best way to check the compression codec that an HDFS file was written with?

+2 votes
891 views

We use both Gzip and Snappy compression so I want a way to determine how a specific file is compressed. The closest I found is the GETCODEC but that relies on the file name suffix ... which dont exist since Reducers typically dont add a suffix to the filenames they create.

posted Dec 4, 2013 by Luv Kumar

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

1 Answer

+1 vote

If you're looking for file header/contents based inspection, you could download the file and run the Linux utility 'file' on the file, and it should tell you the format.

I don't know about Snappy, but Gzip files can be identified simply by their header bytes for the magic sequence.

If its sequence files you are looking to analyse, a simple way is to read its first few hundred bytes, which should have the codec string in it. Programmatically you can use
https://hadoop.apache.org/docs/r1.0.4/api/org/apache/hadoop/io/SequenceFile.Reader.html#getCompressionCodec() for sequence files.

answer Dec 5, 2013 by Majula Joshi
Similar Questions
+3 votes

I am trying to access a hadoop 1 installation via the hadoop 2.2.0 command line tools. I am wondering if this is possible at all?

From hadoop 1 I get:

$ hadoop fs -ls hdfs://127.0.0.1:9000/
Found 2 items
drwxr-xr-x - cs supergroup 0 2014-02-01 08:18 /tmp
drwxr-xr-x - cs supergroup 0 2014-02-01 08:19 /user

From hadoop 2.2.0 I get:

$ hadoop fs -ls hdfs://127.0.0.1:9000/
ls: Failed on local exception: java.io.EOFException; Host Details : 
local host is: "i7/127.0.1.1"; destination host is: "localhost":9000;

I am trying to find this information via a web-search, but up to now no success.

0 votes

The reason behind this is I want to have my custom user who can create anything on the entire hdfs file system (/).
I tried couple of links however, none of them were useful. Is there any way by adding/modifying some property tags I can do that ?

0 votes

I was trying to implement a Hadoop/Spark audit tool, but l met a problem that I can't get the input file location and file name. I can get username, IP address, time, user command, all of these info from hdfs-audit.log. But When I submit a MapReduce job, I can't see input file location neither in Hadoop logs or Hadoop ResourceManager.

Does hadoop have API or log that contains these info through some configuration ?If it have, what should I configure?

...