Can I open multiple files on hdfs and write data to them in parallel and then close them at the end?

707 views

posted May 12, 2014 by Luv Kumar

Looking for an answer? Promote on:

Similar Questions

+2 votes

What is the process to change the files at arbitrary locations in HDFS?

0 votes

HDFS: How quickly can I increase the number of replicas?

My cluster version hdfs 2.2 stable ( 2 ha namenodes, 10 datanodes). I was command bin/hdfs dfs -setrep -R 2 / ( replication 1 to 2 )

I found that HDFS is actually replicating the under replicated blocks but it works very slowly. HDFS performs the replication about 1 block per second.

I have about 400000 under replicated blocks so it will take about 4 more days. Is there any way to speed it up?

+3 votes

Support multiple block placement policies for HDFS?

According to the code, the current implement of HDFS only supports one specific type of block placement policy, which is BlockPlacementPolicyDefault by default.The default policy is enough for most of the circumstances, but under some special circumstances, it works not so well.

For example, on a shared cluster, we want to erasure encode all the files under some specified directories. So the files under these directories need to use a new placement policy.But at the same time, other files still use the default placement policy. Here we need to support multiple placement policies for the HDFS.

One plain thought is that, the default placement policy is still configured as the default. On the other hand, HDFS can let user specify customized placement policy through the extended attributes(xattr). When the HDFS choose the replica targets, it firstly check the customized placement policy, if not specified, it fallbacks to the default one. Any thoughts?

0 votes

How to get info about which data in hdfs or file system that a MapReduce job visits?

I was trying to implement a Hadoop/Spark audit tool, but l met a problem that I can't get the input file location and file name. I can get username, IP address, time, user command, all of these info from hdfs-audit.log. But When I submit a MapReduce job, I can't see input file location neither in Hadoop logs or Hadoop ResourceManager.

Does hadoop have API or log that contains these info through some configuration ?If it have, what should I configure?

+1 vote

Streaming data access in HDFS: Design Feature

Can anyone please explain what we mean by STREAMING DATA ACCESS IN HDFS.

Data is usually copied to HDFS and in HDFS the data is splitted across DataNodes in blocks.
Say for example, I have an input file of 10240 MB(10 GB) in size and a block size of 64 MB. Then there will be 160 blocks.

These blocks will be distributed across DataNodes in blocks. Now the Mappers will read data from these DataNodes keeping the DATA LOCALITY FEATURE in mind(i.e. blocks local to a DataNode will be read by the map tasks running in that DataNode).

Can you please point me where is the "Streaming data access in HDFS" is coming into picture here?

Can I open multiple files on hdfs and write data to them in parallel and then close them at the end?

Your comment on this post:

Your answer

Preview