top button
Flag Notify
    Connect to us
      Site Registration

Site Registration

combining files and no dupes

0 votes
413 views

Having looked at "man join" wasn't sure of it's use here.

Unknown number of files, constant is extension .list
(For testing purposes only using two)

cat *.list >> output.joined | sort -u

How can I test if the output.joined,
is indeed the combined two lists with dupes removed.

posted May 14, 2013 by anonymous

Share this question
Facebook Share Button Twitter Share Button LinkedIn Share Button

2 Answers

0 votes

give

cat *.list >> output.joined | sort -u | uniq --all-repeated
a try. If the output is empty ===> no dupes!

answer May 14, 2013 by anonymous
0 votes

1) You probably want '>' rather than '>>' if you're only running this once. Not that it makes a difference here but it's superfluous.

2) Since you're sending the output of 'cat' to a file, the pipe won't get any input, so you're you sorting nothing. If you actually want to capture the output in a file, you can use 'tee':

sort *.list | tee output | sort -u

or just run the two commands separately:

cat *.list > output
sort -u < output

If not, then "cat *.list|sort -u" is enough.

answer May 14, 2013 by anonymous
Similar Questions
+1 vote

I have a roughly 5 GB file where each row is a key, value pair. I would like to use this as a "hashmap" against another large set of file. From searching around, one way to do it would be to turn it into a dbm like DBD and put it into a distributed cache. Another is by joining the data. A third one is putting it into HBase and use it for
lookup.

I'm more familiar with the first approach, so it seems simpler to me. However, I have read that using a distributed cache for files beyond a few megabytes is not recommended because the file is replicated across
all the data nodes. This doesn't seem that bad to me because I just pay this overhead once at the beginning of the job, and then each node gets a copy locally, right? If I were to go with join, would it not increase the workload (more entries) and create the same network congestion issue? And wouldn't going with HBase means making it a bottleneck?

What's the advantage and disadvantage of going for one solution over the others? What if, for example, that "hashmap" needs to be from, say, a 40GB file. How would my option change? At which point would
each option make sense?

+1 vote

I am facing some difficulty using join to display the array elements. Here is the code snippet

[code]use strict;use warnings
my @fruits = qw/apple mango orange banana guava/;
#print '[', join '][', @fruits;#print ']';
print '[', join '][', @fruits, ']';best,
[/code]

[output]
      [apple][mango][orange][banana][guava][]
[/output]

How can I make the output to eliminate the last empty square brackets [] using a single print statement. I used two print statements as shown in the code snippet above (#lines are commented out). Any help is greatly appreciated.

+1 vote

Trying to profile an application on powerpc architecture.
while profiling the following error occurred by running opreport command..

#opcontrol --start
#./exec
#opcontrol --stop
#opcontrol --dump
#opreport

opreport error : no sample files found.try using opcontrol --dump

...