This is a blog about me, Seinfeld, okonomiyaki, japanese toilet seats, and other things of interest

Saturday, April 28, 2007

Set theory using the shell

What if you have two files with unique lines and you want to figure out what lines that only exist in one file or what lines the two files have in common? Maybe the first things that come to mind is to write a program in say Java or Perl, but the standard shell tools can be just as efficient and no coding necessary.

To find the lines that are in both files (union):

sort file1 file2 | uniq > union.txt

To find the lines that are in both files (intersection):

sort file1 file2 | uniq -d > intersection.txt

To find the lines that are in file1 but not file2:

sort file1 file2 file2 | uniq -u > only_in_file1.txt

uniq -d prints only duplicates and uniq -u prints only unique lines, while sort sorts it's input file(s). In my experience this is at least as fast as a program written in Python or Java with input files consisting of millions of lines.

2 Comments:

Blogger RoboGeek said...

Ahhh, Grasshopper, you have understood the true depth of the shell and are now a master yourself.
Beware, the power you now have is lethal, and may seem anti-social to all but the greatest of dweebs.

3:49 AM

 
Blogger Pádraig Brady said...

Set theory using the shell

5:03 PM

 

Post a Comment

<< Home