Tuesday 3 January 2012

Compare two lists 101

This is probably Programming 101, and should be (in analog form) Philology 101, but combination of the two seems somehow to fall through. Also, there was a question about it today on Digital Medievalist mailing list.

So, the problem for today is: we want to compare two lists of words.

(Let's say we have made one list by sorting all words from a text alphabetically, and then discarding all multiple occurrences -- save the first one, of course.)

Does position in a list matter, asks the programmer. No, answers the philologist -- we only want to know whether a word from list A occurrs also in list B. Do the rows have multiple fields which should be compared, asks the programmer. No, there is only one field, one word, answers the philologist, thinking about finding a word from the text in a dictionary. Our wishes are modest. (Later, of course, we'll want to find where exactly do common words appear in documents A and B.)

If you use Excel, there seem to be some recipes at The Spreadsheet Page and elsewhere. There is also a Perl module List::Compare (if you are a brave philologist, and have a book or two on Perl handy, you can learn much from the problem). Finally, if you are an eccentric philologist and use Linux, there are standard text manipulation tools for Linux. Yes, here is where we realized that philology is much like programming: both are all about texts.

Surprisingly, the main problem in using all these tools (at least for me) turned out to be how to send a list to the tool, how to loop through all elements in a list, etc. Programming kindergarten, I guess -- but philologists don't usually have to think about how to turn the pages or how to scan lines of text, much less to issue instructions such as "now lift the hand... spread the thumb and index finger... catch the page edge lightly... lift again..." (I know, even programmers don't do it anymore either these days; Dada used to do it when she was studying electrical engineering.)

So how do I actually compare two lists? Here is one of my Bash scripts:

egrep -f list1 list2 > resultlist

As you can see, it takes great wisdom and sophistication.

egrep, today same as grep, is an utility which finds words (strings) in a file. With the -f option, it reads a list from a file (where every line is a query). List2 is the file which should be searched (actually it does not have to be sorted -- it can simply be the original text; list1 also does not have to be sorted, by the way). "Greater than" sign is a command to send output to file (called resultlist); without it, results would just fly across our screen.

And basically, that's all there is to it. Try not to be frustrated if something goes wrong, look for recipes and explanations on the internet, and remember that you cannot (hopefully) break anything in your computer experimenting with this kind of commands.

No comments:

Post a Comment