Showing posts with label lists. Show all posts
Showing posts with label lists. Show all posts

Tuesday, 3 January 2012

Compare two lists 101

This is probably Programming 101, and should be (in analog form) Philology 101, but combination of the two seems somehow to fall through. Also, there was a question about it today on Digital Medievalist mailing list.

So, the problem for today is: we want to compare two lists of words.

(Let's say we have made one list by sorting all words from a text alphabetically, and then discarding all multiple occurrences -- save the first one, of course.)

Does position in a list matter, asks the programmer. No, answers the philologist -- we only want to know whether a word from list A occurrs also in list B. Do the rows have multiple fields which should be compared, asks the programmer. No, there is only one field, one word, answers the philologist, thinking about finding a word from the text in a dictionary. Our wishes are modest. (Later, of course, we'll want to find where exactly do common words appear in documents A and B.)

If you use Excel, there seem to be some recipes at The Spreadsheet Page and elsewhere. There is also a Perl module List::Compare (if you are a brave philologist, and have a book or two on Perl handy, you can learn much from the problem). Finally, if you are an eccentric philologist and use Linux, there are standard text manipulation tools for Linux. Yes, here is where we realized that philology is much like programming: both are all about texts.

Surprisingly, the main problem in using all these tools (at least for me) turned out to be how to send a list to the tool, how to loop through all elements in a list, etc. Programming kindergarten, I guess -- but philologists don't usually have to think about how to turn the pages or how to scan lines of text, much less to issue instructions such as "now lift the hand... spread the thumb and index finger... catch the page edge lightly... lift again..." (I know, even programmers don't do it anymore either these days; Dada used to do it when she was studying electrical engineering.)

So how do I actually compare two lists? Here is one of my Bash scripts:

egrep -f list1 list2 > resultlist

As you can see, it takes great wisdom and sophistication.

egrep, today same as grep, is an utility which finds words (strings) in a file. With the -f option, it reads a list from a file (where every line is a query). List2 is the file which should be searched (actually it does not have to be sorted -- it can simply be the original text; list1 also does not have to be sorted, by the way). "Greater than" sign is a command to send output to file (called resultlist); without it, results would just fly across our screen.

And basically, that's all there is to it. Try not to be frustrated if something goes wrong, look for recipes and explanations on the internet, and remember that you cannot (hopefully) break anything in your computer experimenting with this kind of commands.

Monday, 2 January 2012

A list of names

Once we start thinking about lists and tinkering with them (and I've been doing this for a long time now), it turns out that another interesting list to compile would be a list of names from a text. Then, if we cross-reference two texts, we can look for names which occur in both.

Here is such a list for names common both to F. de Diversi's description of Dubrovnik (1440), and A. Crijevic Tubero's history of his times (1520). As you'll see, the list is more than just words. Every item is a link to a search in the CroALa collection -- not just to texts by Diversi and Tubero, but to all currently included texts (this could, of course, be fine-tuned).

  1. albertI
  2. albertUs
  3. alemanUs
  4. alexandrIE
  5. andreE
  6. aUstrIE
  7. bartholomeo
  8. blasII
  9. boemIE
  10. bosnenses
  11. carolUs
  12. chrIstIanorUm
  13. constantInopolIs
  14. contareno
  15. cremonensem
  16. dalmatIa
  17. dalmatIE
  18. epIdaUrI
  19. epIdaUrII
  20. epIdaUrUm
  21. francIscI
  22. francorUm
  23. hUngarIE
  24. IllyrIco
  25. ItalI
  26. ItalIcIs
  27. ItalIco
  28. IUlII
  29. lacromE
  30. laUrentII
  31. leonardUs
  32. marIE
  33. marIam
  34. martInI
  35. medIolanI
  36. mIchEl
  37. mIchElIs
  38. neapolI
  39. neapolItanE
  40. neapolItanam
  41. nIcolaI
  42. nIcolao
  43. nIcolaUm
  44. petrI
  45. petrUm
  46. posonIo
  47. rhacUsanE
  48. rhacUsanIs
  49. salomonIs
  50. sIcIlIE
  51. sIgIsmUndI
  52. sIgIsmUndo
  53. sIgIsmUndUm
  54. thomas
  55. UngarIE


The words look funny because Philologic, the open-source text engine which searches and serves CroALa texts, uses special uppercase characters to find orthographical variants. "UngarIE" will find Vngariae and Ungarię and Ungarie and Vngarye (if there is such a form).