Sunday 8 January 2012

Relations are interesting

Relations are interesting. A single fact is not interesting. If it remains single, that is. We have to relate something to it to get it going.
A single table is moderately interesting. It invites us at first to build resumes and reports (discarding individual facts in the process; this is, by the way, deeply unsatisfying), and then to compare facts in proximity to each other.
Relations of multiple tables are very interesting.
Beyond that -- wait, what lies beyond that?
So many relations, branches and joins between tables that human beings cannot hold it all in their heads, I guess.

Saturday 7 January 2012

In the neighbourhood of Dubrovnik

One of many nice capabilities of PhiloLogic text search system is the collocation table. It shows which words occur most often in the proximity of our search string.

So we gave PhiloLogic an interesting problem. There are many Latin names for the city of Dubrovnik, and even more ways to write these. We wanted to find all of them with a single search, and to see the words which co-occur with all these names.

The search string was:
epI[dt]aUr.*|rh?a[gc]Us.*|dUbr.*. 

(capital U's and I's to find u and v, i ans y)

The result is [here], nicely shortened by Google Shortener into goo.gl/PofAI.

What do we learn from the search? That Dubrovnik is an urbs and a ciuitas, that it has principatus and nobiles and senatus. Not much surprise here.

The interesting move is to compare Dubrovnik with Split (Spalatum: [X]). It can be seen at once that there ecclesia and archiepiscopus feature much more prominently.

And so on, until the map is complete.

Thursday 5 January 2012

Rare and Medium

This wintry afternoon I followed in the footsteps of William Whitaker, the author of WORDS Latin dictionary. The program contains a list of Latin words with very precise lexicographic descriptions -- data on period, area of application, frequency etc. The last part interested me most.

Whitaker, about whom I know almost nothing, but I'd like to know more (he seems to be outside the academia) [1], was very modest and careful in his claims, repeatedly warning users of the program that its philological expertise is limited, that he relied on other authorities and sources, that the program is intended just to be a reading help, not a research tool. Nevertheless, he has produced, I believe, the most informative freely available digital reference work on Latin usage. I'd like to see a review of his work in some scholarly journal, I think he has deserved it.

Anyway, in the documentation on word frequencies Whitaker says:

FREQ guessed from the relative number of citations given by sources need not be valid, but seems to work. (...)

type FREQUENCY_TYPE is ( -- For dictionary entries
X, -- -- Unknown or unspecified
A, -- very freq -- Very frequent, in all Elementary Latin books, top 1000+ words
B, -- frequent -- Frequent, next 2000+ words
C, -- common -- For Dictionary, in top 10,000 words
D, -- lesser -- For Dictionary, in top 20,000 words
E, -- uncommon -- 2 or 3 citations
F, -- very rare -- Having only single citation in OLD or L+S
I, -- inscription -- Only citation is inscription
M, -- graffiti -- Presently not much used
N -- Pliny -- Things that appear only in Pliny Natural History
);

(Of course, Whitaker knows about Diederich's work -- he is the one who OCR'd Diederich's 1939 thesis and put it online.)

So, we're pleased to report that the Profile of Croatian Neo-Latin Project converted Whitaker's DICTPAGE.RAW to a MySQL table, and learned the following about how Whitaker's ten frequency categories are distributed among the 39,225 lemmata in his wordlist:

  1. X (Unknown or unspecified): 0

  2. A (very freq): 2134

  3. B (frequent): 2747

  4. C (common): 5113

  5. D (lesser): 8365

  6. E (uncommon): 11193

  7. F (very rare): 7974

  8. I (inscription): 430

  9. M (graffiti): 0

  10. N (Pliny): 1269

  11. Total: 39225


Now we have something to compare. It is interesting to note that most words are uncommon.

[Further reading.] There is a recent publication, Joseph Denooz, Nouveau lexique fréquentiel de latin. Alpha-Omega. Reihe A Bd 258. Hildesheim/Zürich/New York: Georg Olms Verlag, 2010. Pp. ix, 453. ISBN 9783487144733. €148.00. (reviewed recently on BMCR, with a crucial question: "A dictionary such as this is a tool: so what can this one be used for?").

[1] A sad update. Thinking about possible reasons for William Whitaker's absence from the internet, I consulted the obituaries, and found the following:

Colonel William A. Whitaker (USAF-Retired) passed away on Tuesday, December 14, 2010. While at DARPA, he worked on the computer language ADA. In retirement, he created the Latin-English translation software program, "Whitaker Words". (...)
Published in Midland Reporter-Telegram on December 21, 2010
Source here.


Τάνδε κατ' εὔδενδρον στείβων δρίος εἴρυσα χειρὶ
πτώσσουσαν βρομίας οἰνάδος ἐν πετάλοις,
ὄφρα μοι εὐερκεῖ καναχὰν δόμῳ ἔνδοθι θείη,
τερπνὰ δι' ἀγλώσσου φθεγγομένα στόματος.

Requiescat in pace.

Quantification

This morning we had to compile some numbers on the Croatiae auctores Latini collection. Here they are (also on the CroALa developer's blog):

  • 143 TEI XML files (including, alas, some duplicates)

  • 437.218 words

  • 29.637.450 characters

  • 16.465,25 Textkarten

  • 1029 Druckbogen


Last two strange categories belong to German printing tradition, which was influential in Croatian printing industry; we translated these terms (Textkarte = kartica teksta, Druckbogen = tiskarski arak), and use them still in text accounting.

[Technical note.] Numbers were produced by Linux wc command (cf. recipe) on all XML files currently in CroALa, also available on its Sourceforge page. The Linux one-liner for calculating number of characters and words in multiple XML files was simple:

wc *.xml | awk '{print $3-$1}'

Wednesday 4 January 2012

Crepitantia tympana

Do you think that "crepitantia tympana" is a striking Latin expression? Well, so did a lot of people before you, tells a Google search (you can compare this with a Google Book Search on the same expression, if you like). A lot of "Dictionaria poetica", "Epithetorum opera absolutissima", and "Aeraria poetica" there. The expression, first used in neo-Latin poetry (attributed to Strozza the younger, it seems, though Boiardo also used it), became commonplace by 1594.

Of all Croatian writers currently in the CroALa, only Ludovik Paskalić thought so, but his verse ended up anthologized in Carmina illustrium poetarum Italorum (late, in 1720).

We're interested in the expression because Macario Muzio used it in his De triumpho Christi poema (Venice 1499), describing heavenly music:

264 Summa potestates ducebant agmina et altae
265 Virtutes celsique throni; tum classica sancto
266 Ore procul clangore pio pia signa canebant
267 Victoris uexilla dei lectasque cohortes
268 In superum coetus longo testantia cantu.
269 At circum propiore sono recinente citatis
270 Ad citharam leuibus digitis plectroque uolanti
271 Innumera certante lyra discrimina mille,
272 Mille fides ictis uario modulamine chordis
273 Et totidem surgens ad sacras barbitus odas
274 Aedebat duplicesque manus agitantia naula
275 Cymbalaque et pulsis resonabat bracthea palmis
276 Tinnitus tremula crispans ad carmina dextra.
277 Nec minus obliquas iungebat consona uoces
278 Plurima compactis respondens tibia cannis
279 Et sistra et grato crepitantia tympana bombo
280 Sambucae et molles numeri quos temperat unda
281 Hydraulis, caeleste melos referente monaulo;
282 Diuinos iunxere modos rhythmosque sonantes
283 Acta dei; tales modulans symphonia cantus
284 Laeta triumphantes reuocabat in ethera diuos.

Marko Marulić, who read Muzio's book carefully, later adapted the line in his Dauidias (c. 1502-1510):

2.209 Hinc uictor Dauid, turba comitante suorum,
2.210 Ibat ouans, uiridi redimitus tempora lauro
2.211 Et biiugo uectus curru. Peana canebat
2.212 Pone sequens miles pulsataque tympana bombos
2.213 Ędebant et silua sonum per inane uolutum
2.214 Atque cauis haustum rimis referebat eundem.

The ancient model is Catullus 64:

Cui Thyades passim lymphata mente furebant
euhoe bacchantes, euhoe capita inflectentes.
harum pars tecta quatiebant cuspide thyrsos,
pars e divulso iactabant membra iuvenco,
pars sese tortis serpentibus incingebant,
pars obscura cavis celebrabant orgia cistis,
orgia quae frustra cupiunt audire profani,
plangebant aliae proceris tympana palmis
aut tereti tenuis tinnitus aere ciebant,
multis raucisonos efflabant cornua bombos
barbaraque horribili stridebat tibia cantu.

Tuesday 3 January 2012

Compare two lists 101

This is probably Programming 101, and should be (in analog form) Philology 101, but combination of the two seems somehow to fall through. Also, there was a question about it today on Digital Medievalist mailing list.

So, the problem for today is: we want to compare two lists of words.

(Let's say we have made one list by sorting all words from a text alphabetically, and then discarding all multiple occurrences -- save the first one, of course.)

Does position in a list matter, asks the programmer. No, answers the philologist -- we only want to know whether a word from list A occurrs also in list B. Do the rows have multiple fields which should be compared, asks the programmer. No, there is only one field, one word, answers the philologist, thinking about finding a word from the text in a dictionary. Our wishes are modest. (Later, of course, we'll want to find where exactly do common words appear in documents A and B.)

If you use Excel, there seem to be some recipes at The Spreadsheet Page and elsewhere. There is also a Perl module List::Compare (if you are a brave philologist, and have a book or two on Perl handy, you can learn much from the problem). Finally, if you are an eccentric philologist and use Linux, there are standard text manipulation tools for Linux. Yes, here is where we realized that philology is much like programming: both are all about texts.

Surprisingly, the main problem in using all these tools (at least for me) turned out to be how to send a list to the tool, how to loop through all elements in a list, etc. Programming kindergarten, I guess -- but philologists don't usually have to think about how to turn the pages or how to scan lines of text, much less to issue instructions such as "now lift the hand... spread the thumb and index finger... catch the page edge lightly... lift again..." (I know, even programmers don't do it anymore either these days; Dada used to do it when she was studying electrical engineering.)

So how do I actually compare two lists? Here is one of my Bash scripts:

egrep -f list1 list2 > resultlist

As you can see, it takes great wisdom and sophistication.

egrep, today same as grep, is an utility which finds words (strings) in a file. With the -f option, it reads a list from a file (where every line is a query). List2 is the file which should be searched (actually it does not have to be sorted -- it can simply be the original text; list1 also does not have to be sorted, by the way). "Greater than" sign is a command to send output to file (called resultlist); without it, results would just fly across our screen.

And basically, that's all there is to it. Try not to be frustrated if something goes wrong, look for recipes and explanations on the internet, and remember that you cannot (hopefully) break anything in your computer experimenting with this kind of commands.

Monday 2 January 2012

A list of names

Once we start thinking about lists and tinkering with them (and I've been doing this for a long time now), it turns out that another interesting list to compile would be a list of names from a text. Then, if we cross-reference two texts, we can look for names which occur in both.

Here is such a list for names common both to F. de Diversi's description of Dubrovnik (1440), and A. Crijevic Tubero's history of his times (1520). As you'll see, the list is more than just words. Every item is a link to a search in the CroALa collection -- not just to texts by Diversi and Tubero, but to all currently included texts (this could, of course, be fine-tuned).

  1. albertI
  2. albertUs
  3. alemanUs
  4. alexandrIE
  5. andreE
  6. aUstrIE
  7. bartholomeo
  8. blasII
  9. boemIE
  10. bosnenses
  11. carolUs
  12. chrIstIanorUm
  13. constantInopolIs
  14. contareno
  15. cremonensem
  16. dalmatIa
  17. dalmatIE
  18. epIdaUrI
  19. epIdaUrII
  20. epIdaUrUm
  21. francIscI
  22. francorUm
  23. hUngarIE
  24. IllyrIco
  25. ItalI
  26. ItalIcIs
  27. ItalIco
  28. IUlII
  29. lacromE
  30. laUrentII
  31. leonardUs
  32. marIE
  33. marIam
  34. martInI
  35. medIolanI
  36. mIchEl
  37. mIchElIs
  38. neapolI
  39. neapolItanE
  40. neapolItanam
  41. nIcolaI
  42. nIcolao
  43. nIcolaUm
  44. petrI
  45. petrUm
  46. posonIo
  47. rhacUsanE
  48. rhacUsanIs
  49. salomonIs
  50. sIcIlIE
  51. sIgIsmUndI
  52. sIgIsmUndo
  53. sIgIsmUndUm
  54. thomas
  55. UngarIE


The words look funny because Philologic, the open-source text engine which searches and serves CroALa texts, uses special uppercase characters to find orthographical variants. "UngarIE" will find Vngariae and Ungarię and Ungarie and Vngarye (if there is such a form).

Looking at neo-Latin

Problem: we want to research strange words in neo-Latin texts.

Of course, that depends on what we consider to be "strange". This can mean:
a. Latin words which are rare or non-existent in classical Latin (mutatis mutandis, the language in which Romans wrote until c. 500 a. D)
b. words which are strange to us
c. words which were strange to authors or their public
d. words which are in a neo-Latin text, but are not Latin

Let us here consider case a. It turns out to have several sub-problems of its own:
a.1. words which don't exist in Latin of the Romans (see the Neulateinische Wortliste by J. Ramminger)
a.2. words which are rare in all periods of Latin
a.3. words which are rare in Latin of the Romans, but frequent in later Latin (e. g. medieval), or in some Latin idioms (e. g. ecclesiastical Latin)
a.4. words which are rare in some genres, but frequent in others

Basically, there are two approaches to our research. We can start from our texts, examining their words and looking for them in different wordlists. Or we can start from lists, and see if our texts contain some of their words.

It all comes down to comparing lists. The longer the better.

But we need special lists. Here's a list (sic) of them:
a. a list of words in our text
b. a list of lemmata of words from our text
c. a list of words which are rare in classical Latin
d. a list of neo-Latin words
(e. and a list of frequent Latin words would also come in handy)

We also need some tools. If we want to go from a text to the lists, we'll need:
a. something to list all words in our text
b. a Latin parser (we feed it a form, and get back the lemma)
c. a way to communicate our words to the parser (and get back the results)
d. something which can compare two lists
e. something which can write out the results

If we go from a list to the text, we'll need:
a. a Latin stemmer (we'll look only for stems, and disregard the endings)
b. a regular expressions tool (to find a complete word, given a stem)
c. tools for comparing and storing the results, as above