Sunday, 30 December 2012

Morphological JSON with Perl

Learning Perl, aka "the Llama book", makes a terrific didactical point in footnote 8 on page 6:
If you're going to use a programming language for only a few minutes each week or month, you'd prefer one that is easier to learn, since you'll have forgotten nearly all of it from one use to the next. Perl is for people who are programmers for at least twenty minutes a day.

Basically, nulla dies sine linea. The daily twenty minutes today took about three or four hours, but I ended up with Perl version of what I already did in JavaScript: a script that iterates over any list of JSON results from Latin Morphology Service, decides whether a word sent to it has been recognized or not, and then whether the lemmatization is ambiguous or not.

The rizai — getting through all arrays of hashes and hashes of hashes — have been pikrai indeed (the crucial piece of information was shared by this post at Stack Overflow); dereferencing still appears to me as consecutio temporum must look to a programmer; hashes were my Scylla and arrays my Charybdis, but the ship is still sailing, more or less.

The script is here (thanks to DokuWiki).

All this wasn't done as pure exercise (I'm not such a conscientious student). The Morphology Service JSON holds lot more then a lemma, in fact it provides a wealth of information — most of what people interested in natural language processing of Greek and Latin usually lack (and scholars of other languages have). You need to stem a word? You need to identify which part of speech it is? It's all there somewhere, nested deep in JSON.

Naturally, you ask why should I bother. Are we not trained to use dictionaries, don't we have enough grammatical knowledge? Of course we do; we can read Greek and Latin much better than computers. But there are limits to how much we can read, or analyse. Giving the text the care and the gusto it requires — Greek and Latin we have today were not written to be read quickly ‐ I need from one to ten minutes for a page, and enough time for reflexion and rumination afterwards. Grammatical analysis progresses even slower. The computer, on the other hand, doesn't care for rumination; it gets back from Morphology Service JSON for 2000+ words of a neo-Latin text approximately in the time that I need to write this post.

And then we have a chance to learn from computers' mistakes.

Which words were recognized, which are ambiguous, which are unknown to the service? What is the proportion between the three groups? Which words are unambiguously identified, and not inflected? We'll store the uninflected words somewhere, because we don't need to stem them (much); we'll store the unambiguously recognized words, because we won't need to lemmatize them in other texts; from the set of unrecognized words we'll be building an index nominum et locorum, an index verborum rariorum, and a list of common words which Morphology Service should add to its database. Furthermore, a list of lemmata allows us to begin exploring lexical variety in a text, or in a set of texts.

Mind you, the basis for much of this is being put together while I write this. All I had to do to make it happen was learn some code. It almost didn't hurt. Much.

Saturday, 29 December 2012

Structuring the Mercurius Croaticus

Mercurius Croaticus is currently a set of TEI XML files containing bibliographical records. These files will be served and made accessible via the BaseX XML database, once we decide on how to present the records; our working premise is that, if researchers want to discover something really interesting, they'll be willing and ready to learn XQuery (not least because it's a powerful tool for more than one research project). And Mercurius Croaticus will help them learn.

To get a clear idea on what exactly is in the bibliographic collection, and to avoid confusion, we organised the files in three sets of folders, following the FRBR concept. There is a folder for authors (auctores), a folder for works (opera), and one for "manifestations" (it is interesting that a term for the third category is not readily available in any language I know); "manifestations" have two subfolders: manuscripts (MS) and printed books (typis edita). Obviously, the "internet" subfolder could also be included.

The folder auctores contains our starting prosopography — a set of 244 personal records for Croatian Latin authors included in the Leksikon hrvatskih pisaca (A Lexicon of Croatian Writers, Zagreb 2000) — and the "additions" file, containing currently 70 more neo-Latin authors of Croatian origin.

Main part of opera is also culled from the Leksikon hrvatskih pisaca — there are 1784 items listed — and there are 18 additional items as well.

The manifestations/MS subfolder has one special collection, with excerpts from Paul Oskar Kristeller's Iter Italicum (no, we don't have the money to subscribe to Brill's internet edition); it was excerpted by Darko Novaković, who kindly lent his notes to Mercurius for TEI XML conversion. And there are records collected from other sources, more or less obiter.

Finally, there is the manifestations/typis edita subfolder, where the basis is the bibliography made by Šime Jurić (1915–2004) in late 1960's: Iugoslaviae scriptores Latini recentioris aetatis (that is, its "Pars I. Opera scriptorum Latinorum natione Croatarum usque ad annum MDCCCXLVIII typis edita. Bibliographiae fundamenta. t. 1. Index alphabeticus. t. 2. Index systematicus. Additamentum I."), later improved by the Croatian National and University Library, and encoded by the Croatiae auctores Latini project. This collection contains 5867 bibliographic records on printed Latin publications with works by Croatian authors. Jurić's bibliography breaks off with the year 1850. Mercurius Croaticus should urgently supplement it with data on later publications, all the way up to the present.

Thursday, 27 December 2012

Saturnalia with Perseus Latin JSON

A two-days Christmas project: learn how to use Latin lemmata in JSON format, as provided by the Morphological Analysis Service available on an instance of the Bamboo Services Platform hosted by University of California, Berkeley at http://services-qa.projectbamboo.org/bsp/morphologyservice (used by Perseus for Latin and announced by Bridget Almas on November 1, 2012).

The fanfare: see the results here: [X].

Caveat. If you're a programmer, the following may seem extremely silly, because it is a description of how a non-programmer solved a programmer's task.

The task

What we wanted to do. There is a file of saved JSON responses, produced by sending a list of Latin words (from a Croatian neo-Latin text) to the Morphological Analysis Service, and appending responses. We wanted to produce a simple table containing a form sent to the service and the lemma gotten in response. This has already been achieved locally and indirectly, processing the responses file with some Perl and Linux CLI regex tools to produce a HTML page. But now I wanted to learn how to compute JSON as JSON. Solving the problem also taught me the structure of Morphological Analysis Service response.

The responses file contains lines of JSON objects:

{"RDF":{"Annotation":{"about":"urn:TuftsMorphologyService:abduxerunt:morpheus","hasTarget":{"Description":{"about":"urn:word:abduxerunt"}},"hasBody":{"resource":"urn:uuid:58f0bfcf-0180-4596-92d7-e88eaccffa8b"},"title":null,"creator":{"Agent":{"about":"org.perseus:tools:morpheus.v1"}},"created":"26\nDec\n2012\n12:01:28\nGMT","Body":{"about":"urn:uuid:58f0bfcf-0180-4596-92d7-e88eaccffa8b","type":{"resource":"cnt:ContentAsXML"},"rest":{"entry":{"uri":null,"dict":{"hdwd":{"lang":"lat","$":"abduco"},"pofs":{"order":1,"$":"verb"}},"infl":{"term":{"lang":"lat","stem":"abdu_x","suff":"e_runt"},"pofs":{"order":1,"$":"verb"},"mood":"indicative","num":"plural","pers":"3rd","tense":"perfect","voice":"active","stemtype":"perfstem"}}}}}}} {"RDF":{"Annotation":{"about":"urn:TuftsMorphologyService:abscente:morpheus","hasTarget":{"Description":{"about":"urn:word:abscente"}},"title":null,"creator":{"Agent":{"about":"org.perseus:tools:morpheus.v1"}},"created":"26\nDec\n2012\n12:01:28\nGMT"}}} {"RDF":{"Annotation":{"about":"urn:TuftsMorphologyService:abstineant:morpheus","hasTarget":{"Description":{"about":"urn:word:abstineant"}},"hasBody":{"resource":"urn:uuid:566cf4ec-2a8c-452f-a02f-5e0cecf32f52"},"title":null,"creator":{"Agent":{"about":"org.perseus:tools:morpheus.v1"}},"created":"26\nDec\n2012\n12:01:29\nGMT","Body":{"about":"urn:uuid:566cf4ec-2a8c-452f-a02f-5e0cecf32f52","type":{"resource":"cnt:ContentAsXML"},"rest":{"entry":{"uri":null,"dict":{"hdwd":{"lang":"lat","$":"abstineo"},"pofs":{"order":1,"$":"verb"}},"infl":{"term":{"lang":"lat","stem":"abs:tin","suff":"eant"},"pofs":{"order":1,"$":"verb"},"mood":"subjunctive","num":"plural","pers":"3rd","tense":"present","voice":"active","stemtype":"conj2","morph":"comp_only"}}}}}}}

The JSON

"Lines of JSON objects" is not a valid JSON, as you can see if you copy the lines above and paste them here: jsonlint.com. Why the error? All objects have to be contained in a JSON array. Also, for some reason "RDF" wasn't accepted as field name (key). So we transformed the file locally, introducing the "Verba" as the top array key, like this:

perl -0777 -wpl -e "s/\n/,\n/g;" jsonfilename | perl -wpl -e 's/"RDF"/"rdf"/g;' | perl -0777 -wpl -e 's/^{/{"Verba":\[{/;' | perl -0777 -wpl -e 's/,\n*$/\n\]}\n/;' > resultfilename.json

To process the JSON, we used JavaScript (amazed by the d3 JavaScript library). D3 documentation and examples showed how to read in a JSON file — and then I realized that the Morphology Service JSON is pretty complicated:

  • it is (sometimes deeply) nested
  • its objects contain arrays
  • an unsuccessfully lemmatized word won't have any "Body" object; a lemmatized word will have a string there; for an ambiguously lemmatized word, "Body" will be an array

The loops

So, lots of loops here. First the script had to loop through the array:

var infoLength= data.Verba.length; for (infoIndex = 0; infoIndex < infoLength; infoIndex++) { // ... }

Then it had to check whether a JSON object contains a lemmatized word — whether it had "Body" object. The a1 variable will hold the form sent to the service, while the a2 will hold the lemmata (or information that the word wasn't lemmatized successfully):

var a1 = data.Verba[infoIndex].rdf.Annotation.hasTarget.Description.about; // testing for existence of lemma var verbdata = data.Verba[infoIndex].rdf.Annotation.Body; if (verbdata) { // ... } else { var a2 = "FORMA NON RECOGNITA"; }

And then the "Body" object had to be tested for array; in case it isn't an array, JSON would be traversed all the way to the "$" key (containing the dictionary entry for the lemma):

if(Object.prototype.toString.call(verbdata) === '[object Array]') { // Iterate the array and do stuff ... } else { var link2 = perseus + data.Verba[infoIndex].rdf.Annotation.Body.rest.entry.dict.hdwd.$ ; var a2 = data.Verba[infoIndex].rdf.Annotation.Body.rest.entry.dict.hdwd.$ ; }

Now, in case that the verbdata variable contains an array, the array had to be iterated over, and a list — actually, a new array — had to be built from its values:

var a2 = []; for (bodyIndex = 0; bodyIndex < verbdata.length; bodyIndex++) { a2.push(verbdata[bodyIndex].rest.entry.dict.hdwd.$); }

Finally, we used a small routine to populate a table with resulting forms / lemmata pairs:

var columns = [a1, a2]; var table = d3.select("#container") .append("tr") .selectAll("td") .data(columns) .enter() .append("td") .text(function(column) { return column; });

Lots of trial-and-error (Firebug was a great help, and Stack Overflow even greater one) need not be dwelt on. Just one limitation puzzles me: the JSON file contains responses on more than 2000 words; my version of Firefox throws an error after ca. 720 objects read — either something should be optimized, or a paging system introduced. And, of course, seeing all 2000+ forms/lemmata pairs at once is neither necessary nor useful; the only thing we need is an option to sort out the unrecognized forms. This was added by the sorttable.js script.

Once again, the page with our JavaScript can be seen in action here: [X].

Wednesday, 19 December 2012

One-line concordance in Linux command line

A recipe. To create a "concordance" — actually, a list of forms from a text with frequencies added — using just a command line, skipping programs such as AntConc (which is great, nice and illuminating, but sometimes I just need to prepare a list quickly). It can be done with the following Bash one-liner:

tr '[:punct:]' ' ' < filename1 | tr '[:upper:]' '[:lower:]' | tr '[:blank:]' ' ' | sort | uniq -c | sed 's/ \{1,\}/","/g' | sed 's/^",//g' | sed 's/$/"/g' > filename2.csv (Filename1 is input file, filename2.csv output in csv format.)

Recently there was a discussion on HUMANIST list whether "bash scripting is a worthwhile approach to "tool" development in the Digital Humanities". People tended to reply no, either learn a "real" language (Python was recommended), or develop a GUI ("like most people, humanists have a strong distaste for the commandline"); I think it was 3:1 against the command line.

Obviously, I disagree. Using bash helped me cross the boundary between user and "programmer" — on the command line one just slides from one region into another. Without a formal education: you have a problem, you look for a solution (discovering gratefully that you stand on shoulders of many colleagues), and bam! it's solved.

I think that digital humanists in general should adopt this kind of sliding — from users to programmers as well as from "classical" to "avant-guarde" scholars — as their MO.

Asterisks

* * *

Today I had to mark up several Latin poems which used the device above -- three asterisks, even placed as an asterism -- to mark breaks between thematic units.

How to mark a set of asterisks, a typographical asterism, in TEI XML? A TEI-L discussion from 2007 helped, and I decided to use the space element ("indicates the location of a significant space in the copy text").

Friday, 7 December 2012

Profiling cultural literacy of Croatian Latin writers

A paper to be presented in the Latin, National Identity and the Language Question in Central Europe conference, organised by the Ludwig Boltzmann Institute for Neo-Latin Studies in Innsbruck (12--15 Dec 2012) will apply E. D. Hirsch's concept of cultural literacy --- which is actually German Allgemeinbildung or Bildungsgut (see it at work in these books), and "opća kultura" in Croatian --- to intellectual horizons of Croatian neo-Latin writers, as represented in the Croatiae auctores Latini collection.

Judging from the programme, the conference offers a chance to present digital research to an audience working mostly in "traditional" ways. This is quite an opportunity; too often digital humanities get separated in a room of their own, where they don't get in anybody's way. So, the challenge is to persuade colleagues that a large-scale search of CroALa using e. g. common terms from CAMENA TERMINI can lead to something interesting.

Update, post-conference, 19/12/2012:

My presentation on cultural literacy is here (note to self: never again try to do a presentation in a browser — outdated browser versions always turn up in crucial moments in crucial spots).

The paper itself is here.

Monday, 16 April 2012

Preparing to launch the Mercurius Croaticus




Mercurius Croaticus, a prosopographical and bibliographical collection of data on Croatian Latin writers, their works, editions, and manuscripts, is nearing its launching point.

The starting dataset comprises information on:

  • 269 authors

  • 1808 works

  • 5867 printed editions

  • 58 manuscripts (more to be added soon)

  • digitised copies, where available

Wednesday, 1 February 2012

Beneša's trigrams

Here is an experimental page for researching trigrams in the De morte Christi, a neo-Latin epic by Damianus Benessa (Damjan Beneša).

What did we do:
1. using a concordance program (our reliable AntConc), we found trigrams in Beneša's Latin text, which we obtained courtesy of our colleague Vlado Rezar, Beneša's modern editor

2. we reformatted the trigrams slightly, using tr and sed, to make use of the excellent PhiloLogic crapser function (it is hard not to laugh thinking about this function, because in Croatian "serem" means "I crap")

3. using curl and a simple bash script, we sent the trigrams to CroALa

4. using again sed, we filtered out the successful hits, i. e. those which produced results

5. with some more sed, the hits were turned into searches on the page linked to at the beginning: [X]. There you'll find the trigram which produced the hit, the link to a saved search, and a report on the number of occurrences found in CroALa.

Most interesting findings for us are occurrences from Marko Marulić and Jakov Bunić, close contemporaries of Beneša; Marulić and Bunić also wrote Biblical epic poems in Latin (and Marulić's epic remained in manuscript until the 1950's).

The useful sed snippet which produces the regex line, and a line immediately before it, is here:
sed -n '/Your search found/{x;1!p;g;$N;p;};h' ben-filename


(Adapted from that goldmine, the Sed one-liners.)

Thinking about PND

An important part of our research is finding the Personennamendatei (PND) number of Croatian Latin authors and adding the number to our personal data record of the author. So far, 83 authors (of 241 from our experimental set) have been connected with their PND-Nrs.

Now we're looking into the ways Wikipedia (at least, German Wikipedia) explores the PND to uniquely identify persons and connect data on them. The BEACON format seems a nice start for a small catalogue like ours. And, of course, it would be nice if Croatian Wikipedia decided to adopt something similar to the PND scheme.

Sunday, 8 January 2012

Relations are interesting

Relations are interesting. A single fact is not interesting. If it remains single, that is. We have to relate something to it to get it going.
A single table is moderately interesting. It invites us at first to build resumes and reports (discarding individual facts in the process; this is, by the way, deeply unsatisfying), and then to compare facts in proximity to each other.
Relations of multiple tables are very interesting.
Beyond that -- wait, what lies beyond that?
So many relations, branches and joins between tables that human beings cannot hold it all in their heads, I guess.

Saturday, 7 January 2012

In the neighbourhood of Dubrovnik

One of many nice capabilities of PhiloLogic text search system is the collocation table. It shows which words occur most often in the proximity of our search string.

So we gave PhiloLogic an interesting problem. There are many Latin names for the city of Dubrovnik, and even more ways to write these. We wanted to find all of them with a single search, and to see the words which co-occur with all these names.

The search string was:
epI[dt]aUr.*|rh?a[gc]Us.*|dUbr.*. 

(capital U's and I's to find u and v, i ans y)

The result is [here], nicely shortened by Google Shortener into goo.gl/PofAI.

What do we learn from the search? That Dubrovnik is an urbs and a ciuitas, that it has principatus and nobiles and senatus. Not much surprise here.

The interesting move is to compare Dubrovnik with Split (Spalatum: [X]). It can be seen at once that there ecclesia and archiepiscopus feature much more prominently.

And so on, until the map is complete.

Thursday, 5 January 2012

Rare and Medium

This wintry afternoon I followed in the footsteps of William Whitaker, the author of WORDS Latin dictionary. The program contains a list of Latin words with very precise lexicographic descriptions -- data on period, area of application, frequency etc. The last part interested me most.

Whitaker, about whom I know almost nothing, but I'd like to know more (he seems to be outside the academia) [1], was very modest and careful in his claims, repeatedly warning users of the program that its philological expertise is limited, that he relied on other authorities and sources, that the program is intended just to be a reading help, not a research tool. Nevertheless, he has produced, I believe, the most informative freely available digital reference work on Latin usage. I'd like to see a review of his work in some scholarly journal, I think he has deserved it.

Anyway, in the documentation on word frequencies Whitaker says:

FREQ guessed from the relative number of citations given by sources need not be valid, but seems to work. (...)

type FREQUENCY_TYPE is ( -- For dictionary entries
X, -- -- Unknown or unspecified
A, -- very freq -- Very frequent, in all Elementary Latin books, top 1000+ words
B, -- frequent -- Frequent, next 2000+ words
C, -- common -- For Dictionary, in top 10,000 words
D, -- lesser -- For Dictionary, in top 20,000 words
E, -- uncommon -- 2 or 3 citations
F, -- very rare -- Having only single citation in OLD or L+S
I, -- inscription -- Only citation is inscription
M, -- graffiti -- Presently not much used
N -- Pliny -- Things that appear only in Pliny Natural History
);

(Of course, Whitaker knows about Diederich's work -- he is the one who OCR'd Diederich's 1939 thesis and put it online.)

So, we're pleased to report that the Profile of Croatian Neo-Latin Project converted Whitaker's DICTPAGE.RAW to a MySQL table, and learned the following about how Whitaker's ten frequency categories are distributed among the 39,225 lemmata in his wordlist:

  1. X (Unknown or unspecified): 0

  2. A (very freq): 2134

  3. B (frequent): 2747

  4. C (common): 5113

  5. D (lesser): 8365

  6. E (uncommon): 11193

  7. F (very rare): 7974

  8. I (inscription): 430

  9. M (graffiti): 0

  10. N (Pliny): 1269

  11. Total: 39225


Now we have something to compare. It is interesting to note that most words are uncommon.

[Further reading.] There is a recent publication, Joseph Denooz, Nouveau lexique fréquentiel de latin. Alpha-Omega. Reihe A Bd 258. Hildesheim/Zürich/New York: Georg Olms Verlag, 2010. Pp. ix, 453. ISBN 9783487144733. €148.00. (reviewed recently on BMCR, with a crucial question: "A dictionary such as this is a tool: so what can this one be used for?").

[1] A sad update. Thinking about possible reasons for William Whitaker's absence from the internet, I consulted the obituaries, and found the following:

Colonel William A. Whitaker (USAF-Retired) passed away on Tuesday, December 14, 2010. While at DARPA, he worked on the computer language ADA. In retirement, he created the Latin-English translation software program, "Whitaker Words". (...)
Published in Midland Reporter-Telegram on December 21, 2010
Source here.


Τάνδε κατ' εὔδενδρον στείβων δρίος εἴρυσα χειρὶ
πτώσσουσαν βρομίας οἰνάδος ἐν πετάλοις,
ὄφρα μοι εὐερκεῖ καναχὰν δόμῳ ἔνδοθι θείη,
τερπνὰ δι' ἀγλώσσου φθεγγομένα στόματος.

Requiescat in pace.

Quantification

This morning we had to compile some numbers on the Croatiae auctores Latini collection. Here they are (also on the CroALa developer's blog):

  • 143 TEI XML files (including, alas, some duplicates)

  • 437.218 words

  • 29.637.450 characters

  • 16.465,25 Textkarten

  • 1029 Druckbogen


Last two strange categories belong to German printing tradition, which was influential in Croatian printing industry; we translated these terms (Textkarte = kartica teksta, Druckbogen = tiskarski arak), and use them still in text accounting.

[Technical note.] Numbers were produced by Linux wc command (cf. recipe) on all XML files currently in CroALa, also available on its Sourceforge page. The Linux one-liner for calculating number of characters and words in multiple XML files was simple:

wc *.xml | awk '{print $3-$1}'

Wednesday, 4 January 2012

Crepitantia tympana

Do you think that "crepitantia tympana" is a striking Latin expression? Well, so did a lot of people before you, tells a Google search (you can compare this with a Google Book Search on the same expression, if you like). A lot of "Dictionaria poetica", "Epithetorum opera absolutissima", and "Aeraria poetica" there. The expression, first used in neo-Latin poetry (attributed to Strozza the younger, it seems, though Boiardo also used it), became commonplace by 1594.

Of all Croatian writers currently in the CroALa, only Ludovik Paskalić thought so, but his verse ended up anthologized in Carmina illustrium poetarum Italorum (late, in 1720).

We're interested in the expression because Macario Muzio used it in his De triumpho Christi poema (Venice 1499), describing heavenly music:

264 Summa potestates ducebant agmina et altae
265 Virtutes celsique throni; tum classica sancto
266 Ore procul clangore pio pia signa canebant
267 Victoris uexilla dei lectasque cohortes
268 In superum coetus longo testantia cantu.
269 At circum propiore sono recinente citatis
270 Ad citharam leuibus digitis plectroque uolanti
271 Innumera certante lyra discrimina mille,
272 Mille fides ictis uario modulamine chordis
273 Et totidem surgens ad sacras barbitus odas
274 Aedebat duplicesque manus agitantia naula
275 Cymbalaque et pulsis resonabat bracthea palmis
276 Tinnitus tremula crispans ad carmina dextra.
277 Nec minus obliquas iungebat consona uoces
278 Plurima compactis respondens tibia cannis
279 Et sistra et grato crepitantia tympana bombo
280 Sambucae et molles numeri quos temperat unda
281 Hydraulis, caeleste melos referente monaulo;
282 Diuinos iunxere modos rhythmosque sonantes
283 Acta dei; tales modulans symphonia cantus
284 Laeta triumphantes reuocabat in ethera diuos.

Marko Marulić, who read Muzio's book carefully, later adapted the line in his Dauidias (c. 1502-1510):

2.209 Hinc uictor Dauid, turba comitante suorum,
2.210 Ibat ouans, uiridi redimitus tempora lauro
2.211 Et biiugo uectus curru. Peana canebat
2.212 Pone sequens miles pulsataque tympana bombos
2.213 Ędebant et silua sonum per inane uolutum
2.214 Atque cauis haustum rimis referebat eundem.

The ancient model is Catullus 64:

Cui Thyades passim lymphata mente furebant
euhoe bacchantes, euhoe capita inflectentes.
harum pars tecta quatiebant cuspide thyrsos,
pars e divulso iactabant membra iuvenco,
pars sese tortis serpentibus incingebant,
pars obscura cavis celebrabant orgia cistis,
orgia quae frustra cupiunt audire profani,
plangebant aliae proceris tympana palmis
aut tereti tenuis tinnitus aere ciebant,
multis raucisonos efflabant cornua bombos
barbaraque horribili stridebat tibia cantu.

Tuesday, 3 January 2012

Compare two lists 101

This is probably Programming 101, and should be (in analog form) Philology 101, but combination of the two seems somehow to fall through. Also, there was a question about it today on Digital Medievalist mailing list.

So, the problem for today is: we want to compare two lists of words.

(Let's say we have made one list by sorting all words from a text alphabetically, and then discarding all multiple occurrences -- save the first one, of course.)

Does position in a list matter, asks the programmer. No, answers the philologist -- we only want to know whether a word from list A occurrs also in list B. Do the rows have multiple fields which should be compared, asks the programmer. No, there is only one field, one word, answers the philologist, thinking about finding a word from the text in a dictionary. Our wishes are modest. (Later, of course, we'll want to find where exactly do common words appear in documents A and B.)

If you use Excel, there seem to be some recipes at The Spreadsheet Page and elsewhere. There is also a Perl module List::Compare (if you are a brave philologist, and have a book or two on Perl handy, you can learn much from the problem). Finally, if you are an eccentric philologist and use Linux, there are standard text manipulation tools for Linux. Yes, here is where we realized that philology is much like programming: both are all about texts.

Surprisingly, the main problem in using all these tools (at least for me) turned out to be how to send a list to the tool, how to loop through all elements in a list, etc. Programming kindergarten, I guess -- but philologists don't usually have to think about how to turn the pages or how to scan lines of text, much less to issue instructions such as "now lift the hand... spread the thumb and index finger... catch the page edge lightly... lift again..." (I know, even programmers don't do it anymore either these days; Dada used to do it when she was studying electrical engineering.)

So how do I actually compare two lists? Here is one of my Bash scripts:

egrep -f list1 list2 > resultlist

As you can see, it takes great wisdom and sophistication.

egrep, today same as grep, is an utility which finds words (strings) in a file. With the -f option, it reads a list from a file (where every line is a query). List2 is the file which should be searched (actually it does not have to be sorted -- it can simply be the original text; list1 also does not have to be sorted, by the way). "Greater than" sign is a command to send output to file (called resultlist); without it, results would just fly across our screen.

And basically, that's all there is to it. Try not to be frustrated if something goes wrong, look for recipes and explanations on the internet, and remember that you cannot (hopefully) break anything in your computer experimenting with this kind of commands.

Monday, 2 January 2012

A list of names

Once we start thinking about lists and tinkering with them (and I've been doing this for a long time now), it turns out that another interesting list to compile would be a list of names from a text. Then, if we cross-reference two texts, we can look for names which occur in both.

Here is such a list for names common both to F. de Diversi's description of Dubrovnik (1440), and A. Crijevic Tubero's history of his times (1520). As you'll see, the list is more than just words. Every item is a link to a search in the CroALa collection -- not just to texts by Diversi and Tubero, but to all currently included texts (this could, of course, be fine-tuned).

  1. albertI
  2. albertUs
  3. alemanUs
  4. alexandrIE
  5. andreE
  6. aUstrIE
  7. bartholomeo
  8. blasII
  9. boemIE
  10. bosnenses
  11. carolUs
  12. chrIstIanorUm
  13. constantInopolIs
  14. contareno
  15. cremonensem
  16. dalmatIa
  17. dalmatIE
  18. epIdaUrI
  19. epIdaUrII
  20. epIdaUrUm
  21. francIscI
  22. francorUm
  23. hUngarIE
  24. IllyrIco
  25. ItalI
  26. ItalIcIs
  27. ItalIco
  28. IUlII
  29. lacromE
  30. laUrentII
  31. leonardUs
  32. marIE
  33. marIam
  34. martInI
  35. medIolanI
  36. mIchEl
  37. mIchElIs
  38. neapolI
  39. neapolItanE
  40. neapolItanam
  41. nIcolaI
  42. nIcolao
  43. nIcolaUm
  44. petrI
  45. petrUm
  46. posonIo
  47. rhacUsanE
  48. rhacUsanIs
  49. salomonIs
  50. sIcIlIE
  51. sIgIsmUndI
  52. sIgIsmUndo
  53. sIgIsmUndUm
  54. thomas
  55. UngarIE


The words look funny because Philologic, the open-source text engine which searches and serves CroALa texts, uses special uppercase characters to find orthographical variants. "UngarIE" will find Vngariae and Ungarię and Ungarie and Vngarye (if there is such a form).

Looking at neo-Latin

Problem: we want to research strange words in neo-Latin texts.

Of course, that depends on what we consider to be "strange". This can mean:
a. Latin words which are rare or non-existent in classical Latin (mutatis mutandis, the language in which Romans wrote until c. 500 a. D)
b. words which are strange to us
c. words which were strange to authors or their public
d. words which are in a neo-Latin text, but are not Latin

Let us here consider case a. It turns out to have several sub-problems of its own:
a.1. words which don't exist in Latin of the Romans (see the Neulateinische Wortliste by J. Ramminger)
a.2. words which are rare in all periods of Latin
a.3. words which are rare in Latin of the Romans, but frequent in later Latin (e. g. medieval), or in some Latin idioms (e. g. ecclesiastical Latin)
a.4. words which are rare in some genres, but frequent in others

Basically, there are two approaches to our research. We can start from our texts, examining their words and looking for them in different wordlists. Or we can start from lists, and see if our texts contain some of their words.

It all comes down to comparing lists. The longer the better.

But we need special lists. Here's a list (sic) of them:
a. a list of words in our text
b. a list of lemmata of words from our text
c. a list of words which are rare in classical Latin
d. a list of neo-Latin words
(e. and a list of frequent Latin words would also come in handy)

We also need some tools. If we want to go from a text to the lists, we'll need:
a. something to list all words in our text
b. a Latin parser (we feed it a form, and get back the lemma)
c. a way to communicate our words to the parser (and get back the results)
d. something which can compare two lists
e. something which can write out the results

If we go from a list to the text, we'll need:
a. a Latin stemmer (we'll look only for stems, and disregard the endings)
b. a regular expressions tool (to find a complete word, given a stem)
c. tools for comparing and storing the results, as above