Thursday 3 January 2013

Text and its links

One of the things that can be done with a TEI XML texts is transforming it into other formats. E. g. into HTML. (Other thing that can be done is including the text in a collection like CroALa, of course.)

A HTML edition can have links of its own. These links can be encoded in XML, so that they "come alive" after the HTML transformation. What is needed is a good idea what to link to.

There are two natural locations. If a text is a transcription of a source, such as a manuscript, and if the source is present on the internet, we can link to page images. And, if text contains quotations of or allusions to other texts (and the texts are present on the internet), we can link to these texts — or hypotexts, as Gerard Genette would call it.

This isn't quite as simple as it seems. What if hypotext is the Bible, with many books, chapters and verses, and our text refers to a precise location in it? Over the centuries, philology developed special techniques just for such referring actions, and the techniques are migrating to the internet. Slowly, though; perhaps not surprisingly, the Bible — with services such as bib.ly — is again first to apply them.

Linking to images and linking to sources are features of our working editions of Andrija Dudić's (Andreas Dudithius, 1533-1589) Latin translation of Dionysius of Halicarnassus' essay on Thucydides, and of a Latin letter written in 1418 by Juraj Jurjević (Georgius de Georgiis, Zadar c. 1400) to Giovanni Battista Bevilacqua.

Edition of Dudić's text refers to local images taken from the archive.org digital facsimile of a 1586 Frankfurt edition. Edition of Jurjević refers both to images of the manuscript (Munich, BSB, Clm 5350) and to (one) passage in Isaiah. We used a Vulgate edition prepared by the Perseus Project, because Perseus uses stable Citation URIs (as developed by Canonical Text Services) for referring to segments of their texts.

Transforming the TEI XML to HTML required slight modification to their set of XSL stylesheets. Technical information about this (written mostly for myself, as I forget it again and again) is here (on klafil dokuwiki).

Tuesday 1 January 2013

Filtering Latin words

For anyone speaking Croatian or a host of related languages, "filter" means first and foremost "cigarette filter". There is even a legendary song from the 1980's built around it.

However, in profiling Croatian Latin filters are, more prosaically, ways to save time and resources. Once we have a sufficient set of lemmatized Latin words, we can avoid sending these words to Morphology Service again.

Not one, but three filters are needed. From a list of forms contained in a Latin text (any that we intend to include in CroALa) first will be filtered out all previously unambiguously lemmatized forms. From the remaining set, we'll filter out what was previously recognized, but ambiguously. Finally, a filter will be applied to words previously encountered, but not recognized by the Morphology Service.

What is left is ready for sending to Morphology Service. The resulting JSON will again be sifted into three groups: the lemmatized words, the ambiguously lemmatized, the unrecognized.

E. g. A letter by Juraj Jurjević, a little known nobleman from Zadar interned in Venice in 1418 (Zadar was definitively subjugated by Venice in 1409), consists of 755 words in 536 different forms. The filters separate these forms into 173 previously recognized, 95 previously ambiguously recognized, 268 remaining (now I see that we could have applied the filter for previously unrecognized words, but we didn't do it today).

So 268 Latin forms travelled across the globe to be processed by the Morphology Service on the first day of 2013. Of these forms, 180 were unambiguously lemmatized; there were 139 ambiguous identifications; and 29 forms were listed as forma non recognita. The total score exceeds 268, of course, because of ambiguously identified forms — each of their lemmata gets a row of its own.

Tomorrow I'll write up how all this was accomplished programmatically, in a mix of Bash, Perl, and MySQL.