Monday, 23 December 2013

Ancient Greek as a Unicode Character Class

In oXygen XML editor, when we want to search for any Greek character (using Perl character classes), we do it with this regular expression:
\p{IsGreek}
Simple, great — and took me half an hour to find. Is this programming or philology?

Saturday, 14 December 2013

Elephas culicem non curat

Phalaridis epistula LXXXVI, Graece ed. Hercher 1873 apud Didot (et Heml Lace):
πσ. Ἱέρωνι. Πόλλα λέγειν ἔχων καὶ κατὰ σοῦ καὶ περὶ ἧς κατ' ἐμοῦ πεφλυάρηκας ἐν Λεοντίνοις δημοκοπίας οὐδὲν ἐρῶ περισσότερον πλὴν ὅτι κώνωπος ἐλέφας Ἰνδὸς οὐκ ἀλεγίζει.
Latine vertit Aretinus, Venetiis 1492/1500? (apud BSB Digital):
Hieroni. Qvom multa de te: et de concione quam contra me ad Leontinos stulte habuisti dicere possim: nolo tamen superfluis uti uerbis: nisi quod culicem elephas Indus non curat.
Ita autem vertit Joannes Daniel van Lennep, Groningae 1777 (apud archive.org, p. 119):
XXIX. HIERONI. Cum multa habeam dicere et in te, et de concionibus, quas effutiuisti apud Leontinos, discordi plenas popularitate, nihil dicam amplius, nisi elephantem Indicum non curare culicem.

Friday, 23 August 2013

Querying CTS edition of Osmanides through Fuseki

Once we have a CTS instance containing an XML edition, e. g. of Vlaho Getaldić's Osmanides, up and running, we query it through the Fuseki server, by SPARQL queries such as this one:

select ?x where {?x ?v """Fortis commisit, victorque evasit ab hoste. 165"""}

This means: find a (URN) citation for the line containing this text.

The result is:

<urn:cts:croALa:croalaget003.croalaget001.izdleipzig:165>

This was a working query. Now, here is a meaningful query:

select ?s ?p ?o where { <urn:cts:croALa:croalaget003.croalaget001.izdleipzig:10.249> ?p ?o .}

Meaning, more or less: return RDF predicate and object where the RDF subject contains value as named. This would be book 10, line 249 of the Osmanides.

A useful introduction to SPARQL can also be found under Beginner's guide to RDF: 6. Querying with SPARQL. For example, on using queries (here, with CTS namespace):

PREFIX cts: <http://www.homermultitext.org/cts/rdf/> select ?s ?p ?o where { <urn:cts:croALa:croalaget003.croalaget001.izdleipzig:10.250> cts:hasTextContent ?o .}

Thursday, 3 January 2013

Text and its links

One of the things that can be done with a TEI XML texts is transforming it into other formats. E. g. into HTML. (Other thing that can be done is including the text in a collection like CroALa, of course.)

A HTML edition can have links of its own. These links can be encoded in XML, so that they "come alive" after the HTML transformation. What is needed is a good idea what to link to.

There are two natural locations. If a text is a transcription of a source, such as a manuscript, and if the source is present on the internet, we can link to page images. And, if text contains quotations of or allusions to other texts (and the texts are present on the internet), we can link to these texts — or hypotexts, as Gerard Genette would call it.

This isn't quite as simple as it seems. What if hypotext is the Bible, with many books, chapters and verses, and our text refers to a precise location in it? Over the centuries, philology developed special techniques just for such referring actions, and the techniques are migrating to the internet. Slowly, though; perhaps not surprisingly, the Bible — with services such as bib.ly — is again first to apply them.

Linking to images and linking to sources are features of our working editions of Andrija Dudić's (Andreas Dudithius, 1533-1589) Latin translation of Dionysius of Halicarnassus' essay on Thucydides, and of a Latin letter written in 1418 by Juraj Jurjević (Georgius de Georgiis, Zadar c. 1400) to Giovanni Battista Bevilacqua.

Edition of Dudić's text refers to local images taken from the archive.org digital facsimile of a 1586 Frankfurt edition. Edition of Jurjević refers both to images of the manuscript (Munich, BSB, Clm 5350) and to (one) passage in Isaiah. We used a Vulgate edition prepared by the Perseus Project, because Perseus uses stable Citation URIs (as developed by Canonical Text Services) for referring to segments of their texts.

Transforming the TEI XML to HTML required slight modification to their set of XSL stylesheets. Technical information about this (written mostly for myself, as I forget it again and again) is here (on klafil dokuwiki).

Tuesday, 1 January 2013

Filtering Latin words

For anyone speaking Croatian or a host of related languages, "filter" means first and foremost "cigarette filter". There is even a legendary song from the 1980's built around it.

However, in profiling Croatian Latin filters are, more prosaically, ways to save time and resources. Once we have a sufficient set of lemmatized Latin words, we can avoid sending these words to Morphology Service again.

Not one, but three filters are needed. From a list of forms contained in a Latin text (any that we intend to include in CroALa) first will be filtered out all previously unambiguously lemmatized forms. From the remaining set, we'll filter out what was previously recognized, but ambiguously. Finally, a filter will be applied to words previously encountered, but not recognized by the Morphology Service.

What is left is ready for sending to Morphology Service. The resulting JSON will again be sifted into three groups: the lemmatized words, the ambiguously lemmatized, the unrecognized.

E. g. A letter by Juraj Jurjević, a little known nobleman from Zadar interned in Venice in 1418 (Zadar was definitively subjugated by Venice in 1409), consists of 755 words in 536 different forms. The filters separate these forms into 173 previously recognized, 95 previously ambiguously recognized, 268 remaining (now I see that we could have applied the filter for previously unrecognized words, but we didn't do it today).

So 268 Latin forms travelled across the globe to be processed by the Morphology Service on the first day of 2013. Of these forms, 180 were unambiguously lemmatized; there were 139 ambiguous identifications; and 29 forms were listed as forma non recognita. The total score exceeds 268, of course, because of ambiguously identified forms — each of their lemmata gets a row of its own.

Tomorrow I'll write up how all this was accomplished programmatically, in a mix of Bash, Perl, and MySQL.