\p{IsGreek}Simple, great — and took me half an hour to find. Is this programming or philology?
Collecting, connecting, and sharing data on Croatian neo-Latin literature
Monday, 23 December 2013
Ancient Greek as a Unicode Character Class
Saturday, 14 December 2013
Elephas culicem non curat
πσ. Ἱέρωνι. Πόλλα λέγειν ἔχων καὶ κατὰ σοῦ καὶ περὶ ἧς κατ' ἐμοῦ πεφλυάρηκας ἐν Λεοντίνοις δημοκοπίας οὐδὲν ἐρῶ περισσότερον πλὴν ὅτι κώνωπος ἐλέφας Ἰνδὸς οὐκ ἀλεγίζει.Latine vertit Aretinus, Venetiis 1492/1500? (apud BSB Digital):
Hieroni. Qvom multa de te: et de concione quam contra me ad Leontinos stulte habuisti dicere possim: nolo tamen superfluis uti uerbis: nisi quod culicem elephas Indus non curat.Ita autem vertit Joannes Daniel van Lennep, Groningae 1777 (apud archive.org, p. 119):
XXIX. HIERONI. Cum multa habeam dicere et in te, et de concionibus, quas effutiuisti apud Leontinos, discordi plenas popularitate, nihil dicam amplius, nisi elephantem Indicum non curare culicem.
Friday, 23 August 2013
Querying CTS edition of Osmanides through Fuseki
select ?x
where {?x ?v """
This means: find a (URN) citation for the line containing this text.
The result is:
<urn:cts:croALa:croalaget003.croalaget001.izdleipzig:165>
This was a working query. Now, here is a meaningful query:
select ?s ?p ?o
where { <urn:cts:croALa:croalaget003.croalaget001.izdleipzig:10.249> ?p ?o .}
Meaning, more or less: return RDF predicate and object where the RDF subject contains value as named. This would be book 10, line 249 of the Osmanides.
A useful introduction to SPARQL can also be found under Beginner's guide to RDF: 6. Querying with SPARQL. For example, on using queries (here, with CTS namespace):
PREFIX cts: <http://www.homermultitext.org/cts/rdf/>
select ?s ?p ?o
where { <urn:cts:croALa:croalaget003.croalaget001.izdleipzig:10.250> cts:hasTextContent ?o .}
Thursday, 3 January 2013
Text and its links
A HTML edition can have links of its own. These links can be encoded in XML, so that they "come alive" after the HTML transformation. What is needed is a good idea what to link to.
There are two natural locations. If a text is a transcription of a source, such as a manuscript, and if the source is present on the internet, we can link to page images. And, if text contains quotations of or allusions to other texts (and the texts are present on the internet), we can link to these texts — or hypotexts, as Gerard Genette would call it.
This isn't quite as simple as it seems. What if hypotext is the Bible, with many books, chapters and verses, and our text refers to a precise location in it? Over the centuries, philology developed special techniques just for such referring actions, and the techniques are migrating to the internet. Slowly, though; perhaps not surprisingly, the Bible — with services such as bib.ly — is again first to apply them.
Linking to images and linking to sources are features of our working editions of Andrija Dudić's (Andreas Dudithius, 1533-1589) Latin translation of Dionysius of Halicarnassus' essay on Thucydides, and of a Latin letter written in 1418 by Juraj Jurjević (Georgius de Georgiis, Zadar c. 1400) to Giovanni Battista Bevilacqua.
Edition of Dudić's text refers to local images taken from the archive.org digital facsimile of a 1586 Frankfurt edition. Edition of Jurjević refers both to images of the manuscript (Munich, BSB, Clm 5350) and to (one) passage in Isaiah. We used a Vulgate edition prepared by the Perseus Project, because Perseus uses stable Citation URIs (as developed by Canonical Text Services) for referring to segments of their texts.
Transforming the TEI XML to HTML required slight modification to their set of XSL stylesheets. Technical information about this (written mostly for myself, as I forget it again and again) is here (on klafil dokuwiki).
Tuesday, 1 January 2013
Filtering Latin words
For anyone speaking Croatian or a host of related languages, "filter" means first and foremost "cigarette filter". There is even a legendary song from the 1980's built around it.
However, in profiling Croatian Latin filters are, more prosaically, ways to save time and resources. Once we have a sufficient set of lemmatized Latin words, we can avoid sending these words to Morphology Service again.
Not one, but three filters are needed. From a list of forms contained in a Latin text (any that we intend to include in CroALa) first will be filtered out all previously unambiguously lemmatized forms. From the remaining set, we'll filter out what was previously recognized, but ambiguously. Finally, a filter will be applied to words previously encountered, but not recognized by the Morphology Service.
What is left is ready for sending to Morphology Service. The resulting JSON will again be sifted into three groups: the lemmatized words, the ambiguously lemmatized, the unrecognized.
E. g. A letter by Juraj Jurjević, a little known nobleman from Zadar interned in Venice in 1418 (Zadar was definitively subjugated by Venice in 1409), consists of 755 words in 536 different forms. The filters separate these forms into 173 previously recognized, 95 previously ambiguously recognized, 268 remaining (now I see that we could have applied the filter for previously unrecognized words, but we didn't do it today).
So 268 Latin forms travelled across the globe to be processed by the Morphology Service on the first day of 2013. Of these forms, 180 were unambiguously lemmatized; there were 139 ambiguous identifications; and 29 forms were listed as forma non recognita. The total score exceeds 268, of course, because of ambiguously identified forms — each of their lemmata gets a row of its own.
Tomorrow I'll write up how all this was accomplished programmatically, in a mix of Bash, Perl, and MySQL.