Showing posts with label croala. Show all posts
Showing posts with label croala. Show all posts

Thursday, 3 January 2013

Text and its links

One of the things that can be done with a TEI XML texts is transforming it into other formats. E. g. into HTML. (Other thing that can be done is including the text in a collection like CroALa, of course.)

A HTML edition can have links of its own. These links can be encoded in XML, so that they "come alive" after the HTML transformation. What is needed is a good idea what to link to.

There are two natural locations. If a text is a transcription of a source, such as a manuscript, and if the source is present on the internet, we can link to page images. And, if text contains quotations of or allusions to other texts (and the texts are present on the internet), we can link to these texts — or hypotexts, as Gerard Genette would call it.

This isn't quite as simple as it seems. What if hypotext is the Bible, with many books, chapters and verses, and our text refers to a precise location in it? Over the centuries, philology developed special techniques just for such referring actions, and the techniques are migrating to the internet. Slowly, though; perhaps not surprisingly, the Bible — with services such as bib.ly — is again first to apply them.

Linking to images and linking to sources are features of our working editions of Andrija Dudić's (Andreas Dudithius, 1533-1589) Latin translation of Dionysius of Halicarnassus' essay on Thucydides, and of a Latin letter written in 1418 by Juraj Jurjević (Georgius de Georgiis, Zadar c. 1400) to Giovanni Battista Bevilacqua.

Edition of Dudić's text refers to local images taken from the archive.org digital facsimile of a 1586 Frankfurt edition. Edition of Jurjević refers both to images of the manuscript (Munich, BSB, Clm 5350) and to (one) passage in Isaiah. We used a Vulgate edition prepared by the Perseus Project, because Perseus uses stable Citation URIs (as developed by Canonical Text Services) for referring to segments of their texts.

Transforming the TEI XML to HTML required slight modification to their set of XSL stylesheets. Technical information about this (written mostly for myself, as I forget it again and again) is here (on klafil dokuwiki).

Tuesday, 1 January 2013

Filtering Latin words

For anyone speaking Croatian or a host of related languages, "filter" means first and foremost "cigarette filter". There is even a legendary song from the 1980's built around it.

However, in profiling Croatian Latin filters are, more prosaically, ways to save time and resources. Once we have a sufficient set of lemmatized Latin words, we can avoid sending these words to Morphology Service again.

Not one, but three filters are needed. From a list of forms contained in a Latin text (any that we intend to include in CroALa) first will be filtered out all previously unambiguously lemmatized forms. From the remaining set, we'll filter out what was previously recognized, but ambiguously. Finally, a filter will be applied to words previously encountered, but not recognized by the Morphology Service.

What is left is ready for sending to Morphology Service. The resulting JSON will again be sifted into three groups: the lemmatized words, the ambiguously lemmatized, the unrecognized.

E. g. A letter by Juraj Jurjević, a little known nobleman from Zadar interned in Venice in 1418 (Zadar was definitively subjugated by Venice in 1409), consists of 755 words in 536 different forms. The filters separate these forms into 173 previously recognized, 95 previously ambiguously recognized, 268 remaining (now I see that we could have applied the filter for previously unrecognized words, but we didn't do it today).

So 268 Latin forms travelled across the globe to be processed by the Morphology Service on the first day of 2013. Of these forms, 180 were unambiguously lemmatized; there were 139 ambiguous identifications; and 29 forms were listed as forma non recognita. The total score exceeds 268, of course, because of ambiguously identified forms — each of their lemmata gets a row of its own.

Tomorrow I'll write up how all this was accomplished programmatically, in a mix of Bash, Perl, and MySQL.

Saturday, 29 December 2012

Structuring the Mercurius Croaticus

Mercurius Croaticus is currently a set of TEI XML files containing bibliographical records. These files will be served and made accessible via the BaseX XML database, once we decide on how to present the records; our working premise is that, if researchers want to discover something really interesting, they'll be willing and ready to learn XQuery (not least because it's a powerful tool for more than one research project). And Mercurius Croaticus will help them learn.

To get a clear idea on what exactly is in the bibliographic collection, and to avoid confusion, we organised the files in three sets of folders, following the FRBR concept. There is a folder for authors (auctores), a folder for works (opera), and one for "manifestations" (it is interesting that a term for the third category is not readily available in any language I know); "manifestations" have two subfolders: manuscripts (MS) and printed books (typis edita). Obviously, the "internet" subfolder could also be included.

The folder auctores contains our starting prosopography — a set of 244 personal records for Croatian Latin authors included in the Leksikon hrvatskih pisaca (A Lexicon of Croatian Writers, Zagreb 2000) — and the "additions" file, containing currently 70 more neo-Latin authors of Croatian origin.

Main part of opera is also culled from the Leksikon hrvatskih pisaca — there are 1784 items listed — and there are 18 additional items as well.

The manifestations/MS subfolder has one special collection, with excerpts from Paul Oskar Kristeller's Iter Italicum (no, we don't have the money to subscribe to Brill's internet edition); it was excerpted by Darko Novaković, who kindly lent his notes to Mercurius for TEI XML conversion. And there are records collected from other sources, more or less obiter.

Finally, there is the manifestations/typis edita subfolder, where the basis is the bibliography made by Šime Jurić (1915–2004) in late 1960's: Iugoslaviae scriptores Latini recentioris aetatis (that is, its "Pars I. Opera scriptorum Latinorum natione Croatarum usque ad annum MDCCCXLVIII typis edita. Bibliographiae fundamenta. t. 1. Index alphabeticus. t. 2. Index systematicus. Additamentum I."), later improved by the Croatian National and University Library, and encoded by the Croatiae auctores Latini project. This collection contains 5867 bibliographic records on printed Latin publications with works by Croatian authors. Jurić's bibliography breaks off with the year 1850. Mercurius Croaticus should urgently supplement it with data on later publications, all the way up to the present.

Friday, 7 December 2012

Profiling cultural literacy of Croatian Latin writers

A paper to be presented in the Latin, National Identity and the Language Question in Central Europe conference, organised by the Ludwig Boltzmann Institute for Neo-Latin Studies in Innsbruck (12--15 Dec 2012) will apply E. D. Hirsch's concept of cultural literacy --- which is actually German Allgemeinbildung or Bildungsgut (see it at work in these books), and "opća kultura" in Croatian --- to intellectual horizons of Croatian neo-Latin writers, as represented in the Croatiae auctores Latini collection.

Judging from the programme, the conference offers a chance to present digital research to an audience working mostly in "traditional" ways. This is quite an opportunity; too often digital humanities get separated in a room of their own, where they don't get in anybody's way. So, the challenge is to persuade colleagues that a large-scale search of CroALa using e. g. common terms from CAMENA TERMINI can lead to something interesting.

Update, post-conference, 19/12/2012:

My presentation on cultural literacy is here (note to self: never again try to do a presentation in a browser — outdated browser versions always turn up in crucial moments in crucial spots).

The paper itself is here.

Thursday, 5 January 2012

Quantification

This morning we had to compile some numbers on the Croatiae auctores Latini collection. Here they are (also on the CroALa developer's blog):

  • 143 TEI XML files (including, alas, some duplicates)

  • 437.218 words

  • 29.637.450 characters

  • 16.465,25 Textkarten

  • 1029 Druckbogen


Last two strange categories belong to German printing tradition, which was influential in Croatian printing industry; we translated these terms (Textkarte = kartica teksta, Druckbogen = tiskarski arak), and use them still in text accounting.

[Technical note.] Numbers were produced by Linux wc command (cf. recipe) on all XML files currently in CroALa, also available on its Sourceforge page. The Linux one-liner for calculating number of characters and words in multiple XML files was simple:

wc *.xml | awk '{print $3-$1}'

Monday, 2 January 2012

A list of names

Once we start thinking about lists and tinkering with them (and I've been doing this for a long time now), it turns out that another interesting list to compile would be a list of names from a text. Then, if we cross-reference two texts, we can look for names which occur in both.

Here is such a list for names common both to F. de Diversi's description of Dubrovnik (1440), and A. Crijevic Tubero's history of his times (1520). As you'll see, the list is more than just words. Every item is a link to a search in the CroALa collection -- not just to texts by Diversi and Tubero, but to all currently included texts (this could, of course, be fine-tuned).

  1. albertI
  2. albertUs
  3. alemanUs
  4. alexandrIE
  5. andreE
  6. aUstrIE
  7. bartholomeo
  8. blasII
  9. boemIE
  10. bosnenses
  11. carolUs
  12. chrIstIanorUm
  13. constantInopolIs
  14. contareno
  15. cremonensem
  16. dalmatIa
  17. dalmatIE
  18. epIdaUrI
  19. epIdaUrII
  20. epIdaUrUm
  21. francIscI
  22. francorUm
  23. hUngarIE
  24. IllyrIco
  25. ItalI
  26. ItalIcIs
  27. ItalIco
  28. IUlII
  29. lacromE
  30. laUrentII
  31. leonardUs
  32. marIE
  33. marIam
  34. martInI
  35. medIolanI
  36. mIchEl
  37. mIchElIs
  38. neapolI
  39. neapolItanE
  40. neapolItanam
  41. nIcolaI
  42. nIcolao
  43. nIcolaUm
  44. petrI
  45. petrUm
  46. posonIo
  47. rhacUsanE
  48. rhacUsanIs
  49. salomonIs
  50. sIcIlIE
  51. sIgIsmUndI
  52. sIgIsmUndo
  53. sIgIsmUndUm
  54. thomas
  55. UngarIE


The words look funny because Philologic, the open-source text engine which searches and serves CroALa texts, uses special uppercase characters to find orthographical variants. "UngarIE" will find Vngariae and Ungarię and Ungarie and Vngarye (if there is such a form).