Tuesday 1 November 2011

The more, the merrier

Today's experiments with lemmatizing a neo-Latin word list (from the Ludovik Crijević Tuberon, Commentarii) seem to show that repeated passes through Archimedes Project lemmatizer give better results.

Perhaps the lemmatizer has some kind of limit; Tuberon's word list had ca. 20,000 forms.

Anyway, now we have a Bash script that can do any number of passes. And the final list of "strange" (i. e. not lemmatized, because not recognized) words is here: [X].

Sunday 30 October 2011

Working on a Work entry

Here's what an entry for a work of a Croatian Latin author — in this case, Beninj Albertini — looks like when it's encoded in TEI XML courtesy of Tamara Runjak, librarian in the HAZU (Croatian Academy of Sciences and Arts):

<bibl xml:id="alb-c01-1825">
<author><ref target="alb03">Albertini, Beninj</ref></author>
<title lang="la">Elegia in Arcadum coetu
anno jubilaei MDCCCXXV recitata </title>
<date when="1825">1825</date>
<note type="metrum">elegijski distih</note>
<note type="genus">latinski stihovi</note>
<note type="genus">prigodnice</note>
<relateditem><ref target="alb1825car">Romae 1825</ref></relateditem>
</bibl>

The whole entry has an identifier in its "bibl" element. The "ref" element in "author" points to the entry in our List of Authors (where all authors' names and known data will be listed).
The date is the creation date (when we know it).
Notes typed "metrum" tell in which metre the poem is (currently this information is in Croatian); others, with type "genus", show different levels of genres. Finally, the "relatedItem" points to the List of Manifestations (i. e. editions, manuscripts etc).

Oh, and by the way, here is "just the text" from this entry, reminding us how much implied information a cultured human being can understand:


Albertini, Beninj
Elegia in Arcadum coetu anno jubilaei MDCCCXXV recitata
1825
elegijski distih
latinski stihovi
prigodnice
Romae 1825

Saturday 29 October 2011

Places in bibliography

One of two components for research into Croatian neo-Latin will be a three-part bibliography: a FRBR-inspired list of works, authors, and manifestations. The basis for this are TEI XML encoded lists, such as the one searchable here: [X].

Yesterday's discussion with our programmer, Krešimir Šojat, reminded us that the only way to make placenames in our lists useful is to provide them with some kind of geographically unambiguous mark. Šojat proposed linking to OpenStreetMap, which looks indeed nice — though it is a modern map, not a historical one.

Another possibility that we'll look into are data compiled by the Pleiades project (here are their data on Split / Spalatum). These are ancient places, but for people writing in Latin modern places still carry their ancient names.

For XML encoding, TEI element geo will be used.

(By the way, geotagging is a discipline almost as fascinating as prosopography and bibliography. Too bad I was such a lousy student of geography in high school.)