Sunday 19 January 2014

XQuery for VIAF id numbers

Our bibliography has (too many) authors for whom we need VIAF numbers. Using VIAF API (as described), we get numbers from the database in several steps.

  1. find distinct values of unassigned authors in the bibliography
  2. turn these values into a XML sequence
  3. use the sequence to query VIAF

For the last phase, we use this XQuery (and BaseX GUI):

declare namespace ns2="http://viaf.org/viaf/terms#";
declare namespace ns3="http://viaf.org/viaf/terms#";
(: address to which we are sending the queries :)
let $url :=
("http://www.viaf.org/viaf/search?query=local.personalNames+all+%22REPLACE_URN%22&maximumRecords=1&sortKeys=holdingscount&httpAccept=text/xml")
(: our sequence :)
let $rijeci :=
<a>
<n>Adelmann von Adelmannsfelden, Konrad</n>
<n>Aegidii, Guillermus</n>
<n>Alexander, Natalis</n>
<n>Algerus</n>
<!-- many more -->
<n>Zoller, Martin</n>
</a>

(: for each item in sequence, batches of 100 :)
for $r in $rijeci/n[401>position() and position()>=301]

let $qrijeci := replace($r, " ", "+")
let $parsed := (doc(replace($url,'REPLACE_URN',$qrijeci)))

return element author {
element ref {
attribute type { "viaf" } ,
attribute target { data($parsed//ns2:VIAFCluster/ns2:viafID) } ,
data($r) }
}

Sunday 12 January 2014

Sound of music

A testimony of Stefano Fieschi da Soncino on young singers from Dubrovnik in August 1441, entered in a somewhat inappropriate place: in the official notebook Diversa Cancellarie. Published by Konstantin Jireček in 1897.

... Darüber hat sich ein merkwürdiges Zeugniss erhalten, kalligraphisch eingetragen im Buche »Diversa Cancellarie« 1440 f. 158, unter der Zeichnung einer Krone, zum 19. August 1441:

Commemoratio suauitatis cantus dominorum camerariorum. Cum ego Stefanus Flischus Soncinensis, cancellarius Raguseus, ultrascriptam dominorum camerariorum pacti et sortis conuentionem describerem (d. h.: in eo tempore vendemiarum, in quo unus saltem ipsorum qualibet die se debet Magnifico domino Rectori presentare, soll jeder es für 10 Tage thun), tunc ipsi Ser Nicolaus Pauli de Goze, Ser Marinus Junii de Cruce atque Ser Volcius Blasii de Babalio ita me coram suauissime cecinerunt, ut mihi audire visum fuit amenissimam quandam celestem armoniam. Ego vero, qui tum scribendo, tum etiam tota die complures libros peruoluendo aliquantisper defessus fueram, maxima profecto illius amenissimi cantus suauitate oblectatus sum. Et quamquam ipsi domini camerarii hoc mihi inuere videbantur, quod ipsi, magna eorum in me beniuolentia commoti, illam tam diuinam armonie suauitatem me coram egissent, illud tamen non preterit, quin ipsi venustissima egregiarum amicarum suarum forma compulsi et suauium ipsarum morum diligentissime memores, tunc temporis tam diuinitus cecinerunt. Magnas tamen ipsis gratias ago, qui tam suauissima amenitate me oblectarint, sed maiores habeo illis prestantibus eorum amasiis, que ipsos ad illos cantus impulerunt. Valeant ergo insignes et pulcerrime domicelle, quarum amor, mores et nobilitas tantam vim habent, ut tam prestantium iuuenum mentes ad se alicere valuerunt et eorum voluntates, in quancunque partem velint, faciliter impellere possunt! Valeant etiam ipsi domini camerarii, qui tametsi magna dignitatis auctoritate prediti sunt, non tamen tanta dulcedine ullo modo me carere passi sunt ! Valeat denique hec magnifica atque florentissima ciuitas Ragusea, que iuuenes tam insignes tamque prestantes procreauit, qui splendidissimum sue rei publice decus et ornamentum existunt! Qui cum etate maturiori creuerint, tunc huius alme ciuitatis statum non solum illesum optime conseruabunt, verum etiam acuratissima eorum prudentia mirandum in modum amplificabunt, ad quam comoditatis gratiam utinam illos deus preseruare dignetur. Ex cancellaria celeberrime urbis Ragusee 14 Kal. Septembres, tunc celorum constellationibus dulcissimam eorum amenitatem in ipsos dominos camerarios diuinitus influentibus.

Thursday 9 January 2014

An SQLite SQL for homonyms

In an SQLite database there is a table of parsed (Latin) word-forms, like this:
tokenID|token|code|lemma|type
1266165|rutilantis|v--pppafa-|rutilo||newmorph
1266166|rutilantis|v--pppama-|rutilo||newmorph
1266167|rutilantis|v--sppafg-|rutilo||newmorph
1266168|rutilantis|v--sppamg-|rutilo||newmorph
1266169|rutilantis|v--sppang-|rutilo||newmorph
We are interested in cases where contents of the token field are the same, but code is different. Code holds grammatical information; the first letter is a shorthand for part of speech (v = verb in the example above). For the moment, we can retrieve this for a specific word, using the following SQL query:
select distinct code1 from (
   select substr(code, 1, 1) as code1 from (
     select code from Lexicon where token like "verum")
      );
Grouping on two fields (a recipe found on Stack Overflow) seems promising:
select token , code, count(*) 
 from Lexicon 
 group by token collate nocase, code 
 having (count(*)>1);
And I think this would be the final query:
select distinct token , c 
from (select token , substr(code, 1, 1) as c 
      from Lexicon 
      group by token collate nocase, code 
      having (count(*)>1) limit 30);

Now off to check it on all 1,257,854 rows in the Lexicon table.

It took a bit of post-processing with a bash command (had to swap POS and Wordform fields):

sort hom.csv -t, -k1 \
| sed 's/\([^,]*\),\(.\)/\2,\1/g' - \
| uniq -D -s 2 > hom-pos.csv
The results are now publicly available as a Google Fusion table, all 19,350 rows of them: homonyms (and homographs) differing by part of speech, found in a real digital corpus of Latin texts.

Saturday 4 January 2014

Finding homonyms in a (Latin) treebank

Homonyms and homographs (H & H) in a language are a good thing to master — a lot of confusion goes away once we have understood the differences, and our grasp of the language is significantly improved.

We don't want to run away from the H & H — we want to tackle them at full speed. That means that, if we read a text, we could pick from it a list of phrases with H & H, read them, and see which meaning is employed where.

To do this, we need:

Our task can be done in many ways, including purely "manual" ones. But we would like to use the Perseus Latin Treebank files to find more homonyms and homographs, and to extract phrases from these treebank files as well.

Finding H & H turns out to be a computationally demanding task. On some 50,000+ words my computer chokes and never finishes; 10,000 words is still too much for it. But sets of some 2,500 words are about right, don't take all night to finish.

To an XML file like this:

<uniq>
<w form="vulgo" postag="d--------"/>
<w form="vulgo" postag="d--------"/>
<w form="vulgus" postag="n-s---mn-"/>
<w form="vulnera" postag="n-p---na-"/>
<w form="vulnera" postag="n-p---na-"/>
<w form="vulnera" postag="n-p---na-"/>
<w form="vulnerat" postag="v3spia---"/>
<w form="vulneratum" postag="t-srppma-"/>
<w form="vulnere" postag="n-s---nb-"/>
<w form="vulneribus" postag="n-p---nb-"/>
<w form="vulneribus" postag="n-p---nb-"/>
<w form="vulnus" postag="n-s---nn-"/>
<w form="vulpes" postag="n-p---fn-"/>
<w form="vult" postag="v3spia---"/>
<w form="vult" postag="v3spia---"/>
<w form="vult" postag="v3spia---"/>
<w form="vult" postag="v3spia---"/>
<w form="vultis" postag="v2ppia---"/>
<w form="vultu" postag="n-s---mb-"/>
<w form="vultu" postag="n-s---mb-"/>
<w form="vultum" postag="n-s---ma-"/>
<w form="vultum" postag="n-s---ma-"/>
<w form="vultum" postag="n-s---ma-"/>
<w form="vultus" postag="n-p---ma-"/>
<w form="vultus" postag="n-p---ma-"/>
<w form="vultus" postag="n-p---ma-"/>
<w form="vultus" postag="n-p---ma-"/>
<w form="vultus" postag="n-s---mg-"/>
<w form="vultus" postag="n-s---mn-"/>
</uniq>

We apply the following XQuery (using BaseX, in my case):

element uniq
{ let $a := //*:w
for $l in distinct-values($a/@form),
$f in distinct-values($a[@form=$l]/@postag)
return

element w {
attribute form { $l },
attribute postag { $f }
}
}

Result:

<uniq>
<w form="vulgo" postag="d--------"/>
<w form="vulgus" postag="n-s---mn-"/>
<w form="vulnera" postag="n-p---na-"/>
<w form="vulnerat" postag="v3spia---"/>
<w form="vulneratum" postag="t-srppma-"/>
<w form="vulnere" postag="n-s---nb-"/>
<w form="vulneribus" postag="n-p---nb-"/>
<w form="vulnus" postag="n-s---nn-"/>
<w form="vulpes" postag="n-p---fn-"/>
<w form="vult" postag="v3spia---"/>
<w form="vultis" postag="v2ppia---"/>
<w form="vultu" postag="n-s---mb-"/>
<w form="vultum" postag="n-s---ma-"/>
<w form="vultus" postag="n-p---ma-"/>
<w form="vultus" postag="n-s---mg-"/>
<w form="vultus" postag="n-s---mn-"/>
</uniq>

The most interesting cases are those in which @postag attribute begins with a different value for the same @form, e. g:

<w form="vivis" postag="a-p---mb-"/>
<w form="vivis" postag="a-p---md-"/>
<w form="vivis" postag="n-p---mb-"/>
<w form="vivis" postag="v2spia---"/>

Then we look for "vivis" e. g. in the Croatiae auctores Latini text collection.

How to write a BaseX XQuery with RESTXQ

Caution: technical stuff. Over holidays we managed to put up a BaseX XML database instance as a web application on several machines. But how to execute an XQuery there? A simple approach: use an already provided BaseX REST interface. However, the BaseX team seems more interested in RESTXQ, "a set of XQuery 3.0 Annotations and a small set of functions to enable XQuery to provide RESTful services, thus enabling Web Application development in XQuery" (from the unofficial RESTXQ draft). BaseX supports RESTXQ very well, but the existing documentation is somewhat sparse for a non-programmer like me. Conspicuously absent is an example of a "standard" XQuery search directed at a database (or, in XQuery parlance, a collection). This will be provided here.

The task. A BaseX war instance is deployed on a Jetty server (on my machine, which runs Debian Mint, in /var/lib/jetty8/webapps), accessible on the address http://localhost:8080/BaseX772. A database collection crobib was created and populated with several TEI XML files (with FRBR-structured bibliographical data on Croatian Latin authors, works, and manifestations). We want to execute the following query over the internet, finding the text under tei:persName element as child of all eleventh tei:person elements in the collection:

declare namespace tei = "http://www.tei-c.org/ns/1.0";
for $i in collection("crobib")//tei:person[11]
return element p { $i/tei:persName//text() }

The solution. An .xq script should be written and placed (in our case) under the root of the BaseX war archive. If all goes well, it is found and read by Jetty and BaseX when the server is restarted. This is the script (cbxq.xq). Note how the resulting sequence has to be wrapped in a div element:

import module namespace rest = "http://exquery.org/ns/restxq";
declare namespace page = 'http://basex.org/examples/web-page';
declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare %rest:GET %rest:POST %rest:path("person")
function page:person() {
element div {
for $i in collection("crobib")//tei:person[11]
return ( element p { $i/tei:persName//text() } )
}
};
return

The script is requested over the following address: http://localhost:8080/BaseX772/person.

Going further. We want to search not just for eleventh tei:person element, but for whichever we want. The number of the element should be turned into a variable holding an integer, and the variable will be given as part of the HTML address request. The script now looks like this:

import module namespace rest = "http://exquery.org/ns/restxq";
declare namespace page = "http://basex.org/examples/web-page";
declare namespace tei = "http://www.tei-c.org/ns/1.0";
declare %rest:GET %rest:POST %rest:path("person")
%rest:query-param("var", "{$var}")
function page:person($var as xs:integer) {
element div {
for $i in collection("crobib")//tei:person[$var]
return ( element p { $i/tei:persName//text() } )
}
};
return

We had to declare query parameter var: %rest:query-param("var", "{$var}") and to instruct the function page:person to expect it: function page:person($var as xs:integer).

The script is requested with a call such as this (querying the hundredth tei:person): http://localhost:8080/BaseX772/person?var=100.