Sunday, 30 December 2012

Morphological JSON with Perl

Learning Perl, aka "the Llama book", makes a terrific didactical point in footnote 8 on page 6:
If you're going to use a programming language for only a few minutes each week or month, you'd prefer one that is easier to learn, since you'll have forgotten nearly all of it from one use to the next. Perl is for people who are programmers for at least twenty minutes a day.

Basically, nulla dies sine linea. The daily twenty minutes today took about three or four hours, but I ended up with Perl version of what I already did in JavaScript: a script that iterates over any list of JSON results from Latin Morphology Service, decides whether a word sent to it has been recognized or not, and then whether the lemmatization is ambiguous or not.

The rizai — getting through all arrays of hashes and hashes of hashes — have been pikrai indeed (the crucial piece of information was shared by this post at Stack Overflow); dereferencing still appears to me as consecutio temporum must look to a programmer; hashes were my Scylla and arrays my Charybdis, but the ship is still sailing, more or less.

The script is here (thanks to DokuWiki).

All this wasn't done as pure exercise (I'm not such a conscientious student). The Morphology Service JSON holds lot more then a lemma, in fact it provides a wealth of information — most of what people interested in natural language processing of Greek and Latin usually lack (and scholars of other languages have). You need to stem a word? You need to identify which part of speech it is? It's all there somewhere, nested deep in JSON.

Naturally, you ask why should I bother. Are we not trained to use dictionaries, don't we have enough grammatical knowledge? Of course we do; we can read Greek and Latin much better than computers. But there are limits to how much we can read, or analyse. Giving the text the care and the gusto it requires — Greek and Latin we have today were not written to be read quickly ‐ I need from one to ten minutes for a page, and enough time for reflexion and rumination afterwards. Grammatical analysis progresses even slower. The computer, on the other hand, doesn't care for rumination; it gets back from Morphology Service JSON for 2000+ words of a neo-Latin text approximately in the time that I need to write this post.

And then we have a chance to learn from computers' mistakes.

Which words were recognized, which are ambiguous, which are unknown to the service? What is the proportion between the three groups? Which words are unambiguously identified, and not inflected? We'll store the uninflected words somewhere, because we don't need to stem them (much); we'll store the unambiguously recognized words, because we won't need to lemmatize them in other texts; from the set of unrecognized words we'll be building an index nominum et locorum, an index verborum rariorum, and a list of common words which Morphology Service should add to its database. Furthermore, a list of lemmata allows us to begin exploring lexical variety in a text, or in a set of texts.

Mind you, the basis for much of this is being put together while I write this. All I had to do to make it happen was learn some code. It almost didn't hurt. Much.

Saturday, 29 December 2012

Structuring the Mercurius Croaticus

Mercurius Croaticus is currently a set of TEI XML files containing bibliographical records. These files will be served and made accessible via the BaseX XML database, once we decide on how to present the records; our working premise is that, if researchers want to discover something really interesting, they'll be willing and ready to learn XQuery (not least because it's a powerful tool for more than one research project). And Mercurius Croaticus will help them learn.

To get a clear idea on what exactly is in the bibliographic collection, and to avoid confusion, we organised the files in three sets of folders, following the FRBR concept. There is a folder for authors (auctores), a folder for works (opera), and one for "manifestations" (it is interesting that a term for the third category is not readily available in any language I know); "manifestations" have two subfolders: manuscripts (MS) and printed books (typis edita). Obviously, the "internet" subfolder could also be included.

The folder auctores contains our starting prosopography — a set of 244 personal records for Croatian Latin authors included in the Leksikon hrvatskih pisaca (A Lexicon of Croatian Writers, Zagreb 2000) — and the "additions" file, containing currently 70 more neo-Latin authors of Croatian origin.

Main part of opera is also culled from the Leksikon hrvatskih pisaca — there are 1784 items listed — and there are 18 additional items as well.

The manifestations/MS subfolder has one special collection, with excerpts from Paul Oskar Kristeller's Iter Italicum (no, we don't have the money to subscribe to Brill's internet edition); it was excerpted by Darko Novaković, who kindly lent his notes to Mercurius for TEI XML conversion. And there are records collected from other sources, more or less obiter.

Finally, there is the manifestations/typis edita subfolder, where the basis is the bibliography made by Šime Jurić (1915–2004) in late 1960's: Iugoslaviae scriptores Latini recentioris aetatis (that is, its "Pars I. Opera scriptorum Latinorum natione Croatarum usque ad annum MDCCCXLVIII typis edita. Bibliographiae fundamenta. t. 1. Index alphabeticus. t. 2. Index systematicus. Additamentum I."), later improved by the Croatian National and University Library, and encoded by the Croatiae auctores Latini project. This collection contains 5867 bibliographic records on printed Latin publications with works by Croatian authors. Jurić's bibliography breaks off with the year 1850. Mercurius Croaticus should urgently supplement it with data on later publications, all the way up to the present.

Thursday, 27 December 2012

Saturnalia with Perseus Latin JSON

A two-days Christmas project: learn how to use Latin lemmata in JSON format, as provided by the Morphological Analysis Service available on an instance of the Bamboo Services Platform hosted by University of California, Berkeley at http://services-qa.projectbamboo.org/bsp/morphologyservice (used by Perseus for Latin and announced by Bridget Almas on November 1, 2012).

The fanfare: see the results here: [X].

Caveat. If you're a programmer, the following may seem extremely silly, because it is a description of how a non-programmer solved a programmer's task.

The task

What we wanted to do. There is a file of saved JSON responses, produced by sending a list of Latin words (from a Croatian neo-Latin text) to the Morphological Analysis Service, and appending responses. We wanted to produce a simple table containing a form sent to the service and the lemma gotten in response. This has already been achieved locally and indirectly, processing the responses file with some Perl and Linux CLI regex tools to produce a HTML page. But now I wanted to learn how to compute JSON as JSON. Solving the problem also taught me the structure of Morphological Analysis Service response.

The responses file contains lines of JSON objects:

{"RDF":{"Annotation":{"about":"urn:TuftsMorphologyService:abduxerunt:morpheus","hasTarget":{"Description":{"about":"urn:word:abduxerunt"}},"hasBody":{"resource":"urn:uuid:58f0bfcf-0180-4596-92d7-e88eaccffa8b"},"title":null,"creator":{"Agent":{"about":"org.perseus:tools:morpheus.v1"}},"created":"26\nDec\n2012\n12:01:28\nGMT","Body":{"about":"urn:uuid:58f0bfcf-0180-4596-92d7-e88eaccffa8b","type":{"resource":"cnt:ContentAsXML"},"rest":{"entry":{"uri":null,"dict":{"hdwd":{"lang":"lat","$":"abduco"},"pofs":{"order":1,"$":"verb"}},"infl":{"term":{"lang":"lat","stem":"abdu_x","suff":"e_runt"},"pofs":{"order":1,"$":"verb"},"mood":"indicative","num":"plural","pers":"3rd","tense":"perfect","voice":"active","stemtype":"perfstem"}}}}}}} {"RDF":{"Annotation":{"about":"urn:TuftsMorphologyService:abscente:morpheus","hasTarget":{"Description":{"about":"urn:word:abscente"}},"title":null,"creator":{"Agent":{"about":"org.perseus:tools:morpheus.v1"}},"created":"26\nDec\n2012\n12:01:28\nGMT"}}} {"RDF":{"Annotation":{"about":"urn:TuftsMorphologyService:abstineant:morpheus","hasTarget":{"Description":{"about":"urn:word:abstineant"}},"hasBody":{"resource":"urn:uuid:566cf4ec-2a8c-452f-a02f-5e0cecf32f52"},"title":null,"creator":{"Agent":{"about":"org.perseus:tools:morpheus.v1"}},"created":"26\nDec\n2012\n12:01:29\nGMT","Body":{"about":"urn:uuid:566cf4ec-2a8c-452f-a02f-5e0cecf32f52","type":{"resource":"cnt:ContentAsXML"},"rest":{"entry":{"uri":null,"dict":{"hdwd":{"lang":"lat","$":"abstineo"},"pofs":{"order":1,"$":"verb"}},"infl":{"term":{"lang":"lat","stem":"abs:tin","suff":"eant"},"pofs":{"order":1,"$":"verb"},"mood":"subjunctive","num":"plural","pers":"3rd","tense":"present","voice":"active","stemtype":"conj2","morph":"comp_only"}}}}}}}

The JSON

"Lines of JSON objects" is not a valid JSON, as you can see if you copy the lines above and paste them here: jsonlint.com. Why the error? All objects have to be contained in a JSON array. Also, for some reason "RDF" wasn't accepted as field name (key). So we transformed the file locally, introducing the "Verba" as the top array key, like this:

perl -0777 -wpl -e "s/\n/,\n/g;" jsonfilename | perl -wpl -e 's/"RDF"/"rdf"/g;' | perl -0777 -wpl -e 's/^{/{"Verba":\[{/;' | perl -0777 -wpl -e 's/,\n*$/\n\]}\n/;' > resultfilename.json

To process the JSON, we used JavaScript (amazed by the d3 JavaScript library). D3 documentation and examples showed how to read in a JSON file — and then I realized that the Morphology Service JSON is pretty complicated:

  • it is (sometimes deeply) nested
  • its objects contain arrays
  • an unsuccessfully lemmatized word won't have any "Body" object; a lemmatized word will have a string there; for an ambiguously lemmatized word, "Body" will be an array

The loops

So, lots of loops here. First the script had to loop through the array:

var infoLength= data.Verba.length; for (infoIndex = 0; infoIndex < infoLength; infoIndex++) { // ... }

Then it had to check whether a JSON object contains a lemmatized word — whether it had "Body" object. The a1 variable will hold the form sent to the service, while the a2 will hold the lemmata (or information that the word wasn't lemmatized successfully):

var a1 = data.Verba[infoIndex].rdf.Annotation.hasTarget.Description.about; // testing for existence of lemma var verbdata = data.Verba[infoIndex].rdf.Annotation.Body; if (verbdata) { // ... } else { var a2 = "FORMA NON RECOGNITA"; }

And then the "Body" object had to be tested for array; in case it isn't an array, JSON would be traversed all the way to the "$" key (containing the dictionary entry for the lemma):

if(Object.prototype.toString.call(verbdata) === '[object Array]') { // Iterate the array and do stuff ... } else { var link2 = perseus + data.Verba[infoIndex].rdf.Annotation.Body.rest.entry.dict.hdwd.$ ; var a2 = data.Verba[infoIndex].rdf.Annotation.Body.rest.entry.dict.hdwd.$ ; }

Now, in case that the verbdata variable contains an array, the array had to be iterated over, and a list — actually, a new array — had to be built from its values:

var a2 = []; for (bodyIndex = 0; bodyIndex < verbdata.length; bodyIndex++) { a2.push(verbdata[bodyIndex].rest.entry.dict.hdwd.$); }

Finally, we used a small routine to populate a table with resulting forms / lemmata pairs:

var columns = [a1, a2]; var table = d3.select("#container") .append("tr") .selectAll("td") .data(columns) .enter() .append("td") .text(function(column) { return column; });

Lots of trial-and-error (Firebug was a great help, and Stack Overflow even greater one) need not be dwelt on. Just one limitation puzzles me: the JSON file contains responses on more than 2000 words; my version of Firefox throws an error after ca. 720 objects read — either something should be optimized, or a paging system introduced. And, of course, seeing all 2000+ forms/lemmata pairs at once is neither necessary nor useful; the only thing we need is an option to sort out the unrecognized forms. This was added by the sorttable.js script.

Once again, the page with our JavaScript can be seen in action here: [X].

Wednesday, 19 December 2012

One-line concordance in Linux command line

A recipe. To create a "concordance" — actually, a list of forms from a text with frequencies added — using just a command line, skipping programs such as AntConc (which is great, nice and illuminating, but sometimes I just need to prepare a list quickly). It can be done with the following Bash one-liner:

tr '[:punct:]' ' ' < filename1 | tr '[:upper:]' '[:lower:]' | tr '[:blank:]' ' ' | sort | uniq -c | sed 's/ \{1,\}/","/g' | sed 's/^",//g' | sed 's/$/"/g' > filename2.csv (Filename1 is input file, filename2.csv output in csv format.)

Recently there was a discussion on HUMANIST list whether "bash scripting is a worthwhile approach to "tool" development in the Digital Humanities". People tended to reply no, either learn a "real" language (Python was recommended), or develop a GUI ("like most people, humanists have a strong distaste for the commandline"); I think it was 3:1 against the command line.

Obviously, I disagree. Using bash helped me cross the boundary between user and "programmer" — on the command line one just slides from one region into another. Without a formal education: you have a problem, you look for a solution (discovering gratefully that you stand on shoulders of many colleagues), and bam! it's solved.

I think that digital humanists in general should adopt this kind of sliding — from users to programmers as well as from "classical" to "avant-guarde" scholars — as their MO.

Asterisks

* * *

Today I had to mark up several Latin poems which used the device above -- three asterisks, even placed as an asterism -- to mark breaks between thematic units.

How to mark a set of asterisks, a typographical asterism, in TEI XML? A TEI-L discussion from 2007 helped, and I decided to use the space element ("indicates the location of a significant space in the copy text").

Friday, 7 December 2012

Profiling cultural literacy of Croatian Latin writers

A paper to be presented in the Latin, National Identity and the Language Question in Central Europe conference, organised by the Ludwig Boltzmann Institute for Neo-Latin Studies in Innsbruck (12--15 Dec 2012) will apply E. D. Hirsch's concept of cultural literacy --- which is actually German Allgemeinbildung or Bildungsgut (see it at work in these books), and "opća kultura" in Croatian --- to intellectual horizons of Croatian neo-Latin writers, as represented in the Croatiae auctores Latini collection.

Judging from the programme, the conference offers a chance to present digital research to an audience working mostly in "traditional" ways. This is quite an opportunity; too often digital humanities get separated in a room of their own, where they don't get in anybody's way. So, the challenge is to persuade colleagues that a large-scale search of CroALa using e. g. common terms from CAMENA TERMINI can lead to something interesting.

Update, post-conference, 19/12/2012:

My presentation on cultural literacy is here (note to self: never again try to do a presentation in a browser — outdated browser versions always turn up in crucial moments in crucial spots).

The paper itself is here.