Sunday, 30 December 2012

Morphological JSON with Perl

Learning Perl, aka "the Llama book", makes a terrific didactical point in footnote 8 on page 6:
If you're going to use a programming language for only a few minutes each week or month, you'd prefer one that is easier to learn, since you'll have forgotten nearly all of it from one use to the next. Perl is for people who are programmers for at least twenty minutes a day.

Basically, nulla dies sine linea. The daily twenty minutes today took about three or four hours, but I ended up with Perl version of what I already did in JavaScript: a script that iterates over any list of JSON results from Latin Morphology Service, decides whether a word sent to it has been recognized or not, and then whether the lemmatization is ambiguous or not.

The rizai — getting through all arrays of hashes and hashes of hashes — have been pikrai indeed (the crucial piece of information was shared by this post at Stack Overflow); dereferencing still appears to me as consecutio temporum must look to a programmer; hashes were my Scylla and arrays my Charybdis, but the ship is still sailing, more or less.

The script is here (thanks to DokuWiki).

All this wasn't done as pure exercise (I'm not such a conscientious student). The Morphology Service JSON holds lot more then a lemma, in fact it provides a wealth of information — most of what people interested in natural language processing of Greek and Latin usually lack (and scholars of other languages have). You need to stem a word? You need to identify which part of speech it is? It's all there somewhere, nested deep in JSON.

Naturally, you ask why should I bother. Are we not trained to use dictionaries, don't we have enough grammatical knowledge? Of course we do; we can read Greek and Latin much better than computers. But there are limits to how much we can read, or analyse. Giving the text the care and the gusto it requires — Greek and Latin we have today were not written to be read quickly ‐ I need from one to ten minutes for a page, and enough time for reflexion and rumination afterwards. Grammatical analysis progresses even slower. The computer, on the other hand, doesn't care for rumination; it gets back from Morphology Service JSON for 2000+ words of a neo-Latin text approximately in the time that I need to write this post.

And then we have a chance to learn from computers' mistakes.

Which words were recognized, which are ambiguous, which are unknown to the service? What is the proportion between the three groups? Which words are unambiguously identified, and not inflected? We'll store the uninflected words somewhere, because we don't need to stem them (much); we'll store the unambiguously recognized words, because we won't need to lemmatize them in other texts; from the set of unrecognized words we'll be building an index nominum et locorum, an index verborum rariorum, and a list of common words which Morphology Service should add to its database. Furthermore, a list of lemmata allows us to begin exploring lexical variety in a text, or in a set of texts.

Mind you, the basis for much of this is being put together while I write this. All I had to do to make it happen was learn some code. It almost didn't hurt. Much.

No comments:

Post a Comment