Showing posts with label software. Show all posts
Showing posts with label software. Show all posts

Sunday, 30 December 2012

Morphological JSON with Perl

Learning Perl, aka "the Llama book", makes a terrific didactical point in footnote 8 on page 6:
If you're going to use a programming language for only a few minutes each week or month, you'd prefer one that is easier to learn, since you'll have forgotten nearly all of it from one use to the next. Perl is for people who are programmers for at least twenty minutes a day.

Basically, nulla dies sine linea. The daily twenty minutes today took about three or four hours, but I ended up with Perl version of what I already did in JavaScript: a script that iterates over any list of JSON results from Latin Morphology Service, decides whether a word sent to it has been recognized or not, and then whether the lemmatization is ambiguous or not.

The rizai — getting through all arrays of hashes and hashes of hashes — have been pikrai indeed (the crucial piece of information was shared by this post at Stack Overflow); dereferencing still appears to me as consecutio temporum must look to a programmer; hashes were my Scylla and arrays my Charybdis, but the ship is still sailing, more or less.

The script is here (thanks to DokuWiki).

All this wasn't done as pure exercise (I'm not such a conscientious student). The Morphology Service JSON holds lot more then a lemma, in fact it provides a wealth of information — most of what people interested in natural language processing of Greek and Latin usually lack (and scholars of other languages have). You need to stem a word? You need to identify which part of speech it is? It's all there somewhere, nested deep in JSON.

Naturally, you ask why should I bother. Are we not trained to use dictionaries, don't we have enough grammatical knowledge? Of course we do; we can read Greek and Latin much better than computers. But there are limits to how much we can read, or analyse. Giving the text the care and the gusto it requires — Greek and Latin we have today were not written to be read quickly ‐ I need from one to ten minutes for a page, and enough time for reflexion and rumination afterwards. Grammatical analysis progresses even slower. The computer, on the other hand, doesn't care for rumination; it gets back from Morphology Service JSON for 2000+ words of a neo-Latin text approximately in the time that I need to write this post.

And then we have a chance to learn from computers' mistakes.

Which words were recognized, which are ambiguous, which are unknown to the service? What is the proportion between the three groups? Which words are unambiguously identified, and not inflected? We'll store the uninflected words somewhere, because we don't need to stem them (much); we'll store the unambiguously recognized words, because we won't need to lemmatize them in other texts; from the set of unrecognized words we'll be building an index nominum et locorum, an index verborum rariorum, and a list of common words which Morphology Service should add to its database. Furthermore, a list of lemmata allows us to begin exploring lexical variety in a text, or in a set of texts.

Mind you, the basis for much of this is being put together while I write this. All I had to do to make it happen was learn some code. It almost didn't hurt. Much.

Thursday, 5 January 2012

Rare and Medium

This wintry afternoon I followed in the footsteps of William Whitaker, the author of WORDS Latin dictionary. The program contains a list of Latin words with very precise lexicographic descriptions -- data on period, area of application, frequency etc. The last part interested me most.

Whitaker, about whom I know almost nothing, but I'd like to know more (he seems to be outside the academia) [1], was very modest and careful in his claims, repeatedly warning users of the program that its philological expertise is limited, that he relied on other authorities and sources, that the program is intended just to be a reading help, not a research tool. Nevertheless, he has produced, I believe, the most informative freely available digital reference work on Latin usage. I'd like to see a review of his work in some scholarly journal, I think he has deserved it.

Anyway, in the documentation on word frequencies Whitaker says:

FREQ guessed from the relative number of citations given by sources need not be valid, but seems to work. (...)

type FREQUENCY_TYPE is ( -- For dictionary entries
X, -- -- Unknown or unspecified
A, -- very freq -- Very frequent, in all Elementary Latin books, top 1000+ words
B, -- frequent -- Frequent, next 2000+ words
C, -- common -- For Dictionary, in top 10,000 words
D, -- lesser -- For Dictionary, in top 20,000 words
E, -- uncommon -- 2 or 3 citations
F, -- very rare -- Having only single citation in OLD or L+S
I, -- inscription -- Only citation is inscription
M, -- graffiti -- Presently not much used
N -- Pliny -- Things that appear only in Pliny Natural History
);

(Of course, Whitaker knows about Diederich's work -- he is the one who OCR'd Diederich's 1939 thesis and put it online.)

So, we're pleased to report that the Profile of Croatian Neo-Latin Project converted Whitaker's DICTPAGE.RAW to a MySQL table, and learned the following about how Whitaker's ten frequency categories are distributed among the 39,225 lemmata in his wordlist:

  1. X (Unknown or unspecified): 0

  2. A (very freq): 2134

  3. B (frequent): 2747

  4. C (common): 5113

  5. D (lesser): 8365

  6. E (uncommon): 11193

  7. F (very rare): 7974

  8. I (inscription): 430

  9. M (graffiti): 0

  10. N (Pliny): 1269

  11. Total: 39225


Now we have something to compare. It is interesting to note that most words are uncommon.

[Further reading.] There is a recent publication, Joseph Denooz, Nouveau lexique fréquentiel de latin. Alpha-Omega. Reihe A Bd 258. Hildesheim/Zürich/New York: Georg Olms Verlag, 2010. Pp. ix, 453. ISBN 9783487144733. €148.00. (reviewed recently on BMCR, with a crucial question: "A dictionary such as this is a tool: so what can this one be used for?").

[1] A sad update. Thinking about possible reasons for William Whitaker's absence from the internet, I consulted the obituaries, and found the following:

Colonel William A. Whitaker (USAF-Retired) passed away on Tuesday, December 14, 2010. While at DARPA, he worked on the computer language ADA. In retirement, he created the Latin-English translation software program, "Whitaker Words". (...)
Published in Midland Reporter-Telegram on December 21, 2010
Source here.


Τάνδε κατ' εὔδενδρον στείβων δρίος εἴρυσα χειρὶ
πτώσσουσαν βρομίας οἰνάδος ἐν πετάλοις,
ὄφρα μοι εὐερκεῖ καναχὰν δόμῳ ἔνδοθι θείη,
τερπνὰ δι' ἀγλώσσου φθεγγομένα στόματος.

Requiescat in pace.