Showing posts with label words. Show all posts
Showing posts with label words. Show all posts

Thursday, 27 December 2012

Saturnalia with Perseus Latin JSON

A two-days Christmas project: learn how to use Latin lemmata in JSON format, as provided by the Morphological Analysis Service available on an instance of the Bamboo Services Platform hosted by University of California, Berkeley at http://services-qa.projectbamboo.org/bsp/morphologyservice (used by Perseus for Latin and announced by Bridget Almas on November 1, 2012).

The fanfare: see the results here: [X].

Caveat. If you're a programmer, the following may seem extremely silly, because it is a description of how a non-programmer solved a programmer's task.

The task

What we wanted to do. There is a file of saved JSON responses, produced by sending a list of Latin words (from a Croatian neo-Latin text) to the Morphological Analysis Service, and appending responses. We wanted to produce a simple table containing a form sent to the service and the lemma gotten in response. This has already been achieved locally and indirectly, processing the responses file with some Perl and Linux CLI regex tools to produce a HTML page. But now I wanted to learn how to compute JSON as JSON. Solving the problem also taught me the structure of Morphological Analysis Service response.

The responses file contains lines of JSON objects:

{"RDF":{"Annotation":{"about":"urn:TuftsMorphologyService:abduxerunt:morpheus","hasTarget":{"Description":{"about":"urn:word:abduxerunt"}},"hasBody":{"resource":"urn:uuid:58f0bfcf-0180-4596-92d7-e88eaccffa8b"},"title":null,"creator":{"Agent":{"about":"org.perseus:tools:morpheus.v1"}},"created":"26\nDec\n2012\n12:01:28\nGMT","Body":{"about":"urn:uuid:58f0bfcf-0180-4596-92d7-e88eaccffa8b","type":{"resource":"cnt:ContentAsXML"},"rest":{"entry":{"uri":null,"dict":{"hdwd":{"lang":"lat","$":"abduco"},"pofs":{"order":1,"$":"verb"}},"infl":{"term":{"lang":"lat","stem":"abdu_x","suff":"e_runt"},"pofs":{"order":1,"$":"verb"},"mood":"indicative","num":"plural","pers":"3rd","tense":"perfect","voice":"active","stemtype":"perfstem"}}}}}}} {"RDF":{"Annotation":{"about":"urn:TuftsMorphologyService:abscente:morpheus","hasTarget":{"Description":{"about":"urn:word:abscente"}},"title":null,"creator":{"Agent":{"about":"org.perseus:tools:morpheus.v1"}},"created":"26\nDec\n2012\n12:01:28\nGMT"}}} {"RDF":{"Annotation":{"about":"urn:TuftsMorphologyService:abstineant:morpheus","hasTarget":{"Description":{"about":"urn:word:abstineant"}},"hasBody":{"resource":"urn:uuid:566cf4ec-2a8c-452f-a02f-5e0cecf32f52"},"title":null,"creator":{"Agent":{"about":"org.perseus:tools:morpheus.v1"}},"created":"26\nDec\n2012\n12:01:29\nGMT","Body":{"about":"urn:uuid:566cf4ec-2a8c-452f-a02f-5e0cecf32f52","type":{"resource":"cnt:ContentAsXML"},"rest":{"entry":{"uri":null,"dict":{"hdwd":{"lang":"lat","$":"abstineo"},"pofs":{"order":1,"$":"verb"}},"infl":{"term":{"lang":"lat","stem":"abs:tin","suff":"eant"},"pofs":{"order":1,"$":"verb"},"mood":"subjunctive","num":"plural","pers":"3rd","tense":"present","voice":"active","stemtype":"conj2","morph":"comp_only"}}}}}}}

The JSON

"Lines of JSON objects" is not a valid JSON, as you can see if you copy the lines above and paste them here: jsonlint.com. Why the error? All objects have to be contained in a JSON array. Also, for some reason "RDF" wasn't accepted as field name (key). So we transformed the file locally, introducing the "Verba" as the top array key, like this:

perl -0777 -wpl -e "s/\n/,\n/g;" jsonfilename | perl -wpl -e 's/"RDF"/"rdf"/g;' | perl -0777 -wpl -e 's/^{/{"Verba":\[{/;' | perl -0777 -wpl -e 's/,\n*$/\n\]}\n/;' > resultfilename.json

To process the JSON, we used JavaScript (amazed by the d3 JavaScript library). D3 documentation and examples showed how to read in a JSON file — and then I realized that the Morphology Service JSON is pretty complicated:

  • it is (sometimes deeply) nested
  • its objects contain arrays
  • an unsuccessfully lemmatized word won't have any "Body" object; a lemmatized word will have a string there; for an ambiguously lemmatized word, "Body" will be an array

The loops

So, lots of loops here. First the script had to loop through the array:

var infoLength= data.Verba.length; for (infoIndex = 0; infoIndex < infoLength; infoIndex++) { // ... }

Then it had to check whether a JSON object contains a lemmatized word — whether it had "Body" object. The a1 variable will hold the form sent to the service, while the a2 will hold the lemmata (or information that the word wasn't lemmatized successfully):

var a1 = data.Verba[infoIndex].rdf.Annotation.hasTarget.Description.about; // testing for existence of lemma var verbdata = data.Verba[infoIndex].rdf.Annotation.Body; if (verbdata) { // ... } else { var a2 = "FORMA NON RECOGNITA"; }

And then the "Body" object had to be tested for array; in case it isn't an array, JSON would be traversed all the way to the "$" key (containing the dictionary entry for the lemma):

if(Object.prototype.toString.call(verbdata) === '[object Array]') { // Iterate the array and do stuff ... } else { var link2 = perseus + data.Verba[infoIndex].rdf.Annotation.Body.rest.entry.dict.hdwd.$ ; var a2 = data.Verba[infoIndex].rdf.Annotation.Body.rest.entry.dict.hdwd.$ ; }

Now, in case that the verbdata variable contains an array, the array had to be iterated over, and a list — actually, a new array — had to be built from its values:

var a2 = []; for (bodyIndex = 0; bodyIndex < verbdata.length; bodyIndex++) { a2.push(verbdata[bodyIndex].rest.entry.dict.hdwd.$); }

Finally, we used a small routine to populate a table with resulting forms / lemmata pairs:

var columns = [a1, a2]; var table = d3.select("#container") .append("tr") .selectAll("td") .data(columns) .enter() .append("td") .text(function(column) { return column; });

Lots of trial-and-error (Firebug was a great help, and Stack Overflow even greater one) need not be dwelt on. Just one limitation puzzles me: the JSON file contains responses on more than 2000 words; my version of Firefox throws an error after ca. 720 objects read — either something should be optimized, or a paging system introduced. And, of course, seeing all 2000+ forms/lemmata pairs at once is neither necessary nor useful; the only thing we need is an option to sort out the unrecognized forms. This was added by the sorttable.js script.

Once again, the page with our JavaScript can be seen in action here: [X].

Thursday, 5 January 2012

Rare and Medium

This wintry afternoon I followed in the footsteps of William Whitaker, the author of WORDS Latin dictionary. The program contains a list of Latin words with very precise lexicographic descriptions -- data on period, area of application, frequency etc. The last part interested me most.

Whitaker, about whom I know almost nothing, but I'd like to know more (he seems to be outside the academia) [1], was very modest and careful in his claims, repeatedly warning users of the program that its philological expertise is limited, that he relied on other authorities and sources, that the program is intended just to be a reading help, not a research tool. Nevertheless, he has produced, I believe, the most informative freely available digital reference work on Latin usage. I'd like to see a review of his work in some scholarly journal, I think he has deserved it.

Anyway, in the documentation on word frequencies Whitaker says:

FREQ guessed from the relative number of citations given by sources need not be valid, but seems to work. (...)

type FREQUENCY_TYPE is ( -- For dictionary entries
X, -- -- Unknown or unspecified
A, -- very freq -- Very frequent, in all Elementary Latin books, top 1000+ words
B, -- frequent -- Frequent, next 2000+ words
C, -- common -- For Dictionary, in top 10,000 words
D, -- lesser -- For Dictionary, in top 20,000 words
E, -- uncommon -- 2 or 3 citations
F, -- very rare -- Having only single citation in OLD or L+S
I, -- inscription -- Only citation is inscription
M, -- graffiti -- Presently not much used
N -- Pliny -- Things that appear only in Pliny Natural History
);

(Of course, Whitaker knows about Diederich's work -- he is the one who OCR'd Diederich's 1939 thesis and put it online.)

So, we're pleased to report that the Profile of Croatian Neo-Latin Project converted Whitaker's DICTPAGE.RAW to a MySQL table, and learned the following about how Whitaker's ten frequency categories are distributed among the 39,225 lemmata in his wordlist:

  1. X (Unknown or unspecified): 0

  2. A (very freq): 2134

  3. B (frequent): 2747

  4. C (common): 5113

  5. D (lesser): 8365

  6. E (uncommon): 11193

  7. F (very rare): 7974

  8. I (inscription): 430

  9. M (graffiti): 0

  10. N (Pliny): 1269

  11. Total: 39225


Now we have something to compare. It is interesting to note that most words are uncommon.

[Further reading.] There is a recent publication, Joseph Denooz, Nouveau lexique fréquentiel de latin. Alpha-Omega. Reihe A Bd 258. Hildesheim/Zürich/New York: Georg Olms Verlag, 2010. Pp. ix, 453. ISBN 9783487144733. €148.00. (reviewed recently on BMCR, with a crucial question: "A dictionary such as this is a tool: so what can this one be used for?").

[1] A sad update. Thinking about possible reasons for William Whitaker's absence from the internet, I consulted the obituaries, and found the following:

Colonel William A. Whitaker (USAF-Retired) passed away on Tuesday, December 14, 2010. While at DARPA, he worked on the computer language ADA. In retirement, he created the Latin-English translation software program, "Whitaker Words". (...)
Published in Midland Reporter-Telegram on December 21, 2010
Source here.


Τάνδε κατ' εὔδενδρον στείβων δρίος εἴρυσα χειρὶ
πτώσσουσαν βρομίας οἰνάδος ἐν πετάλοις,
ὄφρα μοι εὐερκεῖ καναχὰν δόμῳ ἔνδοθι θείη,
τερπνὰ δι' ἀγλώσσου φθεγγομένα στόματος.

Requiescat in pace.

Quantification

This morning we had to compile some numbers on the Croatiae auctores Latini collection. Here they are (also on the CroALa developer's blog):

  • 143 TEI XML files (including, alas, some duplicates)

  • 437.218 words

  • 29.637.450 characters

  • 16.465,25 Textkarten

  • 1029 Druckbogen


Last two strange categories belong to German printing tradition, which was influential in Croatian printing industry; we translated these terms (Textkarte = kartica teksta, Druckbogen = tiskarski arak), and use them still in text accounting.

[Technical note.] Numbers were produced by Linux wc command (cf. recipe) on all XML files currently in CroALa, also available on its Sourceforge page. The Linux one-liner for calculating number of characters and words in multiple XML files was simple:

wc *.xml | awk '{print $3-$1}'

Tuesday, 3 January 2012

Compare two lists 101

This is probably Programming 101, and should be (in analog form) Philology 101, but combination of the two seems somehow to fall through. Also, there was a question about it today on Digital Medievalist mailing list.

So, the problem for today is: we want to compare two lists of words.

(Let's say we have made one list by sorting all words from a text alphabetically, and then discarding all multiple occurrences -- save the first one, of course.)

Does position in a list matter, asks the programmer. No, answers the philologist -- we only want to know whether a word from list A occurrs also in list B. Do the rows have multiple fields which should be compared, asks the programmer. No, there is only one field, one word, answers the philologist, thinking about finding a word from the text in a dictionary. Our wishes are modest. (Later, of course, we'll want to find where exactly do common words appear in documents A and B.)

If you use Excel, there seem to be some recipes at The Spreadsheet Page and elsewhere. There is also a Perl module List::Compare (if you are a brave philologist, and have a book or two on Perl handy, you can learn much from the problem). Finally, if you are an eccentric philologist and use Linux, there are standard text manipulation tools for Linux. Yes, here is where we realized that philology is much like programming: both are all about texts.

Surprisingly, the main problem in using all these tools (at least for me) turned out to be how to send a list to the tool, how to loop through all elements in a list, etc. Programming kindergarten, I guess -- but philologists don't usually have to think about how to turn the pages or how to scan lines of text, much less to issue instructions such as "now lift the hand... spread the thumb and index finger... catch the page edge lightly... lift again..." (I know, even programmers don't do it anymore either these days; Dada used to do it when she was studying electrical engineering.)

So how do I actually compare two lists? Here is one of my Bash scripts:

egrep -f list1 list2 > resultlist

As you can see, it takes great wisdom and sophistication.

egrep, today same as grep, is an utility which finds words (strings) in a file. With the -f option, it reads a list from a file (where every line is a query). List2 is the file which should be searched (actually it does not have to be sorted -- it can simply be the original text; list1 also does not have to be sorted, by the way). "Greater than" sign is a command to send output to file (called resultlist); without it, results would just fly across our screen.

And basically, that's all there is to it. Try not to be frustrated if something goes wrong, look for recipes and explanations on the internet, and remember that you cannot (hopefully) break anything in your computer experimenting with this kind of commands.

Monday, 2 January 2012

Looking at neo-Latin

Problem: we want to research strange words in neo-Latin texts.

Of course, that depends on what we consider to be "strange". This can mean:
a. Latin words which are rare or non-existent in classical Latin (mutatis mutandis, the language in which Romans wrote until c. 500 a. D)
b. words which are strange to us
c. words which were strange to authors or their public
d. words which are in a neo-Latin text, but are not Latin

Let us here consider case a. It turns out to have several sub-problems of its own:
a.1. words which don't exist in Latin of the Romans (see the Neulateinische Wortliste by J. Ramminger)
a.2. words which are rare in all periods of Latin
a.3. words which are rare in Latin of the Romans, but frequent in later Latin (e. g. medieval), or in some Latin idioms (e. g. ecclesiastical Latin)
a.4. words which are rare in some genres, but frequent in others

Basically, there are two approaches to our research. We can start from our texts, examining their words and looking for them in different wordlists. Or we can start from lists, and see if our texts contain some of their words.

It all comes down to comparing lists. The longer the better.

But we need special lists. Here's a list (sic) of them:
a. a list of words in our text
b. a list of lemmata of words from our text
c. a list of words which are rare in classical Latin
d. a list of neo-Latin words
(e. and a list of frequent Latin words would also come in handy)

We also need some tools. If we want to go from a text to the lists, we'll need:
a. something to list all words in our text
b. a Latin parser (we feed it a form, and get back the lemma)
c. a way to communicate our words to the parser (and get back the results)
d. something which can compare two lists
e. something which can write out the results

If we go from a list to the text, we'll need:
a. a Latin stemmer (we'll look only for stems, and disregard the endings)
b. a regular expressions tool (to find a complete word, given a stem)
c. tools for comparing and storing the results, as above

Tuesday, 1 November 2011

The more, the merrier

Today's experiments with lemmatizing a neo-Latin word list (from the Ludovik Crijević Tuberon, Commentarii) seem to show that repeated passes through Archimedes Project lemmatizer give better results.

Perhaps the lemmatizer has some kind of limit; Tuberon's word list had ca. 20,000 forms.

Anyway, now we have a Bash script that can do any number of passes. And the final list of "strange" (i. e. not lemmatized, because not recognized) words is here: [X].