Thursday 27 December 2012

Saturnalia with Perseus Latin JSON

A two-days Christmas project: learn how to use Latin lemmata in JSON format, as provided by the Morphological Analysis Service available on an instance of the Bamboo Services Platform hosted by University of California, Berkeley at http://services-qa.projectbamboo.org/bsp/morphologyservice (used by Perseus for Latin and announced by Bridget Almas on November 1, 2012).

The fanfare: see the results here: [X].

Caveat. If you're a programmer, the following may seem extremely silly, because it is a description of how a non-programmer solved a programmer's task.

The task

What we wanted to do. There is a file of saved JSON responses, produced by sending a list of Latin words (from a Croatian neo-Latin text) to the Morphological Analysis Service, and appending responses. We wanted to produce a simple table containing a form sent to the service and the lemma gotten in response. This has already been achieved locally and indirectly, processing the responses file with some Perl and Linux CLI regex tools to produce a HTML page. But now I wanted to learn how to compute JSON as JSON. Solving the problem also taught me the structure of Morphological Analysis Service response.

The responses file contains lines of JSON objects:

{"RDF":{"Annotation":{"about":"urn:TuftsMorphologyService:abduxerunt:morpheus","hasTarget":{"Description":{"about":"urn:word:abduxerunt"}},"hasBody":{"resource":"urn:uuid:58f0bfcf-0180-4596-92d7-e88eaccffa8b"},"title":null,"creator":{"Agent":{"about":"org.perseus:tools:morpheus.v1"}},"created":"26\nDec\n2012\n12:01:28\nGMT","Body":{"about":"urn:uuid:58f0bfcf-0180-4596-92d7-e88eaccffa8b","type":{"resource":"cnt:ContentAsXML"},"rest":{"entry":{"uri":null,"dict":{"hdwd":{"lang":"lat","$":"abduco"},"pofs":{"order":1,"$":"verb"}},"infl":{"term":{"lang":"lat","stem":"abdu_x","suff":"e_runt"},"pofs":{"order":1,"$":"verb"},"mood":"indicative","num":"plural","pers":"3rd","tense":"perfect","voice":"active","stemtype":"perfstem"}}}}}}} {"RDF":{"Annotation":{"about":"urn:TuftsMorphologyService:abscente:morpheus","hasTarget":{"Description":{"about":"urn:word:abscente"}},"title":null,"creator":{"Agent":{"about":"org.perseus:tools:morpheus.v1"}},"created":"26\nDec\n2012\n12:01:28\nGMT"}}} {"RDF":{"Annotation":{"about":"urn:TuftsMorphologyService:abstineant:morpheus","hasTarget":{"Description":{"about":"urn:word:abstineant"}},"hasBody":{"resource":"urn:uuid:566cf4ec-2a8c-452f-a02f-5e0cecf32f52"},"title":null,"creator":{"Agent":{"about":"org.perseus:tools:morpheus.v1"}},"created":"26\nDec\n2012\n12:01:29\nGMT","Body":{"about":"urn:uuid:566cf4ec-2a8c-452f-a02f-5e0cecf32f52","type":{"resource":"cnt:ContentAsXML"},"rest":{"entry":{"uri":null,"dict":{"hdwd":{"lang":"lat","$":"abstineo"},"pofs":{"order":1,"$":"verb"}},"infl":{"term":{"lang":"lat","stem":"abs:tin","suff":"eant"},"pofs":{"order":1,"$":"verb"},"mood":"subjunctive","num":"plural","pers":"3rd","tense":"present","voice":"active","stemtype":"conj2","morph":"comp_only"}}}}}}}

The JSON

"Lines of JSON objects" is not a valid JSON, as you can see if you copy the lines above and paste them here: jsonlint.com. Why the error? All objects have to be contained in a JSON array. Also, for some reason "RDF" wasn't accepted as field name (key). So we transformed the file locally, introducing the "Verba" as the top array key, like this:

perl -0777 -wpl -e "s/\n/,\n/g;" jsonfilename | perl -wpl -e 's/"RDF"/"rdf"/g;' | perl -0777 -wpl -e 's/^{/{"Verba":\[{/;' | perl -0777 -wpl -e 's/,\n*$/\n\]}\n/;' > resultfilename.json

To process the JSON, we used JavaScript (amazed by the d3 JavaScript library). D3 documentation and examples showed how to read in a JSON file — and then I realized that the Morphology Service JSON is pretty complicated:

  • it is (sometimes deeply) nested
  • its objects contain arrays
  • an unsuccessfully lemmatized word won't have any "Body" object; a lemmatized word will have a string there; for an ambiguously lemmatized word, "Body" will be an array

The loops

So, lots of loops here. First the script had to loop through the array:

var infoLength= data.Verba.length; for (infoIndex = 0; infoIndex < infoLength; infoIndex++) { // ... }

Then it had to check whether a JSON object contains a lemmatized word — whether it had "Body" object. The a1 variable will hold the form sent to the service, while the a2 will hold the lemmata (or information that the word wasn't lemmatized successfully):

var a1 = data.Verba[infoIndex].rdf.Annotation.hasTarget.Description.about; // testing for existence of lemma var verbdata = data.Verba[infoIndex].rdf.Annotation.Body; if (verbdata) { // ... } else { var a2 = "FORMA NON RECOGNITA"; }

And then the "Body" object had to be tested for array; in case it isn't an array, JSON would be traversed all the way to the "$" key (containing the dictionary entry for the lemma):

if(Object.prototype.toString.call(verbdata) === '[object Array]') { // Iterate the array and do stuff ... } else { var link2 = perseus + data.Verba[infoIndex].rdf.Annotation.Body.rest.entry.dict.hdwd.$ ; var a2 = data.Verba[infoIndex].rdf.Annotation.Body.rest.entry.dict.hdwd.$ ; }

Now, in case that the verbdata variable contains an array, the array had to be iterated over, and a list — actually, a new array — had to be built from its values:

var a2 = []; for (bodyIndex = 0; bodyIndex < verbdata.length; bodyIndex++) { a2.push(verbdata[bodyIndex].rest.entry.dict.hdwd.$); }

Finally, we used a small routine to populate a table with resulting forms / lemmata pairs:

var columns = [a1, a2]; var table = d3.select("#container") .append("tr") .selectAll("td") .data(columns) .enter() .append("td") .text(function(column) { return column; });

Lots of trial-and-error (Firebug was a great help, and Stack Overflow even greater one) need not be dwelt on. Just one limitation puzzles me: the JSON file contains responses on more than 2000 words; my version of Firefox throws an error after ca. 720 objects read — either something should be optimized, or a paging system introduced. And, of course, seeing all 2000+ forms/lemmata pairs at once is neither necessary nor useful; the only thing we need is an option to sort out the unrecognized forms. This was added by the sorttable.js script.

Once again, the page with our JavaScript can be seen in action here: [X].

No comments:

Post a Comment