Tuesday 1 January 2013

Filtering Latin words

For anyone speaking Croatian or a host of related languages, "filter" means first and foremost "cigarette filter". There is even a legendary song from the 1980's built around it.

However, in profiling Croatian Latin filters are, more prosaically, ways to save time and resources. Once we have a sufficient set of lemmatized Latin words, we can avoid sending these words to Morphology Service again.

Not one, but three filters are needed. From a list of forms contained in a Latin text (any that we intend to include in CroALa) first will be filtered out all previously unambiguously lemmatized forms. From the remaining set, we'll filter out what was previously recognized, but ambiguously. Finally, a filter will be applied to words previously encountered, but not recognized by the Morphology Service.

What is left is ready for sending to Morphology Service. The resulting JSON will again be sifted into three groups: the lemmatized words, the ambiguously lemmatized, the unrecognized.

E. g. A letter by Juraj Jurjević, a little known nobleman from Zadar interned in Venice in 1418 (Zadar was definitively subjugated by Venice in 1409), consists of 755 words in 536 different forms. The filters separate these forms into 173 previously recognized, 95 previously ambiguously recognized, 268 remaining (now I see that we could have applied the filter for previously unrecognized words, but we didn't do it today).

So 268 Latin forms travelled across the globe to be processed by the Morphology Service on the first day of 2013. Of these forms, 180 were unambiguously lemmatized; there were 139 ambiguous identifications; and 29 forms were listed as forma non recognita. The total score exceeds 268, of course, because of ambiguously identified forms — each of their lemmata gets a row of its own.

Tomorrow I'll write up how all this was accomplished programmatically, in a mix of Bash, Perl, and MySQL.

No comments:

Post a Comment