Tuesday 1 November 2011

The more, the merrier

Today's experiments with lemmatizing a neo-Latin word list (from the Ludovik Crijević Tuberon, Commentarii) seem to show that repeated passes through Archimedes Project lemmatizer give better results.

Perhaps the lemmatizer has some kind of limit; Tuberon's word list had ca. 20,000 forms.

Anyway, now we have a Bash script that can do any number of passes. And the final list of "strange" (i. e. not lemmatized, because not recognized) words is here: [X].