Thursday, 9 January 2014

An SQLite SQL for homonyms

In an SQLite database there is a table of parsed (Latin) word-forms, like this:
tokenID|token|code|lemma|type
1266165|rutilantis|v--pppafa-|rutilo||newmorph
1266166|rutilantis|v--pppama-|rutilo||newmorph
1266167|rutilantis|v--sppafg-|rutilo||newmorph
1266168|rutilantis|v--sppamg-|rutilo||newmorph
1266169|rutilantis|v--sppang-|rutilo||newmorph
We are interested in cases where contents of the token field are the same, but code is different. Code holds grammatical information; the first letter is a shorthand for part of speech (v = verb in the example above). For the moment, we can retrieve this for a specific word, using the following SQL query:
select distinct code1 from (
   select substr(code, 1, 1) as code1 from (
     select code from Lexicon where token like "verum")
      );
Grouping on two fields (a recipe found on Stack Overflow) seems promising:
select token , code, count(*) 
 from Lexicon 
 group by token collate nocase, code 
 having (count(*)>1);
And I think this would be the final query:
select distinct token , c 
from (select token , substr(code, 1, 1) as c 
      from Lexicon 
      group by token collate nocase, code 
      having (count(*)>1) limit 30);

Now off to check it on all 1,257,854 rows in the Lexicon table.

It took a bit of post-processing with a bash command (had to swap POS and Wordform fields):

sort hom.csv -t, -k1 \
| sed 's/\([^,]*\),\(.\)/\2,\1/g' - \
| uniq -D -s 2 > hom-pos.csv
The results are now publicly available as a Google Fusion table, all 19,350 rows of them: homonyms (and homographs) differing by part of speech, found in a real digital corpus of Latin texts.

No comments:

Post a Comment