We are interested in cases where contents of the token field are the same, but code is different. Code holds grammatical information; the first letter is a shorthand for part of speech (v = verb in the example above). For the moment, we can retrieve this for a specific word, using the following SQL query:tokenID|token|code|lemma|type 1266165|rutilantis|v--pppafa-|rutilo||newmorph 1266166|rutilantis|v--pppama-|rutilo||newmorph 1266167|rutilantis|v--sppafg-|rutilo||newmorph 1266168|rutilantis|v--sppamg-|rutilo||newmorph 1266169|rutilantis|v--sppang-|rutilo||newmorph
select distinct code1 from ( select substr(code, 1, 1) as code1 from ( select code from Lexicon where token like "verum") );Grouping on two fields (a recipe found on Stack Overflow) seems promising:
select token , code, count(*) from Lexicon group by token collate nocase, code having (count(*)>1);And I think this would be the final query:
select distinct token , c from (select token , substr(code, 1, 1) as c from Lexicon group by token collate nocase, code having (count(*)>1) limit 30);
Now off to check it on all 1,257,854 rows in the Lexicon table.
It took a bit of post-processing with a bash command (had to swap POS and Wordform fields):
sort hom.csv -t, -k1 \ | sed 's/\([^,]*\),\(.\)/\2,\1/g' - \ | uniq -D -s 2 > hom-pos.csvThe results are now publicly available as a Google Fusion table, all 19,350 rows of them: homonyms (and homographs) differing by part of speech, found in a real digital corpus of Latin texts.
No comments:
Post a Comment