Saturday 4 January 2014

Finding homonyms in a (Latin) treebank

Homonyms and homographs (H & H) in a language are a good thing to master — a lot of confusion goes away once we have understood the differences, and our grasp of the language is significantly improved.

We don't want to run away from the H & H — we want to tackle them at full speed. That means that, if we read a text, we could pick from it a list of phrases with H & H, read them, and see which meaning is employed where.

To do this, we need:

Our task can be done in many ways, including purely "manual" ones. But we would like to use the Perseus Latin Treebank files to find more homonyms and homographs, and to extract phrases from these treebank files as well.

Finding H & H turns out to be a computationally demanding task. On some 50,000+ words my computer chokes and never finishes; 10,000 words is still too much for it. But sets of some 2,500 words are about right, don't take all night to finish.

To an XML file like this:

<uniq>
<w form="vulgo" postag="d--------"/>
<w form="vulgo" postag="d--------"/>
<w form="vulgus" postag="n-s---mn-"/>
<w form="vulnera" postag="n-p---na-"/>
<w form="vulnera" postag="n-p---na-"/>
<w form="vulnera" postag="n-p---na-"/>
<w form="vulnerat" postag="v3spia---"/>
<w form="vulneratum" postag="t-srppma-"/>
<w form="vulnere" postag="n-s---nb-"/>
<w form="vulneribus" postag="n-p---nb-"/>
<w form="vulneribus" postag="n-p---nb-"/>
<w form="vulnus" postag="n-s---nn-"/>
<w form="vulpes" postag="n-p---fn-"/>
<w form="vult" postag="v3spia---"/>
<w form="vult" postag="v3spia---"/>
<w form="vult" postag="v3spia---"/>
<w form="vult" postag="v3spia---"/>
<w form="vultis" postag="v2ppia---"/>
<w form="vultu" postag="n-s---mb-"/>
<w form="vultu" postag="n-s---mb-"/>
<w form="vultum" postag="n-s---ma-"/>
<w form="vultum" postag="n-s---ma-"/>
<w form="vultum" postag="n-s---ma-"/>
<w form="vultus" postag="n-p---ma-"/>
<w form="vultus" postag="n-p---ma-"/>
<w form="vultus" postag="n-p---ma-"/>
<w form="vultus" postag="n-p---ma-"/>
<w form="vultus" postag="n-s---mg-"/>
<w form="vultus" postag="n-s---mn-"/>
</uniq>

We apply the following XQuery (using BaseX, in my case):

element uniq
{ let $a := //*:w
for $l in distinct-values($a/@form),
$f in distinct-values($a[@form=$l]/@postag)
return

element w {
attribute form { $l },
attribute postag { $f }
}
}

Result:

<uniq>
<w form="vulgo" postag="d--------"/>
<w form="vulgus" postag="n-s---mn-"/>
<w form="vulnera" postag="n-p---na-"/>
<w form="vulnerat" postag="v3spia---"/>
<w form="vulneratum" postag="t-srppma-"/>
<w form="vulnere" postag="n-s---nb-"/>
<w form="vulneribus" postag="n-p---nb-"/>
<w form="vulnus" postag="n-s---nn-"/>
<w form="vulpes" postag="n-p---fn-"/>
<w form="vult" postag="v3spia---"/>
<w form="vultis" postag="v2ppia---"/>
<w form="vultu" postag="n-s---mb-"/>
<w form="vultum" postag="n-s---ma-"/>
<w form="vultus" postag="n-p---ma-"/>
<w form="vultus" postag="n-s---mg-"/>
<w form="vultus" postag="n-s---mn-"/>
</uniq>

The most interesting cases are those in which @postag attribute begins with a different value for the same @form, e. g:

<w form="vivis" postag="a-p---mb-"/>
<w form="vivis" postag="a-p---md-"/>
<w form="vivis" postag="n-p---mb-"/>
<w form="vivis" postag="v2spia---"/>

Then we look for "vivis" e. g. in the Croatiae auctores Latini text collection.

No comments:

Post a Comment