Thursday 5 January 2012

Quantification

This morning we had to compile some numbers on the Croatiae auctores Latini collection. Here they are (also on the CroALa developer's blog):

  • 143 TEI XML files (including, alas, some duplicates)

  • 437.218 words

  • 29.637.450 characters

  • 16.465,25 Textkarten

  • 1029 Druckbogen


Last two strange categories belong to German printing tradition, which was influential in Croatian printing industry; we translated these terms (Textkarte = kartica teksta, Druckbogen = tiskarski arak), and use them still in text accounting.

[Technical note.] Numbers were produced by Linux wc command (cf. recipe) on all XML files currently in CroALa, also available on its Sourceforge page. The Linux one-liner for calculating number of characters and words in multiple XML files was simple:

wc *.xml | awk '{print $3-$1}'

No comments:

Post a Comment