<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extraction and Analysis of Proper Nouns in Slovak Texts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Radovan Garabík</string-name>
          <email>garabik@kassiopeia.juls.savba.sk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Radoslav Brída</string-name>
          <email>brida@korpus.sk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Copyright c by the paper's authors. Copying permitted for private and academic purposes.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>In Vito Pirrelli, Claudia Marzi, Marcello Ferro (eds.): Word Structure and Word Usage. Proceedings of the NetWordS Final</institution>
          ,
          <addr-line>Conference, Pisa, March 30-April 1, 2015, published at http://ceur-ws.org</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences</institution>
          ,
          <addr-line>Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences</institution>
          ,
          <addr-line>Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
      </contrib-group>
      <fpage>140</fpage>
      <lpage>143</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Unknown named entity recognition in inflected
languages faces several specific problems – the first
and foremost is that the entities themselves are
inflected 1
        <xref ref-type="bibr" rid="ref1">(Dvoncˇ et al., 1966)</xref>
        leading to a problem
of identifying word forms as belonging to the same
lexeme, and also the problem of finding correct
lemma. In this article we analyse the distribution
of word forms for proper nouns in Slovak and
describe an algorithm for their automatic extraction
and lemmatisation.
      </p>
      <p>The task of lemmatisation and morphological
annotation of flective (and more specifically, Slavic)
languages is reasonably researched and developed
(Hajicˇ, 2004). Since we cannot expect a
morphological database (data relating lemmata to inflected
word forms and their grammatical tags) to cover all
or almost all the words present in the corpus
(especially proper names that keep appearing depending
on who or what has become a hot topic in mass
media), using a well tuned guesser can improve the
accuracy of lemmatisation and tagging.</p>
      <p>Common sense says that named entities (proper
names in particular) behave differently from
common names, which translated into information
theory terms means that the information about whether
a word is a proper name is not independent from the
information about its morphology paradigm. This
means we can use the information about proper
names to decrease the entropy of inflections, which
is good because it helps the guesser choose between
the possible lemmata and morphological tags.
We denote Levenshtein distance (¸łØ ,
1965) between two words l and w by ρ(l, w). Since
a typical Slovak noun has up to 12 different word
forms (two numbers, six cases – the vocative is
1e.g. for the lemma Galileo, genitive would be Galilea,
dative Galileovi etc.
rare), and the inflection is mostly realized by
changing the suffix and root vowel alteration, we can
expect the overall distance between lemma and its
word forms to be not only bounded from above, but
also have a regular distribution (roughly speaking,
the less typical the suffix length, the less likely is
such a word form to appear).</p>
      <p>
        We used the morphological database of Slovak
language
        <xref ref-type="bibr" rid="ref2">(Garabík and Šimková, 2012; Karcˇová,
2008; Garabík, 2007)</xref>
        , which contains (at the time
of writing) complete morphological information
of 35 009 nouns (lemmata), out of which 1031 are
proper nouns. We randomly divided the database
into two parts, the training set and the evaluation
set, ensuring that about 90% of both common and
proper nouns is present in the training set. The
evaluation set contained 101 lemmata and 694 unique
word forms for proper nouns.
      </p>
      <p>||
N
/n 0.4
1
0.8
0.6
0.2
0
1
||
N
/n 0.4
0.8
0.6
0.2
0
1
2
5
6</p>
      <p>7
3 4
ρ(lemma,word)
common nouns
proper nouns
0
1
2
5
6</p>
      <p>7
3 4
ρ(lemma,word)
†
. . . Toska
10</p>
      <p>Toskala
33</p>
      <p>Toskalu
28</p>
      <p>Toskánske
11</p>
      <p>Fig. 1 displays the distribution of known
common (top) and proper (bottom) nouns, summed and
normalized through all the nouns in the training
set. Vertical error bars display the standard
deviation for the given distance of word form from
the lemma. From the graphs, we derive several
conclusions – proper nouns are “less inflected”,
higher ratio of them is in the basic form (lemma),
and the maximum distance is ρ = 7 for common
nouns (nouns with greater distance are those with
very irregular declension, e.g. človek → ľudia
“human/humans”) and ρ = 5 for proper nouns.
Distributions of common and proper nouns from the
evaluation set match those from the training set, so
there appears to be some difference between
common and proper nouns globally. However,
categorising single nouns using these differencies between
distributions is not reliable.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Extracting Candidates</title>
      <p>Our algorithm extracts plausible candidates for
proper nouns (those beginning with a capital
letter but not at the beginning of a sentence, together
with some additional filters) and for each
candidate, it considers the set of words with ρ ≤ 5. This
would require calculating the Levenshtein distance
between all pairs of words in the set and the
complexity would be O(n2), which is unacceptable for
corpus sized inputs. Unfortunately, Levenshtein
distance is a metric but cannot be used to make an
ordered set out of a list of words (in particular, it
cannot be used to define an ordering binary
relation ≤).</p>
      <p>However, a trick can be applied – in a
lexicographically ordered list of words (see Table 1) we
need to look only at some interval around the word;
word forms from beyond the interval are very
unlikely to belong to the same lexeme. The
complexity will be O(Cn); where C is the (constant) size of
the interval. This means that for some of the nouns
not all word forms will be covered; especially for
the shorter ones, where there is a higher
probability that many unrelated words will be within the
interval. Empirically we estimated the reasonable
interval width to be 2000 words – increasing it
above this number does not improve the accuracy
anymore and the speed is acceptable. It should be
noted that this interval is not a width of the context
of the concordance – this is an interval in the
lexicographically ordered set of proper noun candidates
extracted from a given text, e.g. from a novel if we
want to extract the whole inflectional paradigms of
(new, unknown) proper nouns from the novel, or
indeed from the whole corpus, if we aim to augment
a morphological database.</p>
      <p>We formally describe a Levenshtein edit
operation e = (o, is, id) – a triple of operation type o,
position is in the source string s and position id
in the destination string d, where operation type o
is one of replace, insert or delete. For replace or
insert, the replacement/new character is taken from
the destination string d.</p>
      <p>Sequence of edit operations q = (e1, e2, e3, ...),
together with the destination string d, when applied
to a string s ∈ S defines a mappingfq,d : Sq,d 7→ S,
where Sq,d and S are sets of strings.2</p>
      <p>If we denote by t a morphological tag for a given
word form w, then for a lexeme with a lemma l a
tuple (w, t) unambiguously refers to one inflected
word form and its grammatical categories. We can
then construct a sequence of edit operations leading
from l to w, denoted by q(l, t).</p>
      <p>For each proper noun from the training set, we
precompute the functions fq(l,t),l (this can be
improved by dividing the nouns into categories based
on their declension rules and using only one noun
from each category), to get the sequence of
operations leading from the lemma to the tuple (w, t)
of the word form and morphological tag. Then, for
each extracted word, we apply the functions fq(l,t),l
to every word from the abovementioned interval
and the word with greatest coverage (sum of the
frequencies of generated word forms within the
interval) is declared the lemma to the extracted word.
Of course, this maximum can be attained by more
than one word, especially if the lexeme is
incom2It is not possible to define the functionf for every source
string, since some of the operations might not be applicable to
the given strings.
plete. We assume that at least the most common
inflectional paradigms (used for proper nouns) are
present in the training set.
We used the algorithm to extract proper nouns
from the Slovak National Corpus, version
prim6.1-public-all3, of the size 829 million tokens, and
evaluated the results on the proper nouns from the
evaluation set. The percentage of correctly
automatically assigned lemmata is shown in Table 2 –
we see that 79.2% word forms had been assigned
a unique lemma, which was also the correct one,
while 18.9% had been assigned a unique, but
incorrect lemma4.
precision
recall
70
60
50
cyn 40
e
u
req 30
f
20
10
0
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
precision, recall</p>
      <p>Figure 2 displays the precision and recall on
word forms for proper nouns (i.e. how much of
the lexeme has been extracted; the numbers are not
weighted by the frequency of word forms in the
corpus) from the evaluation set; we note that about
70 lexemes5 have precision ≈ 1; about 40
lex3http://korpus.juls.savba.sk/res.html
4For 13 word forms (2.5%) the correct lemma was not
present in the interval of 2000 words.</p>
      <p>5Since the number of proper nouns in our evaluation set
was 101, these numbers are fortuitously almost identical to
percentage.
emes have recall ≈ 1, and about 50 lexemes have
0.9 ' recall ' 0.6, while only a small number
of lexemes have lower precision. The lower recall
is caused by insufficient data coverage – not all
the word forms were present in the analysed
corpus. The precision we obtained is excellent and the
accuracy of automatic lemma assignment is good.
5</p>
    </sec>
    <sec id="sec-3">
      <title>Augmenting Morphological Database</title>
      <p>The abovementioned process was used to increase
the number of proper nouns in Slovak
morphological database. We used the extracted candidates
from the prim-6.1-public-all corpus with a number
of occurrences at least 100 (count of all possible
word forms derived from a given lemma). We
calculate the coverage of word forms for one lemma
as r = C(w, t)/C(g), where C(w, t) is the
number of generated tuples of word forms and their
corresponding morphological tags, and C(g) the
number of grammar categories (usually 7 or 14; 7
cases including the vocative and one or two
grammatical numbers, with many proper nouns present
only in singular).</p>
      <p>After removing generated word forms with no
corpus evidence, the average coverage of word
forms per lemma is r = 0.84 ± 0.23, i.e. 84%
of word forms is present in the corpus, 0.23 is the
standard deviation of the coverage. Generated word
forms still contain a lot of noise, therefore we also
removed those word forms whose contribution to
the number of occurrences of given lemma was less
than 1% (it is rare for a grammatical case to have
such a low percentage compared to other cases).
After this, the coverage changed to r = 0.75 ± 0.24,
where again 0.24 is the standard deviation of the
coverage. Then we manually proofread, corrected
and filled in the word forms for the several
hundred most frequent lemmata. After adding these
words to the morphological database, we iterated
the process, re-training the algorithm and
generating another list of less frequent proper nouns.
6</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>The method has been used to improve the
coverage of proper nouns in the Slovak morphological
database and is used as a part of morphological
guesser, providing candidate lemmata and
morphological tags for unknown proper nouns, as part of
the morphosyntactic analysis and part of speech
tagging of the Slovak National Corpus.6
6http://korpus.juls.savba.sk</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Dvoncˇ et al.1966
          <string-name>
            <surname>] Ladislav</surname>
            <given-names>Dvoncˇ</given-names>
          </string-name>
          , Gejza Horák, František Miko, Jozef Mistrík, Ján Oravec, Jozef Ružicˇka, and Milan Urbancˇok.
          <year>1966</year>
          .
          <article-title>Morfológia slovenského jazyka</article-title>
          .
          <source>Vydavatel'stvo SAV</source>
          , Bratislava, Slovakia,
          <source>1st edition</source>
          . 895 p.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>[Garabík and Šimková2012] Radovan Garabík and Mária Šimková</source>
          .
          <year>2012</year>
          .
          <article-title>Slovak Morphosyntactic Tagset</article-title>
          .
          <source>Journal of Language Modelling</source>
          ,
          <volume>0</volume>
          (
          <issue>1</issue>
          ):
          <fpage>41</fpage>
          -
          <lpage>63</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Garabík2007]
          <string-name>
            <given-names>Radovan</given-names>
            <surname>Garabík</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Slovak morphology analyzer based on Levenshtein edit operations</article-title>
          . In M. Laclavík,
          <string-name>
            <surname>I. Budinská</surname>
          </string-name>
          , and L. Hluchý, editors,
          <source>Proceedings of the WIKT'06 conference</source>
          , pages
          <fpage>2</fpage>
          -
          <lpage>5</lpage>
          , Bratislava. Institute of Informatics SAS.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[Hajicˇ2004] Jan Hajicˇ</source>
          .
          <year>2004</year>
          .
          <article-title>Disambiguation of Rich Inflection (Computational Morphology of Czech)</article-title>
          . Karolinum, Charles Univeristy Press, Prague, Czech Republic.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Karcˇová2008] Agáta Karcˇová</source>
          .
          <year>2008</year>
          .
          <article-title>Príprava a uskutocˇnˇovanie projektu morfologického analyzátora</article-title>
          .
          <source>In Anna Gálisová and Alexandra Chomová</source>
          , editors,
          <source>Varia. 15</source>
          .
          <article-title>Zborník materiálov z XV. kolokvia mladých jazykovedcov</article-title>
          , pages
          <fpage>286</fpage>
          -
          <lpage>292</lpage>
          , Bratislava. Slovenská jazykovedná spolocˇnost'
          <article-title>pri SAV - Katedra slovenského jazyka a literatúry FHV UMB v Banskej Bystrici</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>