=Paper=
{{Paper
|id=None
|storemode=property
|title=Resource-Light Acquisition of Inflectional Paradigms
|pdfUrl=https://ceur-ws.org/Vol-1422/66.pdf
|volume=Vol-1422
|dblpUrl=https://dblp.org/rec/conf/itat/KlicH15
}}
==Resource-Light Acquisition of Inflectional Paradigms==
<pdf width="1500px">https://ceur-ws.org/Vol-1422/66.pdf</pdf>
<pre>
J. Yaghob (Ed.): ITAT 2015 pp. 66–72
Charles University in Prague, Prague, 2015


                       Resource-Light Acquisition of Inflectional Paradigms

                                                   Radoslav Klíč1 and Jirka Hana2
                                      1   Geneea Analytics, Velkopřevorské nám. 1, 118 00 Praha 1
                                                     radoslav.klic@gmail.com
                                           2 MFF UK, Malostranské nám. 25, 118 00 Praha 1

                                                       jirka.hana@gmail.com

Abstract: This paper presents a resource-light acquisition             racy (e.g., in English, most words ending in ed are past/-
of morphological paradigms and lexicon for fusional lan-               passive verbs, and most words ending in est are superlative
guages. It builds upon Paramor [10], an unsupervised sys-              adjectives). However, in many languages, the number of
tem, by extending it: (1) to accept a small seed of man-               homonymous endings is simply too high for such system
ually provided word inflections with marked morpheme                   to be useful. For example, the ending a has about 19 dif-
boundary; (2) to handle basic allomorphic changes ac-                  ferent meanings in Czech [4].
quiring the rules from the seed and/or from previously                    Thus our goal is to discover inflectional paradigms each
acquired paradigms. The algorithm has been tested on                   with a list of words declining according to it, in other
Czech and Slovene tagged corpora and has shown in-                     words we discover a list of paradigms and a lexicon. But
creased F-measure in comparison with the Paramor base-                 we do not attempt to assign morphological categories to
line.                                                                  any of the forms. For example, given an English corpus
                                                                       the program should discover that talk, talks, talking, talked
                                                                       are the forms of the same word, and that work, push, pull,
1 Introduction                                                         miss,. . . decline according to the same pattern. How-
                                                                       ever, it will not label talked as a past tense and not even
Morphological analysis is used in many computer appli-                 as a verb.
cations ranging from web search to machine translation.                   This kind of shallow morphological analysis has appli-
As Hajič [6] shows, for languages with high inflection, a             cations in information retrieval (IR), for example search
morphological analyzer is an essential part of a successful            engines. For the most of the queries, users aren’t inter-
tagger.                                                                ested only in particular word forms they entered but also
   Modern morphological analysers based on supervised                  in their inflected forms. In highly inflectional languages,
machine learning and/or hand-written rules achieve very                such as Czech, dealing with morphology in IR is a neces-
high accuracy. However, the standard way to create them                sity. Moreover, it can also be used as a basis for a stan-
for a particular language requires substantial amount of               dard morphological analyzer after labeling endings with
time, money and linguistic expertise. For example, the                 morphological tags and adding information about closed-
Czech analyzer by [7] uses a manually created lexicon                  class/irregular words.
with 300,000+ entries. As a result, most of the world lan-                As the basis of our system, we chose Paramor [10],
guages and dialects have no realistic prospect for morpho-             an algorithm for unsupervised induction of inflection
logical analyzers created in this way.                                 paradigms and morphemic segmentation. We extended it
   Various techniques have been suggested to overcome                  to handle basic phonological/graphemic alternations and
this problem, including unsupervised methods acquiring                 to accept seeding paradigm-lexicon information.
morphological information from an unannotated corpus.                     The rest of this paper is organized as follows: First, we
While completely unsupervised systems are scientifically               discuss related work on unsupervised and semi-supervised
interesting, shedding light on areas such as child language            learning. Then follows a section about baseline Paramor
acquisition or general learneability, for many practical ap-           model. After that, we motivate and describe our extension
plications their precision is still too low. They also com-            to it. Finally, we report results of experiments on Czech
pletely ignore linguistic knowledge accumulated over sev-              and Slovene.
eral millennia, often failing to discover rules that can be
found in basic grammar books.
   Lightly-supervised systems aim to improve upon the ac-              2    Previous Work
curacy of unsupervised system by using a limited amount
of resources. One of such systems for fusional languages               Perhaps the best known unsupervised morphological anal-
is described in the paper.                                             ysers are Goldsmith’s Linguistica [5] and Morfessor [1, 2,
   Using a reference grammar, it is relatively easy to pro-            3] family of algorithms.
vide information about inflectional endings, possibly or-                 Goldsmith uses minimum description length
ganized into paradigms. In some languages, an analyzer                 (MDL; [14]) approach to find the morphology model
built on such information would have an acceptable accu-               which allows the most compact corpus representation.
Resource-Light Acquisition of Inflectional Paradigms                                                                        67


His Linguistica software returns a set of signatures which        The algorithm to acquire schemes has several steps:
roughly correspond to paradigms.
   Unlike Linguistica, Morfessor splits words into mor-           1. Initialization: It first considers all possible segmenta-
phemes in a hierarchical fashion. This makes it more suit-           tions of forms into candidate stems and endings.
able to agglutinative languages, such as Finnish or Turk-         2. Bottom-up Search: It builds schemes by adding end-
ish, with a large number of morphemes per word. A prob-              ings that share a large number of associated stems.
abilistic model is used to tag each morph as a prefix, suffix
or stem. Kohonen et al. [9] improve the results of Morfes-        3. Scheme clustering: Similar schemes (as measured by
sor by providing a small set (1000+ for English, 100+ for            cosine similarity) are merged.
Finish) of correctly segmented words. While the precision
slightly drops, the recall is significantly improved for both     4. Pruning: Schemes proposing frequent morpheme
languages. Tepper and Xia [17] use handwritten rewrite               boundaries not consistent with boundaries proposed
rules to improve Morfessor’s performance by recognising              by a character entropy measure are discarded.
allomorphic variations.                                         Paramor works with types and not tokens. Thus it is not
   The approaches by Yarowsky and Wicentowski [18] and          using any information about the frequency or context of
Schone and Jurafsky [15] aim at combining different in-         forms. Below, we describe some of the steps in more de-
formation sources (e.g., corpus frequencies, edit distance      tail.
similarity, or context similarity) to obtain better analysis,
especially for irregular inflection.
   A system requiring significantly more human supervi-         3.1 Bottom-up Search
sion is presented by Oflazer et al. [13]. This system takes
                                                                In this phase, Paramor performs a bottom-up search of the
manually entered paradigm specification as an input and
                                                                scheme lattice. It starts with schemes containing exactly
generates a finite-state analyser. The user is then presented
                                                                one c-suffix. For each of them, Paramor ascends the lat-
with words in a corpus which are not accepted by the anal-
                                                                tice, adding one c-suffix at a time until a stopping criterion
yser, but close to an accepted form. Then the user may
                                                                is met. C-suffix selected for adding is the one with the
adjust the specification and the analyser is iteratively im-
                                                                biggest c-stem ratio. (Adding a c-suffix to a scheme re-
proved.
                                                                duces number of the stems and the suffix reducing it the
   Feldman and Hana [8, 4] build a system which relies          least is selected. C-stem ratio is ratio between number of
on a manually specified list of paradigms, basic phonol-        stems in the candidate higher-level scheme and the current
ogy and closed-class words and use a raw corpus to auto-        scheme.) When the highest possible c-stem ratio falls un-
matically acquire lexicon. For each form, all hypothetical      der 0.25, the search stops. It is possible to reach the same
lexical entries consistent with the information about the       scheme from multiple searches. For example, a search
endings are created. Then competing entries are compared        starting from the scheme (-s) can continue by adding (-ing)
and only those supported by the highest number of forms         and end by adding (-ed), thus creating a scheme (-s, -ing,
are retained. Most of the remaining entries are still non-      -ed). Another search starting from (-ed) can continue by
existent; however, in the majority of cases, they licence       adding (-s) and then by adding (-ing), creating a redundant
the same inflections as the correct entries, differing only     scheme. Such duplicates are discarded.
in rare inflections.

                                                                3.2 Scheme Clustering
3    Paramor
                                                                Resulting schemes are then subjected to agglomerative
Our approach builds upon Paramor [10, 11, 12], an-              bottom-up clustering to group together schemes which
other unsupervised approach for discovery of inflectional       are partially covering the same linguistic paradigm. For
paradigms.                                                      example, if the first phase generated schemes (-s, -ing)
   Due to data sparsity, not all inflections of a word are      and (-ing, -ed), the clustering phase should put them in
found in a corpus. Therefore Paramor does not attempt to        the same scheme cluster. To determine proximity of two
reconstruct full paradigms, but instead works with partial      scheme clusters, sets of words generated by the clusters
paradigms, called schemes. A scheme contains a set of           are measured by cosine similarity.1 A scheme cluster gen-
c(andidate)-suffixes and a set of c(andidate)-stems inflect-    erates a set of words which is the union of sets generated
ing according to this scheme. The corpus must contain the       by the schemes it contains (not a Cartesian product of all
concatenation of every c-stem with every c-suffix in the        stems and suffixes throughout the schemes). In order to
same scheme. Thus, a scheme is uniquely defined by its          be merged, clusters must satisfy some conditions, e.g. for
c-suffix set. Several schemes might correspond to a single      any two suffixes in the cluster, there must be a stem in the
morphological paradigm, because different stems belong-         cluster which can combine with both of them.
ing to the paradigm occur in the corpus in different set of
                                                                                       |X∩Y |
                                                                   1 proximity(X,Y ) = √
inflections.                                                                            |X||Y |
68                                                                                                              R. Klíč, J. Hana


                                                                4.2 Scheme Seeding
           Corpus
                                                                The manual seed contains a simple list of inflected words
                                                                with marked morpheme boundary. A simple example in
 Paramor
                                                                English would be:
      Stem-suffix map            Induction of
         deep stems            stem-change rules                     talk+0, talk+s, talk+ed, talk+ing
                                                                     stop+0, stop+s, stopp+ed, stopp+ing
                                                                     chat+0, chat+s, chatt+ed, chatt+ing
      Bottom-up search
        more starting                                           This can be written in an abbreviated form as:
           schemes
                             Manual
                              seed                                   talk, stop/stopp, chat/chatt + 0, s / ed, ing
      Scheme clustering
      keep some clusters
                                                                The data are used to enhance Paramor’s accuracy in dis-
       from discarding                                          covering the correct schemes and scheme clusters in the
                                                                following way:

                                                                 1. In the bottom-up search, Paramor starts with single-
        Word clusters                                               suffix schemes. We added a 2-suffix scheme to the
                                                                    starting scheme set for every suffix pair from the
Figure 1: Altered Paramor’s pipeline (our alterations are           manual data belonging to the same inflection. Note
in dashed boxes and outside the Paramor box).                       that we cannot simply add a scheme containing all
                                                                    the suffixes of the whole paradigm as many of the
                                                                    forms will not be present in the corpus.
3.3    Pruning
                                                                 2. Scheme clusters containing suffixes similar to some
After the clustering phase, there are still too many clusters       of the manually entered suffix sets are protected from
remaining and pruning is necessary. In the first pruning            the second phase of the cluster pruning. More pre-
step, clusters which generate only small number of words            cisely, a cluster is protected if at least half of its
are discarded. Then clusters modelling morpheme bound-              schemes share at least two suffixes with a particular
aries inconsistent with letter entropy are dropped.                 manual suffix set.


                                                                4.3 Allomorphy
4     Our Approach
                                                                Many morphemes have several contextually dependent
                                                                realizations, so-called allomorphs due to phonological/-
4.1    Overview
                                                                graphemic changes or irregularities. For example, con-
                                                                sider the declension of the Czech word matka ‘mother’ in
We have modified the individual steps in Paramor’s
                                                                Table 1. It exhibits stem-final consonant change (palatali-
pipeline in order to use (1) a manually provided seed of
                                                                sation of k to c) triggered by the dative and local singular
inflected words divided into stems and suffixes; and (2) to
                                                                ending, and epenthesis (insertion of -e-) in the bare stem
take into account basic allomorphy of stems. Figure 1
                                                                genitive plural.
shows phases of Paramor on the left with dashed boxes
representing our alterations.
                                                                              Case     Singular     Plural
   In the bottom-up search phase and the scheme clus-                         nom      matk+a       matk+y
ter filtering phase, we use manually provided examples                        gen      matk+y       matek+0
of valid suffixes and their grouping to sub-paradigms to                      dat      matc+e       matk+ám
steer Paramor towards creating more adequate schemes                          acc      matk+u       matk+y
and scheme clusters. The data may also contain allomor-                       voc      matk+o       matk+y
phic stems, which we use to induce simple stem rewrite                        loc      matc+e       matk+ách
rules. Using these rules, some of the allomorphic stems in                    inst     matk+ou      matk+ami
the corpus can be discovered and used to find more com-
plete schemes.
   Note that the Paramor algorithm is based on several          Table 1: Declension of the word matka “mother”. Chang-
heuristics with many parameters whose values were set ex-       ing part of the stem is in bold.
perimentally. We used the same settings. Moreover, when
we applied similar heuristics in our modifications, we used       Paramor ignores allomorphy completely (and so do Lin-
analogical parameter values.                                    guistica and Morfessor). There are at least two reasons
Resource-Light Acquisition of Inflectional Paradigms                                                                                  69


to handle allomorphy. First, linguistically, it makes more            Induced rules are applied after the initialisation phase. So-
sense to analyze winning as win+ing than as winn+ing or               called deep stems are generated from the c-stems. A deep
win+ning. For many applications, such as information re-              stem is defined as a set of surface stems.
trieval, it is helpful to know that two morphs are variants of           To obtain a deep stem for a c-stem t, operation of expan-
the same morpheme. Second, ignoring allomorphy makes                  sion is applied. Expansion works as a breadth-first search
the data appear more complicated and noisier than they ac-            using a queue initialised with t and keeping track of the
tually are. Thus, the process of learning morpheme bound-             set D of already generated variants. While the queue is not
aries or paradigms is harder and less successful.                     empty, the first member is removed and its variants found
   This latter problem might manifests itself in Paramor’s            by application of all the rules. (Result of applying a rule is
bottom-up search phase: a linguistically correct suffix trig-         non-empty only if the rule is applicable and its right hand
gering a stem change might be discarded, because Paramor              side is present in the corpus.) Variants which haven’t been
would not consider stem allomorphs to be variants of the              generated so far are added to the back of the queue and to
same stem and c-stem ratio may drop significantly. Further            D. When the queue is emptied, D becomes the deep stem
more, incorrect c-suffixes may be selected.                           associated with t and all other members of D.
   For example, suppose there are 5 English verbs in the                 Bottom-up search and all the following phases of
corpus: talk, hop, stop, knit, chat, together with their -s           Paramor algorithm are then using the deep stems instead
(talks, hops, stops, knits, chats) and -ing (talking, hop-            of the surface ones.
ping, stopping, knitting, chatting) forms. Let’s assume
we already have a scheme {0, s} with 5 stems. Unfor-
                                                                      Stem change rule induction from scheme clusters. In
tunately, a simple ing suffix (without stem-final consonant
                                                                      addition to deriving allomorphic rules from the manual
doubling) combines with one out the 5 stems only, there-
                                                                      seed, we also use a heuristic for detecting stem allomor-
fore adding ing to the scheme would decrease the number
                                                                      phy in the scheme clusters obtained from the previous run
of its stems to 1, leaving only talk in the scheme.
                                                                      of the algorithm. Stem allomorphy increases the sparsity
   However, for most languages the full specification of
                                                                      problem and might prevent Paramor from finding some
rules constraining allomorphy is not available, or at least
                                                                      paradigms. However, if the stem changes are systematic
is not precise enough. Therefore, we automatically induce
                                                                      and frequent, Paramor does create the appropriate scheme
a limited number of simple rules from the seed examples
                                                                      clusters. However, it considers the changing part of the
and/or from the scheme clusters obtained from the previ-
                                                                      stem to be a part of suffix.
ous run of algorithm. Such rules both over and undergen-
                                                                         As an example, consider again the declension of the
erate, but nevertheless they do improve the accuracy of the
                                                                      Czech word matka “mother” in Table 1. Paramor’s scheme
whole system. For languages, where formally specified
                                                                      cluster with suffixes ce, ek, ka, kami, kou, ku, ky, kách, kám
allomorphic rules are available, they can be used directly
                                                                      has correctly discovered 9 of 10 paradigm’s suffixes,3 but
along the lines of Tepper and Xia [17, 16]. For now, we
                                                                      fused together with parts of the stem. Presence of such
consider only stem final changes, namely vowel epenthesis
                                                                      scheme cluster in the result is a hint that there may be a c/k
(e.g., matk-a – matek-0) and alternation of the final conso-
                                                                      alteration and epenthesis in the language.
nant (e.g., matk-a – matc-e). The extension to other pro-
                                                                         First phase of the algorithm for deciding whether a
cesses such as root vowel change (e.g., English foot – feet)
                                                                      scheme cluster with a c-suffix set f is interesting in this
is quite straightforward, but we leave it for future work.
                                                                      respect is following:

Stem change rule induction and application. Formally,                   1. If f contains a c-suffix without a consonant, return
the process can be described as follows. From every pair                   false.
of stem allomorphs in the manual input, sδ1 , sδ2 , where s
                                                                        2. cc = count of unique initial consonants found in
is their longest common initial substring,2 with suffix sets
                                                                           c-suffixes in f .
 f1 , f2 we generate a rule ∗δ1 → ∗δ2 / ( f1 , f2 ) and also
a reverse rule ∗δ2 → ∗δ1 / ( f2 , f1 ). Notation ∗δ1 → ∗δ2              3. If cc > 2 return false. (Morpheme boundary probably
/ ( f1 , f2 ) means “transform a stem xδ1 into xδ2 if the fol-             incorrectly shifted to the left.)
lowing conditions hold:”
                                                                        4. If cc = 1 and f doesn’t contain any c-suffix start-
  1. xδ2 is a c-stem present in the corpus.                                ing with a vowel, return false. (No final consonant
  2. C-suffix set f1x (from the corpus) of the c-stem xδ1                  change, no epenthesis.)
     contains at least one of the suffixes from f1 and con-
                                                                        5. Return true.
     tains no suffix from f2 .
  3. C-suffix set f2x of the c-stem xδ2 contains at least one         If a scheme cluster passes this test, each of its stems’ sub-
     of the suffixes from f2 and contains no suffix from f1 .         paradigms is examined. Subparadigm for stem s consists
                                                                      of s and fs – all the c-suffixes from f with which s forms
    2 should δ or δ be 0, one final character is removed from s and
              1     2
prepended to δ1 and δ2                                                   3 Except for vocative case singular, which is rarely used.
70                                                                                                                         R. Klíč, J. Hana


a word in the corpus. For example, let’s have a stem s =              We use the following terminology in this section:
mat with fs = {ce, ek, ka, ku, ky}. Now, the morpheme              a word group is a set of words returned by our system,
boundary is shifted so that it is immediately to the right         a word paradigm is a set of words from the corpus sharing
from the first consonant of the original c-suffixes. In our        the same lemma. Both word groups and word paradigms
example, we get 3 stem variants: matk + a, u, y, matc              are divisions of corpus into disjoint sets of words. An au-
+ e, matek + 0. To reduce falsely detected phonological            toseed is a seed generated by the heuristic described in
changes, we check each stem variant’s suffix set whether           Section 4.3.
it contains at least one of the c-suffixes that Paramor has           Since Paramor only produces schemes and scheme clus-
already discovered in other scheme clusters. If the condi-         ters, we need an additional step to obtain word groups.
tion holds, rules the with same syntax as the manual data          We generated the word groups by bottom-up clustering
are created. For example, matk / matc / matek + a, u, y            of words using the paradigm distance which is designed
/ e / 0. All generated rules are gathered in a file and can        to group together words generated by similar sets of
be used in the same way as the manual seed or just for the         scheme clusters. To compute paradigm distance for two
induction of phonological rules.                                   words w1 , w2 , we find the set of all scheme clusters which
                                                                   generate w1 and compute cosine similarity to the analogi-
                                                                   cal set for w2 4 . In the simplest case, two forms of a lemma
5     Experiments and Results                                      will be generated just by one scheme cluster and there-
                                                                   fore get distance 1. For a more complicated example, let’s
We tested our approach on Czech and Slovene lemma-                 take two Czech words: otrávení “poisoned masc. anim.
tised corpora. For Czech, we used two differently sized            nom. pl.” and otrávený “poisoned masc. anim. nom.
subsets of the PDT 1 corpus. The first, marked as cz1,             sg.”. The first one was generated by scheme clusters 33
contains 11k types belonging to 6k lemmas. The sec-                and 41, both with otráv as a stem. The second word was
ond, cz2, has 27k types and 13k lemmas and is a su-                generated by scheme cluster 41 with otráv as a stem and
perset of cz1. The purpose of having two Czech cor-                by scheme cluster 45 with otráven as stem. That means
pora was to observe the effect of data size on perfor-             that only scheme cluster 41 generates both words and their
mance of the algorithm. The Slovene corpus si is a subset          paradigm distance is √2×2   1
                                                                                                  = 0.5.
of the jos100k corpus V2.0 (http://nl.ijs.si/jos/                     Precision and recall of the word groups can be com-
jos100k-en.html) with 27k types and 15.5k lemmas.                  puted in the following way: To compute precision, start
   The manual seed consisted of inflections of 18 lem-             with p = 0. For each word group, find a word paradigm
mas for Czech and inflections of 9 lemmas for Slovene.             with the largest intersection. Add the intersection size to p.
In both cases, examples of nouns, adjectives and verbs             Precision = p / total number of words. For computing re-
were provided. They were obtained from a basic grammar             call, start with r = 0. For each word paradigm, find a word
overview. For Czech, we also added information about the           group with the largest intersection. Add the intersection
only two inflectional prefixes (negative prefix ne and su-         size to r. Recall = r / total number of words. F1 is the
perlative prefix nej). The decision which prefixes to con-         standard balanced F-score.
sider inflectional and which not is to a certain degree an ar-
bitrary decision (e.g., it can be argued that ne is a clitic and
not a prefix), therefore it makes sense to provide such in-        5.2 Results
formation manually. (Prefixes were implemented by a spe-           Results of the experiments are presented in Tables 2 – 4.
cial form of stem transformation rules introduced in sec-          We used the following experiment settings:
tion 4.3 which create deep stems consisting of a stem with
and without given prefix.)                                           1. no seed – the baseline, Paramor was run without any
                                                                        seeding
5.1   Evaluation Method                                              2. man. seed – manual seed was used

We evaluated the experiments only on types at least 6 char-          3. autoseed – autoseed was used for induction of the
acters long which Paramor uses for learning. That means                 stem change rules
8.5k types and 4500 lemmas for cz1, 21k types and 10k
                                                                     4. both seeds – Paramor run with manual seed, stem
lemmas for cz2 and 21k types and 12k lemmas for si.
                                                                        change rules were induced from manual and au-
   Since corpora we used do not have morpheme bound-
                                                                        toseed.
aries marked, we could not use the same evaluation
method as authors of Paramor and Morfessor – measuring               5. seed + pref. – manual seed was used together with
the precision and recall of placing morpheme boundaries.                additional rules for two Czech inflectional prefixes,
On the other hand, corpora are lemmatised and we can                    otherwise same as 2.
evaluate whether types grouped to paradigms by the algo-               4 We also have to check whether w and w have the same stem, so,
                                                                                                        1     2
rithm correspond to sets of types belonging to the same            in fact we are comparing sets of pairs hscheme cluster, c-stemi, to make
lemma.                                                             sure only words sharing c-stems are grouped together.
Resource-Light Acquisition of Inflectional Paradigms                                                                               71


    6. both seeds + pref – manual seed was used together           References
       with additional rules for two Czech inflectional pre-
       fixes, otherwise same as 4.                                  [1] Creutz, M., Lagus. K.: Unsupervised discovery of mor-
                                                                        phemes. In: Proceedings of the ACL-02 Workshop on Mor-
                                                                        phological and Phonological Learning, Vol. 6, MPL ’02,
                                                                        21–30, Stroudsburg, PA, USA, 2002, Association for Com-
                                                                        putational Linguistics
      Experiment              Precision      Recall         F1
                                                                    [2] Creutz, M., Lagus, K.: Inducing the morphological lexi-
      no seed                    97.87        84.61      90.76
                                                                        con of a natural language from unannotated text. In: Pro-
      man. seed                  97.96        87.52      92.44          ceedings of the International and Interdisciplinary Confer-
      autoseed                   98.19        84.58      90.88          ence on Adaptive Knowledge Representation and Reason-
      both seeds                 97.96        87.52      92.44          ing (AKRR’05), 106–113, Finland, Espoo, 2005
      seed + pref.               97.84        89.40      93.43      [3] Creutz, M., Lagus, K.: Unsupervised models for mor-
      both seeds + pref.         97.84        89.40      93.43          pheme segmentation and morphology learning. ACM
                                                                        Trans. Speech Lang. Process. 4 (3) (February 2007), 1–34
                                                                    [4] Feldman, A., Hana, J.: A resource-light approach to
             Table 2: Results for the cz1 corpus.
                                                                        morpho-syntactic tagging. Rodopi, Amsterdam/New York,
                                                                        NY, 2010
                                                                    [5] Goldsmith, J. A.: Unsupervised learning of the morphol-
                                                                        ogy of a natural language. Computational Linguistics 27(2)
      Experiment              Precision      Recall         F1          (2001), 153–198
      no seed                    97.36        87.02      91.90      [6] Hajič, J.: Morphological tagging: data vs. dictionaries. In:
      man. seed                  97.04        89.30      93.01          Proceedings of ANLP-NAACL Conference, 94–101, Seat-
      autoseed                   97.30        87.72      92.26          tle, Washington, USA, 2000
      both seeds                 96.78        89.30      92.89      [7] Hajič, J.: Disambiguation of rich inflection: computa-
      seed + pref.               96.68        92.35      94.46          tional morphology of Czech. Karolinum, Charles Univer-
      both seeds + pref.         96.31        92.49      94.36          sity Press, Praha, 2004
                                                                    [8] Hana, J., Feldman, A., Brew, C.: A resource-light approach
                                                                        to Russian morphology: Tagging Russian using Czech re-
             Table 3: Results for the cz2 corpus.                       sources. In: Dekang Lin and Dekai Wu, (eds.), Proceedings
                                                                        of EMNLP 2004, 222–229, Barcelona, Spain, July 2004,
                                                                        Association for Computational Linguistics
                                                                    [9] Kohonen, O., Virpioja, S., Lagus, K.: Semi-supervised
         Experiment        Precision      Recall          F1            learning of concatenative morphology. In: Proceedings
         no seed              95.70        93.00       94.33            of the 11th Meeting of the ACL Special Interest Group
                                                                        on Computational Morphology and Phonology, SIGMOR-
         man. seed            95.62        94.44       95.02
                                                                        PHON’10, 78–86, Stroudsburg, PA, USA, 2010, Associa-
         autoseed             95.69        93.13       94.40
                                                                        tion for Computational Linguistics
         both seeds           95.56        94.76       95.16
                                                                   [10] Monson, C.: ParaMor: from paradigm structure to nat-
                                                                        ural language morphology induction. PhD thesis, Lan-
              Table 4: Results for the si corpus.                       guage Technologies Institute, School of Computer Science,
                                                                        Carnegie Mellon University, 2009
                                                                   [11] Monson, C., Carbonell, J., Lavie, A., Levin, L.: ParaMor:
  As can be seen from the results, the extra manual infor-              minimally supervised induction of paradigm structure and
mation indeed does help the accuracy of clustering words                morphological analysis. In: Proceedings of Ninth Meeting
belonging to the same paradigms. What is not shown by                   of the ACL Special Interest Group in Computational Mor-
the numbers is that more of the morpheme boundaries                     phology and Phonology, 117–125, Prague, Czech Repub-
make linguistic sense because basic stem allomorphy is                  lic, June 2007, Association for Computational Linguistics
accounted for.                                                     [12] Monson, C., Carbonell, J. G., Lavie, A., Levin, L. S.:
                                                                        Paramor: finding paradigms across morphology. In: Ad-
                                                                        vances in Multilingual and Multimodal Information Re-
6     Conclusion                                                        trieval, 8th Workshop of the Cross-Language Evaluation
                                                                        Forum, CLEF 2007, Budapest, Hungary, September 19-21,
                                                                        2007, Revised Selected Papers, 900–907, 2007.
We have shown that providing very little of easily obtain-         [13] Oflazer, K., Nirenburg, S., McShane, M.: Bootstrapping
able information can improve the result of a purely un-                 morphological analyzers by combining human elicitation
supervised system. In the near future, we are planning to               and machine learning. Computational Linguistics 27(1)
model a wider range of allomorphic alternations, try larger             (2001), 59–85
(but still easy to obtain) seeds and finally test the results on   [14] Rissanen, J.: Stochastic complexity in statistical inquiry.
more languages.                                                         World Scientific Publishing Co, Singapore, 1989.
72                                                                R. Klíč, J. Hana


[15] Schone, P., Jurafsky, D.: Knowledge-free induction of
     inflectional morphologies. In: Proceedings of the North
     American Chapter of the Association for Computational
     Linguistics, 183–191, 2001.
[16] Tepper, M., Xia, F.: A hybrid approach to the induction of
     underlying morphology. In: Proceedings of the Third Inter-
     national Joint Conference on Natural Language Processing
     (IJCNLP-2008), Hyderabad, India, Jan 7-12, 17–24, 2008.
[17] Tepper, M., Xia, F.: Inducing morphemes using light
     knowledge. ACM Trans. Asian Lang. Inf. Process. 9 (3)
     (March 2010), 1–38
[18] Yarowsky, D., Wicentowski, R.: Minimally supervised
     morphological analysis by multimodal alignment. In: Pro-
     ceedings of the 38th Meeting of the Association for Com-
     putational Linguistics, 207–216, 2000

</pre>