<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Unsupervised Morpheme Analysis Evaluation by a Comparison to a Linguistic Gold Standard - Morpho Challenge 2007</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mikko Kurimo</string-name>
          <email>Mikko.Kurimo@tkk.fi</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathias Creutz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matti Varjokallio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Adaptive Informatics Research Centre, Helsinki University of Technology</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1997</year>
      </pub-date>
      <abstract>
        <p>This paper presents the evaluation of Morpho Challenge Competition 1 (linguistic gold standard). The Competition 2 (information retrieval) is described in a companion paper. In Morpho Challenge 2007, the objective was to design statistical machine learning algorithms that discover which morphemes (smallest individually meaningful units of language) words consist of. Ideally, these are basic vocabulary units suitable for different tasks, such as text understanding, machine translation, information retrieval, and statistical language modeling The choice of a meaningful evaluation for the submitted morpheme analysis was not straight-forward, because in unsupervised morpheme analysis the morphemes can have arbitrary names. Two complementary ways were developed for the evaluation: Competition 1: The proposed morpheme analyses were compared to a linguistic morpheme analysis gold standard by matching the morpheme-sharing word pairs. Competition 2: Information retrieval (IR) experiments were performed, where the words in the documents and queries were replaced by their proposed morpheme representations and the search was based on morphemes instead of words. Data sets for Competition 1 were provided for four languages: Finnish, German, English, and Turkish and the participants were encouraged to apply their algorithm to all of them. The results show significant variance between the methods and languages, but the best methods seem to be useful in all tested languages and match quite well with the linguistic gold standard. The Morpho Challenge was part of the EU Network of Excellence PASCAL Challenge Program and organized in collaboration with CLEF.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>The scientific objectives of the Morpho Challenge 2007 were: to learn of the phenomena underlying
word construction in natural languages, to advance machine learning methodology, and to discover
approaches suitable for a wide range of languages. The suitability for a wide range of languages is
becoming increasingly important, because language technology methods need to be quickly and as
automatically as possible extended to new languages that have limited previous resources. That is
why learning the morpheme analysis directly from large text corpora using unsupervised machine
learning algorithms is such an attractive approach and a very relevant research topic today.</p>
      <p>
        Morpho Challenge 2007 is a follow-up to our previous Morpho Challenge 2005 (Unsupervised
Segmentation of Words into Morphemes) [
        <xref ref-type="bibr" rid="ref5">7</xref>
        ]. In Morpho Challenge 2005 the focus was in the
segmentation of data into units that are useful for statistical modeling. The specific task for the
competition was to design an unsupervised statistical machine learning algorithm that segments
words into the smallest meaning-bearing units of language, morphemes. In addition to comparing
the obtained morphemes to a linguistic ”gold standard”, their usefulness was evaluated by using
them for training statistical language models for speech recognition.
      </p>
      <p>In Morpho Challenge 2007 a more general focus was chosen to not only to segment words into
smaller units, but also to perform morpheme analysis of the word forms in the data. For instance,
the English words ”boot, boots, foot, feet” might obtain the analyses ”boot, boot + plural, foot,
foot + plural”, respectively. In linguistics, the concept of morpheme does not necessarily directly
correspond to a particular word segment but to an abstract class. For some languages there exist
carefully constructed linguistic tools for this kind of analysis, although not for many, but using
statistical machine learning methods we may still discover interesting alternatives that may rival
even the most careful linguistically designed morphologies.</p>
      <p>
        The problem of learning the morphemes directly from large text corpora using an
unsupervised machine learning algorithm is clearly a difficult one. First the words should be somehow
segmented into meaningful parts, and then these parts should be clustered in the abstract classes
of morphemes that would be useful for modeling. It is also challenging to learn to generalize
the analysis to rare words, because even the largest text corpora are very sparse, a significant
portion of the words may occur only once. Many important words, for example proper names
and their inflections or some forms of long compound words, may also not exist in the training
material at all, and their analysis is often even more challenging. However, benefits for successful
morpheme analysis, in addition to obtaining a set of basic vocabulary units for modeling, can be
seen for many important tasks in language technology. The additional information included in
the units can provide support for building more sophisticated language models, for example, in
speech recognition [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], machine translation [8], and information retrieval [10].
      </p>
      <p>
        The problem of how to arrange a meaningful evaluation of the unsupervised morpheme analysis
algorithms is not straight-forward, because in unsupervised morpheme analysis the morphemes can
be called by arbitrary names which are not likely to directly correspond to the linguistic morpheme
definitions. In this challenge we solved this by developing two complementary evaluations, one
including a comparison to linguistic morpheme analysis gold standard, and another including a
practical real-world application where morpheme analysis might be used. In the first evaluation,
called Competition 1, the proposed morpheme analyses were compared to a linguistic gold standard
citecreutz04.tr by counting the matching morpheme-sharing word pairs. In this way we did not
have to try to match the names of the morphemes directly, but only to measure if the proposed
algorithm can find the correct word pairs that share common morphemes. The second evaluation
called Competition 2 involved performing information retrieval (IR) experiments using the data
of the state of art CLEF evaluation, where the words in the documents and queries were replaced
by their proposed morpheme representations and the search was based on morphemes instead of
words. This paper presents the Competition 1 and the Competition 2 is described in a companion
paper [
        <xref ref-type="bibr" rid="ref4">6</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>Task</title>
      <p>The Morpho Challenge 2007 task was set to return the unsupervised morpheme analysis of every
word form contained in a long word list supplied by the organizers for each test language. The
participants were pointed to corpora in which the words occur, so that the algorithms may utilize
information about word context.</p>
      <p>In the Morpho Challenge 2005 the morphological segmentation evaluations were performed for
three languages: Finnish, English, and Turkish. Now a data set and evaluation were provided for
one new text language, German. To achieve the goal of designing language independent methods,
the participants were encouraged to submit results in all these languages. Having the theme
of unsupervised machine learning, the participants were required to describe any supervision or
parameter optimization steps that were taken in the algorithms. The participants did not need to
worry about which names to use for the morphemes they discovered, because the evaluation was
performed just by the F-measure of matching accuracy of the morpheme-sharing word pairs.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Data sets</title>
      <p>The first and foremost type of data files were the word lists. The words had been extracted from
a text corpus, and each word in the list was preceded by its frequency in the corpus used. For
instance, a subset of the supplied English word list looked like this:
1 barefoot’s
2 barefooted
6699 feet
653 flies
2939 flying
1782 foot
64 footprints</p>
      <p>The result files that the participants’ task was to return, were lists containing exactly the same
words as in the input, with morpheme analyses provided for each word. Submission for the above
English words might have looked like this:
barefoot’s BARE FOOT +GEN
barefooted BARE FOOT +PAST
feet FOOT +PL
flies FLY N +PL, FLY V +3SG
flying FLY V +PCP1
foot FOOT
footprints FOOT PRINT +PL</p>
      <p>The order in which the morpheme labels appeared after the word forms does not matter; e.g.,
”FOOT +PL” is equivalent to ”+PL FOOT”. As the learning is unsupervised, the labels are
arbitrary: e.g., instead of using ”FOOT” one might use ”morpheme784” and instead of ”+PL”
one might use ”morpheme2”. However, intuitive labels are preferable, because it becomes easier
for anyone to get an idea of the quality of the result by looking at it.</p>
      <p>If a word has several interpretations, all interpretations can be supplied: e.g., the word ”flies”
may be the plural form of the noun ”fly” (insect) or the third person singular present tense form
of the verb ”to fly”.Thus the analysis could be as: ”FLY N +PL, FLY V +3SG”. The existence
of alternative analyses makes the task challenging, and it was left to the participants to decide
how much effort they put into this aspect of the task. In English, for instance, in order to get a
perfect score, it would be necessary to distinguish the different functions of the ending ”-s” (plural
or person ending) as well as the different parts-of-speech of the stem ”fly” (noun or verb). As
the results will be evaluated against reference analyses (our so-called gold standard), the guiding
principles used when constructing the gold standard will be explained in Section 4.</p>
      <p>The text corpora where the word list were collected were obtained from the Wortschatz
collection1. at the University of Leipzig (Germany). We used the plain text files (sentences.txt for each
language); the corpus sizes are 3 million sentences for English, Finnish and German, and 1 million
sentences for Turkish. For English, Finnish and Turkish we used preliminary corpora, which have
not yet been released publicly at the Wortschatz site. The corpora were specially preprocessed for
the Morpho Challenge (tokenized, lower-cased, some conversion of character encodings).
4</p>
    </sec>
    <sec id="sec-4">
      <title>Gold standard morpheme analyses</title>
      <p>The gold standard morpheme analyses are the correct grammatical morpheme analyses that were
used as reference in the evaluation. The gold standard morpheme analyses were prepared in
exactly the same format as that of the result file the participants were asked to submit. Because
there are multiple correct analysis for some words, the alternative analyses are separated by a
comma. See Table 1 for examples.</p>
      <p>
        The English and German gold standards are based on the CELEX data base2. The Finnish
gold standard is based on the two-level morphology analyzer FINTWOL from Lingsoft3, Inc.
The Turkish gold-standard analyses have been obtained from a morphological parser developed at
Bogazici University4 [
        <xref ref-type="bibr" rid="ref2">2, 5</xref>
        ]; it is based on Oflazer’s finite-state machines, with a number of changes.
      </p>
      <p>The morphological analyses are morpheme analyses. This means that only grammatical
categories that are realized as morphemes are included. For instance, for none of the languages there is
a singular morpheme for nouns or a present-tense morpheme for verbs, because these grammatical
categories do not alter or add anything to the word form. This is in contrast to, e.g., the plural
form of a noun (house vs. house+s), or the past tense of verbs (help vs. help+ed, come vs. came).</p>
      <p>The morpheme labels that correspond to inflectional (and sometimes also derivational) affixes
have been marked with an initial plus sign (e.g., +PL, +PAST). This is due to a feature of the
evaluation script: in addition to the overall performance statistics, evaluation measures are also
computed separately for the labels starting with a plus sign and those without an initial plus sign.
It is thus possible to make an approximate assessment of how accurately affixes are analyzed vs.
non-affixes (mostly stems).</p>
      <p>The morpheme labels that have not been marked as affixes (no initial plus sign) are typically
stems. These labels consist of an intuitive string, usually followed by an underscore character ( )
and a part-of-speech tag, e.g., ”baby N”, ”sit V”. In many cases, especially in English, the same
morpheme can function as different parts-of-speech; e.g., the English word ”force” can be a noun
or a verb. In the majority of these cases, however, if there is only a difference in syntax (and not
in meaning), the morpheme has been labeled as either a noun or a verb, throughout. For instance,
1http://corpora.informatik.uni-leipzig.de/
2http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14
3http://www.lingsoft.fi/
4http://www.boun.edu.tr/index eng.html
the ”original” part-of-speech of ”force” is a noun, and consequently both noun and verb inflections
of ”force” contain the morpheme ”force N”:
force force N
force’s force N GEN
forced force N +PAST
forces force N +3SG, force N +PL
forcing force N +PCP1
Thus, there is not really a need for the participant’s algorithm to distinguish between different
meanings or syntactic roles of the discovered stem morphemes. However, in some rare cases, if the
meanings of the different parts-of-speech do differ clearly, there are two variants, e.g., ”train N”
(vehicle), ”train V” (to teach), ”fly N” (insect), ”fly V” (to move through the air). But again, if
there are ambiguous meanings within the same part-of-speech, these are not marked in any way,
e.g., ”fan N” (device for producing a current of air) vs. ”fan N” (admirer). This notation is a
consequence of using CELEX and FINTWOL as the sources for our gold standards. We could
have removed the part-of-speech tags, but we decided to leave them there, since they carry useful
information without significantly making the task more difficult. There are no part-of-speech tags
in the Turkish gold standard.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Participants and their submissions</title>
      <p>By the deadline in May, 2007, 6 research groups had submitted the segmentation results
obtained by their algorithms. A total of 12 different algorithms were submitted, 8 of them ran
experiments on all four test languages. All the submitted algorithms are listed in Table 2. In
general, the submissions were all interesting and all of them met the exact specifications given and
were able to get properly evaluated.</p>
      <p>
        In addition to the competitors’ 12 morpheme analysis algorithms, we evaluated a public
baseline method called “Morfessor Categories-MAP” (or here just “Morfessor MAP” or “Morfessor”,
for short) developed by the organizers [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Naturally, the Morfessor competed outside the main
competition and the results were included only as reference.
      </p>
      <p>Tables 3 - 6 show an example analysis and some statistics of each submission including the
average amount of alternative analysis per word, the average amount of morphemes per
analysis, and the total amount of morpheme types. The total amount of word types were 2,206,719
(Finnish), 617,298 (Turkish), 1,266,159 (German), and 384,903 (English). The Turkish word list
was extracted in 1 million sentences, but the other lists from 3 million sentences per each
language. In these word lists, the Gold Standard analysis were available for 650,169 (Finnish), 214,818
(Turkish), 125,641 (German), and 63,225 (English) words.</p>
      <p>
        The algorthms by Bernhard, Bordag and Pitler were the same or improved versions from the
previous Morpho Challenge [
        <xref ref-type="bibr" rid="ref5">7</xref>
        ]. Monson and Zeman were new participants who also provided
several alternative analysis for most words. The most different approach was McNamee’s algorithm,
which did not attempt to provide a real morpheme analysis, but mainly to find a representative
substring for each word type that would be likely to perform well in the IR evaluation (our
Competition 2 [
        <xref ref-type="bibr" rid="ref4">6</xref>
        ]). Noteworthy in Tables 3 - 6 is also that the size of the morpheme lexicon varies a
lot in different algorithms.
For each language, the morpheme analyses proposed by the participants’ algorithm were compared
against the linguistic gold standard. Since the task at hand involves unsupervised learning, it
cannot be expected that the algorithm comes up with morpheme labels that exactly correspond
to the ones designed by linguists. That is, no direct comparison will take place between labels as
such (the labels in the proposed analyses vs. labels in the gold standard). What can be expected,
however, is that two word forms that contain the same morpheme according to the participants’
algorithm also have a morpheme in common according to the gold standard. For instance, in
the English gold standard, the words ”foot” and ”feet” both contain the morpheme ”foot N”. It
is thus desirable that also the participants’ algorithm discovers a morpheme that occurs in both
these word forms (be it called ”FOOT”, ”morpheme784”, ”foot” or something else).
      </p>
      <p>In practice, the evaluation took place by randomly sampling a large number of word pairs,
such that both words in the pair have at least one morpheme in common. The exact constitution
of this set of word pairs was not revealed to the participants. In the evaluation, word frequency
played no role. Thus, all word pairs were equally important, whether they were frequent or rare.
The size of the randomly chosen set of word pairs set varied depending on the size of the word
lists and Gold Standard given in the previous section: 200,000 (Finnish), 50,000 (Turkish), 50,000
(German), and 10,000 (English) word pairs.</p>
      <p>As the evaluation measure, we applied F-measure, which is the harmonic mean of Precision
and Recall:</p>
      <p>F-measure = 1/(1/Precision + 1/Recall) .
(1)</p>
      <p>Precision is here calculated as follows: A number of word forms will be randomly sampled
from the result file provided by the participants; for each morpheme in these words, another word
containing the same morpheme will be chosen from the result file by random (if such a word
exists). We thus obtain a number of word pairs such that in each pair at least one morpheme is
shared between the words in the pair. These pairs will be compared to the gold standard; a point
is given for each word pair that really has a morpheme in common according to the gold standard.
The total number of points is then divided by the total number of word pairs.</p>
      <p>For instance, assume that the proposed analysis of the English word ”abyss” is: ”abys +s”.
Two word pairs are formed: Say that ”abyss” happens to share the morpheme ”abys” with the
word ”abysses”; we thus obtain the word pair ”abyss - abysses”. Also assume that ”abyss” shares
the morpheme ”+s” with the word ”mountains”; this produces the pair ”abyss - mountains”. Now,
according to the gold standard the correct analyses of these words are: ”abyss N”, ”abyss N +PL”,
”mountain N +PL”, respectively. The pair ”abyss - abysses” is correct (common morpheme:
”abyss+ N”), but the pair ”abyss - mountain” is incorrect (no morpheme in common). Precision
here is thus 1/2 = 50</p>
      <p>Recall is calculated analogously to precision: A number of word forms are randomly sampled
from the gold standard file; for each morpheme in these words, another word containing the same
morpheme will be chosen from the gold standard by random (if such a word exists). The word pairs
are then compared to the analyses provided by the participants; a point is given for each sampled
word pair that has a morpheme in common also in the analyses proposed by the participants’
algorithm. The total number of points is then divided by the total number of sampled word pairs.</p>
      <p>For words that have several alternative analyses, as well as for word pairs that have more than
one morpheme in common, the normalization of the points is carried out in order not to give
these words considerably more weight in the evaluation than ”less complex” words. The words are
normalized by the number of alternative analyses and the word pairs by the number of matching
morphemes. Details of the evaluation can be studied directly from the evaluation script5 that was
provided before the competition to let the participants evaluate their morpheme analysis relative
to the gold standard samples provided in the Morpho Challenge.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <p>The precision, recall and F-measure percentages obtained in the evaluation for all the test
languages are shown in Tables 7 - 10. The reference results that are given below each table were:
5The evaluation script can be downloaded from http://www.cis.hut.fi/morphochallenge2007/</p>
      <sec id="sec-6-1">
        <title>METHOD</title>
        <p>Bernhard 2
Bernhard 1
Bordag 5a
Bordag 5
Zeman
McNamee 3
McNamee 4
McNamee 5
Morfessor MAP</p>
      </sec>
      <sec id="sec-6-2">
        <title>METHOD</title>
        <p>Zeman
Bordag 5a
Bordag 5
Bernhard 2
Bernhard 1
McNamee 3
McNamee 4
McNamee 5
Morfessor MAP
Tepper</p>
        <p>• Morfessor Categories-Map: The same Morfessor Categories-Map as described in Morpho
Challenge 2005 [4] was used for the unsupervised morpheme analysis. Each morpheme was
also automatically labeled as prefix, stem or suffix by the algorithm.
• Tepper: A hybrid method developed by Michael Tepper [9] was utilized to improve the
morpheme analysis reference obtained by our Morfessor Categories-MAP.</p>
        <p>For the Finnish task the winner (measured by F-measure) was the algorithm “Bernhard 2”. It
did not reach a particularly high precision, but the recall and the F-measure were clearly superior.
It was also the only algorithm that won the “Morfessor MAP” reference.</p>
        <p>
          For the Turkish task the competition was much tighter. The winner was “Zeman”, but “Bordag
5a” and “Bordag 5” were very close. The “Morfessor MAP” and “Tepper” reference methods was
clearly better than any of the competitors, but all the algorithms (except “Tepper”) seem to
have had problems with the Turkish task, because the scores were lower than for other languages.
This is interesting, because in the morpheme segmentation task (Competition 1) of the previous
Morpho Challenge [
          <xref ref-type="bibr" rid="ref5">7</xref>
          ] the corresponding Turkish task was not more difficult than the others.
        </p>
        <p>The “Monson Paramor-Morfessor” algorithm reached the highest score in the German task, but
the “Bernhard 2” who again had the highest recall as in Finnish was quite close. The “Bordag 5a”
and “Bordag 5” were not very far, either, and managed to beat the “Morfessor MAP” reference.</p>
        <p>For English, the “Bernhard 2” and “Bernhard 1” algorithms were the clear winners, but also
“Pitler” and “Monson Paramor-Morfessor” and “Monson ParaMor” were able to beat the
“Morfessor MAP” and some even the “Tepper” reference.
The significance of the differences in F-measure was analyzed for all algorithm pairs in all
evaluations. The analysis was performed by splitting the data into several partitions and computing
the results for each partition separately. The statistical significance of the differences between the
participants’ algorithms was computed by the Wilcoxon’s Signed-Rank test for comparison of the
results in the independent partitions. The results show that almost all differences were statistical
significant, only the following pairs were not:
• In Finnish (Table 7):
• In Turkish (Table 8): “Zeman” and “Bordag 5a”, “Bordag 5a” and “Bordag”
• In German (Table 9): “Monson Morfessor” and “Bernhard 1”
• In English (Table 10): “Bernhard 2” and “Bernhard 1”, “Monson Paramor-Morfessor” and
“Monson ParaMor”, “Monson Morfessor” and Zeman”, “Bordag 5a” and “Bordag”
This result was not surprising since the random word pair samples were quite large and all these
result pairs that were not significantly different gave very similar F-measures (less than 0.5
percentage units away).</p>
        <p>
          By looking at the precision and recall results we see that the “McNamee 5” algorithm, who had
clearly the highest precision in all languages, suffered from a very low precision and was not thus
competitive in F-measure. However, McNamee’s algorithms were not real attempts to provide
good morpheme analysis, but mainly to find a representative substring for each word type that
would be likely to perform well in the IR evaluation (our Competition 2 [
          <xref ref-type="bibr" rid="ref4">6</xref>
          ]). This is in line with
our assumption that the precision evaluation could be closer to the IR task, because it measures
the portion of matches from a chosen word to other words that agree with the grammatic analysis.
This is related to what the most basic form of IR also does: to look for matches between the
query word and the words in each document. The recall, however, may not be as relevant to
IR, because it measures the portion of grammatically matching morphemes that are found by the
algorithm. By looking at the Gold Standards (Table 1) we see that many of the grammatical
morphemes (such as +PL and +PAST) are very common and may not be very relevant in IR and
an algorithm like the “McNamee 5” would probably ignore them.
        </p>
        <p>The future work in unsupervised morpheme analysis should develop further the clustering of
contextually similar units for morphemes that would match better with the grammatical
morphemes and thus, improve the recall. Most of the submitted algorithms probably did not take the
provided possibility to utilize the sentence context for analyzing the words and finding the
morphemes. Although this may not be as important for success in IR than improving the precision,
it may provide useful additional information for some keywords.
9</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusions</title>
      <p>The objective of Morpho Challenge 2007 was to design a statistical machine learning algorithm
that discovers which morphemes (smallest individually meaningful units of language) words consist
of. Ideally, these are basic vocabulary units suitable for different tasks, such as text
understanding, machine translation, information retrieval, and statistical language modeling. The current
challenge was a successful follow-up to our previous Morpho Challenge 2005 (Unsupervised
Segmentation of Words into Morphemes). This time the task was more general in that instead of
looking for an explicit segmentation of words, the focus was in the morpheme analysis of the word
forms in the data.</p>
      <p>The scientific goals of this challenge were to learn of the phenomena underlying word
construction in natural languages, to discover approaches suitable for a wide range of languages and to
advance machine learning methodology. The analysis and evaluation of the submitted machine
learning algorithm for unsupervised morpheme analysis showed that these goals were quite nicely
met. There were several novel unsupervised methods that achieved good results in several test
languages, both with respect to finding meaningful morphemes and useful units for information
retrieval.</p>
      <p>12 different segmentation algorithms from 6 research groups were submitted and evaluated.
The evaluations included 4 different languages: Finnish, Turkish, German and English. The
algorithms and results were presented in Morpho Challenge Workshop, arranged in connection
with other CLEF 2007 Workshop, September 19-21, 2007. Morpho Challenge 2007 was part of
the EU Network of Excellence PASCAL Challenge Program and organized in collaboration with
CLEF.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We thank all the participants for their submissions and enthusiasm. We owe great thanks as
well to the organizers of the PASCAL Challenge Program and CLEF who helped us organize
this challenge and the challenge workshop. Especially, we would like to thank Carol Peters from
CLEF for helping us to get Morpho Challenge in CLEF 2007 and organize a great workshop
there. We are most grateful to the University of Leipzig for making the training data resources
available to the Challenge, and in particular we thank Stefan Bordag for his kind assistance. We
are indebted to Ebru Arisoy for making the Turkish gold standard available to us. We thank
also Krista Lagus for comments of the manuscript. Our work was supported by the Academy of
Finland in the projects Adaptive Informatics and New adaptive and learning methods in speech
recognition. This work was supported in part by the IST Programme of the European Community,
under the PASCAL Network of Excellence, IST-2002-506778. This publication only reflects the
authors’ views. We acknowledge that access rights to data and other materials are restricted due
to other commitments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Jeff</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Bilmes</surname>
            and
            <given-names>Katrin</given-names>
          </string-name>
          <string-name>
            <surname>Kirchhoff</surname>
          </string-name>
          .
          <article-title>Factored language models and generalized parallel backoff</article-title>
          .
          <source>In Proceedings of the Human Language Technology</source>
          ,
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL)</article-title>
          , pages
          <fpage>4</fpage>
          -
          <lpage>6</lpage>
          , Edmonton, Canada,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ozlem</given-names>
            <surname>Cetinoglu</surname>
          </string-name>
          .
          <article-title>Prolog based natural language processing infrastructure for Turkish. M.Sc</article-title>
          . thesis, Bogazici University, istanbul, Turkey,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Mathias</given-names>
            <surname>Creutz</surname>
          </string-name>
          and
          <string-name>
            <given-names>Krista</given-names>
            <surname>Lagus</surname>
          </string-name>
          .
          <article-title>Inducing the morphological lexicon of a natural language from unannotated text</article-title>
          .
          <source>In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR'05)</source>
          , pages
          <fpage>106</fpage>
          -
          <lpage>113</lpage>
          , Espoo, Finland,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Mikko</given-names>
            <surname>Kurimo</surname>
          </string-name>
          , Mathias Creutz, and
          <string-name>
            <given-names>Ville</given-names>
            <surname>Turunen</surname>
          </string-name>
          .
          <article-title>Unsupervised morpheme analysis evaluation by IR experiments - Morpho Challenge 2007</article-title>
          .
          <source>In Working Notes for the CLEF 2007 Workshop</source>
          , Budapest, Hungary,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Mikko</given-names>
            <surname>Kurimo</surname>
          </string-name>
          , Mathias Creutz, Matti Varjokallio, Ebru Arisoy, and
          <string-name>
            <given-names>Murat</given-names>
            <surname>Saraclar</surname>
          </string-name>
          .
          <source>Unsupervised segmentation of words into morphemes - Challenge</source>
          <year>2005</year>
          ,
          <article-title>an introduction and evaluation report</article-title>
          .
          <source>In PASCAL Challenge Workshop on Unsupervised segmentation of words into morphemes</source>
          , Venice, Italy,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>