=Paper= {{Paper |id=Vol-1175/CLEF2009wn-MorphoChallenge-MonsonEt2009 |storemode=property |title=Probabilistic ParaMor |pdfUrl=https://ceur-ws.org/Vol-1175/CLEF2009wn-MorphoChallenge-MonsonEt2009.pdf |volume=Vol-1175 |dblpUrl=https://dblp.org/rec/conf/clef/MonsonHR09a }} ==Probabilistic ParaMor== https://ceur-ws.org/Vol-1175/CLEF2009wn-MorphoChallenge-MonsonEt2009.pdf
                                        Probabilistic ParaMor
                            Christian Monson, Kristy Hollingshead, and Brian Roark
                                   Center for Spoken Language Understanding
                                      Oregon Health & Science University
                                             monsonc@ohsu.edu


                                                    Abstract

       The ParaMor algorithm for unsupervised morphology induction, which competed in the 2007 and
       2008 Morpho Challenge competitions, does not assign a numeric score to its segmentation deci-
       sions. Scoring each character boundary in each word with the likelihood that it falls at a true mor-
       pheme boundary would allow ParaMor to adjust the confidence level at which the algorithm pro-
       poses segmentations. A sliding threshold on segmentation confidence would, in turn, permit a
       trade off between precision and recall that could optimize F1 or other metrics of interest. Our sub-
       mission to Morpho Challenge 2009 enriches ParaMor with segmentation confidences by training
       an off-the-shelf statistical natural language tagger to mimic ParaMor’s morphological segmenta-
       tions. For a given word, the tagger’s probabilistic confidence that ParaMor would propose the
       character, c, as the first character of a new morpheme serves as the numeric score of the candidate
       morpheme boundary that immediately precedes c. We have trained a ParaMor tagger mimic over a
       development data set of 500,000 unique Hungarian word types. By adjusting the threshold above
       which the ParaMor mimic proposes morpheme boundaries, we improve ParaMor’s F1 score for
       Hungarian by 5.9% absolute, from 41.4% to 47.3%. Moreover, by training a probabilistic tagger to
       emulate the segmentations of a second unsupervised morphology induction system, Morfessor, we
       are able to combine ParaMor’s segmentation decisions with Morfessor’s to form a single joint
       segmentation of each word. Our joint ParaMor-Morfessor tagger mimic enhances F1 performance
       on our Hungarian development set by a further 3.4% absolute, ultimately achieving an F1 score of
       50.7%.



Categories and Subject Descriptors

I.2 [Artificial Intelligence]: I.2.7 Natural Language Processing

General Terms

Experimentation

Keywords

Natural language morphology, Unsupervised learning, Morphology induction, Statistical Mimic



1. Introduction

    Unsupervised morphology induction is the task of learning the morphological analyses of the words of an
unknown natural language from nothing more than a raw corpus of unannotated text. Analyzing words down to
the morpheme level has helped a range of natural language processing tasks including machine translation
(Oflazer and El-Kahlout, 2007), information retrieval (Kurimo and Turunen, 2008), and speech recognition
(Creutz, 2006). But building a morphological analysis system by hand can take person-months of time—hence
the need for automatic methods for morphology induction.
    A wide variety of approaches to unsupervised morphology induction have been proposed in recent years.
Techniques inspired by Zellig Harris’ early work (Harris, 1955), measure the probabilities of word-internal char-
acter transitions to identify likely morpheme boundaries (Bernhard, 2008). Other systems rely on the minimum
description length principle to pick out a set of highly descriptive morphemes (Goldsmith, 2001; Creutz, 2006).
Recent work on unsupervised morphology induction for Semitic languages has focused on estimating robust
statistical models of morphology (Snyder and Barzilay, 2008; Poon et al., 2009). And this paper extends a mor-
phological induction system called ParaMor that leverages morphological paradigms as the inherent structure of
natural language morphology (Monson, 2009).

1.1. ParaMor

    The ParaMor algorithm is a linguistically motivated unsupervised morphology induction system. By counting
the frequency of word-final strings on shared word-initial strings in a list of unannotated words, ParaMor auto-
matically builds sets of suffixes that model the paradigm structure found in inflectional morphology. For exam-
ple, from one corpus of Spanish newswire, ParaMor discovers that each member of a set of 41 word-final strings
that includes a, aba, aban, acion, aciones, ación, ada, adas, ado, ador, … attaches to a set of candidate stems
that covers the forms aboy, celebr, desarroll, and genera. Although ignorant of syntactic and lexical features,
ParaMor has discovered the set of verbal suffixes that attach to Spanish ar verbs.
     The ParaMor algorithm competed in both the 2007 and 2008 Morpho Challenge Competitions, both solo and
in a joint submission with a second unsupervised morphology induction system Morfessor. Setting aside the joint
ParaMor-Morfessor system, the solo ParaMor system placed first in the Turkish Linguistic competition of Mor-
pho Challenge 2008, at 46.5% F1, and second in English, with an F1 score of 52.5%. Meanwhile the joint Pa-
raMor-Morfessor system placed first overall in the 2008 Linguistic competitions for German, Finnish, Turkish,
and Arabic.
    ParaMor’s successes are particularly remarkable given that ParaMor is a rule-based system incapable of
measuring the confidence of the morphological segmentations it proposes. Without a confidence measure on
individual segmentation decisions, it is impossible to increase or decrease the number of segmentation points
that ParaMor proposes so as to optimize ParaMor’s precision-recall performance for a given task. A tradeoff
between precision and recall is inherent in any classification task, including morphological segmentation. If sys-
tem A proposes segmenting a corpus at a superset of the character boundaries at which system B proposes seg-
mentations, some of system A’s additional proposed boundaries will match true morpheme boundaries but others
will not—increasing recall but decreasing precision.
    The ability to trade off precision against recall is clearly relevant for a morphological analysis system. A
morphological analysis system embedded in a patent search application, for example, should likely favor stem
recall over precision: seeking to return as many documents that potentially match a query as possible. On the
other hand, a morphology system that enhances the language model of a text-input system for cell-phones should
likely focus on precise and accurate input—lowering the number of incorrect word proposals that the user must
correct by hand. And a system designed to perform well in the linguistic competition of Morpho Challenge,
should balance precision and recall of morphemes. Morpho Challenge is evaluated by F1, the harmonic mean of
precision and recall, and the harmonic mean imposes a strict penalty when the gap between the precision and
recall scores in large. Consequently, F1 is nearly always maximum when precision and recall are balanced. Our
submission to Morpho Challenge 2009 imbues ParaMor’s segmentation decisions with probabilistic confidence
scores by enlisting the help of a natural language tagger.

2. Probabilistic ParaMor

    At each character, c, in each word, a morphology segmentation algorithm, such as ParaMor, makes a binary
decision to either place or to not place a morpheme boundary before c. We view the morphology segmentation
task as a labeling problem akin to part-of-speech tagging. In part-of-speech tagging each sequential word in a
sequence must be labeled with its part of speech, V, N, Adj, etc. In morphological segmentation, each sequential
character in a word must be labeled as beginning a new morpheme or as continuing the current one.
    There are two advantages to reformulating segmentation as a tagging problem. First, taggers are a proven and
well-understood natural language processing technique that have been adapted to a variety of problems beyond
part-of-speech labeling. Taggers have been used for named entity recognition (Tjong Kim Sang, 2002) and
NP-chunking (Tjong Kim Sang and Buchholz, 2000); to flag words on the periphery of a parse constituent
(Roark and Hollingshead, 2009); as well as to segment written Chinese into words (Xue, 2003)—a task closely
related to morphology segmentation. Second, standard natural language taggers are statistical models that can
output a probability distribution over all possible labels for each tagged item. Thus, a statistical tagger trained to
label morpheme boundaries outputs a probabilistic confidence score that any particular character begins a mor-
pheme.
    Just one problem remains: Statistical taggers are supervised induction methods while the Morpho Challenge
competitions explicitly forbid supervised induction. Where unsupervised induction methods learn from unla-
beled examples, i.e. unadorned words for morphology induction, supervised methods require labeled training
data. In the case of morphological segmentation, labeled training data would consist of a set of words with each
character labeled as the start of or as the continuation of a morpheme.
    While data labeled with the truth is forbidden in Morpho Challenge, we can construct artificial training data
from the segmented output of an unsupervised morphology induction algorithm such as ParaMor; and then use
the artificially labeled data to train a statistical tagger to mimic the segmentations that the unsupervised method
produces. The probabilistic segmentation scores that the tagger mimic assigns to each character can then serve as
a numeric confidence of original unsupervised method.

2.1. Training a Tagger Mimic

    Using the unsupervised morphology induction algorithm ParaMor as a source of labeled data, we trained a
finite-stage tagger (Hollingshead et al., 2005) to identify, for each character, c, in a given word, whether or not
ParaMor would place a morpheme boundary immediately before c. Additionally we had our tagger learn whether
each proposed boundary began a stem or an affix. Thus we trained a statistical model to “tag” each character as
beginning a new stem morpheme, beginning a new suffix morpheme, or as not occurring at the left edge of a
morpheme. The feature set used in the tagger consisted of the surrounding sequences of characters and mor-
pheme- tags. The character sequences are represented by character n-grams up to three characters on either side
of the current character. Thus in a word like “quickly”, the character-features for tagging the letter ‘c’ would be:
‘quic’, ‘uic’, ‘ic’, ‘c’, ‘ck’, ‘ckl’, and ‘ckly’. The morpheme-tag features are represented as unigram, bigram, and
trigram morpheme-tags (i.e., tags from the current and two previous characters).
    We used the averaged perceptron algorithm, as presented in Collins (2002), to train the tagger. During train-
ing, the decoding process is performed using a Viterbi search with a second-order Markov assumption. At test-
time, we use the forward-backward algorithm, again with a second-order Markov assumption, to output the per-
ceptron-score of each morphological tag for each character in the word. The main benefit of decoding in this
manner is that, by normalizing the scores at each character (using softmax due to the log linear modeling), we
can extract the probability of each tag at each character rather than just the single perceptron-preferred solution
for the entire word.

2.2. Efficacy of the ParaMor Tagger Mimic

    Using our finite state tagger, each character, c, in each word is scored with the likelihood that ParaMor would
treat c as the first character in a new morpheme. Consider the segmentation that results from placing morpheme
boundaries before each character that is tagged as the start of a new morpheme, stem or affix, with a probability
greater than 0.5. This baseline mimic segmentation, although trained to emulate ParaMor’s segmentations, will
not be fully identical with ParaMor’s original segmentation of a set of words. Figure 1 summarizes our tagging
accuracy at emulating segmentations for the five languages and six data sets of the Linguistic completion of
Morpho Challenge 2009. Tagging accuracy is calculated over all tagged characters, averaging over the held-out
test-folds during 10-fold cross-validation. For all the test languages and scenarios, our tagger successfully emu-
lates ParaMor at an accuracy above 93%, with particularly strong accuracy for German, 96.6%, and English
97.6%.
                   English         German            Finnish          Turkish        Arabic ‐V Arabic +V
 Linguistic          97.6%            96.6%            93.5%           93.6%            93.3%          93.7%

Figure 1: The tagging accuracy of our finite‐state tagger at mimicking ParaMor’s morphological seg‐
   mentations over the data from the Linguistic competition of Morpho Challenge 2009.



    The mimic tagger’s departures from the original ParaMor segmentation may either hurt or improve the seg-
mentation quality. On the one hand, when the mimic tagger deviates from the ParaMor segmentation, the mimic
may be capturing some real generalization of morphological structure that is hidden in the statistical distribution
of ParaMor’s original segmentation. On the other hand, a disagreement between the original and the mimic Pa-
raMor segmentations may simply be a failure of the tagger to model the irregularities inherent in natural lan-
guage morphology.
    To evaluate the effectiveness of using a tagger to mimic ParaMor’s segmentations, we performed a develop-
ment evaluation over a Hungarian dataset. We used Hunmorph (Trón et al., 2005), a hand-built Hungarian mor-
phological analyzer, to produce an morphological answer key containing 500,000 unique Hungarian word types
from the Hunglish corpus (Varga et al., 2009). Our Hungarian ParaMor tagger mimic actually outperforms Pa-
raMor’s original segmentations at F1: Where the original ParaMor attained an F1 of 41.4%, the ParaMor tagger
mimic improved F1 to 42.7% with the help of a slightly higher recall.

2.3. Optimizing F1

    Having retained ParaMor’s underlying performance quality by training a natural language tagger to mimic
ParaMor’s segmentations, we next seek to increase the tagger-mimic’s F1 further by leveraging the probabilistic
scores that the tagger mimic assigns to each segmentation decision. As discussed in section 1.1, numeric scores
for each segmentation decision are the key to trading off precision for recall, reducing the gap between precision
and recall, and raising F1.
    To optimize F1, we proceed as follows:
   1.   For each character, c, that does not begin a word, record the tagger mimic’s prob-
        ability that c begins a morpheme.
   2.   Sort this list of probabilities smallest to largest and count the number of probability
        scores that are larger than 0.5, assigning k to be this count.
   3.   For a given positive factor, α, consult the list of sorted probabilities to identify the
        probability score, S, above which αk of the probabilities lie.
   4.   Segment at all characters which receive a probabilistic segmentation score above S.

In prose, k is the number of word-internal morpheme boundaries that the default ParaMor mimic proposes. To
trade off recall against precision adjust the number of morpheme boundaries that the ParaMor mimic proposes.
And to increase or decrease the number of morpheme boundaries that the ParaMor mimic proposes by a factor α,
we move the probability threshold from 0.5 to that value which will permit      segmentations.
    Figure 2 plots the precision, recall, and F1 of the ParaMor tagger mimic as the number of word-internal mor-
pheme boundaries varies between one half and four times the baseline k number of word-internal boundaries. As
Figure 2 shows, adjusting α allows for a smooth tradeoff between precision and recall. F1 reaches its maximum
value of 47.5% at α = 4/3. As expected 4/3 is near the location where recall overtakes precision. The improve-
ment in F1 for the ParaMor tagger mimic of 4.8% is statistically significant at a 95% confidence value if we as-
sume that F1 is normally distributed.
     70

                         Precision                                                   Recall
     60


% 50                                            47.5
                                                                                F1
                                   42.7
     40


     30
          0.50                           1.00                           2.00                           4.00
                                                         α
Figure 2: Precision, Recall, and F1 of the ParaMor tagger mimic as α moves between 0.5 and 4.0.
   When α is 1, F1 lies at 42.7%. But by increasing the number of morpheme boundaries that the tag‐
   ger mimic proposes by a third, to α = 4/3, the gap between precision and recall decreases and F1
   rises by 4.8% absolute to reach 47.5%. The error bars are 95% confidence intervals on each F1
   value, assuming the F1 measurements are normally distributed.




2.4. ParaMor in Morpho Challenge 2009

    Our ParaMor tagger mimic competed in all the language scenarios of Morpho Challenge 2009. For all lan-
guages of the Linguistic, Information Retrieval, and Machine Translation competitions of Morpho Challenge we
set α at 4/3, the setting which produced the highest F1 on our Hungarian development set. Figure 3 contains the
2009 linguistic competition’s precision, recall, and F1 scores for both the original ParaMor, which competed in
Morpho Challenge 2008, and for the ParaMor tagger mimic on the non-Arabic languages. Most likely due to the
small size of the Arabic data sets, all versions of ParaMor suffered from extraordinarily low recall in both the
vowelized and unvoweled Arabic scenarios. But in the four non-Arabic languages of the Morpho Challenge
2009 Linguistic competition, the gap between precision and recall is smaller for the ParaMor Mimic than it is for
the 2008 ParaMor system. And in all languages but English, the reduced precision-recall gap results in a higher
F1 score.
    The increase in F1 for German, Finnish, and Turkish is more modest than the Hungarian results had led us to
hope—about one percentage point in each case. Two reasons likely limited the improvements in F1. First, the
performance rose by a smaller amount for the Challenge test languages than they did for our Hungarian devel-
                       English               German                     Finnish                 Turkish
                   P       R      F1       P       R       F1       P       R      F1       P       R       F1

Original 63.3            52.0    57.1     57.0    42.1    48.4    50.0    37.6    42.9     57.4    45.8    50.9


  Mimic          53.1    59.0    55.9     50.8    47.7   49.2     47.2    40.5    43.6     49.5    54.8   52.0

Figure 3: The precision, recall, and F1 of the original ParaMor, which competed in Morpho Challenge
   2008, and the ParaMor tagger mimic in the non‐Arabic languages of the Linguistic competition of
   Morpho Challenge 2009. In all languages but English, the mimic system improves on the original
   ParaMor’s F1 score.




opment set because we were explicitly tuning our α parameter to Hungarian. Second, it may be atypical that the
tagger mimic outperformed the baseline ParaMor system on the Hungarian data. Time and resource constraints
forced us to train the tagger mimics over subsets of the full Morpho Challenge data, anecdotally lowering tag-
ging mimic accuracy by about a percentage point. To ascertain the quality of the tagger mimics across the range
of Morpho Challenge languages, we plan to ask the Morpho Challenge 2009 Committee to evaluate the baseline
ParaMor tagger mimic systems with α set to 1.

3. Joining ParaMor with Morfessor

    In addition to the ParaMor tagger mimic system, we submitted two systems to Morpho Challenge 2009
which join segmentations derived from ParaMor with segmentations obtained from the freely available unsuper-
vised morphology induction system Morfessor (Creutz, 2006), see section 3.1. Our joint ParaMor-Morfessor
systems differ substantially from the ParaMor-Morfessor systems that the lead author submitted in the 2007 and
2008 Challenges. In particular, both joint systems submitted to the 2009 Challenge combine the ParaMor and the
Morfessor segmentations of each word into a single analysis of that word.

3.1. Morfessor

     In brief, the unsupervised morphology induction system Morfessor Categories-MAP searches for a segmenta-
tion of a corpus that maximizes the corpus probability score according to a specific generative probability model.
The Morfessor system then further refines the morphological segmentations it proposes by restricting morpheme
sequences with a Hidden Markov Model (HMM) which permits only (prefix* stem suffix*)+ sequences.
     The Morfessor system serves as a strong baseline system in Morpho Challenge. In the 2008 Challenge, Mor-
fessor placed first in the Arabic Linguistic competition at 34.0% F1, and second in Turkish at 38.5%. Both Mor-
fessor’s underlying recursive search strategy and its HMM structure are designed to handle agglutinative mor-
phology where a single word consists of several morphemes in sequence. In general, Morfessor attains a higher
precision than recall. In the 2008 Morpho Challenge, Morfessor’s lowest precision score in the linguistic compe-
tition was 67.2%, for German, while the highest recall score it achieved was 36.8%, also for German. Although,
Morfessor’s precision scores for all the other languages of the 2008 linguistic competition lie above 70%, Mor-
fessor’s more balanced precision and recall scores for German lead to Morfessor’s highest F1 score for any lan-
guage.
3.2. Two Methods for System Combination

3.2.1. Union

   The first of our two joint submissions to Morpho Challenge 2009 fuses a single morphological segmentation
from the disparate segmentations proposed by the ParaMor and Morfessor systems by segmenting each word at
every location that either ParaMor or Morfessor suggests. Hence, this submission is the union of all segmentation
points that are proposed by ParaMor and Morfessor. As an example union segmentation take the English word
polymers’ from the Linguistic competition of this year’s Challenge. ParaMor segments polymers’ as
polym +er +s’, Morfessor as polymer +s +’, while the union analysis is polym +er +s +’.

3.2.2. A Joint ParaMor-Morfessor Mimic

    The second of our joint ParaMor-Morfessor submissions builds on the idea of tagger mimics that was pro-
posed in section 2. While Morfessor has itself a statistical model that internally scores individual morphological
segmentations with probabilities, the final segmentations that Morfessor proposes are not by default annotated
with confidences. Hence, we followed the procedure outlined in section 2 to train a natural language tagger to
mimic Morfessor’s morphological analyses. It is encouraging that our technique for inducing a probabilistic
model through a mimic tagger immediately extends from a non-statistical system like ParaMor to the black-box
scenario for Morfessor.
    With mimic taggers for both ParaMor and Morfessor in hand we then joined, for each character, c, in each
word, the tag probabilities from the ParaMor mimic with the corresponding probabilities from the Morfessor
mimic. We weighted the probability scores from the ParaMor mimic and the Morfessor mimic equally. To obtain
the final morphological segmentation of each word, our joint ParaMor-Morfessor mimic followed the methodol-
ogy described in section 2.3 of optimizing F1 against our Hungarian development set, with one caveat. Because
we weighted the probabilities of ParaMor and Morfessor equally, any segmentation point that is strongly sug-
gested by only one of the two systems receives an adjusted probability score just less than 0.5. Hence, we moved
the baseline probability threshold from 0.5 to 0.49. With this single adjustment, the α factor that maximized
Hungarian F1 was 10/9, an 11% increase in the number of proposed morpheme boundaries.

4. The Performance of our Joint ParaMor-Morfessor Systems

    Figure 4 summarizes the precision, recall, and F1 performance of three joint ParaMor-Morfessor systems over
the datasets for the non-Arabic languages from the Linguistic competition of Morpho Challenge 2009. The first
two rows of Figure 4 give performance numbers for the Union and Tagger Mimic systems which we submitted




                       English               German                     Finnish                 Turkish
                   P       R      F1       P       R       F1       P       R      F1       P       R       F1
  Union          55.7    62.3    58.8     52.3    60.3   56.1     47.9    51.0    49.4     47.3    60.0    52.9

  Mimic          54.8    60.2    57.4     51.1    57.8    54.2    51.8    45.4    48.4     48.1    60.4   53.5

   2008          70.1* 67.4* 68.7*        64.1    61.5    62.8    65.2    50.4    56.9     66.8    58.0    62.1


Figure 4: The precision, recall, and F1 of three joint ParaMor‐Morfessor morphological segmentation
   systems over the non‐Arabic languages of the Linguistic competition of Morpho Challenge 2009.

   *The best result, reported here, was from Morpho Challenge 2007
this year. The third row lists the performance numbers from the joint ParaMor-Morfessor system that was sub-
mitted by the lead author to Morpho Challenge 2008.
    Although the union and tagger mimic joint systems do outperform at F1 the solo ParaMor mimic (Compare
Figure 3), it was disappointing that the simple union system outscored the ParaMor-Morfessor tagger mimic in
three of the four relevant language scenarios. Particularly surprising is that the recall of the joint tagger mimic
falls below the recall of the union system in every language but Turkish. With an α factor above 1, the joint tag-
ger mimic is proposing all the segmentation points that either the ParaMor mimic or the Morfessor mimic hy-
pothesize—effectively the union of the mimic systems. And yet recall is below the raw union. We tentatively
conclude that the cumulative failure of the ParaMor mimic to emulate the original ParaMor segmentations on the
one hand, and the Morfessor mimic to emulate Morfessor on the other, drags down the recall (and precision) of
the joint mimic.
    Figure 4 also highlights the relative success of the 2008 joint ParaMor-Morfessor system. In particular, the
precision scores of the 2008 joint system are significantly above the precision scores of the joint systems we
submitted to the 2009 Challenge. The 2008 (and 2007) joint system did not form a single unified segmentation
for each word, but instead simply proposed the ParaMor analysis of each word alongside the Morfessor analy-
sis—as if each word were ambiguous between a ParaMor and a Morfessor analysis. The evaluation procedure of
Morpho Challenge performs a non-trivial averaging procedure over alternative segmentations of a word. We
believe it is a shortcoming of the Morpho Challenge evaluation procedure that allows inflated precision scores
when disparate systems’ outputs are proposed as ‘alternative’ analyses.

5. The Next Steps

    Having now built a sound methodology for assigning confidence scores to ParaMor’s rule-based morpho-
logical segmentations, we will return our attention to improving ParaMor’s modeling of morphological structure.
Currently, the ParaMor algorithm cannot hypothesize that two distinct surface strings are simply allomorphs of
the same underlying morpheme. Hence, ParaMor is unable, for example, to conflate the surface forms of Turkish
and Finnish morphemes which differ only in their harmonized vowel. ParaMor is similarly limited in the types of
morphological operations that ParaMor can analyze. Currently ParaMor only searches for suffixes. Broadening
ParaMor to analyze prefixation, infixation, and perhaps the templatic morphology of Semitic languages will
surely boost ParaMor’s recall scores.

Acknowledgements

   This research was supported in part by NSF Grant #IIS-0811745 and DOD/NGIA grant #HM1582-08-1-
0038. Any opinions, findings, conclusions or recommendations expressed in this publication are those of the
authors and do not necessarily reflect the views of the NSF or DOD.

References

Bernhard, Delphine. 2008. Simple Morpheme Labeling in Unsupervised Morpheme Analysis. Lecture Notes in
   Computer Science: 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, Revised Selected
   Papers, Budapest, Hungary, Springer, 5152/2008, 873-880.
Collins, Michael. 2002. Discriminative training methods for hidden Markov models: Theory and experiments
   with perceptron algorithms. Proceedings of the Conference on Empirical Methods in Natural Language
   Processing (EMNLP).
Creutz, Mathias. 2006. Induction of the Morphology of Natural Language: Unsupervised Morpheme Segmenta-
   tion with Application to Automatic Speech Recognition. Ph.D. Thesis, Computer and Information Science,
   Report D13, Helsinki, University of Technology, Espoo, Finland.
Goldsmith, John. 2001. Unsupervised Learning of the Morphology of a Natural Language. Computational Lin-
   guistics, 27.2, 153-198.
Harris, Zellig. 1955. From Phoneme to Morpheme. Language, 31.2, 190-222, Reprinted in Harris (1970).
Harris, Zellig. 1970. Papers in Structural and Transformational Linguists. Ed. D. Reidel, Dordrecht.
Hollingshead, Kristy, Seeger Fisher, and Brian Roark. 2005. Comparing and combining finite-state and context-
   free parsers. Proceedings of the Human Language Technology Conference and the Conference on Empirical
   Methods in Natural Language Processing (HLT/EMNLP), Vancouver, BC, Canada.
Kurimo, Mikko and Ville Turunen. 2008. Unsupervised Morpheme Analysis Evaluation by IR Experiments –
   Morpho Challenge 2008. Working Notes for the CLEF 2008 Workshop.
Monson, Christian. 2009. ParaMor: From Paradigm Structure to Natural Language Morphology Induction.
   Ph.D. Thesis, Language Technologies Institute, School of Computer Science, Carnegie Mellon University,
   Pittsburgh, PA, USA.
Oflazer, Kemal and İlknur Durgar El-Kahlout. 2007. Exploring Different Representational Units in English-to-
   Turkish Statistical Machine Translation. Statistical Machine Translation Workshop at ACL.
Poon, Hoifung, Colin Cherry, and Kristina Toutanova. 2009. Unsupervised Morphological Segmentation with
   Log-Linear Models. Human Language Technologies: The 2009 Annual Conference of the North American
   Chapter of the ACL.
Roark, Brian and Kristy Hollingshead. 2009. Linear Complexity Context-Free Parsing Pipelines via Chart Con-
   straints. Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the
   ACL.
Snyder, Benjamin and Regina Barzilay. 2008. Unsupervised Multilingual Learning for Morphological Segmen-
   tation. Proceedings of ACL-08: HLT.
Tjong Kim Sang, Eric F. 2002. Introduction to the CoNLL-2002 Shared Task: Language-Independent Named
   Entity Recognition. Proceedings of CoNLL-2002.
Tjong Kim Sang, Eric F. and Sabine Buchholz. 2000. Introduction to the CoNLL-2000 Shared Task: Chunking.
   Proceedings of the 4th Conference on Computational Natural Language Learning (CoNLL).
Trón, Viktor, György Gyepesi, Péter Halácsy, András Kornai, László Németh, and Dániel Varga. 2005. Hun-
   morph: Open Source Word Analysis. Proceedings of the ACL Workshop on Software.
Varga, Dániel, Péter Halácsy, András Kornai, László Németh, Viktor Trón, Tamás Váradi, Bálint Sass, Gergő
   Bottyán, Enikő Héja, Ágnes Gyarmati, Ágnes Mészáros and Dávid Labundy. 2009. Hunglish corpus.
   . Accessed on August 18, 2009.
Xue, Nianwen. 2003. Chinese word segmentation as character tagging. Computational Linguistics and Chinese
   Language Processing, 8.1, 29-47.