<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards unsupervised induction of morphophonemic rules</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Erwin Chan University of Pennsylvania</institution>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The task we are investigating is unsupervised learning of natural language morphology
for inflectional languages. The target morphological grammar consists of a lexicon of
morphological base forms and transforms. A base form represents all inflections of
a lexeme, and all base forms of the same POS category share the same fine-grained
morphosyntactic type. Transforms are morphophonemic rewrite rules that convert
base forms to derived forms, and whose context of application is limited to a specific
set of base forms.</p>
      <p>We have developed a greedy algorithm to induce such a grammar. At each
iteration, suffixal transforms to convert between base and derived forms of lexemes are
hypothesized. The algorithm chooses the transform that maximizes vocabulary
coverage, while minimizing the number of conflicts resulting from proposing as base forms
words previously found to be derived forms. After base forms and transforms have
been learned, a distributional clustering step assigns the base forms to POS classes. In
future work, the transforms will be converted to generalized rewrite rules by inducing
phonological characteristics common to the base forms.</p>
      <p>We have tested this algorithm on a version of the Penn Treebank annotated for
inflectional morphology. The algorithm achieves 71.7% recall and 92.9% precision on
inflectional relations, where both a base and derived form occur in the corpus. We
are currently testing the algorithm on other languages, and will present results on the
Morphochallenge gold standards.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>