<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>MorphoClass - Recognition and Morphological Classification of Unknown Words for German</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Preslav Nakov</string-name>
          <email>preslav@rocketmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Sofia University “St. Kliment Ohridski”</institution>
          ,
          <addr-line>Sofia</addr-line>
          ,
          <country country="BG">Bulgaria</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>A system for recognition and morphological classification of unknown words for German is described and evaluated. It takes raw text as input and outputs a list of the unknown nouns together with a hypothesis about their possible morphological class and stem. MorphoClass exploits global information (ending-guessing rules, maximum likelihood 1estimations, word frequency statistics), morphological properties (compounding, inflection, affixes) and external knowledge (lexicons, German grammar information etc.).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>SYSTEM OVERVIEW</title>
      <p>
        The MorphoClass system accepts raw text as input and produces a
list of unknown words together with hypotheses about their stem
and morphological class. We define the stem as the common part
shared by all inflected forms of the base while the morphological
class describes both the word gender and the inflexion rules the
word follows when changes by case and number. Our
morphological classes follow the one used under the DBR-MAT
project — a German-Bulgarian-Romanian Machine Translation
(see [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]), and given in Bulgarisch-Deutsch Worterbuch (see [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]).
      </p>
      <p>MorphoClass solves the problem as a sequence of subtasks
including: unknown words identification, noun identification,
inflected forms of the same word recognition and grouping,
compounds splitting, morphological analysis, stem proposal for
each group of inflected forms, and finally — production of
hypothesis about the possible morphological class for each group
of words.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        Koskenniemi proposes a language-independent model for
both morphological analysis and generation called two-level
morphology and based on finite-state automata. It is
implemented in the KIMMO system (see [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]). Finkler and
Neumann follow a different approach using n-ary tries in the
MORPHIX system (see [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]). Lorenz developed Deutsche
Malaga-Morphologie as a system for the automatic word
form recognition for German based on Left-Associative
Grammar (see [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]). Kupiec uses pre-specified suffixes and
then learns statistically the POS predictions for unknown
word guessing (see [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]). The XEROX tagger comes with a
list of built-in ending-guessing rules (see [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]). Brill builds
more linguistically motivated rules by means of tagged
corpus and a lexicon (see [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]). He does not look at the
affixes only but optionally checks their POS class in a
lexicon. Mikheev proposes a similar approach that estimates
the rule predictions from a raw text (see [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). Daciuk uses
finite state transducers.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Unknown word tokens and types identification</title>
      <p>MorphoClass is interested in the identification and
morphological classification of the nouns with unknown
stems. The first thing to do is to process the text and to
derive a list of the word types. We exploit the German noun
property to be always capitalised regardless of its position in
the sentence. The capitalisation is discarded when deriving
the list but is taken into account since for each word we
collect the following three statistics: total frequency,
capitalised frequency and start-of-sentence frequency. These
are used to determine whether a certain word type could be a
(unknown) noun.
3.2</p>
    </sec>
    <sec id="sec-4">
      <title>All possible stems generation</title>
      <p>We go through the words and generate all the possible stems that
could be obtained by reversing all acceptable German inflexions
for the word type while taking into account the umlauts and the ß
alternations. For each word type all acceptable rule inversions are
performed. For example for the word Lehrerinnen the following
stems are generated (by removing -nen, -en, -n and ∅): Lehrerin,
Lehrerinn, Lehrerinne, Lehrerinnen. We do not impose any
limitations when generating a stem except that it must be
nonempty. The purpose of the stem generation process is to both
identify all the acceptable stems and group the inflected forms of
the same word together.
3.3</p>
    </sec>
    <sec id="sec-5">
      <title>Stem coverage checking and refinement</title>
      <p>We go through the stems and for each one we check whether there
exists a morphological class that could generate all the word forms.
If at least one is found we accept the current coverage and
otherwise we try to refine it in order to make it acceptable. It is
possible that a stem is generated by a set of words that it cannot
cover together. It is important to say that at this moment we are not
interested in the question whether this stem is really correct but
just in whether it is compatible with all the word forms it covers
taken together.
3.4</p>
    </sec>
    <sec id="sec-6">
      <title>Morphological stem analysis</title>
      <p>
        Each stem generated in the previous step is analysed
morphologically in order to obtain some additional information
that could imply useful constraints on the subsequent analysis. The
morphological analysis is based on both lexicon-based and
suffixbased morphology. First, for each stem we check whether it is
present in our stem lexicon. (We built it using the free lexicon of
the Morphy system (see [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ])). If so, we reject it since the unknown
word could not have a known stem: all words the known stems
could generate are already known. Second, we check whether the
stem could be a compound by trying to split it in a way that all its
parts are found in the lexicon. In case of success we know its
morphological class — it is determined by the last word the
compound is made of. Third, we try to guess the class looking at
the stem ending. We implemented a Mikheev-like ending-guessing
rules (see [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]). We selected a confidence level of 90%, considered
endings up to 7 characters long that must be preceded by at least 3
characters and whose frequency is at least 10. We trained the
model over 8,5 MB of raw text and obtained 1789 rules.
3.5
      </p>
    </sec>
    <sec id="sec-7">
      <title>Word types clusterisation (stem coverage)</title>
      <p>After the stem refinements step we are sure that each stem is
compatible with the word types it is supposed to cover and that
there exists at least one morphological class that could generate
them all given the stem. During the next step we obtained some
additional information regarding the stems as a result of
morphological analysis. We thus have a complex structure, which
we can think of as a bi-partitioned graph where the vertices are
either stems or word types and each edge links a stem to a word
type it is supposed to cover. Our goal is to select some of the stems
thus producing stem coverage of the word types. We try to select
some stems in a way that:</p>
      <p>Each word is covered by exactly one stem.</p>
      <p>The stem covers as much word types as possible.</p>
      <p>The covered word types set being equal, a stem with more
reliable morphological information is selected. We prefer words
recognised as compounds, then those analysed using
endingguessing rules and then all the rest.</p>
      <p>All other being equal, a longer stem is preferred.</p>
    </sec>
    <sec id="sec-8">
      <title>EVALUATION</title>
      <p>The MorphoClass system has been evaluated over an 85 KB
German literature text: Erzählungen by Franz Kafka. There were
3510 different word forms found: 862 known nouns, 2155 known
non-nouns and 493 unknown nouns. The evaluation has been
performed manually over a quarter of the stems. We considered
120 stems and classified them in the following categories (counts
in parentheses):</p>
      <p>SET (12) — A set of classes has been assigned rather than a
single one.</p>
      <p>PART (7) — MorphoClass discovered a correct class but not
all the correct classes.</p>
      <p>WRONG (18) — MorphoClass assigned a single class but it
was wrong.</p>
      <p>YES (72) — MorphoClass assigned a single class and it was
the only correct one.</p>
      <p>SKIP (11) — The stem has been skipped. We did so for the
proper nouns, incorrect stems etc.</p>
      <p>We evaluated the System in terms of precision and coverage.
The coverage shows the proportion of the stems whose
morphological class has been found, while the precision reveals
how correct it was. A scaling is performed according to the
proportion of possible classes guessed to the total classes count: if
a stem belongs to k (k≥2) classes and MorphoClass found one of
them (it finds exactly one) then precision1 considers it as a failure
(will add 0), precision2 counts it as a partial success (will add
scaled_PART=1/k) and precision3 accepts it as a full success (will
add 1).
precision1 = YES / (YES + WRONG + PART)
precision2 = (YES + (scaled_PART)) / (YES + WRONG + PART)
precision3 = (YES + PART) / (YES + WRONG + PART)
coverage = (YES + WRONG + PART) / (YES + WRONG + PART + SET)</p>
      <p>The MorphoClass system performs the morphological analysis
using both compound words splitting as well as ending-guessing
rules. These are run in a cascade manner: the ending rules are
applied only if the compound splitting rules failed. Not surprisingly
the compound splitting rules gave a high precision: 93.62% (no
partial matching: all the rules considered predicted just one class
even when more than one splitting was possible) and coverage of
43.12%. These results give an idea of how often the compound
nouns occur on German. Another 45.87% of the stems have been
covered by the ending-guessing rules. Their precision was much
lower: 56% for precision1 and 70% for precision3. This gave us an
overall system coverage of 88.99% and precision of 74.23%,
76.08% and 81.44%. (see Table 1)</p>
    </sec>
    <sec id="sec-9">
      <title>CONCLUSIONS AND FUTURE WORK</title>
      <p>We use very simple rules only, without exploiting any context
information and most of the unknown nouns’ stems have just one
(possibly inflected) noun form. A similar approach could be
applied to other inflectional languages and other important
openclass POS such as: adjectives, verbs and adverbs. Obviously, this
will not be straightforward but most of the steps could be applied
with almost no changes.</p>
    </sec>
    <sec id="sec-10">
      <title>ACKNOWLEDGEMENTS</title>
      <p>I am very grateful to prof. Galia Angelova, prof. Walther von Hahn
and Ingo Schröder for the valuable suggestions and discussions.
Special thanks to prof. Galia Angelova for the strong support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>W.</given-names>
            <surname>von Hahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Angelova. Combining Terminology</surname>
          </string-name>
          ,
          <article-title>Lexical Semantics and Knowledge Representation in Machine Aided Translation</article-title>
          . In: TKE'96: Terminology and
          <string-name>
            <given-names>Knowledge</given-names>
            <surname>Engineering</surname>
          </string-name>
          .
          <source>Proceedings of the Conference "Terminology and Knowledge Engineering"</source>
          ,
          <year>August 1996</year>
          , Vienna, Austria. pp.
          <fpage>304</fpage>
          -
          <lpage>314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E.</given-names>
            <surname>Dietmar</surname>
          </string-name>
          and
          <string-name>
            <given-names>H.</given-names>
            <surname>Walter</surname>
          </string-name>
          .
          <article-title>Bulgarisch-Deutsch Wörterbuch</article-title>
          . VEB Verlag Enzyklopädie Leipzig,
          <year>1987</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Koskenniemi</surname>
          </string-name>
          .
          <article-title>Two-level model for morphological analysis</article-title>
          .
          <source>In IJCAI 1983</source>
          pp.
          <fpage>683</fpage>
          -
          <lpage>685</lpage>
          , Karlsruhe,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Finkler</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Neumann. MORPHIX.</surname>
          </string-name>
          <article-title>A Fast Realization of a Classification-Based Approach to Morphology</article-title>
          . In: Trost, H. (ed.):
          <fpage>4</fpage>
          .
          <string-name>
            <given-names>Osterreichische</given-names>
            <surname>Artificial-Intelligence-Tagung. Wiener</surname>
          </string-name>
          Workshop - Wissensbasierte Sprachverarbeitung. Proceedings. Berlin etc. pp.
          <fpage>11</fpage>
          -
          <lpage>19</lpage>
          , Springer,
          <year>1988</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          .
          <article-title>Automatische Wortformenerkennung für das Deutsche im Rahmen von Malaga</article-title>
          . Magisterarbeit.
          <string-name>
            <surname>Friedrich-AlexanderUniversität Erlangen-Nürnberg</surname>
          </string-name>
          , Abteilung für Computerlinguistik.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kupiec</surname>
          </string-name>
          .
          <article-title>Robust part-of-speech tagging using a hidden Markov model</article-title>
          .
          <source>Computer Speech and Language</source>
          ,
          <volume>6</volume>
          (
          <issue>3</issue>
          ), pp.
          <fpage>225</fpage>
          -
          <lpage>242</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Cutting</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kupiec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pedersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sibun</surname>
          </string-name>
          .
          <article-title>A practical part-ofspeech tagger</article-title>
          .
          <source>Proceedings of the Third Conference on Applied Natural Language Processing (ANLP-92)</source>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>140</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>E.</given-names>
            <surname>Brill</surname>
          </string-name>
          .
          <article-title>Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging</article-title>
          .
          <source>In Computational Linguistics</source>
          ,
          <volume>21</volume>
          (
          <issue>4</issue>
          ):
          <fpage>543</fpage>
          -
          <lpage>565</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mikheev</surname>
          </string-name>
          .
          <article-title>Automatic Rule Induction for Unknown Word Guessing</article-title>
          .
          <source>In Computational Linguistics</source>
          vol
          <volume>23</volume>
          (
          <issue>3</issue>
          )
          <string-name>
            <surname>,</surname>
            <given-names>ACL</given-names>
          </string-name>
          <year>1997</year>
          . pp.
          <fpage>405</fpage>
          -
          <lpage>423</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lezius</surname>
          </string-name>
          . Morphy - German
          <string-name>
            <surname>Morphology</surname>
          </string-name>
          ,
          <article-title>Part-of-Speech Tagging and Applications</article-title>
          . In Ulrich Heid;
          <article-title>Stefan Evert; Egbert Lehmann</article-title>
          and Christian Rohrer, editors,
          <source>Proceedings of the 9th EURALEX International</source>
          Congress pp.
          <fpage>619</fpage>
          -
          <lpage>623</lpage>
          Stuttgart, Germany.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>