<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LOHAI: Providing a baseline for KOS based automatic indexing</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Kai Eckert</string-name>
          <email>eckert@bib.uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Mannheim University Library Mannheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2011</year>
      </pub-date>
      <fpage>121</fpage>
      <lpage>130</lpage>
      <abstract>
        <p>Automatic KOS based indexing { i.e. indexing based on a restricted, controlled vocabulary, a thesaurus or a classi cation { can play an important role to close the gap between the intellectually, high quality indexed publications and the mass of unindexed publications. Especially for unknown, heterogeneous publications, like web publications, simple processes that do not rely on manually created training data are needed. With this contribution, we propose a straight-forward linguistic indexer, that can be used as a basis for own developments and for experiments and analyses to explore own documents and KOSs; it uses state-of-theart information retrieval techniques and hence forms a suitable baseline for evaluations. Finally, it is free and open source.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Intellectual indexing of publications based on a Knowledge Organization System
(KOS) { like controlled vocabularies, thesauri or classi cations { is still
performed to ensure high accuracy in information retrieval. Even with the
availability to search the electronic full text, the resolution of synonyms and homonyms
by the introduction of a controlled vocabulary { functioning as a common
language between creators and searchers of the indexed content { is very important.
To close the gap between the subset of publications that are traditionally indexed
intellectually { books in libraries, but also selected journal articles in mostly
commercial databases { automatic indexing approaches are widely introduced.</p>
      <p>
        The German National Library, for instance, decided, that web publications,
while being collected, will not be indexed intellectually, but only by means of
automatic processes and search engine technology [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. A recent workshop focused
on automatic indexing, called PETRUS1 showed that there are mainly two types
of approaches: linguistic and statistical approaches. While there are smooth
transitions between both, linguistic approaches use techniques from natural language
processing (NLP) to process the texts and extract meaningful concepts, while
statistical approaches use machine learning techniques to assign concepts based
on a manually created training set.
      </p>
      <p>Based on the discussions and contributions of the workshop, there is
currently a preference for statistical approaches, although the reported quality of
the results varies. A mentioned problem was the bias that is introduced by the
training set. For example, for recent news articles, the indexer learned that the
occurrence of \Nuclear power plant" should lead to \Japan" as a concept to be
assigned. More general is the observation that the indexing quality relies on the
homogenity of the documents to be indexed. If they vary very much regarding
content, style or even length, the quality of the indexing result is a ected.</p>
      <p>In the semantic web, heterogenous contents are the rule rather than the
exception. A lot of di erent KOSs are widely used to describe all kind of resources2,
but for a real semantic interoperability, we have to be able to match the
concepts between di erent KOSs or to quickly assign concepts of a KOS to a new
resource. Albeit with inferior quality, such a bridging is needed to connect all
kinds of resources, especially in the area of libraries, archives and museums.</p>
      <p>
        With Maui [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], there exists a statistical indexer that incorporates a lot of
NLP techniques, but there is to the best of our knowledge no free and open
source implementation for a strictly lingustic indexer that can be used without
any training data on arbitrary documents. Especially for the evaluation of \real"
automatic indexers, such a simple implementation is useful. There are a lot of
additional scenarios where this indexer can be used, be it for experimental services
or whenever more sophisticated approaches are just not needed. And of course,
as a reasonable baseline, more sophisticated approaches have to outperform it
in the rst place.
      </p>
      <p>In this paper, we present LOHAI3, a strictly linguistic indexer that
incorporates mainly all these techniques that are state-of-the-art in information retrieval.
The development of LOHAI is led by the following motivational thoughts:
Simplicity over quality: While every single step could be improved or
replaced by a more sophisticated technique that is already developed and
published somewhere, we tried to develop everything as simple as possible.
Everything should be easy to use, easy to understand and easy to improve
if needed.</p>
      <p>Knowledge-poor and without any training: To be usable for arbitrary KOS
and documents, the indexer can not rely on any additional knowledge sources,
however, of course, the KOS itself can and will be used. The indexer must
not employ a training step, as there are many settings where no preindexed
documents are available and the creation of a training set would be too
cumbersome for the user.</p>
      <p>With these prerequisites in mind, we compose the indexer as a pipeline with
several components, as illustrated in Figure 1 on page 3.</p>
      <sec id="sec-1-1">
        <title>2 E.g. by means of SKOS, http://www.w3.org/2004/02/skos/.</title>
        <p>
          3 LOHAI is pronounced like Low-High and means something like LOw HAnging Fruits
Automatic Indexer, which gives a brief summary about the development process.
1. Part-of-speech tagging: We use the Stanford Log-linear Part-Of-Speech
Tagger4, as described by Toutanova et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Part-of-speech (POS) tagging
simply means the identi cation of nouns, verbs, adjectives and other
wordtypes in a text. To avoid wrong concept assignments like the assignment of
the concept \need" (as noun in the sense of requirements) whenever the verb
\to need" is used, we only consider nouns (NN5, NNP, NPS, NNS), adjectives
(JJ, JJR, JJS), foreign words (FW) and unknown words (untagged).
2. Tokenization: The tokenization splits the text into single terms. The
tokenization is performed together with the POS tagging. The result is a list
of terms that are further investigated for proper concept assignments. In
the tokenization step, there is also a cleaning of the terms included, where
everything is truncated that is not a letter, a hyphen or a space. Note, that
numbers are truncated, too, as they usually contain no meaning and are
generally highly ambiguous. In some domains, this would not be desired,
consider for example history or chemistry.
3. Stemming: Finally, the single terms are stemmed, i.e. they are reduced to
their stem. That way, same terms can be matched, even, if they use di erent
grammatical forms, like \banks" and \bank". We use the English (Porter2)
stemming algorithm for the Snowball stemmer [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
4. KOS preparation: This is only performed once per KOS. All concept
labels are stemmed by means of the same stemmer that is employed on the
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>4 http://nlp.stanford.edu/software/tagger.shtml</title>
        <p>
          5 Tag de nitions according to the Penn Treebank tag set [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
document texts. An index is created that maps the single stems to the
corresponsing concepts. Additionally an index of stemmed label parts is created
that is used for the identi cation of compound terms { for instance,
\insurance market" would be stemmed to \insur market", mapping to the
corresponding concept, additionally, both stem parts are indexed and mapped to
the stemmed compound term.
        </p>
        <p>The preprocessing uses only freely available standard approaches. The POS
tagging and the stemming are language dependent; both algorithms employed are
implemented for various languages, including English and German. We assume
that both the KOS and the documents are in the same language and that only
one language is used in the document, so that the appropriate implementations
can be used. If the KOS is multilingual and the documents use di erent language,
an additional language detection step has to be employed.</p>
        <p>After the preprocessing steps, the actual concept assignment and weighting
takes place, as described in the next section.
3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Concept assignment with compound term detection</title>
      <p>The general assignment strategy is a pure string based matching: If a stem that
belongs to a concept in the KOS appears in the stems extracted from the text, the
concept is assigned. In this step, we consider every concept a matching concept
that contains a label that has the same stem or contains the same stem in the
case of an compound term. Under this assumption, we have to deal with three
possibilities that consequently would lead to wrong assignments:
1. A stem could belong to several concepts, including compound term
concepts, e.g. \insur" that belongs among others to \insurance" and \insurance
market".
2. A stem could belong to several concepts that have di erent labels with the
same stem (overstemming), like \nationalism", \nationality" and \nation".
3. A stem could belong to several concepts that have the same labels with the
same stem (homonyms), like \bank" (the nancial institution) and \bank"
(a raised portion of seabed or sloping ground along the edge of a stream,
river, or lake).</p>
      <p>Approaches to handle the latter two variants are described in the next
section, the rst variant is dealt with directly in the assignment phase: The basic
assumption is that we want to assign the most speci c concept, i.e. in the above
example, we would like to assign \insurance market", but not \market".</p>
      <p>We implement this as follows: Whenever a stem is recognized as a potential
part of a compound term, the stem is temporarily stored in a list. When a stem
is found that can not be part of a compound, the list is analyzed for contained
concepts. In this step, the algorithm simply checks every chain of stems for
every starting stem if it corresponds to a compound concept. The algorithm
starts with the longest possible chain and stops if a compound is found, thus
avoiding the assignment of additional concepts contained in the compound. With
this approach, the algorithm has generally a linear runtime with respect to the
words contained in the text. Only the parts that potentially contain compounds
have to be further analysed with a runtime of O(n2) with n denoting the number
of words within such a part.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Unstemming and word-sense disambiguation</title>
      <p>Whenever one or more stems could be assigned to more than one concept, we
would like to identify a single concept as the correct one in the given context.
This task is generally denoted as word-sense disambiguation (WSD). We use
two di erent approaches for WSD, the rst being a speci c check that
tackles the problem of overstemming mentioned above. If this step is not able to
disambiguate the potential concepts, the actual WSD is performed.
Unstemming. Overstemming { the reduction of two di erent terms to the same
stem { leads to ambiguous stems that have to be disambiguated during
indexing. Consequently, we rst unstem the stem, i.e. we go back to the original,
unstemmed form of the term, as found in the text. If the unstemmed term
corresponds directly to an unstemmed label of a concept, we assign this concept. If
there is only one such concept, we nish the WSD step. Otherwise, we continue
with the actual WSD, as described in the following.</p>
      <p>KOS based word-sense disambiguation. Word-sense disambiguation is a broad
eld in the area of natural language processing. Leaving the technical issues of
overstemming aside, it generally consists of the task to determine the correct
sense of a word that appears in a particular context. The variety of possible
senses is often based on some background-knowledge, like a thesaurus or other
types of KOS. As Manning and Schuetze [2, pp. 229 f.] pointed out, this can be
unsatisfactory from a scienti c or philosophical point of view, as the de nitions
in the background knowledge are often quite arbitrary and possibly not su cient
to describe the actual sense of a word in a given context. However, our goal is
not the perfect assignment of a sense to a word, our goal is the assignment of
the best tting concept in the KOS.</p>
      <p>
        WSD approaches can be divided in supervised and unsupervised approaches,
additionally in knowledge-rich and knowledge-poor approaches [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In our setting,
we clearly need an approach that is unsupervised { as it has to work without
any previously tagged texts { and knowledge-rich { as we clearly have a KOS at
hand and of course want to use it to improve the disambiguation quality.
      </p>
      <p>
        A supervised, knowledge-rich approach would be the adaptive
thesaurusbased disambiguation, as presented by Yarowsky [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], where a Bayes classi er is
trained on a large document corpus and thus probabilities for the occurrence of
speci c words in the context of a speci c sense are determined.
      </p>
      <p>
        Yarowsky [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] also proposed an (almost) unsupervised approach that makes
use of two assumptions:
One sense per collocation. We assume that words collocated with the word
to be disambiguated are unique for the correct sense and would not be
collocated with the word for other senses. This basically is the rationale to use
the context of a word { usually a window of words before and after the word
in question { for disambiguation.
      </p>
      <p>One sense per discourse. We assume that only one sense for a given word
is used throughout a whole document. With this assumption, we can make
use of any occurrence of the word in the text and thus get a more stable
disambiguation result.</p>
      <p>
        Both assumptions have been examined and veri ed [
        <xref ref-type="bibr" rid="ref1 ref12">1, 12</xref>
        ]. However, as Yarowsky's
approach is not completely unsupervised { a small set of pretagged senses is
needed as seed { we only make use of the two assumptions, but use a much
simpler approach: Word-sense disambiguation based on a Jaccard comparison
(cf. Ramakrishnan et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]).
      </p>
      <p>For this comparison, we de ne two sets of words: W as the context of an
occurrence of the ambiguous word w, and C as the context of a candidate concept
c, respectively. We then compute the Jaccard measure as follows:
Jaccard(W; C) = jW \ Cj (1)
jW [ Cj</p>
      <p>Based on the assumption \One sense per discourse", we assign each
occurrence of w the concept c that was mostly assigned in the document, i.e. got in
most cases the highest Jaccard value. If only abstracts are available for indexing,
this procedure can be further simply ed by just assuming the whole abstract as
the context for each occurrence of w, which leads to the direct assignment of the
concept c with the highest Jaccard value.</p>
      <p>As context of an ambiguous word w, we either de ne a window of 100 words
before and after the word or just use the whole document in case of short texts,
like abstracts. The context of a concept c is de ned as the union of all labels of
the concept, its direct child concepts, its parent concepts and the direct children
of the parent concepts, i.e. its siblings. Other de nitions are of course possible,
for example the weighting of words and labels depending on the distance to the
word or concept, but for our purpose as part of a simple baseline indexer, our
approach is su cient.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Weighting</title>
      <p>The last step in the indexing pipeline is the weighting of the assigned concepts.
As the baseline indexer so far assigns every concept that can be identi ed by
an occurring word, the weighting of these concepts is vitally important to
determine which concepts are important and descriptive for the given text and
which concepts are only marginally touched. It is also desirable to give concepts
a higher weight when they are not used in the majority of documents, because
these concepts usually only denote common terms and are not important for the
indexing result.</p>
      <p>The common approach for this kind of weighting is tf-idf, which is based on
the term frequency tfc;d of a term (in our case concept c) in a given document d
and on the document frequency dfc of a concept c, i.e. the number of documents,
where the concept appears:
w(c; d) = tfc;d log</p>
      <p>D
dfc
(2)
D denotes the total number of documents in the indexed corpus. The last term
is called inverse document frequency (idf ), as the overall weight becomes smaller
the higher dfc is.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Results</title>
      <p>To show the weaknesses and strengths of LOHAI, we investigate some of the
indexing results. For our experiments, we used the German STW Thesaurus for
Economics6. A concept in the STW consists of preferred and alternative labels,
both in English and in German. For example, there is the concept \Migration
theory" with alternative labels \Economics of migration" and \Theory of
migration".</p>
      <p>Figure 2 shows an example abstract that we indexed. LOHAI produces the
output as shown in Figure 3. Additionally, we listed the intellectually assigned
concepts by a librarian. It can easily be seen that the characteristics of the
results are quite di erent. But if one takes the weighting into account, it can
be seen that there are no wrong assignments with a weight above 0:3. Below
that threshold, there are especially common terms that form a concept in the
thesaurus and that are either not helpful or wrongly assigned, as \Exchange".
\Government", for example, seems to be correct, but is rather a coincidence, as
it is assigned due to the verb \govern" in the text { an indication for a mistake
during the POS tagging. On the other hand, the very abstract concepts that are
assigned by the librarian (besides \Theory") are not found by LOHAI, as the
terms do not directly appear in the text in some form.</p>
      <p>All in all, the results are very promising, even with a relatively simple
approach like ours. Most assignments are correct, even if a human indexer would
not assign all of them. The indexing quality correlates with the employed
weighting, especially assignments with lower rank often contain more common concepts
that sometimes are just wrong. A lot of these mistakes could be avoided if the
thesaurus would be more precise about homonyms and would provide additional
information to disambiguate them, when necessary. The indexer could be
further improved, e.g. common concepts should not be assigned, if more speci c
concepts down the tree are found in the text (Like \Law" and \Contract Law"
above). On the other hand, we wanted to keep it simple. Such adaptions and</p>
      <sec id="sec-5-1">
        <title>6 Standard Thesaurus Wirtschaft, http://zbw.eu/stw/versions/latest/about.en.</title>
        <p>html. The thesaurus is published and maintained by the German National Library
of Economics (Deutsche Zentralbibliothek fur Wirtschaftswissenschaften, ZBW)
Title Contractarianism: Wistful Thinking
Authors Hardin, Russell
Abstract The contract metaphor in political and moral theory is misguided. It is
a poor metaphor both descriptively and normatively, but here I address
its normative problems. Normatively, contractarianism is supposed to
give justi cations for political institutions and for moral rules, just as
contracting in the law is supposed to give justi cation for claims of
obligation based on consent or agreement. This metaphorical
association fails for several reasons. First, actual contracts generally govern
prisoner's dilemma, or exchange, relations; the so-called social contract
governs these and more diverse interactions as well. Second, agreement,
which is the moral basis of contractarianism, is not right-making per se.</p>
        <p>Third, a contract in law gives information on what are the interests of
the parties; a hypothetical social contract requires such knowledge, it
does not reveal it. Hence, much of contemporary contractarian theory
is perversely rationalist at its base because it requires prior, rational
derivation of interests or other values. Finally, contractarian moral
theory has the further disadvantage that, unlike contract in the law, its
agreements cannot be connected to relevant motivations to abide by
them.</p>
        <p>
          Journal Constitutional Political Economy, 1 (2) 1990: 35-52
improvements are easy to implement, if they are needed. A quantitative
evaluation of the results by comparing them to a manually created gold standard
is still missing, but former experiments [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] with a comparable indexer showed
that such results are nevertheless not very meaningful due to the very di erent
characteristics of such an automatic approach and a trained librarian.
7
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>To the best of our knowledge, there is no free indexer available that does not
require any data preparation step or the creation of some training data. With
LOHAI, we developed such an indexer by just using the standard approaches
in natural language processing and information retrieval for the single steps in
the indexing pipeline. Each and every step could be improved by employing new
and more sophisticated approaches, but we intentionally restricted ourselves to
the well-understood approaches that are state of the art in information retrieval
and natural language processing. All in all, the indexer consists of about 500
lines of code in Java, without the POS tagger and the Snowball stemmer. We
showed that the indexer performs quite well and { maybe most important {
does not behave like a black box, every assignment is easily understandable.
We expect that the indexer would even be usable in serious indexing projects.
LOHAI is already successfully employed in SEMTINEL7, a thesaurus evaluation
framework, where it is used to quickly process large document sets for a given
thesaurus to determine its concept coverage. LOHAI is not a stupid indexer, it
is a baseline indexer. It is free, open source and available at https://github.
com/kaiec/LOHAI.</p>
      <sec id="sec-6-1">
        <title>7 http://www.semtinel.org</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Gale</surname>
            ,
            <given-names>W.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Church</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yarowsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>One sense per discourse</article-title>
          .
          <source>In: Proceedings of the workshop on Speech and Natural Language</source>
          . pp.
          <volume>233</volume>
          {
          <fpage>237</fpage>
          . HLT '
          <volume>91</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>1992</year>
          ), http://dx.doi.org/10.3115/1075527.1075579
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuetze</surname>
          </string-name>
          , H.:
          <article-title>Foundations of Statistical Natural Language Processing</article-title>
          . MIT Press (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Maynard</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dasiopoulou</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Costache</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eckert</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stuckenschmidt</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dzbor</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Handschuh</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <source>D1.2.2.1</source>
          .
          <article-title>3 Benchmarking of annotation tools</article-title>
          .
          <source>Tech. rep., Knowledge Web Project</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Medelyan</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Human-competitive automatic topic indexing</article-title>
          .
          <source>Ph.D. thesis</source>
          , University of Waikato (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          , R.:
          <article-title>Word sense disambiguation: A survey</article-title>
          .
          <source>ACM Comput. Surv</source>
          .
          <volume>41</volume>
          (
          <issue>2</issue>
          ),
          <volume>1</volume>
          {
          <fpage>69</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Snowball: A language for stemming algorithms</article-title>
          . Published online. (
          <year>2001</year>
          ), http://www.snowball.tartarus.org/texts/introduction.html
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ramakrishnan</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prithviraj</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhattacharyya</surname>
            ,
            <given-names>P.:</given-names>
          </string-name>
          <article-title>A gloss-centered algorithm for disambiguation</article-title>
          .
          <source>In: SENSEVAL-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text</source>
          , Barcelona, Spain,
          <year>July 2004</year>
          . ACM, New York, NY, USA (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Santorini</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Part-of-speech tagging guidelines for the Penn Treebank Project (3rd revision)</article-title>
          .
          <source>Tech. rep.</source>
          , University of Pennsylvania (
          <year>1990</year>
          ), http://repository. upenn.edu/cis_reports/570/
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Schwens</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiechmann</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Netzpublikationen in der Deutschen Nationalbibliothek</article-title>
          .
          <source>Dialog mit Bibliotheken</source>
          <volume>1</volume>
          (
          <issue>1</issue>
          ),
          <volume>10</volume>
          {
          <fpage>13</fpage>
          (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Singer</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Feature-rich part-of-speech tagging with a cyclic dependency network</article-title>
          .
          <source>In: Proceedings of HLT-NAACL 2003</source>
          . pp. pp.
          <volume>252</volume>
          {
          <issue>259</issue>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Yarowsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Word-sense disambiguation using statistical models of Roget's categories trained on large corpora</article-title>
          .
          <source>In: Proceedings of the 14th conference on Computational linguistics - Volume</source>
          <volume>2</volume>
          . pp.
          <volume>454</volume>
          {
          <fpage>460</fpage>
          . COLING '
          <volume>92</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>1992</year>
          ), http://dx.doi.org/10.3115/ 992133.992140
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Yarowsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>One sense per collocation</article-title>
          .
          <source>In: Proceedings of the workshop on Human Language Technology</source>
          . pp.
          <volume>266</volume>
          {
          <fpage>271</fpage>
          . HLT '
          <volume>93</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>1993</year>
          ), http://dx.doi.org/10.3115/ 1075671.1075731
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Yarowsky</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Unsupervised word sense disambiguation rivaling supervised methods</article-title>
          .
          <source>In: Proceedings of the 33rd annual meeting on Association for Computational Linguistics</source>
          . pp.
          <volume>189</volume>
          {
          <fpage>196</fpage>
          . ACL '
          <volume>95</volume>
          ,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computational Linguistics, Stroudsburg, PA, USA (
          <year>1995</year>
          ), http://dx.doi.org/10.3115/981658.981684
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>