Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


      LOHAI: Providing a baseline for KOS based
                automatic indexing

                                          Kai Eckert

                                   University of Mannheim
                                      University Library
                                     Mannheim, Germany
                                 eckert@bib.uni-mannheim.de


         Abstract. Automatic KOS based indexing – i.e. indexing based on a
         restricted, controlled vocabulary, a thesaurus or a classification – can play
         an important role to close the gap between the intellectually, high quality
         indexed publications and the mass of unindexed publications. Especially
         for unknown, heterogeneous publications, like web publications, simple
         processes that do not rely on manually created training data are needed.
         With this contribution, we propose a straight-forward linguistic indexer,
         that can be used as a basis for own developments and for experiments
         and analyses to explore own documents and KOSs; it uses state-of-the-
         art information retrieval techniques and hence forms a suitable baseline
         for evaluations. Finally, it is free and open source.


1      Introduction

Intellectual indexing of publications based on a Knowledge Organization System
(KOS) – like controlled vocabularies, thesauri or classifications – is still per-
formed to ensure high accuracy in information retrieval. Even with the availabil-
ity to search the electronic full text, the resolution of synonyms and homonyms
by the introduction of a controlled vocabulary – functioning as a common lan-
guage between creators and searchers of the indexed content – is very important.
To close the gap between the subset of publications that are traditionally indexed
intellectually – books in libraries, but also selected journal articles in mostly
commercial databases – automatic indexing approaches are widely introduced.
    The German National Library, for instance, decided, that web publications,
while being collected, will not be indexed intellectually, but only by means of
automatic processes and search engine technology [9]. A recent workshop focused
on automatic indexing, called PETRUS1 showed that there are mainly two types
of approaches: linguistic and statistical approaches. While there are smooth tran-
sitions between both, linguistic approaches use techniques from natural language
processing (NLP) to process the texts and extract meaningful concepts, while
statistical approaches use machine learning techniques to assign concepts based
on a manually created training set.
1
    http://www.d-nb.de/wir/projekte/workshop_petrus.htm


                                            121
    Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


    Based on the discussions and contributions of the workshop, there is cur-
rently a preference for statistical approaches, although the reported quality of
the results varies. A mentioned problem was the bias that is introduced by the
training set. For example, for recent news articles, the indexer learned that the
occurrence of “Nuclear power plant” should lead to “Japan” as a concept to be
assigned. More general is the observation that the indexing quality relies on the
homogenity of the documents to be indexed. If they vary very much regarding
content, style or even length, the quality of the indexing result is affected.
    In the semantic web, heterogenous contents are the rule rather than the ex-
ception. A lot of different KOSs are widely used to describe all kind of resources2 ,
but for a real semantic interoperability, we have to be able to match the con-
cepts between different KOSs or to quickly assign concepts of a KOS to a new
resource. Albeit with inferior quality, such a bridging is needed to connect all
kinds of resources, especially in the area of libraries, archives and museums.
    With Maui [4], there exists a statistical indexer that incorporates a lot of
NLP techniques, but there is to the best of our knowledge no free and open
source implementation for a strictly lingustic indexer that can be used without
any training data on arbitrary documents. Especially for the evaluation of “real”
automatic indexers, such a simple implementation is useful. There are a lot of ad-
ditional scenarios where this indexer can be used, be it for experimental services
or whenever more sophisticated approaches are just not needed. And of course,
as a reasonable baseline, more sophisticated approaches have to outperform it
in the first place.
    In this paper, we present LOHAI3 , a strictly linguistic indexer that incorpo-
rates mainly all these techniques that are state-of-the-art in information retrieval.
The development of LOHAI is led by the following motivational thoughts:

Simplicity over quality: While every single step could be improved or re-
   placed by a more sophisticated technique that is already developed and
   published somewhere, we tried to develop everything as simple as possible.
   Everything should be easy to use, easy to understand and easy to improve
   if needed.
Knowledge-poor and without any training: To be usable for arbitrary KOS
   and documents, the indexer can not rely on any additional knowledge sources,
   however, of course, the KOS itself can and will be used. The indexer must
   not employ a training step, as there are many settings where no preindexed
   documents are available and the creation of a training set would be too
   cumbersome for the user.

   With these prerequisites in mind, we compose the indexer as a pipeline with
several components, as illustrated in Figure 1 on page 3.

2
    E.g. by means of SKOS, http://www.w3.org/2004/02/skos/.
3
    LOHAI is pronounced like Low-High and means something like LOw HAnging Fruits
    Automatic Indexer, which gives a brief summary about the development process.


                                            122
    Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


                                Fig. 1: The indexing pipeline


2      Preprocessing

The preprocessing consists of the several steps:

1. Part-of-speech tagging: We use the Stanford Log-linear Part-Of-Speech
   Tagger4 , as described by Toutanova et al. [10]. Part-of-speech (POS) tagging
   simply means the identification of nouns, verbs, adjectives and other word-
   types in a text. To avoid wrong concept assignments like the assignment of
   the concept “need” (as noun in the sense of requirements) whenever the verb
   “to need” is used, we only consider nouns (NN5 , NNP, NPS, NNS), adjectives
   (JJ, JJR, JJS), foreign words (FW) and unknown words (untagged).
2. Tokenization: The tokenization splits the text into single terms. The tok-
   enization is performed together with the POS tagging. The result is a list
   of terms that are further investigated for proper concept assignments. In
   the tokenization step, there is also a cleaning of the terms included, where
   everything is truncated that is not a letter, a hyphen or a space. Note, that
   numbers are truncated, too, as they usually contain no meaning and are
   generally highly ambiguous. In some domains, this would not be desired,
   consider for example history or chemistry.
3. Stemming: Finally, the single terms are stemmed, i.e. they are reduced to
   their stem. That way, same terms can be matched, even, if they use different
   grammatical forms, like “banks” and “bank”. We use the English (Porter2)
   stemming algorithm for the Snowball stemmer [6].
4. KOS preparation: This is only performed once per KOS. All concept la-
   bels are stemmed by means of the same stemmer that is employed on the
4
    http://nlp.stanford.edu/software/tagger.shtml
5
    Tag definitions according to the Penn Treebank tag set [8].


                                            123
    Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


      document texts. An index is created that maps the single stems to the corre-
      sponsing concepts. Additionally an index of stemmed label parts is created
      that is used for the identification of compound terms – for instance, “insur-
      ance market” would be stemmed to “insur market”, mapping to the corre-
      sponding concept, additionally, both stem parts are indexed and mapped to
      the stemmed compound term.

   The preprocessing uses only freely available standard approaches. The POS
tagging and the stemming are language dependent; both algorithms employed are
implemented for various languages, including English and German. We assume
that both the KOS and the documents are in the same language and that only
one language is used in the document, so that the appropriate implementations
can be used. If the KOS is multilingual and the documents use different language,
an additional language detection step has to be employed.
   After the preprocessing steps, the actual concept assignment and weighting
takes place, as described in the next section.


3      Concept assignment with compound term detection

The general assignment strategy is a pure string based matching: If a stem that
belongs to a concept in the KOS appears in the stems extracted from the text, the
concept is assigned. In this step, we consider every concept a matching concept
that contains a label that has the same stem or contains the same stem in the
case of an compound term. Under this assumption, we have to deal with three
possibilities that consequently would lead to wrong assignments:

1. A stem could belong to several concepts, including compound term con-
   cepts, e.g. “insur” that belongs among others to “insurance” and “insurance
   market”.
2. A stem could belong to several concepts that have different labels with the
   same stem (overstemming), like “nationalism”, “nationality” and “nation”.
3. A stem could belong to several concepts that have the same labels with the
   same stem (homonyms), like “bank” (the financial institution) and “bank”
   (a raised portion of seabed or sloping ground along the edge of a stream,
   river, or lake).

    Approaches to handle the latter two variants are described in the next sec-
tion, the first variant is dealt with directly in the assignment phase: The basic
assumption is that we want to assign the most specific concept, i.e. in the above
example, we would like to assign “insurance market”, but not “market”.
    We implement this as follows: Whenever a stem is recognized as a potential
part of a compound term, the stem is temporarily stored in a list. When a stem
is found that can not be part of a compound, the list is analyzed for contained
concepts. In this step, the algorithm simply checks every chain of stems for
every starting stem if it corresponds to a compound concept. The algorithm
starts with the longest possible chain and stops if a compound is found, thus


                                            124
    Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


avoiding the assignment of additional concepts contained in the compound. With
this approach, the algorithm has generally a linear runtime with respect to the
words contained in the text. Only the parts that potentially contain compounds
have to be further analysed with a runtime of O(n2 ) with n denoting the number
of words within such a part.


4      Unstemming and word-sense disambiguation

Whenever one or more stems could be assigned to more than one concept, we
would like to identify a single concept as the correct one in the given context.
This task is generally denoted as word-sense disambiguation (WSD). We use
two different approaches for WSD, the first being a specific check that tack-
les the problem of overstemming mentioned above. If this step is not able to
disambiguate the potential concepts, the actual WSD is performed.

Unstemming. Overstemming – the reduction of two different terms to the same
stem – leads to ambiguous stems that have to be disambiguated during index-
ing. Consequently, we first unstem the stem, i.e. we go back to the original,
unstemmed form of the term, as found in the text. If the unstemmed term cor-
responds directly to an unstemmed label of a concept, we assign this concept. If
there is only one such concept, we finish the WSD step. Otherwise, we continue
with the actual WSD, as described in the following.

KOS based word-sense disambiguation. Word-sense disambiguation is a broad
field in the area of natural language processing. Leaving the technical issues of
overstemming aside, it generally consists of the task to determine the correct
sense of a word that appears in a particular context. The variety of possible
senses is often based on some background-knowledge, like a thesaurus or other
types of KOS. As Manning and Schuetze [2, pp. 229 f.] pointed out, this can be
unsatisfactory from a scientific or philosophical point of view, as the definitions
in the background knowledge are often quite arbitrary and possibly not sufficient
to describe the actual sense of a word in a given context. However, our goal is
not the perfect assignment of a sense to a word, our goal is the assignment of
the best fitting concept in the KOS.
    WSD approaches can be divided in supervised and unsupervised approaches,
additionally in knowledge-rich and knowledge-poor approaches [5]. In our setting,
we clearly need an approach that is unsupervised – as it has to work without
any previously tagged texts – and knowledge-rich – as we clearly have a KOS at
hand and of course want to use it to improve the disambiguation quality.
    A supervised, knowledge-rich approach would be the adaptive thesaurus-
based disambiguation, as presented by Yarowsky [11], where a Bayes classifier is
trained on a large document corpus and thus probabilities for the occurrence of
specific words in the context of a specific sense are determined.
    Yarowsky [13] also proposed an (almost) unsupervised approach that makes
use of two assumptions:


                                            125
    Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


One sense per collocation. We assume that words collocated with the word
  to be disambiguated are unique for the correct sense and would not be col-
  located with the word for other senses. This basically is the rationale to use
  the context of a word – usually a window of words before and after the word
  in question – for disambiguation.
One sense per discourse. We assume that only one sense for a given word
  is used throughout a whole document. With this assumption, we can make
  use of any occurrence of the word in the text and thus get a more stable
  disambiguation result.
Both assumptions have been examined and verified [1, 12]. However, as Yarowsky’s
approach is not completely unsupervised – a small set of pretagged senses is
needed as seed – we only make use of the two assumptions, but use a much
simpler approach: Word-sense disambiguation based on a Jaccard comparison
(cf. Ramakrishnan et al. [7]).
    For this comparison, we define two sets of words: W as the context of an
occurrence of the ambiguous word w, and C as the context of a candidate concept
c, respectively. We then compute the Jaccard measure as follows:

                                                      |W ∩ C|
                                 Jaccard(W, C) =                                            (1)
                                                      |W ∪ C|
    Based on the assumption “One sense per discourse”, we assign each occur-
rence of w the concept c that was mostly assigned in the document, i.e. got in
most cases the highest Jaccard value. If only abstracts are available for indexing,
this procedure can be further simplyfied by just assuming the whole abstract as
the context for each occurrence of w, which leads to the direct assignment of the
concept c with the highest Jaccard value.
    As context of an ambiguous word w, we either define a window of 100 words
before and after the word or just use the whole document in case of short texts,
like abstracts. The context of a concept c is defined as the union of all labels of
the concept, its direct child concepts, its parent concepts and the direct children
of the parent concepts, i.e. its siblings. Other definitions are of course possible,
for example the weighting of words and labels depending on the distance to the
word or concept, but for our purpose as part of a simple baseline indexer, our
approach is sufficient.


5      Weighting
The last step in the indexing pipeline is the weighting of the assigned concepts.
As the baseline indexer so far assigns every concept that can be identified by
an occurring word, the weighting of these concepts is vitally important to de-
termine which concepts are important and descriptive for the given text and
which concepts are only marginally touched. It is also desirable to give concepts
a higher weight when they are not used in the majority of documents, because
these concepts usually only denote common terms and are not important for the
indexing result.


                                            126
    Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


   The common approach for this kind of weighting is tf-idf, which is based on
the term frequency tfc,d of a term (in our case concept c) in a given document d
and on the document frequency dfc of a concept c, i.e. the number of documents,
where the concept appears:

                                                            D
                                   w(c, d) = tfc,d · log                                    (2)
                                                           dfc

D denotes the total number of documents in the indexed corpus. The last term
is called inverse document frequency (idf ), as the overall weight becomes smaller
the higher dfc is.


6      Results

To show the weaknesses and strengths of LOHAI, we investigate some of the
indexing results. For our experiments, we used the German STW Thesaurus for
Economics6 . A concept in the STW consists of preferred and alternative labels,
both in English and in German. For example, there is the concept “Migration
theory” with alternative labels “Economics of migration” and “Theory of mi-
gration”.
     Figure 2 shows an example abstract that we indexed. LOHAI produces the
output as shown in Figure 3. Additionally, we listed the intellectually assigned
concepts by a librarian. It can easily be seen that the characteristics of the
results are quite different. But if one takes the weighting into account, it can
be seen that there are no wrong assignments with a weight above 0.3. Below
that threshold, there are especially common terms that form a concept in the
thesaurus and that are either not helpful or wrongly assigned, as “Exchange”.
“Government”, for example, seems to be correct, but is rather a coincidence, as
it is assigned due to the verb “govern” in the text – an indication for a mistake
during the POS tagging. On the other hand, the very abstract concepts that are
assigned by the librarian (besides “Theory”) are not found by LOHAI, as the
terms do not directly appear in the text in some form.
     All in all, the results are very promising, even with a relatively simple ap-
proach like ours. Most assignments are correct, even if a human indexer would
not assign all of them. The indexing quality correlates with the employed weight-
ing, especially assignments with lower rank often contain more common concepts
that sometimes are just wrong. A lot of these mistakes could be avoided if the
thesaurus would be more precise about homonyms and would provide additional
information to disambiguate them, when necessary. The indexer could be fur-
ther improved, e.g. common concepts should not be assigned, if more specific
concepts down the tree are found in the text (Like “Law” and “Contract Law”
above). On the other hand, we wanted to keep it simple. Such adaptions and
6
    Standard Thesaurus Wirtschaft, http://zbw.eu/stw/versions/latest/about.en.
    html. The thesaurus is published and maintained by the German National Library
    of Economics (Deutsche Zentralbibliothek für Wirtschaftswissenschaften, ZBW)


                                            127
Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


   Title    Contractarianism: Wistful Thinking
   Authors Hardin, Russell
   Abstract The contract metaphor in political and moral theory is misguided. It is
            a poor metaphor both descriptively and normatively, but here I address
            its normative problems. Normatively, contractarianism is supposed to
            give justifications for political institutions and for moral rules, just as
            contracting in the law is supposed to give justification for claims of
            obligation based on consent or agreement. This metaphorical associ-
            ation fails for several reasons. First, actual contracts generally govern
            prisoner’s dilemma, or exchange, relations; the so-called social contract
            governs these and more diverse interactions as well. Second, agreement,
            which is the moral basis of contractarianism, is not right-making per se.
            Third, a contract in law gives information on what are the interests of
            the parties; a hypothetical social contract requires such knowledge, it
            does not reveal it. Hence, much of contemporary contractarian theory
            is perversely rationalist at its base because it requires prior, rational
            derivation of interests or other values. Finally, contractarian moral the-
            ory has the further disadvantage that, unlike contract in the law, its
            agreements cannot be connected to relevant motivations to abide by
            them.
   Journal Constitutional Political Economy, 1 (2) 1990: 35-52

         Fig. 2: Example of a document abstract used for annotation


            Constitutional economics              Contract Law (1.21)
            Influence of government               Contract (0.76)
            Ethics                                Social contract (0.64)
            Theory                                Law (0.51)
                                                  Politics (0.37)
                                                  Prisoner’s dilemma (0.34)
                                                  Theory (0.32)
                                                  Rationalism (0.24)
                                                  Association (0.23)
                                                  Exchange (0.20)
                                                  Knowledge (0.19)
                                                  Government (0.16)
                                                  Information (0.12)
             (a) Intellectual indexing                   (b) LOHAI

                      Fig. 3: Intellectual indexing vs. LOHAI


                                         128
    Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


improvements are easy to implement, if they are needed. A quantitative eval-
uation of the results by comparing them to a manually created gold standard
is still missing, but former experiments [3] with a comparable indexer showed
that such results are nevertheless not very meaningful due to the very different
characteristics of such an automatic approach and a trained librarian.


7      Conclusion

To the best of our knowledge, there is no free indexer available that does not
require any data preparation step or the creation of some training data. With
LOHAI, we developed such an indexer by just using the standard approaches
in natural language processing and information retrieval for the single steps in
the indexing pipeline. Each and every step could be improved by employing new
and more sophisticated approaches, but we intentionally restricted ourselves to
the well-understood approaches that are state of the art in information retrieval
and natural language processing. All in all, the indexer consists of about 500
lines of code in Java, without the POS tagger and the Snowball stemmer. We
showed that the indexer performs quite well and – maybe most important –
does not behave like a black box, every assignment is easily understandable.
We expect that the indexer would even be usable in serious indexing projects.
LOHAI is already successfully employed in SEMTINEL7 , a thesaurus evaluation
framework, where it is used to quickly process large document sets for a given
thesaurus to determine its concept coverage. LOHAI is not a stupid indexer, it
is a baseline indexer. It is free, open source and available at https://github.
com/kaiec/LOHAI.


References

 1. Gale, W.A., Church, K.W., Yarowsky, D.: One sense per discourse. In: Pro-
    ceedings of the workshop on Speech and Natural Language. pp. 233–237. HLT
    ’91, Association for Computational Linguistics, Stroudsburg, PA, USA (1992),
    http://dx.doi.org/10.3115/1075527.1075579
 2. Manning, C.D., Schuetze, H.: Foundations of Statistical Natural Language Pro-
    cessing. MIT Press (1999)
 3. Maynard, D., Dasiopoulou, S., Costache, S., Eckert, K., Stuckenschmidt, H., Dz-
    bor, M., Handschuh, S.: D1.2.2.1.3 Benchmarking of annotation tools. Tech. rep.,
    Knowledge Web Project (2007)
 4. Medelyan, O.: Human-competitive automatic topic indexing. Ph.D. thesis, Univer-
    sity of Waikato (2009)
 5. Navigli, R.: Word sense disambiguation: A survey. ACM Comput. Surv. 41(2),
    1–69 (2009)
 6. Porter, M.: Snowball: A language for stemming algorithms. Published online.
    (2001), http://www.snowball.tartarus.org/texts/introduction.html
7
    http://www.semtinel.org


                                            129
 Proceedings of the 1st International Workshop on Semantic Digital Archives (SDA 2011)


 7. Ramakrishnan, G., Prithviraj, B., Bhattacharyya, P.: A gloss-centered algorithm
    for disambiguation. In: SENSEVAL-3: Third International Workshop on the Eval-
    uation of Systems for the Semantic Analysis of Text, Barcelona, Spain, July 2004.
    ACM, New York, NY, USA (2004)
 8. Santorini, B.: Part-of-speech tagging guidelines for the Penn Treebank Project
    (3rd revision). Tech. rep., University of Pennsylvania (1990), http://repository.
    upenn.edu/cis_reports/570/
 9. Schwens, U., Wiechmann, B.: Netzpublikationen in der Deutschen Nationalbiblio-
    thek. Dialog mit Bibliotheken 1(1), 10–13 (2009)
10. Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech
    tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003.
    pp. pp. 252–259 (2003)
11. Yarowsky, D.: Word-sense disambiguation using statistical models of Roget’s cate-
    gories trained on large corpora. In: Proceedings of the 14th conference on Compu-
    tational linguistics - Volume 2. pp. 454–460. COLING ’92, Association for Compu-
    tational Linguistics, Stroudsburg, PA, USA (1992), http://dx.doi.org/10.3115/
    992133.992140
12. Yarowsky, D.: One sense per collocation. In: Proceedings of the workshop on
    Human Language Technology. pp. 266–271. HLT ’93, Association for Computa-
    tional Linguistics, Stroudsburg, PA, USA (1993), http://dx.doi.org/10.3115/
    1075671.1075731
13. Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised meth-
    ods. In: Proceedings of the 33rd annual meeting on Association for Computa-
    tional Linguistics. pp. 189–196. ACL ’95, Association for Computational Linguis-
    tics, Stroudsburg, PA, USA (1995), http://dx.doi.org/10.3115/981658.981684


                                         130