Alicante at CLEF 2009 Robust-WSD Task

             Javi Fernández, Rubén Izquierdo, and José M. Gómez

      University of Alicante, Department of Software and Computing Systems
                 San Vicente del Raspeig Road, 03690 Alicante, Spain
       javier.fernandez@eltallerdigital.com, ruben@dlsi.ua.es, jmgomez@ua.es
                                http://www.dlsi.ua.es


      Abstract. In this paper we explore the use of semantic classes in an In-
      formation Retrieval system in order to improve the results in the Robust-
      WSD task at CLEF 2009. We use two ontologies of semantic classes
      (WordNet domain and Basic Level Concepts) to re-rank the retrieved
      documents and obtain better recall and precision. Finally, we implement
      an innovative method to weight the expanded terms taking into account
      the ones of the original query terms and their relations in WordNet with
      respect to the new ones which have demonstrated to improve the results.


1   Introduction
The two main goals of the Robust-WSD last year’s edition task were to measure
the robustness of the retrieval systems (good stable performance over all queries),
and test the benefits of the use of Word Sense Disambiguation (WSD) on this
kind of systems.
    For our participation in the second edition, we decided to employ a system
already implemented and evaluated in the last year edition as starting point,
the one of Universidad Complutense de Madrid [10] due to its good results,
availability and the possibility of easily adjusting the code to our objectives.
    Our main strategy consists of experimenting the benefits semantic classes
in Information Retrieval (IR) systems. Moreover, we propose an innovative and
flexible way of weighting terms for the query expansion based on WordNet rela-
tions.
    WSD, can be defined as the task of assigning the correct sense to words de-
pending on the context in which they appear. It is a challenging task, difficult to
be addressed and, despite the long time it is been studied, the results of state-of-
art WSD systems are still a long way to be useful in other Natural Language Pro-
cesing applications, as shown in last international evaluations [15, 12]. Generally,
supervised systems obtain better results than the unsupervised ones on previous
cited international evaluations. The annotated corpora used commonly in super-
vised approaches are tagged manually by lexicographers with word senses taken
from a particular lexical semantic resource (most commonly WordNet [4].) This
tool has been widely criticized for being a sense repository that often provides
too fine–grained sense distinctions. On the one hand, too fine–grained senses
are not useful for higher level applications like Machine Translation or Question
2       Fernández et al.

Answering. On the other hand, it seems that many word–sense distinctions are
too subtle to be captured by automatic systems with the current small volumes
of word–sense annotated examples. This could be a reason of the poor results of
current WSD systems.
     A possible solution that has been explored is the use of semantic classes in-
stead of word senses. The task of WSD consists of assigning the proper semantic
class to each ambiguous word, instead of its word sense. The use of semantic
classes has several advantages. Firstly, they provide richer and more useful in-
formation than word senses. For example, for IR could be more informative that
the word church belongs to the semantic class building, instead of knowing
that the correct sense to that word is the 1. Secondly, the average polysemy of
texts is decreased with the use of semantic classes; in fact they can group in the
same class several senses of a concrete word. Therefore, the classification task is
simplified, and finally, the amount of training data for each classifier is increased
because semantic classes can group together senses of different words, senses of
the same word, and also senses of word of different morphological categories. As
a consequence, the number of examples to train each classifier is increased in
semantic class approaches, and the problem of the lack of data in alleviated.
     In [5] they empirically explored on the supervised WSD task the performnace
of different levels of abstraction provided by WordNet Domains [8], SUMO la-
bels [9], Lexicographer Files of WordNet [4] and Basic Level Concepts [6]. They
referred to this approach as class–based WSD since the classifiers were created
at a class level instead of at a sense level. As we abovementioned, class–based
WSD cluster senses of different words into the same explicit and comprehensive
grouping. Only thoses cases belonging to the same semantic class are grouped
to train the classifier. For example, the coarser word grouping obtained in [14]
only has one remaining sense for “church”. Using a set of Base Level Concepts
[6], the three senses of “church” are still represented by faith.n#3, building.n#1
and religious ceremony.n#1.
     We are convinced that IR could take advantage of the use of word sense
disambiguation, from a semantic class point of view instead from the traditional
word sense point of view. Due to the data of the robust adhoc IR task has been
processed automatically by two WSD systems, and the information of word
senses is available, we do not run any class–based WSD system over the data.
     This paper is organized as follows: the next section describes the architecture
of our system. In section 3 we discuss the results of this system at CLEF 2009
Robust-WSD Task. Finally, in section 4 we draw the conclusions and future
works.


2   Description of the System

The system architecture is shown in Figure 1.
   As we can see in Figure 1, the user query is pre-parsed to obtain a set of terms
without stopwords and any special symbol. Next, a ranked list of relevant docu-
                                  Alicante at CLEF 2009 Robust-WSD Task         3


                          Fig. 1. Architecture of the system


ments are retrieved using Lucene search engine1 . With the retrieved documents,
the initial query, the relations of the external resource WordNet and state-of-art
query expansions methods an expanded query is obtained. The terms of this new
query are weighted taking into account the weights of the original query words,
their relations in WordNet with respect to the new ones, the weight assigned
by the WSD system to each sense and the weight returned by the expansion
method. Once we obtain a new list of weighted terms, we do, again, a search
but this time using the expanded query instead of the original one in order to
retrieved a new ranked list of documents. Finally, we employ the semantic class
information from two semantic resources (WordNet Domains and Base Level
Concepts) in order to obtain a re-ranked document list as result.
    The following sections present each of this processes in more detail.


2.1    Search engine and query expansion

As we mentioned previously, the search engine we employ is the one provided
by the Universidad Complutense de Madrid . Its implementation is a modified
version of Lucene which uses the BM25 probabilistic model [13] for the document
retrieval. They have also implemented two state-of-art query expansion methods:
Kullback-Liebler Divergence [2, 3] (an information-theoretic approach) and the
Bo1 model [7, 11] (based on Divergence From Randomness [1]). We selected the
1
    http://lucene.apache.org
4      Fernández et al.

Bo1 model since this approach obtains the best results. We also decided to use
the same constant values than they used in the last CLEF Robust-WSD edition
in order to compare the effectiveness of our methods of semantic classes.
    As we can see in the Figure 1, we make two search processes. For the first
retrieval process, query terms are lemmatized and stemmed in order to increase
the system recall. The first Search module gets these terms as input and returns
a list of relevant documents using the BM25 probabilistic model. Furthermore,
in the Query expansion module, we expand the original query obtaining new
terms by means of the Bo1 model.
    Even if [10] proposed a method for weighting the expanded query terms based
on WordNet, we preferred to use our own in fact they do not use all senses of
each term but the weightest one. We decided to use all senses retrieved by the
WSD system in order to improve the recall. In this way, the system searches
all expanded terms in the relations of WordNet with respect to the synonyms,
hyperonyms and hyponyms until a certain level or distance. For example, if the
distance is 2, we search any expanded term among the hyperonym and hyponym
synsets of the original terms but, also, the hyperonyms of the hyperonyms and
the hyponyms of the hyponyms. The distance constant marks the allowed jumps
to reach in the WordNet relations from the synsets of the query terms. We use
all senses supplied for the WSD system for each query term taking into account
the score given by these systems to each sense in order to calculate the weight of
the expanded terms. Thefore, this distance factor is calculated by the following
equation:

                 weight(synseti,d ) = weight(synseti,d−1 ) ∗ αd               (1)

    We defined synseti,1 as a WordNet synset given and synseti,d as another
WordNet synset which is related to the synseti,1 of a distance of d jumps (taken
into acount only hyperonym and hyponym relations). Thus, weight(synseti,d )
is the weight of the synset i, d and weight(synseti,1 ) is the score given by the
WSD system to the synset i, d. α is a constant whose value is between 0 and 1
and d the distance of synseti,d to the synseti,1 .
    Once we calculated the previous synset weight, we combine this weight with
the weight assigned by the expanded method bo1 in order to calculate the final
term weight using the following equation:

                                weight(synseti,d ) + weight0 (termt )
             weight(termt ) =                                                 (2)
                                                  2
    Where weight(termt ) is the weight of the expanded term t which is grouped
in the WordNet synset i, d, and weight0 (termt ) is the weight assigned by bo1
to the term t.
    Using these equations, we give importance to those expanded terms closely
related with the original query terms and, in addition, we include the score
given by the WSD system for each query term in the final term weight. Thus,
we include all senses of a term in the search giving more importance those terms
                                 Alicante at CLEF 2009 Robust-WSD Task          5

which are relationed with more likely senses and closer to the original query
terms.


2.2   Semantic classes

Our approach consists on mapping the assigned word senses to semantic classes,
specifically to WordNet Domains labels and Basic Level Concepts.
     WordNet Domains [8] is a hierarchy of 165 Domain Labels used to label all
the WordNet synsets. Information brought by Domain Labels is complementary
to what is already in WordNet. First of all, a Domain Label can include synsets of
different syntactic categories: for instance medicine groups together with senses
from nouns, such as doctor or hospital, and from verbs, such as to operate.
Second, a Domain Label may also contains senses from different WordNet sub
hierarchies. For example, sport contains senses such as athlete, deriving from
life form, game equipment from physical object, sport from act and playing field
from location.
     Basic Level Concepts [6] are a set of concepts that result from the com-
promise between two conflicting principles of characterization:

 – Represent as many concepts as possible;
 – Represent as many features as possible;

   As a result, Basic Level Concepts typically occurs in the middle of hierarchies
and less than the maximum number of relations.
   The authors developed a method for the automatic selection of BLC from
WordNet. They use a very simple method for deriving a small set of appropriate
meanings using basic structural properties of WordNet. The approach considers:

 – The total number of relations of every synset or just the hyponymy relations
 – Discard those BLCs that do not represent at least a number of synsets.
 – Optionally, the frequency of the synsets (summing up the frequency of the
   senses provided by WordNet).

    The process of automatic selection of BLC follows a bottom-up approach
using the chain of hypernym relations. For each synset in WN, the process selects
as its Base Level Concept the first local maximum according to the relative
number of relations. For synsets having multiple hypernyms, the path having
the local maximum with higher number of relations is selected. Usually, this
process ends with a number of fake Base Level Concepts. That is, synsets having
no descendants (or with a small number) but being the first local maximum
according to the number of relations considered. Thus, the process concludes
checking if the number of concepts subsumed by the preliminary list of BLC is
higher than a certain threshold. For those BLC not representing enough concepts
according to a certain threshold, the process selects the next local maximum
following the hypernym hierarchy.
    Thus, depending on the type of relations considered to be counted and the
threshold established, different sets of BLC can be easily obtained for each WN
6      Fernández et al.

version. For our work, we selected the set of BLC built using all kind of relations
and a threshold of 20 as the minimum number of synsets that each BLC must
subsume.
    We explain now the representation of documents or queries with se-
mantic classes of the words contained on them. In the task data, each am-
biguous word is annotated with their possible senses, each one with a certain
probability. Starting from this information, we create a domain vector, for a
query or for a document, containing all the semantic classes information of the
query or document. The domain vector consists in a vector which, each ele-
ment, represents a WordNet Domain or a Basic Level Concept and its associ-
ated weight. Note that there are 165 Domain Labels in WordNet Domains and
558 Basic Level Concepts for nouns. The way to build this vector is: each word
has annotated several senses, with the associated probability; each word sense is
mapped to its proper semantic class, and the element of the vector corresponding
to this domain is increased with the probability associated to the word sense.
After processing all terms, we obtain a domain vector representing the semantic
information of the document or query. Finally to compare two documents, or
a document and a query, and obtain their similarity in terms of their semantic
content, we use the value of the cosine defined by the two domain vectors.


2.3   Integration of Semantic classes in Robust Ad hoc

Once the final list of documents from the expanded query is retrieved, the Se-
mantic re-ranking module re-arrange this list taking into account both the sim-
ilarity returned by the BM25 probabilistic model and the similarity calculated
by semantic class system. In order to do this, we studied the following 5 different
equations:


        semsim1 (i, j) = simij ∗ semij                                         (3)
                         
                          simmaxi + simij if semij > h
        semsim2 (i, j) =                                                       (4)
                         
                           simij               otherwise
                         
                          (simmaxi + simij ) ∗ semij if semij > h
        semsim3 (i, j) =                                                       (5)
                         
                           simij                         otherwise
                         
                          simmaxi + simij ∗ semij if semij > h
        semsim4 (i, j) =                                                       (6)
                         
                           simij                      otherwise
        semsim5 (i, j) = simmaxi ∗ semij + simij                               (7)

    Where semsimx (i, j) is the final similarity between the query i and the doc-
ument j using the method x, simij is the similarity of the query i with respect
to the document j returned by the search engine, semij is the same similarity
                                 Alicante at CLEF 2009 Robust-WSD Task           7

but returned by the semantic class system, simmaxi is the greatest value of sim-
ilarity returned by the search engine for the query i and h is a constant which
determines a semantic similarity threshold defined empirically.
     As both similarity values (simi j and semi j) are normalized ones, our first
approximation was the equation 3. In this equation we simply multiply both
values (the similarity returned by the search engine and the similarity obtained
by the semantic class system) in order to obtain a new similarity value between
the query i and the document j. With this equation both values have the same
importance.
     The equation 4 was thought in order to put those documents with a certain
level of semantic relation with the query above other ones. Therefore, if the
similarity obtained by the search engine will be summed to the greatest similarity
if, and only if, the semantic class similarity between the document and the query
is greater than a threshold h, otherwise only the BM25 similarity will be taken
into account.
     The next equation (5) is based on the previous one, however the semantic
similarity multiplies the sum of the BM25 similarity for the document j and the
greatest similarity. The main idea of this equation is to overlap some of the less
semantic related documents which exceed the threshold with those which do not
exceed.
     In order to give more relevance those documents with high semantic similarity
but taking into account the semantic class score in the final similarity value, the
equation 6 was used.
     Finally, the equation 7 multiplies the greatest similarity value with the doc-
ument semantic similarity and, next, sums this result to the search engine docu-
ment similarity. This equation tries to improve the search engine similarty value
using the semantic similarity as reduction factor of maximum similarity value.


3   Evaluation

In this section we report the results of each one of our proposals.
    For the evaluation of the Expanded query terms weighter, we set the value
for two variables: α and d (distance). In order to get the best values for these
variables, we experimented with several values for them. Table 1 presents two
of the best results of those experiments. With α = 0.8 and d = 1 we improve
the baseline GMAP in a 9.97%. With α = 0.92 and d = 6 we improve both the
baseline MAP in a 0.02% and the baseline GMAP in a 8.19%. The results of
BM25+Bo1+WD (α = 0.92, d = 6) correspond to our CLEF experiment named
ali02wsd.
    The results of our second proposal, the Semantic re-ranking module, depend
on the function used for the integration of the documents’ weights given by the
semantic classes and the search engine. In addition, for some of the integration
functions, the variable h (threshold) has to be set too. We experimented with
those functions and different values for the threshold. For each integration func-
8      Fernández et al.

        Table 1. Evaluation of the Expanded query terms weighter module

                                       MAP GMAP R-Prec P@5 P@10
     BM25 + Bo1 (Baseline)              .3737 .1294 .0.3585 .4475 .3825
     BM25 + Bo1 + WD (α = 0.8, d = 1) .3706 .1423 .3624 .4500 .3750
     BM25 + Bo1 + WD (α = 0.92, d = 6) .3738 .1400 .3655 .4513 .3775


tion we obtained its best threshold value (when needed), as can be seen in Tables
2 and 3.

Table 2. Evaluation of the Semantic re-ranking module with WND and different
integration functions

                                     MAP GMAP R-Prec P@5 P@10
    BM25 + Bo1 + WND + RR1           .3752 .1298 .3638 .4462 .3862
    BM25 + Bo1 + WND + RR2 (h = 0.2) .3737 .1294 .3585 .4475 .3825
    BM25 + Bo1 + WND + RR3 (h = 1.0) .3737 .1294 .3585 .4475 .3825
    BM25 + Bo1 + WND + RR4 (h = 0.5) .3752 .1298 .3638 .4462 .3862
    BM25 + Bo1 + WND + RR5            .3746 .1296 .3592 .4463 .3856


    As we can see in Table 2, some of the results of different integration func-
tions are the same. In the case of RR2, RR3 and RR5, this occurs because the
best results they can reach are the same as the baseline results. This integration
functions do not improve the system. In the case of RR1 and RR4, they math-
ematical function are the same, except the second one, that can be affected by
the threshold. Both of them obtain the best results for the WND model for all
measures.

Table 3. Evaluation of the Semantic re-ranking module with BLC20 and different
integration functions

                                       MAP GMAP R-Prec P@5 P@10
    BM25 + Bo1 + BLC20 + RR1            .3533 .1231 .3375 .4337 .3619
    BM25 + Bo1 + BLC20 + RR2 (h = 0.8) .3776 .1317 .3609 .4437 .3806
    BM25 + Bo1 + BLC20 + RR3 (h = 0.8) .3776 .1317 .3609 .4437 .3806
    BM25 + Bo1 + BLC20 + RR4 (h = 0.8) .3776 .1317 .3609 .4437 .3806
    BM25 + Bo1 + BLC20 + RR5            .3375 .1170 .3229 .4213 .3625


   Table 3 presents the results of the integration with the BLC20 model. The
best results are obtained by the RR2, RR3 and RR4 integration functions. These
are the functions that use a threshold. Note the threshold that gives the best
                                Alicante at CLEF 2009 Robust-WSD Task         9

results is the same for all them. Thus, we can deduce that the threshold h = 0.8
is the ideal for determining if a semantic class is relevant or not.
    For our final comparison, we chose the best integration function for each
model (WND, BLC20), as shown on Table 4 and Figure 2. The results of
BM25+Bo1+WND correspond to our CLEF experiment named ali01wnd.


Fig. 2. Results of MAP and GMAP for models WND and BLC20 and each integration
function


             Table 4. Evaluation of the Semantic re-ranking module

                                 MAP GMAP R-Prec P@5 P@10
           BM25 + Bo1 (Baseline) .3737 .1294 .3585 .4475 .3825
           BM25 + Bo1 + WND .3752 .1298 .3638 .4462 .3862
           BM25 + Bo1 + BLC20 .3776 .1317 .3609 .4437 .3806


    The integration of the semantic classes to the search engine improves the
baseline results. With WND we improve both the baseline MAP in a 0.4% and
the baseline GMAP in a 0.31%. With BLC20 we improve both the baseline MAP
in a 0.64% and the baseline GMAP in a 1.77%.


4   Conclusions
The results of the experiments with our two proposals demonstrate improve-
ments to the initial IR system. Regarding the Expanded query terms weighter
module, we experimented with the weights of the terms in a probabilistic IR sys-
tem. We have applied a smoothing function based on the WordNet distance to
the weights given by the IR system. The experiments show GMAP improvements
of nearly 10% but not significant MAP improvements.
10     Fernández et al.

    As future work we propose to continue with the experiments on this module.
For the propagation function 2, the search of the best values for α and d can be
more exhaustive, finding better values for this variables. Moreover, new relations
can be explored in WordNet (not only hyponyms and hyperonyms), in order to
improve recall. Even new weight propagation functions can be proposed to better
exploit the concept of distance in WordNet.
    Regarding our second proposal, the Semantic re-ranking module, we have
integrated the semantic classes to a IR system. We carried out this integration
recalculating the weight of the documents retrieved depending on the similarity
between the semantic class of each document and the semantic class of the query.
The results of the experiments made reveal that the semantic classes resources
can be effectively be integrated to the IR systems.
    This module can also be led to new levels. We only used five simple integration
functions for the search engine and the semantic classes weights. More functions
can be studied with the purpose of finding the best way to integrate the available
resources of semantic classes.


Acknowledgements
This paper has been partially supported by the next projects: “Question An-
swering Learning technologies in a multiLingual and Multimodal Environment
(QALL-ME)” (FP6 IST-033860), “Intelligent, Interactive and Multilingual Text
Mining based on Human Language Technologies (TEXT-MESS)” (TIN2006-
15265-C06-01) and “Virtual Observatory for Technology Transfer”. We also want
to thank the support given by “El Taller Digital”.


References
 1. Gianni Amati and Cornelis Joost Van Rijsbergen. Probabilistic models of informa-
    tion retrieval based on measuring the divergence from randomness. ACM Trans.
    Inf. Syst., 20(4):357–389, 2002.
 2. Claudio Carpineto, Renato de Mori, Giovanni Romano, and Brigitte Bigi. An
    information-theoretic approach to automatic query expansion. ACM Trans. Inf.
    Syst., 19(1):1–27, 2001.
 3. Thomas M. Cover and Joy A. Thomas. Elements of information theory. Wiley-
    Interscience, New York, NY, USA, 1991.
 4. C. Fellbaum, editor. WordNet. An Electronic Lexical Database. The MIT Press,
    1998.
 5. R. Izquierdo, A.Suárez, and G. Rigau. An empirical study on class-based word
    sense disambiguation. In EACL, pages 389–397. The Association for Computer
    Linguistics, 2009.
 6. R. Izquierdo, A. Suarez, and G. Rigau. Exploring the automatic selection of basic
    level concepts. In Galia Angelova et al., editor, International Conference Recent
    Advances in Natural Language Processing, pages 298–302, Borovets, Bulgaria, 2007.
 7. Craig Macdonald, Ben He, and Vassilis Plachouras and Iadh Ounis. University of
    glasgow at trec 2005: Experiments in terabyte and enterprise tracks with terrier.
    In TREC, 2005.
                                    Alicante at CLEF 2009 Robust-WSD Task             11

 8. B. Magnini and G. Cavaglià. Integrating subject field codes into wordnet. In
    Proceedings of LREC, Athens. Greece, 2000.
 9. I. Niles and A. Pease. Towards a standard upper ontology. In Proceedings of the
    2nd International Conference on Formal Ontology in Information Systems (FOIS-
    2001), pages 17–19. Chris Welty and Barry Smith, eds, 2001.
10. José R. Peréz Agüera and Hugo Zaragoza. Ucm-y!r at clef 2008 robust and wsd
    tasks. In CLEF, 2008.
11. Vassilis Plachouras, Ben He, and Iadh Ounis. University of glasgow at trec 2004:
    Experiments in web, robust, and terabyte tracks with terrier. In Ellen M. Voorhees,
    Lori P. Buckland, Ellen M. Voorhees, and Lori P. Buckland, editors, TREC, vol-
    ume Special Publication 500-261. National Institute of Standards and Technology
    (NIST), 2004.
12. Sameer Pradhan, Edward Loper, Dmitriy Dligach, and Martha Palmer. Semeval-
    2007 task-17: English lexical sample, srl and all words. In Proceedings of the Fourth
    International Workshop on Semantic Evaluations (SemEval-2007), pages 87–92,
    Prague, Czech Republic, June 2007. Association for Computational Linguistics.
13. S. E. Robertson and S. Walker. Some simple effective approximations to the 2-
    poisson model for probabilistic weighted retrieval. In SIGIR ’94: Proceedings of the
    17th annual international ACM SIGIR conference on Research and development in
    information retrieval, pages 232–241, New York, NY, USA, 1994. Springer-Verlag
    New York, Inc.
14. R. Snow, Prakash S., Jurafsky D., and Ng A. Learning to merge word senses. In
    Proceedings of Joint Conference on Empirical Methods in Natural Language Pro-
    cessing and Computational Natural Language Learning (EMNLP-CoNLL), pages
    1005–1014, 2007.
15. B. Snyder and M. Palmer. The english all-words task. In Rada Mihalcea and Phil
    Edmonds, editors, Senseval-3: Third International Workshop on the Evaluation of
    Systems for the Semantic Analysis of Text, pages 41–43, Barcelona, Spain, July
    2004. Association for Computational Linguistics.