<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enriched Page Rank for Multilingual Word Sense Disambiguation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Diego De Cao</string-name>
          <email>decaog@info.uniroma2.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Basili</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Luciani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Mesiano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Riccardo Rossi</string-name>
          <email>ricc.rossig@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Computer Science, University of Roma Tor Vergata</institution>
          ,
          <addr-line>Roma</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Word Sense Disambiguation (WSD) is an hard challenge especially for language different from english. Porting supervised models, that are state-ofart for english, on different languages is too much expensive. So unsupervised or semi-supervised WSD models are much more applicable to different languages. Graph-based methods, have been recently applied to linguistic knowledge bases, including unsupervised WSD. Although the achievable accuracy is rather high, the quality of the involved resources is de facto a crucial success factor. In this paper an adaptation of the PageRank algorithm proposed for WSD using distributional information is presented. This solution looks to preserve the achievable accuracy for the english language over a foreign language. An experimental analysis for the italian using standard benchmarks will be presented in the paper to support our hypothesis.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Lexical ambiguity is a fundamental aspect of natural language. Word Sense
Disambiguation (WSD) investigates methods to automatically determine the intended sense
of a word in a given context according to a predefined set of sense definitions, provided
by a semantic lexicon. Intuitively, WSD can be usefully exploited in a variety of NLP
and Information Retrieval tasks such as ad hoc retrieval [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] or Question Answering
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. However controversial results have been often obtained, as for example the study
on text classification reported in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The impact of WSD on IR tasks is still an open
issue and large scale assessment is needed. For this reason, unsupervised approaches to
inductive WSD are appealing.
      </p>
      <p>
        It has been more recently that graph-based methods for knowledge-based WSD
have gained much attention in the NLP community ([
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5–7</xref>
        ]). In these methods a graph
representation for senses (nodes) and relation (edges) is first built. In [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], a
comparative analysis of different graph-based models based on PageRank model [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] over two
well known WSD benchmarks is reported. A special emphasis for the resulting
computational efficiency is also posed there. In particular, a variant called Personalized
PageRank (P P R) is proposed in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. This variant tries to trade-off between the amount
of the employed lexical information and the overall efficiency. In synthesis, along the
ideas of the Topic sensitive PageRank [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], P P R suggests that a proper initialization
of the teleporting vector p suitably captures the context information useful to drive the
random surfer PageRank model over the graph to converge towards the proper senses
in fewer steps. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] we present a model to extend the P P R trough distributional
evidence improving the overall PPR performances over the English language. In this paper
we discuss the applicability of the extension of PPR algorithm to Italian language.
      </p>
      <p>The key idea is to exploit an externally acquired semantic space to expand the
incoming sentence into a set of novel terms, different but semantically related with the
words in . In analogy with topic-driven PageRank, the use of these words as a seed for
the iterative algorithm is expected to amplify the effect of local information (i.e. ) onto
the recursive propagation across the lexical network: the interplay of the global
information provided by the whole lexical network with the local information characterizing
the initialization lexicon is expected to maximize their independent effect.</p>
      <p>
        More formally, let the matrix Wk := UkSk be the matrix that represents the
lexicon in the k-dimensional Latent Semantic Analysis (LSA) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] space. Given an input
sentence , a vector representation w!i for each term wi in is made available. The
corresponding representation of the sentence can be thus computed as the linear
combination through the original tf idf scores of the corresponding w!i: this provides always
an unique representation ! for the sentence. ! locates the sentence in the LSA space
and the set of terms that are semantically related to the sentence can be easily found
in the neighborhood. A lower bound can be imposed on the cosine similarity scores
over the vocabulary to compute the lexical expansion of , i.e. the set of terms that are
enough similar to ! in the k dimensional space. Let D be the vocabulary of all terms,
we define as the lexical expansion T ( ) D of ! as follows:
      </p>
      <p>T ( ) = fwj 2 D : sim(w!j ; !) &gt;
g
where represents a real-valued threshold in the set [0; 1). In order to improve
precision it is also possible to impose a limit on the cardinality of T ( ) and discard terms
characterized by lower similarity factors.</p>
      <p>Finally, the later steps of the PPR methods remain unchanged, and the PageRank
works over the corresponding graph.
(1)
2</p>
      <p>
        Empirical Evaluation
The evaluation of the proposed model was focused to evaluate the applicability of the
Extended PPR to the Italian language. This will be done also comparatively with the
state of the art of unsupervised systems over a consolidated benchmark, Evalita 2007
for the Italian language. Concerning to the distributional approach the Italian Web as
Corpus1 (about 1800K web pages) is used with about 150k words. The corpus is
processed with the TreeTagger2 to extract the part of speech for every words. Then a
dimensionality reduction factor of k = 100 is adopted to perform the LSA space. For the
italian language the ItalWordNet [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] resource is adopted. Two different approaches
have been employed for the process of Word Sense Disambiguation:
– Sentence based approach: the process of LSA expansion and disambiguation is
performed for every single sentence of the dataset
      </p>
    </sec>
    <sec id="sec-2">
      <title>1 http://wacky.sslmit.unibo.it/</title>
      <p>2 http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
– Document based approach: the process of LSA expansion and disambiguation is
performed for every document of the dataset. In this approach we used a policy of
”one sense per discourse”.</p>
      <p>The Evalita ’07 all-words task3 consists of about 4700 words to disambiguate. Due the
novel configuration of the distributional space that came out using the Italian corpus,
the damping factor, the number of iterations and the number of words for the LSA
expansion have been re-estimated. The Table 1 reports the result at different parameters
for sentence and document based approaches over the Evalita ’07 test data. For each
parameter the columns in tables show Precision, Recall and F-Measure for the PPR and
PPRw2w respectively.</p>
      <p>Sentence</p>
      <p>Parameters PPR w2w
WN LSA Iter. Prec Rec F1 Prec Rec F1</p>
      <p>
        We adopted fixed limits for LSA expansion where values from 20 up to 1000 terms
have been tested. The good scores obtained on the [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] suggested that a number of
iterations lower than 30 is in general enough to get good accuracy levels: 25 iterations,
instead of 30, have been judged adequate. Finally, on average, the total number of lexical
items in the expanded sentence T ( ) includes about 40% of nouns, 30% of verbs, 20%
of adjectives and 10% of adverbs. As a confirmation of the outcome in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], the
wordby-word model achieves better results. Interestingly, almost on every type of graph and
for every approach (sentence or word oriented) the LSA-based method outperforms the
original UKB.
      </p>
      <p>Table 2 reports Precision, Recall and F1 scores of the different systems as obtained
over the test Evalita ’07 data. The best F1 scores between any pair are emphasized in
bold, to comparatively asses the results. Results confirms that the impact of the topical</p>
    </sec>
    <sec id="sec-3">
      <title>3 http://evalita.fbk.eu/</title>
      <p>
        information provided by the LSA expansion of the sentence is beneficial for a better
use of the lexical graph. An even more interesting outcome is that the improvement
implied by the proposed LSA method on the sentence oriented model (i.e. the standard
PPR method of [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) is higher, so that the difference between the performances of the
P P Rw2w model are no longer strikingly better than the P P R one. As shown in Table
2 our method outperforms the JIGSAW system of over 9.87% in the F-Measure.
Moreover the good accuracy reachable by the document-based approach is also very
interesting as for the higher time efficiency of this approach with respect to the sentence-based
one. As a matter of fact with the first approach the system has been run only sixteen
times instead of hundreds times when the sentence-based approach is employed.
Furthermore the lower execution times suggest the applicability of the system to different
Information Retrieval scenario, such as the Question Answering or the Cross Language
Information Retrieval (CLIR). CLIR is a challenging task and the existence of aligned
lexical database, such as MultiWordnet4 that is aligned to the English WordNet version,
opens an interesting perspective of using word senses as anchor to search in different
language.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Krovetz</surname>
          </string-name>
          , H.:
          <article-title>Homonymy and polysemy in information retrieval</article-title>
          .
          <source>In: Proceedings of the 35th ACL '09</source>
          . (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. bum Kim, S., cheol Seo, H., chang Rim, H.:
          <article-title>Information retrieval using word senses: root sense tagging approach</article-title>
          .
          <source>In: In SIGIR</source>
          <year>2004</year>
          .
          <article-title>(</article-title>
          <year>2004</year>
          )
          <fpage>258</fpage>
          -
          <lpage>265</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Beale</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lavoie</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>McShane</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nirenburg</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Korelsky</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Question answering using ontological semantics</article-title>
          . In: TextMean '04,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          (
          <year>2004</year>
          )
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Moschitti</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basili</surname>
          </string-name>
          , R.:
          <article-title>Complex linguistic features for text classification: A comprehensive study</article-title>
          .
          <source>In: Proc. of the European Conf. on IR, ECIR</source>
          , New York, USA (
          <year>2004</year>
          )
          <fpage>181</fpage>
          -
          <lpage>196</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Sinha</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mihalcea</surname>
          </string-name>
          , R.:
          <article-title>Unsupervised graph-based word sense disambiguation using measures of word semantic similarity</article-title>
          .
          <source>In: IEEE ICSC</source>
          <year>2007</year>
          .
          <article-title>(</article-title>
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Navigli</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lapata</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Graph connectivity measures for unsupervised word sense disambiguation</article-title>
          .
          <source>In: Proceedings of IJCAI'07</source>
          . (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Agirre</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soroa</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Personalizing pagerank for word sense disambiguation</article-title>
          .
          <source>In: Proceedings of the 12th conference of EACL '09</source>
          , Athens,
          <source>Greece (March 30 - April 3</source>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Brin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Page</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>The anatomy of a large-scale hypertextual web search engine</article-title>
          .
          <source>Computer Networks and ISDN Systems</source>
          <volume>30</volume>
          (
          <issue>1-7</issue>
          ) (
          <year>1998</year>
          )
          <fpage>107</fpage>
          -
          <lpage>117</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Haveliwala</surname>
          </string-name>
          , T.H.:
          <article-title>Topic-sensitive pagerank</article-title>
          .
          <source>In: Proc. of 11th Int. Conf. on World Wide Web</source>
          , New York, USA, ACM (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>De Cao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Basili</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luciani</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mesiano</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rossi</surname>
          </string-name>
          , R.:
          <article-title>Robust and efficient page rank for word sense disambiguation</article-title>
          . In ACL, ed.:
          <source>Proceeding of TextGraphs-5</source>
          . (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Landauer</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>A solution to plato's problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge</article-title>
          .
          <source>Psychological Review</source>
          <volume>104</volume>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>A Alonge</given-names>
            <surname>Roventini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bertagna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Calzolari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            ,
            <surname>Girardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Marinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Speranza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Zampolli</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          :
          <article-title>Italwordnet: Building a large semantic database for the automatic treatment of italian</article-title>
          . In: Linguistica Computazionale,
          <string-name>
            <surname>IEPI</surname>
          </string-name>
          , Pisa-Roma. (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>