<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Comparing Word Sense Disambiguation and Distributional Models for Cross-Language Information Filtering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Cataldo Musto</string-name>
          <email>cataldomusto@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fedelucio Narducci</string-name>
          <email>narducci@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <email>basilepp@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pasquale Lops</string-name>
          <email>lops@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco de Gemmis</string-name>
          <email>degemmis@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <email>semeraro@di.uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science - University of Bari \Aldo Moro"</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we deal with the problem of providing users with cross-language recommendations by comparing two di erent contentbased techniques: the rst one relies on a knowledge-based word sense disambiguation algorithm that uses MultiWordNet as sense inventory, while the latter is based on the so-called distributional hypothesis and exploits a dimensionality reduction technique called Random Indexing in order to build language-independent user pro les. This paper summarizes the results already presented within the conference AI*IA 2011 [1].</p>
      </abstract>
      <kwd-group>
        <kwd>Cross-language Information Filtering</kwd>
        <kwd>Word Sense Disambiguation</kwd>
        <kwd>Distributional Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Nowadays the amount of information we have to deal with is usually greater than
the amount of information we can process in an e ective way. In this context
Information Filtering (IF) systems are rapidly emerging since they can adapt their
behavior to individual users by learning their preferences and performing a
progressive removal of non-relevant content. Speci cally, the content-based ltering
approach analyzes a set of documents (usually textual descriptions of items) and
builds a model of user interests based on the features (usually keywords) that
describe the items previously rated as relevant by an individual user. One
relevant problem related to content-based approaches is the strict connection with
the user language, since the information already stored in the user pro le cannot
be exploited to provide suggestions for items whose description is provided in
other languages. In this paper we investigated whether it is possible to
represent user pro les in order to create a mapping between preferences expressed in
di erent languages. Speci cally, we compared two approaches: the rst one
exploits a Word Sense Disambiguation (WSD) technique based on MultiWordnet,
while the second one is based on the distributional models. It assumes that in
every language each term often co-occurs with the same other terms (expressed
in di erent languages, of course) thus, by representing a content-based user
prole in terms of the co-occurences of its terms, user preferences become inerently
independent from the language. The paper is organized as follows. Section 2
analyzes related works in the area of cross-language ltering and retrieval. An
overview of the approaches is provided in Section 3. Experiments carried out in
a movie recommendation scenario are described in Section 4. Conclusions and
future work are drawn in the last section.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        The Multilingual Information Filtering task at CLEF 20091 has introduced the
issues related to the cross-language representation in the area of Information
Filtering. The use of distributional models [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in the area of monolingual and
multilingual Information Filtering is a relatively new topic. Recently the research
about semantic vector space models gained more and more attention: Semantic
Vectors (SV)2 package implements a Random Indexing algorithm and de nes a
negation operator based on quantum logic. Some initial investigations about the
e ectiveness of the SV for retrieval and ltering tasks is reported in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Description of the approaches</title>
      <p>
        Learning pro les through MultiWordnet. In this approach we can imagine
a general architecture composed by three main components: the Content
Analyzer allows to obtain a language-independent document representation by using
a Word Sense Disambiguation algorithm based on MultiWordnet [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Similary
to WordNet, the basic building block of MultiWordNet is the synset (SYNonym
SET), a structure containing sets of words with synonymous meanings, which
represents a speci c meaning of a word. In MultiWordNet, for example the Italian
WordNet is aligned with the English one, so by processing textual descriptions
of items in both the languages, a language-independent representation in terms
of MultiWordNet synsets is obtained. The generation of the cross-language user
pro le is performed by the Pro le Learner, using a nave Bayes text classi er,
since each document has to be classi ed as interesting or not with respect to the
user preferences. Finally the Recommender exploits the cross-language user
proles to suggest relevant items by matching concepts contained in the semantic
pro le against those contained in the disambiguated documents.
Distributional Models. The second strategy used to represent items content
in a semantic space relies on the distributional approach. This approach
represents documents as vectors in a high dimensional space, such as WordSpace [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
The core idea behind WordSpace is that words and concepts (and documents, as
well) are represented by points in a mathematical space, and this representation
is learned from text in such a way that concepts with similar or related meanings
1 http://www.clef-campaign.org/2009.html
2 http://code.google.com/p/semanticvectors/
are near to one another in that space (geometric metaphor of meaning).
Therefore, semantic similarity between documents can be represented as proximity in
a n-dimensional space. Since these techniques are expected to e ciently handle
high dimensional vectors, a common choice is to adopt dimensionality reduction
that allows for representing high-dimensional data in a lower-dimensional space
without losing information. Random Indexing (RI) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] targets the problem of
dimensionality reduction by removing the need for the matrix decomposition or
factorization since it is based on the concept of Random Projection: the idea
is that high dimensional vectors randomly chosen are \nearly orthogonal". This
yields a result that is comparable to orthogonalization methods, but saving
computational resources. Given two corpus (one for language L1 and another one
for L2 ) we build two monolingual spaces SL1 and SL2 that share the same
random base by following the procedure introduced in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Since both spaces share
the same random base it is possible to compare elements belonging to di erent
spaces: for example we can compute how a user pro le in SL1 is similar to an
item in SL2 (or viceversa). This property is used to provide recommendations.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>Experimental evaluation</title>
      <p>
        The goal of the experimental evaluation was to measure the predictive accuracy
of both the content-based multilingual recommendation approaches. We
compared the language-independent user pro les represented through
MultiWordNet sysnsets and the approaches based on distributional hypothesis (W-SV) and
Random Indexing (W-RI), already presented in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The experimental work has been performed on a subset of the MovieLens
dataset3 containing 40,717 ratings provided by 613 di erent users on 520 movies.
The content information for each movie was crawled from both the English and
Italian version of Wikipedia. User pro les are learned by analyzing the ratings
stored in the MovieLens dataset while the e ectiveness of the recommendation
approaches has been evaluated by means of Precision@n (n = 5; 10). We designed
four di erent experiments: In Exp#1 and Exp#2 we learned user pro les on
movies with English (respectively, Italian) description and recommended movies
with Italian (respectively English) description and we compared their accuracy
with the classical monolingual baselines calculated in Exp#3 and Exp#4.
Results of the experiments are reported in Table 1, averaged over all the users.</p>
      <p>In general, the main outcome of the experimental session is that the strategy
implemented for providing cross-language recommendations is quite e ective for
both the approaches. Speci cally, the approach based on the bayesian classi er
gained the best results in the Precision@5. This means that model has a higher
capacity to rank the best items at the top of the recommendation list. On the
other side, the absence of a linguistic pre-processing is one of the strongest point
of the approaches based on the distributional model and the results gained by the
W-SV and W-RI models in the Precision@10 further underlined the e
ectiveness of this model. In conclusion, both the approaches gained good results. Even
3 http://www.grouplens.org
though in most of the experiments the cross-lingua recommendation approaches
get worse results w.r.t. the mono-lingual ones, the di erence in the predictive
accuracy does not appear statistically signi cant. In general the bayesian approach
ts better in scenarios where the number of items to be represented is not too
high, and this can justify the application of the pre-processing steps required
for building the MultiWordNet synset representation, while the distributional
models, thanks to their simplicity and e ectiveness, t better in scenarios where
real-time recommendations need to be provided.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>This paper compared two approaches for providing cross-language
recommendations. The key idea is to provide a bridge among di erent languages by
exploiting a language-independent representation of documents and user pro les
based on word meanings. Experiments were carried out in a movie
recommendation scenario, and the main outcome is that the accuracy of cross-language
recommmendations is comparable to that of classical (monolingual)
contentbased recommendations.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Musto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Narducci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lops</surname>
          </string-name>
          , M. de Gemmis, and G. Semeraro, \
          <article-title>Crosslanguage information ltering: Word sense disambiguation vs. distributional models,"</article-title>
          <source>in AI*IA</source>
          ,
          <year>2011</year>
          , pp.
          <volume>250</volume>
          {
          <fpage>261</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>M.</given-names>
            <surname>Sahlgren</surname>
          </string-name>
          , \
          <article-title>The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces,"</article-title>
          <source>Ph.D. dissertation</source>
          , Stockholm University, Department of Linguistics,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>C.</given-names>
            <surname>Musto</surname>
          </string-name>
          , \
          <article-title>Enhanced vector space models for content-based recommender systems,"</article-title>
          <source>in Proceedings of the fourth ACM conference on Recommender systems, ser. RecSys '10</source>
          . New York, NY, USA: ACM,
          <year>2010</year>
          , pp.
          <volume>361</volume>
          {
          <fpage>364</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>E.</given-names>
            <surname>Pianta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bentivogli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Girardi</surname>
          </string-name>
          , \
          <article-title>MultiwordNet: developing an aligned multilingual database,"</article-title>
          <source>in Proc. of the 1st Int. WordNet Conference</source>
          , Mysore, India,
          <year>2002</year>
          , pp.
          <volume>293</volume>
          {
          <fpage>302</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>