-

Comparing Word Sense Disambiguation and Distributional Models for Cross-Language Information Filtering

Cataldo Musto

cataldomusto@di.uniba.it 0

Fedelucio Narducci

narducci@di.uniba.it 0

Pierpaolo Basile

basilepp@di.uniba.it 0

Pasquale Lops

lops@di.uniba.it 0

Marco de Gemmis

degemmis@di.uniba.it 0

Giovanni Semeraro

semeraro@di.uniba.it 0 0 Department of Computer Science - University of Bari \Aldo Moro" , Italy

In this paper we deal with the problem of providing users with cross-language recommendations by comparing two di erent contentbased techniques: the rst one relies on a knowledge-based word sense disambiguation algorithm that uses MultiWordNet as sense inventory, while the latter is based on the so-called distributional hypothesis and exploits a dimensionality reduction technique called Random Indexing in order to build language-independent user pro les. This paper summarizes the results already presented within the conference AI*IA 2011 [1].

Cross-language Information Filtering Word Sense Disambiguation Distributional Models

Nowadays the amount of information we have to deal with is usually greater than the amount of information we can process in an e ective way. In this context Information Filtering (IF) systems are rapidly emerging since they can adapt their behavior to individual users by learning their preferences and performing a progressive removal of non-relevant content. Speci cally, the content-based ltering approach analyzes a set of documents (usually textual descriptions of items) and builds a model of user interests based on the features (usually keywords) that describe the items previously rated as relevant by an individual user. One relevant problem related to content-based approaches is the strict connection with the user language, since the information already stored in the user pro le cannot be exploited to provide suggestions for items whose description is provided in other languages. In this paper we investigated whether it is possible to represent user pro les in order to create a mapping between preferences expressed in di erent languages. Speci cally, we compared two approaches: the rst one exploits a Word Sense Disambiguation (WSD) technique based on MultiWordnet, while the second one is based on the distributional models. It assumes that in every language each term often co-occurs with the same other terms (expressed in di erent languages, of course) thus, by representing a content-based user prole in terms of the co-occurences of its terms, user preferences become inerently independent from the language. The paper is organized as follows. Section 2 analyzes related works in the area of cross-language ltering and retrieval. An overview of the approaches is provided in Section 3. Experiments carried out in a movie recommendation scenario are described in Section 4. Conclusions and future work are drawn in the last section. 2

Related Work

The Multilingual Information Filtering task at CLEF 20091 has introduced the issues related to the cross-language representation in the area of Information Filtering. The use of distributional models [ 2 ] in the area of monolingual and multilingual Information Filtering is a relatively new topic. Recently the research about semantic vector space models gained more and more attention: Semantic Vectors (SV)2 package implements a Random Indexing algorithm and de nes a negation operator based on quantum logic. Some initial investigations about the e ectiveness of the SV for retrieval and ltering tasks is reported in [ 3 ]. 3

Description of the approaches

Learning pro les through MultiWordnet. In this approach we can imagine a general architecture composed by three main components: the Content Analyzer allows to obtain a language-independent document representation by using a Word Sense Disambiguation algorithm based on MultiWordnet [ 4 ]. Similary to WordNet, the basic building block of MultiWordNet is the synset (SYNonym SET), a structure containing sets of words with synonymous meanings, which represents a speci c meaning of a word. In MultiWordNet, for example the Italian WordNet is aligned with the English one, so by processing textual descriptions of items in both the languages, a language-independent representation in terms of MultiWordNet synsets is obtained. The generation of the cross-language user pro le is performed by the Pro le Learner, using a nave Bayes text classi er, since each document has to be classi ed as interesting or not with respect to the user preferences. Finally the Recommender exploits the cross-language user proles to suggest relevant items by matching concepts contained in the semantic pro le against those contained in the disambiguated documents. Distributional Models. The second strategy used to represent items content in a semantic space relies on the distributional approach. This approach represents documents as vectors in a high dimensional space, such as WordSpace [ 2 ]. The core idea behind WordSpace is that words and concepts (and documents, as well) are represented by points in a mathematical space, and this representation is learned from text in such a way that concepts with similar or related meanings 1 http://www.clef-campaign.org/2009.html 2 http://code.google.com/p/semanticvectors/ are near to one another in that space (geometric metaphor of meaning). Therefore, semantic similarity between documents can be represented as proximity in a n-dimensional space. Since these techniques are expected to e ciently handle high dimensional vectors, a common choice is to adopt dimensionality reduction that allows for representing high-dimensional data in a lower-dimensional space without losing information. Random Indexing (RI) [ 2 ] targets the problem of dimensionality reduction by removing the need for the matrix decomposition or factorization since it is based on the concept of Random Projection: the idea is that high dimensional vectors randomly chosen are \nearly orthogonal". This yields a result that is comparable to orthogonalization methods, but saving computational resources. Given two corpus (one for language L1 and another one for L2 ) we build two monolingual spaces SL1 and SL2 that share the same random base by following the procedure introduced in [ 3 ]. Since both spaces share the same random base it is possible to compare elements belonging to di erent spaces: for example we can compute how a user pro le in SL1 is similar to an item in SL2 (or viceversa). This property is used to provide recommendations. 4

Experimental evaluation

The goal of the experimental evaluation was to measure the predictive accuracy of both the content-based multilingual recommendation approaches. We compared the language-independent user pro les represented through MultiWordNet sysnsets and the approaches based on distributional hypothesis (W-SV) and Random Indexing (W-RI), already presented in [ 3 ].

The experimental work has been performed on a subset of the MovieLens dataset3 containing 40,717 ratings provided by 613 di erent users on 520 movies. The content information for each movie was crawled from both the English and Italian version of Wikipedia. User pro les are learned by analyzing the ratings stored in the MovieLens dataset while the e ectiveness of the recommendation approaches has been evaluated by means of Precision@n (n = 5; 10). We designed four di erent experiments: In Exp#1 and Exp#2 we learned user pro les on movies with English (respectively, Italian) description and recommended movies with Italian (respectively English) description and we compared their accuracy with the classical monolingual baselines calculated in Exp#3 and Exp#4. Results of the experiments are reported in Table 1, averaged over all the users.

In general, the main outcome of the experimental session is that the strategy implemented for providing cross-language recommendations is quite e ective for both the approaches. Speci cally, the approach based on the bayesian classi er gained the best results in the Precision@5. This means that model has a higher capacity to rank the best items at the top of the recommendation list. On the other side, the absence of a linguistic pre-processing is one of the strongest point of the approaches based on the distributional model and the results gained by the W-SV and W-RI models in the Precision@10 further underlined the e ectiveness of this model. In conclusion, both the approaches gained good results. Even 3 http://www.grouplens.org though in most of the experiments the cross-lingua recommendation approaches get worse results w.r.t. the mono-lingual ones, the di erence in the predictive accuracy does not appear statistically signi cant. In general the bayesian approach ts better in scenarios where the number of items to be represented is not too high, and this can justify the application of the pre-processing steps required for building the MultiWordNet synset representation, while the distributional models, thanks to their simplicity and e ectiveness, t better in scenarios where real-time recommendations need to be provided. 5

Conclusions

This paper compared two approaches for providing cross-language recommendations. The key idea is to provide a bridge among di erent languages by exploiting a language-independent representation of documents and user pro les based on word meanings. Experiments were carried out in a movie recommendation scenario, and the main outcome is that the accuracy of cross-language recommmendations is comparable to that of classical (monolingual) contentbased recommendations.

Musto ,

Narducci ,

Basile ,

Lops , M. de Gemmis, and G. Semeraro, \ Crosslanguage information ltering: Word sense disambiguation vs. distributional models," in AI*IA , 2011 , pp. 250 { 261 .

Sahlgren , \ The Word-Space Model: Using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces," Ph.D. dissertation , Stockholm University, Department of Linguistics, 2006 .

Musto , \ Enhanced vector space models for content-based recommender systems," in Proceedings of the fourth ACM conference on Recommender systems, ser. RecSys '10 . New York, NY, USA: ACM, 2010 , pp. 361 { 364 .

Pianta ,

Bentivogli , and

Girardi , \ MultiwordNet: developing an aligned multilingual database," in Proc. of the 1st Int. WordNet Conference , Mysore, India, 2002 , pp. 293 { 302 .