=Paper=
{{Paper
|id=None
|storemode=property
|title=
	      Random Indexing for Content-Based Recommender Systems
	  
|pdfUrl=https://ceur-ws.org/Vol-704/17.pdf
|volume=Vol-704
|dblpUrl=https://dblp.org/rec/conf/iir/MustoLGS11
}}
==
	      Random Indexing for Content-Based Recommender Systems
	  ==
<pdf width="1500px">https://ceur-ws.org/Vol-704/17.pdf</pdf>
<pre>
          Random Indexing for Content-based
               Recommender Systems

    Cataldo Musto, Pasquale Lops, Marco de Gemmis, Giovanni Semeraro

                       Department of Computer Science
                     University of Bari “Aldo Moro”, Italy
             {cataldomusto,lops,degemmis,semeraro}@di.uniba.it


      Abstract. The use of Vector Space Models (VSM) in the area of In-
      formation Retrieval is an established practice, thanks to its very clean
      and solid formalism that allows us to easily represent objects in a vector
      space and to perform calculations on them. The goal of this work is to
      investigate the impact of VSM on Recommender Systems (RS) perfor-
      mance. Specifically, we will introduce two approaches: the first is based
      on a dimensionality reduction technique called Random Indexing, while
      the second extends the previous one by integrating a negation operator
      implemented in the Semantic Vectors open-source package. The results
      emerged from the experimental evaluation confirmed the predictive ac-
      curacy of the model. This work summarizes the results already presented
      in the RecSys 2010 Doctoral Consortium.


1   Introduction
Recommender Systems (RS) are emerging as one of the most useful tools able
to support users to eﬀectively manage the surplus of information they have to
deal with. The goal of these systems is to get information about a target user
and to exploit them in order to ﬁnd the most relevant items for her. Although
the models underlying Information Filtering (IF) present strong analogies with
the Information Retrieval (IR) ones, the impact of IR-based models in the area
of IF has not yet been properly investigated. Since 1975, the VSM [1] emerged
as one of the most eﬀective approaches in the area of IR, although it suﬀers
from two important problems: the high-dimensionality of the vector space and
the inability to manage negative preferences. The main idea behind this work
is to investigate the impact of IR-based models on the area of IF by comparing
their performance wrt other content-based ﬁltering models. We introduced the
deﬁnition of ”enhanced vectors space models” (eVSM) to describe models able to
overcome classical VSM problems. Speciﬁcally, we exploited Random Indexing,
an incremental technique for dimensionality reduction, and a negation operator
based on quantum mechanics to model negative user preferences. This paper is
organized as follows: related work are described in Section 2, while Section 3
focuses on the description of both ﬁltering models. Results emerged from the
experimental evaluation are described in Section 4. Finally, future directions of
this research are sketched in Section 5.
2     Related Work

Many dimensionality reduction approaches such as Latent Semantic Analysis
(LSA) have been proposed in order to improve the eﬀectiveness and the scala-
bility of VSM. Recently, eﬀective techniques for dimensionality reduction such
as Random Indexing (RI) [2] emerged. The Semantic Vectors (SV) package [3]
extends the RI technique by introducing a negation operator based on quantum
mechanics.


3     eVSM for Content-based Recommender Systems

In our opinion, a VSM can be deﬁned enhanced if the whole vector space is built
in an incremental way and it is able to catch both the semantics of documents
and the information coming from negative evidences.
     In our approach we tackled the ﬁrst two issues through the introduction of
RI, while the last one is managed by exploiting SV. RI is an eﬃcient, scalable and
incremental technique for dimensionality reduction. Following this approach, we
can represent terms and documents as points in a vector space with a consider-
able reduction of the features that describe them. RI is based on the so-called
distributional hypothesis. According to that hypothesis, ”words that occur in the
same contexts tend to have similar meanings”. RI builds the ”meaning” of a
term (its position in the vector space) in an incremental way, according to the
other terms it co-occurs with. Further details about the dimensionality reduction
process are contained in [4].
     Through RI we can build low-dimensional vector spaces that maintain the
original expressivity of the model because, as stated by Johnson and Lindestrauss
in their lemma [5], the distance between points in the space is preserved. How-
ever, they still inherit a classic problem of VSM: the information coming from
negative evidences is not managed. In order to tackle this issue we exploited the
Semantic Vectors package1 that introduces a negation operator based on quan-
tum mechanics. While in SV it is used for retrieval tasks (i.e., to deﬁne queries
that contain negative terms, such as A not B ), in our recommendation model
it is exploited to infer two vectors, one for positive preferences and one for neg-
ative ones. Speciﬁcally, the negation operator is used to identify the subspace
that contains the items as close as possible to the positive preference vector and
as far as possible to the negative one.
     To sum up, the main idea behind our ﬁltering models is to build a vector
space where both items to be ﬁltered and user proﬁles are represented as points
in this space. Next, calculations based on similarity measures between vectors
allow us to obtain the set of the most relevant items for the target user, this is
to say, the points in the space that are nearest to her proﬁle.

1
    http://code.google.com/p/semanticvectors/
3.1   Random Indexing (RI) and Weighted RI (W-RI) Models

These approaches are based on the assumption that the information coming
from the items a user liked in the past can be a reliable source of information
to build accurate user proﬁles. Therefore, let d1 , d2 ..dn ∈ D be a set of already
rated items, and r(u, di ) the rating given by the user u to the item di . We
can deﬁne as Iu the set of the items for user u whose rating is over a ﬁxed
threshold. Intuitively, the user proﬁle simply consists of the terms occurring in
the documents she liked in the past. Formally, let |Iu | be the cardinality of the
set Iu and let di be the vector space representation of the document di , we can
deﬁne the user proﬁle pu as follows:

                                          |Iu |
                                          
                                   pu =           di                            (1)
                                          i=1

    The main drawback of the RI method is that the user proﬁle is built without
taking into account the ratings provided by the target user for the items she liked.
The second model, called Weighted Random Indexing-based (W-RI), enriches
the previous one by simply associating to each document vector, before combining
it, a weight equal to the rating provided by the user for it.


3.2   Semantic Vectors (SV) and Weighted SV (W-SV) Models

In SV ﬁltering model two user proﬁle vectors, one for positive preferences and
one for negative ones, are inferred. The set of positive items Iu+ and the positive
user proﬁle vector p+u are identical to the set of positive items Iu and the user
proﬁle pu in RI, while the set of negative items, denoted by Iu− , is deﬁned as the
set of the items whose rating is under the threshold. The negative user proﬁle
vector, denoted by p−u , is built by summing the vector space representations
of the items in Iu− . Given the proﬁle vectors p+u and p−u we can instantiate
the vector p+u NOT p−u , that is exploited to ﬁnd the items represented in
the vector space that contain as much as possible features that occur in the
documents in Iu+ and as less as possible features from Iu− . As RI, the SV model
has its weighted counterpart, called W-SV. This model shares the same idea and
the same weighting schema as the W-RI model, with the unique diﬀerence that
in the negative proﬁle Iu− the items with a lower rate are given higher weights
in order to exclude as much as possible the features disliked by the user.


4     Experimental Evaluation

The goal of the experimental evaluation was to measure the eﬀectiveness of RI
and SV models, as well as their weighted variants W-RI and W-SV, in terms
of predictive accuracy. Furthermore, we compared the behavior of these novel
approaches with a bayesian ﬁltering algorithm described in [6]. The experimental
session has been carried out on a subset of the 100k MovieLens dataset. By
exploiting a simple cosine similarity measure we ranked the items wrt the user
proﬁle, assuming the nearest ones as the most relevant. The metric used to
evaluate the eﬀectiveness of the approaches was the Average Precision@n.The
results emerged from the experimental evaluation are presented in Table 1. We
considered the results of the Bayesian classiﬁer as baseline for our experiments,
since this is the method currently implemented in our recommender system. As
shown in Table 1, W-SV model gained the best results. A thorough description
of the experimental session is contained in [4].

        Table 1. Results of Average Precision@n on 100k MovieLens dataset

                      Metric  RI W-RI SV W-SV Bayes
                     AV-P@1 85,93 86,33 85,97 86,78 86,39
                     AV-P@5 85,75 86,10 85,99 86,16 85,83
                     AV-P@10 85,45 85,76 85,76 85,85 85,75


5   Conclusions and Future Directions
In this work we introduced the ﬁrst results emerged from an initial investigation
on the impact of eVSM, such as RI-based and SV-based ones, on Content-based
Recommender Systems. The main outcome of the experimental evaluation was
that this novel ﬁltering model shows an accuracy comparable to the one obtained
by other content-based ﬁltering techniques such as Bayesian-based RSs. Further-
more, the introduction of a negation operator, a totally novel aspect for VSM,
lets us manage the information about the disliked items and their features. The
results obtained with the W-SV model represents a promising starting point for
further investigations in this area.

References
1. G. Salton, A. Wong, and C. S. Yang, “A vector space model for automatic indexing,”
   Commun. ACM, vol. 18, no. 11, pp. 613–620, 1975.
2. M. Sahlgren, “An introduction to random indexing,” in Methods and Applications
   of Semantic Indexing Workshop, TKE 2005, 2005.
3. D. Widdows, “Orthogonal negation in vector spaces for modelling word-meanings
   and document retrieval,” in ACL, 2003, pp. 136–143.
4. C. Musto, “Enhanced vector space models for content-based recommender systems,”
   in Proceedings of the fourth ACM conference on Recommender systems, pp. 361–364.
5. W. Johnson and J. Lindenstauss, “Extensions of Lipschitz maps into a Hilbert
   space,” Contemporary Mathematics, 1984.
6. P. Lops, M. de Gemmis, G. Semeraro, C. Musto, F. Narducci, and M. Bux, “A
   semantic content-based recommender system integrating folksonomies for person-
   alized access,” in Web Personalization in Intelligent Environment, G. Castellano,
   L. C. Jain, and A. M. Fanelli, Eds. Springer (Berlin), 2009, pp. 27–47.

</pre>