<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Recommender System Based on Random and Text Retrieval Approaches Walks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Max Chevalier</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Taoufiq Dkaki</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Damien Dudognon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Josiane Mothe</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institut de Recherche en Informatique de Toulouse Université de Toulouse</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <fpage>95</fpage>
      <lpage>101</lpage>
      <abstract>
        <p>This paper presents the approaches IRIT developed for the VLNetChallenge regarding recommender systems in the context of video lectures. The first task aims at recommending newly acquired lectures after viewing an “old” lecture. We use random walk algorithms based on a graph composed of author, category, event, and lecture nodes and associated relationships. The second task aims at recommending 10 lectures from three lectures extracted from a sequence of lectures. We use the categories associated to lectures in addition to the lecture pairs (lectures viewed in a same session).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        To begin with, we uploaded the CSV data provided to the participants in a PostgreSQL
database [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. For each lecture, we extracted the categories, events and authors associated with it.
      </p>
      <p>
        We also indexed lectures using the Solr search engine [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. We used as content the name,
description and slide_titles fields of each lecture. Indexing is based on a “bag of words” approach.
To build the Solr index, the stopwords were not removed and we did not use any stemming
heuristic similar to the Porter Stemmer [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Avoiding pre-processing steps allows us to store all the
lectures in the same index, regardless of their language. The retrieval model used in Solr
combines Boolean Model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and Vector Space Model [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. The documents are first selected by
Boolean Model and then are scored using Vector Space Model. The scoring function
implemented in Solr is derived from the VSM score, based on the Cosine similarity [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>Solr was used in the two tasks. In the cold start task, Solr was used to build two matrices that
reflect the lecture similarities based on content. For the first one, we used MoreLikeThis from
Solr to calculate the similarities between each lecture pairs. For a given document, the
MoreLikeThis module generates a query based on the representative terms of the document. These
terms are selected depending on several parameters which are: their length, their frequency in
the document and their frequency in the overall collection. The second matrix was built
differently: for each lecture, we calculate its similarities with all the other lectures, considering its
title as a query; lectures were favored if recent.</p>
      <p>In the pooled sequences task, Solr was used to retrieve the most similar lectures from a given
lecture.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Cold start task</title>
      <p>
        The cold start task aims at predicting “which of the newly acquired lectures at the site should
be recommended after viewing some of the 'older' lectures” [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>To complete this task, we first built a graph from the data in which nodes and relationships
are typed. In addition we weighted some of the relationships. Then we applied two random-walk
models to compute document similarities and predict which new lectures should be
recommended. Section 3.1 explains the way the graph is built and section 3.2 explains the way it is used.
3.1</p>
      <sec id="sec-2-1">
        <title>Graph generation</title>
        <p>From the data, we built a graph G={N,R}
ships between nodes.</p>
        <p>The set of nodes N is defined as: N={A, C, E, L} where:
- A is a set of author nodes,
- C a set of category nodes,
- E a set of event nodes, and
- L is a set of lecture nodes.</p>
        <p>where N is a set of nodes and R a set of
relationThe set of relationships R is defined as:
R={CR, ERei,ej, ARli,aj, DRli,cj, TRli,ej, LRli,lj} where:
- CR is a relationship defined between two categories.</p>
        <p>CR(ci,cj) = 1 if categories ci and cj have a hierarchical relationship in the
database;</p>
        <p>= 0 otherwise.</p>
        <p>- ER is a relationship between two events. As for CR, ER(ei,ej) is either 0 or 1, based on
the hierarchical relationship defined between events ei and ej using parent_id attribute.
- AR is a relationship between a lecture and an author.
- DR is a relationship between a lecture and a category.
- PR is a relationship between a lecture and an event.</p>
        <p>For those three relationships, when the items are associated in the data set, the relationship is
weighted 1; 0 otherwise.</p>
        <p>- LR is a relationship between two lectures. We defined two types of LR relationships. They
can be either content based similarities or deduced from pairs of lectures. Lecture pairs were
provided to participants; the deduced LR_P relationships were weighted considering the
frequency of each pair and the number of views associated to its related lectures. Lecture
similarities were calculated as described in section 2 and conduced to weighted LR_S relationships.
LR_P and LR_S relationships were fused considering a linear combination, such as:
, = ∗ + ∗ _ ( , )</p>
        <p>,
where li and lj are two lectures. In the experiments, β =1.5 and γ=0.05. These values have
been obtained through manual tuning.</p>
        <p>Finally, each type of relationships receives a relative weight. For example, AR(li,aj)
receives a relative weight of 3 between li and aj if the lecture and the author are linked. Figure 1
depicts the various types of relationships that link nodes.</p>
        <p>CR : Is Parent of :</p>
        <p>0.1
Categories</p>
        <p>DR: Describe :</p>
        <p>0.35
LR_S, Is Similar to :
0.05</p>
        <p>Authors
AR: Is Author of :</p>
        <p>3
Lectures</p>
        <p>ER, Is Parent of :</p>
        <p>0.1</p>
        <p>Events
PR, Is Part of :</p>
        <p>0.15
LR_P, Is Viewed with :</p>
        <p>1.5</p>
        <p>= + + ⋯ + + ⋯ = ( − ) − (1)
where A is the adjacency matrix, I the identity and α constant.</p>
        <p>A is the adjacency matrix generated from the complete graph (rows and columns of the matrix
are the nodes of the various types) and thus represents direct links between the graph’s nodes. An
represents the indirect links through paths of length n. Both direct and indirect links are taken
into account but a coefficient of attenuation is used: αn represents the attenuation in importance
of the links of length n, K exists provided that the attenuation coefficient α is less than the
inverse of the spectral radius of A. In our experiments, we use α =0.05. This value should have
been tuned; but we did not for time reasons.</p>
        <p>
          Random-forest based algorithm (RFA). In RFA, the similarity matrix S between the nodes
of a graph G is based on relative forest accessibility. Let F(G) = F be the set of all spanning
forests of graph G. A spanning forest is any subgraph of G that is cycle free and includes every
vertex of G. For any two nodes i and j of G, Fij denote the subset of F where i and j belong to
the same tree. The relative forest accessibility of i and j is defined as sij = ε(Fij)/ε(F). ε is the
weight function defined in [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. For unweighted graphs ε(Fij)/ε(F)= |Fij|/|F|
[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] demonstrates (I + L)−1 exists for any undirected weighted graphs and that :
        </p>
        <p>S = (I + L)−1 (2)
where L is the laplacian matrix from the adjacency matrix A generated from the complete
graph G (see section 3.1).</p>
        <p>
          RFA which can be seen as a rough Laplacian regularization is closely related to the similarity
measure associated to the pseudo-inverse of graphs Laplacian L+(see [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] for more details). L+ is
a valid kernel that preserves the Euclidian commute time distance in graphs. We did not
experiment the similarity measure based on L+ in the context of VLNetChallenge for lack of time to
solve a technical problem.
3.3
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Implementation issues</title>
        <p>All experiments were conducted on Linux computers with a 2.9 GHz Intel Core2 Duo
processor P9700 and 6 GB of RAM.</p>
        <p>
          The graphs we handled in the context of VLNetChallenge contain around 15 000 nodes. The
approaches we explored are then based on inverting matrices (Ο(n3)) of size 15 000 x 15 000.
Our attempt to use Scilab [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] (with memory stack set to the maximum) was unfruitful and
ended with a stack overflow error after more than 20 hours of running time. After shifting to
atlas [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] the Automatically Tuned Linear Algebra Software, the running time was about 20
minutes.
3.4
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Results</title>
        <p>When considering the preliminary results on the training collection (based on 20% of the final
collection), our method obtained from 0.1434 to 0.22465, depending both on the random walk
method used and on the weight used for the relationships. The best results have been obtained
for RFA using the weights presented in bold font in Figure 1. These weights have been obtained
through a rough manual-tuning that used the entire training collection.</p>
        <p>When considering the final collection, our method is ranked 9 over 58 submissions without
nil results or errors. We obtained a score of 0.24044 while the best result is 0.35857. Interesting
enough, when considering the approaches better than ours, we can see that the results decrease
from the preliminary results to the final results. One hypothesis could be that those approaches
over learnt on test data.
4</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Pooled sequences</title>
      <p>
        In this task participants “are asked to recommend a ranked list of ten lectures that should be
recommended after viewing a set of three lectures” [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ].
      </p>
      <p>
        To complete this task, we followed an empirical approach according to our knowledge mainly
acquired in Information Retrieval field. This knowledge has been transposed and adapted to
recommender systems. We tried two approaches that are related to the work we presented in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]:
one was based on lecture content only; the second one considered the categories associated to
lectures and lecture pair frequency.
4.1
      </p>
      <sec id="sec-3-1">
        <title>Content-based approach</title>
        <p>
          In this approach, we considered the lecture content only. We used Solr search engine [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] as
explained in section 2. For each lecture of a given triplet, we search for the 50 most similar
documents. Then we fused the three retrieved document list using CombSum function [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] that
consists in the sum of the document’s individual scores.
        </p>
        <p>When applied to the training collection, the results were slightly above 0.04. Indeed when
analyzing the learning data set, we identify that users read lectures related to various topics to
complete their knowledge. This variety of topics cannot be captured with a standard content
similarity-based measure. For this reason, we did not continue with this content-only approach.
Thanks to the work done in the cold-start task, we decided to particularly study lecture pair
frequency (importance of LR_P in section 3.1) and categories.
4.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Category-based approach</title>
        <p>Rather than considering the lecture content only, we concentrated on the categories of the
lectures. The first issue to solve was the fact that many lectures were not associated with any
category. For those lectures, we first associated them with a category considering the hierarchy of
events. Once the lectures are associated with a category, we then consider the lectures that have
been visited with one of the lectures of the target triplet within close categories in the category
hierarchy.</p>
        <p>Association of categories to lectures. Some of the lectures are not associated with any
category; for those lectures, we applied two algorithms. First for any lecture that is not in
categories_lectures, we browsed the lecture-event hierarchy using a bottom up approach and associated
the current lecture to the category or categories associated to the closest event (considering the
hierarchy). When such a parent does not exist, we associated the category (or categories) of the
most similar lectures or events, based on its content or description.</p>
        <p>Frequency of lecture pairs. For each lecture of the current triplet, we search for the 100
most visited lectures with the current lecture. We then calculate the lecture score (3). The score
of the retrieved lecture li is computed as its frequency times the distance between categories.
Indeed, this distance between categories allows the system to identify recommendations close in
sibling categories. In that way, we emphasize the selection of information in close categories in
order to simulate the user behavior according to what we have extracted from the training data
set analysis.</p>
        <p>Score (lj) = Frequency (lj) * similarity (category (lj), category (li))
(3)</p>
        <p>
          When a lecture has more than one category, we use the most general category. This treatment
is repeated for the three lectures of the triplet and the three lists are fused using CombSum. The
distance between categories is inspired from our previous work detailed in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>We then ranked the retrieved lectures by decreasing scores. The recommendations are the top
10 lectures. Using this method, it occurs that we obtained less than 10 recommendations. In
those cases, we then add lectures to the recommended list.</p>
        <p>Completing the recommended list. When less than 10 lectures are recommended using the
previous method, we complete the list by considering the lecture content rather than the lecture
visits. For each lecture, we search for the 10 most similar lectures. For each lecture, we search
for the 100 lectures the most visited with the current lecture and calculate the score of the
lectures using the same method as previously. When this process fails to complete the list, it is
completed with the lectures the most visited thanks to the frequency of lecture views.
4.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Results</title>
        <p>Considering the training set, using our method, we obtained from 0.04453 to 0.18725
(depending on the approach used).</p>
        <p>Regarding the complete set, we are ranked 12th with the score of 0.18943. The best score
being 0.62415.</p>
        <p>The results we obtained show that the visits on lectures has a great importance; more than the
content itself.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we describe the methods we developed for the two tasks defined in
VLNetChallenge. With regard to the cold start task, our method was not over trained. We tried various
values for the different parameters. A more systematic tuning could help improving the results.
With regard to the pooled sequence task, we identified that content only approach is not
sufficient. Furthermore, we think that categories could have been used more. For example, for a
given triplet, we could have kept only those retrieved lectures that share a category with any lecture
of the triplet.</p>
      <p>In the two tasks, we also identified the importance of the frequency of lecture pairs. As a
conclusion, we expect that combining various dimensions in recommender systems can improve
recommendation quality.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>G.</given-names>
            <surname>Cabanac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chevalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chrisment</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Julien</surname>
          </string-name>
          ,
          <article-title>An Original Usage-based Metrics for Building a Unified View of Corporate Documents</article-title>
          ,
          <source>International Conference on Database and Expert Systems Applications (DEXA)</source>
          ,
          <source>Springer-Verlag, LNCS 4653</source>
          , pp.
          <fpage>202</fpage>
          -
          <lpage>212</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>L.</given-names>
            <surname>Candillier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chevalier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dudognon</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Mothe</surname>
          </string-name>
          ,
          <article-title>Diversity in Recommender Systems: bridging the gap between users and systems</article-title>
          , Centrics, to appear
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>P. Y.</given-names>
            <surname>Chebotarev</surname>
          </string-name>
          and
          <string-name>
            <given-names>E. V.</given-names>
            <surname>Shamis</surname>
          </string-name>
          .
          <article-title>The matrix-forest theorem and measuring relations in small social groups</article-title>
          .
          <source>Automation and Remote Control</source>
          , Vol.
          <volume>58</volume>
          , N°9, pp.
          <fpage>1505</fpage>
          -
          <lpage>1514</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>P. Y.</given-names>
            <surname>Chebotarev</surname>
          </string-name>
          and
          <string-name>
            <given-names>E. V.</given-names>
            <surname>Shamis</surname>
          </string-name>
          ,
          <article-title>On proximity measures for graph vertices</article-title>
          ,
          <source>Automation and Remote Control</source>
          , Vol.
          <volume>59</volume>
          , N°
          <volume>10</volume>
          , pp.
          <fpage>1443</fpage>
          -
          <lpage>1459</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>F.</given-names>
            <surname>Fouss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pirotte</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Renders</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Saerens</surname>
          </string-name>
          ,
          <article-title>Random-walk computation of similarities between nodes of a graph, with application to collaborative recommendation</article-title>
          ,
          <source>IEEE Transactions on Knowledge and Data Engineering</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>L.</given-names>
            <surname>Katz</surname>
          </string-name>
          .
          <article-title>A new status index derived from sociometric analysis</article-title>
          .
          <source>Psychmetrika</source>
          ,
          <volume>18</volume>
          (
          <issue>1</issue>
          ), pp
          <fpage>39</fpage>
          -
          <lpage>43</lpage>
          ,
          <year>1953</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>F. W.</given-names>
            <surname>Lancaster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. G.</given-names>
            <surname>Fayen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Information</given-names>
            <surname>Retrieval</surname>
          </string-name>
          On-Line, Melville Publishing Co., Los Angeles, California,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Lee</surname>
          </string-name>
          .
          <article-title>Analyses of multiple evidence combination</article-title>
          ,
          <source>Proceeding of SIGIR'97</source>
          , pp.
          <fpage>267</fpage>
          -
          <lpage>276</lpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Porter</surname>
          </string-name>
          ,
          <article-title>An algorithm for suffix stripping</article-title>
          ,
          <source>Program</source>
          , pp.
          <fpage>130</fpage>
          -
          <lpage>137</lpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton and M. J. McGill</surname>
          </string-name>
          , Introduction to Modern Information Retrieval,
          <string-name>
            <surname>McGraw-Hill</surname>
          </string-name>
          ,
          <year>1983</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. G. Salton,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>A vector space model for automatic indexing</article-title>
          ,
          <source>Commu- nications of the ACM</source>
          ,
          <volume>18</volume>
          (
          <issue>11</issue>
          ), pp.
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. http://tunedit.org/challenge/VLNetChallenge/task_1,
          <string-name>
            <surname>June</surname>
          </string-name>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. http://tunedit.org/challenge/VLNetChallenge/task_2,
          <string-name>
            <surname>June</surname>
          </string-name>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. http://wiki.apache.org/solr/,
          <year>June 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15. http://www.postgresql.org/,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16. http://www.scilab.org,
          <year>August 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17. http://math-atlas.sourceforge.net,
          <year>August 2011</year>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>