<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Book Recommendation Using Information Retrieval Methods and Graph Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chahinez Benkoussas</string-name>
          <email>chahinez.benkoussas@lsis.org</email>
          <email>chahinez.benkoussas@openedition.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anaïs Ollagnier</string-name>
          <email>anais.ollagnier@lsis.org</email>
          <email>anais.ollagnier@openedition.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrice Bellot</string-name>
          <email>patrice.bellot@lsis.org</email>
          <email>patrice.bellot@openedition.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aix-Marseille Université</institution>
          ,
          <addr-line>CNRS, CLEO OpenEdition UMS 3287, 13451, Marseille</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Aix-Marseille Université</institution>
          ,
          <addr-line>CNRS, LSIS UMR 7296, 13397, Marseille</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we present our contribution in INEX 2015 Social Book Search Track. This track aims to exploit social information (users reviews, ratings, etc. . . ) from LibraryThing and Amazon collections. We used traditional information retrieval models, namely, InL2 and the Sequential Dependence Model (SDM) and tested their combination. We integrated tools from natural language processing (NLP) and approaches based on graph analysis to improve the recommendation performances.</p>
      </abstract>
      <kwd-group>
        <kwd>InL2</kwd>
        <kwd>language model</kwd>
        <kwd>recommender system</kwd>
        <kwd>graph analysis</kwd>
        <kwd>natural language processing</kwd>
        <kwd>dependence analysis</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        The Social Book Search (SBS) Track [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] was introduced by INEX in 2010 with
the purpose of evaluate approaches for supporting users in searching collections
of books based on book metadata and associated user-generated content. As in
2014, the track 2015 includes two tasks: the suggestion task and the interactive
task. As part of our work, we oriented on the suggestion task, which suggests a
list of the most relevant books according to the request provided by the user.
      </p>
      <p>SBS task builds on corpus of topics that consist of a set of 208 complex
queries expressed in natural language made by users of LibraryThing3 forums.
And a collection of 2.8 million of real books extracted from Amazon4 pages
extended by social metadata.</p>
      <p>In our contribution at SBS task, we tested several approaches. The first
consists of combining the output of retrieval systems. Secondly, we performed topic
representation by query expansion and recommend books using traditional
retrieval methodology. Finally, We tested a new approach for document retrieval</p>
      <sec id="sec-1-1">
        <title>3 https://www.librarything.com/</title>
      </sec>
      <sec id="sec-1-2">
        <title>4 http://www.amazon.com/</title>
        <p>based on graph analysis and exploit the PageRank algorithm for ranking
documents with respect to a user’s query. In the absence of manually-created
hyperlinks, we use social information to create the Directed Graph of Documents
(DGD) and argue that it can be treated in the same manner as hyperlink graphs.</p>
        <p>We submitted 6 runs in which we used the reviews, the ratings attributed to
books by Amazon users and PageRank for reranking.</p>
        <p>The rest of this paper is organized as follows. The following section describes
our retrieval frameworks. In section 3, we describe the submitted runs. Finally,
we present the obtained results in section 4.
2</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Retrieval Model</title>
      <p>This section presents brief description of retrieval models used for book
recommendation and our new approach based on graph modeling.
2.1</p>
      <p>
        InL2
We used InL2 model implemented in Terrier5. InL2 is DFR-based model
(Divergence From Randomness). The DFR models are based on this idea: "The more
the divergence of the within-document term-frequency from its frequency within
the collection, the more the information carried by the word t in the document
d" [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. InL2 signies Inverse Document Frequency model with Laplace after-effect
and normalization 2.
2.2
      </p>
      <p>
        Sequential Dependence Model
We used Metzler and Croft’s Markov Random Field (MRF) model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to
integrate multi word phrases in the query. Specially, we use the SDM (Sequential
Dependence Model), which is a special case of MRF. SDM builds upon this idea
by considering combinations of query terms with proximity constraints which
are: single term features (standard unigram language model features, fT ), exact
phrase features (words appearing in sequence, fO) and unordered window
features (require words to be close together, but not necessarily in an exact sequence
order, fU ).
      </p>
      <p>Finally, documents are ranked according to the following scoring function:
SDM (Q; D) = T</p>
      <p>X fT (q; D)+
q2Q</p>
      <p>jQj 1
+ O X fO(qi; qi + 1; D)
i=1
jQj 1
+ U X fU (qi; qi + 1; D)
i=1</p>
      <sec id="sec-2-1">
        <title>5 http://terrier.org/</title>
        <p>
          Where feature weights are set based on the author’s recommendation ( T =
0:85, O = 0:1, U = 0:05) in [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. fT , fO and fU are the log maximum likelihood
estimates of query terms in document D, computed over the target collection
using a Dirichlet smoothing. We applied this model to the queries using Indri6
Query Language7.
2.3
        </p>
        <p>
          Combination of Retrieval Systems outputs
Combining the output of many search system, in contrast to using just a single
retrieval technique, can improve the retrieval effectiveness as shown by Belkin
in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. He combined the results of probabilistic and vector space models. In
our work, we combined the results of InL2 model and SDM model. The retrieval
models use different weighting schemes therefore we should normalize the scores.
We used the maximum and minimum scores according to Lee’s formula [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] as
followed:
normalizedScore =
oldScore
maxScore
minScore
minScore
        </p>
        <p>
          We have shown in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] that InL2 and SDM models have different levels of
retrieval effectiveness, thus it is necessary to weight individual model scores
depending on their overall performance. We used an interpolation parameter ( )
varied to get the best interpolation that provides better retrieval effectiveness.
2.4
        </p>
        <p>
          Query Expansion With Dependence Analysis
Due to the nature of the proposed queries, namely, complex and long queries
expressed in natural language, we used a dependency parser to extract bigrams of
words syntactically dependent. We used for this purpose Stanford tool
Stanford Dependencies8 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. We performed this analysis on fields "title",
"mediated_query" and "narrative" of each topic and then extended the query with
resulting dependencies. We established two types of query expansions: first, with
all dependencies and second, with some selected dependencies that we have
retained during frequency study. Among selected dependencies we have identified
the names (8,4% dependencies present in queries) and dependencies composed
of prepositions defined, in linguistic, as words that establish a logical connection
between two words. Among the prepositions that we considered relevant we can
identify those based on of (4.63% dependencies), to (1.27% dependencies) and
about (1.12% dependencies). Figure 1 shows an example of a query after analysis.
6 http://www.lemurproject.org/indri/
7 http://www.lemurproject.org/lemur/IndriQueryLanguage.php
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>8 http://nlp.stanford.edu/software/stanford-dependencies.shtml</title>
        <p>We tested a new approach of book recommendation based on graphs. This
approach combines the results of retrieval methodology and graph analysis. To
construct the graph, we exploited a special type of similarity based on
several factors. This similarity is provided by Amazon and corresponds to “Similar
Products” given generally for each book. Amazon suggests books in “Similar
Products” field according to there similarity to book. The degree of similarity
depends of users’ social information like: clicks or purchases and content-based
information like book attributes (book description, book title, etc.). The exact
formula used by Amazon to combine social and content based information to
compute similarity is proprietary.</p>
        <p>To perform data modeling into Directed Graph of Documents (DGD), we
extracted the “Similar Products” links between documents. Once used it to enrich
results from the retrieval models, in the same spirit as pseudo-relevance-feedback.</p>
        <p>Nodes are connected with directed links, given nodes fA; Bg 2 S , if A
points to B, B is suggested as Similar Product to A. The DGD network contains
1:645:355 nodes (89:86% of nodes are in the collection and the rest does not
belong to it) and 6:582:258 relationships.</p>
        <p>Each node in the DGD represents document (Amazon description of book),
and has set of properties:
– ID: book’s ISBN;
– content : book description that include many other properties (title, product
description, author(s), users’ tags, content of reviews, etc.)
– M eanRating : average of ratings attributed to the book
– P R : value of book PageRank</p>
        <p>In this section we use some fixed notations. The collection of documents is
denoted by C. In C, each document d has a unique ID. The set of topics is
denoted by T , the set Dinit C refers to the documents returned by the initial
retrieval model. StartingN ode indicates document from Dinit which is used as
input to graph processing algorithms in the DGD. The set of documents present
in the graph is denoted by S. Dti indicates the documents retrieved for topic
ti 2 T .</p>
        <p>Algorithm 1 takes as inputs: Dinit returned list of documents for each topic
by the retrieval techniques described in Section 3, DGD network and
parameter which is the number of the top selected StartingN ode from Dinit
denoted by DStartingNodes. We fixed to 100 (10% of the returned list for each
topic). The algorithm returns list of recommendations for each topic denoted by
“Dfinal”. It processes topic by topic, and extracts the list of all neighbors for
each StartingN ode. It performs mutual Shortest Paths computation between all
selected StartingN ode in DGD. The two lists (neighbors and nodes in computed
Shortest Paths) are concatenated after that all duplicated nodes are deleted. The
set of documents in returned list is denoted by “ Dgraph”. After that, documents
in Dgraph are added to the initial list of documents (all duplications are deleted),
a new final list of retrieved documents is obtained, “ Dfinal” and reranked using
different reranking schemes described in the next section.</p>
        <p>Algorithm 1 Retrieving based on DGD feedback
1: Dinit Retrieving Documents for each ti 2 T
2: for each Dti 2 Dinit do
3: DStartingNodes first documents 2 Dti
4: for each StartingN ode in DStartingNodes do
5: Dgraph Dgraph + neighbors(StartingN ode; DGD)
6: DSP nodes all D 2 ShortestP ath(StartingN ode; DStartingNodes; DGD)
7: Dgraph Dgraph + DSP nodes
8: Delete all duplications from Dgraph
9: Dfinal Dfinal + (Dti + Dgraph)
10: Delete all duplications from Dfinal
11: Rerank Dfinal
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Runs</title>
      <p>We submitted 6 runs for the SBS Task. We used 2 frameworks that implement
retrieval models : Indri9 and Terrier (TERabyte RetrIEveR). Porter stemmer
and Bayesian smoothing with Dirichlet priors (Dirichlet prior = 1500) are
used in our experiments. We performed a preprocessing step to convert Inex
SBS corpus into Trec Collection Format10, we consider that the content of all
tags in each XML file is important for indexing; therefore we take the whole XML
file as one document identified by its ISBN. Thus, we just need two tags instead
of all tags in XML, the ISBN and the whole content (named text) following this
format:</p>
      <sec id="sec-3-1">
        <title>9 http://www.lemurproject.org/ 10 http://lab.hypotheses.org/1129</title>
        <p>Inex SBS corpus is composed of 2.8 million documents, distributed in 1100
folders, we generate for each folder only one Trec formatted file which contains
all xml files in this folder. In fact this processing is necessary for improving the
execution time of Terrier indexing process.</p>
        <p>We described previously our approach based on DGD. We used NetworkX11
tool of Python to handle the graph processing. In all runs, the index is built on
all fields in the book xml files</p>
        <p>INL2_fulldep This run is based on InL2 model, for each topic we use
mediated_query, group, narrative tags and we extended the topics by all
syntactic dependencies.</p>
        <p>INL2_SelectDep This run is based on InL2 model, we extended the topics
by selected dependencies presented in section 2.4.</p>
        <p>INL2_fdep_SDM We used first, the extended topics with syntactic
dependencies and performed retrieving using INL2. Secondly, we concatenate
the title and mediated_query fields and then perform retrieving with SDM
model. Scores of each retrieval system are normalized and combined using
interpolation parameter = 0:8.</p>
        <p>INL2_SDM_Graph In this run we used the result of INL2 and SDM
outputs combination. We selected the 100 top returned documents as
starting points in the DGD. Then we performed neighbors computing for each
starting point and collected all nodes in Shortest paths between the starting
points. We integrated the resulting nodes in the first returned list of INL2
and SDM combination. To rerank the final list, we weighted the combined
score of retrieval systems (INL2 and SDM) with PageRank value for each
book.</p>
        <p>INL2_fdep_Graph In this run we used the extended topics with syntactic
dependencies and INL2 retrieval model. We selected the 100 top returned
documents as starting points in the DGD. Then we performed neighbors
computing for each starting point and collected all nodes in Shortest paths
between the starting points. We integrated the resulting nodes in the first
returned list by INL2. To rerank the final list, we weighted the score of
retrieval system (INL2) with PageRank value for each book.
11 https://networkx.github.io/</p>
        <p>INL2_Gph_SimJac In this run we used the extended topics with
syntactic dependencies and INL2 retrieval model. We selected the 100 top returned
documents as starting points in the DGD. Then we performed neighbors
computing for each starting point and collected all nodes in Shortest paths
between the starting points. We integrated the resulting nodes in the first
returned list by INL2. To rerank the final list, we weighted the score of
retrieval system (INL2) with value of Jaccard similarity computed between
“title + description” of book and “title + narrative” of topic.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>In this paper we presented our contribution for the INEX 2014 Social Book
Search Track. In the 6 submitted runs, we tested 2 retrieval models (SDM for
MRF and InL2 for DFR) and their combination, and used social links present
in “SimilarProducts” field to construct a Directed Graph of Document (DGD)
and use it to enrich recommendation list returned by IR systems. We performed
also dependences analysis for query expansion. We showed that combining INL2
and SDM models increases retrieval effectiveness, and integrating result of graph
analysis in retrieval process enhances the performances.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Nicholas</surname>
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Belkin</surname>
            ,
            <given-names>Paul B.</given-names>
          </string-name>
          <string-name>
            <surname>Kantor</surname>
          </string-name>
          , Edward A.
          <string-name>
            <surname>Fox</surname>
          </string-name>
          , and
          <string-name>
            <surname>Joseph</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Shaw.</surname>
          </string-name>
          <article-title>Combining the evidence of multiple query representations for information retrieval</article-title>
          .
          <source>Inf</source>
          . Process. Manage.,
          <volume>31</volume>
          (
          <issue>3</issue>
          ):
          <fpage>431</fpage>
          -
          <lpage>448</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Chahinez</given-names>
            <surname>Benkoussas</surname>
          </string-name>
          , Hussam Hamdan, Shereen Albitar, Anaïs Ollagnier, and
          <string-name>
            <given-names>Patrice</given-names>
            <surname>Bellot</surname>
          </string-name>
          .
          <article-title>Collaborative filtering for book recommandation</article-title>
          .
          <source>In Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18</source>
          ,
          <year>2014</year>
          ., pages
          <fpage>501</fpage>
          -
          <lpage>507</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Ludovic</given-names>
            <surname>Bonnefoy</surname>
          </string-name>
          , Romain Deveaud, and
          <string-name>
            <given-names>Patrice</given-names>
            <surname>Bellot</surname>
          </string-name>
          .
          <article-title>Do social information help book search</article-title>
          ? In Pamela Forner, Jussi Karlgren, and
          <string-name>
            <surname>Christa</surname>
          </string-name>
          Womser-Hacker, editors,
          <source>CLEF (Online Working Notes/Labs/Workshop)</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Marie-Catherine de Marneffe</surname>
          </string-name>
          , Bill MacCartney, and
          <string-name>
            <surname>Christopher</surname>
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Generating typed dependency parses from phrase structure trees</article-title>
          .
          <source>In LREC</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Gabriella</given-names>
            <surname>Kazai</surname>
          </string-name>
          , Marijn Koolen, Jaap Kamps, Antoine Doucet, and
          <string-name>
            <given-names>Monica</given-names>
            <surname>Landoni</surname>
          </string-name>
          .
          <article-title>Overview of the inex 2010 book track: Scaling up the evaluation using crowdsourcing</article-title>
          .
          <source>In Shlomo Geva</source>
          , Jaap Kamps, Ralf Schenkel, and Andrew Trotman, editors,
          <source>INEX</source>
          , volume
          <volume>6932</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>98</fpage>
          -
          <lpage>117</lpage>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Joon</given-names>
            <surname>Ho</surname>
          </string-name>
          <article-title>Lee</article-title>
          .
          <article-title>Combining multiple evidence from different properties of weighting schemes</article-title>
          .
          <source>In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '95</source>
          , pages
          <fpage>180</fpage>
          -
          <lpage>188</lpage>
          , New York, NY, USA,
          <year>1995</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Donald</given-names>
            <surname>Metzler</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <article-title>A markov random field model for term dependencies</article-title>
          . In Ricardo A.
          <string-name>
            <surname>Baeza-Yates</surname>
          </string-name>
          , Nivio Ziviani, Gary Marchionini, Alistair Moffat, and John Tait, editors,
          <source>SIGIR</source>
          , pages
          <fpage>472</fpage>
          -
          <lpage>479</lpage>
          . ACM,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Stephen</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>C. J. van Rijsbergen</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>and Martin F.</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <article-title>Probabilistic models of indexing and searching</article-title>
          .
          <source>In SIGIR</source>
          , pages
          <fpage>35</fpage>
          -
          <lpage>56</lpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>