=Paper=
{{Paper
|id=Vol-1448/paper8
|storemode=property
|title=Cross-Document Search Engine For Book Recommendation
|pdfUrl=https://ceur-ws.org/Vol-1448/paper8.pdf
|volume=Vol-1448
|dblpUrl=https://dblp.org/rec/conf/recsys/BenkoussasB15
}}
==Cross-Document Search Engine For Book Recommendation==
<pdf width="1500px">https://ceur-ws.org/Vol-1448/paper8.pdf</pdf>
<pre>
                      Cross-Document Search Engine For Book
                                Recommendation

                                                            Chahinez Benkoussas
                                              Aix-Marseille Université, CNRS, LSIS UMR 7296
                                                          13397, Marseille. France
                                                     chahinez.benkoussas@lsis.org
                               Aix-Marseille Université, CNRS, CLEO OpenEdition UMS 3287, 13451
                                                      13397, Marseille. France
                                              chahinez.benkoussas@openedition.org

                                                                       Patrice Bellot
                                              Aix-Marseille Université, CNRS, LSIS UMR 7296
                                                          13397, Marseille. France
                                                             patrice.bellot@lsis.org
                               Aix-Marseille Université, CNRS, CLEO OpenEdition UMS 3287, 13451
                                                      13397, Marseille. France
                                                      patrice.bellot@openedition.org

ABSTRACT                                                                       There has been much work both in the industry and academia
A new combination of multiple Information Retrieval ap-                        on developing new approaches to improve the performance of
proaches are proposed for book recommendation based on                         retrieval and recommendation systems over the last decade.
complex users’ queries. We used different theoretical re-                      The aim is to help users to deal with information over-
trieval models: probabilistic as InL2 (Divergence From Ran-                    load and provide recommendation for books, restaurants or
domness model) and language models and tested their in-                        movies. Some vendors have incorporated recommendation
terpolated combination. We considered the application of a                     capabilities into their commerce services, such as Amazon.
graph based algorithm in a new retrieval approach to related
document network comprised of social links. We called Di-
rected Graph of Documents (DGD) a network constructed                          Existing document retrieval approaches need to be improved
with documents and social information provided from each                       to satisfy users’ information needs. Most systems use clas-
one of them. Specifically, this work tackles the problem of                    sic information retrieval models, such as language models or
book recommendation in the context of CLEF Labs pre-                           probabilistic models. Language models have been applied
cisely Social Book Search track. We established a specific                     with a high degree of success in information retrieval applica-
strategy for queries searching after separating query set into                 tions [29–31]. This was first introduced by Ponte and Croft
two genres “Analogue” and “Non-Analogue” after analyzing                       in [27]. They proposed a method to score documents, called
users’ needs. Series of reranking experiments demonstrate                      query likelihood in two steps: estimate a language model
that combining retrieval models and exploiting linked docu-                    for each document and then rank documents according to
ments for retrieving yield significant improvements in terms                   the likelihood scores resulting from the estimated language
of standard ranked retrieval metrics. These results extend                     model. Markov Random Field model, proposed by Metzler
the applicability of link analysis algorithms to different en-                 and Croft in [19] considers query term proximity in docu-
vironments.                                                                    ments by estimating term dependencies in the context of lan-
                                                                               guage modeling approach. Alternatively, Divergence From
                                                                               Randomness model, proposed by Amati and Van Rijsber-
Keywords                                                                       gen [2], measures the global informativeness of the term in
Document retrieval, InL2, language model, book recommen-
                                                                               the document collection. It is based on the idea :“The more
dation, PageRank, graph modeling, Social Book Search.
                                                                               the term occurrences diverge from random throughout the
                                                                               collection, the more informative the term is” [28]. One limit
1.    INTRODUCTION                                                             of such models is that the distance between query terms in
                                                                               documents is not considered.


                                                                               Users’ queries differ by their type of needs. In book recom-
                                                                               mendation, we identified two genres of queries : “Analogue”
                                                                               and “Non-Analogue” that we describe in the following sec-
                                                                               tions. In this paper, the first proposed approach combines
                                                                               probabilistic and language models to improve the retrieval
CBRecSys 2015, September 20, 2015, Vienna, Austria.                            performances and show that the two models act much better
Copyright remains with the authors and/or original copyright holders
in the context of book recommendation.                             Social Book Search (SBS) task1 aims to evaluate the value
                                                                   of professional and user’s metadata for book search on the
In recent years, an important innovation in information re-        Web. The main goal is to exploit search techniques to deal
trieval is the exploitation of relationships between docu-         with complex information needs and complex information
ments, e.g. Google’s PageRank [25]. It has been success-           sources that include user profiles, personal catalogs, and
ful in Web environments, where the relationships are pro-          book descriptions.
vided by hyperlinks between documents. We present a new
approach for linking documents to construct a graph struc-         The SBS task provides a collection of 2.8 million book de-
ture that is used in retrieving process. In this approach,         scription crawled by the University of Duisburg-Essen from
we exploit the PageRank algorithm for ranking documents            Amazon2 [4] and enriched with content from LibraryThing3 ,
with respect to users’ queries. In the absence of manually-        which is an online service to help people catalog their books
created hyperlinks, we use social information to create a          easly. Books are stored in XML files and identified by an
Directed Graph of Documents (DGD) and argue that it can            ISBN. They contains information like: title information,
be treated in the same manner as hyperlink graphs. Our             Dewey Decimal Classification (DDC) code (for 61% of the
experiments will show that incorporating graph analysis al-        books), category, Amazon product description, etc. Ama-
gorithms in document retrieval improves the performance in         zon records contain also social information generated by
term of the standard ranked retrieval metrics.                     users like: tags, reviews, ratings (see Figure 1. For each
                                                                   book, Amazon suggests a set of “Similar Products” which
Our work focuses on search in the book recommendation do-          represents a result of computed similarity based on content
main, in the context of CLEF Labs Social Book Search track.        information and user behavior (purchases, likes, reviews,
We tested our approaches on collection contains Amazon/Li-         etc.) [13].
braryThing book descriptions and set of queries, called top-
ics, extracted from the LibraryThing discussion forums.

2.   RELATED WORK
This work is first related to the area of document retrieval
models, more specially language models and probabilistic
models. The unigram language models are most often used
for ad hoc Information Retrieval work but several researchers
explored the use of language modeling for capturing higher
order dependencies between terms. Bouchard and Nie in [8]
showed significant improvements in retrieval effectiveness
with a new statistical language model for the query based on
completing the query by terms in the user’s domain of inter-
est, reordering the retrieval results or expanding the query
using lexical relations extracted from the user’s domain of
interest.

Divergence From Randomness (DFR) is one of several prob-
abilistic models that we have used in our work. Abolhassani
and Fuhr have investigated several possibilities for apply-
ing Amati’s DFR model [2] for content-only search in XML
documents. [1].

There has been an increasing use of techniques based on            Figure 1: Example of book from the Amazon/LibraryThing
graphs constructed by implicit relationships between doc-          collection in XML format
uments. Kurland and Lee performed structural reranking
based on centrality measures in graph of documents which
                                                                   SBS task provides a set of queries called topics where users
has been generated using relationships between documents
                                                                   describe what they are looking for (books for a particular
based on language models [14]. In [16], Lin demonstrates the
                                                                   genre, books of particular authors, similar books to those
possibility to exploit document networks defined by automatically-
                                                                   that have been already read, etc.). These requests for rec-
generated content-similarity links for document retrieval in
                                                                   ommendations are natural expressions of information needs
the absence of explicit hyperlinks. He integrates the PageR-
                                                                   for a large collection of online book records. The topics are
ank scores with standard retrieval score and shows a signifi-
                                                                   crawled from LibraryThing discussion Forums.
cant improvement in ranked retrieval performance. His work
was focused on search in the biomedical domain, in the con-
                                                                   The topic set consists of 680 topics in 2014. Each topic has
text of PubMed search engine. Perhaps the main contrast
                                                                   a narrative description of the information need and other
with our work is that links were not induced by generation
                                                                   fields as illustrated in Figure 2.
probabilities or linguistic items.
                                                                   1
                                                                     http://social-book-search.humanities.uva.nl/
3.   INEX SOCIAL BOOK SEARCH TRACK                                 2
                                                                     http://www.amazon.com/
     AND TEST COLLECTION                                           3
                                                                     http://www.librarything.com/
                                                                   to the queries using Indri4 Query Language5 .

                                                                   4.3     Combining Search Systems
                                                                   Combining the output of many search systems, in contrast to
                                                                   using just a single one improves the retrieval effectiveness as
                                                                   proved in [5] where Belkin combined the results of probabilis-
                                                                   tic with vector space models. On the basis of this approach,
                                                                   In our work, we combined the probabilistic model, InL2 with
                                                                   language model SDM. This combination takes into account
Figure 2: Example of topic, composed with multiple fields          both the informativeness of query terms and their depen-
to describe user’s need(s)                                         dencies in the document collection. Each retrieval model
                                                                   uses different weighting schemes therefore the scores should
                                                                   be normalized. We used the maximum and minimum scores
                                                                   according to Lee’s formula [15].
4.    RETRIEVAL MODELS
This section describes the retrieval models we used for book
recommendation and their combination.
                                                                                                  oldScore − minScore
                                                                            normalizedScore =
                                                                                                  maxScore − minScore
4.1   InL2 of Divergence From Randomness
We used InL2, Inverse Document Frequency model with                It has been shown in [6] that InL2 and SDM models have
Laplace after-effect and normalization 2. This model has           different levels of retrieval effectiveness, thus it is necessary
been used with success in different works [3,6,10,26]. InL2 is     to weight individual model scores depending on their overall
a DFR-based model (Divergence From Randomness) based               performance. We used an interpolation parameter (α) that
on the Geometric distribution and Laplace law of succession.       we varied to improve retrieval effectiveness.

                                                                   5.    GRAPH MODELING
4.2   Sequential dependence Model of Markov                        In [17], the authors have exploited networks defined by automatically-
      Random Field                                                 generated content-similarity links for document retrieval.
Language models are largely used in Document Retrieval             We provided document analysis to find new way to link
search for book recommendation [6, 7]. Metzler and Croft           them. In our case, we exploited a special type of similar-
proposed Markov Random Field (MRF) model [18, 20] that             ity based on several factors. This similarity is provided by
integrates multi-word phrases in the query. Specifically, we       Amazon and corresponds to “Similar Products” given gener-
used the Sequential Dependence Model (SDM), which is a             ally for each book. The degree of similarity depends on social
special case of MRF. In this models co-occurrence of query         information like: number of clicks or purchases and content-
terms is taken into consideration. SDM builds upon this idea       based information like book attributes (book description,
by considering combinations of query terms with proximity          book title, etc.). The exact formula used by Amazon to com-
constraints which are: single term features (standard uni-         bine social and content based information to compute sim-
gram language model features, fT ), exact phrase features          ilarity is proprietary. The idea behind this linking method
(words appearing in sequence, fO ) and unordered window            is that documents linked with such type of similarity, the
features (require words to be close together, but not neces-       probability that they are in the same context is higher than
sarily in an exact sequence order, fU ).                           if they are not connected.

Finally, documents are ranked according to the following           To perform data modeling into DGD, we extracted the “Sim-
scoring function:                                                  ilar Products” links between documents in order to con-
                                                                   struct the graph structure. Once used it to enrich results
                                                                   from the retrieval models, in the same spirit as pseudo-
                                                                   relevance-feedback. Each node in the DGD represents doc-
                                 X
               SDM (Q, D) = λT         fT (q, D)+
                                 q∈Q                               ument (Amazon description of book), and has set of prop-
                                                                   erties:
                                    |Q|−1
                                       X
                              +λO            fO (qi , qi + 1, D)
                                       i=1                              • ID: book’s ISBN
                                    |Q|−1
                                       X                                • content : book description that include many other
                              +λU            fU (qi , qi + 1, D)
                                                                          properties (title, product description, author(s), users’
                                       i=1
                                                                          tags, content of reviews, etc.)

Where feature weights are set based on the authorâĂŹs rec-           • M eanRating : average of ratings attributed to the
ommendation (λT = 0.85, λO = 0.1, λU = 0.05) in [7]. fT                   book
, fO and fU are the log maximum likelihood estimates of            4
                                                                    http://www.lemurproject.org/indri/
query terms in document D, computed over the target col-           5
                                                                    http://www.lemurproject.org/lemur/
lection using a Dirichlet smoothing. We applied this model         IndriQueryLanguage.php
   • P R : book’s PageRank                                         StartingN ode identifies a document from Dinit used as in-
                                                                   put to the graph processing algorithms in the DGD. The
                                                                   set of documents present in the graph is denoted by S. Dti
Edges in the DGD are directed and correspond to Amazon             indicates the documents retrieved for topic ti ∈ T .
similarity, so given nodes {A, B} ∈ S , if A points to B,
B is suggested as Similar Product to A. In the Figure 3,
we show an example of DGD, network of documents. The               5.1    Our Approach
DGD network contains 1 645 355 nodes (89.86% of nodes are          The DGD network contains useful information about doc-
within the collection and the rest are outside) and 6 582 258      uments that can be exploited for document retrieval. Our
edges.                                                             approach is based, first on results of a traditional retrieval
                                                                   engine, then on the DGD network to find new documents.
                                                                   The idea is to suppose that the suggestions given by Ama-
                                                                   zon can be relevant to the user queries.


                                                                   Algorithm 1 takes as inputs: Dinit returned list of docu-
                                                                   ments for each topic by the retrieval techniques described
                                                                   in Section 3, DGD network and parameter β which is the
                                                                   number of the top selected StartingN ode from Dinit de-
                                                                   noted by DStartingN odes . We fixed β to 100 (10% of the
                                                                   returned list for each topic). The algorithm returns a list
                                                                   of recommendations for each topic denoted by “Df inal ”. It
   Figure 3: Example of Directed Graph of Documents                processes topic by topic, and extracts the list of all neighbors
                                                                   for each StartingN ode. It performs mutual Shortest Paths
                                                                   computation between all selected StartingN ode in DGD.
Figure 4 shows the general architecture of our document re-        The two lists (neighbors and nodes in computed Shortest
trieval system with two-level document search. In this sys-        Paths) are concatenated after that all duplicated nodes are
tem, the IR Engine finds all relevant documents for user’s         deleted. The set of documents in returned list is denoted by
query. Then, the Graph Search module selects resulting             Dgraph . A second concatenation is performed between initial
document returned by Graph Analysis module. The Graph              list of documents and Dgraph (all duplications are deleted) in
Structured Data is a network constructed using Social Infor-       new final list of retrieved documents, Df inal reranked using
mation Matrix and enriched by Compute PageRank module.             different reranking schemes.
The Social Information Matrix is constructed by two mod-
ules: “Ratings“ and ”Similar Products“ Extraction from the         Algorithm 1 Retrieving based on DGD feedback
Data Collection that contains description books in XML for-
mat. Scoring Ranking module combines scores of documents            1: Dinit ← Retrieving Documents for each ti ∈ T
resulting from IR Engine and Graph Analysis modules and             2: for each Dti ∈ Dinit do
reranks them.                                                       3:    DStartingN odes ← first β documents ∈ Dti
                                                                    4:    for each StartingN ode in DStartingN odes do
                                                                    5:        Dgraph ← Dgraph
                                                                             + neighbors(StartingN ode, DGD)
                                                                    6:        DSP nodes ← all D ∈
                                                                            ShortestP ath(StartingN ode, DStartingN odes , DGD)
                                                                    7:        Dgraph ← Dgraph + DSP nodes
                                                                    8:        Delete all duplications from Dgraph
                                                                    9:    Df inal ← Df inal + (Dti + Dgraph )
                                                                   10: Delete all duplications from Df inal
                                                                   11: Rerank Df inal


                                                                   Figure 5 shows an illustration of the document retrieval ap-
                                                                   proach based on DGD feedback.


                                                                   6.    EXPERIMENTS AND RESULTS
                                                                   In this section, we describe the experimental setup we used
                                                                   for our experiments. Furthermore, we present the different
Figure 4: Architecture of document retrieval approach based        reranking schemes used in previously defined approaches.
on graph of documents                                              We discuss the results we achieved by using the InL2 re-
                                                                   trieval model, its combination to the SDM model, and re-
                                                                   trieval system proposed in our approach that uses the DGD
In this section, the collection of documents is denoted by         network.
C. In C, each document d has a unique ID. The set of
queries called topics is denoted by T , the set Dinit ⊂ C refers
to the documents returned by the initial retrieval model.          6.1    Experiments setup
Figure 5: Book retrieval approach based on DGD feedback. Numbers on the arrows refer to the instructions in the Algorithm
1


For our experiments, we used different tools that implement     borhood extraction and PageRank calculation.
retrieval models and handle the graph processing. First,
we used Terrier (TERabyte RetrIEveR)6 Information Re-           To evaluate the results of retrieval systems, several measure-
trieval framework developed at the University of Glasgow        ments have been used for SBS task: Discounted Cumulative
[21–23]. Terrier is a modular platform for rapid develop-       Gain (nDCG), the most popular measure in IR [11], Mean
ment of large-scale IR applications. It provides indexing       Average Precision (MAP) which calculates the mean of av-
and retrieval functionalities. It is based on DFR framework     erage precisions over a set of queries, and other measures:
and we used it to deploy InL2 model described in section        Recip Rank and Precision at the rank 10 (P@10).
4.1. Further information about Terrier can be found at
http://ir.dcs.gla.ac.uk/terrier.
                                                                6.2    Reranking Schemes
A preprocessing step was performed to convert INEX SBS          Two approaches were proposed. The first one (see section
corpus into the Trec Collection Format7 , by considering that   4.3) merges the results of two different information retrieval
the content of all tags in each XML file is important for in-   models which are the Language Model (SDM) and DFR
dexing; therefore the whole XML file was transformed on         model (InL2). For topic ti , the models give 1000 documents
one document identified by its ISBN. Thus, we just need         and each retrieved document has an associated score. The
two tags instead of all tags in XML, the ISBN and the whole     linear combination method uses the following formula to cal-
content (named text).                                           culate final score for each retrieved document d by SDM and
                                                                InL2 models:
                                                                   Sf inal (d, ti ) = α ∗ SInL2 (d, ti ) + (1 − α) ∗ SSDM (d, ti )
Secondly, Indri8 , Lemur Toolkit for Language Modeling and
Information Retrieval was used to carry out a language
model (SDM) described in section 4.2. Indri is a framework      Where SInL2 (d, ti ) and SSDM (d, ti ) are normalized scores.
that provides state-of-the-art text search methods and a rich   α is the interpolation parameter set up at 0.8 after several
structured query language for big collections (up to 50 mil-    tests on the 2014 topics.
lion documents). It is a part of the Lemur project and devel-
oped by researchers from UMass and Carnegie Mellon Uni-
versity. We used Porter stemmer and performed Bayesian          The second approach (described in 5.1) uses the DGD con-
smoothing with Dirichlet priors (Dirichlet prior µ = 1500).     structed from the “Similar Products” information. The doc-
                                                                ument set returned by the retrieval model are fused to the
                                                                documents in neighbors set and Shortest Path results. We
In section 5.1, we have described our approach based on         tested many reranking methods that combine the retrieval
DGD which includes graph processing. We used NetworkX9          model scores and other scores based on social information.
tool of Python to perform shortest path computing, neigh-       For each document in the resulting list, we calculated the
                                                                following scores:
6
  http://terrier.org/
7
  http://lab.hypotheses.org/1129
8
  http://www.lemurproject.org/indri/                               • PageRank, computed using NetworkX tool. It is
9
  https://networkx.github.io/                                        a well-known algorithm that exploits link structure
      to score the importance of nodes in a graph. Usu-           forming reranking with PageRank improves significantly per-
      ally, it was been used for hyperlink graphs such as the     formances but in contrast, it lowers the baseline perfor-
      Web [24]. The values of PageRank are given by the           mances when using the Non-Analogue topic set. This can
      following formula.                                          be explained by the fact that Analogue topics contain ex-
                                                                  amples of books (Figure 6) which require the use of graph
               P R(A) = (1 − d) + d(P R(T1 )/C(T1 )
                                                                  to extract the similar connected books.
                                   +... + P R(Tn )/C(Tn ))
      Where document A has documents T1 ...Tn which point
      to it (i.e., Similar products). The parameter d is a
      damping factor set between 0 and 1 (0.85 in our case).
      C(A) is defined as the number of links going out of
      page A.
   • Likeliness, computed from information generated by
     users (reviews and ratings). It is based on the idea that
     more the book has a lot of reviews and good ratings,
     the more interesting it is (it may not be a good or
     popular book but a book that has a high impact).
                                                 P
                                                    r∈RD r
      Likeliness(D) = log(#reviews(D)) ×
                                               #reviews(D)
      Where #reviews(D) is the number of reviews attributed
                                                                     Figure 6: Examples of narratives in Analogue topics
      to D, RD is the set of reviews of D.

The computed scores were normalized using this formula:           Using Likeliness scores (in InL2 DGD MnRtg) to rerank re-
normalizedscore = oldscore /maxscore . After that, to com-        trieved documents decreases significantly the baseline effi-
bine the results of retrieval systems and each of normal-         ciency for the two topic sets. This means that ratings given
ized scores, an intuitive solution is to weight the retrieval     by users don’t provide any improvement for the reranking
model scores with the previously described scores (normal-        performances.
ized PageRank and Likeliness). However, this would favor
documents with high PageRank and Likeliness scores even
though their content is much less related to the topics.

6.3    Results
We used two topic sets provided by INEX SBS task in 2014
(680 topics). The systems retrieve 1000 documents per topic.
We assessed the narrative field of each topic and provided au-
tomatic classification of the topic set into 2 genres. Analogue
topics (261) in which users give the already read books (gen-
erally, titles and authors) to have similar books. In the sec-
ond genre “Non-Analogue” (356 topics), users describe their
needs by defining the thematic, interested field, event, etc.
without citing other books. Notify that, 63 topics are ig-
nored because of their ambiguity.

In order to evaluate our IR methodologies described in sec-       Figure 7: Histograms that demonstrate and compare the
tions 4.3, 5 we performed retrieving for each topic genre indi-   number of improved, deteriorated and same results’ topics
vidually. The experimental results, which describe the per-       using the proposed approaches for MAP measure. (Baseline:
formance of the different retrieval systems on Amazon/Li-         InL2)
braryThing document collection, are shown in Table 1.
                                                                  Figure 7 compares the number of improved, deteriorated
As illustrated in Table 1, the system that combines proba-        and same results’ topics between the baseline (InL2) and the
bilistic model InL2 and the Language Model SDM (InL2 SDM)         proposed retrieval systems in term of MAP measure. The
achieves a significant improvement for each topic set compar-     proposed systems based on DGD graph provide the highest
ing to InL2 model (Baseline) but the improvement is highest       number of improved topics compared with the combination
for Non-Analogue topic set where the content of queries are       of IR systems. More precisely, using PageRank to rerank
more explicit than the other topic set. This improvement is       document produces better results in term of improved top-
mainly due to the increase of the number of relevant docu-        ics. This results prove the positive impact of linked structure
ments that are retrieved by both systems.                         on document retrieval systems for book recommendation.

The results of run InL2 DDG PR using the Analogue topic           The depicted results confirm that we are starting with com-
set confirm that exploiting structured documents and per-         petitive baseline, suggesting that improvements contribute
Table 1: Experimental results. The runs are ranked according to nDCG@10. (∗) denotes significance according to Wilcoxon
test [9]. In all cases, all of our tests produced two-sided p-value, α = 0.05.
                                          Analogue topics                                              Non-Analogue topics
     Run             nDCG@10         Recip Rank        MAP             P@10         nDCG@10         Recip Rank        MAP            P@10
     InL2              0.1099           0.267          0.072           0.078          0.138           0.207          0.117          0.0579
     InL2 SDM      0.1115 (+1%∗ )             ∗
                                    0.271 (+1% ) 0.073 (+0.6%)     0.079 (+1%∗ )   0.147(+6%∗ )              ∗
                                                                                                   0.222(+7% )    0.124(+5%∗ )   0.0630(+8%∗ )
     InL2 DGD PR   0.1111 (+1%∗ )   0.277 (+3%∗ ) 0.068 (−5%∗ )    0.082 (+12%)    0.127(−7%∗ )   0.206(−0.6%∗ ) 0.102(−12%∗ )   0.0570(−1%∗ )
     InL2 DGD LK   0.1043 (−5%)     0.275 (+2%)    0.064(−11%∗ )    0.082(+5%)     0.130(−5%)      0.214(+3%∗ )  0.100(−14%∗ )   0.0676(+16%)


by combining output retrieval systems and social link anal-                 track and the proposed topics in 2014 divided into two classes
ysis are indeed meaningful.                                                 Analogue and “Non-Analogue”.

7.     HUMANITIES AND SOCIAL SCIENCES                                       We presented the first approach that combines the outputs
                                                                            of probabilistic model (InL2) and Language Model (SDM)
       COLLECTION: GRAPH MODELING AND                                       using a linear interpolation after normalizing scores of each
       RECOMMENDATION                                                       retrieval system. We have shown a significant improvement
We tested the proposed approach of recommendation based                     of baseline results using this combination.
on linked documents on Revues.org10 collection. Revues.org
is one of the four platforms of OpenEdition11 portal dedi-                  A novel approach was proposed, based on Directed Graph
cated to electronic resources in the humanities and social                  of Documents (DGD) constructed from social relationships.
sciences (books, journals, research blogs, and academic an-                 It exploits link structure to enrich the returned document
nouncements). Revues.org was founded in 1999 and today                      list by traditional retrieval model (InL2). We performed a
it hosts over 400 online journals, i.e. 149000 articles, pro-               reranking method using PageRank and Likeliness of each
ceedings ans editorials.                                                    retrieved document.

We built a network of documents from ASp12 journal. It                      In the future, we would like to construct an evaluation cor-
publishes research articles, publication listings and reviews               pora from Revues.org collection and develop an evaluation
related to the field of English for Specific Purposes (ESP) for             process similar to that of INEX SBS task. Another inter-
both teaching and research. The network contains 500 doc-                   esting extension of our work would be using the learning
uments and 833 relationships which represent bibliographic                  to rank techniques to automatically adjust the settings of
citations. Each relationship is constructed using BILBO                     re-ranking parameters.
[12], the reference parsing software. BILBO is constructed
with annotated corpora from Digital Humanities articles
from OpenEdition Revues.org platform. It automatic an-                      9.     ACKNOWLEDGMENT
notates bibliographic references in the bibliography section                This work was supported by the French program Investisse-
of each document and obtains the corresponding DOI (Digi-                   ments d’Avenir FSN and the French Région PACA under
tal Object Identifier) via CrossRef13 API if such an identifier             the projects InterTextes and Agoraweb.
exists.
                                                                            10.     REFERENCES
Each node in the citation network have a set of properties                    [1] M. Abolhassani and N. Fuhr. Applying the divergence
(ID which is its URL, type, it can be article, editorial, re-                     from randomness approach for content-only search in
view of book, etc., and readers’ clicks number that we called                     XML documents. pages 409–419, 2004.
popularity). The recommender system applied on this net-
                                                                              [2] G. Amati and C. J. Van Rijsbergen. Probabilistic
work takes as input user query, generally a small set of short
                                                                                  models of information retrieval based on measuring
keywords, and performs retrieval step using Solr14 search
                                                                                  the divergence from randomness. ACM Trans. Inf.
engine. The system extend the returned results with doc-
                                                                                  Syst., 20(4):357–389, Oct. 2002.
uments in the citation network by using graph algorithms
                                                                              [3] G. Amati and C. J. van Rijsbergen. Probabilistic
(neighborhood search and shortest path algorithm) as de-
                                                                                  models of information retrieval based on measuring
scribed in section5.1. After that, we rerank documents ac-
                                                                                  the divergence from randomness. ACM Trans. Inf.
cording to the popularity property of each document.
                                                                                  Syst., 20(4):357–389, October 2002.
We tested the system manually for a small set of user queries,                [4] T. Beckers, N. Fuhr, N. Pharo, R. Nordlie, and K. N.
and found that for most queries, the results were satisfying.                     Fachry. Overview and results of the INEX 2009
                                                                                  interactive track. In Research and Advanced
                                                                                  Technology for Digital Libraries, 14th European
8.     CONCLUSION AND FUTURE WORK                                                 Conference, ECDL 2010, Glasgow, UK, September
In this paper, we proposed and evaluated approaches of doc-                       6-10, 2010. Proceedings, pages 409–412, 2010.
ument retrieval in the context of book recommendation. We                     [5] N. J. Belkin, P. B. Kantor, E. A. Fox, and J. A. Shaw.
used the test collection of CLEF Labs Social Book Search                          Combining the evidence of multiple query
10                                                                                representations for information retrieval. Inf. Process.
   http://www.revues.org/
11                                                                                Manage., 31(3):431–448, 1995.
   http://www.openedition.org
12
   http://www.openedition.org/6457                                            [6] C. Benkoussas, H. Hamdan, S. Albitar, A. Ollagnier,
13
   http://www.crossref.org/                                                       and P. Bellot. Collaborative filtering for book
14
   http://lucene.apache.org/solr/                                                 recommandation. In Working Notes for CLEF 2014
     Conference, Sheffield, UK, September 15-18, 2014.,                model for term dependencies. In R. A. Baeza-Yates,
     pages 501–507, 2014.                                              N. Ziviani, G. Marchionini, A. Moffat, and J. Tait,
 [7] L. Bonnefoy, R. Deveaud, and P. Bellot. Do social                 editors, SIGIR, pages 472–479. ACM, 2005.
     information help book search? In P. Forner,                  [21] I. Ounis, G. Amati, V. Plachouras, B. He,
     J. Karlgren, and C. Womser-Hacker, editors, CLEF                  C. Macdonald, and C. Lioma. Terrier: A High
     (Online Working Notes/Labs/Workshop), 2012.                       Performance and Scalable Information Retrieval
 [8] H. Bouchard and J.-Y. Nie. ModÃĺles de langue                   Platform. In Proceedings of ACM SIGIR’06 Workshop
     appliquÃl’s Ãă la recherche d’information contextuelle.        on Open Source Information Retrieval (OSIR 2006),
     In CORIA, pages 213–224. UniversitÃl’ de Lyon, 2006.             2006.
 [9] W. B. Croft. Organizing and searching large files of         [22] I. Ounis, G. Amati, P. V., B. He, C. Macdonald, and
     document descriptions. PhD thesis, Cambridge                      Johnson. Terrier Information Retrieval Platform. In
     University, 1978.                                                 Proceedings of the 27th European Conference on IR
[10] R. GuillÃl’n. Gir with language modeling and dfr                 Research (ECIR 2005), volume 3408 of Lecture Notes
     using terrier. In C. Peters, T. Deselaers, N. Ferro,              in Computer Science, pages 517–519. Springer, 2005.
     J. Gonzalo, G. Jones, M. Kurimo, T. Mandl,                   [23] I. Ounis, C. Lioma, C. Macdonald, and V. Plachouras.
     A. PeÃśas, and V. Petras, editors, Evaluating Systems           Research directions in terrier: a search engine for
     for Multilingual and Multimodal Information Access,               advanced retrieval on the web. Novatica/UPGRADE
     volume 5706 of Lecture Notes in Computer Science,                 Special Issue on Web Information Access, 2007.
     pages 822–829. Springer Berlin Heidelberg, 2009.             [24] L. Page, S. Brin, R. Motwani, and T. Winograd. The
[11] K. Järvelin and J. Kekäläinen. Ir evaluation methods           pagerank citation ranking: Bringing order to the web.
     for retrieving highly relevant documents. In                      In Proceedings of the 7th International World Wide
     E. Yannakoudakis, N. Belkin, P. Ingwersen, and M.-K.              Web Conference, pages 161–172, Brisbane, Australia,
     Leong, editors, Proceedings of the 23rd Annual                    1998.
     International ACM SIGIR Conference on Research               [25] L. Page, S. Brin, R. Motwani, and T. Winograd. The
     and Development in Information Retrieval (SIGIR                   pagerank citation ranking: Bringing order to the web.
     2000), pages 41–48, New York, NY, USA, 2000. ACM.                 Technical Report 1999-66, Stanford InfoLab,
[12] Y.-M. Kim, P. Bellot, E. Faath, and M. Dacos.                     November 1999. Previous number =
     Automatic annotation of bibliographical references in             SIDL-WP-1999-0120.
     digital humanities books, articles and blogs. In             [26] V. Plachouras, B. He, and I. Ounis. University of
     G. Kazai, C. Eickhoff, and P. Brusilovsky, editors,               glasgow at trec 2004: Experiments in web, robust, and
     BooksOnline, pages 41–48. ACM, 2011.                              terabyte tracks with terrier. In E. M. Voorhees and
[13] M. Koolen, T. Bogers, J. Kamps, G. Kazai, and                     L. P. Buckland, editors, TREC, volume Special
     M. Preminger. Overview of the INEX 2014 social book               Publication 500-261. National Institute of Standards
     search track. In Working Notes for CLEF 2014                      and Technology (NIST), 2004.
     Conference, Sheffield, UK, September 15-18, 2014.,           [27] J. M. Ponte and W. B. Croft. A language modeling
     pages 462–479, 2014.                                              approach to information retrieval. In Proc. SIGIR,
[14] O. Kurland and L. Lee. PageRank without hyperlinks:               1998.
     Structural re-ranking using links induced by language        [28] S. E. Robertson, C. J. van Rijsbergen, and M. F.
     models. In Proceedings of SIGIR, pages 306–313, 2005.             Porter. Probabilistic models of indexing and searching.
[15] J. H. Lee. Combining multiple evidence from different             In SIGIR, pages 35–56, 1980.
     properties of weighting schemes. In Proceedings of the       [29] F. Song and W. Croft. A general language model for
     18th Annual International ACM SIGIR Conference on                 information retrieval. In Proceedings of the SIGIR
     Research and Development in Information Retrieval,                Conference on Information Retrieval, 1999.
     SIGIR ’95, pages 180–188, New York, NY, USA, 1995.           [30] T. Tao, X. Wang, Q. Mei, and C. Zhai. Language
     ACM.                                                              model information retrieval with document expansion.
[16] J. Lin. Pagerank without hyperlinks: Reranking with               In R. C. Moore, J. A. Bilmes, J. Chu-Carroll, and
     pubmed related article networks for biomedical text               M. Sanderson, editors, HLT-NAACL. The Association
     retrieval. BMC Bioinformatics, 9(1), 2008.                        for Computational Linguistics, 2006.
[17] J. Lin. Pagerank without hyperlinks: Reranking with          [31] C. Zhai. Statistical Language Models for Information
     pubmed related article networks for biomedical text               Retrieval. Synthesis Lectures on Human Language
     retrieval. BMC Bioinformatics, 9(1), 2008.                        Technologies. Morgan and Claypool Publishers, 2008.
[18] D. Metzler and W. B. Croft. Combining the language
     model and inference network approaches to retrieval.
     Inf. Process. Manage., 40(5):735–750, 2004.
[19] D. Metzler and W. B. Croft. A markov random field
     model for term dependencies. In Proceedings of the
     28th Annual International ACM SIGIR Conference on
     Research and Development in Information Retrieval,
     SIGIR ’05, pages 472–479, New York, NY, USA, 2005.
     ACM.
[20] D. Metzler and W. B. Croft. A markov random field

</pre>