<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LaHC at INEX 2014: Social Book Search Track</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Meriem Hafsi</string-name>
          <email>meriem.hafsi@etu.univ-st-etienne.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mathias Ge´ry</string-name>
          <email>mathias.gery@univ-st-etienne.fr</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michel Beigbeder</string-name>
          <email>michel.beigbeder@emse.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>E ́cole Nationale Supe ́rieure des Mines de Saint- E ́tienne 158</institution>
          ,
          <addr-line>cours Fauriel - F 42023 Saint- E ́tienne</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universite ́ de Lyon</institution>
          ,
          <addr-line>F-42023, Saint- E ́tienne, France, CNRS, UMR 5516</addr-line>
          ,
          <institution>Laboratoire Hubert Curien</institution>
          ,
          <addr-line>F-42000, Saint- E ́tienne</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>514</fpage>
      <lpage>520</lpage>
      <abstract>
        <p>In the article, we describe our participation in the INEX 2014 Social Book Search track. We present the different approaches exploiting user social information such as reviews, tags and ratings. These social informations are assigned by users to the books. We optimize our models using the INEX Social Book Search 2013 collection and we test them on the INEX 2014 Social Book Search track.</p>
      </abstract>
      <kwd-group>
        <kwd>Social Information Retrieval</kwd>
        <kwd>Recommendation</kwd>
        <kwd>Structured Information Retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>In this article, we present the different approaches used in our participation to the INEX
2014 Social Book Search (SBS). The idea is to exploit some user generated content such
as reviews and ratings to recommend some books. We have also used an approach based
on the similarity between users. For the experiments, we have used the both collections
INEX SBS 2013 and 20143. Our goal is to improve the information retrieval process by
optimizing our models with the INEX SBS 2013 collection. In the following section, we
present the INEX SBS 2014 collection and data. Then, we present the models optimized
on the INEX SBS 2013. Finally, we detail our official runs ans the results obtained.</p>
    </sec>
    <sec id="sec-2">
      <title>Collection and Data</title>
      <p>The collection contains 2.8 million book descriptions from Amazon, composed of 64
XML fields. Among these fields, we distinguish:
– Metadata: &lt;book&gt;, &lt;isbn&gt;, &lt;title&gt;, &lt;authorid&gt;, etc.
– Social information: &lt;review&gt;, &lt;summary&gt;, &lt;tags&gt;, &lt;rating&gt;, etc.</p>
      <p>Preprocessing
LT User-Profiles are provided from LibraryThing (LT) in a text file containing 93,976
anonymous users. These profiles do not contain the personal information: they
contain only the personal catalog of the users. Each user catalog is presented as a set of
rows where each row represents the review of the user on one book with a rating and
eventually some tags.</p>
      <p>There are 680 topics in INEX 2014 SBS Track, where each topic contains five fields:
&lt;title&gt;, &lt;query&gt;, &lt;narrative&gt;, the &lt;group&gt; where the topic was posted
and a personal catalog &lt;catalog&gt; of the anonymous user who wrote the topic.</p>
      <p>
        In our experiments, we use both collections, INEX SBS 2013 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and INEX SBS
2014. Both collections use the same set of documents (book descriptions). We use INEX
SBS 2013 collection because the relevance judgments were available so it allowed the
optimization of our system before the actual submission of our 2014 runs. The
difference between the two collections lies in the topics and the users profiles. In 2014, we
have only the personal catalog of each user, but in 2013 we have complete users profiles.
      </p>
      <p>Retrieval models and their optimization with INEX SBS 2013
The preprocessing step eliminates the fields we do not need. Each book description can
contain one &lt;reviews&gt; field, that contains one or several reviews in &lt;review&gt;
fields. Each &lt;review&gt; field is composed of &lt;summary&gt;, &lt;content&gt; and &lt;tags&gt;
fields. In the &lt;tags&gt; field, we find some tags &lt;tag&gt;. After this preprocessing step,
the collection contains the fields:
– &lt;docno&gt;: &lt;isbn&gt; field of the book.
– &lt;title&gt;: &lt;title&gt; field of the book.
– &lt;summary&gt;: Concatenation of the &lt;summary&gt; fields.
– &lt;content&gt;: Concatenation of the &lt;content&gt; fields.
– &lt;tags&gt;: Concatenation of the &lt;tag&gt; fields. The new field &lt;tags&gt; contains
as many copies of the &lt;tag&gt; field content as indicated by the count attribute.
For example, &lt;tag count="3"&gt;moon&lt;/tag&gt; will be written: &lt;tags&gt;moon
moon moon&lt;/tags&gt;.
3.2 Indexing and querying
We use the Terrier 3.6 Search Engine4 which can index large XML collections. We use
the default stop-words list of Terrier and the Porter Stemmer. Then, we create five book
description index as follows:
– Index-Title: Only the &lt;title&gt; field of each book description, so no social
information is indexed.
– Index-Summary: &lt;summary&gt; field only.
– Index-Content: &lt;content&gt; field only.
4 Terrier: http://terrier.org/
– Index-Tags: &lt;tags&gt; field only.
– Index-All-Fields: The concatenation of all the fields: &lt;title&gt;, &lt;summary&gt;,
&lt;content&gt; and &lt;tags&gt;.</p>
      <sec id="sec-2-1">
        <title>We build four set of queries:</title>
        <p>– Topic-Title: Only the &lt;title&gt; field of each topic.
– Topic-Query: Only the &lt;query&gt; field.
– Topic-Title-Query: &lt;title&gt; and &lt;query&gt; fields.</p>
        <p>– Topic-All-Fields: &lt;title&gt;, &lt;query&gt; and &lt;narrative&gt; fields.
3.3</p>
        <p>
          Content-Based Retrieval
First, we combine each of the four queries set with each of the five documents index,
using the BM25 model [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. We select one of the combinations and then we optimize
the weight of each field in the BM25F model [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. To evaluate the runs of the 20
combinations with BM25 and the run with BM25F, we test on 52 topics manually selected
among the 386 topics of INEX SBS 2013. The selected topics are those in which the
information need is based on the actual book content and not only on its usage. We
evaluate our results with nDCG@1000 (shortened to nDCG in this paper) using trec eval5.
Evaluation results of these experiments are shown in Table 1.
The results show that indexing user generated content (&lt;summary&gt;, &lt;content&gt;
and &lt;tags&gt;) improved the search results. We notice that a field based BM25 weight
function (BM25F) obtains better results (0.2132) than the classical weight function
BM25 (0.1504), while they index the same information. We choose to focus our
experiments on the topics composed only by the &lt;title&gt; and &lt;query&gt; fields. Thus,
the BM25F has been used only with one set of queries, even if the best results are
obtained with queries including the &lt;narrative&gt; field. In the sequel, we will consider
the index BM25F Index-All-Fields which is the most promising one.
        </p>
        <sec id="sec-2-1-1">
          <title>5 trec eval version 9.0: http://trec.nist.gov/trec eval/</title>
          <p>3.4</p>
          <p>Social Re-Ranking
Our goal is to experiment re-ranking methods in order to improve content-based search
results obtained in the previous section. We propose four models using the books ratings
(ScoreAmazonRatings, ScoreLT RatingsP op and ScoreLT RatingsRep) and the similarity
between users (ScoreUsersSimilarity ).</p>
          <p>Amazon Books Rating based approach(ScoreAmazonRatings): Some books were
commented and rated by Amazon users. There are 14,042,020 ratings in the collection
ranging from 0 to 5, 5 indicating the maximum rating. These ratings are distributed as
shown in Table 2.</p>
          <p>We compute a score AmazonRating(d) for each book d using m user ratings of d
as presented in equation 1. We define the score ScoreAmazonRatings(d, q) of a book d
for a query q by a linear combination of BM 25F and AmazonRating(d) scores (cf.
equation 2).
where α1 is a free parameter of our model.</p>
          <p>LibraryThing Books Rating based approaches (ScoreLT RatingsP op/Rep): The
LibraryThing ratings range from 0 to 10. We introduce two concepts:
– Popularity: P op(d) is based on the number of times the book has been added to
catalog. The more the book is reviewed, the higher is its popularity.
– Reputation: Rep(d) is based on the number of times the book received a rating
grater than 6. Then, the more the book is highly rated, the higher is its reputation.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>P op(d) and Rep(d) are obtained as follows:</title>
          <p>P op(d) =
(0 if (m = 0)</p>
          <p>ln(m) if (m ≥ 1)
Rep(d) =
(0 if (l = 0)
ln(l) if (l ≥ 1)
(3)
(4)
where m is the number of ratings and l is the number of ratings higher than 6.</p>
          <p>Then, we define the scores ScoreLT RatingsP op(d, q) and ScoreLT RatingsRep(d, q):
ScoreLT RatingsP op(d, q) = α2BM 25F (d, q) + (1 − α2)P op(d)</p>
          <p>ScoreLT RatingsRep(d, q) = α3BM 25F (d, q) + (1 − α3)Rep(d)
Users Similarity based approach (ScoreU sersSimilarity): This approach is based
on the similarity between users. The idea is that users who read liked the same books
in the past are likely to like the same things in the future. So, we will recommend to
the user who submit a topic, the books reviewed by similar users. For each book d, we
calculate the score Sim(d, q) according to the similarity in the following manner:
(maxui∈Reviewers(d) U sersSim(uq, ui) if (Reviewers(d) 6= ∅)
Sim(d, q) =
0 else
(5)
(6)
(7)
with:
– uq: The user who submit topic q.
– Reviewers(d): Users who have reviewed d.
– U sersSim(ui, uj): Similarity between the catalogs of the users ui and uj
represented by two binary vectors (a component is set to one if the rating of the book is
higher than 6).</p>
          <p>The final score of each book is computed as follows:</p>
          <p>ScoreUsersSimilarity (d, q) = α4BM 25F (d, q) + (1 − α4)Sim(d, q)
(8)
3.5</p>
          <p>Results on INEX SBS 2013
The results of three of our social information retrieval models, presented in Table 3,
show that two of the three models improve slightly the results compared to the BM 25F
model. The free parameters α1,α2 and α3 have been optimized to respectively 0.5, 0.75,
and 0.75. Note that our fourth model has not been optimized because INEX SBS 2013
collection does not have a large number of users profiles.</p>
          <p>Experiments and results on INEX SBS 2014 Track
For our participation to INEX SBS 2014 track, we built six runs by applying the models
that we optimize on INEX SBS 2013 collection and the model ScoreUsersSimilarity .
These runs are summarized in Table 4 and their results are shown in Table 5. With
the INEX 2014 SBS track official measures (nDCG@10), our six runs are ranked as
shown in Table 5. Their rank/nDCG curves are presented in figure 1. Our best run is the
one that exploits the &lt;narrative&gt; field of the topic and uses BM25F model. Table
5 displays also the results of a post-INEX baseline obtained with the BM25 model
queried with the &lt;title&gt; and &lt;query&gt; fields of the topic. The baseline is the
sole run which does not use the &lt;review&gt; and &lt;tag&gt; field and their results are
much worse. This was also observed with the SBS 2013 collection. There is a slight
improvement in nDCG@10 and MRR when taking into account the user generated
content. As in the 2013 results, taking into account the &lt;narrative&gt; topic field
improves the results because the user information needs are sometimes better exposed.
5</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>In this paper, we described our participation to the INEX 2014 SBS track. We tested
different approaches using social information: indexing of the book reviews and tags,
querying with all topic fields, using scores based on book ratings and similarity between
users. These approaches give interesting results, except the approach based on the
similarity between users. This is probably due to the fact that we recommend to user the
whole list of books appreciated by his similar users. It would have been interesting to
filter this list with another kind of social information. Also, this approach should be
improved by optimizing its parameters.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Koolen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Preminger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Doucet</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Overview of the INEX 2013 social book search track</article-title>
          . In:
          <article-title>CLEF 2013 Evaluation Labs</article-title>
          and Workshop, Online Working Notes (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hancock-Beaulieu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gatford</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Payne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Okapi at TREC'4</article-title>
          .
          <source>In: The Fourth Text REtrieval Conference (TREC'4)</source>
          . pp.
          <fpage>73</fpage>
          -
          <lpage>96</lpage>
          . TREC-
          <volume>4</volume>
          (
          <year>1996</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zaragoza</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Taylor</surname>
          </string-name>
          , M.:
          <article-title>Simple BM25 extension to multiple weighted fields</article-title>
          .
          <source>In: Conference on Information and Knowledge Management</source>
          . pp.
          <fpage>42</fpage>
          -
          <lpage>49</lpage>
          . CIKM'04,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>