<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CERIST at INEX 2015: Social Book Search Track</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Messaoud CHAA</string-name>
          <email>mchaa@cerist.dz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Omar NOUALI</string-name>
          <email>onouali@cerist.dz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research Center on Scientific and Technical Information 05 rue des 03 frères Aissou</institution>
          ,
          <addr-line>Ben Aknoun, Alger, 16030</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Université Abderrahmane Mira Béjaia Rue Targa Ouzemour</institution>
          ,
          <addr-line>Béjaïa 6000</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe our participation in the INEX 2015 Social Book Search Suggestion Track (SBS). We have exploited in our experiments only the tags assigned by users to books provided from LibraryThing (LT). We have investigated the impact of the weight of each term of the topic in the retrieval model using two methods. In the first method, we have used the TF-IQF formula to assign a weight to each term of the topic. In the second method, we have used Rocchio algorithm to expand the query and calculate the weight of the tags assigned to the example books mentioned in the book search request. Parameters of our models have been tuned using the topics of INEX 2014 and tested on INEX 2015 Social Book Search track.</p>
      </abstract>
      <kwd-group>
        <kwd>Social book Search</kwd>
        <kwd>TF-IQF</kwd>
        <kwd>Tag-Based</kwd>
        <kwd>Rocchio Algorithm</kwd>
        <kwd>Query Expansion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The emergence of Web 2.0 and social web application has completely changed the
way how to publish, share, and find information on the web. This shift has led
researchers in the information retrieval field to look to other techniques and tools to
help users to find the most relevant information to their needs. This is what the goal of
the Social Book Search Track is[1].</p>
      <p>To reach this goal and since 2011, INEX SBS has provided a collection of 2.8
million records containing both professional metadata, from Amazon, extended with
user-generated content, social metadata from LibraryThing1 (LT). In addition, it has
provided a large set of 93,976 anonymous users’ profiles from LT with over 33
million cataloguing transactions.</p>
      <p>A set of topics that were extracted from LT forum have been also made available
to evaluate systems submitted by participants at the SBS task. Each of them contains
many fields to describe the user needs; title, group, mediated query, narrative and a
personal catalogue of the topic starter. This year the topics have been enriched by an
examples field which lists all the example books mentioned in the search request. The
different representations of the topic made the understanding of the users’ information
need and the determination of the importance of each term in the topic a very difficult
task.</p>
      <p>In this paper, we try to tackle this problem through two contributions. Firstly, we
introduce the tf-iqf function to assign a high weight values to terms which are
significant to the topic (high term frequency in the given topic) and a low weight to those
appearing in many different topics. Secondly, and to better represent the topic, we
add other terms by expanding the original query using Rocchio technique [2]. The
example books mentioned in the search request are used as relevant feedback
documents in this technique.</p>
      <p>The organization of the rest of the paper is as follows: in section 2, we describe the
data processing; in section 3, we present our approach focusing on the retrieval
function used and the weighting functions of the two above methods. Reporting and
describing the results of our experiments will be in section 4. Finally, we conclude in
Section 5 with an outlook to future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Data processing and indexing</title>
      <p>In this section, we describe the data processing and indexing techniques. Several
studies in social information retrieval show that social tagging can improve the quality of
search results by using these tags as index terms. In order to investigate the impact of
social tags on SBS, we want to emphasize that in all our experiments we have used
only the user profiles file2 provided by INEX SBS track which contains over 33
million cataloguing transactions. Each transaction is represented by a row, where each
row contains five columns; the user, the book, the month in which the user added that
book, the rating and a set of tags assigned by this user to this book. Those columns
are represented, the user profiles file, as follow:
&lt;user_id&gt; &lt;book_id&gt; &lt;add_date&gt; &lt;user_rating&gt; &lt;user_tags&gt;</p>
      <p>The two columns, book_id and user_tags, are used to extract for each book all tags
that are assigned to it by users. Before creating the index, the Porter stemmer [3] is
used to reduce all tags into their stem. After all tags have been extracted and
processing data is done, the data is indexed using the following two relational tables,
implemented using the Postgres3 database management system:
• BOOKS(id_book, id_tag, tf): contains for each book id_book, the tag id_tag used
by users to tag this book and tf (the number of times users of LT have tagged the
book id_book with the tag id_tag )
• TAGS (id_tag, tag, idf): contains the stem tag for a tag id_tag and idf (logarithm of
the ratio of the number of books in the collection to the number of books tagged by
the given tag).
2 http://cleverdon.hum.uva.nl/sbs/profiles/sbs15.profiles.gz
3 http://www.postgresql.org/</p>
    </sec>
    <sec id="sec-3">
      <title>Our approach</title>
      <p>To illustrate our approach, we first present in this section, the scoring function used to
measure the similarity between query and each book in the collection. Then, we
describe the two techniques used to weighting query terms and to expanding the original
query.
3.1</p>
      <sec id="sec-3-1">
        <title>Scoring Function</title>
        <p>In our approach, we consider a query Q as a set of weighted terms issued by the topic
starter to describe their needs. Each document (book) of the collection is represented
by a vector where each dimension value is the number of times the document D is
tagged by the specified tag t. To compute the score S(D,Q) of a document D with
respect to a query Q, we use BM15 the simplified retrieval function of
okapiBM25[4]. The BM15 function is used because, there is no notion of length
normalization and the number of tags assigned to a book cannot be considered as a length.
,
= ∑ ∈
,
, .</p>
        <p>.</p>
        <p>,
,
(
)
=</p>
        <p>, ) . )
= ! | |*#%*%</p>
        <p>&amp;.'
&amp;.'
Where (
and the )
is the weight of term t, , ) is the frequency of term t in the topic q,
is the inverse query frequency calculated as follow:
= ! | |$#%$%</p>
        <p>&amp;.'
&amp;.'
Where is the number of documents that are tagged with t, and | | is the total
number of documents in a collection.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Query terms weighting</title>
        <p>
          The topics of INEX SBS track which are derived from the LT forum contain many
fields namely title, group, mediated query and narrative. In our approach we
investigate all terms of all this fields however we give a weight to each term of the topic by
using tf-iqf formula which is similar to tf-idf for documents [5]. Therefore, each topic
will be represented by a weighted vector, where the values of this victor are the
weights of terms calculated as follow:
Where w t, D and w t, Q are the weights of term t in the document D respectively in
the query Q. K1 and k3 are free parameters. idf t is the inverse document frequency
calculated as follow :
(
          <xref ref-type="bibr" rid="ref1">1</xref>
          )
(
          <xref ref-type="bibr" rid="ref2">2</xref>
          )
(
          <xref ref-type="bibr" rid="ref3">3</xref>
          )
(
          <xref ref-type="bibr" rid="ref4">4</xref>
          )
Where ) is the number of topics that contain t, and |Q| is the total number of
topics in a collection (the 680 topics from INEX 2014 are used).
3.3
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>Query expansion</title>
        <p>This year the topics of INEX SBS have been enriched by an examples field which
lists all the example books mentioned in the search request with Information on if the
user has read the book or not and his/her sentiment about this book (positive, negative
or neutral). In order to exploit this field, we adopted the query Expansion method
which is used to improve the search results by automatically adding terms to the user's
original query. Rocchio relevance feedback is one of the most popular methods used
for this task. Here are the steps to be followed for its application:
─ For each book example in the topic, rank all the tags assigned to this book
according to the tf-idf function;
─ Select the top-k tags for each book;
─ Apply the function below to construct the new query
+,-.</p>
        <p>
          = α. +,0123 + 5 6 ∑$∈607 , + 8 9: ∑$∈9.;
, − =
9 ∑$∈9.3 ,
(
          <xref ref-type="bibr" rid="ref5">5</xref>
          )
Where +,0123 and +,-. are the original and the new query vector respectively, ,
denotes the weighted tag vector of the example book . P, NT and N are respectively
the number of positive, neutral and negative books. The parameter α is used to
measure the importance of the terms of the original query, whereas 5, 8 and = are used to
weight the tags of example books on the final query. The latter parameters take into
account the sentiment of the topic starter about this example book. It is worth
mentioning that the information on whether the topic starter has read the example book
has not been taken into account in this technique.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments &amp; Results</title>
      <p>In order to test and validate our approach, we ran several experiments with different
representation of the query. We use the topics and the relevance judgments of INEX
SBS 20144 to training our approach and optimizing the parameters of the different
function used.
4.1</p>
      <sec id="sec-4-1">
        <title>Training &amp; optimizing from SBS 2014</title>
        <p>In order to study the impact of term weighting on retrieval performance; we have
opted for two ways of doing. In the first one, the weight of terms consisted of their
frequency of appearance in the topic fields. In the second one, the weight of terms is
calculated by the tf-iqf described in section 3.2. We then calculated the score of each
4 http://social-book-search.humanities.uva.nl/data/judgements/inex14sbs_V2.qrels
book in the index using the retrieval function. We mention here that all terms of the
topic fields are used to represent the query.</p>
        <p>We optimized the parameters of the BM15 using the 2014 topics (k3 is set to 1000
and k1 have been optimized to 5). Table1 summarizes the results of the two weighting
methods on the 680 topics and relevance judgments of SBS 2014.</p>
        <p>
          To investigate the query expansion technique, and since the example field was not
present in the topics of SBS 2014, it was necessary to perform our approach by using
the 208 topics of 2015 only. The evaluation will be based on the relevance judgments
of SBS 2014. In the beginning and in order to assess the impact of the example books
on the retrieval performance, we selected for each of them the top-10 tags ranked by
tf-idf. The values of the example book vector are set to 1 to gives all tags the same
importance. The equation (
          <xref ref-type="bibr" rid="ref5">5</xref>
          ) was used to compute and weight the final query vector.
For this, we fixed α = 0 (no topic terms), 5 = 1, = = 0.5while varying 8 from 0 to
1 in steps of 0.2. The best parameter found was 8 = 0.8. We combined then the
original topic terms (top-10, top-20 and top-30 ranked by tf-iqf) with the top-10 tags of the
example books. The best parameters found above ( 5 = 1, 8 = 0.8, = = 0.5 were
used with varied α from 0 to 1 in steps of 0.2. The best results have been found when
we have used the top-20 terms of the topic with the top-10 tags of the example books
and D = 0.4. The results of the different topic representations are shown in table 2.
In the last stage and in order to avoid returning books that already exist in the
catalogue of the topic starter, we removed all these books from the ranked list. The table
below shows that the results have been improved when using this technique.
After all the performed experiments we note that, from Table1, using the tf-iqf
function to weight the topic terms improves the results more than using the frequency of
the term. In term of nDCG@10 measure, the result increases from 0.065 to 0.101.
From table 2, we notice that using the query expansion technique to add other terms
to the original query can also improve the results. This technique increases
nDCG@10 from 0.094 to 0.113 when we using the tags of example books only, and
from 0.113 to 0.137 when we combining both the tags of the example books and the
original query terms.
        </p>
        <p>It is important also mentioning that the use of the topics of INEX 2014 as a training
and the topics of INEX 2015 as a testing sets, which are almost the same , can
overfitting the parameters of the model learned. To avoid this overfitting, it would have been
better if we had used the n-fold cross-validation technique.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we described our participation to the INEX 2015 Social Book Search
track. Our proposed approach investigates the query terms weighting techniques to
select the most significant terms of the topic. Two methods were performed, the tf-iqf
function to weight the terms of the topic and rocchio technique to expand and
reweight the query terms. Both methods have given interesting results, especially, the
query expansion method. It is true that we used the user profiles file in our
experiments but it was limited only to the tags assigned by users to books. In future works,
it will be interesting to use other information from this file like rating, personnel
catalogues of each user and similarity between them to experiment with collaborative
filtering and recommender system to improve the results.
6</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bellot</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bogers</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Geva</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huurdeman</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kamps</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koolen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moriceau</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Preminger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>SanJuan</surname>
          </string-name>
          , E.,
          <string-name>
            <surname>Schenkel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Skov</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tannier</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Walsh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          (
          <year>2014</year>
          ).
          <article-title>Overview of INEX 2014</article-title>
          .
          <article-title>In Information Access Evaluation</article-title>
          . Multilinguality, Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          (pp.
          <fpage>212</fpage>
          -
          <lpage>228</lpage>
          ). Springer International Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Rocchio</surname>
          </string-name>
          , J.: Relevance Feedback in Information Retrieval. Prentice Hall, Englewood, Cliffs, New Jersey (
          <year>1971</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Porter</surname>
            ,
            <given-names>M. F.</given-names>
          </string-name>
          (
          <year>1980</year>
          ).
          <article-title>An algorithm for suffix stripping</article-title>
          .
          <source>Program: Electronic Library and Information Systems</source>
          ,
          <volume>40</volume>
          (
          <issue>3</issue>
          ),
          <fpage>211</fpage>
          -
          <lpage>218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jones</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hancock-Beaulieu</surname>
            ,
            <given-names>M. M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Gatford</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>1995</year>
          ).
          <article-title>Okapi at TREC-3</article-title>
          . NIST SPECIAL PUBLICATION SP,
          <volume>109</volume>
          -
          <fpage>109</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Salton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wong</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>C. S.</given-names>
          </string-name>
          (
          <year>1975</year>
          ).
          <article-title>A vector space model for automatic indexing</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>18</volume>
          (
          <issue>11</issue>
          ):
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>