<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Information Retrieval Baselines for the ResPubliQA Task∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Joaqu´ın P´erez-Iglesias</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Guillermo Garrido</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>A´lvaro Rodrigo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lourdes Araujo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>joaquin.perez</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>ggarrido</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>alvarory</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>lurdes</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>anselmo}@lsi.uned.es</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>This paper describes the baselines proposed for the ResPubliQA 2009 task. These baselines are purely based on information retrieval techniques. The selection of an adequate retrieval model that fits the specific characteristic of the supplied data is considered as a core part of the task. Applying a not adequate retrieval function would return a subset of paragraphs where the answer could not appear, and thus the posterior techniques applied in order to detect the answer within the subset of candidates paragraphs will fail. In order to check the ability to retrieve the right paragraph by a pure information retrieval approach, two baselines are proposed. Both of them use the Okapi-BM25[3] ranking function, with and without a stemming preprocess. The main aim was to prove how well can a pure information retrieval system perform on this task.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>∗This paper corrects a minor mistake found in the results of the not stemmed baseline. The previous results
were slightly lower than those presented here.</p>
      <p>1http://langtech.jrc.it/JRC-Acquis.html</p>
      <p>
        Related works can be found on previously international competitions and workshops as
NTCIR7[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], TREC Genomics Track 2006[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and 2007 and in the workshops IR4QA carried out within
COLING 2008 and SIGIR 2004.
      </p>
      <p>Different techniques can be found on the previous works specifically focused on applying
classical retrieval models, to the selection of paragraphs or snippets within a document. In general
these techniques are adapted to the data characteristics moreover of the straight use of the model.</p>
      <p>The proposed baselines can be considered as a first phase within a typical pipeline architecture
for question answering. That is, a first selection of paragraphs that are considered relevant for the
proposed question are selected. Therefore the focus is to obtain a first set of paragraphs ordered
according to their relevance with the question. The precision in terms of retrieving a correct
answer for the question within the top k paragraphs, delimits in some sense the overall quality
of the full system. In order to retrieve the most relevant paragraphs, the full collection has been
indexed by paragraphs removing stopwords and applying a stemmer when there is some available.
Both process are performed specifically by language.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Retrieval Model</title>
      <p>The selection of an adequate retrieval model is a key part of this task, as only the returned
paragraphs in this phase will be analysed to check if any of them contains the answer for the
proposed question. In general, retrieval models are built around three basic statistics from the
data: frequency of terms in a document; frequency of a term in the collection, where document
frequency (DF) or collection frequency (CF) can be applied; and document length.</p>
      <p>The ideal ranking function for this task should be adaptable enough to fit the specific
characteristics of the data in use. For the ResPubliQA 2009 task the documents are actually paragraphs
with an average length of ten terms, and where the frequency of a question term within a
paragraph will hardly exceed one. Given the task characteristics, a paragraph candidate to contain the
answer to a question will be one where the maximum number of question terms appear (excluding
stopwords); with a length similar to the average, avoiding to give too much relevance to term
frequency within the paragraph.</p>
      <p>
        The use of the classic Vector Space Model[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] model is not an adequate option for this task, since
this model typically normalises the weight assigned to a document with the document length2.
This would cause that those paragraphs that contain at least one question term and has the lowest
length will obtain the highest score. Moreover, the typical saturation of terms frequency, with the
logarithm or root square, used in this model gives too much relevance to the term frequency. This
can be seen in equation (1) where frequency is saturated with the root square and normalisation
is carried out dividing by the square root of the length.
      </p>
      <p>R(q,d) = X
t en q
pf reqt,d · idft
plength(d)
(1)</p>
      <p>
        A more adequate ranking function for this task is BM25[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. In this ranking function the effect
of term frequency and document length to the final score of a document can be specified by
setting up two parameters (b,k1). Further explanation of the effect of these parameters over the
ResPubliQA data appears below.
      </p>
      <p>The parameter b defines the length normalisation applied and it is computed as in the next
equation:</p>
      <p>B = (1 − b) + b(</p>
      <p>
        dl
avdl
)
where parameter b ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], dl is the document length, and avdl is the average document length
within the full collection. To assign 0 to b is equivalent to avoid the process of normalisation and
2As it is done in the Lucene framework.
therefore the document length will not affect the final score (B = 1). If b takes 1, we will be
carrying out a full normalisation B = avdldl .
      </p>
      <p>tf =
f reqt,d</p>
      <p>B</p>
      <p>Once the normalisation factor has been calculated it is applied to the term frequency. Final
score is computed applying a term frequency saturation that uses the parameter k1 allowing us to
control the effect of frequency in final score, as it can be seen in next equation:
(2)
R(q, d) =</p>
      <p>tf
tf + k1</p>
      <p>· idft
idft =</p>
      <p>N − dft + 0.5</p>
      <p>
        dft + 0.5
where ∞ &gt; k1 &gt; 0. Finally, IDF (Inverse Document Frequency) is tipically computed as next:
Where N is the total number of documents in the collection and dft is the number of documents
in the collection that contains t. An implementation of the BM25 ranking function over Lucene
was developed for this work3. The details of this implementation can be seen in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The final
expression for BM25 ranking function can be expresed as next:
      </p>
      <p>R(q,d) = X
t en q k1((1 − b) + b · alvdld ) + f reqt</p>
      <p>· idft
f reqt,d
3</p>
    </sec>
    <sec id="sec-3">
      <title>Experimentation Settings</title>
      <p>In order to test the precision of our retrieval system we propose the execution of two baselines. In
both baselines the paragraph selected in order to answer the question is the one that appears first
in the ranking obtained after the retrieval phase. The only difference between both baselines is if
a stemming pre-process is carried out or not. In order to proceed with the stemming process for
each language4 the Snowball implementation that can be found at http://snowball.tartarus.org/
has been applied. The resources used for each language can be downloaded from the following
sites:
1. Bulgarian:
2. German
3. English
4. French
• Stoplist: http://members.unine.ch/jacques.savoy/clef/bulgarianST.txt
• Stemmer: Not Available
• Stoplist: http://members.unine.ch/jacques.savoy/clef/germanST.txt
• Stemmer: http://snowball.tartarus.org/algorithms/german/stemmer.html
• Stoplist: http://members.unine.ch/jacques.savoy/clef/englishST.txt
• Stemmer: http://snowball.tartarus.org/algorithms/english/stemmer.html
• Stoplist: http://members.unine.ch/jacques.savoy/clef/frenchST.txt
• Stemmer: http://snowball.tartarus.org/algorithms/french/stemmer.html
3http://nlp.uned.es/~jperezi/Lucene-BM25/
4Except Bulgarian for which not available stemmer could be found.</p>
      <p>• Stoplist: http://members.unine.ch/jacques.savoy/clef/italianST.txt
• Stemmer: http://snowball.tartarus.org/algorithms/italian/stemmer.html
• Stoplist: http://members.unine.ch/jacques.savoy/clef/portugueseST2.txt
• Stemmer: http://snowball.tartarus.org/algorithms/portuguese/stemmer.html
• Stoplist: http://members.unine.ch/jacques.savoy/clef/roumanianST.txt
• Stemmer: http://snowball.tartarus.org/algorithms/romanian/stemmer.html</p>
      <sec id="sec-3-1">
        <title>5. Italian</title>
      </sec>
      <sec id="sec-3-2">
        <title>6. Portuguese</title>
      </sec>
      <sec id="sec-3-3">
        <title>7. Romanian</title>
      </sec>
      <sec id="sec-3-4">
        <title>8. Spanish</title>
        <p>• Stoplist: http://members.unine.ch/jacques.savoy/clef/spanishSmart.txt
• Stemmer: http://snowball.tartarus.org/algorithms/spanish/stemmer.html
The parameters values are equivalent for both baselines and have been fixed as next:
1. b: 0.6. Those paragraphs with a length over the average will obtain a slightly higher score.
2. k1: 0.1. The effect of term frequency over final score will be minimised.</p>
        <p>Both parameters have been fixed to these values after a training phase with the English
development set supplied by the organisation. The obtained results are shown and described in detail
in the next section:
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <p>After the execution of both baselines the obtained results can be observed in Table 1, where the
obtained results, the paragraphs average length and the best result obtained for each language.
Where no data appears is due to no run was submitted for the specific language. The data for
both baselines is the number of right paragraphs returned for the 500 questions. Between brackets
an average of the P@1 measure per question appears. That is the number of questions answered
correctly divided by the total of questions.</p>
      <sec id="sec-4-1">
        <title>Bulgarian</title>
      </sec>
      <sec id="sec-4-2">
        <title>English</title>
      </sec>
      <sec id="sec-4-3">
        <title>French</title>
      </sec>
      <sec id="sec-4-4">
        <title>German</title>
      </sec>
      <sec id="sec-4-5">
        <title>Italian</title>
      </sec>
      <sec id="sec-4-6">
        <title>Portuguese</title>
      </sec>
      <sec id="sec-4-7">
        <title>Romanian</title>
      </sec>
      <sec id="sec-4-8">
        <title>Spanish</title>
        <p>Some preliminary conclusions can be extracted from the obtained results. First, it is clear that
the best results have been obtained for the English language (.53). This can be easily explained
by the fact that the parameters were fixed using the English development set. It is expected that
a window for improvement could be found fixing the parameters specifically for each language.</p>
        <p>As it can be observed in Table 1 a general behaviour is the fact that best results are achieved
with the use of stemming.</p>
        <p>Moreover, it appears that for those languages with more lexical variability (Spanish, French)
the improvement with the use of stemming is clearly higher in relation with lexically more simple
languages as English, where less variation appears.</p>
        <p>The lowest performance can be observed for languages like Bulgarian or German, we believe
that it is due to the fact of the higher complexity of these languages. The performance for these
languages could be increased with the use of a lemmatiser instead of a stemmer.</p>
        <p>Finally, no correlation has been found between the performance and the paragraph average
length for each language.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgements</title>
      <p>This work has been partially supported by the Spanish Ministry of Science and Innovation
within the project QEAVis-Catiex (TIN2007-67581-C02-01), the TrebleCLEF Coordination
Action, within FP7 of the European Commission, Theme ICT-1-4-1 Digital Libraries and
Technology Enhanced Learning (Contract 215231), the Regional Government of Madrid under the
Research Network MAVIR (S-0505/TIC-0267), the Education Council of the Regional Government
of Madrid and the European Social Fund.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>William</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hersh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Aaron</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <surname>Phoebe M. Roberts</surname>
          </string-name>
          , and
          <article-title>Hari Krishna Rekapalli</article-title>
          .
          <article-title>TREC 2006 Genomics Track Overview</article-title>
          . In Ellen M.
          <article-title>Voorhees</article-title>
          and Lori P. Buckland, editors,
          <source>TREC, volume Special Publication 500-272. National Institute of Standards and Technology (NIST)</source>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Joaqu</surname>
          </string-name>
          <article-title>´ın P´erez-</article-title>
          <string-name>
            <surname>Iglesias</surname>
          </string-name>
          , Jos´e R. P´
          <article-title>erez-Agu¨era, V´ıctor Fresno, and Yuval Z. Feinstein. Integrating the Probabilistic Models BM25/BM25F into Lucene</article-title>
          . CoRR, abs/0911.5046,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Stephen</surname>
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Robertson</surname>
            and
            <given-names>Steve</given-names>
          </string-name>
          <string-name>
            <surname>Walker</surname>
          </string-name>
          .
          <article-title>Some Simple Effective Approximations to the 2- Poisson Model for Probabilistic Weighted Retrieval</article-title>
          . In W. Bruce Croft and C. J. van Rijsbergen, editors,
          <source>SIGIR</source>
          , pages
          <fpage>232</fpage>
          -
          <lpage>241</lpage>
          . ACM/Springer,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Tetsuya</given-names>
            <surname>Sakai</surname>
          </string-name>
          , Noriko Kando,
          <string-name>
            <surname>Chuan-Jie</surname>
            <given-names>Lin</given-names>
          </string-name>
          , Teruko Mitamura, Hideki Shima, Donghong Ji,
          <string-name>
            <surname>Kuang-Hua Chen</surname>
            , and
            <given-names>Eric</given-names>
          </string-name>
          <string-name>
            <surname>Nyberg</surname>
          </string-name>
          .
          <source>Overview of the NTCIR-7 ACLIA IR4QA Task</source>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Salton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Yang</surname>
          </string-name>
          .
          <article-title>A Vector Space Model for Automatic Indexing</article-title>
          .
          <source>Commun. ACM</source>
          ,
          <volume>18</volume>
          (
          <issue>11</issue>
          ):
          <fpage>613</fpage>
          -
          <lpage>620</lpage>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>