<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Claudia Hauf</string-name>
          <email>c.hauf@tudelft.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Delft University of Technology Delft</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <fpage>44</fpage>
      <lpage>46</lpage>
      <abstract>
        <p>The Lemur Project was set up in 2000 by the Center for Intelligent Information Retrieval at UMass Amherst. It is one of the longest lasting open-source projects in the information retrieval (IR) research community. Among the released tools is Indri, a popular search engine that was designed for language-modeling based approaches to IR. For OSIRRC 2019 we dockerized Indri and added support for the Robust04, Core18 and GOV2 test collections.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>OVERVIEW</title>
      <p>
        As part of the Lemur Project1 a number of tools have been
developed, most notably Galago, Indri [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and RankLib. Indri
has been—and still is—a widely popular research search engine
implemented in C++ which allows for the eficient development
and evaluation of novel language-modeling based approaches to IR.
In addition, Indri ofers a query language that provides support for
constraints based on proximity, document fields, syntax matches,
and so on.
      </p>
      <p>We here describe the implementation of the Indri Docker
image2 for the OSIRRC 2019 challenge, the incorporated baselines,
results and issues observed along the way.</p>
    </sec>
    <sec id="sec-2">
      <title>DOCKER IMAGE DESIGN</title>
      <p>The design of our Docker image is tied to the jig,3 a toolkit
developed specifically for OSIRRC 2019, which provides a number
of “hooks” (such as index and search) that are particular to the
workflow of search systems.</p>
    </sec>
    <sec id="sec-3">
      <title>Dockerfile</title>
      <p>The Dockerfile builds an image based on Ubuntu 16.04. Apart from
Indri v5.13 itself, a number of additional software package are
installed such as nodejs (one of the scripts to prepare the Core18
collection is a Node.js script) and python (to interact with the jig).
2.2</p>
      <p>index
This hook indexes the corpora mounted by the jig, making use
of Indri’s IndriBuildIndex. We support three corpora (Core18,
GOV2 and Robust04), which each require diferent preprocessing
steps:
Robust04 The original Robust04 corpus is .z compressed, a
com</p>
      <p>pression format Indri does not support. And thus, we first
&lt;desc&gt; Description:
Identify organizations that participate in international
criminal activity, the activity, and, if possible,
collaborating organizations and the countries involved.
&lt;narr&gt; Narrative:
A relevant document must as a minimum identify the
organization and the type of illegal activity (e.g.,
Columbian cartel exporting cocaine). Vague references to
international drug trade without identification of the
organization(s) involved would not be relevant.
&lt;/top&gt;
need to uncompress the corpus and filter out undesired
folders such as cr (as Indri does not support excluding
particular subfolders from indexing) before starting the indexing
process.</p>
      <p>Core18 The corpus is provided in JSON format and first needs to</p>
      <p>be converted to a document format Indri supports.</p>
      <p>GOV2 Among the three corpora only GOV2 is well suited for Indri,</p>
      <p>it can be indexed without any further preprocessing.</p>
      <p>The created indices are stemmed (Krovetz) with stopwords
removed. For the latter, we relied on the Lemur project stopword list4
which contains 418 stopwords.
2.3</p>
      <p>search
This hook is responsible for creating a retrieval run.</p>
      <p>Topic files. In a preprocessing step, the TREC topic files (an
example topic of Robust04 is shown in Figure 1) have to be reformatted
as Indri requires topic files to adhere to a particular format.</p>
      <p>Next to reformatting, special characters (punctuation marks, etc.)
have to be removed. Indri does not provide specific tooling for
this step, and one either has to investigate how exactly Indri deals
with special characters during the indexing phase (thus matching
the processing of special characters in order to achieve optimal
retrieval efectiveness) or rely on very restrictive filtering (removing
anything but alphanumeric characters). We opted for the latter. In
contrast, stemming does not have to be applied, as Indri applies
the same stemming to each query as specified in the index manifest
(creating during the indexing phase).</p>
      <p>Only standard stopword removal is applied to the topics; this
means that in the TREC description and TREC narrative phrases
4http://www.lemurproject.org/stopwords/stoplist.dft
&lt;query&gt;
&lt;number&gt;301-LM&lt;/number&gt;
&lt;text&gt;#combine( international organized crime )&lt;/text&gt;
&lt;/query&gt;
&lt;query&gt;
&lt;number&gt;301-SD&lt;/number&gt;
&lt;text&gt;#weight( 0.9 #combine(international organized crime)
0.05 #combine(#1(international organized) #1(organized crime) )
0.05 #combine(#uw8(international organized) #uw8(organized crime)) )&lt;/text&gt;
&lt;/query&gt;
&lt;query&gt;
&lt;number&gt;301-DESC&lt;/number&gt;
&lt;text&gt;#combine( identify organizations that participate in international criminal activity the activity and
if possible collaborating organizations and the countries involved )&lt;/text&gt;
&lt;/query&gt;
&lt;query&gt;
&lt;number&gt;301-BM25&lt;/number&gt;
&lt;text&gt;international organized crime&lt;/text&gt;
&lt;/query&gt;
such as Identify ... or A relevant document ... (cf. Figure 1) remain in
the final query after preprocessing.</p>
      <p>Moreover, diferent retrieval methods require diferently
formatted topic files (e.g. the BM25 retrieval model does not support
complex queries, cf. Figure 2). Depending on the topic type (e.g.,
TREC title-only topics, description-only topics, title+description
topics) diferent queries are created.</p>
      <p>Retrieval rules. The jig provides an option --opts which allows
extra options to be passed to the search hook. We use it, among
others, to specify (i) the retrieval rule,5 (ii) whether to include
pseudo-relevance feedback (PRF, use_prf="1") and (iii) whether
to use the sequence dependency (SD, sd="1") model. The
hyperparameters for both PRF and SD are fixed. Specifically, for PRF we
use 50 feedback documents, 25 feedback terms and equal
weighting of the original and expanded query model. The SD weights
are set to 0.9 for the original query, 0.05 for bigrams and 0.05 for
unordered windows. These settings were based on prior works. A
better approach would be to employ hyperparameter tuning.
3</p>
    </sec>
    <sec id="sec-4">
      <title>RESULTS</title>
      <p>
        Table 1 showcases the use of the optional parameter of the jig’s
search hook to set the retrieval rules. We report the retrieval
effectiveness in MAP. When comparing our results to those reported
in prior works using Indri and (at least) Robust04 [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref5">1–3, 5</xref>
        ] we
report similar trends, though with smaller absolute efectiveness
diferences: SD and PRF are both more efective than the vanilla
5All retrieval methods as documentend at https://lemurproject.org/doxygen/lemur/
html/IndriRunQuery.html are supported.
language modeling approach and their combination performs best.
BM25 performs somewhat worse than expected, an outcome we
argue is due to our lack of hyperparameter tuning. The biggest
diferences can be found in the results we report for queries solely
derived from the TREC topic descriptions (instead of a combination
of title and description): our results are significantly worse than
the title-only baseline, which we attribute to a lack of “cleaning up”
those descriptions (i.e. removing phrases like Relevant documents
include).
4
      </p>
    </sec>
    <sec id="sec-5">
      <title>CONCLUSIONS</title>
      <p>Creating the Docker image for Indri was more work than
anticipated. One unexpected problem turned out to be the sourcing of
the original corpora (instead of processed versions suited for Indri
that had been “passed down” from researcher to researcher within
our lab). In addition, for almost every corpus/topic set combination
a diferent preprocessing script had to be written which turned into
a lengthy process as (i) Indri tends to fail silently (e.g. a failure to
process a query with special characters will only be flagged when
running trec_eval as the exception is simply written to the result
ifle) and (ii) debugging a Docker image is not trivial.</p>
      <p>In the next step, we will implement automatic hyperparameter
tuning.</p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This research has been supported by NWO project SearchX (639.022.722).
--opts out_file_name="outfile" rule="method:dirichlet,mu:1000" topic_type="title"
--opts out_file_name="outfile" rule="method:dirichlet,mu:1000" topic_type="title" sd="1"
--opts out_file_name="outfile" rule="method:dirichlet,mu:1000" topic_type="title" use_prf="1"
--opts out_file_name="outfile" rule="method:dirichlet,mu:1000" topic_type="title" use_prf="1" sd="1" 0.2855
--opts out_file_name="outfile" rule="okapi,k1:1.2,b:0.75" topic_type="title+desc"
--opts out_file_name="outfile" rule="method:dirichlet,mu:1000" topic_type="desc"
0.2499
0.2547
0.2812
0.2702
0.2023
0.2800
0.2904
0.3033
0.3104
0.2705
0.1336
0.2332
0.2428
0.2800
0.2816
0.2457
0.1674</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Michael</given-names>
            <surname>Bendersky</surname>
          </string-name>
          , Donald Metzler, and
          <string-name>
            <given-names>W Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Efective query formulation with multiple information sources</article-title>
          .
          <source>In Proceedings of the fifth ACM international conference on Web search and data mining. ACM</source>
          ,
          <volume>443</volume>
          -
          <fpage>452</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Zhuyun</given-names>
            <surname>Dai</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jamie</given-names>
            <surname>Callan</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Deeper Text Understanding for IR with Contextual Neural Language Modeling</article-title>
          . arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>09217</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Van</given-names>
            <surname>Dang</surname>
          </string-name>
          and
          <string-name>
            <surname>Bruce W Croft</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Query reformulation using anchor text</article-title>
          .
          <source>In Proceedings of the third ACM international conference on Web search and data mining. ACM</source>
          ,
          <volume>41</volume>
          -
          <fpage>50</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Strohman</surname>
          </string-name>
          , Donald Metzler, Howard Turtle, and
          <string-name>
            <given-names>W Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Indri: A language model-based search engine for complex queries (extended version)</article-title>
          .
          <source>CIIR Technical Report.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Guoqing</given-names>
            <surname>Zheng</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jamie</given-names>
            <surname>Callan</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Learning to reweight terms with distributed representations</article-title>
          .
          <source>In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM</source>
          ,
          <volume>575</volume>
          -
          <fpage>584</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>