<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>The OldDog Docker Image for OSIRRC at SIGIR 2019</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chris Kamphuis</string-name>
          <email>P@30</email>
          <email>ckamphuis@cs.ru.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arjen P. de Vries</string-name>
          <email>arjen@acm.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Radboud University</institution>
          ,
          <addr-line>Nijmegen</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>25</volume>
      <issue>2019</issue>
      <fpage>47</fpage>
      <lpage>49</lpage>
      <abstract>
        <p>Modern column-store databases are perfectly suited for carrying out IR experiments, but they are not widely used for IR research. A plausible explanation would be that setting up a database system and populating it with the documents to be ranked provides enough of a hurdle to never get started on this route. We took up the OSIRRC challenge to produce an easily replicable experimental setup for running IR experiments on modern database architecture. OldDog, named after a short paper on SIGIR proposing the use of column-stores for IR experiments, implements standard IR ranking using BM25 as SQL queries issued to MonetDB SQL. This provides a baseline system on par with custom IR implementations and a perfect starting point for the exploration of more advanced integrations of IR and databases. Reflecting on our experience in OSIRRC 2019, we found a much larger efectiveness penalty than anticipated in the prior work for using the conjunctive variant of BM25 (requiring all query terms to occur). Simplifying the SQL query to rank the documents using the disjunctive variant (the normal IR ranking approach) results in longer runtimes but higher efectiveness. The interaction between query optimizations for eficiency and the resulting diferences in efectiveness remains a research topic with many open questions.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>CCS CONCEPTS</title>
      <p>• Information systems → Search engine indexing; Evaluation
of retrieval results.
information retrieval, replicability, column store</p>
    </sec>
    <sec id="sec-2">
      <title>OVERVIEW</title>
      <p>
        OldDog is a software project to replicate and extend the
database approach to information retrieval presented in
Mühleisen et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The authors proposed that IR researchers
would use column store relational databases for their retrieval
experiments. Specifically, researchers should store their
document representation in such a database. The ranking function
can then be expressed as SQL queries. This allows for easy
comparison of diferent ranking functions. IR researchers will
only need to focus on the retrieval methodology while the
database takes care of eficiently retrieving the documents.
      </p>
      <p>
        OldDog represents the data using the schema proposed by
Mühleisen et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. An extra ‘collection identifier’ column
has been added to include the original collection identifiers.
The original paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] produced the database tables to
represent ‘postings’ using a custom program running on Hadoop.
Instead, we rely on the Anserini toolsuite [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] to create a
Lucene1 index. Anserini takes care of standard document
pre-processing.
      </p>
      <p>
        Like [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], OldDog uses column store database MonetDB
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] for query processing. Term and document information
is extracted from the Lucene index, stored as CSV files
representing the columns in the database, and loaded into
MonetDB using a standard COPY INTO command.2
      </p>
      <p>After initialisation, document ranking is performed by
issuing SQL queries that specify the retrieval model.
Interactive querying can support the researcher with additional
examples.
2</p>
    </sec>
    <sec id="sec-3">
      <title>TECHNICAL DETAILS</title>
      <sec id="sec-3-1">
        <title>Supported Collections:</title>
        <p>robust4, core18</p>
      </sec>
      <sec id="sec-3-2">
        <title>Supported Hooks:</title>
        <p>init, index, search, interact</p>
        <p>The OldDog docker image itself consists of bash/python
‘hooks’ that wrap the underlying OldDog, Anserini and
MonetDB commands.</p>
        <p>Anserini builds a Lucene index, where we are happy
endusers of the utilities provided to index common test collections.
The code has been tested for Robust04 and Core18 in our
ifrst release; extending to other collections readily supported
by Anserini should be trivial.</p>
        <p>OldDog provides the Java code to convert the Lucene
index created by Anserini into CSV files that are subsequently
loaded into the MonetDB database. OldDog further contains
the Python code necessary to call Java code to pre-process
the topics (calling the corresponding Anserini code, to
guarantee that topics are processed exactly the same way as the
documents) and issue SQL queries to the MonetDB database
to rank the collection.
1https://lucene.apache.org, last accessed June 26th, 2019.
2The intermediate step exporting and importing CSV files is not
strictly necessary, but simplifies the pipeline and is robust to failure.</p>
        <p>Apart from the required init, index and search hooks,
OldDog supports the interact hook to spawn an3 SQL shell
that allows the user to query the database interactively.
2.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Schema</title>
      <p>Consider an example document doc1 with contents I put on
my shoes after I put on my socks. to illustrate the database
schema. Indexing the document results in tables 1, 2 and 3:</p>
    </sec>
    <sec id="sec-5">
      <title>Retrieval Model</title>
      <p>
        OldDog implements the BM25 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] ranking formula. The values
for  1 and  are fixed to 1.2 and 0.75, respectively. The
original paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] uses a conjunctive variant of this formula,
that only produces results for documents where all query
terms appear the document; this yields lower efectiveness
scores, but speeds up query processing leading to a better
runtime performance.
      </p>
      <p>OldDog implements disjunctive query processing as well,
included after noticing a surprisingly large diference in
efectiveness when compared to other systems applied to Robust04.
In the disjunctive variant, documents are considered when
they contain at least one of the query terms. As expected,
runtimes for an evaluation increase when using this strategy.</p>
      <p>Table 4 summarises efectiveness scores for both methods
on two test collections, Robust04 and Core18.</p>
      <p>Listing 1 shows the conjunctive BM25 SQL query for robust
topic 301: International Organized Crime. The disjunctive
variant simply omits the having clause.
3Or, ‘a SQL shell’, pronouncing SQL as sequel like database folk do.</p>
      <p>Kamphuis and de Vries
Overall, we look back at an excellent learning experience
taking part in the OSIRRC challenge. The setup using Docker
containers worked out very well during coding, by multiple
people on diferent machines. Automated builds on Docker
Hub and the integration with Zenodo complete a fully
replicable experimental setup.</p>
      <p>The standardised API for running evaluations provided by
the ‘jig’ made it easy to learn from other groups; and mixing
version management using git (multiple branches) with
building (locally) Docker containers with diferent tags allowed
progress in parallel when implementing diferent extensions
of the initial code (in our case, including disjunctive query
processing and adding Core18).</p>
      <p>Recording the evaluation outcomes at release time let us
catch a bug that would have been easily overlooked without
such a setup - after including Core18, a minor bug introduced
in the code to parse topic files lead to slightly diferent scores
on Robust04, that we could easily detect and fix (one topic
was missing) thanks to the structured approach of recording
progress.4
Let us conclude the paper by discussing a few advantages
of database-backed IR experiments. Using the interact hook,
it is possible to issue SQL queries directly to the database.
This is useful if one wants to try diferent kinds of ranking
functions, or just to investigate the content of the database.
We show some examples of queries on the Robust04 test
collection.</p>
      <p>The three most occurring terms are easily extracted from
the dict table:</p>
      <p>SELECT ∗ FROM dict ORDER BY df DESC LIMIT 3;
+--------+-------+--------+
| termid | term | df |
+========+=======+========+
| 541834 | from | 355901 |
| 563475 | ha | 320097 |
| 894136 | which | 302365 |
+--------+-------+--------+</p>
      <p>As expected, the term distribution is skewed with a very
long tail; consider for example the number of distinct terms
that occur only once:</p>
      <p>SELECT COUNT(∗) AS terms FROM dict WHERE df = 1;
+--------+
| terms |
+========+
| 516956 |
+--------+</p>
      <p>Apart from applying a brief static stopword list to all
preprocessing (defined in StandardAnalyzer.STOP_WORDS_SET),
Anserini ‘stops’ the query expansions in its RM3 module by
ifltering on document frequency, thresholded at 10% of the
number of documents in the collection.</p>
      <p>Having such a collection-dependent stoplist would be an
interesting option in the initial run as well, so let us use the
interactive mode to investigate the efect on query efectiveness
of applying this df filter to the initial run.</p>
      <p>We can easily evaluate the efect of removing the terms with
high document frequency, e.g. by modifying the dictionary
table as follows:</p>
      <p>ALTER TABLE dict RENAME TO odict;
CREATE table dict AS
SELECT ∗ FROM odict WHERE df &lt;=</p>
      <p>(SELECT 0.1 ∗ COUNT(∗) FROM docs);
4We may also conclude that a unified topic format for all TREC
collections would be a useful improvement to avoid errors in experiments
carried out on these test collections.</p>
      <p>We find the efectiveness scores shown in table 5 for
disjunctive BM25. Performance drops for both MAP and early
precision, suggesting that filtering query term presence based
on document count is not a good idea, and should be
limited to pseudo relevance feedback (not yet implemented in
OldDog).</p>
      <p>Core18
MAP P@30
0.1907 0.2693</p>
      <p>
        A natural next step is to include the ‘qrel’ files in the
database, to explore more easily the relevant documents that
are (not) retrieved by specific test queries.
We conclude that we could successfully apply the methods
from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and have learned that conjunctive query processing
for BM25 degrades retrieval efectiveness more than we
expected a priori. The Docker image produced for the workshop
is a perfect starting point for exploration of IR on relational
databases, where we build on standard pre-processing and
test collection code in the Anserini project. Of course, we
should extend the retrieval model beyond plain BM25 to
obtain more interesting results from an IR perspective.
Interactive querying the database representation of the collection,
especially after including relevance assessments, seems like a
promising avenue to pursue. Finally we found that the ‘jig’
setup not only allows for easy replication of the software, it
serves as a tool for supporting continuous integration.
      </p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work is part of the research program Commit2Data
with project number 628.011.001 (SQIREL-GRAPHS), which
is (partly) financed by the Netherlands Organisation for
Scientific Research (NWO).</p>
      <p>We also want to thank Ryan Clancy and Jimmy Lin for
the excellent support with the ‘jig’ framework.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Boncz</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Monet: A next-generation DBMS kernel for query-intensive applications</article-title>
          . Universiteit van Amsterdam.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Hannes</given-names>
            <surname>Mühleisen</surname>
          </string-name>
          , Thaer Samar,
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Lin</surname>
          </string-name>
          , and Arjen De Vries.
          <year>2014</year>
          .
          <article-title>Old dogs are great at new tricks: Column stores for IR prototyping</article-title>
          .
          <source>In Proceedings of the 37th international ACM SIGIR conference on Research &amp; development in information retrieval. ACM</source>
          ,
          <volume>863</volume>
          -
          <fpage>866</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Stephen</surname>
            <given-names>E</given-names>
          </string-name>
          <string-name>
            <surname>Robertson</surname>
            and
            <given-names>Steve</given-names>
          </string-name>
          <string-name>
            <surname>Walker</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Some simple efective approximations to the 2-poisson model for probabilistic weighted retrieval</article-title>
          .
          <source>In Proceedings of the 17th international ACM SIGIR conference on Research &amp; development in information retrieval</source>
          . Springer,
          <fpage>232</fpage>
          -
          <lpage>241</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Peilin</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Hui</given-names>
            <surname>Fang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Anserini: Enabling the use of Lucene for information retrieval research</article-title>
          .
          <source>In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM</source>
          ,
          <volume>1253</volume>
          -
          <fpage>1256</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>