The OldDog Docker Image for OSIRRC at SIGIR 2019
                        Chris Kamphuis                                                          Arjen P. de Vries
                      ckamphuis@cs.ru.nl                                                        arjen@acm.org
                      Radboud University                                                      Radboud University
                   Nijmegen, The Netherlands                                               Nijmegen, The Netherlands
ABSTRACT                                                                    comparison of different ranking functions. IR researchers will
Modern column-store databases are perfectly suited for car-                 only need to focus on the retrieval methodology while the
rying out IR experiments, but they are not widely used for                  database takes care of efficiently retrieving the documents.
IR research. A plausible explanation would be that setting                     OldDog represents the data using the schema proposed by
up a database system and populating it with the documents                   Mühleisen et al. [2]. An extra ‘collection identifier’ column
to be ranked provides enough of a hurdle to never get started               has been added to include the original collection identifiers.
on this route.                                                              The original paper [2] produced the database tables to repre-
   We took up the OSIRRC challenge to produce an easily                     sent ‘postings’ using a custom program running on Hadoop.
replicable experimental setup for running IR experiments on                 Instead, we rely on the Anserini toolsuite [4] to create a
modern database architecture. OldDog, named after a short                   Lucene1 index. Anserini takes care of standard document
paper on SIGIR proposing the use of column-stores for IR                    pre-processing.
experiments, implements standard IR ranking using BM25                         Like [2], OldDog uses column store database MonetDB
as SQL queries issued to MonetDB SQL. This provides a                       [1] for query processing. Term and document information
baseline system on par with custom IR implementations and                   is extracted from the Lucene index, stored as CSV files
a perfect starting point for the exploration of more advanced               representing the columns in the database, and loaded into
integrations of IR and databases.                                           MonetDB using a standard COPY INTO command.2
   Reflecting on our experience in OSIRRC 2019, we found a                     After initialisation, document ranking is performed by
much larger effectiveness penalty than anticipated in the prior             issuing SQL queries that specify the retrieval model. Inter-
work for using the conjunctive variant of BM25 (requiring                   active querying can support the researcher with additional
all query terms to occur). Simplifying the SQL query to                     examples.
rank the documents using the disjunctive variant (the normal
IR ranking approach) results in longer runtimes but higher                  2   TECHNICAL DETAILS
effectiveness. The interaction between query optimizations                  Supported Collections:
for efficiency and the resulting differences in effectiveness
remains a research topic with many open questions.                          robust4, core18
                                                                            Supported Hooks:
CCS CONCEPTS
• Information systems → Search engine indexing; Evaluation                  init, index, search, interact
of retrieval results.                                                           The OldDog docker image itself consists of bash/python
                                                                            ‘hooks’ that wrap the underlying OldDog, Anserini and Mon-
KEYWORDS                                                                     etDB commands.
information retrieval, replicability, column store                              Anserini builds a Lucene index, where we are happy end-
                                                                             users of the utilities provided to index common test collections.
Image Source: https://github.com/osirrc/olddog-docker                       The code has been tested for Robust04 and Core18 in our
Docker Hub: https://hub.docker.com/r/osirrc2019/olddog                       first release; extending to other collections readily supported
DOI: https://doi.org/10.5281/zenodo.3255060                                  by Anserini should be trivial.
                                                                                OldDog provides the Java code to convert the Lucene in-
1   OVERVIEW                                                                 dex created by Anserini into CSV files that are subsequently
OldDog is a software project to replicate and extend the                     loaded into the MonetDB database. OldDog further contains
database approach to information retrieval presented in Müh-                 the Python code necessary to call Java code to pre-process
leisen et al. [2]. The authors proposed that IR researchers                  the topics (calling the corresponding Anserini code, to guar-
would use column store relational databases for their retrieval              antee that topics are processed exactly the same way as the
experiments. Specifically, researchers should store their docu-              documents) and issue SQL queries to the MonetDB database
ment representation in such a database. The ranking function                 to rank the collection.
can then be expressed as SQL queries. This allows for easy
                                                                            1
Copyright © 2019 for this paper by its authors. Use permitted under          https://lucene.apache.org, last accessed June 26th , 2019.
                                                                            2
Creative Commons License Attribution 4.0 International (CC BY 4.0).          The intermediate step exporting and importing CSV files is not
OSIRRC 2019 co-located with SIGIR 2019, 25 July 2019, Paris, France.        strictly necessary, but simplifies the pipeline and is robust to failure.


                                                                       47
OSIRRC 2019, July 25, 2019, Paris, France                                                                                    Kamphuis and de Vries


  Apart from the required init, index and search hooks,                                         Table 4: Effectiveness scores
OldDog supports the interact hook to spawn an3 SQL shell
that allows the user to query the database interactively.                                                   Robust04             Core18
                                                                                                         MAP P@30            MAP P@30
2.1       Schema                                                                   Conjunctive BM25      0.1736 0.2526       0.1802 0.3167
Consider an example document doc1 with contents I put on                           Disjunctive BM25      0.2434 0.2985       0.2381 0.3313
my shoes after I put on my socks. to illustrate the database
schema. Indexing the document results in tables 1, 2 and 3:
                                                                               /∗ For all topic terms ∗/
                                                                               WITH qterms AS (SELECT termid, docid, count FROM terms
                             Table 1: dict
                                                                                 WHERE termid IN (591020, 720333, 462570)),
                                                                               /∗ Calculate the BM25 subscores ∗/
                          termid    term         df
                                                                               subscores AS (SELECT docs.collection_id, docs.id , len ,
                             1       put         2
                             2      shoes        1                               term_tf.termid, term_tf.tf, df ,
                             3      after        1                               (log((528030−df+0.5)/(df+0.5))∗((term_tf.tf∗(1.2+1)/
                             4      socks        1                               (term_tf.tf+1.2∗(1−0.75+0.75∗(len/188.33)))))) AS
                                                                                    subscore
                                                                               /∗ Calculate BM25 components ∗/
                                                                               FROM (SELECT termid, docid, count as tf FROM qterms) AS
                            Table 2: terms
                                                                                    term_tf
                                                                                 JOIN (SELECT docid FROM qterms
                        termid     docid     count
                                                                                   GROUP BY docid HAVING COUNT(distinct termid) = 3)
                           1         1         2
                                                                                   AS cdocs ON term_tf.docid = cdocs.docid
                           2         1         1
                           3         1         1                                 JOIN docs ON term_tf.docid = docs.id
                           4         1         1                                 JOIN dict ON term_tf.termid = dict.termid)
                                                                               /∗ Aggregate over the topic terms ∗/
                                                                               SELECT scores. collection_id , score
                                                                                 FROM (SELECT collection_id, SUM(subscore) AS score
                             Table 3: docs                                         FROM subscores
                                                                                   GROUP BY collection_id) AS scores
                        collection_id       id    len
                                                                                   JOIN docs ON scores. collection_id=docs.collection_id
                             doc1            1     5
                                                                                 ORDER BY score DESC;
                                                                                                Listing 1: Conjunctive BM25

2.2       Retrieval Model
OldDog implements the BM25 [3] ranking formula. The values
for 𝑘1 and 𝑏 are fixed to 1.2 and 0.75, respectively. The                      3    OSIRRC EXPERIENCE
original paper [2] uses a conjunctive variant of this formula,                 Overall, we look back at an excellent learning experience tak-
that only produces results for documents where all query                       ing part in the OSIRRC challenge. The setup using Docker
terms appear the document; this yields lower effectiveness                     containers worked out very well during coding, by multiple
scores, but speeds up query processing leading to a better                     people on different machines. Automated builds on Docker
runtime performance.                                                           Hub and the integration with Zenodo complete a fully repli-
   OldDog implements disjunctive query processing as well,                     cable experimental setup.
included after noticing a surprisingly large difference in effec-                 The standardised API for running evaluations provided by
tiveness when compared to other systems applied to Robust04.                   the ‘jig’ made it easy to learn from other groups; and mixing
In the disjunctive variant, documents are considered when                      version management using git (multiple branches) with build-
they contain at least one of the query terms. As expected,                     ing (locally) Docker containers with different tags allowed
runtimes for an evaluation increase when using this strategy.                  progress in parallel when implementing different extensions
   Table 4 summarises effectiveness scores for both methods                    of the initial code (in our case, including disjunctive query
on two test collections, Robust04 and Core18.                                  processing and adding Core18).
   Listing 1 shows the conjunctive BM25 SQL query for robust                      Recording the evaluation outcomes at release time let us
topic 301: International Organized Crime. The disjunctive                      catch a bug that would have been easily overlooked without
variant simply omits the having clause.                                        such a setup - after including Core18, a minor bug introduced
                                                                               in the code to parse topic files lead to slightly different scores
3
    Or, ‘a SQL shell’, pronouncing SQL as sequel like database folk do.        on Robust04, that we could easily detect and fix (one topic


                                                                          48
The OldDog Docker Image for OSIRRC at SIGIR 2019                                                            OSIRRC 2019, July 25, 2019, Paris, France


was missing) thanks to the structured approach of recording                     We find the effectiveness scores shown in table 5 for dis-
progress.4                                                                   junctive BM25. Performance drops for both MAP and early
                                                                             precision, suggesting that filtering query term presence based
4    INTERACT EXAMPLES                                                       on document count is not a good idea, and should be lim-
Let us conclude the paper by discussing a few advantages                     ited to pseudo relevance feedback (not yet implemented in
of database-backed IR experiments. Using the interact hook,                  OldDog).
it is possible to issue SQL queries directly to the database.
This is useful if one wants to try different kinds of ranking                    Table 5: Effectiveness scores after high df term removal
functions, or just to investigate the content of the database.
We show some examples of queries on the Robust04 test                                                       Robust04              Core18
collection.                                                                                              MAP P@30             MAP P@30
   The three most occurring terms are easily extracted from                      Disjunctive BM25        0.2285 0.2727        0.1907 0.2693
the dict table:
                                                                               A natural next step is to include the ‘qrel’ files in the
    SELECT ∗ FROM dict ORDER BY df DESC LIMIT 3;                             database, to explore more easily the relevant documents that
                                                                             are (not) retrieved by specific test queries.
      +--------+-------+--------+
      | termid | term | df      |
      +========+=======+========+                                            5    CONCLUSION
      | 541834 | from | 355901 |                                             We conclude that we could successfully apply the methods
      | 563475 | ha    | 320097 |                                            from [2], and have learned that conjunctive query processing
      | 894136 | which | 302365 |                                            for BM25 degrades retrieval effectiveness more than we ex-
      +--------+-------+--------+                                            pected a priori. The Docker image produced for the workshop
                                                                             is a perfect starting point for exploration of IR on relational
   As expected, the term distribution is skewed with a very                  databases, where we build on standard pre-processing and
long tail; consider for example the number of distinct terms                 test collection code in the Anserini project. Of course, we
that occur only once:                                                        should extend the retrieval model beyond plain BM25 to
                                                                             obtain more interesting results from an IR perspective. Inter-
    SELECT COUNT(∗) AS terms FROM dict WHERE df = 1;                         active querying the database representation of the collection,
                                                                             especially after including relevance assessments, seems like a
      +--------+                                                             promising avenue to pursue. Finally we found that the ‘jig’
      | terms |
                                                                             setup not only allows for easy replication of the software, it
      +========+
                                                                             serves as a tool for supporting continuous integration.
      | 516956 |
      +--------+
                                                                             ACKNOWLEDGMENTS
    Apart from applying a brief static stopword list to all pre-             This work is part of the research program Commit2Data
processing (defined in StandardAnalyzer.STOP_WORDS_SET),                     with project number 628.011.001 (SQIREL-GRAPHS), which
Anserini ‘stops’ the query expansions in its RM3 module by                   is (partly) financed by the Netherlands Organisation for
filtering on document frequency, thresholded at 10% of the                   Scientific Research (NWO).
number of documents in the collection.                                          We also want to thank Ryan Clancy and Jimmy Lin for
    Having such a collection-dependent stoplist would be an                  the excellent support with the ‘jig’ framework.
interesting option in the initial run as well, so let us use the in-
teractive mode to investigate the effect on query effectiveness
                                                                             REFERENCES
                                                                             [1] Peter Boncz. 2002. Monet: A next-generation DBMS kernel for
of applying this df filter to the initial run.                                   query-intensive applications. Universiteit van Amsterdam.
    We can easily evaluate the effect of removing the terms with             [2] Hannes Mühleisen, Thaer Samar, Jimmy Lin, and Arjen De Vries.
high document frequency, e.g. by modifying the dictionary                        2014. Old dogs are great at new tricks: Column stores for IR
                                                                                 prototyping. In Proceedings of the 37th international ACM SIGIR
table as follows:                                                                conference on Research & development in information retrieval.
                                                                                 ACM, 863–866.
                                                                             [3] Stephen E Robertson and Steve Walker. 1994. Some simple effective
     ALTER TABLE dict RENAME TO odict;                                           approximations to the 2-poisson model for probabilistic weighted
     CREATE table dict AS                                                        retrieval. In Proceedings of the 17th international ACM SIGIR
                                                                                 conference on Research & development in information retrieval.
     SELECT ∗ FROM odict WHERE df <=                                             Springer, 232–241.
       (SELECT 0.1 ∗ COUNT(∗) FROM docs);                                    [4] Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling
                                                                                 the use of Lucene for information retrieval research. In Proceedings
                                                                                 of the 40th International ACM SIGIR Conference on Research
                                                                                 and Development in Information Retrieval. ACM, 1253–1256.
4
 We may also conclude that a unified topic format for all TREC col-
lections would be a useful improvement to avoid errors in experiments
carried out on these test collections.


                                                                        49