The OldDog Docker Image for OSIRRC at SIGIR 2019 Chris Kamphuis Arjen P. de Vries ckamphuis@cs.ru.nl arjen@acm.org Radboud University Radboud University Nijmegen, The Netherlands Nijmegen, The Netherlands ABSTRACT comparison of different ranking functions. IR researchers will Modern column-store databases are perfectly suited for car- only need to focus on the retrieval methodology while the rying out IR experiments, but they are not widely used for database takes care of efficiently retrieving the documents. IR research. A plausible explanation would be that setting OldDog represents the data using the schema proposed by up a database system and populating it with the documents Mühleisen et al. [2]. An extra ‘collection identifier’ column to be ranked provides enough of a hurdle to never get started has been added to include the original collection identifiers. on this route. The original paper [2] produced the database tables to repre- We took up the OSIRRC challenge to produce an easily sent ‘postings’ using a custom program running on Hadoop. replicable experimental setup for running IR experiments on Instead, we rely on the Anserini toolsuite [4] to create a modern database architecture. OldDog, named after a short Lucene1 index. Anserini takes care of standard document paper on SIGIR proposing the use of column-stores for IR pre-processing. experiments, implements standard IR ranking using BM25 Like [2], OldDog uses column store database MonetDB as SQL queries issued to MonetDB SQL. This provides a [1] for query processing. Term and document information baseline system on par with custom IR implementations and is extracted from the Lucene index, stored as CSV files a perfect starting point for the exploration of more advanced representing the columns in the database, and loaded into integrations of IR and databases. MonetDB using a standard COPY INTO command.2 Reflecting on our experience in OSIRRC 2019, we found a After initialisation, document ranking is performed by much larger effectiveness penalty than anticipated in the prior issuing SQL queries that specify the retrieval model. Inter- work for using the conjunctive variant of BM25 (requiring active querying can support the researcher with additional all query terms to occur). Simplifying the SQL query to examples. rank the documents using the disjunctive variant (the normal IR ranking approach) results in longer runtimes but higher 2 TECHNICAL DETAILS effectiveness. The interaction between query optimizations Supported Collections: for efficiency and the resulting differences in effectiveness remains a research topic with many open questions. robust4, core18 Supported Hooks: CCS CONCEPTS • Information systems → Search engine indexing; Evaluation init, index, search, interact of retrieval results. The OldDog docker image itself consists of bash/python ‘hooks’ that wrap the underlying OldDog, Anserini and Mon- KEYWORDS etDB commands. information retrieval, replicability, column store Anserini builds a Lucene index, where we are happy end- users of the utilities provided to index common test collections. Image Source: https://github.com/osirrc/olddog-docker The code has been tested for Robust04 and Core18 in our Docker Hub: https://hub.docker.com/r/osirrc2019/olddog first release; extending to other collections readily supported DOI: https://doi.org/10.5281/zenodo.3255060 by Anserini should be trivial. OldDog provides the Java code to convert the Lucene in- 1 OVERVIEW dex created by Anserini into CSV files that are subsequently OldDog is a software project to replicate and extend the loaded into the MonetDB database. OldDog further contains database approach to information retrieval presented in Müh- the Python code necessary to call Java code to pre-process leisen et al. [2]. The authors proposed that IR researchers the topics (calling the corresponding Anserini code, to guar- would use column store relational databases for their retrieval antee that topics are processed exactly the same way as the experiments. Specifically, researchers should store their docu- documents) and issue SQL queries to the MonetDB database ment representation in such a database. The ranking function to rank the collection. can then be expressed as SQL queries. This allows for easy 1 Copyright © 2019 for this paper by its authors. Use permitted under https://lucene.apache.org, last accessed June 26th , 2019. 2 Creative Commons License Attribution 4.0 International (CC BY 4.0). The intermediate step exporting and importing CSV files is not OSIRRC 2019 co-located with SIGIR 2019, 25 July 2019, Paris, France. strictly necessary, but simplifies the pipeline and is robust to failure. 47 OSIRRC 2019, July 25, 2019, Paris, France Kamphuis and de Vries Apart from the required init, index and search hooks, Table 4: Effectiveness scores OldDog supports the interact hook to spawn an3 SQL shell that allows the user to query the database interactively. Robust04 Core18 MAP P@30 MAP P@30 2.1 Schema Conjunctive BM25 0.1736 0.2526 0.1802 0.3167 Consider an example document doc1 with contents I put on Disjunctive BM25 0.2434 0.2985 0.2381 0.3313 my shoes after I put on my socks. to illustrate the database schema. Indexing the document results in tables 1, 2 and 3: /∗ For all topic terms ∗/ WITH qterms AS (SELECT termid, docid, count FROM terms Table 1: dict WHERE termid IN (591020, 720333, 462570)), /∗ Calculate the BM25 subscores ∗/ termid term df subscores AS (SELECT docs.collection_id, docs.id , len , 1 put 2 2 shoes 1 term_tf.termid, term_tf.tf, df , 3 after 1 (log((528030−df+0.5)/(df+0.5))∗((term_tf.tf∗(1.2+1)/ 4 socks 1 (term_tf.tf+1.2∗(1−0.75+0.75∗(len/188.33)))))) AS subscore /∗ Calculate BM25 components ∗/ FROM (SELECT termid, docid, count as tf FROM qterms) AS Table 2: terms term_tf JOIN (SELECT docid FROM qterms termid docid count GROUP BY docid HAVING COUNT(distinct termid) = 3) 1 1 2 AS cdocs ON term_tf.docid = cdocs.docid 2 1 1 3 1 1 JOIN docs ON term_tf.docid = docs.id 4 1 1 JOIN dict ON term_tf.termid = dict.termid) /∗ Aggregate over the topic terms ∗/ SELECT scores. collection_id , score FROM (SELECT collection_id, SUM(subscore) AS score Table 3: docs FROM subscores GROUP BY collection_id) AS scores collection_id id len JOIN docs ON scores. collection_id=docs.collection_id doc1 1 5 ORDER BY score DESC; Listing 1: Conjunctive BM25 2.2 Retrieval Model OldDog implements the BM25 [3] ranking formula. The values for 𝑘1 and 𝑏 are fixed to 1.2 and 0.75, respectively. The 3 OSIRRC EXPERIENCE original paper [2] uses a conjunctive variant of this formula, Overall, we look back at an excellent learning experience tak- that only produces results for documents where all query ing part in the OSIRRC challenge. The setup using Docker terms appear the document; this yields lower effectiveness containers worked out very well during coding, by multiple scores, but speeds up query processing leading to a better people on different machines. Automated builds on Docker runtime performance. Hub and the integration with Zenodo complete a fully repli- OldDog implements disjunctive query processing as well, cable experimental setup. included after noticing a surprisingly large difference in effec- The standardised API for running evaluations provided by tiveness when compared to other systems applied to Robust04. the ‘jig’ made it easy to learn from other groups; and mixing In the disjunctive variant, documents are considered when version management using git (multiple branches) with build- they contain at least one of the query terms. As expected, ing (locally) Docker containers with different tags allowed runtimes for an evaluation increase when using this strategy. progress in parallel when implementing different extensions Table 4 summarises effectiveness scores for both methods of the initial code (in our case, including disjunctive query on two test collections, Robust04 and Core18. processing and adding Core18). Listing 1 shows the conjunctive BM25 SQL query for robust Recording the evaluation outcomes at release time let us topic 301: International Organized Crime. The disjunctive catch a bug that would have been easily overlooked without variant simply omits the having clause. such a setup - after including Core18, a minor bug introduced in the code to parse topic files lead to slightly different scores 3 Or, ‘a SQL shell’, pronouncing SQL as sequel like database folk do. on Robust04, that we could easily detect and fix (one topic 48 The OldDog Docker Image for OSIRRC at SIGIR 2019 OSIRRC 2019, July 25, 2019, Paris, France was missing) thanks to the structured approach of recording We find the effectiveness scores shown in table 5 for dis- progress.4 junctive BM25. Performance drops for both MAP and early precision, suggesting that filtering query term presence based 4 INTERACT EXAMPLES on document count is not a good idea, and should be lim- Let us conclude the paper by discussing a few advantages ited to pseudo relevance feedback (not yet implemented in of database-backed IR experiments. Using the interact hook, OldDog). it is possible to issue SQL queries directly to the database. This is useful if one wants to try different kinds of ranking Table 5: Effectiveness scores after high df term removal functions, or just to investigate the content of the database. We show some examples of queries on the Robust04 test Robust04 Core18 collection. MAP P@30 MAP P@30 The three most occurring terms are easily extracted from Disjunctive BM25 0.2285 0.2727 0.1907 0.2693 the dict table: A natural next step is to include the ‘qrel’ files in the SELECT ∗ FROM dict ORDER BY df DESC LIMIT 3; database, to explore more easily the relevant documents that are (not) retrieved by specific test queries. +--------+-------+--------+ | termid | term | df | +========+=======+========+ 5 CONCLUSION | 541834 | from | 355901 | We conclude that we could successfully apply the methods | 563475 | ha | 320097 | from [2], and have learned that conjunctive query processing | 894136 | which | 302365 | for BM25 degrades retrieval effectiveness more than we ex- +--------+-------+--------+ pected a priori. The Docker image produced for the workshop is a perfect starting point for exploration of IR on relational As expected, the term distribution is skewed with a very databases, where we build on standard pre-processing and long tail; consider for example the number of distinct terms test collection code in the Anserini project. Of course, we that occur only once: should extend the retrieval model beyond plain BM25 to obtain more interesting results from an IR perspective. Inter- SELECT COUNT(∗) AS terms FROM dict WHERE df = 1; active querying the database representation of the collection, especially after including relevance assessments, seems like a +--------+ promising avenue to pursue. Finally we found that the ‘jig’ | terms | setup not only allows for easy replication of the software, it +========+ serves as a tool for supporting continuous integration. | 516956 | +--------+ ACKNOWLEDGMENTS Apart from applying a brief static stopword list to all pre- This work is part of the research program Commit2Data processing (defined in StandardAnalyzer.STOP_WORDS_SET), with project number 628.011.001 (SQIREL-GRAPHS), which Anserini ‘stops’ the query expansions in its RM3 module by is (partly) financed by the Netherlands Organisation for filtering on document frequency, thresholded at 10% of the Scientific Research (NWO). number of documents in the collection. We also want to thank Ryan Clancy and Jimmy Lin for Having such a collection-dependent stoplist would be an the excellent support with the ‘jig’ framework. interesting option in the initial run as well, so let us use the in- teractive mode to investigate the effect on query effectiveness REFERENCES [1] Peter Boncz. 2002. Monet: A next-generation DBMS kernel for of applying this df filter to the initial run. query-intensive applications. Universiteit van Amsterdam. We can easily evaluate the effect of removing the terms with [2] Hannes Mühleisen, Thaer Samar, Jimmy Lin, and Arjen De Vries. high document frequency, e.g. by modifying the dictionary 2014. Old dogs are great at new tricks: Column stores for IR prototyping. In Proceedings of the 37th international ACM SIGIR table as follows: conference on Research & development in information retrieval. ACM, 863–866. [3] Stephen E Robertson and Steve Walker. 1994. Some simple effective ALTER TABLE dict RENAME TO odict; approximations to the 2-poisson model for probabilistic weighted CREATE table dict AS retrieval. In Proceedings of the 17th international ACM SIGIR conference on Research & development in information retrieval. SELECT ∗ FROM odict WHERE df <= Springer, 232–241. (SELECT 0.1 ∗ COUNT(∗) FROM docs); [4] Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the use of Lucene for information retrieval research. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1253–1256. 4 We may also conclude that a unified topic format for all TREC col- lections would be a useful improvement to avoid errors in experiments carried out on these test collections. 49