GeeseDB: A Python Graph Engine for Exploration and Search Chris Kamphuis1 , Arjen P. de Vries1 1 Radboud University, Toernooiveld 212, Nijmegen, The Netherlands Abstract GeeseDB is a Python toolkit for solving information retrieval research problems that leverage graphs as data structures. It aims to simplify information retrieval research by allowing researchers to easily formulate graph queries through a graph query language. GeeseDB is built on top of DuckDB, an embedded column-store relational database designed for analytical workloads. GeeseDB is available as an easy to install Python package. In only a few lines of code users can create a first stage retrieval ranking using BM25. Queries read and write Numpy arrays and Pandas dataframes, at zero or negligible data transformation cost (dependent on base datatype). Therefore, results of a first-stage ranker expressed in GeeseDB can be used in various stages in the ranking process, enabling all the power of Python machine learning libraries with minimal overhead. Also, because data representation and processing are strictly separated, GeeseDB forms an ideal basis for reproducible IR research. Keywords Open-Source Search Engine, Information Retrieval, Graph Databases 1. Introduction vide the following functionalities: In recent years there has been a lot of exciting new infor- • GeeseDB is an easy-to-install, self-contained mation retrieval research that makes use of non-text data Python package available through pip install to improve the effectiveness of search systems. Consider with as few as possible dependencies. It contains for example dense representations for retrieval [1, 2, 3], topics and relevance judgements for several stan- knowledge graphs to leverage entity information [4, 5, 6], dard IR collections out-of-the-box, allowing re- and non-textual learning-to-rank features [7, 8]. All these searchers to quickly start developing new ranking research directions have improved the effectiveness of models. search systems by making use of more diverse data. De- • First stage (sparse) retrieval is directly supported. spite the fact that search systems consider more diverse In only a few lines of code it is possible to load sources of data, the usage of this data is often imple- documents and create a first stage ranking. mented through the use of a coupled architecture. In par- • Data is served in a usable format for later retrieval ticular, first-stage retrieval is often carried out with dif- stages. GeeseDB allows to directly run queries ferent software compared to later retrieval stages where on Pandas data frames for efficient data transfer these novel reranking techniques tend to be used. In our to sequential reranking algorithms. view, researchers could benefit from a system where re- • Data exploration is supported through querying trieval stages are more tightly integrated, that facilitates data with SQL, but more interestingly, also using the exploration on how to use non-content data for rank- a graph query language, making the exploration ing, and serves the data in a format suitable for reranking of new research avenues easier. This prototype with e.g. transformers or tree based methods. supports a subset of the graph query language 1 In order to fulfill these needs we propose GeeseDB , a Cypher [9], a graph query language originally prototype Python toolkit for information retrieval that proposed for Neo4j, similar to the property graph leverages graphs as data structures, allowing metadata database model query language as described by and graphs to be easily included in the ranking pipeline. Angles [10]. The toolkit is designed to quickly set up first stage re- trieval, and make it easy for researchers to explore new GeeseDB began as a project after identifying the opportu- ranking models quickly. In short, GeeseDB aims to pro- nities for graph queries to improve reproducible IR [11] at DESIRES 2021 – 2nd International Conference on Design of the Open-Source IR Replicability Challenge SIGIR work- Experimental Search & Information REtrieval Systems, September shop [12]. Prior work observed many BM25 implementa- 15–18, 2021, Padua, Italy tions [13, 14], that resulted in wildly varying effectiveness " chris@cs.ru.nl (C. Kamphuis); arjen@cs.ru.nl (A. P. de Vries) scores, and the variety of systems participating in this © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). workshop also found varying BM25 effectiveness scores CEUR CEUR Workshop Proceedings (CEUR-WS.org) Workshop http://ceur-ws.org between them. Is this really a problem? Several valid ISSN 1613-0073 Proceedings 1 https://github.com/informagi/geesedb reasons could explain these differences in effectiveness; 2. Design document pre-processing, parameter tuning, or even in- terpretation of the theory to arrive at the exact ranking At the core of GeeseDB lies the full text search design formula to be used. When using these scores as a baseline presented by Mühleisen et al. [14]. In this work, a column- however, the effectiveness gain of novel methods could store database for IR prototyping is proposed, which uses be exaggerated due to the (coincidental) choice for an the database schema described in Figure 1, consisting implementation of the baseline that gives low effective- of three database tables. (One for all term information, ness. Indeed, Yang et al. [15] showed empirically that the comparison against weak baselines is a real problem, Documents that can obfuscate the real gain in effectiveness. PK doc_id int NOT NULL A method introduced into the community to help sim- length int NOT NULL Term Document plify the comparison between open source search sys- collection_id varchar NOT NULL tems has been the introduction of the Common Index FK1 doc_id int NOT NULL File Format (CIFF) [16]. CIFF is a binary data exchange Terms FK2 term_id int NOT NULL format that can be used by search systems to share their term_frequency int NOT NULL PK term_id int NOT NULL index structures. This way, researchers ensure that the exact same pre-processing has been applied when com- term_id int NOT NULL paring different systems to each other. Experiments in doc_frequency int NOT NULL [16] show how differences in (BM25) effectiveness scores between different implementations do decrease when their indexes are exchanged using CIFF. GeeseDB there- Figure 1: Database schema by Mühleisen et al. [14] for full text search in relational databases fore adopts the CIFF index format to exchange data be- tween systems. A second approach to improve the reproducibility of one for all document information, and one that contains IR research results has been adopted less widely. By mak- the information on how terms relate to documents; the ing use of a database system, the way how data is stored information that is found in a posting list of an inverted and the plans on how this data is processed are explicitly index). Using these three tables they show that BM25 can separated. This enables easier inspection on differences be easily expressed as a SQL query, with latencies that between ranking formulas. In that perspective, it may are on par with custom-build IR engines. In GeeseDB we be not so surprising that the only two systems that pro- use the exact same relational schema for full text search. duced the exact same effectiveness scores for their BM25 Instead of seeing the document data and term data as rankings in the studies mentioned above, were the two re- tables that relate to each other through a many-to-many lational database systems used to rank documents; even join table, it is also possible to consider this schema as a though their execution engines were completely different bipartite graph. In this graph both documents and terms and implemented by different teams. Also, in the work by are considered as nodes, connected to each other through Kamphuis et al. [17], using a shared database back-end edges. If a term occurs in a document there exists an edge for a series of retrieval experiments, testing a number of between that term and document. GeeseDB uses the data previously proposed ‘improvements’ of BM25, demon- model of property graphs; labeled multigraphs where strated that these differences between variants turn out both edges and nodes can have property-value pairs. The insignificant once everything but the ranking formula is database schema as described in Figure 1 would then fixed. translate to the property graph schema shown in Fig- Given these findings, we fully subscribe to the posi- ure 2. A small example of a graph represented by this tion that the declarative specification of ranking in a database query language offers the potential to improve Documents Terms reproducibility in IR research. SQL queries that express +collection_id: varchar + tf: int +string: varchar more complex ranking functions than the default combi- +length: int +document frequency: int nation of term frequency and document frequency, can however easily become overly tedious to write, elaborate and error-prone. As the way forward, GeeseDB there- Figure 2: Graph schema representing bipartite document- fore introduces the property graph data model with a term graph graph query language to express IR retrieval models in a more compact manner. We show in this work that this schema is shown in Figure 3, document nodes contain is especially useful when introducing representations of document specific information (i.e. document length and documents and queries that include information beyond the collection identifier), term nodes contain informa- just text. tion relevant to the term (i.e. the term string and the term’s document frequency), and the the edges between DataFrames. GeeseDB inherits all these functionalities document and terms nodes contain term frequency in- from DuckDB. formation (i.e. how often is the term mentioned in the As DuckDB is a SQL database management system, we document represented the respective nodes it connects). can execute analytical SQL queries on the tables that con- If one wants to also store position data, this graph can tain our data, including the BM25 rankings described by Mühleisen et al. [14]. By default, the BM25 implementa- doc 1 term 1 tion provided with GeeseDB implements the disjunctive variant of BM25, instead of the conjunctive variant they length: 2 TF: 1 string: "dog" collection_id: "a" doc frequency: 1 used. Although the conjunctive variant of BM25 can be calculated more quickly, we chose to use the disjunctive TF: 1 variant as it is more commonly used by IR researchers doc 2 term 2 and the differences between effectiveness scores are no- length: 2 TF: 1 string: "cat" ticeable on smaller collections. For now we only support collection_id: "b" doc frequency: 3 the original formulation of BM25 by Robertson et al. [19], TF: 1 however support of or adding other versions of BM25 [17] doc 3 TF: 2 term 3 is trivial. length: 5 TF: 3 string: "music" collection_id: "c" doc frequency: 2 2.2. Graph Query Languange What distinguishes GeeseDB from alternatives, database- backed (olddog) [20] or native systems (Anserini [21], Figure 3: Example term-document graph that maps to rela- Terrier [22]) is the graph query language, based on tional database schema Cypher [9]. Systems like Elasic2 and Solr3 do support querying graphs, but not declaratively. For now, GeeseDB easily be changed to a graph where the edges store the implements Cypher’s basic graph pattern matching position of a term. If a term would appear multiple times queries for retrieving data. An example of a graph query in a document, the property graph model would allow supported by GeeseDB is presented in Figure 4. This for multiple edges to exist between two nodes. The graph schema that we described by Figure 2 maps one-to-one MATCH (d:docs)-[]-(:authors)-[]-(d2:docs) WHERE d.collection_id = "96ab542e" to the relational database schema described by Figure 1, RETURN DISTINCT d2.collection_id so nodes are represented by normal relational tables that represent specific data units (terms, documents), while Figure 4: An example cypher query that finds all documents edges are represented by many-to-many join tables. So, that were written by the same author that wrote the docu- even though we think of the data as graphs, in the back- ment with the collecion_id “96ab542e” end they are represented as relational tables. When using GeeseDB for search we expect at least the document-term graph to be present, of course new node types can be query finds all documents written by the same authors introduced in order to explore new search strategies. as those who wrote document “96ab542e”. For compari- son, Figure 5 illustrates the same query represented in SQL; much more complex than the Cypher version, due 2.1. Backend to the join conditions that have to be made explicit. In GeeseDB is built on top of DuckDB [18], an in-process order to connect the “docs” table with the “authors” table SQL OLAP (analytics optimized) database management 2 joins are needed, reconnecting the “docs” table again system. DuckDB is designed to support analytical query introduces two more joins. workloads, meaning that it specifically aims to process At the moment of writing, GeeseDB supports the fol- complex long-running queries where a significant por- lowing Cypher keywords: MATCH, RETURN, WHERE, AND, tion of the data is accessed, conditions matching the case DISTINCT, ORDER BY, SKIP, and LIMIT. Instead of us- of IR research. DuckDB has a client Python API which ing WHERE to filter data, it is also possible to use graph can be installed using pip, afterwards it can be used di- matching using the keyword MATCH, as shown in Figure 6; rectly. DuckDB has a separated API built around both the query returns the length of document “96ab542e”. We NumPy and Pandas, providing NumPy/Pandas views over the same underlying data representation, without incur- 2 ring data transfer (usually referred to as “zero-copy” read- https://www.elastic.co/what-is/elasticsearch-graph (accessed 19-08-2021) ing). Pandas DataFrames can be registered as virtual ta- 3 https://solr.apache.org/guide/6_6/graph-traversal.html bles, allowing to directly query the data present in Pandas (accessed 19-08-2021) SELECT DISTINCT d2.collection_id The goal of this task is: Given a news story, find other news FROM docs AS d2 articles that can provide important context or background JOIN doc_author AS da2 ON (d2.collection_id = da2.doc) information. These articles can then be recommended to JOIN authors AS a2 the reader to help them understand the context in which ON (da2.author = a2.author) these news articles take place. The collection used for JOIN doc_author AS da3 this task is the Washington Post V3 collection6 released ON (a2.author = da3.author) for the 2020 edition of TREC. It contains 671.945 news JOIN docs AS d articles published by the Washington Post published be- ON (d.collection_id = da3.doc) tween 2012 and 2020, and 50 topics with relevance as- WHERE d.collection_id = '96ab542e' sessments (topics correspond to collection identifiers of documents for which relevant data has to be found). The Figure 5: SQL query that corresponds to the graph query articles in this collection contain useful metadata; in par- described in Figure 4. ticular, we will use authorship information. We extracted 25.703 unique article authors, where it is possible that MATCH (d:docs {d.collection_id: "96ab542e"}) multiple authors co-wrote a news article. We also an- RETURN d.len notate documents with entity information which was obtained by using the Radboud Entity Linker [24]. In Figure 6: Graph query where the length of document with total 31.622.419 references to 541.729 unique entities collection_id is returned. were found. An edge between entity and document nodes contains mention and location information, as well as the ner_tag found by the linker’s entity recognition plan to support the other keywords of Cypher in the fu- module (the entity linker can assign different tags to the 7 ture, as well as directed edges. Everything that is not same entity). Figure 7 illustrates the data schema that yet directly supported yet by our implementation can of we use for the background linking task. course still be expressed in SQL, which is fully supported4 . In order to know how to join nodes to each other if no Entities edge information has been provided, GeeseDB stores in- entity: "dog" df: 1 Authors formation on the schema. This way GeeseDB knows author: "Chris" start: 0 how nodes relate to each other through which edges. len: 1 Authors GeeseDB has a module for updating the graph schema, mention: "dog" author: "Arjen" allowing researchers to easily set up the graph they want ner_tag: "misc" represented in the database. Documents Documents collection_id: "abc" collection_id: "def" 3. Usage len: 3 len: 2 GeeseDB comes as an easy-to-install Python package that tf: 1 tf: 2 tf: 1 tf: 1 can be installed using pip, the standard package installer Terms Terms Terms for Python: string: "dog" string: "cat" string: "music" df: 1 df: 2 df: 1 $ pip install geesedb==0.0.1 We can start using GeeseDB after installing it. All exam- Figure 7: Example property graph for the TREC News Track’s background linking task. The node types are authors, ples we show in this paper were run on version v0.0.1 entities, terms and documents. Edges connect document of GeeseDB. However, as GeeseDB is actively being de- nodes to other types of nodes. Both edges and nodes can veloped, we advise readers to use the latest version of have properties (following the property graph model). Multi- GeeseDB, which can be installed when not specifying ple edges may exist between one entity node and one docu- a package version. It is also possible to install the lat- ment node, as one entity can be linked multiple times to one est commit by installing the latest version directly from document. GitHub5 . As an example, we will show how to use GeeseDB for the background linking task of the TREC News Track [23]. 4 GeeseDB supports the graph queries by translating them to their corresponding SQL queries, both nodes and edges are after all 6 just tables in the backend. https://trec.nist.gov/data/wapost/ 5 7 https://github.com/informagi/GeeseDB#package-installation The annotated data will be made publicly available. MATCH (d:docs {collection_id: 3.1. Indexing and Search ˓→ ?})-[]-(t:term_dict) In order to start, a database containing at least the docu- RETURN string ment and term information needs to be created. Figure 8 ORDER BY tf*log(671945/df) shows how the data can be easily loaded using CSV files. DESC LIMIT 5 from geesedb.index import FullTextFromCSV Figure 10: Prepared Cypher statement that finds the top-5 TF-IDF terms in a document. index = FullTextFromCSV( database='/path/to/database', SELECT term_dict.string docs_file='/path/to/docs.csv', FROM term_dict term_dict_file='/path/to/term_dict.csv', JOIN term_doc ON term_doc_file='/path/to/term_doc.csv' (term_dict.term_id = term_doc.term_id) ) JOIN docs ON index.load_data() (docs.doc_id = term_doc.doc_id) WHERE docs.collection_id = ? Figure 8: Load text data from the WashingtonPost collection ORDER BY term_doc.tf * formatted as csv files in the format as described by Mühleisen ˓→ log(671945/term_dict.df et al. [14] DESC LIMIT 5; Instead of loading the data from CSV files it is also pos- sible to load the text data directly using the CIFF format Figure 11: Prepared SQL statement that finds the top-5 TF- IDF terms in a document. for data exchange [16]. GeeseDB also has functionalities to create the CSV files used here from the CIFF format. Authorship information and entity links can be loaded similarly. Processing Cypher queries depends on the Using the terms found with Cypher, we can construct schema information that needs to be loaded as well. We queries that we can pass to the searcher, and create a have a supporting class (called metadata) for this, and BM25 ranking. The code that generates the rankings for the schema data used in this paper will be available via all topics is presented in Figure 12. As you can see, with GitHub. After loading the data we can quickly create a only a limited number of lines of Python code it is quite BM25 ranking for ad hoc search in the Washington Post easy to create rankings. Note that the collection size is collection as shown in Figure 9. hardcoded as version v0.0.1 does not support aggregation yet. From this point it is quite trivial to write the content from geesedb.search import Searcher of hits to a runfile, and evaluate using trec_eval. Instead of “just” ranking with BM25, using e.g. the meta- searcher = Searcher( data in order to adapt the ranking is straightforward. In database='/path/to/database', the case of background linking, it makes sense to con- n=10 sider authorship information when recommending arti- ) topic = 'obama and trump' cles that might be suitable as background reading. As hits = searcher.search_topic(topic) journalists are often specialized in certain news topics (e.g. politics, foreign affairs, tech), the stories they write Figure 9: Example on how to create a BM25 ranking for the often share context. Also, when journalists collaborate query “obama and trump” that returns the top 10 documents. on stories they write together on topics they specialize in as well. As authorship information is available to us, we can decide to use the information whether an article is For the background linking task however, we do not written by the authors of the topic article, or by someone have regular topics; we only have the collection iden- they have collaborated with in the past. Finding the arti- tifiers of the documents we need to find relevant back- cles that are written by this group of people can easily ground info for. In order to search for relevant back- be done using a graph query, the query that finds these ground reading, queries that represent our information articles is shown in Figure 13. need to be constructed. A common approach is to use the Depending on the number of documents found by this top-𝑘 TF-IDF terms of the source article. These can easily query, different rescoring strategies can be decided upon. be found using the Cypher statement shown in Figure 10. If the set of documents written by the authors or their Instead of using Cypher it would also be possible to use co-authors is large, perhaps it is possible to only consider SQL, as shown in Figure 11; however this example shows these documents, but if the set is small, a score boost again the Cypher query is more elegant. from geesedb.search import Searcher # import and first lines the same as example from geesedb.connection import get_connection from geesedb.resources import author_c_query = """cypher authorship ˓→ get_topics_backgroundlinking ˓→ query""" from geesedb.interpreter import Translator author_query = t.translate(author_c_query) db_path = '/path/to/database' cursor = get_connection(db_path).cursor searcher = Searcher( topics = get_topics_backgroundlinking( database=db_path, '/path/to/topics' n=1000 ) ) for topic_no, collection_id in topics: cursor.execute(query, [collection_id]) translator = Translator(db_path) topic = ' '.join(cursor.fetchall()[0]) c_query = """cypher TFIDF query""" hits = searcher.search_topic(topic) query = translator.translate(c_query) cursor.execute(author_query, cursor = get_connection(db_path).cursor ˓→ [collection_id]) topics = get_topics_backgroundlinking( docs_authors = { '/path/to/topics' e[0] for e in cursor.fetchall() ) } for topic_no, collection_id in topics: if len(docs_authors) > 2000: cursor.execute(query, [collection_id]) hits = hits[hits.collection_id.isin( topic = ' '.join(cursor.fetchall()[0]) docs_authors)] hits = searcher.search_topic(topic) Figure 14: Find documents written by all authors that collab- Figure 12: Create a BM25 ranking for all background linking orated with the authors of the topic article, if there are more topics using the top-5 TFIDF terms. Note that in this case than 2000 documents found only consider these documents a processed topic file was used that only contains the topic as background reading candidates. identifier and the topic article id. The topic file in this format is provided on our GitHub. MATCH (d:docs {collection_id: ˓→ ?})-[]-(e:entities) MATCH (d:docs)-[]-(:authors)-[]-(:docs)-[]- ⌋ RETURN mention ˓→ (:authors)-[]-(d2:docs {collection_id: ORDER BY start ˓→ ?}) LIMIT 5 RETURN DISTINCT d.collection_id Figure 15: Retrieve the first five entities mentioned in the Figure 13: Cypher query to find documents written by co- topic article; and return the terms used to mention the entity. authors of the authors of the topic article. first five entity mentions, the text needs to be processed. might be more appropriate. Figure 14 shows an example The term data loaded in GeeseDB was already processed, on how to only consider documents found with the query as it was data loaded from CSV files built from a CIFF file in Figure 13, in this particular case we ensure that at least created from an Anserini [21] (Lucene) index. Anserini 2000 documents are found before filtering. has an easy to use Python extension, Pyserini [26], that To give another example; the graph query language can be used to tokenize the text in the same way as the is also useful when considering entities. When journal- documents were tokenized. Figure 16 shows the Python ists write news articles, the articles relate to events con- code where we extract the mentions, process them such cerning e.g. people, organisations, or countries. In other that they become a usable query for GeeseDB, and then words, the basis of news articles lay the entities as they BM25 ranking is created with this query. are often the subject of news. So, instead of using the In summary, GeeseDB allows researchers to index and most informative terms in a news article, it could be use- search data with only a few lines of Python code. It can ful to consider the entities identified in the article instead. be used to explore new IR research ideas through both Important entities tend to be mentioned in the beginning SQL and the Cypher graph query language. As GeeseDB of a news article [25]; Figure 15 shows the Cypher query can query directly on top of Panda’s DataFrames, no data to retrieve the text mentions of the first five mentioned transfer has to be done, making this framework ideal entities. to set up the data for other Python reranking pipelines Before it is possible to search using the text describing the (i.e. it is trivial to store learning-to-rank features in the from geesedb.search import Searcher searchers. A few obvious extensions would be IR from geesedb.connection import get_connection dataset support, native document processing, and from geesedb.resources import ˓→ get_topics_backgroundlinking implementations of popular first-stage rankers. from geesedb.interpreter import Translator • This version of GeeseDB lacks extensive bench- from pyserini.analysis import Analyzer, marking. We plan to release benchmarks on pop- ˓→ get_lucene_analyzer ular IR datasets, including instruction on how to reproduce these benchmarks. db_path = '/path/to/database' • In recent years, dense graph representations have searcher = Searcher( become popular. We would like to add the func- database=db_path, tionality to analyse these dense representations n=1000 for graphs managed in GeeseDB. ) Eventually, we would like to extend the query language analyzer = Analyzer(get_lucene_analyzer()) with proper support to define ranking over graphs. (Now, the ranking function is hidden in the ‘searcher’ module.) translator = Translator(db_path) c_query = """cypher entity query""" query = translator.translate(c_query) 5. Conclusion cursor = get_connection(db_path).cursor In this work we have described our prototype implemen- topics = get_topics_backgroundlinking( tation of GeeseDB, and how we envision graph databases '/path/to/topics' can be used for information retrieval research. GeeseDB ) is still in active development, and we are open to addi- for topic_no, collection_id in topics: tional contributions from the community. cursor.execute(query, [collection_id]) topic = ' '.join([e[0] for e in ˓→ cursor.fetchall()]) Acknowledgments topic = ' '.join(analyzer.analyze(topic)) hits = searcher.search_topic(topic) This work is part of the research program Commit2Data with project number 628.011.001 (SQIREL-GRAPHS), Figure 16: Create a BM25 ranking for all background linking which is (partly) financed by the Netherlands Organi- topics using the mention text of the first five linked entities sation for Scientific Research (NWO). in the source article. References database that can then directly be used). [1] L. Gao, Z. Dai, T. Chen, Z. Fan, B. Van Durme, J. Callan, Complement lexical retrieval model with semantic residual embeddings, in: Advances in 4. Future Work Information Retrieval, ECIR ’21, Springer Interna- As the current GeeseDB version is still an early prototype, tional Publishing, Cham, 2021, pp. 146–160. many future improvements have been envisioned. We [2] Y. Luan, J. Eisenstein, K. Toutanova, M. Collins, have identified four improvements we want to pursue as Sparse, dense, and attentional representations for a priority: text retrieval, Transactions of the Association for Computational Linguistics 9 (2021) 329–345. • We have implemented the graph query language [3] S. Lin, J. Yang, J. Lin, Distilling dense representa- Cypher only partially; in the near future, we tions for ranking using tightly-coupled teachers, would like to support this fully. For now, it is CoRR abs/2010.11386 (2020). URL: https://arxiv.org/ only possible to use the graph query language abs/2010.11386. arXiv:2010.11386. to query data, but ideally it could also be used [4] F. Hasibi, K. Balog, S. E. Bratsberg, Exploiting to load or update data. Of course it is already entity linking in queries for entity retrieval, in: possible to do this through the SQL backend, but Proceedings of the 2016 ACM International Con- this should only be necessary for extending the ference on the Theory of Information Retrieval, backend support for new use-cases. ICTIR ’16, Association for Computing Machinery, • As the goal of GeeseDB is to serve as an IR toolkit, New York, NY, USA, 2016, p. 209–218. URL: https: we would like to extend GeeseDB with function- //doi.org/10.1145/2970398.2970406. doi:10.1145/ alities that make the package easy to use for IR re- 2970398.2970406. [5] K. Balog, Entity-oriented search, Springer Nature, //doi.org/10.1145/3331184.3331647. doi:10.1145/ Gewerbestrasse 11, 6330 Cham, Switzerland, 2018. 3331184.3331647. [6] J. Dalton, L. Dietz, J. Allan, Entity query fea- [13] J. Arguello, F. Diaz, J. Lin, A. Trotman, Sigir 2015 ture expansion using knowledge base links, in: workshop on reproducibility, inexplicability, and Proceedings of the 37th International ACM SI- generalizability of results (rigor), in: Proceed- GIR Conference on Research; Development in In- ings of the 38th International ACM SIGIR Con- formation Retrieval, SIGIR ’14, Association for ference on Research and Development in Infor- Computing Machinery, New York, NY, USA, 2014, mation Retrieval, SIGIR ’15, Association for Com- p. 365–374. URL: https://doi.org/10.1145/2600428. puting Machinery, New York, NY, USA, 2015, p. 2609628. doi:10.1145/2600428.2609628. 1147–1148. URL: https://doi.org/10.1145/2766462. [7] R. Deveaud, M.-D. Albakour, C. Macdonald, I. Ou- 2767858. doi:10.1145/2766462.2767858. nis, On the importance of venue-dependent fea- [14] H. Mühleisen, T. Samar, J. Lin, A. de Vries, Old tures for learning to rank contextual suggestions, dogs are great at new tricks: Column stores for in: Proceedings of the 23rd ACM International Con- ir prototyping, in: Proceedings of the 37th Inter- ference on Conference on Information and Knowl- national ACM SIGIR Conference on Research and edge Management, CIKM ’14, Association for Com- Development in Information Retrieval, SIGIR ’14, puting Machinery, New York, NY, USA, 2014, p. Association for Computing Machinery, New York, 1827–1830. URL: https://doi.org/10.1145/2661829. NY, USA, 2014, p. 863–866. URL: https://doi.org/ 2661956. doi:10.1145/2661829.2661956. 10.1145/2600428.2609460. doi:10.1145/2600428. [8] C. Macdonald, R. L. Santos, I. Ounis, On the use- 2609460. fulness of query features for learning to rank, in: [15] W. Yang, K. Lu, P. Yang, J. Lin, Critically examining Proceedings of the 21st ACM International Confer- the "neural hype": Weak baselines and the additiv- ence on Information and Knowledge Management, ity of effectiveness gains from neural ranking mod- CIKM ’12, Association for Computing Machinery, els, in: Proceedings of the 42nd International ACM New York, NY, USA, 2012, p. 2559–2562. URL: https: SIGIR Conference on Research and Development //doi.org/10.1145/2396761.2398691. doi:10.1145/ in Information Retrieval, SIGIR’19, Association for 2396761.2398691. Computing Machinery, New York, NY, USA, 2019, [9] N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lin- p. 1129–1132. URL: https://doi.org/10.1145/3331184. daaker, V. Marsault, S. Plantikow, M. Rydberg, 3331340. doi:10.1145/3331184.3331340. P. Selmer, A. Taylor, Cypher: An evolving query [16] J. Lin, J. Mackenzie, C. Kamphuis, C. Macdonald, language for property graphs, in: Proceedings A. Mallia, M. Siedlaczek, A. Trotman, A. de Vries, of the 2018 International Conference on Manage- Supporting interoperability between open-source ment of Data, SIGMOD ’18, Association for Com- search engines with the common index file format, puting Machinery, New York, NY, USA, 2018, p. in: Proceedings of the 43rd International ACM SI- 1433–1445. URL: https://doi.org/10.1145/3183713. GIR Conference on Research and Development in 3190657. doi:10.1145/3183713.3190657. Information Retrieval, SIGIR ’20, Association for [10] R. Angles, The property graph database model., in: Computing Machinery, New York, NY, USA, 2020, Proceedings of the 12th Alberto Mendelzon Inter- p. 2149–2152. URL: https://doi.org/10.1145/3397271. national Workshop on Foundations of Data Man- 3401404. doi:10.1145/3397271.3401404. agement, AMW ’18, CEUR-WS.org, Aachen, 2018. [17] C. Kamphuis, A. P. de Vries, L. Boytsov, J. Lin, [11] C. Kamphuis, A. P. de Vries, Reproducible IR needs Which bm25 do you mean? a large-scale repro- an (IR) (graph) query language, in: Proceedings ducibility study of scoring variants, in: Advances of the Open-Source IR Replicability Challenge co- in Information Retrieval, ECIR ’20, Springer Inter- located with 42nd International ACM SIGIR Confer- national Publishing, Cham, 2020, pp. 28–34. ence on Research and Development in Information [18] M. Raasveldt, H. Mühleisen, Duckdb: An em- Retrieval, OSIRRC@SIGIR 2019, Paris, France, July beddable analytical database, in: Proceedings 25, 2019, CEUR-WS.org, Aachen, 2019, pp. 17–20. of the 2019 International Conference on Manage- URL: http://ceur-ws.org/Vol-2409/position03.pdf. ment of Data, SIGMOD ’19, Association for Com- [12] R. Clancy, N. Ferro, C. Hauff, J. Lin, T. Sakai, Z. Z. puting Machinery, New York, NY, USA, 2019, p. Wu, The sigir 2019 open-source ir replicability 1981–1984. URL: https://doi.org/10.1145/3299869. challenge (osirrc 2019), in: Proceedings of the 3320212. doi:10.1145/3299869.3320212. 42nd International ACM SIGIR Conference on Re- [19] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock- search and Development in Information Retrieval, Beaulieu, M. Gatford, et al., Okapi at trec-3, Nist SIGIR’19, Association for Computing Machinery, Special Publication Sp 109 (1995) 109. New York, NY, USA, 2019, p. 1432–1434. URL: https: [20] C. Kamphuis, A. P. de Vries, The olddog docker image for OSIRRC at SIGIR 2019, in: Proceedings of the Open-Source IR Replicability Challenge co- located with 42nd International ACM SIGIR Confer- ence on Research and Development in Information Retrieval, OSIRRC@SIGIR 2019, Paris, France, July 25, 2019, CEUR-WS.org, Aachen, 2019, pp. 47–49. URL: http://ceur-ws.org/Vol-2409/docker07.pdf. [21] P. Yang, H. Fang, J. Lin, Anserini: Enabling the use of lucene for information retrieval research, in: Proceedings of the 40th International ACM SI- GIR Conference on Research and Development in Information Retrieval, SIGIR ’17, Association for Computing Machinery, New York, NY, USA, 2017, p. 1253–1256. URL: https://doi.org/10.1145/3077136. 3080721. doi:10.1145/3077136.3080721. [22] I. Ounis, G. Amati, V. Plachouras, B. He, C. Mac- donald, D. Johnson, Terrier information retrieval platform, in: D. E. Losada, J. M. Fernández-Luna (Eds.), Advances in Information Retrieval, Springer Berlin Heidelberg, Berlin, Heidelberg, 2005, pp. 517– 519. [23] I. Soboroff, S. Huang, D. Harman, Trec 2018 news track overview., in: Proceedings of The Twenty-Seventh Text REtrieval Conference, TREC ’18, National Institute for Standards and Technology (NIST), Gaithersburg, Maryland, USA, 2018. [24] J. M. van Hulst, F. Hasibi, K. Dercksen, K. Balog, A. P. de Vries, Rel: An entity linker standing on the shoulders of giants, in: Proceedings of the 43rd International ACM SIGIR Conference on Re- search and Development in Information Retrieval, SIGIR ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 2197–2200. URL: https: //doi.org/10.1145/3397271.3401416. doi:10.1145/ 3397271.3401416. [25] C. Kamphuis, F. Hasibi, A. P. de Vries, T. Crijns, Rad- boud university at trec 2019., in: Proceedings of The Twenty-Eight Text REtrieval Conference, TREC ’19, National Institute for Standards and Technology (NIST), Gaithersburg, Maryland, USA, 2019. [26] J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep, R. Nogueira, Pyserini: An easy-to-use python toolkit to support replicable ir research with sparse and dense representations, arXiv preprint arXiv:2102.10073 (2021).