1. Introduction

International Conference on Design of Experimental Search & Information REtrieval Systems, September

GeeseDB: A Python Graph Engine for Exploration and Search

Chris Kamphuis

Arjen P. de Vries

0 0 Radboud University , Toernooiveld 212, Nijmegen , The Netherlands

2021

1 5 18

GeeseDB is a Python toolkit for solving information retrieval research problems that leverage graphs as data structures. It aims to simplify information retrieval research by allowing researchers to easily formulate graph queries through a graph query language. GeeseDB is built on top of DuckDB, an embedded column-store relational database designed for analytical workloads. GeeseDB is available as an easy to install Python package. In only a few lines of code users can create a first stage retrieval ranking using BM25. Queries read and write Numpy arrays and Pandas dataframes, at zero or negligible data transformation cost (dependent on base datatype). Therefore, results of a first-stage ranker expressed in GeeseDB can be used in various stages in the ranking process, enabling all the power of Python machine learning libraries with minimal overhead. Also, because data representation and processing are strictly separated, GeeseDB forms an ideal basis for reproducible IR research.

eol>Open-Source Search Engine Information Retrieval Graph Databases

1. Introduction

vide the following functionalities: In recent years there has been a lot of exciting new infor- • GeeseDB is an easy-to-install, self-contained mation retrieval research that makes use of non-text data Python package available through pip install to improve the efectiveness of search systems. Consider with as few as possible dependencies. It contains for example dense representations for retrieval [ 1, 2, 3 ], topics and relevance judgements for several stanknowledge graphs to leverage entity information [ 4, 5, 6 ], dard IR collections out-of-the-box, allowing reand non-textual learning-to-rank features [7, 8]. All these searchers to quickly start developing new ranking research directions have improved the efectiveness of models. search systems by making use of more diverse data. De- • First stage (sparse) retrieval is directly supported. spite the fact that search systems consider more diverse In only a few lines of code it is possible to load sources of data, the usage of this data is often imple- documents and create a first stage ranking. mented through the use of a coupled architecture. In par- • Data is served in a usable format for later retrieval ticular, first-stage retrieval is often carried out with dif- stages. GeeseDB allows to directly run queries ferent software compared to later retrieval stages where on Pandas data frames for eficient data transfer these novel reranking techniques tend to be used. In our to sequential reranking algorithms. view, researchers could benefit from a system where re- • Data exploration is supported through querying trieval stages are more tightly integrated, that facilitates data with SQL, but more interestingly, also using the exploration on how to use non-content data for rank- a graph query language, making the exploration ing, and serves the data in a format suitable for reranking of new research avenues easier. This prototype with e.g. transformers or tree based methods. supports a subset of the graph query language

In order to fulfill these needs we propose GeeseDB 1, a Cypher [9], a graph query language originally prototype Python toolkit for information retrieval that proposed for Neo4j, similar to the property graph leverages graphs as data structures, allowing metadata database model query language as described by and graphs to be easily included in the ranking pipeline. Angles [10].

The toolkit is designed to quickly set up first stage retrieval, and make it easy for researchers to explore new GeeseDB began as a project after identifying the opporturanking models quickly. In short, GeeseDB aims to pro- nities for graph queries to improve reproducible IR [11] at the Open-Source IR Replicability Challenge SIGIR workshop [12]. Prior work observed many BM25 implementations [13, 14], that resulted in wildly varying efectiveness scores, and the variety of systems participating in this workshop also found varying BM25 efectiveness scores between them. Is this really a problem? Several valid 2. Design reasons could explain these diferences in efectiveness; document pre-processing, parameter tuning, or even interpretation of the theory to arrive at the exact ranking At the core of GeeseDB lies the full text search design formula to be used. When using these scores as a baseline presented by Mühleisen et al. [14]. In this work, a columnhowever, the efectiveness gain of novel methods could store database for IR prototyping is proposed, which uses be exaggerated due to the (coincidental) choice for an the database schema described in Figure 1, consisting implementation of the baseline that gives low efective- of three database tables. (One for all term information, ness. Indeed, Yang et al. [15] showed empirically that the comparison against weak baselines is a real problem, Documents that can obfuscate the real gain in efectiveness. PK doc_id int NOT NULL

A method introduced into the community to help sim- length int NOT NULL Term Document plify the comparison between open source search sys- collection_id varchar NOT NULL tems has been the introduction of the Common Index FK1 doc_id int NOT NULL File Format (CIFF) [16]. CIFF is a binary data exchange FK2 term_id int NOT NULL format that can be used by search systems to share their Terms term_frequency int NOT NULL index structures. This way, researchers ensure that the PK term_id int NOT NULL exact same pre-processing has been applied when com- term_id int NOT NULL paring diferent systems to each other. Experiments in doc_frequency int NOT NULL [16] show how diferences in (BM25) efectiveness scores between diferent implementations do decrease when their indexes are exchanged using CIFF. GeeseDB there- Figure 1: Database schema by Mühleisen et al. [14] for full fore adopts the CIFF index format to exchange data be- text search in relational databases tween systems.

A second approach to improve the reproducibility of one for all document information, and one that contains IR research results has been adopted less widely. By mak- the information on how terms relate to documents; the ing use of a database system, the way how data is stored information that is found in a posting list of an inverted and the plans on how this data is processed are explicitly index). Using these three tables they show that BM25 can separated. This enables easier inspection on diferences be easily expressed as a SQL query, with latencies that between ranking formulas. In that perspective, it may are on par with custom-build IR engines. In GeeseDB we be not so surprising that the only two systems that pro- use the exact same relational schema for full text search. duced the exact same efectiveness scores for their BM25 Instead of seeing the document data and term data as rankings in the studies mentioned above, were the two re- tables that relate to each other through a many-to-many lational database systems used to rank documents; even join table, it is also possible to consider this schema as a though their execution engines were completely diferent bipartite graph. In this graph both documents and terms and implemented by diferent teams. Also, in the work by are considered as nodes, connected to each other through Kamphuis et al. [17], using a shared database back-end edges. If a term occurs in a document there exists an edge for a series of retrieval experiments, testing a number of between that term and document. GeeseDB uses the data previously proposed ‘improvements’ of BM25, demon- model of property graphs; labeled multigraphs where strated that these diferences between variants turn out both edges and nodes can have property-value pairs. The insignificant once everything but the ranking formula is database schema as described in Figure 1 would then ifxed. translate to the property graph schema shown in Fig

Given these findings, we fully subscribe to the posi- ure 2. A small example of a graph represented by this tion that the declarative specification of ranking in a database query language ofers the potential to improve Documents Terms reproducibility in IR research. SQL queries that express more complex ranking functions than the default combi- ++lceonllgetcht:ioinnt_id: varchar + tf: int ++sdtoricnugm:veanrtcfhreaqruency: int nation of term frequency and document frequency, can however easily become overly tedious to write, elaborate and error-prone. As the way forward, GeeseDB there- Figure 2: Graph schema representing bipartite documentfore introduces the property graph data model with a term graph graph query language to express IR retrieval models in a more compact manner. We show in this work that this schema is shown in Figure 3, document nodes contain is especially useful when introducing representations of document specific information (i.e. document length and documents and queries that include information beyond the collection identifier), term nodes contain informajust text. tion relevant to the term (i.e. the term string and the term’s document frequency), and the the edges between DataFrames. GeeseDB inherits all these functionalities document and terms nodes contain term frequency in- from DuckDB. formation (i.e. how often is the term mentioned in the As DuckDB is a SQL database management system, we document represented the respective nodes it connects). can execute analytical SQL queries on the tables that conIf one wants to also store position data, this graph can tain our data, including the BM25 rankings described by Mühleisen et al. [14]. By default, the BM25 implementadoc 1 term 1 tion provided with GeeseDB implements the disjunctive variant of BM25, instead of the conjunctive variant they collelecntiognth_:id2: "a" TF: 1 dosctrfirnegq:u"ednocgy": 1 used. Although the conjunctive variant of BM25 can be calculated more quickly, we chose to use the disjunctive

TF: 1 variant as it is more commonly used by IR researchers doc 2 term 2 and the diferences between efectiveness scores are nolength: 2 TF: 1 string: "cat" ticeable on smaller collections. For now we only support collection_id: "b" doc frequency: 3 the original formulation of BM25 by Robertson et al. [19], TF: 1 however support of or adding other versions of BM25 [17]

is trivial.

TF: 2

term 3 TF: 3 string: "music" doc frequency: 2

2.2. Graph Query Languange

easily be changed to a graph where the edges store the position of a term. If a term would appear multiple times in a document, the property graph model would allow for multiple edges to exist between two nodes. The graph schema that we described by Figure 2 maps one-to-one to the relational database schema described by Figure 1, so nodes are represented by normal relational tables that represent specific data units (terms, documents), while edges are represented by many-to-many join tables. So, even though we think of the data as graphs, in the backend they are represented as relational tables. When using GeeseDB for search we expect at least the document-term graph to be present, of course new node types can be introduced in order to explore new search strategies.

2.1. Backend GeeseDB is built on top of DuckDB [18], an in-process

SQL OLAP (analytics optimized) database management system. DuckDB is designed to support analytical query workloads, meaning that it specifically aims to process complex long-running queries where a significant portion of the data is accessed, conditions matching the case of IR research. DuckDB has a client Python API which can be installed using pip, afterwards it can be used directly. DuckDB has a separated API built around both NumPy and Pandas, providing NumPy/Pandas views over the same underlying data representation, without incurring data transfer (usually referred to as “zero-copy” reading). Pandas DataFrames can be registered as virtual tables, allowing to directly query the data present in Pandas MATCH (d:docs)-[]-(:authors)-[]-(d2:docs) WHERE d.collection_id = "96ab542e" RETURN DISTINCT d2.collection_id query finds all documents written by the same authors as those who wrote document “96ab542e”. For comparison, Figure 5 illustrates the same query represented in SQL; much more complex than the Cypher version, due to the join conditions that have to be made explicit. In order to connect the “docs” table with the “authors” table 2 joins are needed, reconnecting the “docs” table again introduces two more joins.

At the moment of writing, GeeseDB supports the following Cypher keywords: MATCH, RETURN, WHERE, AND, DISTINCT, ORDER BY, SKIP, and LIMIT. Instead of using WHERE to filter data, it is also possible to use graph matching using the keyword MATCH, as shown in Figure 6; the query returns the length of document “96ab542e”. We

2https://www.elastic.co/what-is/elasticsearch-graph

(accessed 19-08-2021)

3https://solr.apache.org/guide/6_6/graph-traversal.html (accessed 19-08-2021) SELECT DISTINCT d2.collection_id FROM docs AS d2 JOIN doc_author AS da2

ON (d2.collection_id = da2.doc) JOIN authors AS a2

ON (da2.author = a2.author) JOIN doc_author AS da3

ON (a2.author = da3.author) JOIN docs AS d

ON (d.collection_id = da3.doc) WHERE d.collection_id = '96ab542e' plan to support the other keywords of Cypher in the future, as well as directed edges. Everything that is not yet directly supported yet by our implementation can of course still be expressed in SQL, which is fully supported4. In order to know how to join nodes to each other if no edge information has been provided, GeeseDB stores information on the schema. This way GeeseDB knows how nodes relate to each other through which edges. GeeseDB has a module for updating the graph schema, allowing researchers to easily set up the graph they want represented in the database.

3. Usage GeeseDB comes as an easy-to-install Python package that can be installed using pip, the standard package installer for Python:

$ pip install geesedb==0.0.1

We can start using GeeseDB after installing it. All exam

ples we show in this paper were run on version v0.0.1 of GeeseDB. However, as GeeseDB is actively being developed, we advise readers to use the latest version of GeeseDB, which can be installed when not specifying a package version. It is also possible to install the latest commit by installing the latest version directly from GitHub5.

As an example, we will show how to use GeeseDB for the background linking task of the TREC News Track [ 23 ].

4GeeseDB supports the graph queries by translating them to

their corresponding SQL queries, both nodes and edges are after all just tables in the backend.

5https://github.com/informagi/GeeseDB#package-installation The goal of this task is: Given a news story, find other news articles that can provide important context or background information. These articles can then be recommended to the reader to help them understand the context in which these news articles take place. The collection used for this task is the Washington Post V3 collection6 released for the 2020 edition of TREC. It contains 671.945 news articles published by the Washington Post published between 2012 and 2020, and 50 topics with relevance assessments (topics correspond to collection identifiers of documents for which relevant data has to be found). The articles in this collection contain useful metadata; in particular, we will use authorship information. We extracted 25.703 unique article authors, where it is possible that multiple authors co-wrote a news article. We also annotate documents with entity information which was obtained by using the Radboud Entity Linker [ 24 ]. In total 31.622.419 references to 541.729 unique entities were found. An edge between entity and document nodes contains mention and location information, as well as the ner_tag found by the linker’s entity recognition module (the entity linker can assign diferent tags to the same entity).7 Figure 7 illustrates the data schema that we use for the background linking task.

start: 0 len: 1 mention: "dog" ner_tag: "misc"

Entities entity: "dog" df: 1

Authors author: "Arjen"

Authors author: "Chris"

Documents collection_id: "abc" len: 3

Documents collection_id: "def"

len: 2 tf: 1 tf: 2 tf: 1 tf: 1

Terms string: "dog" df: 1

Terms string: "cat" df: 2

Terms string: "music" df: 1

6https://trec.nist.gov/data/wapost/ 7The annotated data will be made publicly available. Instead of loading the data from CSV files it is also pos

sible to load the text data directly using the CIFF format for data exchange [16]. GeeseDB also has functionalities to create the CSV files used here from the CIFF format. Authorship information and entity links can be loaded similarly. Processing Cypher queries depends on the schema information that needs to be loaded as well. We have a supporting class (called metadata) for this, and the schema data used in this paper will be available via GitHub. After loading the data we can quickly create a BM25 ranking for ad hoc search in the Washington Post collection as shown in Figure 9.

Using the terms found with Cypher, we can construct

queries that we can pass to the searcher, and create a BM25 ranking. The code that generates the rankings for all topics is presented in Figure 12. As you can see, with only a limited number of lines of Python code it is quite easy to create rankings. Note that the collection size is hardcoded as version v0.0.1 does not support aggregation yet. From this point it is quite trivial to write the content from geesedb.search import Searcher of hits to a runfile, and evaluate using trec_eval. Instead of “just” ranking with BM25, using e.g. the metasearcher = Searcher( data in order to adapt the ranking is straightforward. In database='/path/to/database', the case of background linking, it makes sense to conn=10 sider authorship information when recommending artit)opic = 'obama and trump' cles that might be suitable as background reading. As hits = searcher.search_topic(topic) journalists are often specialized in certain news topics (e.g. politics, foreign afairs, tech), the stories they write Figure 9: Example on how to create a BM25 ranking for the often share context. Also, when journalists collaborate query “obama and trump” that returns the top 10 documents. on stories they write together on topics they specialize in as well. As authorship information is available to us, we can decide to use the information whether an article is

For the background linking task however, we do not written by the authors of the topic article, or by someone have regular topics; we only have the collection iden- they have collaborated with in the past. Finding the artitifiers of the documents we need to find relevant back- cles that are written by this group of people can easily ground info for. In order to search for relevant back- be done using a graph query, the query that finds these ground reading, queries that represent our information articles is shown in Figure 13. need to be constructed. A common approach is to use the Depending on the number of documents found by this top- TF-IDF terms of the source article. These can easily query, diferent rescoring strategies can be decided upon. be found using the Cypher statement shown in Figure 10. If the set of documents written by the authors or their Instead of using Cypher it would also be possible to use co-authors is large, perhaps it is possible to only consider SQL, as shown in Figure 11; however this example shows these documents, but if the set is small, a score boost again the Cypher query is more elegant. from geesedb.search import Searcher from geesedb.connection import get_connection from geesedb.resources import ˓→ get_topics_backgroundlinking from geesedb.interpreter import Translator ifrst five entity mentions, the text needs to be processed. might be more appropriate. Figure 14 shows an example The term data loaded in GeeseDB was already processed, on how to only consider documents found with the query as it was data loaded from CSV files built from a CIFF file in Figure 13, in this particular case we ensure that at least created from an Anserini [ 21 ] (Lucene) index. Anserini 2000 documents are found before filtering. has an easy to use Python extension, Pyserini [ 26 ], that

To give another example; the graph query language can be used to tokenize the text in the same way as the is also useful when considering entities. When journal- documents were tokenized. Figure 16 shows the Python ists write news articles, the articles relate to events con- code where we extract the mentions, process them such cerning e.g. people, organisations, or countries. In other that they become a usable query for GeeseDB, and then words, the basis of news articles lay the entities as they BM25 ranking is created with this query. are often the subject of news. So, instead of using the In summary, GeeseDB allows researchers to index and most informative terms in a news article, it could be use- search data with only a few lines of Python code. It can ful to consider the entities identified in the article instead. be used to explore new IR research ideas through both Important entities tend to be mentioned in the beginning SQL and the Cypher graph query language. As GeeseDB of a news article [ 25 ]; Figure 15 shows the Cypher query can query directly on top of Panda’s DataFrames, no data to retrieve the text mentions of the first five mentioned transfer has to be done, making this framework ideal entities. to set up the data for other Python reranking pipelines Before it is possible to search using the text describing the (i.e. it is trivial to store learning-to-rank features in the from geesedb.search import Searcher from geesedb.connection import get_connection from geesedb.resources import ˓→ get_topics_backgroundlinking from geesedb.interpreter import Translator from pyserini.analysis import Analyzer, ˓→ get_lucene_analyzer

4. Future Work As the current GeeseDB version is still an early prototype, many future improvements have been envisioned. We have identified four improvements we want to pursue as a priority:

• We have implemented the graph query language Cypher only partially; in the near future, we would like to support this fully. For now, it is only possible to use the graph query language to query data, but ideally it could also be used to load or update data. Of course it is already possible to do this through the SQL backend, but this should only be necessary for extending the backend support for new use-cases. • As the goal of GeeseDB is to serve as an IR toolkit, we would like to extend GeeseDB with functionalities that make the package easy to use for IR researchers. A few obvious extensions would be IR dataset support, native document processing, and implementations of popular first-stage rankers. • This version of GeeseDB lacks extensive benchmarking. We plan to release benchmarks on popular IR datasets, including instruction on how to reproduce these benchmarks. • In recent years, dense graph representations have become popular. We would like to add the functionality to analyse these dense representations for graphs managed in GeeseDB.

Eventually, we would like to extend the query language with proper support to define ranking over graphs. (Now, the ranking function is hidden in the ‘searcher’ module.) 5. Conclusion In this work we have described our prototype implemen

tation of GeeseDB, and how we envision graph databases can be used for information retrieval research. GeeseDB is still in active development, and we are open to additional contributions from the community.

Acknowledgments This work is part of the research program Commit2Data with project number 628.011.001 (SQIREL-GRAPHS), which is (partly) financed by the Netherlands Organisation for Scientific Research (NWO).

[5] K. Balog, Entity-oriented search, Springer Nature, //doi.org/10.1145/3331184.3331647. doi:10.1145/

Gewerbestrasse 11, 6330 Cham, Switzerland, 2018. 3331184.3331647. [6] J. Dalton, L. Dietz, J. Allan, Entity query fea- [13] J. Arguello, F. Diaz, J. Lin, A. Trotman, Sigir 2015 ture expansion using knowledge base links, in: workshop on reproducibility, inexplicability, and Proceedings of the 37th International ACM SI- generalizability of results (rigor), in: ProceedGIR Conference on Research; Development in In- ings of the 38th International ACM SIGIR Conformation Retrieval, SIGIR ’14, Association for ference on Research and Development in InforComputing Machinery, New York, NY, USA, 2014, mation Retrieval, SIGIR ’15, Association for Comp. 365–374. URL: https://doi.org/10.1145/2600428. puting Machinery, New York, NY, USA, 2015, p. 2609628. doi:10.1145/2600428.2609628. 1147–1148. URL: https://doi.org/10.1145/2766462. [7] R. Deveaud, M.-D. Albakour, C. Macdonald, I. Ou- 2767858. doi:10.1145/2766462.2767858. nis, On the importance of venue-dependent fea- [14] H. Mühleisen, T. Samar, J. Lin, A. de Vries, Old tures for learning to rank contextual suggestions, dogs are great at new tricks: Column stores for in: Proceedings of the 23rd ACM International Con- ir prototyping, in: Proceedings of the 37th Interference on Conference on Information and Knowl- national ACM SIGIR Conference on Research and edge Management, CIKM ’14, Association for Com- Development in Information Retrieval, SIGIR ’14, puting Machinery, New York, NY, USA, 2014, p. Association for Computing Machinery, New York, 1827–1830. URL: https://doi.org/10.1145/2661829. NY, USA, 2014, p. 863–866. URL: https://doi.org/ 2661956. doi:10.1145/2661829.2661956. 10.1145/2600428.2609460. doi:10.1145/2600428. [8] C. Macdonald, R. L. Santos, I. Ounis, On the use- 2609460.

fulness of query features for learning to rank, in: [15] W. Yang, K. Lu, P. Yang, J. Lin, Critically examining Proceedings of the 21st ACM International Confer- the "neural hype": Weak baselines and the additivence on Information and Knowledge Management, ity of efectiveness gains from neural ranking modCIKM ’12, Association for Computing Machinery, els, in: Proceedings of the 42nd International ACM New York, NY, USA, 2012, p. 2559–2562. URL: https: SIGIR Conference on Research and Development //doi.org/10.1145/2396761.2398691. doi:10.1145/ in Information Retrieval, SIGIR’19, Association for 2396761.2398691. Computing Machinery, New York, NY, USA, 2019, [9] N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lin- p. 1129–1132. URL: https://doi.org/10.1145/3331184. daaker, V. Marsault, S. Plantikow, M. Rydberg, 3331340. doi:10.1145/3331184.3331340. P. Selmer, A. Taylor, Cypher: An evolving query [16] J. Lin, J. Mackenzie, C. Kamphuis, C. Macdonald, language for property graphs, in: Proceedings A. Mallia, M. Siedlaczek, A. Trotman, A. de Vries, of the 2018 International Conference on Manage- Supporting interoperability between open-source ment of Data, SIGMOD ’18, Association for Com- search engines with the common index file format, puting Machinery, New York, NY, USA, 2018, p. in: Proceedings of the 43rd International ACM SI1433–1445. URL: https://doi.org/10.1145/3183713. GIR Conference on Research and Development in 3190657. doi:10.1145/3183713.3190657. Information Retrieval, SIGIR ’20, Association for [10] R. Angles, The property graph database model., in: Computing Machinery, New York, NY, USA, 2020, Proceedings of the 12th Alberto Mendelzon Inter- p. 2149–2152. URL: https://doi.org/10.1145/3397271. national Workshop on Foundations of Data Man- 3401404. doi:10.1145/3397271.3401404. agement, AMW ’18, CEUR-WS.org, Aachen, 2018. [17] C. Kamphuis, A. P. de Vries, L. Boytsov, J. Lin, [11] C. Kamphuis, A. P. de Vries, Reproducible IR needs Which bm25 do you mean? a large-scale reproan (IR) (graph) query language, in: Proceedings ducibility study of scoring variants, in: Advances of the Open-Source IR Replicability Challenge co- in Information Retrieval, ECIR ’20, Springer Interlocated with 42nd International ACM SIGIR Confer- national Publishing, Cham, 2020, pp. 28–34. ence on Research and Development in Information [18] M. Raasveldt, H. Mühleisen, Duckdb: An emRetrieval, OSIRRC@SIGIR 2019, Paris, France, July beddable analytical database, in: Proceedings 25, 2019, CEUR-WS.org, Aachen, 2019, pp. 17–20. of the 2019 International Conference on ManageURL: http://ceur-ws.org/Vol-2409/position03.pdf. ment of Data, SIGMOD ’19, Association for Com[12] R. Clancy, N. Ferro, C. Hauf, J. Lin, T. Sakai, Z. Z. puting Machinery, New York, NY, USA, 2019, p.

Wu, The sigir 2019 open-source ir replicability 1981–1984. URL: https://doi.org/10.1145/3299869. challenge (osirrc 2019), in: Proceedings of the 3320212. doi:10.1145/3299869.3320212. 42nd International ACM SIGIR Conference on Re- [19] S. E. Robertson, S. Walker, S. Jones, M. M. Hancocksearch and Development in Information Retrieval, Beaulieu, M. Gatford, et al., Okapi at trec-3, Nist SIGIR’19, Association for Computing Machinery, Special Publication Sp 109 (1995) 109. New York, NY, USA, 2019, p. 1432–1434. URL: https: [20] C. Kamphuis, A. P. de Vries, The olddog docker

[1]

Gao ,

Dai ,

Chen ,

Fan , B. Van Durme ,

Callan , Complement lexical retrieval model with semantic residual embeddings , in: Advances in Information Retrieval, ECIR '21 , Springer International Publishing, Cham, 2021 , pp. 146 - 160 .

[2]

Luan ,

Eisenstein ,

Toutanova , M. Collins, Sparse, dense, and attentional representations for text retrieval, Transactions of the Association for Computational Linguistics 9 ( 2021 ) 329 - 345 .

[3]

Lin ,

Yang ,

Lin , Distilling dense representations for ranking using tightly-coupled teachers , CoRR abs/ 2010 .11386 ( 2020 ). URL: https://arxiv.org/ abs/ 2010 .11386. arXiv: 2010 .11386.

[4]

Hasibi ,

Balog ,

S. E.

Bratsberg , Exploiting entity linking in queries for entity retrieval , in: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval , ICTIR '16, Association for Computing Machinery, New York, NY, USA, 2016 , p. 209 - 218 . URL: https: //doi.org/10.1145/2970398.2970406. doi: 10 .1145/ 2970398.2970406. image for OSIRRC at SIGIR 2019, in: Proceedings of the Open-Source IR Replicability Challenge colocated with 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, OSIRRC@SIGIR 2019 , Paris, France, July 25 , 2019 , CEUR-WS.org, Aachen, 2019 , pp. 47 - 49 . URL: http://ceur-ws. org/ Vol- 2409 /docker07.pdf .

[21]

Yang ,

Fang ,

Lin , Anserini: Enabling the use of lucene for information retrieval research , in: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '17, Association for Computing Machinery, New York, NY, USA, 2017 , p. 1253 - 1256 . URL: https://doi.org/10.1145/3077136. 3080721. doi: 10 .1145/3077136.3080721.

[22]

Ounis ,

Amati ,

Plachouras ,

He ,

Macdonald , D. Johnson, Terrier information retrieval platform , in: D. E. Losada,

J. M.

Fernández-Luna (Eds.), Advances in Information Retrieval , Springer Berlin Heidelberg, Berlin, Heidelberg, 2005 , pp. 517 - 519 .

[23]

Soborof ,

Huang ,

Harman , Trec 2018 news track overview ., in: Proceedings of The Twenty-Seventh Text REtrieval Conference , TREC ' 18 , National Institute for Standards and Technology (NIST), Gaithersburg , Maryland, USA, 2018 .

[24] J. M. van Hulst , F.

Hasibi , K.

Dercksen , K.

Balog , A. P. de Vries , Rel: An entity linker standing on the shoulders of giants , in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '20, Association for Computing Machinery, New York, NY, USA, 2020 , p. 2197 - 2200 . URL: https: //doi.org/10.1145/3397271.3401416. doi: 10 .1145/ 3397271.3401416.

[25]

Kamphuis ,

Hasibi , A. P. de Vries , T. Crijns, Radboud university at trec 2019 ., in: Proceedings of The Twenty-Eight Text REtrieval Conference , TREC ' 19 , National Institute for Standards and Technology (NIST), Gaithersburg , Maryland, USA, 2019 .

[26]

Lin ,

Ma , S.- C. Lin , J.-H.

Yang , R.

Pradeep , R.

Nogueira , Pyserini:

An easy-to-use python toolkit to support replicable ir research with sparse and dense representations , arXiv preprint arXiv:2102.10073 ( 2021 ).