GeeseDB: A Python Graph Engine for Exploration and
Search
Chris Kamphuis1 , Arjen P. de Vries1
1
    Radboud University, Toernooiveld 212, Nijmegen, The Netherlands


                                       Abstract
                                       GeeseDB is a Python toolkit for solving information retrieval research problems that leverage graphs as data structures. It
                                       aims to simplify information retrieval research by allowing researchers to easily formulate graph queries through a graph
                                       query language. GeeseDB is built on top of DuckDB, an embedded column-store relational database designed for analytical
                                       workloads. GeeseDB is available as an easy to install Python package. In only a few lines of code users can create a first
                                       stage retrieval ranking using BM25. Queries read and write Numpy arrays and Pandas dataframes, at zero or negligible data
                                       transformation cost (dependent on base datatype). Therefore, results of a first-stage ranker expressed in GeeseDB can be used
                                       in various stages in the ranking process, enabling all the power of Python machine learning libraries with minimal overhead.
                                       Also, because data representation and processing are strictly separated, GeeseDB forms an ideal basis for reproducible IR
                                       research.

                                       Keywords
                                       Open-Source Search Engine, Information Retrieval, Graph Databases


1. Introduction                                                                                 vide the following functionalities:

In recent years there has been a lot of exciting new infor-                                    • GeeseDB is an easy-to-install, self-contained
mation retrieval research that makes use of non-text data                                        Python package available through pip install
to improve the effectiveness of search systems. Consider                                         with as few as possible dependencies. It contains
for example dense representations for retrieval [1, 2, 3],                                       topics and relevance judgements for several stan-
knowledge graphs to leverage entity information [4, 5, 6],                                       dard IR collections out-of-the-box, allowing re-
and non-textual learning-to-rank features [7, 8]. All these                                      searchers to quickly start developing new ranking
research directions have improved the effectiveness of                                           models.
search systems by making use of more diverse data. De-                                         • First stage (sparse) retrieval is directly supported.
spite the fact that search systems consider more diverse                                         In only a few lines of code it is possible to load
sources of data, the usage of this data is often imple-                                          documents and create a first stage ranking.
mented through the use of a coupled architecture. In par-                                      • Data is served in a usable format for later retrieval
ticular, first-stage retrieval is often carried out with dif-                                    stages. GeeseDB allows to directly run queries
ferent software compared to later retrieval stages where                                         on Pandas data frames for efficient data transfer
these novel reranking techniques tend to be used. In our                                         to sequential reranking algorithms.
view, researchers could benefit from a system where re-                                        • Data exploration is supported through querying
trieval stages are more tightly integrated, that facilitates                                     data with SQL, but more interestingly, also using
the exploration on how to use non-content data for rank-                                         a graph query language, making the exploration
ing, and serves the data in a format suitable for reranking                                      of new research avenues easier. This prototype
with e.g. transformers or tree based methods.                                                    supports a subset of the graph query language
                                                                                       1
   In order to fulfill these needs we propose GeeseDB , a                                        Cypher [9], a graph query language originally
prototype Python toolkit for information retrieval that                                          proposed for Neo4j, similar to the property graph
leverages graphs as data structures, allowing metadata                                           database model query language as described by
and graphs to be easily included in the ranking pipeline.                                        Angles [10].
The toolkit is designed to quickly set up first stage re-
trieval, and make it easy for researchers to explore new GeeseDB began as a project after identifying the opportu-
ranking models quickly. In short, GeeseDB aims to pro- nities for graph queries to improve reproducible IR [11] at
DESIRES 2021 – 2nd International Conference on Design of                                 the Open-Source IR Replicability Challenge SIGIR work-
Experimental Search & Information REtrieval Systems, September                           shop [12]. Prior work observed many BM25 implementa-
15–18, 2021, Padua, Italy                                                                tions [13, 14], that resulted in wildly varying effectiveness
" chris@cs.ru.nl (C. Kamphuis); arjen@cs.ru.nl (A. P. de Vries)                          scores, and the variety of systems participating in this
          © 2021 Copyright for this paper by its authors. Use permitted under Creative
          Commons License Attribution 4.0 International (CC BY 4.0).                     workshop also found varying BM25 effectiveness scores
    CEUR

          CEUR Workshop Proceedings (CEUR-WS.org)
    Workshop
                  http://ceur-ws.org


                                                                                         between them. Is this really a problem? Several valid
                  ISSN 1613-0073
    Proceedings


    1
      https://github.com/informagi/geesedb
reasons could explain these differences in effectiveness;    2. Design
document pre-processing, parameter tuning, or even in-
terpretation of the theory to arrive at the exact ranking    At the core of GeeseDB lies the full text search design
formula to be used. When using these scores as a baseline    presented by Mühleisen et al. [14]. In this work, a column-
however, the effectiveness gain of novel methods could       store database for IR prototyping is proposed, which uses
be exaggerated due to the (coincidental) choice for an       the database schema described in Figure 1, consisting
implementation of the baseline that gives low effective-     of three database tables. (One for all term information,
ness. Indeed, Yang et al. [15] showed empirically that
the comparison against weak baselines is a real problem,                Documents

that can obfuscate the real gain in effectiveness.           PK   doc_id int NOT NULL

   A method introduced into the community to help sim-            length int NOT NULL
                                                                                                                        Term Document


plify the comparison between open source search sys-              collection_id varchar NOT NULL
tems has been the introduction of the Common Index                                                             FK1 doc_id int NOT NULL

File Format (CIFF) [16]. CIFF is a binary data exchange                    Terms
                                                                                                               FK2 term_id int NOT NULL

format that can be used by search systems to share their                                                            term_frequency int NOT NULL
                                                             PK   term_id int NOT NULL
index structures. This way, researchers ensure that the
exact same pre-processing has been applied when com-              term_id int NOT NULL

paring different systems to each other. Experiments in            doc_frequency int NOT NULL

[16] show how differences in (BM25) effectiveness scores
between different implementations do decrease when
their indexes are exchanged using CIFF. GeeseDB there-       Figure 1: Database schema by Mühleisen et al. [14] for full
                                                             text search in relational databases
fore adopts the CIFF index format to exchange data be-
tween systems.
   A second approach to improve the reproducibility of       one for all document information, and one that contains
IR research results has been adopted less widely. By mak-    the information on how terms relate to documents; the
ing use of a database system, the way how data is stored     information that is found in a posting list of an inverted
and the plans on how this data is processed are explicitly   index). Using these three tables they show that BM25 can
separated. This enables easier inspection on differences     be easily expressed as a SQL query, with latencies that
between ranking formulas. In that perspective, it may        are on par with custom-build IR engines. In GeeseDB we
be not so surprising that the only two systems that pro-     use the exact same relational schema for full text search.
duced the exact same effectiveness scores for their BM25     Instead of seeing the document data and term data as
rankings in the studies mentioned above, were the two re-    tables that relate to each other through a many-to-many
lational database systems used to rank documents; even       join table, it is also possible to consider this schema as a
though their execution engines were completely different     bipartite graph. In this graph both documents and terms
and implemented by different teams. Also, in the work by     are considered as nodes, connected to each other through
Kamphuis et al. [17], using a shared database back-end       edges. If a term occurs in a document there exists an edge
for a series of retrieval experiments, testing a number of   between that term and document. GeeseDB uses the data
previously proposed ‘improvements’ of BM25, demon-           model of property graphs; labeled multigraphs where
strated that these differences between variants turn out     both edges and nodes can have property-value pairs. The
insignificant once everything but the ranking formula is     database schema as described in Figure 1 would then
fixed.                                                       translate to the property graph schema shown in Fig-
   Given these findings, we fully subscribe to the posi-     ure 2. A small example of a graph represented by this
tion that the declarative specification of ranking in a
database query language offers the potential to improve             Documents                                                 Terms
reproducibility in IR research. SQL queries that express
                                                              +collection_id: varchar              + tf: int        +string: varchar
more complex ranking functions than the default combi-        +length: int                                          +document frequency: int
nation of term frequency and document frequency, can
however easily become overly tedious to write, elaborate
and error-prone. As the way forward, GeeseDB there-          Figure 2: Graph schema representing bipartite document-
fore introduces the property graph data model with a         term graph
graph query language to express IR retrieval models in a
more compact manner. We show in this work that this          schema is shown in Figure 3, document nodes contain
is especially useful when introducing representations of     document specific information (i.e. document length and
documents and queries that include information beyond        the collection identifier), term nodes contain informa-
just text.                                                   tion relevant to the term (i.e. the term string and the
term’s document frequency), and the the edges between         DataFrames. GeeseDB inherits all these functionalities
document and terms nodes contain term frequency in-           from DuckDB.
formation (i.e. how often is the term mentioned in the           As DuckDB is a SQL database management system, we
document represented the respective nodes it connects).       can execute analytical SQL queries on the tables that con-
If one wants to also store position data, this graph can      tain our data, including the BM25 rankings described by
                                                              Mühleisen et al. [14]. By default, the BM25 implementa-
       doc 1                                  term 1          tion provided with GeeseDB implements the disjunctive
                                                              variant of BM25, instead of the conjunctive variant they
      length: 2            TF: 1            string: "dog"
  collection_id: "a"                      doc frequency: 1    used. Although the conjunctive variant of BM25 can be
                                                              calculated more quickly, we chose to use the disjunctive
                           TF: 1                              variant as it is more commonly used by IR researchers
       doc 2                                  term 2          and the differences between effectiveness scores are no-
      length: 2            TF: 1            string: "cat"     ticeable on smaller collections. For now we only support
  collection_id: "b"                      doc frequency: 3    the original formulation of BM25 by Robertson et al. [19],
                        TF: 1                                 however support of or adding other versions of BM25 [17]
       doc 3            TF: 2                 term 3
                                                              is trivial.

      length: 5            TF: 3           string: "music"
  collection_id: "c"                      doc frequency: 2    2.2. Graph Query Languange
                                                             What distinguishes GeeseDB from alternatives, database-
                                                             backed (olddog) [20] or native systems (Anserini [21],
Figure 3: Example term-document graph that maps to rela- Terrier [22]) is the graph query language, based on
tional database schema                                       Cypher [9]. Systems like Elasic2 and Solr3 do support
                                                             querying graphs, but not declaratively. For now, GeeseDB
easily be changed to a graph where the edges store the implements Cypher’s basic graph pattern matching
position of a term. If a term would appear multiple times queries for retrieving data. An example of a graph query
in a document, the property graph model would allow supported by GeeseDB is presented in Figure 4. This
for multiple edges to exist between two nodes. The graph
schema that we described by Figure 2 maps one-to-one MATCH (d:docs)-[]-(:authors)-[]-(d2:docs)
                                                             WHERE d.collection_id = "96ab542e"
to the relational database schema described by Figure 1,
                                                             RETURN DISTINCT d2.collection_id
so nodes are represented by normal relational tables that
represent specific data units (terms, documents), while
                                                             Figure 4: An example cypher query that finds all documents
edges are represented by many-to-many join tables. So, that were written by the same author that wrote the docu-
even though we think of the data as graphs, in the back- ment with the collecion_id “96ab542e”
end they are represented as relational tables. When using
GeeseDB for search we expect at least the document-term
graph to be present, of course new node types can be query finds all documents written by the same authors
introduced in order to explore new search strategies.        as those who wrote document “96ab542e”. For compari-
                                                             son, Figure 5 illustrates the same query represented in
                                                             SQL; much more complex than the Cypher version, due
2.1. Backend                                                 to the join conditions that have to be made explicit. In
GeeseDB is built on top of DuckDB [18], an in-process order to connect the “docs” table with the “authors” table
SQL OLAP (analytics optimized) database management 2 joins are needed, reconnecting the “docs” table again
system. DuckDB is designed to support analytical query introduces two more joins.
workloads, meaning that it specifically aims to process         At the moment of writing, GeeseDB supports the fol-
complex long-running queries where a significant por- lowing Cypher keywords: MATCH, RETURN, WHERE, AND,
tion of the data is accessed, conditions matching the case DISTINCT, ORDER BY, SKIP, and LIMIT. Instead of us-
of IR research. DuckDB has a client Python API which ing WHERE to filter data, it is also possible to use graph
can be installed using pip, afterwards it can be used di- matching using the keyword MATCH, as shown in Figure 6;
rectly. DuckDB has a separated API built around both the query returns the length of document “96ab542e”. We
NumPy and Pandas, providing NumPy/Pandas views over
the same underlying data representation, without incur-
                                                                  2
ring data transfer (usually referred to as “zero-copy” read-        https://www.elastic.co/what-is/elasticsearch-graph
                                                             (accessed 19-08-2021)
ing). Pandas DataFrames can be registered as virtual ta-          3
                                                                    https://solr.apache.org/guide/6_6/graph-traversal.html
bles, allowing to directly query the data present in Pandas (accessed 19-08-2021)
SELECT DISTINCT d2.collection_id
                                                              The goal of this task is: Given a news story, find other news
FROM docs AS d2
                                                              articles that can provide important context or background
JOIN doc_author AS da2
    ON (d2.collection_id = da2.doc)
                                                              information. These articles can then be recommended to
JOIN authors AS a2                                            the reader to help them understand the context in which
    ON (da2.author = a2.author)                               these news articles take place. The collection used for
JOIN doc_author AS da3                                        this task is the Washington Post V3 collection6 released
    ON (a2.author = da3.author)                               for the 2020 edition of TREC. It contains 671.945 news
JOIN docs AS d                                                articles published by the Washington Post published be-
    ON (d.collection_id = da3.doc)                            tween 2012 and 2020, and 50 topics with relevance as-
WHERE d.collection_id = '96ab542e'                            sessments (topics correspond to collection identifiers of
                                                              documents for which relevant data has to be found). The
Figure 5: SQL query that corresponds to the graph query articles in this collection contain useful metadata; in par-
described in Figure 4.                                        ticular, we will use authorship information. We extracted
                                                              25.703 unique article authors, where it is possible that
MATCH (d:docs {d.collection_id: "96ab542e"})                  multiple authors co-wrote a news article. We also an-
RETURN d.len                                                  notate documents with entity information which was
                                                              obtained by using the Radboud Entity Linker [24]. In
Figure 6: Graph query where the length of document with total 31.622.419 references to 541.729 unique entities
collection_id is returned.                                    were found. An edge between entity and document nodes
                                                              contains mention and location information, as well as
                                                              the ner_tag found by the linker’s entity recognition
plan to support the other keywords of Cypher in the fu- module (the entity linker can assign different tags to the
                                                                              7
ture, as well as directed edges. Everything that is not same entity). Figure 7 illustrates the data schema that
yet directly supported yet by our implementation can of we use for the background linking task.
course still be expressed in SQL, which is fully supported4 .
In order to know how to join nodes to each other if no                              Entities
edge information has been provided, GeeseDB stores in-                            entity: "dog"
                                                                                      df: 1                     Authors
formation on the schema. This way GeeseDB knows                                                              author: "Chris"
                                                                    start: 0
how nodes relate to each other through which edges.                  len: 1                     Authors
GeeseDB has a module for updating the graph schema,             mention: "dog"               author: "Arjen"
allowing researchers to easily set up the graph they want       ner_tag: "misc"
represented in the database.
                                                                                       Documents                  Documents
                                                                                   collection_id: "abc"        collection_id: "def"
3. Usage                                                                                   len: 3                     len: 2

GeeseDB comes as an easy-to-install Python package that                                tf: 1      tf: 2       tf: 1         tf: 1
can be installed using pip, the standard package installer                    Terms                    Terms                     Terms
for Python:                                                               string: "dog"             string: "cat"           string: "music"
                                                                               df: 1                    df: 2                     df: 1
      $ pip install geesedb==0.0.1

We can start using GeeseDB after installing it. All exam-             Figure 7: Example property graph for the TREC News
                                                                      Track’s background linking task. The node types are authors,
ples we show in this paper were run on version v0.0.1
                                                                      entities, terms and documents. Edges connect document
of GeeseDB. However, as GeeseDB is actively being de-
                                                                      nodes to other types of nodes. Both edges and nodes can
veloped, we advise readers to use the latest version of               have properties (following the property graph model). Multi-
GeeseDB, which can be installed when not specifying                   ple edges may exist between one entity node and one docu-
a package version. It is also possible to install the lat-            ment node, as one entity can be linked multiple times to one
est commit by installing the latest version directly from             document.
GitHub5 .
   As an example, we will show how to use GeeseDB for
the background linking task of the TREC News Track [23].
     4
       GeeseDB supports the graph queries by translating them to
their corresponding SQL queries, both nodes and edges are after all
                                                                          6
just tables in the backend.                                                   https://trec.nist.gov/data/wapost/
     5                                                                    7
       https://github.com/informagi/GeeseDB#package-installation              The annotated data will be made publicly available.
                                                                 MATCH (d:docs {collection_id:
3.1. Indexing and Search
                                                                 ˓→   ?})-[]-(t:term_dict)
In order to start, a database containing at least the docu-      RETURN string
ment and term information needs to be created. Figure 8          ORDER BY tf*log(671945/df)
shows how the data can be easily loaded using CSV files.         DESC
                                                                 LIMIT 5

from geesedb.index import FullTextFromCSV                        Figure 10: Prepared Cypher statement that finds the top-5
                                                                 TF-IDF terms in a document.
index = FullTextFromCSV(
    database='/path/to/database',
                                                                 SELECT term_dict.string
    docs_file='/path/to/docs.csv',
                                                                 FROM term_dict
    term_dict_file='/path/to/term_dict.csv',
                                                                 JOIN term_doc ON
    term_doc_file='/path/to/term_doc.csv'
                                                                      (term_dict.term_id = term_doc.term_id)
)
                                                                 JOIN docs ON
index.load_data()
                                                                      (docs.doc_id = term_doc.doc_id)
                                                                 WHERE docs.collection_id = ?
Figure 8: Load text data from the WashingtonPost collection      ORDER BY term_doc.tf *
formatted as csv files in the format as described by Mühleisen   ˓→   log(671945/term_dict.df
et al. [14]                                                      DESC
                                                                 LIMIT 5;

   Instead of loading the data from CSV files it is also pos-
sible to load the text data directly using the CIFF format       Figure 11: Prepared SQL statement that finds the top-5 TF-
                                                                 IDF terms in a document.
for data exchange [16]. GeeseDB also has functionalities
to create the CSV files used here from the CIFF format.
Authorship information and entity links can be loaded
similarly. Processing Cypher queries depends on the                 Using the terms found with Cypher, we can construct
schema information that needs to be loaded as well. We           queries that we can pass to the searcher, and create a
have a supporting class (called metadata) for this, and          BM25 ranking. The code that generates the rankings for
the schema data used in this paper will be available via         all topics is presented in Figure 12. As you can see, with
GitHub. After loading the data we can quickly create a           only a limited number of lines of Python code it is quite
BM25 ranking for ad hoc search in the Washington Post            easy to create rankings. Note that the collection size is
collection as shown in Figure 9.                                 hardcoded as version v0.0.1 does not support aggregation
                                                                 yet. From this point it is quite trivial to write the content
from geesedb.search import Searcher                              of hits to a runfile, and evaluate using trec_eval.
                                                                 Instead of “just” ranking with BM25, using e.g. the meta-
searcher = Searcher(                                             data in order to adapt the ranking is straightforward. In
    database='/path/to/database',
                                                                 the case of background linking, it makes sense to con-
    n=10
                                                                 sider authorship information when recommending arti-
)
topic = 'obama and trump'
                                                                 cles that might be suitable as background reading. As
hits = searcher.search_topic(topic)                              journalists are often specialized in certain news topics
                                                                 (e.g. politics, foreign affairs, tech), the stories they write
Figure 9: Example on how to create a BM25 ranking for the        often share context. Also, when journalists collaborate
query “obama and trump” that returns the top 10 documents.       on stories they write together on topics they specialize in
                                                                 as well. As authorship information is available to us, we
                                                                 can decide to use the information whether an article is
   For the background linking task however, we do not
                                                                 written by the authors of the topic article, or by someone
have regular topics; we only have the collection iden-
                                                                 they have collaborated with in the past. Finding the arti-
tifiers of the documents we need to find relevant back-
                                                                 cles that are written by this group of people can easily
ground info for. In order to search for relevant back-
                                                                 be done using a graph query, the query that finds these
ground reading, queries that represent our information
                                                                 articles is shown in Figure 13.
need to be constructed. A common approach is to use the
                                                                 Depending on the number of documents found by this
top-𝑘 TF-IDF terms of the source article. These can easily
                                                                 query, different rescoring strategies can be decided upon.
be found using the Cypher statement shown in Figure 10.
                                                                 If the set of documents written by the authors or their
Instead of using Cypher it would also be possible to use
                                                                 co-authors is large, perhaps it is possible to only consider
SQL, as shown in Figure 11; however this example shows
                                                                 these documents, but if the set is small, a score boost
again the Cypher query is more elegant.
from geesedb.search import Searcher                                  # import and first lines the same as example
from geesedb.connection import get_connection
from geesedb.resources import                                        author_c_query = """cypher authorship
˓→  get_topics_backgroundlinking                                     ˓→  query"""
from geesedb.interpreter import Translator                           author_query = t.translate(author_c_query)

db_path = '/path/to/database'                                        cursor = get_connection(db_path).cursor
searcher = Searcher(                                                 topics = get_topics_backgroundlinking(
        database=db_path,                                                '/path/to/topics'
        n=1000                                                       )
)                                                                    for topic_no, collection_id in topics:
                                                                         cursor.execute(query, [collection_id])
translator = Translator(db_path)                                         topic = ' '.join(cursor.fetchall()[0])
c_query = """cypher TFIDF query"""                                       hits = searcher.search_topic(topic)

query = translator.translate(c_query)                                     cursor.execute(author_query,
cursor = get_connection(db_path).cursor                                   ˓→  [collection_id])
topics = get_topics_backgroundlinking(                                    docs_authors = {
    '/path/to/topics'                                                         e[0] for e in cursor.fetchall()
)                                                                         }
for topic_no, collection_id in topics:                                    if len(docs_authors) > 2000:
    cursor.execute(query, [collection_id])                                    hits = hits[hits.collection_id.isin(
    topic = ' '.join(cursor.fetchall()[0])                                        docs_authors)]
    hits = searcher.search_topic(topic)
                                                                     Figure 14: Find documents written by all authors that collab-
Figure 12: Create a BM25 ranking for all background linking          orated with the authors of the topic article, if there are more
topics using the top-5 TFIDF terms. Note that in this case           than 2000 documents found only consider these documents
a processed topic file was used that only contains the topic         as background reading candidates.
identifier and the topic article id. The topic file in this format
is provided on our GitHub.
                                                                          MATCH (d:docs {collection_id:
                                                                     ˓→   ?})-[]-(e:entities)
MATCH (d:docs)-[]-(:authors)-[]-(:docs)-[]- ⌋                             RETURN mention
˓→  (:authors)-[]-(d2:docs {collection_id:                                ORDER BY start
˓→  ?})                                                                   LIMIT 5
RETURN DISTINCT d.collection_id
                                                                     Figure 15: Retrieve the first five entities mentioned in the
Figure 13: Cypher query to find documents written by co-             topic article; and return the terms used to mention the entity.
authors of the authors of the topic article.


                                                                     first five entity mentions, the text needs to be processed.
might be more appropriate. Figure 14 shows an example                The term data loaded in GeeseDB was already processed,
on how to only consider documents found with the query               as it was data loaded from CSV files built from a CIFF file
in Figure 13, in this particular case we ensure that at least        created from an Anserini [21] (Lucene) index. Anserini
2000 documents are found before filtering.                           has an easy to use Python extension, Pyserini [26], that
   To give another example; the graph query language                 can be used to tokenize the text in the same way as the
is also useful when considering entities. When journal-              documents were tokenized. Figure 16 shows the Python
ists write news articles, the articles relate to events con-         code where we extract the mentions, process them such
cerning e.g. people, organisations, or countries. In other           that they become a usable query for GeeseDB, and then
words, the basis of news articles lay the entities as they           BM25 ranking is created with this query.
are often the subject of news. So, instead of using the                  In summary, GeeseDB allows researchers to index and
most informative terms in a news article, it could be use-           search data with only a few lines of Python code. It can
ful to consider the entities identified in the article instead.      be used to explore new IR research ideas through both
Important entities tend to be mentioned in the beginning             SQL and the Cypher graph query language. As GeeseDB
of a news article [25]; Figure 15 shows the Cypher query             can query directly on top of Panda’s DataFrames, no data
to retrieve the text mentions of the first five mentioned            transfer has to be done, making this framework ideal
entities.                                                            to set up the data for other Python reranking pipelines
Before it is possible to search using the text describing the        (i.e. it is trivial to store learning-to-rank features in the
from geesedb.search import Searcher
                                                                         searchers. A few obvious extensions would be IR
from geesedb.connection import get_connection
                                                                         dataset support, native document processing, and
from geesedb.resources import
˓→  get_topics_backgroundlinking
                                                                         implementations of popular first-stage rankers.
from geesedb.interpreter import Translator                             • This version of GeeseDB lacks extensive bench-
from pyserini.analysis import Analyzer,                                  marking. We plan to release benchmarks on pop-
˓→  get_lucene_analyzer                                                  ular IR datasets, including instruction on how to
                                                                         reproduce these benchmarks.
db_path = '/path/to/database'                                          • In recent years, dense graph representations have
searcher = Searcher(                                                     become popular. We would like to add the func-
    database=db_path,
                                                                         tionality to analyse these dense representations
    n=1000
                                                                         for graphs managed in GeeseDB.
)
                                                                  Eventually, we would like to extend the query language
analyzer = Analyzer(get_lucene_analyzer())                        with proper support to define ranking over graphs. (Now,
                                                                  the ranking function is hidden in the ‘searcher’ module.)
translator = Translator(db_path)
c_query = """cypher entity query"""
query = translator.translate(c_query)                             5. Conclusion
cursor = get_connection(db_path).cursor                           In this work we have described our prototype implemen-
topics = get_topics_backgroundlinking(
                                                                  tation of GeeseDB, and how we envision graph databases
    '/path/to/topics'
                                                                  can be used for information retrieval research. GeeseDB
)
                                                                  is still in active development, and we are open to addi-
for topic_no, collection_id in topics:                            tional contributions from the community.
    cursor.execute(query, [collection_id])
    topic = ' '.join([e[0] for e in
    ˓→  cursor.fetchall()])
                                                                  Acknowledgments
    topic = ' '.join(analyzer.analyze(topic))
    hits = searcher.search_topic(topic)
                                                                  This work is part of the research program Commit2Data
                                                                  with project number 628.011.001 (SQIREL-GRAPHS),
Figure 16: Create a BM25 ranking for all background linking       which is (partly) financed by the Netherlands Organi-
topics using the mention text of the first five linked entities   sation for Scientific Research (NWO).
in the source article.
                                                                  References
database that can then directly be used).                          [1] L. Gao, Z. Dai, T. Chen, Z. Fan, B. Van Durme,
                                                                       J. Callan, Complement lexical retrieval model with
                                                                       semantic residual embeddings, in: Advances in
4. Future Work                                                         Information Retrieval, ECIR ’21, Springer Interna-
As the current GeeseDB version is still an early prototype,            tional Publishing, Cham, 2021, pp. 146–160.
many future improvements have been envisioned. We                  [2] Y. Luan, J. Eisenstein, K. Toutanova, M. Collins,
have identified four improvements we want to pursue as                 Sparse, dense, and attentional representations for
a priority:                                                            text retrieval, Transactions of the Association for
                                                                       Computational Linguistics 9 (2021) 329–345.
     • We have implemented the graph query language                [3] S. Lin, J. Yang, J. Lin, Distilling dense representa-
       Cypher only partially; in the near future, we                   tions for ranking using tightly-coupled teachers,
       would like to support this fully. For now, it is                CoRR abs/2010.11386 (2020). URL: https://arxiv.org/
       only possible to use the graph query language                   abs/2010.11386. arXiv:2010.11386.
       to query data, but ideally it could also be used            [4] F. Hasibi, K. Balog, S. E. Bratsberg, Exploiting
       to load or update data. Of course it is already                 entity linking in queries for entity retrieval, in:
       possible to do this through the SQL backend, but                Proceedings of the 2016 ACM International Con-
       this should only be necessary for extending the                 ference on the Theory of Information Retrieval,
       backend support for new use-cases.                              ICTIR ’16, Association for Computing Machinery,
     • As the goal of GeeseDB is to serve as an IR toolkit,            New York, NY, USA, 2016, p. 209–218. URL: https:
       we would like to extend GeeseDB with function-                  //doi.org/10.1145/2970398.2970406. doi:10.1145/
       alities that make the package easy to use for IR re-            2970398.2970406.
 [5] K. Balog, Entity-oriented search, Springer Nature,             //doi.org/10.1145/3331184.3331647. doi:10.1145/
     Gewerbestrasse 11, 6330 Cham, Switzerland, 2018.               3331184.3331647.
 [6] J. Dalton, L. Dietz, J. Allan, Entity query fea-          [13] J. Arguello, F. Diaz, J. Lin, A. Trotman, Sigir 2015
     ture expansion using knowledge base links, in:                 workshop on reproducibility, inexplicability, and
     Proceedings of the 37th International ACM SI-                  generalizability of results (rigor), in: Proceed-
     GIR Conference on Research; Development in In-                 ings of the 38th International ACM SIGIR Con-
     formation Retrieval, SIGIR ’14, Association for                ference on Research and Development in Infor-
     Computing Machinery, New York, NY, USA, 2014,                  mation Retrieval, SIGIR ’15, Association for Com-
     p. 365–374. URL: https://doi.org/10.1145/2600428.              puting Machinery, New York, NY, USA, 2015, p.
     2609628. doi:10.1145/2600428.2609628.                          1147–1148. URL: https://doi.org/10.1145/2766462.
 [7] R. Deveaud, M.-D. Albakour, C. Macdonald, I. Ou-               2767858. doi:10.1145/2766462.2767858.
     nis, On the importance of venue-dependent fea-            [14] H. Mühleisen, T. Samar, J. Lin, A. de Vries, Old
     tures for learning to rank contextual suggestions,             dogs are great at new tricks: Column stores for
     in: Proceedings of the 23rd ACM International Con-             ir prototyping, in: Proceedings of the 37th Inter-
     ference on Conference on Information and Knowl-                national ACM SIGIR Conference on Research and
     edge Management, CIKM ’14, Association for Com-                Development in Information Retrieval, SIGIR ’14,
     puting Machinery, New York, NY, USA, 2014, p.                  Association for Computing Machinery, New York,
     1827–1830. URL: https://doi.org/10.1145/2661829.               NY, USA, 2014, p. 863–866. URL: https://doi.org/
     2661956. doi:10.1145/2661829.2661956.                          10.1145/2600428.2609460. doi:10.1145/2600428.
 [8] C. Macdonald, R. L. Santos, I. Ounis, On the use-              2609460.
     fulness of query features for learning to rank, in:       [15] W. Yang, K. Lu, P. Yang, J. Lin, Critically examining
     Proceedings of the 21st ACM International Confer-              the "neural hype": Weak baselines and the additiv-
     ence on Information and Knowledge Management,                  ity of effectiveness gains from neural ranking mod-
     CIKM ’12, Association for Computing Machinery,                 els, in: Proceedings of the 42nd International ACM
     New York, NY, USA, 2012, p. 2559–2562. URL: https:             SIGIR Conference on Research and Development
     //doi.org/10.1145/2396761.2398691. doi:10.1145/                in Information Retrieval, SIGIR’19, Association for
     2396761.2398691.                                               Computing Machinery, New York, NY, USA, 2019,
 [9] N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lin-        p. 1129–1132. URL: https://doi.org/10.1145/3331184.
     daaker, V. Marsault, S. Plantikow, M. Rydberg,                 3331340. doi:10.1145/3331184.3331340.
     P. Selmer, A. Taylor, Cypher: An evolving query           [16] J. Lin, J. Mackenzie, C. Kamphuis, C. Macdonald,
     language for property graphs, in: Proceedings                  A. Mallia, M. Siedlaczek, A. Trotman, A. de Vries,
     of the 2018 International Conference on Manage-                Supporting interoperability between open-source
     ment of Data, SIGMOD ’18, Association for Com-                 search engines with the common index file format,
     puting Machinery, New York, NY, USA, 2018, p.                  in: Proceedings of the 43rd International ACM SI-
     1433–1445. URL: https://doi.org/10.1145/3183713.               GIR Conference on Research and Development in
     3190657. doi:10.1145/3183713.3190657.                          Information Retrieval, SIGIR ’20, Association for
[10] R. Angles, The property graph database model., in:             Computing Machinery, New York, NY, USA, 2020,
     Proceedings of the 12th Alberto Mendelzon Inter-               p. 2149–2152. URL: https://doi.org/10.1145/3397271.
     national Workshop on Foundations of Data Man-                  3401404. doi:10.1145/3397271.3401404.
     agement, AMW ’18, CEUR-WS.org, Aachen, 2018.              [17] C. Kamphuis, A. P. de Vries, L. Boytsov, J. Lin,
[11] C. Kamphuis, A. P. de Vries, Reproducible IR needs             Which bm25 do you mean? a large-scale repro-
     an (IR) (graph) query language, in: Proceedings                ducibility study of scoring variants, in: Advances
     of the Open-Source IR Replicability Challenge co-              in Information Retrieval, ECIR ’20, Springer Inter-
     located with 42nd International ACM SIGIR Confer-              national Publishing, Cham, 2020, pp. 28–34.
     ence on Research and Development in Information           [18] M. Raasveldt, H. Mühleisen, Duckdb: An em-
     Retrieval, OSIRRC@SIGIR 2019, Paris, France, July              beddable analytical database, in: Proceedings
     25, 2019, CEUR-WS.org, Aachen, 2019, pp. 17–20.                of the 2019 International Conference on Manage-
     URL: http://ceur-ws.org/Vol-2409/position03.pdf.               ment of Data, SIGMOD ’19, Association for Com-
[12] R. Clancy, N. Ferro, C. Hauff, J. Lin, T. Sakai, Z. Z.         puting Machinery, New York, NY, USA, 2019, p.
     Wu, The sigir 2019 open-source ir replicability                1981–1984. URL: https://doi.org/10.1145/3299869.
     challenge (osirrc 2019), in: Proceedings of the                3320212. doi:10.1145/3299869.3320212.
     42nd International ACM SIGIR Conference on Re-            [19] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-
     search and Development in Information Retrieval,               Beaulieu, M. Gatford, et al., Okapi at trec-3, Nist
     SIGIR’19, Association for Computing Machinery,                 Special Publication Sp 109 (1995) 109.
     New York, NY, USA, 2019, p. 1432–1434. URL: https:        [20] C. Kamphuis, A. P. de Vries, The olddog docker
     image for OSIRRC at SIGIR 2019, in: Proceedings
     of the Open-Source IR Replicability Challenge co-
     located with 42nd International ACM SIGIR Confer-
     ence on Research and Development in Information
     Retrieval, OSIRRC@SIGIR 2019, Paris, France, July
     25, 2019, CEUR-WS.org, Aachen, 2019, pp. 47–49.
     URL: http://ceur-ws.org/Vol-2409/docker07.pdf.
[21] P. Yang, H. Fang, J. Lin, Anserini: Enabling the
     use of lucene for information retrieval research,
     in: Proceedings of the 40th International ACM SI-
     GIR Conference on Research and Development in
     Information Retrieval, SIGIR ’17, Association for
     Computing Machinery, New York, NY, USA, 2017,
     p. 1253–1256. URL: https://doi.org/10.1145/3077136.
     3080721. doi:10.1145/3077136.3080721.
[22] I. Ounis, G. Amati, V. Plachouras, B. He, C. Mac-
     donald, D. Johnson, Terrier information retrieval
     platform, in: D. E. Losada, J. M. Fernández-Luna
     (Eds.), Advances in Information Retrieval, Springer
     Berlin Heidelberg, Berlin, Heidelberg, 2005, pp. 517–
     519.
[23] I. Soboroff, S. Huang, D. Harman, Trec 2018
     news track overview., in: Proceedings of The
     Twenty-Seventh Text REtrieval Conference, TREC
     ’18, National Institute for Standards and Technology
     (NIST), Gaithersburg, Maryland, USA, 2018.
[24] J. M. van Hulst, F. Hasibi, K. Dercksen, K. Balog,
     A. P. de Vries, Rel: An entity linker standing on
     the shoulders of giants, in: Proceedings of the
     43rd International ACM SIGIR Conference on Re-
     search and Development in Information Retrieval,
     SIGIR ’20, Association for Computing Machinery,
     New York, NY, USA, 2020, p. 2197–2200. URL: https:
     //doi.org/10.1145/3397271.3401416. doi:10.1145/
     3397271.3401416.
[25] C. Kamphuis, F. Hasibi, A. P. de Vries, T. Crijns, Rad-
     boud university at trec 2019., in: Proceedings of The
     Twenty-Eight Text REtrieval Conference, TREC ’19,
     National Institute for Standards and Technology
     (NIST), Gaithersburg, Maryland, USA, 2019.
[26] J. Lin, X. Ma, S.-C. Lin, J.-H. Yang, R. Pradeep,
     R. Nogueira, Pyserini: An easy-to-use python
     toolkit to support replicable ir research with
     sparse and dense representations, arXiv preprint
     arXiv:2102.10073 (2021).