Reproducible IR needs an (IR) (Graph) Query Language
                        Chris Kamphuis                                                        Arjen P. de Vries
                      ckamphuis@cs.ru.nl                                                       arjen@acm.org
                      Radboud University                                                     Radboud University
                   Nijmegen, The Netherlands                                              Nijmegen, The Netherlands
ABSTRACT                                                                    CCS CONCEPTS
IR research should concentrate on the retrieval model, and                  • Information systems → Query languages; Document rep-
not be hindered by its implementation in terms of low-level                 resentation; Evaluation of retrieval results; Search engine
data structures; and we should embrace graph databases to                   indexing.
realize that vision!
   Even results from retrieval systems based on the clas-                   KEYWORDS
sic Okapi BM25 retrieval model (an approach that seems                      information retrieval, graph query language, reproducibility,
straightforward to implement) have been remarkably difficult
to actually reproduce in practice. Pin-pointing the cause of
not obtaining identical results on an identical collection is               1   INTRODUCTION
surprisingly hard, especially since retrieval systems usually               The IR community should work towards being a community
mix the retrieval model itself with the code necessary to                   where reproducibility is the norm. We recognize that such a
make IR efficient. Unlike the database community, our field                 process can happen gradually, and that weaker forms of repro-
has never moved away from implementing query processing                     ducibility, like replicability, are better than none. However,
directly on the file system; historically best attributed to a              when introducing tools for replicability, researchers should
need for speed on large data sets of heterogeneous documents,               keep in mind that replicability is only a temporarily solution
never a good match for our database colleagues’ solutions.                  in the greater picture when working towards reproducibility.
   Nevertheless, this position paper calls for a shift toward                  The definition of reproducibility by the ACM1 states that,
a declarative approach to specify retrieval. Not only has                   for computational experiments, an independent group can
the state of the art in database query processing reached a                 obtain same results using artifacts they developed indepen-
level where doing IR as database queries is not unimagin-                   dently. Exact reproduction is not required, but results ob-
able, advancing retrieval effectiveness has become dependent                tained should support the claims of the original work. For IR
on systems that handle complex data models in multiple                      experiments, this implies that evaluation metrics observed in
layers of processing, usually involving machine learning and                the reproduction study should be within a tolerable margin
thereby including the data itself into the retrieval model. If              from those in the published study.
the results of ranking with a straightforward combination of                   In general, exact reproduction of studies is not a realistic
term frequency and document frequency can hardly be repro-                  expectation to hold, and illustrated well by two published
duced, how will we advance our field when researchers face                  results from the information retrieval field. First, work carried
the implementation of ranking formulas that integrate the                   out by Mühleisen et al. [5] compared multiple systems that
data itself, knowledge bases and advanced machine learning                  all claim to implement the BM25 ranking formula, and found
methods?                                                                    that four different implementations result in four different
   Reproducibility in IR will be much easier to achieve if the              effectiveness scores, both in 𝑀 𝐴𝑃 and 𝑃 @5. Given that
retrieval model can be expressed concisely (and precisely!)                 these systems included Indri [9] and Terrier [6], this came
in terms of operations over the document-term graph, or,                    out as quite a surprise; the authors took specific care to
probable today, a document-metadata-entity-term-graph. We                   keep document pre-processing identical for both systems,
propose to adopt G-core, a language proposed by the database                but the observed difference in 𝑀 𝐴𝑃 of 3% absolute was the
community to become a standard graph query language, and                    largest deviation in score reported in that study. Similarly,
where necessary extend it for IR purposes. IR researchers                   the RIGOR workshop compared different systems to each
and practitioners can represent documents and annotations                   other, out of which four systems implemented the BM25
naturally as graphs. We discuss the advantages of a graph                   ranking model Arguello et al. [2]. Again, the differences in
model over previous work using relational databases, and                    effectiveness amounted to 3% absolute on 𝑀 𝐴𝑃 @1000.
illustrate how the key properties of simple and extended IR                    Whether these differences are ‘within a tolerable margin’
models can be expressed naturally.                                          is arguable. Why the differences are so large has not been
                                                                            uncovered in detail; the authors of [2] pointed at tuning
                                                                            parameters, differences in interpretation of the equations,
                                                                            and optimizations. We suspect that implementations might
Copyright © 2019 for this paper by its authors. Use permitted under
                                                                            1
Creative Commons License Attribution 4.0 International (CC BY 4.0).          https://www.acm.org/publications/policies/artifact-review-
OSIRRC 2019 co-located with SIGIR 2019, 25 July 2019, Paris, France.        badging, last accessed July 4th, 2019.


                                                                       17
OSIRRC 2019, July 25, 2019, Paris, France                                                                                  Kamphuis and de Vries


(unintentionally) modify the specification of the ranking for-                 only have to focus on issuing queries, without having
mula when implementing techniques to obtain faster query                       to write additional code.
processing. Whatever the correct explanation, what we can                  All of these arguments relate directly to improving repro-
conclude is that, apparently, widely used experimental IR               ducibility of IR experimentation. However, what the authors
systems do not implement the same BM25 ranking formula                  did not mention is the increased friction of mapping all the
(and given the differences between all systems, it is not clear         elements of an information retrieval model onto a relational
which one implements the ranking formula that was proposed              representation, especially when both the documents and the
in Robertson and Walker [7]).                                           ranking functions have been increasing in complexity in re-
   As the scientific field that aims to identify the mechanisms         cent years. While it is definitely possible to express retrieval
that can predict relevance computationally, it is disappointing         models in SQL or a relational algebra variant (assuming
that we cannot even reach conclusive results for classic, highly        it includes aggregation operators), this is not a convenient
effective models. So, how can we improve upon the current               approach, and, in our opinion, error-prone as well.
situation?                                                                 We embrace the arguments in favour of a database ap-
   This paper takes the position that IR research should use            proach to IR, but propose to move along with recent trends in
higher level abstractions in the implementation of the re-              the database community and model documents and queries
trieval models we study. Prototyping retrieval models using             as graphs, and adopt a graph database model instead of a
a declarative query language would improve reproducibility,             relational model. Recently, a standard query language over
because the query expressions used provide a much better                graphs, G-CORE, has been introduced in Angles et al. [1].
basis to help investigate the differences that would explain            The language proposal is supported by database researchers
observed variations in effectiveness scores. If the reason be-          and practitioners united in the Linked Data Benchmark Coun-
tween such differences can be explained, they can either be             cil (LDBC)2 , and (at least partially) implemented by vendors
fixed or kept depending on the nature of the cause.                     like Neo4J. In other words, it is the right time to consider
   More specifically, we propose to adopt a graph query lan-            the question how to express information retrieval problems
guage as prototyping tool to achieve reproducibility in IR              in G-CORE, and identify possible extensions that might be
experimentation. We first review arguments given in favour              necessary for adoption in our community.
of ‘a database approach to IR’. We explain how a graph query
language would be used for IR tasks and discuss its advan-              2.1       Documents and terms
tages over other approaches. After sketching our solution
                                                                        G-core assumes the property graph data model: graphs are
direction, we conclude by identifying the challenges that still
                                                                        directed, have labels on both nodes and edges, and every node
lie ahead.
                                                                        and edge can have associated <property, value> pairs.
                                                                           We can represent both documents and terms as nodes
2    A GRAPH QUERY LANGUAGE FOR IR                                      of the graph. Edges are formed between a document node
Mühleisen et al. [5] have argued that relational databases              and a term node if the term appears in the document. If a
implemented as column store would be a suitable choice for              term appears multiple times in a document, G-CORE allows
creating information retrieval systems, and demonstrated how            for multiple edges between two nodes to exist. Edges may,
TF-IDF like ranking functions can be expressed easily and               optionally, save the position of the term in the document.
executed efficiently (show-casing a conjunctive BM25 ranking               G-CORE is a graph query language that is closed over
function). They give the following arguments in favour of ‘a            property graphs; when querying a graph, the result is another
database approach to IR’:                                               graph. Expressing a retrieval model like BM25 then corre-
                                                                        sponds to defining different graphs to determine the different
    ∙ A query language is a formal framework in which
                                                                        components in the ranking function. To illustrate, consider
      query evaluations have to be formulated precisely. Re-
                                                                        how the term frequency of term 𝑡 given a document 𝐷 is
      searchers do not have the option the resort to shortcuts
                                                                        computed from the graph specified using the G-CORE query
      in corner cases;
                                                                        shown in listing 1.
    ∙ data management components are separated from the
                                                                        CONSTRUCT (d)<-[:appearsIn]-(t)
      query evaluation specifications, reducing system com-
                                                                          MATCH (d:Document) ON document_graph,
      plexity and yielding a better structured architecture;
                                                                                (t:Term)     ON term_graph
    ∙ advances made in the database community (e.g. moving
                                                                            WHERE t.value = query_t.value
      from column-wise query processing to vectorized query
                                                                            AND d.id = doc_D.id
      processing) will benefit the retrieval engine ‘for free’;
    ∙ error analysis can be carried out on the same database                    Listing 1: Term frequency for term document pair
      that represents the collection. A researcher can write               The resulting graph contains two nodes; the document
      additional queries for analysis without having to write           node of document 𝐷 and the term node of term 𝑡. Term
      low-level code to interact with the data;                         frequency 𝑡𝐷 is calculated by counting the number of edges
    ∙ a database provides a rapid prototyping tool. IR re-              connecting these nodes. G-CORE does not offer methods
      searchers are primarily interested in questions regard-
                                                                        2
      ing the ranking components of their problem. They                     http://ldbcouncil.org, last accessed June 14th, 2019.


                                                                   18
Reproducible IR needs an (IR) (Graph) Query Language                                               OSIRRC 2019, July 25, 2019, Paris, France


to return properties yet, but Angles et al. [1] show how an            2.2.2 Entity ranking. For the entity ranking task, entities
extension of the language would implement this; listing 2              that appear in a query article, need to be ranked. The more
expresses this using the current implementation in Neo4J’s             important an entity is in context of the article, the higher
Cypher [4].                                                            it should be ranked. The set of entities that appear in the
CONSTRUCT (d)<-[e:appearsIn]-(t)                                       document is provided, and, obviously, considering the docu-
  MATCH (d:Document) ON document_graph,                                ments as graphs lets us represent this information directly.
        (t:Term)     ON term_graph                                     For ranking the entities one might want to know the collec-
    WHERE t.value = query_t.value                                      tion statistics of said entities. In order to measure these, we
    AND d.id = doc_D.id                                                would extend the documents by running an entity tagger.
RETURN COUNT(e)                                                        We can easily express entity occurrence in articles as graph
                                                                       queries, and, if deemed useful to improve effectiveness, bring
           Listing 2: Extension to return properties
                                                                       in additional information from a knowledge base, an external
   A term’s Inverse Document Frequency (IDF) statistic can             graph that we can join as easily, providing a complete and
be determined similarly: construct a graph consisting of all           high level specification of the exact way how the knowledge
nodes that are directly connected to the term node. The                base would be used in the ranking function.
number of document nodes in this graph is equal to the
number of documents in the collection containing query term
𝑡. Listing 3 shows how this graph can be constructed.
CONSTRUCT (d)
                                                                       2.3   Helping reproducibility
  MATCH (d:Document) ON document_graph,                                So far, we have focused on using a graph database for proto-
        (t:Term)     ON term_graph                                     typing information retrieval models. The common practice
    WHERE t.value = query_t.value                                      today among IR researchers and practitioners is however to
    AND (d)<-[:appearsIn]-(t)                                          implement their approach to retrieval directly on top of an
       Listing 3: Number of documents containing term                  inverted file index structure. The result is an implementa-
                                                                       tion with many different interwoven individual components:
                                                                       tokenisation, stemming, stop word removal, memory man-
2.2    Entities and metadata                                           agement, query processing, etc. When trying to reproduce a
For the BM25 example, both the relational and the graph                study, it can be quite a challenge to determine which parts
database would be appropriate. However, when the data                  of the process are different from the original study.
becomes more (semi-)structured, using a graph model will                  For a graph database solution, data management is taken
have clear advantages. Consider for example the TREC news              care of by the database system. The ranking function is ex-
track Soboroff et al. [8], that introduces two retrieval tasks:        pressed in the graph query language. This setup allows for
background linking and entity ranking. Both can be expected            easy analysis if a particular component fails in the reproduc-
to benefit from a richer document representation, where                tion process. Ideally the original work and the reproduction
the documents are enriched with their metadata and entity              study both have represented their work with a graph data-
annotations – perfectly modelled as a document graph.                  base. If reproduction fails in that case it easy to analyze
                                                                       whether the ranking function represented in the graph query
2.2.1 Background linking. For the background linking tasks,
                                                                       language is the same and whether the document representa-
the news articles to be retrieved provide background informa-
                                                                       tions are identical. If either one is different it might explain
tion to help understand a query article of interest. Annotating
                                                                       why it was not possible to reproduce results.
the documents in the collection with metadata can be ex-
                                                                          Consider again the study [5]. The query expression to cap-
pected to help this task. For example, articles can be linked
                                                                       ture the retrieval model is easily shared between research
with author information. Articles written by the same author
                                                                       groups. The column store approach was executed on two dif-
might have a higher chance of being relevant, as authors tend
                                                                       ferent database engines, MonetDB Boncz [3] and VectorWise
to write articles within a narrow range of topics. As articles
                                                                       Zukowski et al. [10]. Remarkably, only those two systems
can be written by multiple authors it also possible to identify
                                                                       produced identical effectiveness scores, even though the ex-
which authors co-authored often. This also might help when
                                                                       ecution engines are completely different; only the queries
determining if an article is suited for background reading.
                                                                       representing the IR model are identical. Had the effectiveness
These extra annotations can easily be represented in a graph
                                                                       scores not been the same, that would show that the systems
database, authors are simple nodes that are connected to
                                                                       processed the queries differently. The cause of a difference
article nodes if they authored the article. The graph of all
                                                                       would be clear: either a bug, or differences in the data. If
articles written by the same author as the query article can
                                                                       another group would also represent their document collection
be constructed using the query shown in listing 4.
                                                                       on one of these column store databases, and would find differ-
CONSTRUCT (d)                                                          ent effectiveness scores, the document representations must
  MATCH (d:document)                                                   differ. This might be the case because of text processing. It
    WHERE (d)-[:writtenBy]->()<-[:writtenBy]-(query_d)                 is immediately clear why differences probably occur when
         Listing 4: Documents written by same author                   investigating this way.


                                                                  19
OSIRRC 2019, July 25, 2019, Paris, France                                                                               Kamphuis and de Vries


   Even if the original work is not executed in the form               Finally, how can continuous updates of the graph be sampled
of graph queries, our approach is still useful for analysis            efficiently? It might be hard do determine when we need to
during a reproduction study. Because of the separation of              check which parts of the graph need to be updated when.
concerns between data management and the expression of
the ranking function, comparing minor modifications in data            4   CONCLUSION AND
(pre-)processing and ranking gives the researchers a low-                  RECOMMENDATIONS
effort opportunity to explore the effects of these changes. If         In this position paper we propose that IR practitioners should
some implementation details are not provided in a scientific           represent their data as graphs managed in a graph database
publication, the graph database allows for quick analysis to           system. We argue that this helps when trying to reproduce
compare different implementations that follow from the left            studies, and will simultaneously make your own research
out details.                                                           more reproducible for others.

3    FUTURE DIRECTIONS                                                 ACKNOWLEDGMENTS
                                                                       This work is part of the research program Commit2Data
Using a graph database with a graph query language instead
                                                                       with project number 628.011.001 (SQIREL-GRAPHS), which
of a column store database with a structured query language,
                                                                       is (partly) financed by the Netherlands Organisation for
offers new research opportunities. Firstly, we need empirical
                                                                       Scientific Research (NWO).
research to determine whether graph databases have matured
sufficiently to execute the information retrieval queries as
                                                                       REFERENCES
efficient as their alternatives. Secondly, how should keyword
                                                                        [1] Renzo Angles, Marcelo Arenas, Pablo Barcelo, Peter Boncz,
search be integrated in a graph query language? In the previ-               George Fletcher, Claudio Gutierrez, Tobias Lindaaker, Marcus
ous section a suggestion on how this can be accomplished was                Paradies, Stefan Plantikow, Juan Sequeda, Oskar van Rest, and
                                                                            Hannes Voigt. 2018. G-CORE: A Core for Future Graph Query
made, but different representations may be explored, both                   Languages. In SIGMOD (SIGMOD ’18). ACM, New York, NY,
from an efficiency and an effectiveness perspective.                        USA, 1421–1432. https://doi.org/10.1145/3183713.3190654
   With respect to expressiveness of the language, the XQuery           [2] Jaime Arguello, Fernando Diaz, Jimmy Lin, and Andrew Trotman.
                                                                            2015. Sigir 2015 workshop on reproducibility, inexplicability, and
Fulltext vs. NEXI debate from the XML IR era should be                      generalizability of results (rigor). In SIGIR. ACM, 1147–1148.
revisited. Should we add operators that express clearly how             [3] Peter Boncz. 2002. Monet: A next-generation DBMS kernel for
the graphs are constructed from the underlying documents,                   query-intensive applications. Ph.D. Dissertation. Universiteit van
                                                                            Amsterdam.
including stemming and text normalization? With the in-                 [4] Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin,
creased use of neural methods, topic models, and word embed-                Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Ryd-
                                                                            berg, Petra Selmer, and Andrés Taylor. 2018. Cypher: An evolving
dings, the question remains to what extent the construction                 query language for property graphs. In SIGMOD. ACM, 1433–
and/or their application should also be captured in declar-                 1445.
ative queries. Creation of a topic model can be viewed as               [5] Hannes Mühleisen, Thaer Samar, Jimmy Lin, and Arjen De Vries.
                                                                            2014. Old dogs are great at new tricks: Column stores for IR
an aggregation operation, just like we express the compu-                   prototyping. In SIGIR. ACM, 863–866.
tation of IDF in context of a collection graph. The nearest             [6] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig
neighbour search operations in high dimensions for using                    Macdonald, and Christina Lioma. 2006. Terrier: A high perfor-
                                                                            mance and scalable information retrieval platform. In Proceedings
word embeddings can be expressed declaratively, and even                    of the OSIR Workshop. 18–25.
the efficient query processing strategies necessary (e.g. cite-         [7] Stephen E Robertson and Steve Walker. 1994. Some simple
                                                                            effective approximations to the 2-poisson model for probabilistic
bond02). Future research will have to point out the need for                weighted retrieval. In SIGIR. Springer, 232–241.
expressing these (often merely pre-processing) steps in the             [8] Ian Soboroff, Shudong Huang, and Donna Harman. 2018. TREC
same graph query language, or are more naturally addressed                  2018 News Track Overview. In The Twenty-Seventh Text RE-
                                                                            trieval Conference (TREC 2018) Proceedings.
as user-defined functions that primarily call external machine          [9] Trevor Strohman, Donald Metzler, Howard Turtle, and W. Bruce
learning libraries.                                                         Croft. 2005. Indri: A language-model based search engine for
   Finally, it is an open question how to rank edges and/or                 complex queries (extended version). IR 407. University of Mas-
                                                                            sachusetts.
nodes when taking into account the graph structure itself.             [10] Marcin Zukowski, Mark Van de Wiel, and Peter A Boncz. 2012.
When the data is represented as a graph, graph specific                     Vectorwise: A Vectorized Analytical DBMS.. In ICDE. 1349–1350.
ranking components like pagerank might easily be taken
advantage of. Here, a specific property of interest is the
notion of path expressions offered by the G-CORE query
language.
   Graph structures often appear in the context of social
media. In this context the graph constantly changes. New
nodes and edges are created all the time, and nodes and edges
might also leave the network. It will be useful to consider the
aspect that the graph structure constantly changes. So, it is
finally of interest to investigate which data structures and
algorithms are efficient in presence of continuous updates.


                                                                  20