Reproducible IR needs an (IR) (Graph) Query Language Chris Kamphuis Arjen P. de Vries ckamphuis@cs.ru.nl arjen@acm.org Radboud University Radboud University Nijmegen, The Netherlands Nijmegen, The Netherlands ABSTRACT CCS CONCEPTS IR research should concentrate on the retrieval model, and • Information systems → Query languages; Document rep- not be hindered by its implementation in terms of low-level resentation; Evaluation of retrieval results; Search engine data structures; and we should embrace graph databases to indexing. realize that vision! Even results from retrieval systems based on the clas- KEYWORDS sic Okapi BM25 retrieval model (an approach that seems information retrieval, graph query language, reproducibility, straightforward to implement) have been remarkably difficult to actually reproduce in practice. Pin-pointing the cause of not obtaining identical results on an identical collection is 1 INTRODUCTION surprisingly hard, especially since retrieval systems usually The IR community should work towards being a community mix the retrieval model itself with the code necessary to where reproducibility is the norm. We recognize that such a make IR efficient. Unlike the database community, our field process can happen gradually, and that weaker forms of repro- has never moved away from implementing query processing ducibility, like replicability, are better than none. However, directly on the file system; historically best attributed to a when introducing tools for replicability, researchers should need for speed on large data sets of heterogeneous documents, keep in mind that replicability is only a temporarily solution never a good match for our database colleagues’ solutions. in the greater picture when working towards reproducibility. Nevertheless, this position paper calls for a shift toward The definition of reproducibility by the ACM1 states that, a declarative approach to specify retrieval. Not only has for computational experiments, an independent group can the state of the art in database query processing reached a obtain same results using artifacts they developed indepen- level where doing IR as database queries is not unimagin- dently. Exact reproduction is not required, but results ob- able, advancing retrieval effectiveness has become dependent tained should support the claims of the original work. For IR on systems that handle complex data models in multiple experiments, this implies that evaluation metrics observed in layers of processing, usually involving machine learning and the reproduction study should be within a tolerable margin thereby including the data itself into the retrieval model. If from those in the published study. the results of ranking with a straightforward combination of In general, exact reproduction of studies is not a realistic term frequency and document frequency can hardly be repro- expectation to hold, and illustrated well by two published duced, how will we advance our field when researchers face results from the information retrieval field. First, work carried the implementation of ranking formulas that integrate the out by Mühleisen et al. [5] compared multiple systems that data itself, knowledge bases and advanced machine learning all claim to implement the BM25 ranking formula, and found methods? that four different implementations result in four different Reproducibility in IR will be much easier to achieve if the effectiveness scores, both in 𝑀 𝐴𝑃 and 𝑃 @5. Given that retrieval model can be expressed concisely (and precisely!) these systems included Indri [9] and Terrier [6], this came in terms of operations over the document-term graph, or, out as quite a surprise; the authors took specific care to probable today, a document-metadata-entity-term-graph. We keep document pre-processing identical for both systems, propose to adopt G-core, a language proposed by the database but the observed difference in 𝑀 𝐴𝑃 of 3% absolute was the community to become a standard graph query language, and largest deviation in score reported in that study. Similarly, where necessary extend it for IR purposes. IR researchers the RIGOR workshop compared different systems to each and practitioners can represent documents and annotations other, out of which four systems implemented the BM25 naturally as graphs. We discuss the advantages of a graph ranking model Arguello et al. [2]. Again, the differences in model over previous work using relational databases, and effectiveness amounted to 3% absolute on 𝑀 𝐴𝑃 @1000. illustrate how the key properties of simple and extended IR Whether these differences are ‘within a tolerable margin’ models can be expressed naturally. is arguable. Why the differences are so large has not been uncovered in detail; the authors of [2] pointed at tuning parameters, differences in interpretation of the equations, and optimizations. We suspect that implementations might Copyright © 2019 for this paper by its authors. Use permitted under 1 Creative Commons License Attribution 4.0 International (CC BY 4.0). https://www.acm.org/publications/policies/artifact-review- OSIRRC 2019 co-located with SIGIR 2019, 25 July 2019, Paris, France. badging, last accessed July 4th, 2019. 17 OSIRRC 2019, July 25, 2019, Paris, France Kamphuis and de Vries (unintentionally) modify the specification of the ranking for- only have to focus on issuing queries, without having mula when implementing techniques to obtain faster query to write additional code. processing. Whatever the correct explanation, what we can All of these arguments relate directly to improving repro- conclude is that, apparently, widely used experimental IR ducibility of IR experimentation. However, what the authors systems do not implement the same BM25 ranking formula did not mention is the increased friction of mapping all the (and given the differences between all systems, it is not clear elements of an information retrieval model onto a relational which one implements the ranking formula that was proposed representation, especially when both the documents and the in Robertson and Walker [7]). ranking functions have been increasing in complexity in re- As the scientific field that aims to identify the mechanisms cent years. While it is definitely possible to express retrieval that can predict relevance computationally, it is disappointing models in SQL or a relational algebra variant (assuming that we cannot even reach conclusive results for classic, highly it includes aggregation operators), this is not a convenient effective models. So, how can we improve upon the current approach, and, in our opinion, error-prone as well. situation? We embrace the arguments in favour of a database ap- This paper takes the position that IR research should use proach to IR, but propose to move along with recent trends in higher level abstractions in the implementation of the re- the database community and model documents and queries trieval models we study. Prototyping retrieval models using as graphs, and adopt a graph database model instead of a a declarative query language would improve reproducibility, relational model. Recently, a standard query language over because the query expressions used provide a much better graphs, G-CORE, has been introduced in Angles et al. [1]. basis to help investigate the differences that would explain The language proposal is supported by database researchers observed variations in effectiveness scores. If the reason be- and practitioners united in the Linked Data Benchmark Coun- tween such differences can be explained, they can either be cil (LDBC)2 , and (at least partially) implemented by vendors fixed or kept depending on the nature of the cause. like Neo4J. In other words, it is the right time to consider More specifically, we propose to adopt a graph query lan- the question how to express information retrieval problems guage as prototyping tool to achieve reproducibility in IR in G-CORE, and identify possible extensions that might be experimentation. We first review arguments given in favour necessary for adoption in our community. of ‘a database approach to IR’. We explain how a graph query language would be used for IR tasks and discuss its advan- 2.1 Documents and terms tages over other approaches. After sketching our solution G-core assumes the property graph data model: graphs are direction, we conclude by identifying the challenges that still directed, have labels on both nodes and edges, and every node lie ahead. and edge can have associated pairs. We can represent both documents and terms as nodes 2 A GRAPH QUERY LANGUAGE FOR IR of the graph. Edges are formed between a document node Mühleisen et al. [5] have argued that relational databases and a term node if the term appears in the document. If a implemented as column store would be a suitable choice for term appears multiple times in a document, G-CORE allows creating information retrieval systems, and demonstrated how for multiple edges between two nodes to exist. Edges may, TF-IDF like ranking functions can be expressed easily and optionally, save the position of the term in the document. executed efficiently (show-casing a conjunctive BM25 ranking G-CORE is a graph query language that is closed over function). They give the following arguments in favour of ‘a property graphs; when querying a graph, the result is another database approach to IR’: graph. Expressing a retrieval model like BM25 then corre- sponds to defining different graphs to determine the different ∙ A query language is a formal framework in which components in the ranking function. To illustrate, consider query evaluations have to be formulated precisely. Re- how the term frequency of term 𝑡 given a document 𝐷 is searchers do not have the option the resort to shortcuts computed from the graph specified using the G-CORE query in corner cases; shown in listing 1. ∙ data management components are separated from the CONSTRUCT (d)<-[:appearsIn]-(t) query evaluation specifications, reducing system com- MATCH (d:Document) ON document_graph, plexity and yielding a better structured architecture; (t:Term) ON term_graph ∙ advances made in the database community (e.g. moving WHERE t.value = query_t.value from column-wise query processing to vectorized query AND d.id = doc_D.id processing) will benefit the retrieval engine ‘for free’; ∙ error analysis can be carried out on the same database Listing 1: Term frequency for term document pair that represents the collection. A researcher can write The resulting graph contains two nodes; the document additional queries for analysis without having to write node of document 𝐷 and the term node of term 𝑡. Term low-level code to interact with the data; frequency 𝑡𝐷 is calculated by counting the number of edges ∙ a database provides a rapid prototyping tool. IR re- connecting these nodes. G-CORE does not offer methods searchers are primarily interested in questions regard- 2 ing the ranking components of their problem. They http://ldbcouncil.org, last accessed June 14th, 2019. 18 Reproducible IR needs an (IR) (Graph) Query Language OSIRRC 2019, July 25, 2019, Paris, France to return properties yet, but Angles et al. [1] show how an 2.2.2 Entity ranking. For the entity ranking task, entities extension of the language would implement this; listing 2 that appear in a query article, need to be ranked. The more expresses this using the current implementation in Neo4J’s important an entity is in context of the article, the higher Cypher [4]. it should be ranked. The set of entities that appear in the CONSTRUCT (d)<-[e:appearsIn]-(t) document is provided, and, obviously, considering the docu- MATCH (d:Document) ON document_graph, ments as graphs lets us represent this information directly. (t:Term) ON term_graph For ranking the entities one might want to know the collec- WHERE t.value = query_t.value tion statistics of said entities. In order to measure these, we AND d.id = doc_D.id would extend the documents by running an entity tagger. RETURN COUNT(e) We can easily express entity occurrence in articles as graph queries, and, if deemed useful to improve effectiveness, bring Listing 2: Extension to return properties in additional information from a knowledge base, an external A term’s Inverse Document Frequency (IDF) statistic can graph that we can join as easily, providing a complete and be determined similarly: construct a graph consisting of all high level specification of the exact way how the knowledge nodes that are directly connected to the term node. The base would be used in the ranking function. number of document nodes in this graph is equal to the number of documents in the collection containing query term 𝑡. Listing 3 shows how this graph can be constructed. CONSTRUCT (d) 2.3 Helping reproducibility MATCH (d:Document) ON document_graph, So far, we have focused on using a graph database for proto- (t:Term) ON term_graph typing information retrieval models. The common practice WHERE t.value = query_t.value today among IR researchers and practitioners is however to AND (d)<-[:appearsIn]-(t) implement their approach to retrieval directly on top of an Listing 3: Number of documents containing term inverted file index structure. The result is an implementa- tion with many different interwoven individual components: tokenisation, stemming, stop word removal, memory man- 2.2 Entities and metadata agement, query processing, etc. When trying to reproduce a For the BM25 example, both the relational and the graph study, it can be quite a challenge to determine which parts database would be appropriate. However, when the data of the process are different from the original study. becomes more (semi-)structured, using a graph model will For a graph database solution, data management is taken have clear advantages. Consider for example the TREC news care of by the database system. The ranking function is ex- track Soboroff et al. [8], that introduces two retrieval tasks: pressed in the graph query language. This setup allows for background linking and entity ranking. Both can be expected easy analysis if a particular component fails in the reproduc- to benefit from a richer document representation, where tion process. Ideally the original work and the reproduction the documents are enriched with their metadata and entity study both have represented their work with a graph data- annotations – perfectly modelled as a document graph. base. If reproduction fails in that case it easy to analyze whether the ranking function represented in the graph query 2.2.1 Background linking. For the background linking tasks, language is the same and whether the document representa- the news articles to be retrieved provide background informa- tions are identical. If either one is different it might explain tion to help understand a query article of interest. Annotating why it was not possible to reproduce results. the documents in the collection with metadata can be ex- Consider again the study [5]. The query expression to cap- pected to help this task. For example, articles can be linked ture the retrieval model is easily shared between research with author information. Articles written by the same author groups. The column store approach was executed on two dif- might have a higher chance of being relevant, as authors tend ferent database engines, MonetDB Boncz [3] and VectorWise to write articles within a narrow range of topics. As articles Zukowski et al. [10]. Remarkably, only those two systems can be written by multiple authors it also possible to identify produced identical effectiveness scores, even though the ex- which authors co-authored often. This also might help when ecution engines are completely different; only the queries determining if an article is suited for background reading. representing the IR model are identical. Had the effectiveness These extra annotations can easily be represented in a graph scores not been the same, that would show that the systems database, authors are simple nodes that are connected to processed the queries differently. The cause of a difference article nodes if they authored the article. The graph of all would be clear: either a bug, or differences in the data. If articles written by the same author as the query article can another group would also represent their document collection be constructed using the query shown in listing 4. on one of these column store databases, and would find differ- CONSTRUCT (d) ent effectiveness scores, the document representations must MATCH (d:document) differ. This might be the case because of text processing. It WHERE (d)-[:writtenBy]->()<-[:writtenBy]-(query_d) is immediately clear why differences probably occur when Listing 4: Documents written by same author investigating this way. 19 OSIRRC 2019, July 25, 2019, Paris, France Kamphuis and de Vries Even if the original work is not executed in the form Finally, how can continuous updates of the graph be sampled of graph queries, our approach is still useful for analysis efficiently? It might be hard do determine when we need to during a reproduction study. Because of the separation of check which parts of the graph need to be updated when. concerns between data management and the expression of the ranking function, comparing minor modifications in data 4 CONCLUSION AND (pre-)processing and ranking gives the researchers a low- RECOMMENDATIONS effort opportunity to explore the effects of these changes. If In this position paper we propose that IR practitioners should some implementation details are not provided in a scientific represent their data as graphs managed in a graph database publication, the graph database allows for quick analysis to system. We argue that this helps when trying to reproduce compare different implementations that follow from the left studies, and will simultaneously make your own research out details. more reproducible for others. 3 FUTURE DIRECTIONS ACKNOWLEDGMENTS This work is part of the research program Commit2Data Using a graph database with a graph query language instead with project number 628.011.001 (SQIREL-GRAPHS), which of a column store database with a structured query language, is (partly) financed by the Netherlands Organisation for offers new research opportunities. Firstly, we need empirical Scientific Research (NWO). research to determine whether graph databases have matured sufficiently to execute the information retrieval queries as REFERENCES efficient as their alternatives. Secondly, how should keyword [1] Renzo Angles, Marcelo Arenas, Pablo Barcelo, Peter Boncz, search be integrated in a graph query language? In the previ- George Fletcher, Claudio Gutierrez, Tobias Lindaaker, Marcus ous section a suggestion on how this can be accomplished was Paradies, Stefan Plantikow, Juan Sequeda, Oskar van Rest, and Hannes Voigt. 2018. G-CORE: A Core for Future Graph Query made, but different representations may be explored, both Languages. In SIGMOD (SIGMOD ’18). ACM, New York, NY, from an efficiency and an effectiveness perspective. USA, 1421–1432. https://doi.org/10.1145/3183713.3190654 With respect to expressiveness of the language, the XQuery [2] Jaime Arguello, Fernando Diaz, Jimmy Lin, and Andrew Trotman. 2015. Sigir 2015 workshop on reproducibility, inexplicability, and Fulltext vs. NEXI debate from the XML IR era should be generalizability of results (rigor). In SIGIR. ACM, 1147–1148. revisited. Should we add operators that express clearly how [3] Peter Boncz. 2002. Monet: A next-generation DBMS kernel for the graphs are constructed from the underlying documents, query-intensive applications. Ph.D. Dissertation. Universiteit van Amsterdam. including stemming and text normalization? With the in- [4] Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, creased use of neural methods, topic models, and word embed- Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Ryd- berg, Petra Selmer, and Andrés Taylor. 2018. Cypher: An evolving dings, the question remains to what extent the construction query language for property graphs. In SIGMOD. ACM, 1433– and/or their application should also be captured in declar- 1445. ative queries. Creation of a topic model can be viewed as [5] Hannes Mühleisen, Thaer Samar, Jimmy Lin, and Arjen De Vries. 2014. Old dogs are great at new tricks: Column stores for IR an aggregation operation, just like we express the compu- prototyping. In SIGIR. ACM, 863–866. tation of IDF in context of a collection graph. The nearest [6] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig neighbour search operations in high dimensions for using Macdonald, and Christina Lioma. 2006. Terrier: A high perfor- mance and scalable information retrieval platform. In Proceedings word embeddings can be expressed declaratively, and even of the OSIR Workshop. 18–25. the efficient query processing strategies necessary (e.g. cite- [7] Stephen E Robertson and Steve Walker. 1994. Some simple effective approximations to the 2-poisson model for probabilistic bond02). Future research will have to point out the need for weighted retrieval. In SIGIR. Springer, 232–241. expressing these (often merely pre-processing) steps in the [8] Ian Soboroff, Shudong Huang, and Donna Harman. 2018. TREC same graph query language, or are more naturally addressed 2018 News Track Overview. In The Twenty-Seventh Text RE- trieval Conference (TREC 2018) Proceedings. as user-defined functions that primarily call external machine [9] Trevor Strohman, Donald Metzler, Howard Turtle, and W. Bruce learning libraries. Croft. 2005. Indri: A language-model based search engine for Finally, it is an open question how to rank edges and/or complex queries (extended version). IR 407. University of Mas- sachusetts. nodes when taking into account the graph structure itself. [10] Marcin Zukowski, Mark Van de Wiel, and Peter A Boncz. 2012. When the data is represented as a graph, graph specific Vectorwise: A Vectorized Analytical DBMS.. In ICDE. 1349–1350. ranking components like pagerank might easily be taken advantage of. Here, a specific property of interest is the notion of path expressions offered by the G-CORE query language. Graph structures often appear in the context of social media. In this context the graph constantly changes. New nodes and edges are created all the time, and nodes and edges might also leave the network. It will be useful to consider the aspect that the graph structure constantly changes. So, it is finally of interest to investigate which data structures and algorithms are efficient in presence of continuous updates. 20