PISA: Performant Indexes and Search for Academia Antonio Mallia Michał Siedlaczek antonio.mallia@nyu.edu michal.siedlaczek@nyu.edu New York University New York University New York, US New York, US Joel Mackenzie Torsten Suel joel.mackenzie@rmit.edu.au torsten.suel@nyu.edu RMIT University New York University Melbourne, Australia New York, US ABSTRACT discuss where PISA is suited for use in IR experimentation. The Performant Indexes and Search for Academia (PISA) is an experimen- remainder of this paper is structured as follows. In Section 2 we tal search engine that focuses on efficient implementations of state- describe some of the core functionality that makes PISA the state- of-the-art representations and algorithms for text retrieval. In this of-the-art for efficient search. Section 3 outlines the particular runs work, we outline our effort in creating a replicable search run from that were deployed for the workshop, and shows some reference PISA for the 2019 Open Source Information Retrieval Replicability experiments across a variety of collections. In Section 4, we briefly Challenge, which encourages the information retrieval community outline the future of the PISA system. Finally, we conclude this to produce replicable systems through the use of a containerized, work in Section 5. Docker-based infrastructure. We also discuss the origins, current functionality, and future direction and challenges for the PISA sys- 2 THE PISA ENGINE tem. PISA2 is an open source library implementing text indexing and search, primarily targeted for use in an academic setting. PISA im- KEYWORDS plements a range of state-of-the-art indexing and search algorithms, Search Engine, Open-Source System, Replicability making it useful for researchers to easily experiment with new tech- nologies, especially those concerned with efficiency. Nevertheless, we strive for much more than just an efficient implementation. With clean and extensible design and API, PISA provides a general 1 INTRODUCTION framework that can be employed for miscellaneous research tasks, Reproducibility, replicability, and generalizability have become in- such as parsing, ranking, sharding, index compression, document creasingly important within the Information Retrieval (IR) commu- reordering and query processing, to name a few. nity. For example, weak baselines [3, 18] are often used as compari- PISA started off as a fork repository of the ds2i library3 by Otta- son points against newly proposed systems, resulting in what often viano et al., which was used for prior research on efficient index appear to be large improvements. One possible reason that weak representations [26, 27]. Since then, PISA has gone through sub- baselines are used is that they are usually simple and well estab- stantial changes, and now considerably extends the original library. lished, making it easy to reproduce or replicate them. Indeed, repli- PISA is still being actively developed, integrating new features and cating experimental runs is not a trivial task; minor changes in soft- improving its design and implementation at regular intervals. ware, datasets, and even hardware can result in significant changes to experimental runs [10]. To this end, the 2019 Open Source Infor- 2.1 Design Overview mation Retrieval Replicability Challenge (OSIRRC) brings together PISA is designed to be efficient, extensible, and easy to use. We now academic groups with the aim of defining a reusable framework for briefly outline some of the design aspects of PISA. easily running IR experiments with a particular focus on replica- Modern Implementation. The PISA engine itself is built using bility, where a different team (to those who proposed the system) C++17, making use of many new features in the C++ standard. uses the original experimental artifacts to replicate the given result. This allows us to ensure the implementations are both efficient and In an attempt to improve replicability, the OSIRRC workshop pro- understandable, as some of the newest language features can make poses to package and deploy various IR systems within a Docker for cleaner code and APIs. We aim to adhere to best design practices, container,1 which enables low-effort replication by reducing many such as RAII (Resource Acquisition Is Initialization), C++ Core experimental confounders. Guidelines4 (aided by Guidelines Support Library5 ), and strongly- The goal of this paper is to give an overview of the PISA system typed aliases, all of which result in safer and cleaner code without and to outline the process of building replicable runs from PISA sacrificing runtime performance. with Docker. We outline the particulars of our submitted runs, and 2 https://https://github.com/pisa-engine/pisa Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons 3 https://github.com/ot/ds2i License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located with SIGIR 2019, 25 July 2019, Paris, France. 4 https://github.com/isocpp/CppCoreGuidelines 1 https://www.docker.com/ 5 https://github.com/Microsoft/GSL 50 OSIRRC 2019, July 25, 2019, Paris, France Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel Performance. One of the biggest advantages of C++ is its perfor- it as a payload vector. The structure is divided into two memory ar- mance. Control over data memory layout allows us to implement eas: the first one is an array of integers representing payload offsets, and store efficient data structures with little to no runtime overhead. while the other stores the payloads (strings). This representation Furthermore, we make use of low level optimizations, such as CPU allows us to quickly retrieve a word at a given position—which intrinsics, branch prediction hinting, and SIMD instructions, which determines its ID—directly from disk using memory mapping. We are especially important for efficiently encoding and decoding post- achieve string lookups by assigning term IDs in lexicographical ings lists. Memory mapped files provide fast and easy access to data order and performing binary search. Note that reassigning term structures persisted on disk. We also avoid unnecessary indirection IDs requires little overhead, as it is applied directly when a number of runtime polymorphism in performance-sensitive code in favor of of small index batches are merged together. This design decision the static polymorphism of C++ templates and metaprogramming. enables us to provide a set of command line tools to quickly access Our performance is also much more predictable than when using index data without unnecessary index loading overhead. Docu- languages with garbage collection. Finally, we suffer no runtime ment titles (such as TREC IDs) are stored using the same structure overhead as is the case with VM-based or interpreted languages. but without sorting them first, as the order of the documents is established via an index reordering stage as described below. Extensibility. Another important design aspect of PISA promotes The entire parsing process is performed in parallel when ex- extensibility. For example, interfaces are exposed which allow for ecuted on a multi-core architecture. The forward index can be new components to be plugged in easily, such as different parsers, created under tight memory constraints by dividing the corpus and stemmers, and compression codecs. This is achieved through heavy processing it in batches, and then merging the resulting forward use of generic programming, similar to that provided by the C++ indexes at the end. Currently, PISA only supports merging forward Standard Template Library. For example, an encoding schema is as indexes together prior to generating the canonical inverted index. much a parameter of an index as a custom allocator is a parameter However, future work will aim to allow index updates through of std::vector. By decoupling different parts of the framework, efficient inverted index merging operations. we provide an easy way of extending the library both in future iterations of the project, as well as by users of the library. Indexing. Once the parsing phase is complete, the forward index containing a collection can be used to build an inverted index in a 2.2 Feature Overview process called inverting. The product of this phase is an inverted index in the canonical format. This representation is very similar In this section, we take a short tour of several important features to the forward index, but in reverse: it is a collection of terms, of our system. We briefly discuss the indexing pipeline, document each a containing a list of document IDs. The canonical representa- reordering, scoring, and implemented retrieval methods. tion is stored in an uncompressed and universally readable format, Parsing Collection. The objective of parsing is to represent a which simply uses binary sequences to represent lists. There are given collection as a forward index, where each term is assigned a a few advantages of storing the canonical representation. Firstly, unique numerical ID, and each document consists of a list of such it allows various transformations, such as document reordering identifiers. This is a non-trivial task that involves several steps that or index pruning, to be performed on the index before storing it can be critical to retrieval performance. in its final compressed form. Secondly, it allows for different final First, the document content must be accessed by parsing a certain representations to be built rapidly, such as indexes that use different data exchange format, such as WARC, JSON, or XML. The document compression algorithms. Thirdly, it allows the PISA system to be itself is typically represented by HTML, XML, or a custom annotated used to create an inverted index without requiring the PISA system format, which must be parsed to retrieve the underlying text. The to be used beyond this point, making it easy for experimenters to text must be then tokenized, and the resulting words are typically import the canonical index into their own IR system. stemmed before indexing. Document Reordering. Document reordering corresponds to re- PISA supports a selection of file and content parsers. The parsing assigning the document identifiers within the inverted index [4]. It tool allows input formats of many standard IR collections, such as generally aims to minimize the cost of representing the inverted in- ClueWeb096 , ClueWeb127 , GOV28 , Robust049 , Washington Post10 , dex with respect to storage consumption. However, reordering can and New York Times.11 We also provide an HTML content parser, also be used to minimize other cost functions, such as query process- and the Porter2 [31] stemming algorithm for English language. ing speed [41]. Interestingly, optimizing for space consumption has As discussed in Section 2.1, PISA is designed to allow new com- been empirically shown to speed up query processing [14, 15, 24], ponents, such as parsers or stemmers, to be plugged-in with low making document reordering an attractive step during indexing. implementation overhead. In theory, index reordering can take place either on an existing As part of a forward index, we also encode a term lexicon. This is inverted index, or before the inverted index is constructed. PISA simply a mapping between strings and numerical IDs. We represent opts to use the canonical index as both input and output for the doc- ument reordering step, as this allows a generic reordering scheme 6 https://lemurproject.org/clueweb09/ to be used which can be easily extended to various reordering tech- 7 https://lemurproject.org/clueweb12/ niques, and allows the reordering functionality to be used without 8 http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm 9 https://trec.nist.gov/data/robust/04.guidelines.html requiring further use of PISA. 10 https://trec.nist.gov/data/wapost/ Many approaches for document reordering exist, including ran- 11 https://catalog.ldc.upenn.edu/LDC2008T19 dom ordering, reordering by URL [33], MinHash ordering [6, 9], 51 PISA: Performant Indexes and Search for Academia OSIRRC 2019, July 25, 2019, Paris, France ss Compressed Index reorder documents pre .. com . s compres Compressed Index Collection Forward Index Canonical Inverted Index extract parse invert Index Metadata export ext ra .. Term lexicon ct . Document lexicon External System Index Metadata Figure 1: Index building pipeline in PISA. A collection is first parsed and encoded in a forward index. Subsequently, it is inverted and stored in the canonical inverted index format. This can be used to efficiently reorder documents. Eventually, a compressed representation is produced, which will be used at query time. Additional data might be extracted, depending on the algorithms used. The simplicity of the inverted index format (uncompressed) makes it easy to convert it to any given format used by another framework. and recursive graph bisection [13]. PISA supports efficient index where N is the number of documents in the collection, ft is the reordering for all of the above techniques [21]. document frequency of term t, fd,t is the frequency of t in d, ld is the length of document d, and l avg is the average document length. Index Compression. Given the extremely large collections in- We set parameters k 1 = 0.9 and b = 0.4, as described by Trotman dexed by current search engines, even a single node of a large et al. [36]. For a more exhaustive examination of BM25 variants, search cluster typically contains many billions of integers in its we refer the reader to the work by Trotman et al. [38]. index structure. In particular, compression of posting lists is of utmost importance, as they account for much of the data size and access costs. Compressing an inverted index introduces a twofold Search. Because PISA supports document-ordered index organiza- advantage over a non-compressed representation: it reduces the tion, both Boolean and scored conjunctions or disjunctions can be size of the index, and it allows us to better exploit the memory evaluated, exploiting either a Document-at-a-Time or a Term-at-a- hierarchy, which consequently speeds up query processing. Time index traversal strategy. Compression allows the space requirements of indexes to be sub- Furthermore, PISA supports a range of state-of-the-art dynamic stantially decreased without loss of information. The simple and pruning algorithms such as MaxScore [39] and WAND [5], and extensible design of PISA allows for new compression approaches their Block-Max counterparts, Block-Max MaxScore (BMM) [7] and to be plugged in easily. As such, a range of state-of-the-art com- Block-Max WAND (BMW) [14]. pression schemes are currently supported, including variable byte In order to facilitate these dynamic pruning algorithms, an aux- encoders (VarIntGB [12], VarInt-G8IU [34], MaskedVByte [30], iliary index metadata structure must be built, which stores the re- and StreamVByte [17]), word-aligned encoders (Simple8b [2], Sim- quired upper-bound score information to enable efficient dynamic ple16 [43], QMX [35, 37], and SIMD-BP128 [16]), monotonic en- pruning. It can be built per postings list (for algorithms like WAND coders (Interpolative [25], EF [40], and PEF [27]), and frame-of- and MaxScore), or for each fixed-sized block (for the Block-Max reference encoders (Opt-PFD [42]). variants). In addition, variable-sized blocks can be built (in lieu of Oftentimes, the choice of encoding depends on both the time fixed-sized blocks) to support the variable-block family of Block- and space constraints, as compression schemes usually trade off Max algorithms listed above, such as Variable Block-Max WAND space efficiency for either encoding or decoding performance, or (VBMW) [22, 23]. Ranked conjunctions are also supported using both. We refer the reader to [24] for more details. the Ranked AND or (Variable) Block-Max Ranked AND (BMA) [14] algorithms. Scoring. Currently, BM25 [32] is used as the weighting model for Indeed, the logical blocks stored in the index metadata are de- ranked retrieval. BM25 is a simple yet effective ranker for process- coupled from the compressed blocks inside the inverted index. The ing bag-of-words queries, and promotes effective dynamic prun- advantage of storing the metadata independently from the inverted ing [28]. Given a document d and a query q, we use the following index is that the metadata depends only on the canonical index, formulation of BM25: Õ needs to be computed only once, and does not change if the under- BM25(d, q) = IDF(t) · TF(d, t), (1) lying compression codec is changed. t ∈q PISA provides two ways to experiment with query retrieval. The first one performs end-to-end search for a given list of queries, and  N − ft + 0.5  prints out the results in the TREC format. It can also be used to IDF(t) = log , (2) evaluate query speed, as was done for this workshop. Additionally, ft + 0.5 we provide a more granular approach, which focuses on comparing different retrieval methods directly. Here, we only report the time fd,t · (k 1 + 1) TF(d, t) = , (3) to fetch posting lists and perform search, excluding lexicon lookups fd,t + k 1 · (1 − b + b · ld /l avg ) and printing results to the standard output or a file. We encourage 52 OSIRRC 2019, July 25, 2019, Paris, France Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel Robust04 Core17 Core18 Gov2 ClueWeb09 ClueWeb12 Documents Terms Postings 16.5 15 13.5 17 22.5 25.5 Robust04 528,155 587,561 107,481,358 Core17 1,855,660 1,048,294 448,998,765 Table 1: Values of λ for the given collections using the Gumbo Core18 595,037 621,941 191,042,917 parser, Porter2 stemmer, and reordering with recursive graph bi- Gov2 25,205,178 32,407,061 5,264,576,636 section. These values will yield an average block size of 40 ± 0.5 ClueWeb09 50,220,110 87,896,391 14,996,638,171 for the variable block metadata. ClueWeb12 52,343,021 133,248,235 14,173,094,439 Table 2: Quantitative properties of our indexes. the interested reader to refer to PISA’s documentation for more details about running experiments.12 Collection Track Topics # Topics 3 REPRODUCIBLE EXPERIMENTATION Robust04 Robust ’04 301–450, 601–700 250 In the spirit of OSIRRC, we utilize the software and metrics made Core17 Common Core ’17 301–450, 601–700† 250 available by the organizers, including the jig13 and the trec_eval14 Core18 Common Core ’18 321–825‡ 50 tool. In addition, we have decided to provide some further informa- Gov2 Terabyte ’04-’06 701–850 150 tion and reference experiments that we consider important. ClueWeb09 Web ’09-’12 51–200 150 ClueWeb12 Web ’13-’14 201–300 100 3.1 Default Runs Given the many possibilities for the various components of the Table 3: Query topics used in the experiments. Note that the Core17 PISA system, we now outline the default system configuration for topics are the same as the Robust04 topics, but some were modified the OSIRRC workshop. Further information can be found in the to reflect temporal changes† , and Core18 used 25 topics from Core17 documentation of our OSIRRC system.15 Note that the block size for and 25 new topics.‡ the variable-block indexes depends on a parameter λ [22]. In order to get the desired average block size for the variable blocks, the value of λ was searched for offline, and differs for each collection. For convenience, we tabulate the values of λ in Table 1. Note that • Core17 corresponds to the New York Times news collection, such values of λ only apply when using the same parsing, stemmer, which contains news articles between 1987 and 2007. and reordering as listed below. • Core18 is the TREC Washington Post Corpus, which consists of news articles and blog posts from January 2012 through • Parsing: Gumbo16 with Porter2 stemming; no stopwords August 2017. removed. We discard the content of any document with over • Gov2 is the TREC 2004 Terabyte Track test collection con- 1,000 HTML parsing errors. sisting of around 25 million .gov sites crawled in early 2004; • Reordering: Recursive graph bisection. We optimize the the documents are truncated to 256 kB. objective function using the posting lists of lengths at least • ClueWeb09 is the ClueWeb 2009 Category B collection con- 4,096. sisting of around 50 million English web pages crawled be- • Compression: SIMD-BP128 with a posting list block size tween January and February, 2009. of 128. • ClueWeb12 is the ClueWeb 2012 Category B-13 collection, • Scoring: BM25 with k 1 = 0.9 and b = 0.4 which contains around 52 million English web pages crawled • Index Metadata: Variable blocks with a mean block size of between February and May, 2012. 40 ± 0.5. • Index Traversal: Variable BlockMax WAND. Some quantitative properties of these collections are summarized in Table 2. The first three are relatively small, and contain newswire 3.2 Experimental Setup data. The remaining corpora are significantly larger, containing samples of the Web. Thus, the latter two should be more indicative Now, we briefly outline the experimental setup and the resources of any differences in query efficiency. In fact, each of these can used for our experimentation. be thought of as representing a single shard in a large distributed Datasets. We performed our experiments on the following text search system. collections: Test queries. Each given collection contains a set of test queries • Robust04 consists of newswire articles from a variety of from various TREC tracks which we use to validate the effectiveness sources from the late 1980’s through to the mid 1990’s. of our system. These queries are described in Table 3. 12 https://pisa.readthedocs.io/en/latest/ Testing details. All experiments were conducted on a machine 13 https://github.com/osirrc/jig/ 14 https://github.com/usnistgov/trec_eval with two Intel Xeon E5-2667 v2 CPUs, with a total of 32 cores 15 https://github.com/osirrc/pisa-docker clocked at 3.30 GHz with 256 GiB RAM running Linux 4.15.0. Fur- 16 https://github.com/google/gumbo-parser thermore, the experiments presented here are deployed within the 53 PISA: Performant Indexes and Search for Academia OSIRRC 2019, July 25, 2019, Paris, France Docker framework. Although we believe that this may cause a a time. MaxScore has been shown to outperform other algorithms slight reduction in the efficiency of the presented algorithms, we for large values of k on the Gov2 and ClueWeb09 collections [24]. preserve this setup in the spirit of the workshop and comparabil- Table 7 shows the results. While MaxScore usually outperforms ity. We leave further investigation of potential overhead of Docker VBMW, we did not optimize the block size of the index metadata, containers as future work. so comparisons should be made with caution. Indeed, VBMW is likely to outperform MaxScore with optimized blocks and small A note on ClueWeb12. In preliminary experiments, we found values of k. For a more detailed analysis of per-query latency within that the memory consumption for reordering the ClueWeb12 index PISA, we refer the interested reader to the recent work by Mallia was high, which slowed down the indexing process considerably. et al. [24]. Thus, we opted to skip reordering the ClueWeb12 collection in the following experiments, and our results are reported on an index that uses the default (crawl) order. Since index order impacts the 3.4 Discussion value of λ, we use λ = 26, which results in variable block metadata PISA is built for performance. We are able to rapidly process each with a mean block size in the desired range of 40 ± 0.5. Note that query set thanks to efficient document retrieval algorithms and this value differs from the one reported in Table 1, which is correct extremely fast compression. On the other hand, as we have shown, if reordering is applied based on Recursive Graph Bisection (see SIMD-BP128 encoding also exhibits a reasonable compression ratio, Section 3.1). which allows us to store the index in main memory. We encour- age the reader to study the work by Mallia et al. [24] for more 3.3 Results and Discussion information about query efficiency under different retrieval and compression methods. We now present our reference experiments, which involve end-to- At the present moment, our query retrieval is tailored towards end processing of each given collection. fast candidate selection, as we lack any complex ranking functional- Indexing and Compression. The HTML content of each docu- ity, such as a learning-to-rank document reranking stage. However, ment was extracted with the Gumbo parser. We then extracted the effectiveness we obtain using BM25 is consistent with other three kinds of tokens: alphanumeric strings, acronyms, and pos- results found in the literature [19]. sessives, which were then stemmed using the Porter2 algorithm. Furthermore, we provide a generic index building pipeline, which We reordered documents using the recursive graph bisection al- can be easily customized to one’s needs. We unload most of the com- gorithm which is known to improve both compression and query putationally intensive operations onto the initial stages of indexing performance [13, 21, 24]. Then we compressed the index with SIMD- to speed up experiments with many configurations; in particular, to BP128 encoding, which has been proven to exhibit one of the best deliver additional indexes with different integer encodings quickly space-speed trade offs [24]. and easily. Table 4 summarizes indexing times broken down into individual As per the workshop rules, we deliver a Docker image, which phases, while Table 5 shows compressed inverted index sizes as reproduces the presented results. Note that the initial version of the well as average numbers of bits used to encode document gaps and image was derived from an image with a precompiled distribution frequencies. The entire building process was executed with 32 cores of PISA. However, we quickly discovered this solution was not available; however, at the time of writing, only some parts of the portable. The source of our issues was compiling the code with pipeline support parallel execution. We also note that the index AVX2 support. Once compiled, the binaries could not be executed reordering step is usually the most expensive step in our indexing on a machine not supporting AVX2. One solution could be to cross- pipeline. If a fast indexing time is of high importance, this step can compile and provide different versions of the image. However, we be omitted, as we did for ClueWeb12. Alternatively, less expensive chose to simply distribute the source code to be compiled at the reordering operations can be used. However, skipping the index initial stage of an experimental run. reordering stage (or using a less effective reordering technique) will result in a larger inverted index and less efficient query-time 4 FUTURE PLANS performance. Despite its clear strengths, PISA is still a relatively young project, System Effectiveness. Next, we outline the effectiveness of the aspiring to become a more widely used tool for IR experimentation. PISA system. In particular, we are processing rank-safe, disjunctive, We recognize that many relevant features can be still developed to top-k queries to depth k = 1,000. Since processing is rank-safe, all of further enrich the framework. We have every intention of pursuing the disjunctive index traversal algorithms result in the same top-k these in the nearest future. set. Table 6 reports the effectiveness for Mean Average Precision An obvious direction is to continue our work on query perfor- (MAP), Precision at rank 30 (P@30), and Normalized Discounted mance. For instance, we intend to support precomputing quantized Cumulative Gain at rank 20 (NDCG@20). partial scores in order to further improve candidate selection perfor- mance [11]. We are also considering implementing other traversal Query Efficiency. To measure the efficiency of query processing, techniques, including known approaches, such as Score-at-a-Time we measure how long it takes to process the entire query log for methods [1, 20], as well as novel techniques. each collection. We use 32 threads to concurrently retrieve the The next step would be to implement more complex document top-k documents for all queries using either the MaxScore or the rankings based on learning-to-rank. Many of the data structures VBMW algorithm, with a single thread processing a single query at required for feature extraction are indeed already in place. We 54 OSIRRC 2019, July 25, 2019, Paris, France Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel Parse Invert Reorder Compress Metadata Total Robust04 0:06:22 0:00:08 0:01:21 0:00:07 0:00:44 0:08:42 Core17 0:11:41 0:00:42 0:07:18 0:00:14 0:02:59 0:22:54 Core18 0:10:42 0:00:14 0:02:11 0:00:07 0:01:11 0:14:25 Gov2 1:37:52 1:00:12 2:28:04 0:06:42 0:37:04 5:49:52 ClueWeb09 4:08:08 3:11:50 10:28:30 0:32:42 2:01:01 20:22:12 ClueWeb12 5:09:58 3:27:51 — 0:34:55 2:11:46 11:24:30 Table 4: Indexing times broken down into five phases: parsing, inverting, reordering, compression, and index metadata construction. Times are reported in the following format: hours:minutes:seconds. Index size (MiB) Docs (bpi) Freqs (bpi) k = 10 k = 1,000 Robust04 136.88 7.48 3.21 Collection MaxScore VBMW MaxScore VBMW Core17 545.90 7.07 3.13 Robust04 7 7 21 26 Core18 238.33 7.24 3.22 Core17 10 8 16 15 Gov2 5,410.89 5.59 3.03 Core18 5 4 12 14 ClueWeb09 20,715.29 7.96 3.63 Gov2 115 99 215 200 ClueWeb12 23,206.16 9.22 4.52 ClueWeb09 225 138 285 424 ClueWeb12 220 415 248 842 Table 5: Total index size and average number of bits per integer while encoding documents and frequencies within posting lists. Table 7: Time taken to process the entire query log for each collec- tion. Time is reported in milliseconds. Topics MAP P@30 NDCG@20 Robust04 All 0.2534 0.3120 0.4221 has been successfully used in several recent research papers [21, Core17 All 0.2078 0.4260 0.3898 23, 24, 29]. Core18 All 0.2384 0.3500 0.3927 One of the indisputable advantages of PISA is its extremely fast query execution, achieved by careful optimization and the zero- 701-750 0.2638 0.4776 0.4070 cost abstractions of C++. Furthermore, it supports a multitude of Gov2 751-800 0.3305 0.5487 0.5073 state-of-the-art compression and query processing techniques that 801-850 0.2950 0.4680 0.4925 can be used interchangeably. 51-100 0.1009 0.2521 0.1509 Although there are still several shortcomings, these are mostly 101-150 0.1093 0.2507 0.2177 due to the project’s young age, and we hope to address these very ClueWeb09 151-200 0.1054 0.2100 0.1311 soon. Furthermore, we plan to continue enhancing the system with novel solutions. Indeed, a good amount of time has been spent on 201-250 0.0449 0.1940 0.1529 ClueWeb12 PISA to provide a high quality experimental IR framework, not 251-300 0.0217 0.1240 0.1484 only in terms of performance, but also from a software engineering point of view. We use modern technologies and libraries, continuous Table 6: The effectiveness of the submitted run for each respective integration, and test suites to ensure the quality of our code, and corpus. the correctness of our implementations. We encourage any interested researchers to get involved with the PISA project. would also like to enhance our query retrieval pipeline with ranking cascades that are capable of applying learned models [8]. ACKNOWLEDGEMENTS Other planned features include query expansion, content ex- This research was supported by NSF Grant IIS-1718680, a grant from traction (template detection, boilerplate removal), sharding, and Amazon and the Australian Research Training Program Scholarship. distributed indexes. Work on some of these has in fact already started. REFERENCES [1] V. N. Anh, O. de Kretser, and A. Moffat. 2001. Vector-space ranking with effective 5 CONCLUSION early termination.. In Proc. SIGIR. 35–42. [2] V. N. Anh and A. Moffat. 2010. Index compression using 64-bit words. Soft. Prac. PISA is a relative newcomer on the scene of open source IR software, & Exp. 40, 2 (2010), 131–147. yet it has already proven its many benefits, including a flexible [3] T. G. Armstrong, A. Moffat, W. Webber, and J. Zobel. 2009. Improvements that design which is specifically tailored for use in research. Indeed, PISA don’t add up: Ad-hoc Retrieval Results since 1998. In Proc. CIKM. 601–610. 55 PISA: Performant Indexes and Search for Academia OSIRRC 2019, July 25, 2019, Paris, France [4] D. Blandford and G. Blelloch. 2002. Index Compression Through Document [23] A. Mallia and E. Porciani. 2019. Faster BlockMax WAND with longer skipping. Reordering. In Proc. DCC. 342–352. In Proc. ECIR. 771–778. [5] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. 2003. Efficient [24] A. Mallia, M. Siedlaczek, and T. Suel. 2019. An Experimental Study of Index Query Evaluation Using a Two-level Retrieval Process. In Proc. CIKM. 426–434. Compression and DAAT Query Processing Methods. In Proc. ECIR. 353–368. [6] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. 2000. Min-Wise [25] A. Moffat and L. Stuiver. 2000. Binary Interpolative Coding for Effective Index Independent Permutations. J. Comput. Syst. Sci. 60, 3 (2000), 630–659. Compression. Inf. Retr. 3, 1 (2000), 25–47. [7] K. Chakrabarti, S. Chaudhuri, and V. Ganti. 2011. Interval-based pruning for [26] G. Ottaviano, N. Tonellotto, and R. Venturini. 2015. Optimal Space-time Tradeoffs top-k processing over compressed lists. In Proc. ICDE. 709–720. for Inverted Indexes. In Proc. WSDM. 47–56. [8] R-C. Chen, L. Gallagher, R. Blanco, and J. S. Culpepper. 2017. Efficient Cost-Aware [27] G. Ottaviano and R. Venturini. 2014. Partitioned Elias-Fano indexes. In Proc. Cascade Ranking in Multi-Stage Retrieval. In Proc. SIGIR. 445–454. SIGIR. 273–282. [9] F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. [28] M. Petri, J. S. Culpepper, and A. Moffat. 2013. Exploring the Magic of WAND. In Raghavan. 2009. On Compressing Social Networks. In Proc. SIGKDD. 219–228. Proc. Aust. Doc. Comp. Symp. 58–65. [10] M. Crane. 2018. Questionable Answers in Question Answering Research: Repro- [29] M. Petri, A. Moffat, J. Mackenzie, J. S. Culpepper, and D. Beck. 2019. Accelerated ducibility and Variability of Published Results. Trans. ACL 6 (2018), 241–252. Query Processing Via Similarity Score Prediction. In Proc. SIGIR. To Appear. [11] M. Crane, A. Trotman, and R. O’Keefe. 2013. Maintaining discriminatory power [30] J. Plaisance, N. Kurz, and D. Lemire. 2015. Vectorized VByte Decoding. In Int. in quantized indexes. In Proc. CIKM. 1221–1224. Symp. Web Alg. [12] J. Dean. 2009. Challenges in Building Large-scale Information Retrieval Systems: [31] M. F. Porter. 1997. Readings in IR. Chapter An Algorithm for Suffix Stripping, Invited Talk. In Proc. WSDM. 1–1. 313–316. [13] L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, and A. Shalita. 2016. [32] S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. 1994. Compressing Graphs and Indexes with Recursive Graph Bisection. In Proc. KDD. Okapi at TREC-3. In Proc. TREC. 1535–1544. [33] F. Silvestri. 2007. Sorting out the Document Identifier Assignment Problem. In [14] S. Ding and T. Suel. 2011. Faster top-k document retrieval using block-max Proc. ECIR. 101–112. indexes. In Proc. SIGIR. 993–1002. [34] A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and P. S. Oberoi. 2011. [15] D. Hawking and T. Jones. 2012. Reordering an Index to Speed Query Processing SIMD-based Decoding of Posting Lists. In Proc. CIKM. 317–326. Without Loss of Effectiveness. In Proc. ADCS. 17–24. [35] A. Trotman. 2014. Compression, SIMD, and Postings Lists. In Proc. Aust. Doc. [16] D. Lemire and L. Boytsov. 2015. Decoding billions of integers per second through Comp. Symp. 50.50–50.57. vectorization. Soft. Prac. & Exp. 45, 1 (2015), 1–29. [36] A. Trotman, X-F. Jia, and M. Crane. 2012. Towards an efficient and effective [17] D. Lemire, N. Kurz, and C. Rupp. 2018. Stream VByte: Faster byte-oriented integer search engine. In Wkshp. Open Source IR. 40–47. compression. Inf. Proc. Letters 130 (2018), 1–6. [37] A. Trotman and J. Lin. 2016. In Vacuo and In Situ Evaluation of SIMD Codecs. In [18] J. Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines. SIGIR Proc. Aust. Doc. Comp. Symp. 1–8. Forum 52, 2 (2019), 40–51. [38] A. Trotman, A. Puurula, and B. Burgess. 2014. Improvements to BM25 and [19] J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C. Language Models Examined. In Proc. Aust. Doc. Comp. Symp. 58–65. Macdonald, and S. Vigna. 2016. Toward Reproducible Baselines: The Open-Source [39] H. R. Turtle and J. Flood. 1995. Query Evaluation: Strategies and Optimizations. IR Reproducibility Challenge. In Proc. ECIR. 408–420. Inf. Proc. & Man. 31, 6 (1995), 831–850. [20] J. Lin and A. Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In [40] S. Vigna. 2013. Quasi-succinct indices. In Proc. WSDM. 83–92. Proc. ICTIR. 301–304. [41] Q. Wang and T. Suel. 2019. Document Reordering for Faster Intersection. Proc. [21] J. Mackenzie, A. Mallia, M. Petri, J. S. Culpepper, and T. Suel. 2019. Compressing VLDB 12, 5 (2019), 475–487. Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study. In [42] H. Yan, S. Ding, and T. Suel. 2009. Inverted index compression and query pro- Proc. ECIR. 339–352. cessing with optimized document ordering. In Proc. WWW. 401–410. [22] A. Mallia, G. Ottaviano, E. Porciani, N. Tonellotto, and R. Venturini. 2017. Faster [43] J. Zhang, X. Long, and T. Suel. 2008. Performance of Compressed Inverted List BlockMax WAND with Variable-sized Blocks. In Proc. SIGIR. 625–634. Caching in Search Engines. In Proc. WWW. 387–396. 56