=Paper=
{{Paper
|id=Vol-2409/docker08
|storemode=property
|title=PISA: Performant Indexes and Search for Academia
|pdfUrl=https://ceur-ws.org/Vol-2409/docker08.pdf
|volume=Vol-2409
|authors=Antonio Mallia,Michał Siedlaczek,Joel Mackenzie,Torsten Suel
|dblpUrl=https://dblp.org/rec/conf/sigir/MalliaSMS19
}}
==PISA: Performant Indexes and Search for Academia==
<pdf width="1500px">https://ceur-ws.org/Vol-2409/docker08.pdf</pdf>
<pre>
                  PISA: Performant Indexes and Search for Academia
                                Antonio Mallia                                                                      Michał Siedlaczek
                           antonio.mallia@nyu.edu                                                               michal.siedlaczek@nyu.edu
                             New York University                                                                   New York University
                                New York, US                                                                          New York, US

                                Joel Mackenzie                                                                             Torsten Suel
                        joel.mackenzie@rmit.edu.au                                                                  torsten.suel@nyu.edu
                              RMIT University                                                                        New York University
                            Melbourne, Australia                                                                        New York, US
ABSTRACT                                                                                    discuss where PISA is suited for use in IR experimentation. The
Performant Indexes and Search for Academia (PISA) is an experimen-                          remainder of this paper is structured as follows. In Section 2 we
tal search engine that focuses on efficient implementations of state-                       describe some of the core functionality that makes PISA the state-
of-the-art representations and algorithms for text retrieval. In this                       of-the-art for efficient search. Section 3 outlines the particular runs
work, we outline our effort in creating a replicable search run from                        that were deployed for the workshop, and shows some reference
PISA for the 2019 Open Source Information Retrieval Replicability                           experiments across a variety of collections. In Section 4, we briefly
Challenge, which encourages the information retrieval community                             outline the future of the PISA system. Finally, we conclude this
to produce replicable systems through the use of a containerized,                           work in Section 5.
Docker-based infrastructure. We also discuss the origins, current
functionality, and future direction and challenges for the PISA sys-                        2     THE PISA ENGINE
tem.                                                                                        PISA2 is an open source library implementing text indexing and
                                                                                            search, primarily targeted for use in an academic setting. PISA im-
KEYWORDS                                                                                    plements a range of state-of-the-art indexing and search algorithms,
Search Engine, Open-Source System, Replicability                                            making it useful for researchers to easily experiment with new tech-
                                                                                            nologies, especially those concerned with efficiency. Nevertheless,
                                                                                            we strive for much more than just an efficient implementation.
                                                                                            With clean and extensible design and API, PISA provides a general
1    INTRODUCTION                                                                           framework that can be employed for miscellaneous research tasks,
Reproducibility, replicability, and generalizability have become in-                        such as parsing, ranking, sharding, index compression, document
creasingly important within the Information Retrieval (IR) commu-                           reordering and query processing, to name a few.
nity. For example, weak baselines [3, 18] are often used as compari-                           PISA started off as a fork repository of the ds2i library3 by Otta-
son points against newly proposed systems, resulting in what often                          viano et al., which was used for prior research on efficient index
appear to be large improvements. One possible reason that weak                              representations [26, 27]. Since then, PISA has gone through sub-
baselines are used is that they are usually simple and well estab-                          stantial changes, and now considerably extends the original library.
lished, making it easy to reproduce or replicate them. Indeed, repli-                       PISA is still being actively developed, integrating new features and
cating experimental runs is not a trivial task; minor changes in soft-                      improving its design and implementation at regular intervals.
ware, datasets, and even hardware can result in significant changes
to experimental runs [10]. To this end, the 2019 Open Source Infor-                         2.1     Design Overview
mation Retrieval Replicability Challenge (OSIRRC) brings together                           PISA is designed to be efficient, extensible, and easy to use. We now
academic groups with the aim of defining a reusable framework for                           briefly outline some of the design aspects of PISA.
easily running IR experiments with a particular focus on replica-
                                                                                            Modern Implementation. The PISA engine itself is built using
bility, where a different team (to those who proposed the system)
                                                                                            C++17, making use of many new features in the C++ standard.
uses the original experimental artifacts to replicate the given result.
                                                                                            This allows us to ensure the implementations are both efficient and
In an attempt to improve replicability, the OSIRRC workshop pro-
                                                                                            understandable, as some of the newest language features can make
poses to package and deploy various IR systems within a Docker
                                                                                            for cleaner code and APIs. We aim to adhere to best design practices,
container,1 which enables low-effort replication by reducing many
                                                                                            such as RAII (Resource Acquisition Is Initialization), C++ Core
experimental confounders.
                                                                                            Guidelines4 (aided by Guidelines Support Library5 ), and strongly-
   The goal of this paper is to give an overview of the PISA system
                                                                                            typed aliases, all of which result in safer and cleaner code without
and to outline the process of building replicable runs from PISA
                                                                                            sacrificing runtime performance.
with Docker. We outline the particulars of our submitted runs, and
                                                                                            2 https://https://github.com/pisa-engine/pisa
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons
                                                                                            3 https://github.com/ot/ds2i
License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located with SIGIR
2019, 25 July 2019, Paris, France.                                                          4 https://github.com/isocpp/CppCoreGuidelines
1 https://www.docker.com/                                                                   5 https://github.com/Microsoft/GSL


                                                                                       50
OSIRRC 2019, July 25, 2019, Paris, France                                   Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel


Performance. One of the biggest advantages of C++ is its perfor-                it as a payload vector. The structure is divided into two memory ar-
mance. Control over data memory layout allows us to implement                   eas: the first one is an array of integers representing payload offsets,
and store efficient data structures with little to no runtime overhead.         while the other stores the payloads (strings). This representation
Furthermore, we make use of low level optimizations, such as CPU                allows us to quickly retrieve a word at a given position—which
intrinsics, branch prediction hinting, and SIMD instructions, which             determines its ID—directly from disk using memory mapping. We
are especially important for efficiently encoding and decoding post-            achieve string lookups by assigning term IDs in lexicographical
ings lists. Memory mapped files provide fast and easy access to data            order and performing binary search. Note that reassigning term
structures persisted on disk. We also avoid unnecessary indirection             IDs requires little overhead, as it is applied directly when a number
of runtime polymorphism in performance-sensitive code in favor of               of small index batches are merged together. This design decision
the static polymorphism of C++ templates and metaprogramming.                   enables us to provide a set of command line tools to quickly access
Our performance is also much more predictable than when using                   index data without unnecessary index loading overhead. Docu-
languages with garbage collection. Finally, we suffer no runtime                ment titles (such as TREC IDs) are stored using the same structure
overhead as is the case with VM-based or interpreted languages.                 but without sorting them first, as the order of the documents is
                                                                                established via an index reordering stage as described below.
Extensibility. Another important design aspect of PISA promotes                     The entire parsing process is performed in parallel when ex-
extensibility. For example, interfaces are exposed which allow for              ecuted on a multi-core architecture. The forward index can be
new components to be plugged in easily, such as different parsers,              created under tight memory constraints by dividing the corpus and
stemmers, and compression codecs. This is achieved through heavy                processing it in batches, and then merging the resulting forward
use of generic programming, similar to that provided by the C++                 indexes at the end. Currently, PISA only supports merging forward
Standard Template Library. For example, an encoding schema is as                indexes together prior to generating the canonical inverted index.
much a parameter of an index as a custom allocator is a parameter               However, future work will aim to allow index updates through
of std::vector. By decoupling different parts of the framework,                 efficient inverted index merging operations.
we provide an easy way of extending the library both in future
iterations of the project, as well as by users of the library.                  Indexing. Once the parsing phase is complete, the forward index
                                                                                containing a collection can be used to build an inverted index in a
2.2      Feature Overview                                                       process called inverting. The product of this phase is an inverted
                                                                                index in the canonical format. This representation is very similar
In this section, we take a short tour of several important features
                                                                                to the forward index, but in reverse: it is a collection of terms,
of our system. We briefly discuss the indexing pipeline, document
                                                                                each a containing a list of document IDs. The canonical representa-
reordering, scoring, and implemented retrieval methods.
                                                                                tion is stored in an uncompressed and universally readable format,
Parsing Collection. The objective of parsing is to represent a                  which simply uses binary sequences to represent lists. There are
given collection as a forward index, where each term is assigned a              a few advantages of storing the canonical representation. Firstly,
unique numerical ID, and each document consists of a list of such               it allows various transformations, such as document reordering
identifiers. This is a non-trivial task that involves several steps that        or index pruning, to be performed on the index before storing it
can be critical to retrieval performance.                                       in its final compressed form. Secondly, it allows for different final
   First, the document content must be accessed by parsing a certain            representations to be built rapidly, such as indexes that use different
data exchange format, such as WARC, JSON, or XML. The document                  compression algorithms. Thirdly, it allows the PISA system to be
itself is typically represented by HTML, XML, or a custom annotated             used to create an inverted index without requiring the PISA system
format, which must be parsed to retrieve the underlying text. The               to be used beyond this point, making it easy for experimenters to
text must be then tokenized, and the resulting words are typically              import the canonical index into their own IR system.
stemmed before indexing.                                                        Document Reordering. Document reordering corresponds to re-
   PISA supports a selection of file and content parsers. The parsing           assigning the document identifiers within the inverted index [4]. It
tool allows input formats of many standard IR collections, such as              generally aims to minimize the cost of representing the inverted in-
ClueWeb096 , ClueWeb127 , GOV28 , Robust049 , Washington Post10 ,               dex with respect to storage consumption. However, reordering can
and New York Times.11 We also provide an HTML content parser,                   also be used to minimize other cost functions, such as query process-
and the Porter2 [31] stemming algorithm for English language.                   ing speed [41]. Interestingly, optimizing for space consumption has
As discussed in Section 2.1, PISA is designed to allow new com-                 been empirically shown to speed up query processing [14, 15, 24],
ponents, such as parsers or stemmers, to be plugged-in with low                 making document reordering an attractive step during indexing.
implementation overhead.                                                        In theory, index reordering can take place either on an existing
   As part of a forward index, we also encode a term lexicon. This is           inverted index, or before the inverted index is constructed. PISA
simply a mapping between strings and numerical IDs. We represent                opts to use the canonical index as both input and output for the doc-
                                                                                ument reordering step, as this allows a generic reordering scheme
6 https://lemurproject.org/clueweb09/                                           to be used which can be easily extended to various reordering tech-
7 https://lemurproject.org/clueweb12/
                                                                                niques, and allows the reordering functionality to be used without
8 http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm
9 https://trec.nist.gov/data/robust/04.guidelines.html                          requiring further use of PISA.
10 https://trec.nist.gov/data/wapost/                                              Many approaches for document reordering exist, including ran-
11 https://catalog.ldc.upenn.edu/LDC2008T19                                     dom ordering, reordering by URL [33], MinHash ordering [6, 9],


                                                                           51
PISA: Performant Indexes and Search for Academia                                                        OSIRRC 2019, July 25, 2019, Paris, France


                                                                                                                      ss       Compressed Index
                                                                        reorder documents                         pre                 ..
                                                                                                               com
                                                                                                                                       .
                                                                                                                           s
                                                                                                               compres         Compressed Index
        Collection                     Forward Index                 Canonical Inverted Index                    extract
                         parse                              invert
                                                                                                                                Index Metadata
                                                                                   export                         ext
                                                                                                                      ra
                                                                                                                                       ..
                                     Term lexicon                                                                       ct              .
                                   Document lexicon                      External System                                        Index Metadata

Figure 1: Index building pipeline in PISA. A collection is first parsed and encoded in a forward index. Subsequently, it is inverted and
stored in the canonical inverted index format. This can be used to efficiently reorder documents. Eventually, a compressed representation is
produced, which will be used at query time. Additional data might be extracted, depending on the algorithms used. The simplicity of the
inverted index format (uncompressed) makes it easy to convert it to any given format used by another framework.


and recursive graph bisection [13]. PISA supports efficient index                 where N is the number of documents in the collection, ft is the
reordering for all of the above techniques [21].                                  document frequency of term t, fd,t is the frequency of t in d, ld is
                                                                                  the length of document d, and l avg is the average document length.
Index Compression. Given the extremely large collections in-                      We set parameters k 1 = 0.9 and b = 0.4, as described by Trotman
dexed by current search engines, even a single node of a large                    et al. [36]. For a more exhaustive examination of BM25 variants,
search cluster typically contains many billions of integers in its                we refer the reader to the work by Trotman et al. [38].
index structure. In particular, compression of posting lists is of
utmost importance, as they account for much of the data size and
access costs. Compressing an inverted index introduces a twofold                  Search. Because PISA supports document-ordered index organiza-
advantage over a non-compressed representation: it reduces the                    tion, both Boolean and scored conjunctions or disjunctions can be
size of the index, and it allows us to better exploit the memory                  evaluated, exploiting either a Document-at-a-Time or a Term-at-a-
hierarchy, which consequently speeds up query processing.                         Time index traversal strategy.
   Compression allows the space requirements of indexes to be sub-                    Furthermore, PISA supports a range of state-of-the-art dynamic
stantially decreased without loss of information. The simple and                  pruning algorithms such as MaxScore [39] and WAND [5], and
extensible design of PISA allows for new compression approaches                   their Block-Max counterparts, Block-Max MaxScore (BMM) [7] and
to be plugged in easily. As such, a range of state-of-the-art com-                Block-Max WAND (BMW) [14].
pression schemes are currently supported, including variable byte                     In order to facilitate these dynamic pruning algorithms, an aux-
encoders (VarIntGB [12], VarInt-G8IU [34], MaskedVByte [30],                      iliary index metadata structure must be built, which stores the re-
and StreamVByte [17]), word-aligned encoders (Simple8b [2], Sim-                  quired upper-bound score information to enable efficient dynamic
ple16 [43], QMX [35, 37], and SIMD-BP128 [16]), monotonic en-                     pruning. It can be built per postings list (for algorithms like WAND
coders (Interpolative [25], EF [40], and PEF [27]), and frame-of-                 and MaxScore), or for each fixed-sized block (for the Block-Max
reference encoders (Opt-PFD [42]).                                                variants). In addition, variable-sized blocks can be built (in lieu of
   Oftentimes, the choice of encoding depends on both the time                    fixed-sized blocks) to support the variable-block family of Block-
and space constraints, as compression schemes usually trade off                   Max algorithms listed above, such as Variable Block-Max WAND
space efficiency for either encoding or decoding performance, or                  (VBMW) [22, 23]. Ranked conjunctions are also supported using
both. We refer the reader to [24] for more details.                               the Ranked AND or (Variable) Block-Max Ranked AND (BMA) [14]
                                                                                  algorithms.
Scoring. Currently, BM25 [32] is used as the weighting model for
                                                                                      Indeed, the logical blocks stored in the index metadata are de-
ranked retrieval. BM25 is a simple yet effective ranker for process-
                                                                                  coupled from the compressed blocks inside the inverted index. The
ing bag-of-words queries, and promotes effective dynamic prun-
                                                                                  advantage of storing the metadata independently from the inverted
ing [28]. Given a document d and a query q, we use the following
                                                                                  index is that the metadata depends only on the canonical index,
formulation of BM25:
                               Õ                                                  needs to be computed only once, and does not change if the under-
                BM25(d, q) =      IDF(t) · TF(d, t),             (1)              lying compression codec is changed.
                                 t ∈q                                                 PISA provides two ways to experiment with query retrieval. The
                                                                                  first one performs end-to-end search for a given list of queries, and
                                   
                                       N − ft + 0.5
                                                                                 prints out the results in the TREC format. It can also be used to
                   IDF(t) = log                       ,                (2)        evaluate query speed, as was done for this workshop. Additionally,
                                         ft + 0.5
                                                                                  we provide a more granular approach, which focuses on comparing
                                                                                  different retrieval methods directly. Here, we only report the time
                                   fd,t · (k 1 + 1)
            TF(d, t) =                                         ,       (3)        to fetch posting lists and perform search, excluding lexicon lookups
                         fd,t + k 1 · (1 − b + b · ld /l avg )                    and printing results to the standard output or a file. We encourage


                                                                             52
OSIRRC 2019, July 25, 2019, Paris, France                                   Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel


 Robust04         Core17       Core18       Gov2   ClueWeb09   ClueWeb12                           Documents            Terms           Postings
      16.5           15             13.5     17      22.5        25.5               Robust04           528,155        587,561       107,481,358
                                                                                    Core17           1,855,660      1,048,294       448,998,765
Table 1: Values of λ for the given collections using the Gumbo                      Core18             595,037        621,941       191,042,917
parser, Porter2 stemmer, and reordering with recursive graph bi-                    Gov2            25,205,178     32,407,061     5,264,576,636
section. These values will yield an average block size of 40 ± 0.5                  ClueWeb09       50,220,110     87,896,391    14,996,638,171
for the variable block metadata.                                                    ClueWeb12       52,343,021    133,248,235    14,173,094,439

                                                                                         Table 2: Quantitative properties of our indexes.

the interested reader to refer to PISA’s documentation for more
details about running experiments.12
                                                                                 Collection     Track                 Topics                 # Topics
3     REPRODUCIBLE EXPERIMENTATION                                               Robust04       Robust ’04            301–450, 601–700       250
In the spirit of OSIRRC, we utilize the software and metrics made                Core17         Common Core ’17       301–450, 601–700†      250
available by the organizers, including the jig13 and the trec_eval14             Core18         Common Core ’18       321–825‡               50
tool. In addition, we have decided to provide some further informa-              Gov2           Terabyte ’04-’06      701–850                150
tion and reference experiments that we consider important.                       ClueWeb09      Web ’09-’12           51–200                 150
                                                                                 ClueWeb12      Web ’13-’14           201–300                100
3.1      Default Runs
Given the many possibilities for the various components of the                  Table 3: Query topics used in the experiments. Note that the Core17
PISA system, we now outline the default system configuration for                topics are the same as the Robust04 topics, but some were modified
the OSIRRC workshop. Further information can be found in the                    to reflect temporal changes† , and Core18 used 25 topics from Core17
documentation of our OSIRRC system.15 Note that the block size for              and 25 new topics.‡
the variable-block indexes depends on a parameter λ [22]. In order
to get the desired average block size for the variable blocks, the
value of λ was searched for offline, and differs for each collection.
For convenience, we tabulate the values of λ in Table 1. Note that                  • Core17 corresponds to the New York Times news collection,
such values of λ only apply when using the same parsing, stemmer,                     which contains news articles between 1987 and 2007.
and reordering as listed below.                                                     • Core18 is the TREC Washington Post Corpus, which consists
                                                                                      of news articles and blog posts from January 2012 through
      • Parsing: Gumbo16 with Porter2 stemming; no stopwords
                                                                                      August 2017.
        removed. We discard the content of any document with over
                                                                                    • Gov2 is the TREC 2004 Terabyte Track test collection con-
        1,000 HTML parsing errors.
                                                                                      sisting of around 25 million .gov sites crawled in early 2004;
      • Reordering: Recursive graph bisection. We optimize the
                                                                                      the documents are truncated to 256 kB.
        objective function using the posting lists of lengths at least
                                                                                    • ClueWeb09 is the ClueWeb 2009 Category B collection con-
        4,096.
                                                                                      sisting of around 50 million English web pages crawled be-
      • Compression: SIMD-BP128 with a posting list block size
                                                                                      tween January and February, 2009.
        of 128.
                                                                                    • ClueWeb12 is the ClueWeb 2012 Category B-13 collection,
      • Scoring: BM25 with k 1 = 0.9 and b = 0.4
                                                                                      which contains around 52 million English web pages crawled
      • Index Metadata: Variable blocks with a mean block size of
                                                                                      between February and May, 2012.
        40 ± 0.5.
      • Index Traversal: Variable BlockMax WAND.                                   Some quantitative properties of these collections are summarized
                                                                                in Table 2. The first three are relatively small, and contain newswire
3.2      Experimental Setup                                                     data. The remaining corpora are significantly larger, containing
                                                                                samples of the Web. Thus, the latter two should be more indicative
Now, we briefly outline the experimental setup and the resources                of any differences in query efficiency. In fact, each of these can
used for our experimentation.                                                   be thought of as representing a single shard in a large distributed
Datasets. We performed our experiments on the following text                    search system.
collections:
                                                                                Test queries. Each given collection contains a set of test queries
      • Robust04 consists of newswire articles from a variety of                from various TREC tracks which we use to validate the effectiveness
        sources from the late 1980’s through to the mid 1990’s.                 of our system. These queries are described in Table 3.
12 https://pisa.readthedocs.io/en/latest/
                                                                                Testing details. All experiments were conducted on a machine
13 https://github.com/osirrc/jig/
14 https://github.com/usnistgov/trec_eval                                       with two Intel Xeon E5-2667 v2 CPUs, with a total of 32 cores
15 https://github.com/osirrc/pisa-docker                                        clocked at 3.30 GHz with 256 GiB RAM running Linux 4.15.0. Fur-
16 https://github.com/google/gumbo-parser                                       thermore, the experiments presented here are deployed within the


                                                                           53
PISA: Performant Indexes and Search for Academia                                                     OSIRRC 2019, July 25, 2019, Paris, France


Docker framework. Although we believe that this may cause a                    a time. MaxScore has been shown to outperform other algorithms
slight reduction in the efficiency of the presented algorithms, we             for large values of k on the Gov2 and ClueWeb09 collections [24].
preserve this setup in the spirit of the workshop and comparabil-              Table 7 shows the results. While MaxScore usually outperforms
ity. We leave further investigation of potential overhead of Docker            VBMW, we did not optimize the block size of the index metadata,
containers as future work.                                                     so comparisons should be made with caution. Indeed, VBMW is
                                                                               likely to outperform MaxScore with optimized blocks and small
A note on ClueWeb12. In preliminary experiments, we found                      values of k. For a more detailed analysis of per-query latency within
that the memory consumption for reordering the ClueWeb12 index                 PISA, we refer the interested reader to the recent work by Mallia
was high, which slowed down the indexing process considerably.                 et al. [24].
Thus, we opted to skip reordering the ClueWeb12 collection in the
following experiments, and our results are reported on an index
that uses the default (crawl) order. Since index order impacts the
                                                                               3.4    Discussion
value of λ, we use λ = 26, which results in variable block metadata            PISA is built for performance. We are able to rapidly process each
with a mean block size in the desired range of 40 ± 0.5. Note that             query set thanks to efficient document retrieval algorithms and
this value differs from the one reported in Table 1, which is correct          extremely fast compression. On the other hand, as we have shown,
if reordering is applied based on Recursive Graph Bisection (see               SIMD-BP128 encoding also exhibits a reasonable compression ratio,
Section 3.1).                                                                  which allows us to store the index in main memory. We encour-
                                                                               age the reader to study the work by Mallia et al. [24] for more
3.3    Results and Discussion                                                  information about query efficiency under different retrieval and
                                                                               compression methods.
We now present our reference experiments, which involve end-to-
                                                                                  At the present moment, our query retrieval is tailored towards
end processing of each given collection.
                                                                               fast candidate selection, as we lack any complex ranking functional-
Indexing and Compression. The HTML content of each docu-                       ity, such as a learning-to-rank document reranking stage. However,
ment was extracted with the Gumbo parser. We then extracted                    the effectiveness we obtain using BM25 is consistent with other
three kinds of tokens: alphanumeric strings, acronyms, and pos-                results found in the literature [19].
sessives, which were then stemmed using the Porter2 algorithm.                    Furthermore, we provide a generic index building pipeline, which
We reordered documents using the recursive graph bisection al-                 can be easily customized to one’s needs. We unload most of the com-
gorithm which is known to improve both compression and query                   putationally intensive operations onto the initial stages of indexing
performance [13, 21, 24]. Then we compressed the index with SIMD-              to speed up experiments with many configurations; in particular, to
BP128 encoding, which has been proven to exhibit one of the best               deliver additional indexes with different integer encodings quickly
space-speed trade offs [24].                                                   and easily.
   Table 4 summarizes indexing times broken down into individual                  As per the workshop rules, we deliver a Docker image, which
phases, while Table 5 shows compressed inverted index sizes as                 reproduces the presented results. Note that the initial version of the
well as average numbers of bits used to encode document gaps and               image was derived from an image with a precompiled distribution
frequencies. The entire building process was executed with 32 cores            of PISA. However, we quickly discovered this solution was not
available; however, at the time of writing, only some parts of the             portable. The source of our issues was compiling the code with
pipeline support parallel execution. We also note that the index               AVX2 support. Once compiled, the binaries could not be executed
reordering step is usually the most expensive step in our indexing             on a machine not supporting AVX2. One solution could be to cross-
pipeline. If a fast indexing time is of high importance, this step can         compile and provide different versions of the image. However, we
be omitted, as we did for ClueWeb12. Alternatively, less expensive             chose to simply distribute the source code to be compiled at the
reordering operations can be used. However, skipping the index                 initial stage of an experimental run.
reordering stage (or using a less effective reordering technique)
will result in a larger inverted index and less efficient query-time           4     FUTURE PLANS
performance.                                                                   Despite its clear strengths, PISA is still a relatively young project,
System Effectiveness. Next, we outline the effectiveness of the                aspiring to become a more widely used tool for IR experimentation.
PISA system. In particular, we are processing rank-safe, disjunctive,          We recognize that many relevant features can be still developed to
top-k queries to depth k = 1,000. Since processing is rank-safe, all of        further enrich the framework. We have every intention of pursuing
the disjunctive index traversal algorithms result in the same top-k            these in the nearest future.
set. Table 6 reports the effectiveness for Mean Average Precision                 An obvious direction is to continue our work on query perfor-
(MAP), Precision at rank 30 (P@30), and Normalized Discounted                  mance. For instance, we intend to support precomputing quantized
Cumulative Gain at rank 20 (NDCG@20).                                          partial scores in order to further improve candidate selection perfor-
                                                                               mance [11]. We are also considering implementing other traversal
Query Efficiency. To measure the efficiency of query processing,               techniques, including known approaches, such as Score-at-a-Time
we measure how long it takes to process the entire query log for               methods [1, 20], as well as novel techniques.
each collection. We use 32 threads to concurrently retrieve the                   The next step would be to implement more complex document
top-k documents for all queries using either the MaxScore or the               rankings based on learning-to-rank. Many of the data structures
VBMW algorithm, with a single thread processing a single query at              required for feature extraction are indeed already in place. We


                                                                          54
OSIRRC 2019, July 25, 2019, Paris, France                                    Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel


                                                Parse      Invert       Reorder      Compress        Metadata          Total
                                Robust04        0:06:22    0:00:08       0:01:21       0:00:07         0:00:44       0:08:42
                                Core17          0:11:41    0:00:42       0:07:18       0:00:14         0:02:59       0:22:54
                                Core18          0:10:42    0:00:14       0:02:11       0:00:07         0:01:11       0:14:25
                                Gov2            1:37:52    1:00:12       2:28:04       0:06:42         0:37:04       5:49:52
                                ClueWeb09       4:08:08    3:11:50      10:28:30       0:32:42         2:01:01       20:22:12
                                ClueWeb12       5:09:58    3:27:51          —          0:34:55         2:11:46       11:24:30

Table 4: Indexing times broken down into five phases: parsing, inverting, reordering, compression, and index metadata construction. Times
are reported in the following format: hours:minutes:seconds.


                    Index size (MiB)     Docs (bpi)    Freqs (bpi)                                                  k = 10                     k = 1,000
    Robust04                  136.88           7.48              3.21                   Collection        MaxScore        VBMW          MaxScore         VBMW
    Core17                    545.90           7.07              3.13                  Robust04                7              7              21            26
    Core18                    238.33           7.24              3.22                  Core17                  10             8              16             15
    Gov2                    5,410.89           5.59              3.03                  Core18                  5              4              12            14
    ClueWeb09              20,715.29           7.96              3.63                  Gov2                   115            99             215            200
    ClueWeb12              23,206.16           9.22              4.52                  ClueWeb09              225            138            285            424
                                                                                       ClueWeb12              220            415            248            842
Table 5: Total index size and average number of bits per integer
while encoding documents and frequencies within posting lists.                     Table 7: Time taken to process the entire query log for each collec-
                                                                                   tion. Time is reported in milliseconds.

                      Topics     MAP       P@30       NDCG@20
      Robust04            All   0.2534    0.3120        0.4221                     has been successfully used in several recent research papers [21,
      Core17              All   0.2078    0.4260        0.3898                     23, 24, 29].
      Core18              All   0.2384    0.3500        0.3927                        One of the indisputable advantages of PISA is its extremely fast
                                                                                   query execution, achieved by careful optimization and the zero-
                     701-750    0.2638    0.4776        0.4070                     cost abstractions of C++. Furthermore, it supports a multitude of
      Gov2           751-800    0.3305    0.5487        0.5073                     state-of-the-art compression and query processing techniques that
                     801-850    0.2950    0.4680        0.4925                     can be used interchangeably.
                      51-100    0.1009    0.2521        0.1509                        Although there are still several shortcomings, these are mostly
                     101-150    0.1093    0.2507        0.2177                     due to the project’s young age, and we hope to address these very
      ClueWeb09
                     151-200    0.1054    0.2100        0.1311                     soon. Furthermore, we plan to continue enhancing the system with
                                                                                   novel solutions. Indeed, a good amount of time has been spent on
                     201-250    0.0449    0.1940        0.1529
      ClueWeb12                                                                    PISA to provide a high quality experimental IR framework, not
                     251-300    0.0217    0.1240        0.1484
                                                                                   only in terms of performance, but also from a software engineering
                                                                                   point of view. We use modern technologies and libraries, continuous
Table 6: The effectiveness of the submitted run for each respective
                                                                                   integration, and test suites to ensure the quality of our code, and
corpus.
                                                                                   the correctness of our implementations.
                                                                                      We encourage any interested researchers to get involved with
                                                                                   the PISA project.
would also like to enhance our query retrieval pipeline with ranking
cascades that are capable of applying learned models [8].                          ACKNOWLEDGEMENTS
   Other planned features include query expansion, content ex-                     This research was supported by NSF Grant IIS-1718680, a grant from
traction (template detection, boilerplate removal), sharding, and                  Amazon and the Australian Research Training Program Scholarship.
distributed indexes. Work on some of these has in fact already
started.                                                                           REFERENCES
                                                                                   [1] V. N. Anh, O. de Kretser, and A. Moffat. 2001. Vector-space ranking with effective
5   CONCLUSION                                                                         early termination.. In Proc. SIGIR. 35–42.
                                                                                   [2] V. N. Anh and A. Moffat. 2010. Index compression using 64-bit words. Soft. Prac.
PISA is a relative newcomer on the scene of open source IR software,                   & Exp. 40, 2 (2010), 131–147.
yet it has already proven its many benefits, including a flexible                  [3] T. G. Armstrong, A. Moffat, W. Webber, and J. Zobel. 2009. Improvements that
design which is specifically tailored for use in research. Indeed, PISA                don’t add up: Ad-hoc Retrieval Results since 1998. In Proc. CIKM. 601–610.


                                                                           55
PISA: Performant Indexes and Search for Academia                                                                              OSIRRC 2019, July 25, 2019, Paris, France


 [4] D. Blandford and G. Blelloch. 2002. Index Compression Through Document                        [23] A. Mallia and E. Porciani. 2019. Faster BlockMax WAND with longer skipping.
     Reordering. In Proc. DCC. 342–352.                                                                 In Proc. ECIR. 771–778.
 [5] A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. 2003. Efficient               [24] A. Mallia, M. Siedlaczek, and T. Suel. 2019. An Experimental Study of Index
     Query Evaluation Using a Two-level Retrieval Process. In Proc. CIKM. 426–434.                      Compression and DAAT Query Processing Methods. In Proc. ECIR. 353–368.
 [6] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. 2000. Min-Wise                  [25] A. Moffat and L. Stuiver. 2000. Binary Interpolative Coding for Effective Index
     Independent Permutations. J. Comput. Syst. Sci. 60, 3 (2000), 630–659.                             Compression. Inf. Retr. 3, 1 (2000), 25–47.
 [7] K. Chakrabarti, S. Chaudhuri, and V. Ganti. 2011. Interval-based pruning for                  [26] G. Ottaviano, N. Tonellotto, and R. Venturini. 2015. Optimal Space-time Tradeoffs
     top-k processing over compressed lists. In Proc. ICDE. 709–720.                                    for Inverted Indexes. In Proc. WSDM. 47–56.
 [8] R-C. Chen, L. Gallagher, R. Blanco, and J. S. Culpepper. 2017. Efficient Cost-Aware           [27] G. Ottaviano and R. Venturini. 2014. Partitioned Elias-Fano indexes. In Proc.
     Cascade Ranking in Multi-Stage Retrieval. In Proc. SIGIR. 445–454.                                 SIGIR. 273–282.
 [9] F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P.                 [28] M. Petri, J. S. Culpepper, and A. Moffat. 2013. Exploring the Magic of WAND. In
     Raghavan. 2009. On Compressing Social Networks. In Proc. SIGKDD. 219–228.                          Proc. Aust. Doc. Comp. Symp. 58–65.
[10] M. Crane. 2018. Questionable Answers in Question Answering Research: Repro-                   [29] M. Petri, A. Moffat, J. Mackenzie, J. S. Culpepper, and D. Beck. 2019. Accelerated
     ducibility and Variability of Published Results. Trans. ACL 6 (2018), 241–252.                     Query Processing Via Similarity Score Prediction. In Proc. SIGIR. To Appear.
[11] M. Crane, A. Trotman, and R. O’Keefe. 2013. Maintaining discriminatory power                  [30] J. Plaisance, N. Kurz, and D. Lemire. 2015. Vectorized VByte Decoding. In Int.
     in quantized indexes. In Proc. CIKM. 1221–1224.                                                    Symp. Web Alg.
[12] J. Dean. 2009. Challenges in Building Large-scale Information Retrieval Systems:              [31] M. F. Porter. 1997. Readings in IR. Chapter An Algorithm for Suffix Stripping,
     Invited Talk. In Proc. WSDM. 1–1.                                                                  313–316.
[13] L. Dhulipala, I. Kabiljo, B. Karrer, G. Ottaviano, S. Pupyrev, and A. Shalita. 2016.          [32] S. E. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. 1994.
     Compressing Graphs and Indexes with Recursive Graph Bisection. In Proc. KDD.                       Okapi at TREC-3. In Proc. TREC.
     1535–1544.                                                                                    [33] F. Silvestri. 2007. Sorting out the Document Identifier Assignment Problem. In
[14] S. Ding and T. Suel. 2011. Faster top-k document retrieval using block-max                         Proc. ECIR. 101–112.
     indexes. In Proc. SIGIR. 993–1002.                                                            [34] A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and P. S. Oberoi. 2011.
[15] D. Hawking and T. Jones. 2012. Reordering an Index to Speed Query Processing                       SIMD-based Decoding of Posting Lists. In Proc. CIKM. 317–326.
     Without Loss of Effectiveness. In Proc. ADCS. 17–24.                                          [35] A. Trotman. 2014. Compression, SIMD, and Postings Lists. In Proc. Aust. Doc.
[16] D. Lemire and L. Boytsov. 2015. Decoding billions of integers per second through                   Comp. Symp. 50.50–50.57.
     vectorization. Soft. Prac. & Exp. 45, 1 (2015), 1–29.                                         [36] A. Trotman, X-F. Jia, and M. Crane. 2012. Towards an efficient and effective
[17] D. Lemire, N. Kurz, and C. Rupp. 2018. Stream VByte: Faster byte-oriented integer                  search engine. In Wkshp. Open Source IR. 40–47.
     compression. Inf. Proc. Letters 130 (2018), 1–6.                                              [37] A. Trotman and J. Lin. 2016. In Vacuo and In Situ Evaluation of SIMD Codecs. In
[18] J. Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines. SIGIR                        Proc. Aust. Doc. Comp. Symp. 1–8.
     Forum 52, 2 (2019), 40–51.                                                                    [38] A. Trotman, A. Puurula, and B. Burgess. 2014. Improvements to BM25 and
[19] J. Lin, M. Crane, A. Trotman, J. Callan, I. Chattopadhyaya, J. Foley, G. Ingersoll, C.             Language Models Examined. In Proc. Aust. Doc. Comp. Symp. 58–65.
     Macdonald, and S. Vigna. 2016. Toward Reproducible Baselines: The Open-Source                 [39] H. R. Turtle and J. Flood. 1995. Query Evaluation: Strategies and Optimizations.
     IR Reproducibility Challenge. In Proc. ECIR. 408–420.                                              Inf. Proc. & Man. 31, 6 (1995), 831–850.
[20] J. Lin and A. Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In                   [40] S. Vigna. 2013. Quasi-succinct indices. In Proc. WSDM. 83–92.
     Proc. ICTIR. 301–304.                                                                         [41] Q. Wang and T. Suel. 2019. Document Reordering for Faster Intersection. Proc.
[21] J. Mackenzie, A. Mallia, M. Petri, J. S. Culpepper, and T. Suel. 2019. Compressing                 VLDB 12, 5 (2019), 475–487.
     Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study. In                  [42] H. Yan, S. Ding, and T. Suel. 2009. Inverted index compression and query pro-
     Proc. ECIR. 339–352.                                                                               cessing with optimized document ordering. In Proc. WWW. 401–410.
[22] A. Mallia, G. Ottaviano, E. Porciani, N. Tonellotto, and R. Venturini. 2017. Faster           [43] J. Zhang, X. Long, and T. Suel. 2008. Performance of Compressed Inverted List
     BlockMax WAND with Variable-sized Blocks. In Proc. SIGIR. 625–634.                                 Caching in Search Engines. In Proc. WWW. 387–396.


                                                                                              56

</pre>