<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>PISA: Performant Indexes and Search for Academia</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Antonio Mallia</string-name>
          <email>antonio.mallia@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joel Mackenzie</string-name>
          <email>joel.mackenzie@rmit.edu.au</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michał Siedlaczek</string-name>
          <email>michal.siedlaczek@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Torsten Suel</string-name>
          <email>torsten.suel@nyu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>New York University</institution>
          ,
          <addr-line>New York</addr-line>
          ,
          <country country="US">US</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>RMIT University</institution>
          ,
          <addr-line>Melbourne</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2019</year>
      </pub-date>
      <volume>25</volume>
      <issue>2019</issue>
      <fpage>342</fpage>
      <lpage>352</lpage>
      <abstract>
        <p>Performant Indexes and Search for Academia (PISA) is an experimental search engine that focuses on eficient implementations of stateof-the-art representations and algorithms for text retrieval. In this work, we outline our efort in creating a replicable search run from PISA for the 2019 Open Source Information Retrieval Replicability Challenge, which encourages the information retrieval community to produce replicable systems through the use of a containerized, Docker-based infrastructure. We also discuss the origins, current functionality, and future direction and challenges for the PISA system.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Reproducibility, replicability, and generalizability have become
increasingly important within the Information Retrieval (IR)
community. For example, weak baselines [
        <xref ref-type="bibr" rid="ref3">3, 18</xref>
        ] are often used as
comparison points against newly proposed systems, resulting in what often
appear to be large improvements. One possible reason that weak
baselines are used is that they are usually simple and well
established, making it easy to reproduce or replicate them. Indeed,
replicating experimental runs is not a trivial task; minor changes in
software, datasets, and even hardware can result in significant changes
to experimental runs [10]. To this end, the 2019 Open Source
Information Retrieval Replicability Challenge (OSIRRC) brings together
academic groups with the aim of defining a reusable framework for
easily running IR experiments with a particular focus on
replicability, where a diferent team (to those who proposed the system)
uses the original experimental artifacts to replicate the given result.
In an attempt to improve replicability, the OSIRRC workshop
proposes to package and deploy various IR systems within a Docker
container,1 which enables low-efort replication by reducing many
experimental confounders.
      </p>
      <p>The goal of this paper is to give an overview of the PISA system
and to outline the process of building replicable runs from PISA
with Docker. We outline the particulars of our submitted runs, and
2
PISA2 is an open source library implementing text indexing and
search, primarily targeted for use in an academic setting. PISA
implements a range of state-of-the-art indexing and search algorithms,
making it useful for researchers to easily experiment with new
technologies, especially those concerned with eficiency. Nevertheless,
we strive for much more than just an eficient implementation.
With clean and extensible design and API, PISA provides a general
framework that can be employed for miscellaneous research tasks,
such as parsing, ranking, sharding, index compression, document
reordering and query processing, to name a few.</p>
      <p>
        PISA started of as a fork repository of the ds2i library3 by
Ottaviano et al., which was used for prior research on eficient index
representations [
        <xref ref-type="bibr" rid="ref11 ref12">26, 27</xref>
        ]. Since then, PISA has gone through
substantial changes, and now considerably extends the original library.
PISA is still being actively developed, integrating new features and
improving its design and implementation at regular intervals.
2.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Design Overview</title>
      <p>PISA is designed to be eficient, extensible, and easy to use. We now
briefly outline some of the design aspects of PISA.</p>
      <p>Modern Implementation. The PISA engine itself is built using
C++17, making use of many new features in the C++ standard.
This allows us to ensure the implementations are both eficient and
understandable, as some of the newest language features can make
for cleaner code and APIs. We aim to adhere to best design practices,
such as RAII (Resource Acquisition Is Initialization), C++ Core
Guidelines4 (aided by Guidelines Support Library5), and
stronglytyped aliases, all of which result in safer and cleaner code without
sacrificing runtime performance.
2https://https://github.com/pisa-engine/pisa
3https://github.com/ot/ds2i
4https://github.com/isocpp/CppCoreGuidelines
5https://github.com/Microsoft/GSL
Performance. One of the biggest advantages of C++ is its
performance. Control over data memory layout allows us to implement
and store eficient data structures with little to no runtime overhead.
Furthermore, we make use of low level optimizations, such as CPU
intrinsics, branch prediction hinting, and SIMD instructions, which
are especially important for eficiently encoding and decoding
postings lists. Memory mapped files provide fast and easy access to data
structures persisted on disk. We also avoid unnecessary indirection
of runtime polymorphism in performance-sensitive code in favor of
the static polymorphism of C++ templates and metaprogramming.
Our performance is also much more predictable than when using
languages with garbage collection. Finally, we sufer no runtime
overhead as is the case with VM-based or interpreted languages.
Extensibility. Another important design aspect of PISA promotes
extensibility. For example, interfaces are exposed which allow for
new components to be plugged in easily, such as diferent parsers,
stemmers, and compression codecs. This is achieved through heavy
use of generic programming, similar to that provided by the C++
Standard Template Library. For example, an encoding schema is as
much a parameter of an index as a custom allocator is a parameter
of std::vector. By decoupling diferent parts of the framework,
we provide an easy way of extending the library both in future
iterations of the project, as well as by users of the library.
2.2</p>
    </sec>
    <sec id="sec-3">
      <title>Feature Overview</title>
      <p>In this section, we take a short tour of several important features
of our system. We briefly discuss the indexing pipeline, document
reordering, scoring, and implemented retrieval methods.
Parsing Collection. The objective of parsing is to represent a
given collection as a forward index, where each term is assigned a
unique numerical ID, and each document consists of a list of such
identifiers. This is a non-trivial task that involves several steps that
can be critical to retrieval performance.</p>
      <p>First, the document content must be accessed by parsing a certain
data exchange format, such as WARC, JSON, or XML. The document
itself is typically represented by HTML, XML, or a custom annotated
format, which must be parsed to retrieve the underlying text. The
text must be then tokenized, and the resulting words are typically
stemmed before indexing.</p>
      <p>
        PISA supports a selection of file and content parsers. The parsing
tool allows input formats of many standard IR collections, such as
ClueWeb096, ClueWeb127, GOV28, Robust049, Washington Post10,
and New York Times.11 We also provide an HTML content parser,
and the Porter2 [
        <xref ref-type="bibr" rid="ref16">31</xref>
        ] stemming algorithm for English language.
As discussed in Section 2.1, PISA is designed to allow new
components, such as parsers or stemmers, to be plugged-in with low
implementation overhead.
      </p>
      <p>As part of a forward index, we also encode a term lexicon. This is
simply a mapping between strings and numerical IDs. We represent
6https://lemurproject.org/clueweb09/
7https://lemurproject.org/clueweb12/
8http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm
9https://trec.nist.gov/data/robust/04.guidelines.html
10https://trec.nist.gov/data/wapost/
11https://catalog.ldc.upenn.edu/LDC2008T19
it as a payload vector. The structure is divided into two memory
areas: the first one is an array of integers representing payload ofsets,
while the other stores the payloads (strings). This representation
allows us to quickly retrieve a word at a given position—which
determines its ID—directly from disk using memory mapping. We
achieve string lookups by assigning term IDs in lexicographical
order and performing binary search. Note that reassigning term
IDs requires little overhead, as it is applied directly when a number
of small index batches are merged together. This design decision
enables us to provide a set of command line tools to quickly access
index data without unnecessary index loading overhead.
Document titles (such as TREC IDs) are stored using the same structure
but without sorting them first, as the order of the documents is
established via an index reordering stage as described below.</p>
      <p>The entire parsing process is performed in parallel when
executed on a multi-core architecture. The forward index can be
created under tight memory constraints by dividing the corpus and
processing it in batches, and then merging the resulting forward
indexes at the end. Currently, PISA only supports merging forward
indexes together prior to generating the canonical inverted index.
However, future work will aim to allow index updates through
eficient inverted index merging operations.</p>
      <p>Indexing. Once the parsing phase is complete, the forward index
containing a collection can be used to build an inverted index in a
process called inverting. The product of this phase is an inverted
index in the canonical format. This representation is very similar
to the forward index, but in reverse: it is a collection of terms,
each a containing a list of document IDs. The canonical
representation is stored in an uncompressed and universally readable format,
which simply uses binary sequences to represent lists. There are
a few advantages of storing the canonical representation. Firstly,
it allows various transformations, such as document reordering
or index pruning, to be performed on the index before storing it
in its final compressed form. Secondly, it allows for diferent final
representations to be built rapidly, such as indexes that use diferent
compression algorithms. Thirdly, it allows the PISA system to be
used to create an inverted index without requiring the PISA system
to be used beyond this point, making it easy for experimenters to
import the canonical index into their own IR system.</p>
      <p>
        Document Reordering. Document reordering corresponds to
reassigning the document identifiers within the inverted index [ 4]. It
generally aims to minimize the cost of representing the inverted
index with respect to storage consumption. However, reordering can
also be used to minimize other cost functions, such as query
processing speed [
        <xref ref-type="bibr" rid="ref26">41</xref>
        ]. Interestingly, optimizing for space consumption has
been empirically shown to speed up query processing [
        <xref ref-type="bibr" rid="ref9">14, 15, 24</xref>
        ],
making document reordering an attractive step during indexing.
In theory, index reordering can take place either on an existing
inverted index, or before the inverted index is constructed. PISA
opts to use the canonical index as both input and output for the
document reordering step, as this allows a generic reordering scheme
to be used which can be easily extended to various reordering
techniques, and allows the reordering functionality to be used without
requiring further use of PISA.
      </p>
      <p>
        Many approaches for document reordering exist, including
random ordering, reordering by URL [
        <xref ref-type="bibr" rid="ref18">33</xref>
        ], MinHash ordering [6, 9],
and recursive graph bisection [13]. PISA supports eficient index
reordering for all of the above techniques [
        <xref ref-type="bibr" rid="ref6">21</xref>
        ].
      </p>
      <p>Index Compression. Given the extremely large collections
indexed by current search engines, even a single node of a large
search cluster typically contains many billions of integers in its
index structure. In particular, compression of posting lists is of
utmost importance, as they account for much of the data size and
access costs. Compressing an inverted index introduces a twofold
advantage over a non-compressed representation: it reduces the
size of the index, and it allows us to better exploit the memory
hierarchy, which consequently speeds up query processing.</p>
      <p>
        Compression allows the space requirements of indexes to be
substantially decreased without loss of information. The simple and
extensible design of PISA allows for new compression approaches
to be plugged in easily. As such, a range of state-of-the-art
compression schemes are currently supported, including variable byte
encoders (VarIntGB [12], VarInt-G8IU [
        <xref ref-type="bibr" rid="ref19">34</xref>
        ], MaskedVByte [
        <xref ref-type="bibr" rid="ref15">30</xref>
        ],
and StreamVByte [17]), word-aligned encoders (Simple8b [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
Simple16 [
        <xref ref-type="bibr" rid="ref28">43</xref>
        ], QMX [
        <xref ref-type="bibr" rid="ref20 ref22">35, 37</xref>
        ], and SIMD-BP128 [16]), monotonic
encoders (Interpolative [
        <xref ref-type="bibr" rid="ref10">25</xref>
        ], EF [
        <xref ref-type="bibr" rid="ref25">40</xref>
        ], and PEF [
        <xref ref-type="bibr" rid="ref12">27</xref>
        ]), and
frame-ofreference encoders (Opt-PFD [
        <xref ref-type="bibr" rid="ref27">42</xref>
        ]).
      </p>
      <p>
        Oftentimes, the choice of encoding depends on both the time
and space constraints, as compression schemes usually trade of
space eficiency for either encoding or decoding performance, or
both. We refer the reader to [
        <xref ref-type="bibr" rid="ref9">24</xref>
        ] for more details.
      </p>
      <p>
        Scoring. Currently, BM25 [
        <xref ref-type="bibr" rid="ref17">32</xref>
        ] is used as the weighting model for
ranked retrieval. BM25 is a simple yet efective ranker for
processing bag-of-words queries, and promotes efective dynamic
pruning [
        <xref ref-type="bibr" rid="ref13">28</xref>
        ]. Given a document d and a query q, we use the following
formulation of BM25:
      </p>
      <p>BM25(d, q) =</p>
      <p>IDF(t ) = log
Õ IDF(t ) · TF(d, t ),
t ∈q</p>
      <p>N − ft + 0.5
ft + 0.5
,
TF(d, t ) =</p>
      <p>
        fd,t · (k1 + 1)
fd,t + k1 · (1 − b + b · ld /lavg)
where N is the number of documents in the collection, ft is the
document frequency of term t , fd,t is the frequency of t in d, ld is
the length of document d, and lavg is the average document length.
We set parameters k1 = 0.9 and b = 0.4, as described by Trotman
et al. [
        <xref ref-type="bibr" rid="ref21">36</xref>
        ]. For a more exhaustive examination of BM25 variants,
we refer the reader to the work by Trotman et al. [
        <xref ref-type="bibr" rid="ref23">38</xref>
        ].
Search. Because PISA supports document-ordered index
organization, both Boolean and scored conjunctions or disjunctions can be
evaluated, exploiting either a Document-at-a-Time or a
Term-at-aTime index traversal strategy.
      </p>
      <p>
        Furthermore, PISA supports a range of state-of-the-art dynamic
pruning algorithms such as MaxScore [
        <xref ref-type="bibr" rid="ref24">39</xref>
        ] and WAND [5], and
their Block-Max counterparts, Block-Max MaxScore (BMM) [7] and
Block-Max WAND (BMW) [14].
      </p>
      <p>
        In order to facilitate these dynamic pruning algorithms, an
auxiliary index metadata structure must be built, which stores the
required upper-bound score information to enable eficient dynamic
pruning. It can be built per postings list (for algorithms like WAND
and MaxScore), or for each fixed-sized block (for the Block-Max
variants). In addition, variable-sized blocks can be built (in lieu of
ifxed-sized blocks) to support the variable-block family of
BlockMax algorithms listed above, such as Variable Block-Max WAND
(VBMW) [
        <xref ref-type="bibr" rid="ref7 ref8">22, 23</xref>
        ]. Ranked conjunctions are also supported using
the Ranked AND or (Variable) Block-Max Ranked AND (BMA) [14]
algorithms.
      </p>
      <p>Indeed, the logical blocks stored in the index metadata are
decoupled from the compressed blocks inside the inverted index. The
advantage of storing the metadata independently from the inverted
index is that the metadata depends only on the canonical index,
needs to be computed only once, and does not change if the
underlying compression codec is changed.</p>
      <p>PISA provides two ways to experiment with query retrieval. The
ifrst one performs end-to-end search for a given list of queries, and
prints out the results in the TREC format. It can also be used to
evaluate query speed, as was done for this workshop. Additionally,
we provide a more granular approach, which focuses on comparing
diferent retrieval methods directly. Here, we only report the time
to fetch posting lists and perform search, excluding lexicon lookups
and printing results to the standard output or a file. We encourage
(1)
(2)
(3)
the interested reader to refer to PISA’s documentation for more
details about running experiments.12
3</p>
    </sec>
    <sec id="sec-4">
      <title>REPRODUCIBLE EXPERIMENTATION</title>
      <p>In the spirit of OSIRRC, we utilize the software and metrics made
available by the organizers, including the jig13 and the trec_eval14
tool. In addition, we have decided to provide some further
information and reference experiments that we consider important.
3.1</p>
    </sec>
    <sec id="sec-5">
      <title>Default Runs</title>
      <p>
        Given the many possibilities for the various components of the
PISA system, we now outline the default system configuration for
the OSIRRC workshop. Further information can be found in the
documentation of our OSIRRC system.15 Note that the block size for
the variable-block indexes depends on a parameter λ [
        <xref ref-type="bibr" rid="ref7">22</xref>
        ]. In order
to get the desired average block size for the variable blocks, the
value of λ was searched for ofline, and difers for each collection.
For convenience, we tabulate the values of λ in Table 1. Note that
such values of λ only apply when using the same parsing, stemmer,
and reordering as listed below.
      </p>
      <p>• Parsing: Gumbo16 with Porter2 stemming; no stopwords
removed. We discard the content of any document with over
1,000 HTML parsing errors.
• Reordering: Recursive graph bisection. We optimize the
objective function using the posting lists of lengths at least
4,096.
• Compression: SIMD-BP128 with a posting list block size
of 128.
• Scoring: BM25 with k1 = 0.9 and b = 0.4
• Index Metadata: Variable blocks with a mean block size of
40 ± 0.5.</p>
      <p>• Index Traversal: Variable BlockMax WAND.
3.2</p>
    </sec>
    <sec id="sec-6">
      <title>Experimental Setup</title>
      <p>Now, we briefly outline the experimental setup and the resources
used for our experimentation.</p>
      <p>Datasets. We performed our experiments on the following text
collections:
• Robust04 consists of newswire articles from a variety of
sources from the late 1980’s through to the mid 1990’s.
12https://pisa.readthedocs.io/en/latest/
13https://github.com/osirrc/jig/
14https://github.com/usnistgov/trec_eval
15https://github.com/osirrc/pisa-docker
16https://github.com/google/gumbo-parser</p>
      <p>Some quantitative properties of these collections are summarized
in Table 2. The first three are relatively small, and contain newswire
data. The remaining corpora are significantly larger, containing
samples of the Web. Thus, the latter two should be more indicative
of any diferences in query eficiency. In fact, each of these can
be thought of as representing a single shard in a large distributed
search system.</p>
      <p>Test queries. Each given collection contains a set of test queries
from various TREC tracks which we use to validate the efectiveness
of our system. These queries are described in Table 3.
Testing details. All experiments were conducted on a machine
with two Intel Xeon E5-2667 v2 CPUs, with a total of 32 cores
clocked at 3.30 GHz with 256 GiB RAM running Linux 4.15.0.
Furthermore, the experiments presented here are deployed within the
Docker framework. Although we believe that this may cause a
slight reduction in the eficiency of the presented algorithms, we
preserve this setup in the spirit of the workshop and
comparability. We leave further investigation of potential overhead of Docker
containers as future work.</p>
      <p>A note on ClueWeb12. In preliminary experiments, we found
that the memory consumption for reordering the ClueWeb12 index
was high, which slowed down the indexing process considerably.
Thus, we opted to skip reordering the ClueWeb12 collection in the
following experiments, and our results are reported on an index
that uses the default (crawl) order. Since index order impacts the
value of λ, we use λ = 26, which results in variable block metadata
with a mean block size in the desired range of 40 ± 0.5. Note that
this value difers from the one reported in Table 1, which is correct
if reordering is applied based on Recursive Graph Bisection (see
Section 3.1).
3.3</p>
    </sec>
    <sec id="sec-7">
      <title>Results and Discussion</title>
      <p>We now present our reference experiments, which involve
end-toend processing of each given collection.</p>
      <p>
        Indexing and Compression. The HTML content of each
document was extracted with the Gumbo parser. We then extracted
three kinds of tokens: alphanumeric strings, acronyms, and
possessives, which were then stemmed using the Porter2 algorithm.
We reordered documents using the recursive graph bisection
algorithm which is known to improve both compression and query
performance [
        <xref ref-type="bibr" rid="ref6 ref9">13, 21, 24</xref>
        ]. Then we compressed the index with
SIMDBP128 encoding, which has been proven to exhibit one of the best
space-speed trade ofs [
        <xref ref-type="bibr" rid="ref9">24</xref>
        ].
      </p>
      <p>Table 4 summarizes indexing times broken down into individual
phases, while Table 5 shows compressed inverted index sizes as
well as average numbers of bits used to encode document gaps and
frequencies. The entire building process was executed with 32 cores
available; however, at the time of writing, only some parts of the
pipeline support parallel execution. We also note that the index
reordering step is usually the most expensive step in our indexing
pipeline. If a fast indexing time is of high importance, this step can
be omitted, as we did for ClueWeb12. Alternatively, less expensive
reordering operations can be used. However, skipping the index
reordering stage (or using a less efective reordering technique)
will result in a larger inverted index and less eficient query-time
performance.</p>
      <p>System Efectiveness. Next, we outline the efectiveness of the
PISA system. In particular, we are processing rank-safe, disjunctive,
top-k queries to depth k = 1,000. Since processing is rank-safe, all of
the disjunctive index traversal algorithms result in the same top-k
set. Table 6 reports the efectiveness for Mean Average Precision
(MAP), Precision at rank 30 (P@30), and Normalized Discounted
Cumulative Gain at rank 20 (NDCG@20).</p>
      <p>
        Query Eficiency. To measure the eficiency of query processing,
we measure how long it takes to process the entire query log for
each collection. We use 32 threads to concurrently retrieve the
top-k documents for all queries using either the MaxScore or the
VBMW algorithm, with a single thread processing a single query at
a time. MaxScore has been shown to outperform other algorithms
for large values of k on the Gov2 and ClueWeb09 collections [
        <xref ref-type="bibr" rid="ref9">24</xref>
        ].
Table 7 shows the results. While MaxScore usually outperforms
VBMW, we did not optimize the block size of the index metadata,
so comparisons should be made with caution. Indeed, VBMW is
likely to outperform MaxScore with optimized blocks and small
values of k. For a more detailed analysis of per-query latency within
PISA, we refer the interested reader to the recent work by Mallia
et al. [
        <xref ref-type="bibr" rid="ref9">24</xref>
        ].
3.4
      </p>
    </sec>
    <sec id="sec-8">
      <title>Discussion</title>
      <p>
        PISA is built for performance. We are able to rapidly process each
query set thanks to eficient document retrieval algorithms and
extremely fast compression. On the other hand, as we have shown,
SIMD-BP128 encoding also exhibits a reasonable compression ratio,
which allows us to store the index in main memory. We
encourage the reader to study the work by Mallia et al. [
        <xref ref-type="bibr" rid="ref9">24</xref>
        ] for more
information about query eficiency under diferent retrieval and
compression methods.
      </p>
      <p>
        At the present moment, our query retrieval is tailored towards
fast candidate selection, as we lack any complex ranking
functionality, such as a learning-to-rank document reranking stage. However,
the efectiveness we obtain using BM25 is consistent with other
results found in the literature [
        <xref ref-type="bibr" rid="ref4">19</xref>
        ].
      </p>
      <p>Furthermore, we provide a generic index building pipeline, which
can be easily customized to one’s needs. We unload most of the
computationally intensive operations onto the initial stages of indexing
to speed up experiments with many configurations; in particular, to
deliver additional indexes with diferent integer encodings quickly
and easily.</p>
      <p>As per the workshop rules, we deliver a Docker image, which
reproduces the presented results. Note that the initial version of the
image was derived from an image with a precompiled distribution
of PISA. However, we quickly discovered this solution was not
portable. The source of our issues was compiling the code with
AVX2 support. Once compiled, the binaries could not be executed
on a machine not supporting AVX2. One solution could be to
crosscompile and provide diferent versions of the image. However, we
chose to simply distribute the source code to be compiled at the
initial stage of an experimental run.
4</p>
    </sec>
    <sec id="sec-9">
      <title>FUTURE PLANS</title>
      <p>Despite its clear strengths, PISA is still a relatively young project,
aspiring to become a more widely used tool for IR experimentation.
We recognize that many relevant features can be still developed to
further enrich the framework. We have every intention of pursuing
these in the nearest future.</p>
      <p>
        An obvious direction is to continue our work on query
performance. For instance, we intend to support precomputing quantized
partial scores in order to further improve candidate selection
performance [11]. We are also considering implementing other traversal
techniques, including known approaches, such as Score-at-a-Time
methods [
        <xref ref-type="bibr" rid="ref1 ref5">1, 20</xref>
        ], as well as novel techniques.
      </p>
      <p>The next step would be to implement more complex document
rankings based on learning-to-rank. Many of the data structures
required for feature extraction are indeed already in place. We
Robust04
Core17
Core18
Gov2
ClueWeb09
ClueWeb12</p>
      <p>Parse
0.4776
0.5487
0.4680
0.2521
0.2507
0.2100
would also like to enhance our query retrieval pipeline with ranking
cascades that are capable of applying learned models [8].</p>
      <p>Other planned features include query expansion, content
extraction (template detection, boilerplate removal), sharding, and
distributed indexes. Work on some of these has in fact already
started.
5</p>
    </sec>
    <sec id="sec-10">
      <title>CONCLUSION</title>
      <p>PISA is a relative newcomer on the scene of open source IR software,
yet it has already proven its many benefits, including a flexible
design which is specifically tailored for use in research. Indeed, PISA</p>
      <p>
        Collection
has been successfully used in several recent research papers [
        <xref ref-type="bibr" rid="ref14 ref6 ref8 ref9">21,
23, 24, 29</xref>
        ].
      </p>
      <p>One of the indisputable advantages of PISA is its extremely fast
query execution, achieved by careful optimization and the
zerocost abstractions of C++. Furthermore, it supports a multitude of
state-of-the-art compression and query processing techniques that
can be used interchangeably.</p>
      <p>Although there are still several shortcomings, these are mostly
due to the project’s young age, and we hope to address these very
soon. Furthermore, we plan to continue enhancing the system with
novel solutions. Indeed, a good amount of time has been spent on
PISA to provide a high quality experimental IR framework, not
only in terms of performance, but also from a software engineering
point of view. We use modern technologies and libraries, continuous
integration, and test suites to ensure the quality of our code, and
the correctness of our implementations.</p>
      <p>We encourage any interested researchers to get involved with
the PISA project.</p>
    </sec>
    <sec id="sec-11">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This research was supported by NSF Grant IIS-1718680, a grant from
Amazon and the Australian Research Training Program Scholarship.
[18] J. Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines. SIGIR</p>
      <p>Forum 52, 2 (2019), 40–51.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Anh</surname>
          </string-name>
          , O. de Kretser,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Mofat</surname>
          </string-name>
          .
          <year>2001</year>
          .
          <article-title>Vector-space ranking with efective early termination.</article-title>
          .
          <source>In Proc. SIGIR</source>
          .
          <volume>35</volume>
          -
          <fpage>42</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Anh</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Mofat</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Index compression using 64-bit words</article-title>
          .
          <source>Soft. Prac. &amp; Exp. 40</source>
          ,
          <issue>2</issue>
          (
          <year>2010</year>
          ),
          <fpage>131</fpage>
          -
          <lpage>147</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. G.</given-names>
            <surname>Armstrong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mofat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Webber</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zobel</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Improvements that don't add up: Ad-hoc Retrieval Results since 1998</article-title>
          .
          <source>In Proc. CIKM</source>
          .
          <volume>601</volume>
          -
          <fpage>610</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Crane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Trotman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Chattopadhyaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Foley</surname>
          </string-name>
          , G. Ingersoll,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vigna</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge</article-title>
          .
          <source>In Proc. ECIR</source>
          .
          <volume>408</volume>
          -
          <fpage>420</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Trotman</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Anytime Ranking for Impact-Ordered Indexes</article-title>
          .
          <source>In Proc. ICTIR</source>
          .
          <volume>301</volume>
          -
          <fpage>304</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mackenzie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mallia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Petri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Suel</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study</article-title>
          .
          <source>In Proc. ECIR</source>
          .
          <volume>339</volume>
          -
          <fpage>352</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mallia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Ottaviano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Porciani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Venturini</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Faster BlockMax WAND with Variable-sized Blocks</article-title>
          .
          <source>In Proc. SIGIR</source>
          .
          <volume>625</volume>
          -
          <fpage>634</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mallia</surname>
          </string-name>
          and
          <string-name>
            <given-names>E.</given-names>
            <surname>Porciani</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Faster BlockMax WAND with longer skipping</article-title>
          .
          <source>In Proc. ECIR</source>
          .
          <volume>771</volume>
          -
          <fpage>778</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mallia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Siedlaczek</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Suel</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>An Experimental Study of Index Compression and DAAT Query Processing Methods</article-title>
          .
          <source>In Proc. ECIR</source>
          .
          <volume>353</volume>
          -
          <fpage>368</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mofat</surname>
          </string-name>
          and
          <string-name>
            <given-names>L.</given-names>
            <surname>Stuiver</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>Binary Interpolative Coding for Efective Index Compression</article-title>
          .
          <source>Inf. Retr. 3</source>
          ,
          <issue>1</issue>
          (
          <year>2000</year>
          ),
          <fpage>25</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ottaviano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Venturini</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Optimal Space-time Tradeofs for Inverted Indexes</article-title>
          .
          <source>In Proc. WSDM</source>
          .
          <volume>47</volume>
          -
          <fpage>56</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ottaviano</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Venturini</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Partitioned Elias-Fano indexes</article-title>
          .
          <source>In Proc. SIGIR</source>
          .
          <volume>273</volume>
          -
          <fpage>282</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>M.</given-names>
            <surname>Petri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Mofat</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Exploring the Magic of WAND</article-title>
          .
          <source>In Proc. Aust</source>
          . Doc. Comp. Symp.
          <volume>58</volume>
          -
          <fpage>65</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>M.</given-names>
            <surname>Petri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mofat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mackenzie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Culpepper</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Beck</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Accelerated Query Processing Via Similarity Score Prediction</article-title>
          .
          <source>In Proc. SIGIR</source>
          . To Appear.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J.</given-names>
            <surname>Plaisance</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kurz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Lemire</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Vectorized VByte Decoding</article-title>
          .
          <source>In Int. Symp. Web Alg.</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Porter</surname>
          </string-name>
          .
          <year>1997</year>
          .
          <article-title>Readings in IR. Chapter An Algorithm for Sufix Stripping</article-title>
          ,
          <fpage>313</fpage>
          -
          <lpage>316</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Walker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hancock-Beaulieu</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Gatford</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Okapi at TREC-3</article-title>
          .
          <source>In Proc. TREC</source>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Sorting out the Document Identifier Assignment Problem</article-title>
          .
          <source>In Proc. ECIR</source>
          .
          <volume>101</volume>
          -
          <fpage>112</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Stepanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Gangolli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. E.</given-names>
            <surname>Rose</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Ernst</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Oberoi</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>SIMD-based Decoding of Posting Lists</article-title>
          .
          <source>In Proc. CIKM</source>
          .
          <volume>317</volume>
          -
          <fpage>326</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>A.</given-names>
            <surname>Trotman</surname>
          </string-name>
          .
          <year>2014</year>
          . Compression,
          <string-name>
            <surname>SIMD</surname>
          </string-name>
          , and
          <article-title>Postings Lists</article-title>
          .
          <source>In Proc. Aust</source>
          . Doc. Comp. Symp.
          <volume>50</volume>
          .
          <fpage>50</fpage>
          -
          <lpage>50</lpage>
          .
          <fpage>57</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>A.</given-names>
            <surname>Trotman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X-F.</given-names>
            <surname>Jia</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Crane</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Towards an eficient and efective search engine</article-title>
          . In Wkshp. Open Source IR.
          <fpage>40</fpage>
          -
          <lpage>47</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>A.</given-names>
            <surname>Trotman</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>In Vacuo and In Situ Evaluation of SIMD Codecs</article-title>
          .
          <source>In Proc. Aust. Doc. Comp. Symp. 1-8.</source>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>A.</given-names>
            <surname>Trotman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Puurula</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Burgess</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Improvements to BM25 and Language Models Examined</article-title>
          .
          <source>In Proc. Aust</source>
          . Doc. Comp. Symp.
          <volume>58</volume>
          -
          <fpage>65</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Turtle</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Flood</surname>
          </string-name>
          .
          <year>1995</year>
          .
          <article-title>Query Evaluation: Strategies and Optimizations</article-title>
          .
          <source>Inf. Proc. &amp; Man. 31</source>
          ,
          <issue>6</issue>
          (
          <year>1995</year>
          ),
          <fpage>831</fpage>
          -
          <lpage>850</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vigna</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Quasi-succinct indices</article-title>
          .
          <source>In Proc. WSDM</source>
          .
          <volume>83</volume>
          -
          <fpage>92</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Suel</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Document Reordering for Faster Intersection</article-title>
          .
          <source>Proc. VLDB 12</source>
          ,
          <issue>5</issue>
          (
          <year>2019</year>
          ),
          <fpage>475</fpage>
          -
          <lpage>487</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ding</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Suel</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Inverted index compression and query processing with optimized document ordering</article-title>
          .
          <source>In Proc. WWW</source>
          .
          <volume>401</volume>
          -
          <fpage>410</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Long</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Suel</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Performance of Compressed Inverted List Caching in Search Engines</article-title>
          .
          <source>In Proc. WWW</source>
          .
          <volume>387</volume>
          -
          <fpage>396</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>