INTRODUCTION

July

PISA: Performant Indexes and Search for Academia

Antonio Mallia

antonio.mallia@nyu.edu 0

Joel Mackenzie

joel.mackenzie@rmit.edu.au 1

Michał Siedlaczek

michal.siedlaczek@nyu.edu 0

Torsten Suel

torsten.suel@nyu.edu 0 0 New York University , New York , US 1 RMIT University , Melbourne , Australia

2019

25 2019 342 352

Performant Indexes and Search for Academia (PISA) is an experimental search engine that focuses on eficient implementations of stateof-the-art representations and algorithms for text retrieval. In this work, we outline our efort in creating a replicable search run from PISA for the 2019 Open Source Information Retrieval Replicability Challenge, which encourages the information retrieval community to produce replicable systems through the use of a containerized, Docker-based infrastructure. We also discuss the origins, current functionality, and future direction and challenges for the PISA system.

INTRODUCTION

Reproducibility, replicability, and generalizability have become increasingly important within the Information Retrieval (IR) community. For example, weak baselines [ 3, 18 ] are often used as comparison points against newly proposed systems, resulting in what often appear to be large improvements. One possible reason that weak baselines are used is that they are usually simple and well established, making it easy to reproduce or replicate them. Indeed, replicating experimental runs is not a trivial task; minor changes in software, datasets, and even hardware can result in significant changes to experimental runs [10]. To this end, the 2019 Open Source Information Retrieval Replicability Challenge (OSIRRC) brings together academic groups with the aim of defining a reusable framework for easily running IR experiments with a particular focus on replicability, where a diferent team (to those who proposed the system) uses the original experimental artifacts to replicate the given result. In an attempt to improve replicability, the OSIRRC workshop proposes to package and deploy various IR systems within a Docker container,1 which enables low-efort replication by reducing many experimental confounders.

The goal of this paper is to give an overview of the PISA system and to outline the process of building replicable runs from PISA with Docker. We outline the particulars of our submitted runs, and 2 PISA2 is an open source library implementing text indexing and search, primarily targeted for use in an academic setting. PISA implements a range of state-of-the-art indexing and search algorithms, making it useful for researchers to easily experiment with new technologies, especially those concerned with eficiency. Nevertheless, we strive for much more than just an eficient implementation. With clean and extensible design and API, PISA provides a general framework that can be employed for miscellaneous research tasks, such as parsing, ranking, sharding, index compression, document reordering and query processing, to name a few.

PISA started of as a fork repository of the ds2i library3 by Ottaviano et al., which was used for prior research on eficient index representations [ 26, 27 ]. Since then, PISA has gone through substantial changes, and now considerably extends the original library. PISA is still being actively developed, integrating new features and improving its design and implementation at regular intervals. 2.1

Design Overview

PISA is designed to be eficient, extensible, and easy to use. We now briefly outline some of the design aspects of PISA.

Modern Implementation. The PISA engine itself is built using C++17, making use of many new features in the C++ standard. This allows us to ensure the implementations are both eficient and understandable, as some of the newest language features can make for cleaner code and APIs. We aim to adhere to best design practices, such as RAII (Resource Acquisition Is Initialization), C++ Core Guidelines4 (aided by Guidelines Support Library5), and stronglytyped aliases, all of which result in safer and cleaner code without sacrificing runtime performance. 2https://https://github.com/pisa-engine/pisa 3https://github.com/ot/ds2i 4https://github.com/isocpp/CppCoreGuidelines 5https://github.com/Microsoft/GSL Performance. One of the biggest advantages of C++ is its performance. Control over data memory layout allows us to implement and store eficient data structures with little to no runtime overhead. Furthermore, we make use of low level optimizations, such as CPU intrinsics, branch prediction hinting, and SIMD instructions, which are especially important for eficiently encoding and decoding postings lists. Memory mapped files provide fast and easy access to data structures persisted on disk. We also avoid unnecessary indirection of runtime polymorphism in performance-sensitive code in favor of the static polymorphism of C++ templates and metaprogramming. Our performance is also much more predictable than when using languages with garbage collection. Finally, we sufer no runtime overhead as is the case with VM-based or interpreted languages. Extensibility. Another important design aspect of PISA promotes extensibility. For example, interfaces are exposed which allow for new components to be plugged in easily, such as diferent parsers, stemmers, and compression codecs. This is achieved through heavy use of generic programming, similar to that provided by the C++ Standard Template Library. For example, an encoding schema is as much a parameter of an index as a custom allocator is a parameter of std::vector. By decoupling diferent parts of the framework, we provide an easy way of extending the library both in future iterations of the project, as well as by users of the library. 2.2

Feature Overview

In this section, we take a short tour of several important features of our system. We briefly discuss the indexing pipeline, document reordering, scoring, and implemented retrieval methods. Parsing Collection. The objective of parsing is to represent a given collection as a forward index, where each term is assigned a unique numerical ID, and each document consists of a list of such identifiers. This is a non-trivial task that involves several steps that can be critical to retrieval performance.

First, the document content must be accessed by parsing a certain data exchange format, such as WARC, JSON, or XML. The document itself is typically represented by HTML, XML, or a custom annotated format, which must be parsed to retrieve the underlying text. The text must be then tokenized, and the resulting words are typically stemmed before indexing.

PISA supports a selection of file and content parsers. The parsing tool allows input formats of many standard IR collections, such as ClueWeb096, ClueWeb127, GOV28, Robust049, Washington Post10, and New York Times.11 We also provide an HTML content parser, and the Porter2 [ 31 ] stemming algorithm for English language. As discussed in Section 2.1, PISA is designed to allow new components, such as parsers or stemmers, to be plugged-in with low implementation overhead.

As part of a forward index, we also encode a term lexicon. This is simply a mapping between strings and numerical IDs. We represent 6https://lemurproject.org/clueweb09/ 7https://lemurproject.org/clueweb12/ 8http://ir.dcs.gla.ac.uk/test_collections/gov2-summary.htm 9https://trec.nist.gov/data/robust/04.guidelines.html 10https://trec.nist.gov/data/wapost/ 11https://catalog.ldc.upenn.edu/LDC2008T19 it as a payload vector. The structure is divided into two memory areas: the first one is an array of integers representing payload ofsets, while the other stores the payloads (strings). This representation allows us to quickly retrieve a word at a given position—which determines its ID—directly from disk using memory mapping. We achieve string lookups by assigning term IDs in lexicographical order and performing binary search. Note that reassigning term IDs requires little overhead, as it is applied directly when a number of small index batches are merged together. This design decision enables us to provide a set of command line tools to quickly access index data without unnecessary index loading overhead. Document titles (such as TREC IDs) are stored using the same structure but without sorting them first, as the order of the documents is established via an index reordering stage as described below.

The entire parsing process is performed in parallel when executed on a multi-core architecture. The forward index can be created under tight memory constraints by dividing the corpus and processing it in batches, and then merging the resulting forward indexes at the end. Currently, PISA only supports merging forward indexes together prior to generating the canonical inverted index. However, future work will aim to allow index updates through eficient inverted index merging operations.

Indexing. Once the parsing phase is complete, the forward index containing a collection can be used to build an inverted index in a process called inverting. The product of this phase is an inverted index in the canonical format. This representation is very similar to the forward index, but in reverse: it is a collection of terms, each a containing a list of document IDs. The canonical representation is stored in an uncompressed and universally readable format, which simply uses binary sequences to represent lists. There are a few advantages of storing the canonical representation. Firstly, it allows various transformations, such as document reordering or index pruning, to be performed on the index before storing it in its final compressed form. Secondly, it allows for diferent final representations to be built rapidly, such as indexes that use diferent compression algorithms. Thirdly, it allows the PISA system to be used to create an inverted index without requiring the PISA system to be used beyond this point, making it easy for experimenters to import the canonical index into their own IR system.

Document Reordering. Document reordering corresponds to reassigning the document identifiers within the inverted index [ 4]. It generally aims to minimize the cost of representing the inverted index with respect to storage consumption. However, reordering can also be used to minimize other cost functions, such as query processing speed [ 41 ]. Interestingly, optimizing for space consumption has been empirically shown to speed up query processing [ 14, 15, 24 ], making document reordering an attractive step during indexing. In theory, index reordering can take place either on an existing inverted index, or before the inverted index is constructed. PISA opts to use the canonical index as both input and output for the document reordering step, as this allows a generic reordering scheme to be used which can be easily extended to various reordering techniques, and allows the reordering functionality to be used without requiring further use of PISA.

Many approaches for document reordering exist, including random ordering, reordering by URL [ 33 ], MinHash ordering [6, 9], and recursive graph bisection [13]. PISA supports eficient index reordering for all of the above techniques [ 21 ].

Index Compression. Given the extremely large collections indexed by current search engines, even a single node of a large search cluster typically contains many billions of integers in its index structure. In particular, compression of posting lists is of utmost importance, as they account for much of the data size and access costs. Compressing an inverted index introduces a twofold advantage over a non-compressed representation: it reduces the size of the index, and it allows us to better exploit the memory hierarchy, which consequently speeds up query processing.

Compression allows the space requirements of indexes to be substantially decreased without loss of information. The simple and extensible design of PISA allows for new compression approaches to be plugged in easily. As such, a range of state-of-the-art compression schemes are currently supported, including variable byte encoders (VarIntGB [12], VarInt-G8IU [ 34 ], MaskedVByte [ 30 ], and StreamVByte [17]), word-aligned encoders (Simple8b [ 2 ], Simple16 [ 43 ], QMX [ 35, 37 ], and SIMD-BP128 [16]), monotonic encoders (Interpolative [ 25 ], EF [ 40 ], and PEF [ 27 ]), and frame-ofreference encoders (Opt-PFD [ 42 ]).

Oftentimes, the choice of encoding depends on both the time and space constraints, as compression schemes usually trade of space eficiency for either encoding or decoding performance, or both. We refer the reader to [ 24 ] for more details.

Scoring. Currently, BM25 [ 32 ] is used as the weighting model for ranked retrieval. BM25 is a simple yet efective ranker for processing bag-of-words queries, and promotes efective dynamic pruning [ 28 ]. Given a document d and a query q, we use the following formulation of BM25:

BM25(d, q) =

IDF(t ) = log Õ IDF(t ) · TF(d, t ), t ∈q

N − ft + 0.5 ft + 0.5 , TF(d, t ) =

fd,t · (k1 + 1) fd,t + k1 · (1 − b + b · ld /lavg) where N is the number of documents in the collection, ft is the document frequency of term t , fd,t is the frequency of t in d, ld is the length of document d, and lavg is the average document length. We set parameters k1 = 0.9 and b = 0.4, as described by Trotman et al. [ 36 ]. For a more exhaustive examination of BM25 variants, we refer the reader to the work by Trotman et al. [ 38 ]. Search. Because PISA supports document-ordered index organization, both Boolean and scored conjunctions or disjunctions can be evaluated, exploiting either a Document-at-a-Time or a Term-at-aTime index traversal strategy.

Furthermore, PISA supports a range of state-of-the-art dynamic pruning algorithms such as MaxScore [ 39 ] and WAND [5], and their Block-Max counterparts, Block-Max MaxScore (BMM) [7] and Block-Max WAND (BMW) [14].

In order to facilitate these dynamic pruning algorithms, an auxiliary index metadata structure must be built, which stores the required upper-bound score information to enable eficient dynamic pruning. It can be built per postings list (for algorithms like WAND and MaxScore), or for each fixed-sized block (for the Block-Max variants). In addition, variable-sized blocks can be built (in lieu of ifxed-sized blocks) to support the variable-block family of BlockMax algorithms listed above, such as Variable Block-Max WAND (VBMW) [ 22, 23 ]. Ranked conjunctions are also supported using the Ranked AND or (Variable) Block-Max Ranked AND (BMA) [14] algorithms.

Indeed, the logical blocks stored in the index metadata are decoupled from the compressed blocks inside the inverted index. The advantage of storing the metadata independently from the inverted index is that the metadata depends only on the canonical index, needs to be computed only once, and does not change if the underlying compression codec is changed.

PISA provides two ways to experiment with query retrieval. The ifrst one performs end-to-end search for a given list of queries, and prints out the results in the TREC format. It can also be used to evaluate query speed, as was done for this workshop. Additionally, we provide a more granular approach, which focuses on comparing diferent retrieval methods directly. Here, we only report the time to fetch posting lists and perform search, excluding lexicon lookups and printing results to the standard output or a file. We encourage (1) (2) (3) the interested reader to refer to PISA’s documentation for more details about running experiments.12 3

REPRODUCIBLE EXPERIMENTATION

In the spirit of OSIRRC, we utilize the software and metrics made available by the organizers, including the jig13 and the trec_eval14 tool. In addition, we have decided to provide some further information and reference experiments that we consider important. 3.1

Default Runs

Given the many possibilities for the various components of the PISA system, we now outline the default system configuration for the OSIRRC workshop. Further information can be found in the documentation of our OSIRRC system.15 Note that the block size for the variable-block indexes depends on a parameter λ [ 22 ]. In order to get the desired average block size for the variable blocks, the value of λ was searched for ofline, and difers for each collection. For convenience, we tabulate the values of λ in Table 1. Note that such values of λ only apply when using the same parsing, stemmer, and reordering as listed below.

• Parsing: Gumbo16 with Porter2 stemming; no stopwords removed. We discard the content of any document with over 1,000 HTML parsing errors. • Reordering: Recursive graph bisection. We optimize the objective function using the posting lists of lengths at least 4,096. • Compression: SIMD-BP128 with a posting list block size of 128. • Scoring: BM25 with k1 = 0.9 and b = 0.4 • Index Metadata: Variable blocks with a mean block size of 40 ± 0.5.

• Index Traversal: Variable BlockMax WAND. 3.2

Experimental Setup

Now, we briefly outline the experimental setup and the resources used for our experimentation.

Datasets. We performed our experiments on the following text collections: • Robust04 consists of newswire articles from a variety of sources from the late 1980’s through to the mid 1990’s. 12https://pisa.readthedocs.io/en/latest/ 13https://github.com/osirrc/jig/ 14https://github.com/usnistgov/trec_eval 15https://github.com/osirrc/pisa-docker 16https://github.com/google/gumbo-parser

Some quantitative properties of these collections are summarized in Table 2. The first three are relatively small, and contain newswire data. The remaining corpora are significantly larger, containing samples of the Web. Thus, the latter two should be more indicative of any diferences in query eficiency. In fact, each of these can be thought of as representing a single shard in a large distributed search system.

Test queries. Each given collection contains a set of test queries from various TREC tracks which we use to validate the efectiveness of our system. These queries are described in Table 3. Testing details. All experiments were conducted on a machine with two Intel Xeon E5-2667 v2 CPUs, with a total of 32 cores clocked at 3.30 GHz with 256 GiB RAM running Linux 4.15.0. Furthermore, the experiments presented here are deployed within the Docker framework. Although we believe that this may cause a slight reduction in the eficiency of the presented algorithms, we preserve this setup in the spirit of the workshop and comparability. We leave further investigation of potential overhead of Docker containers as future work.

A note on ClueWeb12. In preliminary experiments, we found that the memory consumption for reordering the ClueWeb12 index was high, which slowed down the indexing process considerably. Thus, we opted to skip reordering the ClueWeb12 collection in the following experiments, and our results are reported on an index that uses the default (crawl) order. Since index order impacts the value of λ, we use λ = 26, which results in variable block metadata with a mean block size in the desired range of 40 ± 0.5. Note that this value difers from the one reported in Table 1, which is correct if reordering is applied based on Recursive Graph Bisection (see Section 3.1). 3.3

Results and Discussion

We now present our reference experiments, which involve end-toend processing of each given collection.

Indexing and Compression. The HTML content of each document was extracted with the Gumbo parser. We then extracted three kinds of tokens: alphanumeric strings, acronyms, and possessives, which were then stemmed using the Porter2 algorithm. We reordered documents using the recursive graph bisection algorithm which is known to improve both compression and query performance [ 13, 21, 24 ]. Then we compressed the index with SIMDBP128 encoding, which has been proven to exhibit one of the best space-speed trade ofs [ 24 ].

Table 4 summarizes indexing times broken down into individual phases, while Table 5 shows compressed inverted index sizes as well as average numbers of bits used to encode document gaps and frequencies. The entire building process was executed with 32 cores available; however, at the time of writing, only some parts of the pipeline support parallel execution. We also note that the index reordering step is usually the most expensive step in our indexing pipeline. If a fast indexing time is of high importance, this step can be omitted, as we did for ClueWeb12. Alternatively, less expensive reordering operations can be used. However, skipping the index reordering stage (or using a less efective reordering technique) will result in a larger inverted index and less eficient query-time performance.

System Efectiveness. Next, we outline the efectiveness of the PISA system. In particular, we are processing rank-safe, disjunctive, top-k queries to depth k = 1,000. Since processing is rank-safe, all of the disjunctive index traversal algorithms result in the same top-k set. Table 6 reports the efectiveness for Mean Average Precision (MAP), Precision at rank 30 (P@30), and Normalized Discounted Cumulative Gain at rank 20 (NDCG@20).

Query Eficiency. To measure the eficiency of query processing, we measure how long it takes to process the entire query log for each collection. We use 32 threads to concurrently retrieve the top-k documents for all queries using either the MaxScore or the VBMW algorithm, with a single thread processing a single query at a time. MaxScore has been shown to outperform other algorithms for large values of k on the Gov2 and ClueWeb09 collections [ 24 ]. Table 7 shows the results. While MaxScore usually outperforms VBMW, we did not optimize the block size of the index metadata, so comparisons should be made with caution. Indeed, VBMW is likely to outperform MaxScore with optimized blocks and small values of k. For a more detailed analysis of per-query latency within PISA, we refer the interested reader to the recent work by Mallia et al. [ 24 ]. 3.4

Discussion

PISA is built for performance. We are able to rapidly process each query set thanks to eficient document retrieval algorithms and extremely fast compression. On the other hand, as we have shown, SIMD-BP128 encoding also exhibits a reasonable compression ratio, which allows us to store the index in main memory. We encourage the reader to study the work by Mallia et al. [ 24 ] for more information about query eficiency under diferent retrieval and compression methods.

At the present moment, our query retrieval is tailored towards fast candidate selection, as we lack any complex ranking functionality, such as a learning-to-rank document reranking stage. However, the efectiveness we obtain using BM25 is consistent with other results found in the literature [ 19 ].

Furthermore, we provide a generic index building pipeline, which can be easily customized to one’s needs. We unload most of the computationally intensive operations onto the initial stages of indexing to speed up experiments with many configurations; in particular, to deliver additional indexes with diferent integer encodings quickly and easily.

As per the workshop rules, we deliver a Docker image, which reproduces the presented results. Note that the initial version of the image was derived from an image with a precompiled distribution of PISA. However, we quickly discovered this solution was not portable. The source of our issues was compiling the code with AVX2 support. Once compiled, the binaries could not be executed on a machine not supporting AVX2. One solution could be to crosscompile and provide diferent versions of the image. However, we chose to simply distribute the source code to be compiled at the initial stage of an experimental run. 4

FUTURE PLANS

Despite its clear strengths, PISA is still a relatively young project, aspiring to become a more widely used tool for IR experimentation. We recognize that many relevant features can be still developed to further enrich the framework. We have every intention of pursuing these in the nearest future.

An obvious direction is to continue our work on query performance. For instance, we intend to support precomputing quantized partial scores in order to further improve candidate selection performance [11]. We are also considering implementing other traversal techniques, including known approaches, such as Score-at-a-Time methods [ 1, 20 ], as well as novel techniques.

The next step would be to implement more complex document rankings based on learning-to-rank. Many of the data structures required for feature extraction are indeed already in place. We Robust04 Core17 Core18 Gov2 ClueWeb09 ClueWeb12

Parse 0.4776 0.5487 0.4680 0.2521 0.2507 0.2100 would also like to enhance our query retrieval pipeline with ranking cascades that are capable of applying learned models [8].

Other planned features include query expansion, content extraction (template detection, boilerplate removal), sharding, and distributed indexes. Work on some of these has in fact already started. 5

CONCLUSION

PISA is a relative newcomer on the scene of open source IR software, yet it has already proven its many benefits, including a flexible design which is specifically tailored for use in research. Indeed, PISA

Collection has been successfully used in several recent research papers [ 21, 23, 24, 29 ].

One of the indisputable advantages of PISA is its extremely fast query execution, achieved by careful optimization and the zerocost abstractions of C++. Furthermore, it supports a multitude of state-of-the-art compression and query processing techniques that can be used interchangeably.

Although there are still several shortcomings, these are mostly due to the project’s young age, and we hope to address these very soon. Furthermore, we plan to continue enhancing the system with novel solutions. Indeed, a good amount of time has been spent on PISA to provide a high quality experimental IR framework, not only in terms of performance, but also from a software engineering point of view. We use modern technologies and libraries, continuous integration, and test suites to ensure the quality of our code, and the correctness of our implementations.

We encourage any interested researchers to get involved with the PISA project.

ACKNOWLEDGEMENTS

This research was supported by NSF Grant IIS-1718680, a grant from Amazon and the Australian Research Training Program Scholarship. [18] J. Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines. SIGIR

Forum 52, 2 (2019), 40–51.

[1]

V. N.

Anh , O. de Kretser, and

Mofat . 2001 . Vector-space ranking with efective early termination. . In Proc. SIGIR . 35 - 42 .

[2]

V. N.

Anh and

Mofat . 2010 . Index compression using 64-bit words . Soft. Prac. & Exp. 40 , 2 ( 2010 ), 131 - 147 .

[3]

T. G.

Armstrong ,

Mofat ,

Webber , and

Zobel . 2009 . Improvements that don't add up: Ad-hoc Retrieval Results since 1998 . In Proc. CIKM . 601 - 610 .

[19]

Lin ,

Crane ,

Trotman ,

Callan , I. Chattopadhyaya ,

Foley , G. Ingersoll,

Macdonald , and

Vigna . 2016 . Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge . In Proc. ECIR . 408 - 420 .

[20]

Lin and

Trotman . 2015 . Anytime Ranking for Impact-Ordered Indexes . In Proc. ICTIR . 301 - 304 .

[21]

Mackenzie ,

Mallia ,

Petri ,

J. S.

Culpepper , and

Suel . 2019 . Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study . In Proc. ECIR . 339 - 352 .

[22]

Mallia ,

Ottaviano ,

Porciani ,

Tonellotto , and

Venturini . 2017 . Faster BlockMax WAND with Variable-sized Blocks . In Proc. SIGIR . 625 - 634 .

[23]

Mallia and

Porciani . 2019 . Faster BlockMax WAND with longer skipping . In Proc. ECIR . 771 - 778 .

[24]

Mallia ,

Siedlaczek , and

Suel . 2019 . An Experimental Study of Index Compression and DAAT Query Processing Methods . In Proc. ECIR . 353 - 368 .

[25]

Mofat and

Stuiver . 2000 . Binary Interpolative Coding for Efective Index Compression . Inf. Retr. 3 , 1 ( 2000 ), 25 - 47 .

[26]

Ottaviano ,

Tonellotto , and

Venturini . 2015 . Optimal Space-time Tradeofs for Inverted Indexes . In Proc. WSDM . 47 - 56 .

[27]

Ottaviano and

Venturini . 2014 . Partitioned Elias-Fano indexes . In Proc. SIGIR . 273 - 282 .

[28]

Petri ,

J. S.

Culpepper , and

Mofat . 2013 . Exploring the Magic of WAND . In Proc. Aust . Doc. Comp. Symp. 58 - 65 .

[29]

Petri ,

Mofat ,

Mackenzie ,

J. S.

Culpepper , and

Beck . 2019 . Accelerated Query Processing Via Similarity Score Prediction . In Proc. SIGIR . To Appear.

[30]

Plaisance ,

Kurz , and

Lemire . 2015 . Vectorized VByte Decoding . In Int. Symp. Web Alg.

[31]

M. F.

Porter . 1997 . Readings in IR. Chapter An Algorithm for Sufix Stripping , 313 - 316 .

[32]

S. E.

Robertson ,

Walker ,

Jones ,

Hancock-Beaulieu , and

Gatford . 1994 . Okapi at TREC-3 . In Proc. TREC .

[33]

Silvestri . 2007 . Sorting out the Document Identifier Assignment Problem . In Proc. ECIR . 101 - 112 .

[34]

A. A.

Stepanov ,

A. R.

Gangolli ,

D. E.

Rose ,

R. J.

Ernst , and

P. S.

Oberoi . 2011 . SIMD-based Decoding of Posting Lists . In Proc. CIKM . 317 - 326 .

[35]

Trotman . 2014 . Compression, SIMD , and Postings Lists . In Proc. Aust . Doc. Comp. Symp. 50 . 50 - 50 . 57 .

[36]

Trotman ,

X-F.

Jia , and

Crane . 2012 . Towards an eficient and efective search engine . In Wkshp. Open Source IR. 40 - 47 .

[37]

Trotman and

Lin . 2016 . In Vacuo and In Situ Evaluation of SIMD Codecs . In Proc. Aust. Doc. Comp. Symp. 1-8.

[38]

Trotman ,

Puurula , and

Burgess . 2014 . Improvements to BM25 and Language Models Examined . In Proc. Aust . Doc. Comp. Symp. 58 - 65 .

[39]

H. R.

Turtle and

Flood . 1995 . Query Evaluation: Strategies and Optimizations . Inf. Proc. & Man. 31 , 6 ( 1995 ), 831 - 850 .

[40]

Vigna . 2013 . Quasi-succinct indices . In Proc. WSDM . 83 - 92 .

[41]

Wang and

Suel . 2019 . Document Reordering for Faster Intersection . Proc. VLDB 12 , 5 ( 2019 ), 475 - 487 .

[42]

Yan ,

Ding , and

Suel . 2009 . Inverted index compression and query processing with optimized document ordering . In Proc. WWW . 401 - 410 .

[43]

Zhang ,

Long , and

Suel . 2008 . Performance of Compressed Inverted List Caching in Search Engines . In Proc. WWW . 387 - 396 .