=Paper=
{{Paper
|id=Vol-2621/CIRCLE20_26
|storemode=property
|title=On the Reproducibility of Experiments of Indexing Repetitive Document Collections
|pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_26.pdf
|volume=Vol-2621
|authors=Antonio Fariña,Miguel A. Martínez-Prieto,Francisco Claude,Gonzalo Navarro,Juan J. Lastra-Díaz,Nicola Prezza,Diego Seco
|dblpUrl=https://dblp.org/rec/conf/circle/FarinaMCNLPS20
}}
==On the Reproducibility of Experiments of Indexing Repetitive Document Collections==
On the Reproducibility of Experiments of Indexing Repetitive Document Collections Antonio Fariña Miguel A. Martínez-Prieto Francisco Claude antonio.farina@udc.es migumar2@infor.uva.es fclaude@recoded.cl Universidade da Coruña, CITIC University of Valladolid Universidad Diego Portales A Coruña, Spain Segovia, Spain Santiago, Chile Gonzalo Navarro Juan J. Lastra-Díaz Nicola Prezza gnavarro@dcc.uchile.cl jlastra@invi.uned.es nicola.prezza@gmail.com IMFD, DCC, University of Chile UNED University of Pisa Santiago, Chile Madrid, Spain Pisa, Italy Diego Seco dseco@udec.cl IMFD, University of Concepción Concepción, Chile ABSTRACT According to the ACM [1], an experiment is reproducible only if We summarize an already published work [3]. It consists in a com- an independent group of researchers can obtain the same results panion paper that aims at allowing the exact replication of the using artifacts that they developed independently. The ACM also methods, experiments, and results discussed in a previous work defines replicability as a weaker type of reproducibility, where inde- [2], where we proposed many techniques for compressing indexes pendent researchers must obtain the same results when using the which exploit that highly repetitive collections are formed mostly artifacts provided by the original work’s authors. Finally, repeata- of documents that are near-copies. In this work, we show our bility refers to the possibility that the original author of a work replication framework (uiHRDC: available at https://github.com/ could obtain the same results when reusing the original setup (same migumar2/uiHRDC/), that permits to replicate the actual experi- conditions, system,...); i.e. he can repeat the experiments and obtain mental setup from our parent paper with little effort. The experi- the same results. The conclusion is that research works must be mentation was carefully explained, providing precise details about at least repeatable, and that achieving replicability would become the parameters that can be tuned for each indexing solution. Finally, desirable. Yet, that is not always simple as we must not only have we also provided uiHRDC as a reproducibility package. source code, but also sometimes match the exact version of exter- nal dependencies, tune the parameters of the tools (such tuning is CCS CONCEPTS sometimes difficult to obtain from the original paper), or even keep the original computer without software/hardware updates. In our • Information systems → Data compression; Search index com- case, this was accomplished with a Docker-based solution. pression. KEYWORDS 2 OUR PROPOSAL Repetitive document collections, inverted index, self-index, repro- We focused on making the experiments of our previous work [2] ducibility. replicable. In that parent paper, we tackled the problem of indexing repetitive document collections. We showed that the existing com- pressed posting lists representations commonly used for positional 1 INTRODUCTION and non-positional inverted indexes are not well-suited to deal with Scientific advances have typically been disseminated in the form of highly repetitive collections. We provided new compressed posting scientific publications. Publications in Computer Science, usually lists representations (using run-length, lempel-ziv, and grammar present a new technique, algorithm, etc. that aims at improving based compression) that improved upon the state-of-the-art, and upon the state of the art. Experimental results are typically provided included also self-indexes in the comparisons. In our reproducible as evidence of the achievements of the paper. Therefore, having pub- paper [3] we focused on: lic access to the original authors’ setup (tools/source code, datasets, tuning parameters, etc.) used to drive such experimental evaluation • Briefly enumerating the 10 posting list representations in- becomes crucial to assess the validity of the results obtained and to cluded in the parent paper, and when applicable, we dis- allow other researchers to reuse this previous research as a baseline cussed the parameters of those techniques, and the actual to compare with their future techniques. values used to tune those parameters. Those techniques are: rice, vbyte (with three variants), Simple9, PforDelta, "Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- QMX-coding, rice-runs, vbyte-Lzma, vbyte-Lzend, Repair mons License Attribution 4.0 International (CC BY 4.0)." (with four variants). For all these techniques we used our ... collection index index index PDF Report INDEXING LOCATE searches EXTRACT searches results results Fariña et al. implementation of inverted indexes. We also included also Finally, we used Docker as the tool to provide a fully working the implementation of 4 techniques provided by a more re- environment where we tested our uiHRDC framework, hence guar- cent work [4] (partitioned-EliasFano, Opt-PForDelta, antying replicability. We provided a Docker configuration file (Dock- Interpolative-codes, and Varint-G8IU ). erfile) with instructions to create a container that exactly repro- • We briefly enumerated 6 self-indexing techniques used: RLCSA, duces the configuration of the test framework and deploys uiHRDC. SLP, WCSA, WSLP, Lz77-index, and LzEnd-index. We also Basically, it installs ubuntu 14.04 (even though we also tested it suc- showed how they were tuned in the original experiments. cessfully on ubuntu 16.04), with gcc 4.8, and installs from libraries • We described the original experimental framework, includ- such as libboost, to tools such as cmake, screen/byobu, textlive, gnu- ing both the datasets and the query patterns used, and finally plot, or even an ssh-server to make ssh/sftp connections possible. describe how those experiments were run. We provided the We provide simple instructions to build our docker image, launch actual source code for all the techniques compared and in- and connect to a docker instance (this is really simple as the user structions regarding how to use them. This included the ac- can simply connect by ssh as if the container were a running linux tual scripts that were used to create the different indexes at server), and then execute a script to deploy uiHRDC, run all the build time, and those scripts that leaded search experiments. experiments, and get the final PDF report using a sftp connection. Both datasets and source code were also made available. • We clearly described the variables being measured. That 6 is, the memory footprint of the different techniques (space 1 4 requirements), and the CPU execution time at query time. Source compressed compressed compressed … Particularly when performing the operations locate, and ex- code 3 index index index tract with the available indexing/self-indexing techniques. PDF Report • We fixed some errors that were detected in the parent paper. INDEXING LOCATE searches 5 latex 8 2 EXTRACT searches template • We provided our reproducibility framework uiHRDC that 7 Document permits to re-run all the experiments from the parent paper collection queries results results Results and to gather all the experimental results. It is discussed in & Figs Section 2.1. • Finally, we used uiHRDC to automatically run the experi- Figure 1: Workflow in uiHRDC to reproduce experiments. ments in a new/different computer, and showed the results obtained. We discussed how the results permitted us to asses 3 CONCLUSIONS that the new results obtained actually draw the same con- We have briefly described all the techniques and their parameters clusions as those in the parent paper. from our original research [2]. We have also described the repro- ducibility framework uiHRDC, that includes the datasets, query sets, 2.1 uiHRDC Framework source code (and dependencies), as well as a set of scripts to au- The main contribution of this work was the careful creation of our tomate the execution of all the experiments from [2]. They also uiHRDC (universal indexes for Highly Repetitive Document Collec- automate the generation of a final PDF report where we collect the tions) reproducibility framework. We included in it: (1) the source results (space/query-time) obtained by each technique and create code and all the dependencies (with the actual versions) required to all the figures from the parent paper. Finally, we provide a Docker build the test programs; (2) the document datasets and query pat- container that reproduces our test enviroment to ensure that all terns used; (3) scripts to compile the source code for each technique; our experiments can be replicated with little effort in any computer (4) scripts to build/create the compressed inverted indexes and self- having Docker installed and matching our minimal RAM/disk re- indexes over the document collections. Such indexing/self-indexing quirements. structures are finally saved into disk; (5) scripts that load each rep- resentation from disk into memory and perform queries over such 4 ACKNOWLEDGEMENTS representation. Each run saves both space/time measurements into Supported by Xunta de Galicia/FEDER-UE [CSI: ED431G 2019/01 text result-files; (6) a script to compile all the source code, build all and GRC: ED431C 2017/58]; by Xunta de Galicia Conecta-Peme the indexes, and perform all the querying experiments (it includes 2018 [Gema: IN852A 2018/14]; by MCIU-AEI/ FEDER-UE [Datos repetitively launching steps 3 → 4 → 5 for each technique); (7) a 4.0: TIN2016-78011-C4-1-R, BIZDEVOPS: RTI2018-098309-B-C32] script that collects all the result-files in first term. In same cases these are gnuplot-data files, in other cases they are simple text files REFERENCES containing the standard output of the search programs that we [1] ACM. 2018. Artifact Review and Badging. parse (using python scripts) to create well-formed gnuplot-data https://www.acm.org/publications/policies/artifact-review-badging. [2] Francisco Claude, A. Fariña, Miguel A. Martínez-Prieto, and Gonzalo Navarro. files. Then it creates all the figures from the parent paper using 2016. Universal Indexes for Highly Repetitive Document Collections. Information exactly the same styles to simplify the comparison between both Systems 61 (2016), 1–23. https://doi.org/10.1016/j.is.2016.04.002 [3] Antonio Fariña, Miguel A. Martínez-Prieto, Francisco Claude, Gonzalo Navarro, versions; (8) an script to collect all those figures, and that using a Juan J. Lastra-Díaz, Nicola Prezza, and Diego Seco. 2019. On the reproducibility of latex template, generates a unique report in PDF format. In such experiments of indexing repetitive document collections. Information Systems 83 report we also provide information regarding the machine in which (2019), 181 – 194. https://doi.org/10.1016/j.is.2019.03.007 [4] Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano Indexes. the results were obtained (cpu, memory, o.s. version, etc) among In Proc. 37th Int. ACM SIGIR Conference on Research and Development in Information other information. Figure 1 illustrates the workflow in uiHRDC. Retrieval. ACM, NY, USA, 273–282. https://doi.org/10.1145/2600428.2609615