=Paper= {{Paper |id=Vol-2621/CIRCLE20_26 |storemode=property |title=On the Reproducibility of Experiments of Indexing Repetitive Document Collections |pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_26.pdf |volume=Vol-2621 |authors=Antonio Fariña,Miguel A. Martínez-Prieto,Francisco Claude,Gonzalo Navarro,Juan J. Lastra-Díaz,Nicola Prezza,Diego Seco |dblpUrl=https://dblp.org/rec/conf/circle/FarinaMCNLPS20 }} ==On the Reproducibility of Experiments of Indexing Repetitive Document Collections== https://ceur-ws.org/Vol-2621/CIRCLE20_26.pdf
    On the Reproducibility of Experiments of Indexing Repetitive
                      Document Collections
                 Antonio Fariña                                 Miguel A. Martínez-Prieto                            Francisco Claude
            antonio.farina@udc.es                                    migumar2@infor.uva.es                          fclaude@recoded.cl
        Universidade da Coruña, CITIC                                University of Valladolid                    Universidad Diego Portales
               A Coruña, Spain                                           Segovia, Spain                                Santiago, Chile

               Gonzalo Navarro                                        Juan J. Lastra-Díaz                              Nicola Prezza
          gnavarro@dcc.uchile.cl                                       jlastra@invi.uned.es                       nicola.prezza@gmail.com
       IMFD, DCC, University of Chile                                         UNED                                    University of Pisa
              Santiago, Chile                                              Madrid, Spain                                  Pisa, Italy

                                                                            Diego Seco
                                                                      dseco@udec.cl
                                                               IMFD, University of Concepción
                                                                     Concepción, Chile

ABSTRACT                                                                                According to the ACM [1], an experiment is reproducible only if
We summarize an already published work [3]. It consists in a com-                    an independent group of researchers can obtain the same results
panion paper that aims at allowing the exact replication of the                      using artifacts that they developed independently. The ACM also
methods, experiments, and results discussed in a previous work                       defines replicability as a weaker type of reproducibility, where inde-
[2], where we proposed many techniques for compressing indexes                       pendent researchers must obtain the same results when using the
which exploit that highly repetitive collections are formed mostly                   artifacts provided by the original work’s authors. Finally, repeata-
of documents that are near-copies. In this work, we show our                         bility refers to the possibility that the original author of a work
replication framework (uiHRDC: available at https://github.com/                      could obtain the same results when reusing the original setup (same
migumar2/uiHRDC/), that permits to replicate the actual experi-                      conditions, system,...); i.e. he can repeat the experiments and obtain
mental setup from our parent paper with little effort. The experi-                   the same results. The conclusion is that research works must be
mentation was carefully explained, providing precise details about                   at least repeatable, and that achieving replicability would become
the parameters that can be tuned for each indexing solution. Finally,                desirable. Yet, that is not always simple as we must not only have
we also provided uiHRDC as a reproducibility package.                                source code, but also sometimes match the exact version of exter-
                                                                                     nal dependencies, tune the parameters of the tools (such tuning is
CCS CONCEPTS                                                                         sometimes difficult to obtain from the original paper), or even keep
                                                                                     the original computer without software/hardware updates. In our
• Information systems → Data compression; Search index com-
                                                                                     case, this was accomplished with a Docker-based solution.
pression.

KEYWORDS                                                                             2   OUR PROPOSAL
Repetitive document collections, inverted index, self-index, repro-                  We focused on making the experiments of our previous work [2]
ducibility.                                                                          replicable. In that parent paper, we tackled the problem of indexing
                                                                                     repetitive document collections. We showed that the existing com-
                                                                                     pressed posting lists representations commonly used for positional
1    INTRODUCTION
                                                                                     and non-positional inverted indexes are not well-suited to deal with
Scientific advances have typically been disseminated in the form of                  highly repetitive collections. We provided new compressed posting
scientific publications. Publications in Computer Science, usually                   lists representations (using run-length, lempel-ziv, and grammar
present a new technique, algorithm, etc. that aims at improving                      based compression) that improved upon the state-of-the-art, and
upon the state of the art. Experimental results are typically provided               included also self-indexes in the comparisons. In our reproducible
as evidence of the achievements of the paper. Therefore, having pub-                 paper [3] we focused on:
lic access to the original authors’ setup (tools/source code, datasets,
tuning parameters, etc.) used to drive such experimental evaluation                      • Briefly enumerating the 10 posting list representations in-
becomes crucial to assess the validity of the results obtained and to                      cluded in the parent paper, and when applicable, we dis-
allow other researchers to reuse this previous research as a baseline                      cussed the parameters of those techniques, and the actual
to compare with their future techniques.                                                   values used to tune those parameters. Those techniques
                                                                                           are: rice, vbyte (with three variants), Simple9, PforDelta,
"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-         QMX-coding, rice-runs, vbyte-Lzma, vbyte-Lzend, Repair
mons License Attribution 4.0 International (CC BY 4.0)."                                   (with four variants). For all these techniques we used our
                                                                                                                          ...
                                                     collection             index             index            index

                                                                                                                                    PDF Report
                                                           INDEXING         LOCATE searches
                                                                                        EXTRACT searches

                                                                                    results                results
                                                                                                                                                                    Fariña et al.


        implementation of inverted indexes. We also included also                        Finally, we used Docker as the tool to provide a fully working
        the implementation of 4 techniques provided by a more re-                     environment where we tested our uiHRDC framework, hence guar-
        cent work [4] (partitioned-EliasFano, Opt-PForDelta,                          antying replicability. We provided a Docker configuration file (Dock-
        Interpolative-codes, and Varint-G8IU ).                                       erfile) with instructions to create a container that exactly repro-
      • We briefly enumerated 6 self-indexing techniques used: RLCSA,                 duces the configuration of the test framework and deploys uiHRDC.
        SLP, WCSA, WSLP, Lz77-index, and LzEnd-index. We also                         Basically, it installs ubuntu 14.04 (even though we also tested it suc-
        showed how they were tuned in the original experiments.                       cessfully on ubuntu 16.04), with gcc 4.8, and installs from libraries
      • We described the original experimental framework, includ-                     such as libboost, to tools such as cmake, screen/byobu, textlive, gnu-
        ing both the datasets and the query patterns used, and finally                plot, or even an ssh-server to make ssh/sftp connections possible.
        describe how those experiments were run. We provided the                      We provide simple instructions to build our docker image, launch
        actual source code for all the techniques compared and in-                    and connect to a docker instance (this is really simple as the user
        structions regarding how to use them. This included the ac-                   can simply connect by ssh as if the container were a running linux
        tual scripts that were used to create the different indexes at                server), and then execute a script to deploy uiHRDC, run all the
        build time, and those scripts that leaded search experiments.                 experiments, and get the final PDF report using a sftp connection.
        Both datasets and source code were also made available.
      • We clearly described the variables being measured. That                                                                                            6
        is, the memory footprint of the different techniques (space                                1                        4
        requirements), and the CPU execution time at query time.                          Source               compressed compressed compressed …
        Particularly when performing the operations locate, and ex-                        code            3     index      index      index
        tract with the available indexing/self-indexing techniques.                                                                                                PDF Report
      • We fixed some errors that were detected in the parent paper.                           INDEXING                LOCATE searches 5                              latex
                                                                                                                                                                            8
                                                                                                       2                           EXTRACT searches                   template
      • We provided our reproducibility framework uiHRDC that                                                                                                  7
                                                                                        Document
        permits to re-run all the experiments from the parent paper                     collection queries                results                results                Results
        and to gather all the experimental results. It is discussed in                                                                                                   & Figs

        Section 2.1.
      • Finally, we used uiHRDC to automatically run the experi-                          Figure 1: Workflow in uiHRDC to reproduce experiments.
        ments in a new/different computer, and showed the results
        obtained. We discussed how the results permitted us to asses                  3       CONCLUSIONS
        that the new results obtained actually draw the same con-                     We have briefly described all the techniques and their parameters
        clusions as those in the parent paper.                                        from our original research [2]. We have also described the repro-
                                                                                      ducibility framework uiHRDC, that includes the datasets, query sets,
2.1     uiHRDC Framework                                                              source code (and dependencies), as well as a set of scripts to au-
The main contribution of this work was the careful creation of our                    tomate the execution of all the experiments from [2]. They also
uiHRDC (universal indexes for Highly Repetitive Document Collec-                      automate the generation of a final PDF report where we collect the
tions) reproducibility framework. We included in it: (1) the source                   results (space/query-time) obtained by each technique and create
code and all the dependencies (with the actual versions) required to                  all the figures from the parent paper. Finally, we provide a Docker
build the test programs; (2) the document datasets and query pat-                     container that reproduces our test enviroment to ensure that all
terns used; (3) scripts to compile the source code for each technique;                our experiments can be replicated with little effort in any computer
(4) scripts to build/create the compressed inverted indexes and self-                 having Docker installed and matching our minimal RAM/disk re-
indexes over the document collections. Such indexing/self-indexing                    quirements.
structures are finally saved into disk; (5) scripts that load each rep-
resentation from disk into memory and perform queries over such                       4       ACKNOWLEDGEMENTS
representation. Each run saves both space/time measurements into                      Supported by Xunta de Galicia/FEDER-UE [CSI: ED431G 2019/01
text result-files; (6) a script to compile all the source code, build all             and GRC: ED431C 2017/58]; by Xunta de Galicia Conecta-Peme
the indexes, and perform all the querying experiments (it includes                    2018 [Gema: IN852A 2018/14]; by MCIU-AEI/ FEDER-UE [Datos
repetitively launching steps 3 → 4 → 5 for each technique); (7) a                     4.0: TIN2016-78011-C4-1-R, BIZDEVOPS: RTI2018-098309-B-C32]
script that collects all the result-files in first term. In same cases
these are gnuplot-data files, in other cases they are simple text files               REFERENCES
containing the standard output of the search programs that we                         [1] ACM.         2018.                    Artifact     Review       and       Badging.
parse (using python scripts) to create well-formed gnuplot-data                           https://www.acm.org/publications/policies/artifact-review-badging.
                                                                                      [2] Francisco Claude, A. Fariña, Miguel A. Martínez-Prieto, and Gonzalo Navarro.
files. Then it creates all the figures from the parent paper using                        2016. Universal Indexes for Highly Repetitive Document Collections. Information
exactly the same styles to simplify the comparison between both                           Systems 61 (2016), 1–23. https://doi.org/10.1016/j.is.2016.04.002
                                                                                      [3] Antonio Fariña, Miguel A. Martínez-Prieto, Francisco Claude, Gonzalo Navarro,
versions; (8) an script to collect all those figures, and that using a                    Juan J. Lastra-Díaz, Nicola Prezza, and Diego Seco. 2019. On the reproducibility of
latex template, generates a unique report in PDF format. In such                          experiments of indexing repetitive document collections. Information Systems 83
report we also provide information regarding the machine in which                         (2019), 181 – 194. https://doi.org/10.1016/j.is.2019.03.007
                                                                                      [4] Giuseppe Ottaviano and Rossano Venturini. 2014. Partitioned Elias-Fano Indexes.
the results were obtained (cpu, memory, o.s. version, etc) among                          In Proc. 37th Int. ACM SIGIR Conference on Research and Development in Information
other information. Figure 1 illustrates the workflow in uiHRDC.                           Retrieval. ACM, NY, USA, 273–282. https://doi.org/10.1145/2600428.2609615