=Paper=
{{Paper
|id=Vol-2441/paper1
|storemode=property
|title=A Progressive Visual Analytics Tool for Incremental Experimental Evaluation
|pdfUrl=https://ceur-ws.org/Vol-2441/paper17.pdf
|volume=Vol-2441
|authors=Fabio Giachelle,Gianmaria Silvello
|dblpUrl=https://dblp.org/rec/conf/iir/GiachelleS19
}}
==A Progressive Visual Analytics Tool for Incremental Experimental Evaluation==
<pdf width="1500px">https://ceur-ws.org/Vol-2441/paper17.pdf</pdf>
<pre>
                                 A Progressive Visual Analytics Tool for
                                 Incremental Experimental Evaluation
                                                         Fabio Giachelle            Gianmaria Silvello
                                                  fabio.giachelle@unipd.it          gianmaria.silvello@unipd.it
                                                  , 0000-0001-5015-5498             , 0000-0003-4970-4554
                                                         Department of Information Engineering
                                                               University of Padua, Italy

ABSTRACT                                                                                comparisons during the indexing phase. AVIATOR allows the user
This paper presents a visual tool – AVIATOR – that integrates the                       to issue queries to a system while the indexing phase is still run-
progressive visual analytics paradigm in the IR evaluation process.                     ning and to explore partial evaluation results in an intuitive way
This tool serves to speed-up and facilitate the performance assess-                     thanks to visual analytics advances. In particular, leveraging on
ment of retrieval models enabling a result analysis through visual                      the progressive visual analytics paradigm “enable(s) an analyst to
facilities. AVIATOR goes one step beyond the common “compute–                           inspect partial results of an algorithm as they become available and
wait–visualize” analytics paradigm, introducing a continuous eval-                      interact with the algorithm to prioritize subspaces of interest” [8].
uation mechanism that minimizes human and computational re-                                Visual analytics and IR experimental evaluation have interacted
source consumption.                                                                     before producing visual tools to design and ease failure analysis [2],
                                                                                        what-if analysis [3], to explore pooling strategies [7] and to enable
KEYWORDS                                                                                interactive grid exploration over a large combinatorial space of sys-
                                                                                        tems [1]. Nevertheless, they followed the “compute-wait-visualize”
visual analytics; experimental evaluation; incremental indexing
                                                                                        paradigm of visual analytics. AVIATOR moves a step beyond (par-
                                                                                        tially) removing the “wait” phase. To the best of our knowledge,
1    MOTIVATIONS                                                                        our paper is the first to focus on progressive visual analytics em-
The development of a new retrieval model is a demanding activ-                          ployed in IR to enable the dynamic and incremental evaluation of
ity that goes beyond the definition and the implementation of the                       IR systems.
model itself. A retrieval model can be conceived as part of an ecosys-                     A video showing the main functionalities of the system is avail-
tem where each component interacts with the others to produce                           able at the URL: https://www.gigasolution.it/v/Aviator.mp4.
the final document ranking for the user. As shown in [4], the ef-                          Outline. A general overview of AVIATOR is presented in Sec-
fectiveness of a model highly depends on the pipeline components                        tion 2. AVIATOR comprises of a back-end component that deals
it interacts with (e.g., stoplist and stemmer). To determine which                      with the incremental indexing and retrieval (Section 3) and of a
configuration is best in order to get the most out of a model is a                      front-end component that enables the interactive exploration of
demanding activity. In fact, it requires the inspection of several                      the partial experimental results (Section 4).
component pipelines and a comparison to baselines through mul-
tiple test collections and evaluation measures.
    The typical evaluation process comprises the following phases:                      2    SYSTEM OVERVIEW
corpus preprocessing (e.g., tokenization, stopword removal, stem-
                                                                                        AVIATOR embodies five phases: preprocessing, incremental index-
ming) and indexing phase, the retrieval phase and the evaluation
                                                                                        ing, retrieval, evaluation and visual analysis.
phase itself. If something is modified in the preprocessing phase,
                                                                                            In the preprocessing phase the document corpus D is partitioned
the whole collection has to be re-indexed before testing the re-
                                                                                        into n bundles B = [B 1 , B 2 , . . . , Bn ], where Bi , with i ∈ [1, n − 1],
trieval model and conducting the evaluation again. Unfortunately,                                      ⌊ |D | ⌋
indexing a collection may require hours, if not days, depending on                      has size k = n and Bn has size |D| − k(n − 1). The bundles are
the hardware and on the collection size. To assess the best con-                        populated by uniformly sampling D such that Bi ∩ B j = ∅, ∀i, j ∈
figuration of components over multiple collections on the basis of                      [1, n]. This sampling strategy is described in [5], where it is also
a grid search requires great human effort and computational re-                         shown that biased sub-collections exhibit similar behavior with
sources.                                                                                uniform samples in terms of precision.
    We propose an “all-in-one visual analytics tool for the evalua-                         As shown in Figure 1, in the incremental indexing phase, we
tion of IR systems” (AVIATOR) to speed up this evaluation process.                      adopt two parallel system threads each one implementing an inde-
The idea behind the tool is to improve test retrieval models, calcu-                    pendent instance of the same Information Retrieval System (IRS).
late approximate measures, explore the results and make baseline                        These threads are referred to as dynamic and stable core, respec-
                                                                                        tively. The dynamic core indexes the first corpus bundle and then
Copyright © 2019 for this paper by its authors. Use permitted under Creative Com-       releases the partial index to the stable core. The stable core enables
mons License Attribution 4.0 International (CC BY 4.0).                                 the user to run the retrieval phase on the partial index, while the
IIR 2019, September 16–18, 2019, Padova, Italy                                          dynamic core proceeds to index the second bundle. When the sec-
                                                                                        ond bundle has been indexed, an interrupt is issued to the stable
                                                                                        core and the user decides if s/he wants to update the index and


                                                                                    2
                                                                                                                                       F. Giachelle, G. Silvello


run a new retrieval phase or to continue with the index already at                        collection is indexed, AVIATOR generates a reliable estimation of
hand.                                                                                     average system performances.
   In the retrieval phase the partial index is queried by the user.                          Figure 4 illustrates a topic based analysis of the nDCG where
Currently, AVIATOR is based on batch retrieval on shared test col-                        relative difference between the 64 runs are calculated on partial
lections. Hence, in each retrieval phase at least 50 queries are is-                      indexes and the final runs are calculated on the full index. We can
sued and a TREC-like run is returned for evaluation. The user can                         see that with a 10% index (bundle 1), half of the topics have an
select several standard retrieval models or can use a custom one                          nDCG presenting a 80% difference with the true nDCG value. Nev-
loaded into the system. This phase can be considered as dynamic                           ertheless, as can be seen, nDCG approximation improves steadily
since the user can keep querying the partial index by changing the                        as the index grows. With half of the collection indexed (bundle 5)
retrieval model or its parameters.                                                        the nDCG approximation for half of the topics shows less than a
   The runs produced in the retrieval phase undergo continuous                            40% difference from the final value.
evaluation as they are being produced. Once the evaluation phase
is performed, the results are visualized by the visual analytics com-                     4     FRONT-END COMPONENT
ponent that enables the user to conduct an in-depth and intuitive                         The front-end component is a Web application designed on the ba-
analysis.                                                                                 sis of the Model-View-Controller design pattern. Its development
                                                                                          leverages on HTML5, D3 4 and JQuery 5 JavaScript libraries.
3     BACK-END COMPONENT                                                                     Figure 5 shows the configuration page of AVIATOR. The user
The back-end component implements the first four phases described                         can select amongst different corpora, topic sets and pool files. The
above. AVIATOR is a client-server application built on top of an                          current version of AVIATOR is based on the TIPSTER collection
IR system of choice. In the current implementation, AVIATOR is                            and TREC7 ad-hoc topics and pool file. For demo purposes the par-
based on Apache Solr 1 which in turn exploits the widely-used                             tial indexes have been precomputed. The interaction with the sys-
Apache Lucene search engine. In the back-end, AVIATOR acts as a                           tem can therefore be artificially sped up to avoid the actual waiting
wrapper on the IR system, controlling every stage of the IR process                       time between one index version and the next. Moreover, the user
(indexing, retrieval and evaluation) via HTTP through a REpresen-                         can select the stoplist and the stemmer to be used for building the
tational State Transfer (REST)ful Web service.                                            index and a retrieval model. In turn, the retrieval model can be
   AVIATOR’s demo version is based on the Disk 4&5 of the TREC                            changed afterwards and other models can be added to the evalua-
TIPSTER collection 2 and on the 50 topics (no. 351 − 400) of the                          tion and analytics phase.
TREC7 ad-hoc track [10]. For testing purposes AVIATOR was de-                                Figure 6 shows the main analytics interface for the topic based
signed to work with 64 different IR system pipelines including                            analysis. In the top of the screen, the main settings related to the
four different stoplists (indri, lucene, terrier, nostop), four stem-                     collection and index are reported, as a reference for the user. Be-
mers (Hunspell, Krovetz, Porter, nostem), and four IR models                              low, the two tabs can be used to conduct a topic based or an overall
(BM25, boolean3 , Dirichlet LM, TF-IDF).                                                  analysis. In the top-right corner, the user can see the percentage of
   The incremental index is designed to work on 10 corpus bundles                         the corpus and the number of documents currently indexed. The
(10%, 20%, . . ., 100% of the corpus). This implies that, at the time of                  main interaction interface shows a scatter plot with the Average
writing, the AVIATOR demo version works on 160 (4 × 4 × 10) dif-                          Precision values of the retrieval model selected in the configura-
ferent indexes that if statically stored in the memory would occupy                       tion phase. Just above the scatter plot, the two tabs can be used to
up to 230 GB.                                                                             add new retrieval models and to change the evaluation measure, as
   The system run obtained over a partial index is an approxima-                          shown in Figure 7 (all measures returned by trec_eval are avail-
tion of the “true” run obtained on the complete index. In Figure                          able).
2 we show the average nDCG was shown in relation to the sys-                                 Figure 8 illustrates the scatter plot. Four different retrieval mod-
tem performance difference between partial and full index. As ex-                         els can be compared through a pop-up window that can be trig-
pected, the precision of the measure grows with the index size and                        gered with a mouse over the points of the plot. The pop-up re-
the approximate effectiveness is consistent across all the 64 tested                      ports the retrieval model, the measure value and the topic being
systems. For instance, with index size 60% for most of the systems                        inspected. The user can zoom over a specific part of the scatter
the nDCG estimation is 40% lower than the true value obtained                             plot to better inspect the results.
with the full index. Figure 3 shows that, on the TREC7, the sys-                             Figure 9 illustrates how the user is notified when a new version
tem rankings obtained on partial indexes are highly correlated to                         of the index is ready. The user can decide whether or not to work
the ranking obtained on the full one. The correlation is based on                         on a new version of the index. When a new version of the index is
Kendall’s τ [6] and, following a common rule of thumb [9], two                            loaded all the visualizations are updated accordingly and the user
rankings were found to be highly correlated, with τ > 0.8. Thus,                          settings are maintained from one version to the next.
in comparing all the 64 IR systems on the 20% of the full index (B 2 )                       Figure 10 shows the interface enabling the inspection of overall
our systems ranking is quite close to the one obtained with the full                      results (averaged over all topics) of the tested retrieval models. In
index. The correlation increases rather rapidly and when half of the                      this case too, a mouse over the plot bars triggers a pop-up window,
1 http://lucene.apache.org/solr/
                                                                                          providing detailed information on the inspected system.
2 https://trec.nist.gov/data/qa/T8_QAdata/disks4_5.html
3 The boolean model, implemented in Apache Solr, uses a simple matching coefficient       4 https://d3js.org

to rank documents                                                                         5 https://jquery.com


                                                                                      3
A Progressive Visual Analytics Tool for
Incremental Experimental Evaluation
                                                                                                                                                                                                              Kendall's correlation between Bi and B10
DIAGRAMS, IMAGES, SCREEN SHOTS                                                                                                                                          1


                                                                                                                                                           0.95


                                                                                  Document             Indexed                                                         0.9
                                                     Request                       Corpus             percentage


                                                                                                                             Kendall's
                                                                                                                                                           0.85

                                                 Stable                             Dynamic
                                                                                                         0%
                                                  core                                core

                                                                                                                                                                       0.8


                              Response
                                                 Stable                             Dynamic                                                                0.75
                                                                                                        10%
                                                  core                                core
                                                                     Sync

                                                                                                                                                                       0.7
                                                                                                                                                                         10%              20%       30%       40%       50%          60%       70%   80%         90%        100%

                                                                                                                                                                                                                              Index size
                              Response
                                                Stable                              Dynamic
                                                                                                        20%
                                                 core                                 core
                                                                     Sync                                                 Figure 3: Kendall’s τ correlation between the system rank-
                                                                                                                          ings (based on nDCG) obtained over increasingly more com-
                                                                                                                          plete index bundles (Bi , i ∈ [1, 10]) and the complete index
                              Response
                                                 Stable                             Dynamic                               bundle (B 10 ).
                                                  core                                core              30%
                                                                     Sync
                                                                                                                                                                                  Boxplot distribution of topic-based relative differences between partial and full indexes

                                                                                                                                                                       100%


                                                                                                                                                                         90%


                                                                                                                                                                         80%
                                                                                                                             Relative difference with the full index


                              Response
                                                 Stable                             Dynamic                                                                              70%
                                                  core                                core              100%
                                                                     Sync                                                                                                60%


                                                                                                                                                                         50%


                                                                                                                                                                         40%

Figure 1: Incremental indexing: the interaction between the                                                                                                              30%

stable and dynamic cores.                                                                                                                                                20%


                                                                                                                                                                         10%


                                                                                                                                                                             0%
                                                 nDCG percentage difference partial vs full index
                              90%
                                                                                                                                                                                      1         2         3         4         5            6    7    8       9         10
                                                                                                                                                                                                                          Index Bundles

                              80%


                                                                                                                          Figure 4: Boxplot distribution of topic-based nDCG relative
                              70%                                                                                         differences between partial and full indexes.
   nDCG relative difference


                              60%


                                                                                                                          REFERENCES
                              50%                                                                                         [1] M. Angelini, V. Fazzini, N. Ferro, G. Santucci, and G. Silvello. 2018. CLAIRE: A
                                                                                                                              combinatorial visual analytics system for information retrieval evaluation. In-
                              40%
                                                                                                                              formation Processing & Management in print (2018). https://doi.org/10.1016/j.
                                                                                                                              jvlc.2013.12.003
                                                                                                                          [2] M. Angelini, N. Ferro, G. Santucci, and G. Silvello. 2014. VIRTUE: A Visual Tool
                              30%                                                                                             for Information Retrieval Performance Evaluation and Failure Analysis. J. Vis.
                                                                                                                              Lang. Comput. 25, 4 (2014), 394–413. https://doi.org/10.1016/j.jvlc.2013.12.003
                              20%
                                                                                                                          [3] M. Angelini, N. Ferro, G. Santucci, and G. Silvello. 2016. A Visual Analytics
                                                                                                                              Approach for What-If Analysis of Information Retrieval Systems. In Proc. 39th
                                                                                                                              Annual International ACM SIGIR Conference on Research and Development in In-
                              10%
                                10%      20%   30%        40%     50%       60%       70%       80%     90%    100%
                                                                                                                              formation Retrieval (SIGIR 2016). ACM Press, New York, USA.
                                                                    Index size                                            [4] N. Ferro and G. Silvello. 2018. Toward an anatomy of IR system component
                                                                                                                              performances. Journal of the Association for Information Science and Technology
                                                                                                                              69, 2 (2018), 187–200. https://doi.org/10.1002/asi.23910
Figure 2: Average nDCG relative difference between partial                                                                [5] D. Hawking and S. E. Robertson. 2003. On Collection Size and Retrieval Effec-
indexes at different levels of cut-off and the full index. Each                                                               tiveness. Information Retrieval 6, 1 (2003), 99–105.
line shows one of the 64 tested IR systems.                                                                               [6] M.G. Kendall. 1948. Rank correlation methods. Griffin, Oxford, England.
                                                                                                                          [7] A. Lipani, M. Lupu, and A. Hanbury. 2017. Visual Pool: A Tool to Visualize and
                                                                                                                              Interact with the Pooling Method. In Proc. 40th Annual International ACM SIGIR


                                                                                                                      4
                                                                                                                                     F. Giachelle, G. Silvello


                                                                                               Figure 7: The AVIATOR inspection interface: evaluation
                                                                                               measure selection.


        Figure 5: The AVIATOR configuration interface.


                                                                                               Figure 8: The AVIATOR inspection interface: in-depth result
                                                                                               analysis.


                                                                                               Figure 9: The AVIATOR inspection interface: index update.


Figure 6: The AVIATOR inspection interface: topic per topic
visualization with a single model.


     Conference on Research and Development in Information Retrieval (SIGIR 2017).
     ACM Press, New York, USA.
 [8] C. D. Stolper, A. Perer, and D. Gotz. 2014. Progressive Visual Analytics: User-
     Driven Visual Exploration of In-Progress Analytics. IEEE Trans. Vis. Comput.
     Graph. 20, 12 (2014), 1653–1662.
 [9] E. Voorhees. 2001. Evaluation by Highly Relevant Documents. In Proc. 24th An-
     nual International ACM SIGIR Conference on Research and Development in Infor-
     mation Retrieval (SIGIR 2001), D. H. Kraft, W. B. Croft, D. J. Harper, and J. Zobel
     (Eds.). ACM Press, New York, USA, 74–82.
[10] E. M. Voorhees and D. Harman. 1998. The Text Retrieval Conferences (TRECS).               Figure 10: The AVIATOR inspection interface: overall analy-
     In TIPSTER TEXT PROGRAM PHASE III:. Morgan Kaufmann.
                                                                                               sis.


                                                                                           5

</pre>