=Paper= {{Paper |id=Vol-1653/paper_2 |storemode=property |title=What-If Analysis: A Visual Analytics Approach to Information Retrieval Evaluation |pdfUrl=https://ceur-ws.org/Vol-1653/paper_2.pdf |volume=Vol-1653 |authors=Marco Angelini,Nicola Ferro,Giuseppe Santucci,Gianmaria Silvello |dblpUrl=https://dblp.org/rec/conf/iir/AngeliniFSS16 }} ==What-If Analysis: A Visual Analytics Approach to Information Retrieval Evaluation== https://ceur-ws.org/Vol-1653/paper_2.pdf
     What-If Analysis: A Visual Analytics Approach to
            Information Retrieval Evaluation

    Marco Angelini2 , Nicola Ferro1 , Giuseppe Santucci2 , and Gianmaria Silvello1
                                1 University of Padua, Italy

                           {ferro,silvello}@dei.unipd.it
                         2 “La Sapienza” University of Rome, Italy

                        {angelini,santucci}@dis.uniroma1.it



       Abstract. This paper focuses on the innovative visual analytics approach real-
       ized by the Visual Analytics Tool for Experimental Evaluation (VATE2 ) system,
       which eases and makes more effective the experimental evaluation process by
       introducing the what-if analysis. The what-if analysis is aimed at estimating the
       possible effects of a modification to an Information Retrieval (IR) system, in or-
       der to select the most promising fixes before implementing them, thus saving a
       considerable amount of effort.
       VATE2 builds on an analytical framework which models the behavior of the sys-
       tems in order to make estimations, and integrates this analytical framework into a
       visual part which, via proper interaction and animations, receives input and pro-
       vides feedback to the user. We conducted an experimental evaluation to assess
       the numerical performances of the analytical model and a validation of the visual
       analytics prototype with domain experts. Both the numerical evaluation and the
       user validation have shown that VATE2 is effective, innovative, and useful.


1   Introduction

IR systems operate using a best match approach: in response to an often vague user
query, they return a ranked list of documents ordered by the estimation of their relevance
to that query. In this context effectiveness, meant as the ability of systems to retrieve
and better rank relevant documents while at the same time suppressing the retrieval of
not relevant ones, is the primary concern. Since there are no a-priori exact answers to
a user query, experimental evaluation [10] based on effectiveness is the main driver of
research and innovation in the field. Indeed, the measurement of system performances
from the effectiveness point of view is basically the only means to determine the best
approaches and to understand how to improve IR systems.
    Nowadays, user tasks and needs are becoming increasingly demanding, the data
sources to be searched are rapidly evolving and greatly heterogeneous, the interaction
between users and IR systems is much more articulated, and the systems themselves be-
come increasingly complicated and constituted by many interrelated components. As an
example consider what web search is today: highly diversified results are returned from
web pages, news, social media, image and video search, products and more, and they
are all merged together by adaptive strategies driven by current and previous interaction
of the users with the system.
     Understanding and interpreting the results produced by experimental evaluation is
not a trivial task, due to the complex interactions among the components of an IR sys-
tem. Nevertheless, succeeding in this task is fundamental for detecting where systems
fail and hypothesizing possible fixes and improvements. As a consequences, this task is
mostly manual and requires huge amounts of time and effort.
     Moreover, after such activity, the researcher needs to come back to design and then
implement the modifications that the previous analysis suggested as possible solutions
to the identified problems. Afterwards, a new experimentation cycle needs to be started
to verify whether the introduced modifications actually give the expected improvement.
Therefore, the overall process of improving an IR system is extremely time and re-
source demanding and proceeds through cycles where each new feature needs to be
implemented and experimented.
     The goal of this paper is to introduce a new phase in this cycle: we call it what-if
analysis and it falls between the experimental evaluation and the design and imple-
mentation of the identified modifications. What-if analysis aims at estimating what the
effects of a modification to the IR system under examination could be before actually
being implemented. In this way researchers and developers can get a feeling of whether
a modification is worth being implemented and, if so, they can go ahead with its imple-
mentation followed by a new evaluation and analysis cycle for understanding whether
it has produced the expected outcomes.
     What-if analysis exploits Visual Analytics (VA) techniques to make researchers and
developers: (i) interact with and explore the ranked result list produced by an IR system
and the achieved performances; (ii) hypothesize possible causes of failure and their
fixes; (iii) estimate the possible impact of such fixes through a powerful analytical
model of the system behavior.
     What-if analysis is a major step forward since it can save huge amounts of time and
effort in IR system development and, to the best of our knowledge, it has never been
attempted before.
     The paper is organized as follows: Section 2 discusses some related works; Section 4
explains in detail the proposed analytical framework which is then experimentally eval-
uated in Section 5; Section 6 presents the visual analytics environment called Visual
Analytics Tool for Experimental Evaluation (VATE2 ); and, Section 7 draws some con-
clusions and presents an outlook for future work.


2   Related Works
Experimental evaluation is based on the Cranfield methodology [6] which makes use of
experimental collections C = (D, T, GT ) consisting of: a set of documents D represent-
ing the domain of interest; a set of topics T , which simulates and abstracts actual user
information needs; and, the ground-truth GT , i.e. a kind of “correct” answer, where for
each topic t ∈ T the documents d ∈ D relevant to it are determined. System outputs
are then scored with respect to the ground-truth using whole breadth of performance
measures [15]. The ground-truth can consist of both binary relevance judgments or
multi-graded ones [12].
    Experimental evaluation is a demanding activity in terms of effort and required
resources that benefits from using shared datasets, which allow for repeatability of the
experiments and comparison among state-of-the-art approaches. Therefore, over the
last 20 years, experimental evaluation has been carried out in large-scale evaluation
campaigns at international level, such as the Text REtrieval Conference (TREC)3 in the
US and the Conference and Labs of the Evaluation Forum (CLEF)4 in Europe.
    The activity described in the previous section and aimed at understanding how and
why a system has failed is called failure analysis. To give the reader an idea of how de-
manding it can be, let us consider the case of the the Reliable Information Access (RIA)
workshop [9], which was aimed at systematically investigating the behavior of just
one component in an IR system, namely the relevance feedback module. Harman and
Buckley in [9] reported that, for analyzing 8 systems, 28 people from 12 organizations
worked for 6 weeks requiring from 11 to 40 person-hours per topic for 150 overall
topics. These figures do not take into account the time then needed to implement the
identified modifications and perform another evaluation cycle to understand if they had
the desired effect.
    VA is typically exploited for the presentation and exploration of the documents man-
aged by an IR system [18]. However, much less attention has been devoted to applying
VA techniques to the analysis and exploration of the performances of IR systems [4].
To the best of our knowledge, our previous work is the most systematic attempt. We ex-
plored several ways in which VA can help the interpretation and exploration of system
performances [7]. This preliminary work led to the development of Visual Information
Retrieval Tool for Upfront Evaluation (VIRTUE), a fully-fledged VA prototype which
specifically supports performance and failure analysis [3].
    We then started to explore the need for what-if analysis in the context of large-scale
evaluation campaigns, such as TREC or CLEF. In this context, evaluators do not have
access to the tested systems but they can only examine the final outputs, i.e. the ranked
result lists returned for each topic. In [1, 2] we continued the study for the evaluation
campaigns and we set up an analytical framework for trying to learn the behavior of a
system just from its outputs, in order to obtain a rough estimation of the possible effects
of a modification to the system.
    Therefore, in this paper, we put VATE2 in a different context from the one of eval-
uation campaigns. Indeed, the present version of VATE2 is thought for designers and
developers of IR systems who have access to all the internals of the system being tested
and they have the know-how to hypothesizing possible fixes and improvements.


3   Intuitive Overview
In order to understand how what-if analysis works we need to recall the basic ideas
underlying performance and failure analysis as designed and realized by VIRTUE [3].
    To quantify the performances of an IR system, VIRTUE adopts the Discounted
Cumulated Gain (DCG) family of measures [11] which have proved to be especially
well-suited for analyzing ranked results list. This is because they allow for graded rel-
evance judgments and embed a model of the user behavior while he scrolls down the
result list which also gives an account of its overall satisfaction.
3 http://trec.nist.gov/
4 http://www.clef-initiative.eu/
     We compare the result list produced by an experiment with respect to an ideal rank-
ing created starting from the relevant documents in the ground-truth, which represents
the best possible results that an experiment can return. In addition to what is typically
done, we compare the result list with respect to an optimal one created with the same
documents retrieved by the IR system but with an optimal ranking, i.e. a permutation
of the results retrieved by the experiment aimed at maximizing its performances by
sorting the retrieved documents in decreasing order of relevance. Therefore, the ideal
ranking compares the experiment at hand with respect to the best results possible, i.e.
also considering relevant documents not retrieved by the system, while the optimal
ranking compares an experiment with respect to what could have been done better with
the same retrieved documents.
     Looking at a performance curve, as the DCG curve is, it is not always easy to spot
what the critical regions in a ranking are. Indeed, DCG is a not-decreasing monotonic
function which increases only when a relevant document is found in the ranking. How-
ever, when DCG does not increase, this could be due to two different reasons: either
you are in an area of the ranking where you are expected to put relevant documents but
you are putting a not relevant one and thus you do not gain anything; or, you are in an
area of the ranking where you are not expected to put relevant documents and, correctly,
you are putting a not relevant one, still gaining nothing.
     In order to overcome this and similar issues, we introduce the Relative Position (RP)
indicator [8]; RP allows us to quantify and explain what happens at each rank position
and its paired with a visual counterpart which eases the exploration of the performances
across the ranking. RP allows us to immediately grasp the most critical areas as we
can see in Figure 1 showing the VATE2 system. RP quantifies the effect of misplacing
relevant documents with respect to the ideal case, i.e. it accounts for how far a document
is from its ideal position. Overall, the greater the absolute value of RP is, the bigger the
distance of the document from its ideal interval is.
     We envision the following scenario. By exploiting VIRTUE a failure analysis is
conducted and the user hypothesizes the problem of the IR system at hand. At the same
time, the user hypothesizes that if he fixes such failure, a given relevant document would
be ranked higher than in the current system. As shown in Figure 1, what VATE2 offers
to the user is: (i) the possibility of dragging and dropping the target document in the
estimated position of the rank; (ii) the estimation of which other documents would be
affected by the movement of the target document and how the overall ranking would
be modified; (iii) the computation of the system performances according to the new
ranking.


4   Analytical Framework
We introduce the basic notions regarding experimental evaluation in IR regarding the
functioning of VATE2 ; for a complete and formal definition of experimental evaluation
in IR refer to [3] and [8].
    We consider relevance as an ordered set of naturals, say REL ∈ N, where rel ∈ REL
indicates the degree of relevance of a document d ∈ D for a given topic t ∈ T . The
ground truth associates a relevance degree to each document with respect to a topic. So,
let T be a set of topics and D a set of documents, then we can define a ground truth for
each topic, say tk ∈ T , as a map GTtk of pairs (d, rel) with size |D|, where d ∈ D is a
document and rel ∈ REL is its relevance degree with respect to tk .
      We indicate with Ltk a list of triples (di , simi , reli ) of length N representing the
ranked list of documents retrieved for a topic tk ∈ T , where di ∈ D is a document
(di 6= d j , ∀i, j ∈ [1, N] | i 6= j), simi ∈ R is a degree indicating the similarity of di to
tk , and reli ∈ REL indicates the relevance degree of di for tk ; the triples in Ltk are in
decreasing order of similarity degree such that simi ≥ sim j if i > j.
      Now, we can point out some methods to access the elements in a ranked list Lt that
we will use in the following. Lt (1, 1) = d1 returns the document in the first triple of
Lt , Lt (2, 1) = d2 the second document and so on; whereas, Lt (:, 1) returns the list of
documents in Lt . In the same vein, Lt (1, 2) = sim1 returns the similarity degree of the
first document in Lt and Lt (1, 3) = rel1 returns its relevance degree; moreover, Lt (1, :
) = (d1 , sim1 , rel1 ) returns the first triple in Lt .
      Relative Position (RP) is a measure that quantifies the misplacement of a document
in a ranked list with respect to the ideal one. In order to introduce RP we need to define
the concepts of minimum rank and maximum rank of a given relevance degree building
on the definition of ideal ranked list. Given an ideal ranked list It with length N ∈ N+
and a relevance degree rel ∈ REL, then the minimum rank minIt (rel) returns the first
position i ∈ [1, N] at which we find a document with relevance degree equal to rel, while
the maximum rank maxIt (rel) returns the last position i at which we find a document
with relevance degree equal to rel in the ideal ranking.

                                                                               
                 0
                                       if minIt Lt (i, 3)) ≤ i ≤ maxIt Lt (i, 3)
                                                             
       RPLt [i] = i − minIt Lt (i, 3)   if i < minIt Lt (i, 3)
                                                             
                  i − maxIt Lt (i, 3) if i > maxIt Lt (i, 3)
                 

    The RP measure points out the instantaneous and local effect of misplaced docu-
ments and how much they are misplaced with respect to the ideal ranking It . In the
following definition, zero values denote documents which are within the ideal inter-
val; positive values denote documents which are ranked below their ideal interval, i.e.
documents of higher relevance degree that are in a position of the ranking where less
relevant ones are expected; and negative values denote documents which are above their
ideal interval, i.e. less relevant documents that are in a position of the ranking where
documents of higher relevance degree are expected.
    In VATE2 , document clustering is adopted in the context of the failure hypothesis:
“closely associated documents tend to be affected by the same failures”, stating the
common intuition that a given failure will affect documents with common features,
and, consequently, that a fix for that failure will have an effect on the documents sharing
those common features.; once the user has selected a target document, say d j , within a
ranked list Lt , our goal is to get a cluster of documents, Cd j , similar to d j , where the
similarity is quantified by the IR system, say SX , which generated Lt .
    The creation of a cluster of documents similar to d j is very close to the operation
done by the SX IR system to get a ranked list of documents starting from a topic t.
Indeed, SX , takes the topic t and calculates the similarity between the topic and each
document in D; afterwards it returns a ranked list Lt of documents ordered by decreasing
similarity to the topic. The document cluster creation methodology we adopt in VATE2
follows this very procedure: given a target document d j we use it as topic and we submit
it to the IR system SX which returns a ranked list of documents, say Cd j , ordered by
decreasing similarity to d j .
     The first document in Cd j is always d j and then we encounter progressively less
similar documents. Cd j tell us which documents are seen in “the same way” by the IR
system being tested and that will probably be affected by the same issues found for d j .
     We limit to 10 the number of documents in the cluster to be moved in order to
consider only the most similar documents to the target document d j .
     A cluster Cd j is defined as a list of pairs (dk , simk ) where dk ∈ D is a document and
simk is the similarity of dk to d j . The first document in Cd j is d j and, by definition, it has
the maximum similarity value in the cluster. For every considered cluster we normalize
the similarities in the [0, 1] range by dividing their values by maximum similarity value
in the cluster.
     The clusters of documents so defined play a central role in the document movement
estimation of VATE2 . Indeed, once a user spots a misplaced document, say d4 , and
s/he decides to move it upward, the ten documents in the Cd4 cluster are also moved
accordingly.
     We developed two variations of movement: a simple one called constant movement
(cm) and a slightly more complex one called similarity-based movement (sbm). We
present the constant movement first and then we build on it to explain the similarity-
based one.
     Let us consider a general environment where a ranked list Lt is composed of N
triples such that the list of documents is d1 , d2 , . . . , dN , where the subscript of the docu-
ments indicates their position in the ranking. As a consequence of the failure hypothesis,
if we move a document d j ∈ Lt from position j to position k – which means that we
move d j of ∆ = j − k positions – we also move the documents in the cluster C j of ∆
positions accordingly.
     The constant movement is based on three assumptions:

 1. Linear movement: if d j is moved upward from position j to position k where ∆ =
     j − k, all the documents in its cluster C j are moved by ∆ in the same direction.
 2. Cluster independence: the movement of the cluster C j does not imply the movement
    of other clusters. This means that when we move d j in the position of dk , dk is
    influenced by the movement, but the cluster Ck is not.
 3. Unary shifting: if C j is moved by ∆ positions, then the other documents in the
    ranking have to make room for them and thus they are moved downward by one
    position.

    The pseudo-code of the constant movement is reported by Algorithm 1 (implement-
ing the actual movement) and Algorithm 2 (implementing the operations necessary to
reorder the ranked list after a movement)5 . It starts to move by ∆ the last document
 5 For   the sake of simplicity in these algorithms we employ four convenience methods:
  POSITION (L, d) which returns the rank index of d in L, SIZE (L) which returns the number of el-
                        0                              0
  ement of L, ADD(L, L ) which adds the elements of L at the end of L and GET C LUSTER(Lt , dj )
  which returns the cluster of d j .
  Algorithm 1: MOVEMENT
      Input: The ranked list Lt , the document dj to be moved, the target rank position tPos.
      Output: The ranked list Lt after the movements.
  1   Cdj ←GET C LUSTER(Lt , dj )
  2   sPos ←POSITION(Lt , Cdj (1, 1))
  3   for i ← SIZE(Cdj ) to 1 do
 4         oldPos ← POSITION(Lt , Cdj (i, 1))
 5         if oldPos == 0 then
 6              oldPos ← SIZE(Lt ) + 1
 7         newPos ← oldPos − (sPos − tPos)
 8         Lt ← REORDER L IST(Lt , oldPos, newPos, Cdj (i, 1))
  9   end
 10   return Lt




in the cluster Cd6 which is d4 , so we can see that d4 is put in the place of d1 and d1 is
shifted downward by one position; afterwards, the algorithm repeats the same operation
                                                                          0
for all the other documents in the cluster generating the reordered list Lt .
    We have seen that in a general setting, if we move d j upwards by ∆ positions, all
the documents in C j move accordingly by ∆ position. There are cases where this is
not possible, because the movement is capped on the top by one or more documents in
the cluster. As an example, consider a movement upward of d j , if there is a document
dw ∈ C j such that w < ∆ , then dw cannot be moved upwards by ∆ position, but at most
by w. In this case, dw is moved by w positions while the other documents, if possible,
are moved by ∆ positions.
    We can see that this movement can be easily changed by altering the three starting
assumptions. For instance, one can decide that constant movement is no longer a valid
assumption, e.g. by saying that when d j in moved by ∆ positions, the documents in C j
are moved by ∆ − σ , where σ is a variable calculated on the basis of the documents
rank or similarity score. The similarity-based movement does exactly so by changing
the way in which the new document positions are calculated starting from the ∆ value
defined for the starting document d j ; the new movement is obtained by substituting the
instruction at line 7 of Algorithm 1 with the following:

                                                                   
                                           sPos − tPos
                      newPos ← oldPos ∗ 1 −             ∗ Cdj (i, 2)
                                               sPos


    We can see that the new position of a document di is weighted by two terms Cdj (i, 2)
which is the normalized similarity of di to d j in the cluster Cd j and sPos−tPossPos  which
determines the relative movement of the starting document d j . Every document di is
moved by the same increment of d j (e.g. 1) weighted by the normalized similarity of di
in the cluster. Basically, d j , which has similarity 1 by definition, is always moved by the
number of positions indicated by the user, whereas the other documents in the cluster
are moved by a number of position depending on the similarity to d j : the higher it is the
bigger the movement upward.
    Algorithm 2: REORDER L IST
        Input: The ranked list Lt , the old position oldPos of dj , the new position newPos of dj and the document dj to be
               moved.
                                              0
        Output: The reordered ranked list Lt .
    1   if newPos < 1 then
    2        newPos ← 1
    3   if newPos > 1 then
    4        chunk1 ← Lt (1 : newPos, :)
    5   else
    6        chunk1 ← [ ]
    7   end
    8   if oldPos > SIZE(Lt ) then
    9        chunk2 ← Lt (newPos : SIZE(Lt ), :)
 10     else
 11          chunk2 ← Lt (newPos : oldPos + 1, :)
 12     end
 13     if oldPos > SIZE(Lt ) then
 14          chunk3 ← Lt (oldPos + 1 : SIZE(Lt ), :)
 15     else
 16          chunk3 ← [ ]
 17     end
          0
 18     Lt ← chunk1
          0        0
 19     Lt ← add(Lt , dj )
          0        0
 20     Lt ← add(Lt , chunk2)
          0        0
 21     Lt ← add(Lt , chunk3)
                0
 22     return Lt




5        Experimental Evaluation of the Analytical Framework
VATE2 is expected to work as follows: the user examines a bugged system SB , identifies
the cause of a possible failure and makes a hypothesis about how the fixed version of the
system SF would rank the documents by dragging the spotted document in the expected
position.
    To conduct an experiment in a controlled environment which accounts for this be-
haviour, we start from a properly working IR system SF and we produce a “bugged”
version of it SB by changing one component or one specific feature at a time. Then,
we consider all the possible movements that move a relevant document from a wrong
position in SB to the correct one in SF and we count how many times the prediction
of VATE2 , i.e. improvement or deterioration of the performances, corresponds to the
actual improvement or deterioration passing from SB to SF .
    The system we used is Terrier ver. 4.06 [13], an open source and widely used system
in the IR field which is developed and maintained by the University of Glasgow. To run
the experiment, we used a standard and openly available experimental collection C , the
TREC 8, 1999, Ad-Hoc collection [17].
    We experimented the use case about the stemmer and we setup the Terrier sys-
tem with four different stemmers by keeping all the other components fixed, namely:
Porter [14] stemmer, Weak Porter stemmer, Snowball stemmer and no stemmer.
    We considered those pairs of systems SB and SF , which correspond to sensible and
useful cases in practice, i.e. when you pass from a lesser performing system SB to a
better one SF , as for example when you pass from no stemming SB to stemming SF .
6 http://terrier.org/
    For each topic, we identified the set of all the possible predictions T P, where each
misplaced relevant document d j in LBt is moved upwards along with its cluster Cd j to
the correct position determined by its rank in the fixed ranked list LFt , thus generating
a predicted ranked list LPt .
    We computed the DCG for each of the above ranked lists: DCGLBt and DCGLFt
indicate the DCG of the bugged system SB and fixed system SF while DCGLPt indicates
the DCG of the predicted system SP for the i-th possible movement in T P.
    We consider a prediction by VATE2 to be correct if a performance improvement (or
deterioration) between the actual bugged system SB and the fixed one SF corresponds
to a performance improvement (or deterioration) between the actual bugged systems SB
and the predicted one SP . Let sgn(x) = 1 if x ≥ 0 and sgn(x) = −1 if x < 0; then for
each possible prediction p ∈ T P we define the Correct Prediction (CP) measure as:
                                                                            
                       sgn DCGLFt − DCGLBt + sgn DCGLPt − DCGLBt
            CPt [p] =
                                                  2
Lastly, we define the Prediction Precision (PP) for topic t as the number of correct
predictions over the total number of possible predictions: PPt = |T1P| ∑ p∈T P CPt [p]
    PPt ranges between 0 and 1, where 0 indicates that no correct prediction has been
made and 1 indicates that all the predictions were correct.
    In Table 1 we report the mean DCG value (DCG is calculated topic by topic and
then it is averaged over all 50 topics) for the four different stemmers. We can see that
there are substantial differences between the systems, in particular the “Weak Porter”
and the “No Stemmer” systems have much lower performances with respect to the best
one which is the “Porter” system. In Table 2 we report the results of the tests in terms



Table 1: DCG averaged over all the 50 topics of the systems considered for the stemmer
failure family.
                           Porter Weak Porter Snowball No Stemmer
                   DCG     105.57   65.63      103.03     59.92


of Prediction Precision (PP) averaged over all the topics for the considered pairs of
systems. We can see that the PP is in general satisfactory and it is higher for those pairs
where the difference in DCG is higher – e.g. SB = No Stemmer and SF = Snowball.
    Even if the constant movement behaves better than the similarity-based one in 3
out of 5 considered cases, there is no clear evidence that one of the two movements
performs better since they are not significantly different from the statistical point of
view according to Student’s t test [16] which returns a p-value p = 0.9459. Therefore,
in the running implementation of VATE2 , we decided to use the constant movement in
VATE2 because its behavior is more intuitive to the users.

6   Visual Analytics Environment
In Figure 1(a) we can see an overview of VATE2 system which is available at the URL:
http://ims-ws.dei.unipd.it/vate_ui/ and a video is available here: http://
Table 2: Prediction Precision of VATE2 averaged over all the 50 topics. SB indicates
the bugged system; SF indicates the fixed system; T P indicates the total number of
predictions; cm indicates the constant movement and sbm the similarity-based one.
                         SB           SF    TP PP (cm) PP (sbm)
                     No Stemmer Porter      442 .5659      .6047
                     No Stemmer Snowball    279 .7106      .7278
                     No Stemmer Weak Porter 410 .6694      .6648
                     Weak Porter Porter     475 .6283      .6014
                     Weak Porter Snowball   436 .6385      .6093




                    (a)                                            (b)


Fig. 1: (a) selection of a document and highlight of its cluster; (b) the ranked list and
the DCG curve after the movement.


ims.dei.unipd.it/video/VATE/sigir-vate.mp4. VATE2 functioning has been
described in details in [5].
     We can see that the system is structured in three main components. The Experi-
mental collection information (A) which allows the user to inspect and interact with
the information regarding the experimental collection. More in detail, it is divided into
three sub-components. The first is the “Experiment Selection” where the user can select
the experimental collection, the experiment to analyze and the evaluation measure and
its parameters. The second sub-component is the “Topic Information” composed of the
structured description of the topic and the topic selection grid. The third sub-component
is the “Document Information” reporting the content of the document under analysis.
     The Ranked list exploration (B) which is placed on the center and shows a visual
representation of the ranked list. More in detail, the documents are represented as rect-
angles ordered by rank from top to bottom where the color indicates the RP value. The
intensity of the color encodes the severity of the misplacement, the more intense the
worse the misplacement. This visualization provides the user with an intuitive insight
into the quality of the ranking.
     The Performance view (C) which is placed on the right side and shows the perfor-
mance curves of the selected experiment. The yellow curve is the ideal one, the magenta
curve is the optimal one and the cyan curve is the experiment one. The user can analyze
the trend of the experiment by comparing the behavior of its curve with the ideal and
optimal ranking by spotting the possible areas of improvement.
    The user can interactively select the topic to be analyzed in the topic selection grid
and the ranked list and the performance curves are updated accordingly to the selected
topic for the given experiment. The ranked list can be dynamically inspected by hov-
ering the mouse over the documents. Moreover, the user can interact with the “Per-
formance view” by hovering the mouse over the curves which, by means of a tooltip,
reports information about the document and the performance score.
    As shown in Figure 1, concerning what-if analysis, once the user selects a document,
the system displaces on the right the rectangles corresponding to the documents in its
similarity cluster and reports their identifiers also on the right. Once the user selects a
document, s/he can drag it to a new position in the ranked list; afterwards, the movement
algorithm is triggered and moves the document along with its similarity cluster in the
new positions. This action is visually shown to the user and it is represented with an
animated movement of the corresponding rectangles to the new positions. After the
movement the ranked list is split in two parts: the old ranked list on the left and the new
ranked list produced after the movement on the right. In this way the user can visually
compare the effects of the movement and see what other documents have been affected
by it.

7   Conclusions
In this paper we explored the application of VA techniques to the problem of IR experi-
mental evaluation and we have seen how joining a powerful analytical framework with a
proper visual environment can foster the introduction of a new, yet highly needed phase,
which is the what-if analysis. Indeed, improving or fixing an IR system is an extremely
resource demanding activity and what-if analysis can help in getting an estimate of what
is worth doing, thus saving time and effort.
    We designed and developed the VATE2 system which has proven to be robust and
well-suited for its purposes from a two-fold point of view. The experimental evaluation
has numerically shown that the analytical engine, the failure hypothesis and the corre-
sponding way of clustering documents together with the document movement estima-
tion algorithms are satisfactory. The validation with domain experts has confirmed that
VATE2 is innovative, addresses an open and relevant problem and provides an effective
and intuitive solution to it.
    As future work, we plan to explore what happens when multiple movements are
considered all together. This will require an extension of the analytical engine in order to
account for the possible inter-dependencies among the different movements. Moreover,
also the visual analytics environment will require a substantial modification in order
to support users in intuitively dealing with multiple movements, interacting with the
history of the performed movements and moving back and forth within it.




References
 1. M. Angelini, N. Ferro, G. Santucci, and G. Silvello. Visual Interactive Failure Analysis: Sup-
    porting Users in Information Retrieval Evaluation. In Proc. 4th Symposium on Information
    Interaction in Context (IIiX 2012), pages 195–203. ACM Press, 2012.
 2. M. Angelini, N. Ferro, G. Santucci, and G. Silvello. Improving Ranking Evaluation Employ-
    ing Visual Analytics. In Information Access Evaluation meets Multilinguality, Multimodality,
    and Visualization. Proceedings of the Fourth International Conference of the CLEF Initiative
    (CLEF 2013), pages 29–40. LNCS 8138, Springer, 2013.
 3. M. Angelini, N. Ferro, G. Santucci, and G. Silvello. VIRTUE: A visual tool for informa-
    tion retrieval performance evaluation and failure analysis. Journal of Visual Languages &
    Computing (JVLC), 25(4):394–413, 2014.
 4. M. Angelini, N. Ferro, G. Santucci, and G. Silvello. Visual Analytics for Information Re-
    trieval Evaluation (VAIRË 2015). In Advances in Information Retrieval. Proc. 37th European
    Conference on IR Research (ECIR 2015), pages 709–812. LNCS 9022, Springer, 2015.
 5. M. Angelini, N. Ferro, G. Santucci, and G. Silvello. A visual analytics approach for what-
    if analysis of information retrieval systems. In Proceedings of the 39th International ACM
    SIGIR Conference on Research and Development in Information Retrieval, Pisa, Italy, July
    17-21, 2016. Accepted for publication, 2016.
 6. C. W. Cleverdon. The Cranfield Tests on Index Languages Devices. In Readings in Infor-
    mation Retrieval, pages 47–60. Morgan Kaufmann Publisher, Inc., 1997.
 7. N. Ferro, A. Sabetta, G. Santucci, and G. Tino. Visual Comparison of Ranked Result Cumu-
    lated Gains. In Proc. 2nd International Workshop on Visual Analytics (EuroVA 2011), pages
    21–24. Eurographics Association, 2011.
 8. N. Ferro, G. Silvello, H. Keskustalo, A. Pirkola, and K. Järvelin. The Twist Measure for
    IR Evaluation: Taking User’s Effort Into Account. Journal of the American Society for
    Information Science and Technology (JASIST), 67:620–648, 2016.
 9. D. Harman and C. Buckley. Overview of the Reliable Information Access Workshop. Infor-
    mation Retrieval, 12(6):615–641, 2009.
10. D. K. Harman. Information Retrieval Evaluation. Morgan & Claypool Publishers, USA,
    2011.
11. K. Järvelin and J. Kekäläinen. Cumulated Gain-Based Evaluation of IR Techniques. ACM
    Transactions on Information Systems (TOIS), 20(4):422–446, October 2002.
12. J. Kekäläinen and K. Järvelin. Using Graded Relevance Assessments in IR Evalua-
    tion. Journal of the American Society for Information Science and Technology (JASIST),
    53(13):1120—1129, November 2002.
13. I. Ounis, G. Amati, V. Plachouras, B. He, C. Macdonald, and C. Lioma. Terrier: A High
    Performance and Scalable Information Retrieval Platform. In Proceedings of ACM SIGIR’06
    Workshop on Open Source Information Retrieval (OSIR 2006), 2006.
14. M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130–137, July 1980.
15. T. Sakai. Metrics, Statistics, Tests. In Bridging Between Information Retrieval and
    Databases - PROMISE Winter School 2013, Revised Tutorial Lectures, pages 116–163. Lec-
    ture Notes in Computer Science (LNCS) 8173, Springer, 2014.
16. Student. The Probable Error of a Mean. Biometrika, 6(1):1–25, March 1908.
17. E. M. Voorhees and D. K. Harman. Overview of the Eigth Text REtrieval Conference (TREC-
    8). In The Eighth Text REtrieval Conference (TREC-8), pages 1–24. National Institute of
    Standards and Technology (NIST), Special Publication 500-246, 1999.
18. J. Zhang. Visualization for Information Retrieval. Springer-Verlag, 2008.