Visual Interactive Failure Analysis: Supporting
      Users in Information Retrieval Evaluation
                 (Extended Abstract)?

 Marco Angelini2 , Nicola Ferro1 , Giuseppe Santucci2 , and Gianmaria Silvello1
                              1
                               University of Padua, Italy
                           {ferro,silvello}@dei.unipd.it
                       2
                         “La Sapienza” University of Rome, Italy
                        {angelini,santucci}@dis.uniroma1.it


        Abstract. Evaluation has a crucial role in Information Retrieval (IR)
        and developing tools to support researchers and analysts when analyzing
        results and investigating strategies to improve IR system performance
        can help make the analysis easier and more effective. To this purpose
        we present a Visual Analytics-based approach to support the analyst in
        performing failure and what-if analysis.


1     Introduction
Designing, developing, and testing an IR system is a challenging task, especially
when it comes to understanding and analysing the behaviour of the system under
different conditions in order to tune or to improve it as to achieve the level of
effectiveness needed to meet the user expectations.
    Failure analysis is especially resource demanding in terms of time and human
effort, since it requires inspecting, for several queries, system logs, intermediate
output of system components, and, mostly, long lists of retrieved documents
which need to be read one by one in order to try to figure out why they have
been ranked in that way with respect to the query at hand.
    Considering this, it is important to define new ways to help IR researchers,
analysts and developers to understand the limits and strengths of the IR system
under investigation. Visual analytics techniques can give assistance to this pro-
cess by providing graphic tools which interacting with IR techniques may ease
the work of the users.
    The goal of this paper is to exploit a visual analytics approach to design
a methodology and develop an interactive visual system which support IR re-
searchers and developers in conducting experimental evaluation and improving
their systems by: (i) reducing the effort needed to conduct failure analysis; (ii)
allowing them to anticipate what the impact of a modification to their system
could be before needing to actually implement it.
?
    The extended version of this abstract has been published in [1].
                      Fig. 1. The Visual Analytics prototype.


2   Failure Analysis
As far as the failure analysis is concerned, we introduce a ranking model that
allows us to understand what happens when you misplace documents with dif-
ferent relevance grades in a ranked list. The proposed ranking model is able to
quantify, rank by rank, the gain/loss obtained by an IR system with respect to
both the ideal ranking, i.e. the best ranked list that can be produced for a given
topic, and the optimal ranking, i.e. the best ranked list that can be produced
using the documents actually retrieved by the system.
    Starting from the Discounted Cumulative Gain (DCG) measures, we intro-
duce two functions: the relative position, which quantifies how much a document
has been misplaced with respect to its ideal (optimal) position, and the delta
gain, which quantifies how much each document has gained/lost with respect to
its ideal (optimal) DCG. On top of this ranking model, we propose a visualiza-
tion, see Figure 1, where the DCG curves for the experiment ranking, the ideal
ranking, and the optimal ranking are displayed together with two bars, on the
left, representing the relative position and the delta gain. Please note that an
equivalent graph can be obtained by using nDCG in the place of DCG.
    The proposed ranking model and the related visualization are quite innova-
tive because, usually, information visualization and visual analytics are exploited
to improve the presentation of the results of a system to the end user, rather
than applying them to the exploration and understanding of the performances
and behaviour of an IR system. Secondly, comparisons are usually made with
respect to ideal ranking only while our method allows user to compare a system
also which respect to the optimal ranking produced with the system results, thus
giving the possibility of better interpreting the obtained results [2].
                              Fig. 2. Data pipeline.


3   What-If Analysis

When it comes to the what-if analysis, i.e. allowing users to anticipate the impact
of a modification, we allow them to simulate what happens when you change the
ranking of a given document for a certain topic not only in terms of which other
documents will change their rank for that topic but also in terms of the effect
that this change has on the ranking of the other topics. In other terms, we try
to give the user an estimate of the “domino effect” that a change in the ranking
of a single document can have. Moreover, when you simulate the move of a
single document (and all the related documents), you produce a new ranking
for a given topic which corresponds to a new version of your system, in our
case a bug fixing in a component of the system. However, this new version of
the system will now behave differently when ranking documents for the other
topics in your experimental collection. Therefore, a change in the system which
positively affects the performances on topic t1 may have the side-effect to be
detrimental for the performances on topic t2 and we would like to give users an
estimate also of this kind of “domino effect”.
    Therefore, the overall goal is to have an initial raw estimate of the effect of
a planned modification before actually implementing it in terms of effect both
for the topic under examination and for the other topics. This gives researchers
and developers the possibility of exploring several alternatives before having to
implement them and of determining a reasonable trade-off between the effort
and costs for given modifications and the expected improvements.
    Figure 2 shows the block diagram describing the pipeline of the data ex-
changed in the whole process. We consider the general-purpose IR scenario com-
posed by a set of topics T , a collection of documents D, and a ranking model
RM; an IR system for a given topic tk ∈ T retrieves a set of documents Dj ⊆ D.
     The ranking model RM generates for each topic tk ∈ T a ranked document
list RLj . The whole set of ranked lists constitute the input for building the
Clustering via Learning to Rank Model that is in charge of generating, for each
document, a similarity cluster. The Visualization deals with one topic t at time:
it takes as input the ranked document list for the topic t and the ideal ranked
list, obtained choosing the most relevant documents in the collection D for the
topic t and ordering them in the best way. While visually inspecting the ranked
list, it is possible to simulate the effect of interactively reordering the list, moving
a target document d and observing the effect on the ranking while this shift is
propagated to all the documents of the cluster containing the documents similar
to d. This cluster of documents simulates the “domino effect” within the given
topic t.
     When the analyst is satisfied with the results, i.e. when he has produced a
new ranking of the documents that corresponds to the effect that is expected
by modifications that are planned for the system, he can feed the Clustering
via Learning to Rank Model with the newly produced ranked list, obtain a new
model which takes into account the just introduced modifications, and inspecting
the effects of this new model for other topics. This re-learning phase simulates
the “domino effect” on the other topics different from t caused by a possible
modification in the system.


4    Final Remarks
This paper presented a fully-fledged analytical and visualization model to sup-
port interactive exploration of IR experimental results. The overall goal of the
paper has been to provide users with tools and methods to investigate the perfor-
mances of a system and explore different alternatives for improving it avoiding
a continuous iteration of trials-and-errors to see if the proposed modifications
actually provide the expected improvements.

Acknowledgements The work reported in this paper has been supported by
the PROMISE network of excellence (contract n. 258191) project as a part of
the 7th Framework Program of the European commission (FP7/2007-2013).


References
1. M. Angelini, N. Ferro, G. Santucci, and G. Silvello. Visual Interactive Failure Anal-
   ysis: Supporting Users in Information Retrieval Evaluation. In J. Kamps, W. Kraaij,
   and N. Fuhr, editors, Proc. 4th Symposium on Information Interaction in Context
   (IIiX 2012). ACM Press, New York, USA, 2012.
2. E. Di Buccio, M. Dussin, N. Ferro, I. Masiero, G. Santucci, and G. Tino. To Re-
   rank or to Re-query: Can Visual Analytics Solve This Dilemma? In Multilingual and
   Multimodal Information Access Evaluation. Proc. of the 2nd Int. Conf. of the Cross-
   Language Evaluation Forum (CLEF 2011), pages 119–130. LNCS 6941, Springer,
   Heidelberg, Germany, 2011.