Interactive Analysis and Exploration of Experimental Evaluation Results Emanuele Di Buccio Marco Dussin Nicola Ferro University of Padua, Italy University of Padua, Italy University of Padua, Italy dibuccio@dei.unipd.it dussinma@dei.unipd.it ferro@dei.unipd.it Ivano Masiero Giuseppe Santucci Giuseppe Tino University of Padua, Italy Sapienza University of Rome, Sapienza University of Rome, masieroi@dei.unipd.it Italy Italy santucci@dis.uniroma1.it tino@dis.uniroma1.it ABSTRACT research groups and industries, producing a huge amount of This paper proposes a methodology based on discounted cu- valuable data to be analysed, mined, and understood. mulated gain measures and visual analytics techniques in The aim of this work is to explore how we can improve order to improve the analysis and understanding of IR ex- the comprehension of and the interaction with the experi- perimental evaluation results. The proposed methodology mental results by IR researchers and IR system developers. is geared to favour a natural and effective interaction of the We imagine the following scenarios: (i) a researcher or a de- researchers and developers with the experimental data and veloper is attending the workshop of one of the large-scale it is demonstrated by developing an innovative application evaluation campaigns and s/he wants to explore and under- based on Apple iPad. stand the experimental results as s/he is listening at the presentation discussing them; (ii) a team of researchers or developers is working on tuning and improving an IR sys- Categories and Subject Descriptors tem and they need tools and applications that allow them H.3.3 [Information Search and Retrieval]: [Search pro- to investigate and discuss the performances of the system cess]; H.3.4 [Systems and Software]: [Performance eval- under examination in a handy and effective way. uation (efficiency and effectiveness)] These scenarios call for: (a) proper metrics that allow us to understand the behaviour of a system; (b) effective General Terms analysis and visualization techniques that allow us to get an overall idea of the main factors and critical areas which have Experimentation, Human Factors, Measurement, Performance influenced performances in order to be able to dig into de- tails; (c) for tools and applications that allow us to interact Keywords with the experimental result in a both effective and natural Ranking, Visual Analytics, Interaction, Discounted Cumu- way. lated Gain, Experimental Evaluation, DIRECT To this end, we propose a methodology which allows us to quickly get an idea of the distance of an IR system with re- spect to both its own optimal performances and the best per- 1. INTRODUCTION formances possible. We rely on the (normalized) discounted The Information Retrieval (IR) field has a strong and long- cumulated gain (n)DCG family of measures [7] because they lived tradition, that dates back to late 50s/early 60s of the have shown to be especially well-suited not only to quantify last century, as far as the assessment of the performances of system performances but also to give an idea of the over- an IR system is concerned. In particular, in the last 20 years, all user satisfaction with a given ranked list considering the large-scale evaluation campaigns, such as the Text REtrieval persistence of the user in scanning the list. Conference (TREC)1 in the United States and the Cross- The contribution of this paper is to improve on the previ- Language Evaluation Forum (CLEF)2 in Europe, have con- ous work [7,11] by trying to better understand what happens ducted cooperative evaluation efforts involving hundreds of when you flip documents with different relevance grades in 1 http://trec.nist.gov/ a ranked list. This is achieved by providing a formal model 2 http://www.clef-campaign.org/ that allows us to properly frame the problem and quantify the gain/loss with respect to an optimal ranking, rank by rank, according to the actual result list produced by an IR system. The proposed model provides the basis for the develop- ment of Visual Analytics (VA) techniques that give us the possibility to get a quick and intuitive idea of what hap- pened in a result list and what determined its perceived performances. Visual Analytics [8, 10, 14] is an emerging Copyright c 2011 for the individual papers by the papers’ authors. Copy- ing permitted only for private and academic purposes. This volume is pub- multi-disciplinary area that takes into account both ad-hoc lished and copyrighted by the editors of euroHCIR2011. and classical Data Mining (DM) algorithms and Informa- tion Visualization (IV) techniques, combining the strengths vector of n documents V , i.e., V [1] contains the identifier of of human and electronic data processing. Visualisation be- the document predicted by the system to be most relevant, comes the medium of a semi-automated analytical process, V [n] the least relevant one. The ground truth GT function where human beings and machines cooperate using their re- assigns to each document V [i] a value in the relevance inter- spective distinct capabilities for the most effective results. val {0..k}, where k represents the highest relevance score, Decisions on which direction analysis should take in order e.g. k = 3 in [7]. The basic assumption is that the greater to accomplish a certain task are left to final user. While IV the position of a document the less likely it is that the user techniques have been extensively explored [4,13], combining will examine it, because of the required time and effort and them with automated data analysis for specific application the information coming from the documents already exam- domains is still a challenging activity [9]. Moreover, the ined. As a consequence, the greater the rank of a relevant Visual Analytics community acknowledges the relevance of document the less useful it is for the user. This is mod- interaction for visual data analysis, and that the current eled through a discounting function DF that progressively research activities very often focus only on visual represen- reduces the relevance of a document, GT (V [i]) as i increases: tation, neglecting the interaction design, as clearly stated  in [14]. This refers to two different typologies of interaction: GT (V [i]), if i ≤ x DF (V [i]) = (1) 1) interaction within a visualization and, 2), closer to the GT (V [i])/ logx (i), if i > x paper contribution, interaction between the visual and the The quality of a result can be assessed P using the discounted analytical components. cumulative gain function DCG(V, i) = ij=1 DF (V [j]) that The idea of exploring and applying VA techniques to the estimates the information gained by a user that examines experimental evaluation in the IR field is quite innovative the first i documents of V . since it has never been attempted before and, due to the The DCG function allows for comparing the performances complexity of the evaluation measures and the amount of of different search engines, e.g., plotting the DCG(i) values data produced by large-scale evaluation campaigns, there is of each engine and comparing the curve behavior. a strong need for better and more effective representation However, if the user’s task is to improve the ranking per- techniques. Moreover, visualizing and assessing ranked list formance of a single search engine, looking at the misplaced of items, to the best of the authors’ knowledge, has not been documents (i.e., ranked too high or too low with respect to addressed by the VA community. The few related propos- the other documents) the DCG function does not help: the als, see, e.g., [12], use rankings for presenting the user with same value DCG(i) could be generated by different permu- the most relevant visualizations, or for browsing the ranked tations of V and it does not point out the loss in cumulative result, see, e.g., [5], but do not deal with the problem of gain caused by misplaced elements. To this aim, we intro- observing the ranked item position, comparing it with an duce the following definitions and novel metrics. ideal solution, to assess and improve the ranking quality. A We denote with OptP erm(V ) the set of optimal permu- first attempt in such a direction is in [1], where the authors tations of V such as that ∀OV ∈ OptP erm(V ) it holds explored the basic issues associated with the problem, pro- that GT (OV [i]) ≥ GT (OV [j])∀i, j <= n V i < j, that viding basic metrics and introducing a VA web based system is, OV maximizes the values of DCG(OV, i)∀i. In other that allows for exploring the quality of a ranking with re- words, OptP erm(V ) represents the set of the optimal rank- spect to an optimal solution. ings for a given search result. It is worth noting that each On top of the proposed model, we have built a running vector in OptP erm(V ) is composed by k + 1 intervals of prototype where the experimental results and data are stored documents sharing the same GT values. As an example, as- in a dedicated system accessible via standard Web services. suming a result vector composed by 12 elements and k = 3, This allows for the design and development of various client a possible sequence of GT values of an optimal vector OV applications and tools for exploiting the managed data. In is <3,3,3,3,2,2,2,2,1,1,0,0>; according to this we define the particular, in this paper, we have started to explore the pos- max index(V, r) and min index(V, r) functions, with 0 ≤ sibility of adopting the Apple iPad3 as an appropriate device r ≤ k, that return the greatest and the lowest indexes of el- to allow users to easily and naturally interact with the ex- ements in a vector belonging to OptP erm(V ) that share the perimental data and we have developed an initial prototype same GT value r. As an example, considering the above 12 that allows us for interactively inspecting the actual experi- GT values, min index(V, 2) = 5 and max index(V, 2) = 8. mental data in order to get insights about the behaviour of Using the above definitions we can define the relative posi- a IR system. tion R P os(V [i]) function for each document in V as follows: Overall, the proposed model, the proposed visualization ( techniques, and the implemented prototype meet all the (a- 0, if min index(V, GT (V [i]) ≤ i ≤ max index(V, GT (V [i]) c) requirements for the two scenarios introduced above. min index(V, GT (V [i]) − i, if i < min index(V, GT (V [i]) The paper is organized as follows. Section 2 introduces the max index(V, GT (V [i]) − i, if i > max index(V, GT (V [i]) model underlying the system together with its visualization R P os(V [i]) allows for pointing out misplaced elements techniques; Section 3 describes the interaction strategies of and understanding how much they are misplaced: 0 values the system, Section 4 describes the overall architecture of denote documents that are within the optimal interval, nega- the system, and Section 5 concludes the paper, pointing out tive and positive values denote elements that are respectively ongoing research activities. below and above the optimal interval. The absolute value of R P os(V [i]) gives the minimum distance of a misplaced 2. THE PROTOTYPE element from its optimal interval. According to [7] we model the retrieval results as a ranked According to the actual relevance and rank position, the same value of R P os(V [i]) can produce different variations 3 http://www.apple.com/ipad/ of the DCG function. We measure the contributions of mis- Figure 1: The iPad prototype interface. placed elements with the function ∆ Gain(V, i) that com- position. Similarly, the ∆ Gain vector codes the function pares ∀i the actual values of DF (V [i]) with the correspond- using colors: light blue refers to positive values, light red ing values in OV , DF (OV [i]): ∆ Gain(V, i) = DF (V [i]) − codes negative values, and green 0 values. Moreover, if the DF (OV [i]). user touches a specific area of the R P os vector (that is sim- ulated by the gray round in Figure 1), the main results list automatically scrolls back, providing the end user with a de- 3. INTERACTION tailed view on the corresponding documents. The rightmost A multi-touch prototype interface based on the model pre- part of the screen shows the DCG graphs of the ideal, the sented in section 2 has been designed for the iPad device. It optimal and the experiment vector, i.e. the ranking curves. has been developed and tested on the iOS 4.24 with the inte- The navigation bar displays a back button on the right which gration of the Core Plot5 plotting framework for the graph- let the user visualize the results for a different topic. ical visualization of data. The interface allows the end user for comparing the curve of the ranked results, for a given experiment/topic, with the optimal one and with the ideal 4. ARCHITECTURE one. This facilitates the activities of failure analysis, eas- The design of the architecture of the system benefits from ily locating misplaced elements, blue or red items, that pop what has been learned in ten years of work for the CLEF and up from the visualization together with the extent of their in the design and implementation of Distributed Information displacement and the impact they have on DCG. Retrieval Evaluation Campaign Tool (DIRECT), the system Figure 1 shows a screenshot of the current interface: the developed in CLEF since 2005 to manage all the aspects of main list on the left represents the top n = 200 ranked result an evaluation campaign [2, 3]. for a given experiment/topic and it can be easily scrolled by The approach to the architecture is the implementation the user. Each row corresponds to a document ID, a short of a modular design, as sketched in Figure 2, with the aim snippet of the content is included in the subtitle of each to clearly separate the logic entailed by the application into cell and more information on a specific result (i.e. relevance three levels of abstraction – data, application, and interface score, DCG, R P os, ∆ Gain) can be viewed by touching the logic – able to reciprocally communicate, easily extensible row. On the right side there are two coloured vectors which and implementable using modular and reusable components. show the R P os and ∆ Gain functions. The R P os vec- The Data Logic layer, depicted at the bottom of Figure 2, tor presents the results using different color shadings: light deals with the persistence of the information coming from green, light red and light blue respectively for documents the other layers. From the implementation point of view, that are within, below and above the optimal interval. It data stored into databases and indexes are mapped to re- allows for locating misplaced documents and, thanks to the sources and communicate with the upper levels through the shading, understanding how they are far from the optimal mechanism granted by the Data Access Object (DAO) pat- 4 tern6 — see point (1) in Figure 2. The Application Logic http://developer.apple.com/ 5 6 http://code.google.com/p/core-plot/ http://java.sun.com/blueprints/corej2eepatterns/ Acknowledgements The work reported in this paper has been partially sup- atio n ported by the PROMISE network of excellence (contract plic Ap and e 5 r fa c n. 258191), as a part of the 7th Framework Program of the Inte ogic L European commission (FP7/2007-2013). 6. REFERENCES 4 [1] N. Ferro, A. Sabetta, G. Santucci, G. Tino, and F. Acc ess Veltri. Visual comparison of ranked result cumulated Con n trol atio gains. In Proc. of EuroVA 2011. Eurographics, 2011. RE plic STfu lW Ap Logic eb Serv ice 3 [2] M. Agosti, G. Di Nunzio, M. Dussin, and N. Ferro. 10 6 Years of CLEF Data in DIRECT: Where We Are and Resource 2 Where We Can Go. In Proc. of EVIA 2010, pages Logging Infrastructure Resource 16–24. Tokyo, Japan, 2010. Resource [3] M. Agosti and N. Ferro. Towards an Evaluation ogic Infrastructure for DL Performance Evaluation. In ta L Da Resource DAO Evaluation of Digital Libraries: An Insight to Useful Resource DAO Applications and Methods. Chandos Publishing, Resource DAO ta b ase s Oxford, UK, 2009. Da and s dexe [4] S. K. Card and J. Mackinlay. The structure of the In 1 ion information visualization design space. In Proc. of r act bst InfoVis ’97, pages 92–99, Washington, DC, USA, nA atio plic n Ap 1997. IEEE Computer Society. atio ent n Im ple m [5] M. Derthick, M. G. Christel, A. G. Hauptmann, and atio plic H. D. Wactlar. Constant density displays using Ap diversity sampling. In Proc. of the IEEE Information Visualization, pages 137–144, 2003. Figure 2: The Architecture of the Application. [6] R. T. Fielding and R. N. Taylor. Principled design of the modern web architecture. ACM TOIT, 2:115–150, layer is in charge of the high-level tasks made by the sys- 2002. tem, such as the enrichment of raw data, the calculation [7] K. Järvelin and J. Kekäläinen. Cumulated Gain-Based of metrics and the carrying out of statistical analyses on Evaluation of IR Techniques. ACM TOIS, experiments. These resources (2) are therefore accessible 20(4):422–446, October 2002. via HTTP through a RESTful Web service [6], sketched at [8] D. Keim, G. Andrienko, J.-D. Fekete, C. Görg, point (3). After the validation of credentials and permissions J. Kohlhammer, and G. Melançon. Information made by the access control mechanism (4), it is possible for visualization. chapter Visual Analytics: Definition, remote devices such as web browsers or custom clients (5) Process, and Challenges, pages 154–175. to create, modify, or delete resources attaching their rep- Springer-Verlag, Berlin, Heidelberg, 2008. resentation in XML7 or JSON8 format to the body of an [9] D. Keim, J. Kohlhammer, G. Santucci, F. Mansmann, HTTP request, and to read them as response of specific F. Wanner, and M. Schäfer. Visual Analytics queries. A logging infrastructure (6) grants the tracking of Challenges. In Proc. of eChallenges 2009, 2009. all the activities made at each layer and can be used to ob- [10] D. A. Keim, F. Mansmann, J. Schneidewind, and tain information about the provenance of all the managed H. Ziegler. Challenges in visual data analysis. In Proc. resources. of IV’06, pages 9–16, 2006. [11] H. Keskustalo, K. Järvelin, A. Pirkola, and 5. CONCLUSIONS J. Kekäläinen. Intuition-Supporting Visualization of We have presented a model and a prototype which allow User’s Performance Based on Explicit Negative users to easily interact with the experimental results and to Higher-Order Relevance. In Proc. of SIGIR ’08, pages work together in a cooperative way while actually accessing 675–681. ACM Press, NY, USA, 2008. the data. This first step uncovers new and interesting pos- [12] J. Seo and B. Shneiderman. A rank-by-feature sibilities for the experimental evaluation and for the way in framework for interactive exploration of which researchers and developers usually carry out such ac- multidimensional data. In Proc. of the IEEE tivities. For example, the proposed techniques may alleviate Information Visualization, pages 65–72, 2004. the burden of certain tasks, such as failure analysis, which [13] B. Shneiderman. The eyes have it: a task by data type are often overlooked due to their demanding nature, thus taxonomy for information visualizations. In Proc. of making easier and more common to perform them and, as a the 1996 IEEE Symposium on Visual Languages, consequence, improving the overall comprehension of system pages 336 –343, 1996. behaviour. This will be explored in the future work. [14] J. J. Thomas and K. A. Cook. A visual analytics Patterns/DataAccessObject.html agenda. IEEE Computer Graphics and Applications, 7 http://www.w3.org/XML/ 26:10–13, 2006. 8 http://www.ietf.org/rfc/rfc4627.txt