=Paper=
{{Paper
|id=None
|storemode=property
|title=A Visualization Tool of Probabilistic Models for Information Access Components
|pdfUrl=https://ceur-ws.org/Vol-560/paper6.pdf
|volume=Vol-560
|dblpUrl=https://dblp.org/rec/conf/iir/Nunzio10
}}
==A Visualization Tool of Probabilistic Models for Information Access Components==
A Visualization Tool of Probabilistic Models for Information Access Components∗ Giorgio Maria Di Nunzio Dept. of Information Engineering University of Padua Via Gradenigo 6/a, 35131 Padua, Italy dinunzio@dei.unipd.it ABSTRACT the structure, and synthesizing search features, objects and An effective graphic interface is a key tool to improve the object relationships into the visual space [5]. fruition of the results retrieved by an Information Retrieval The introduction of visualization environments may add (IR) system. In this work, we describe a two-dimensional in- cognitive processes to the user who needs to understand and terface that represents the documents ranked on a Cartesian learn the characteristics of the new environment and inter- space and allows the user to interact with the documents in act with them to get the best from the system. In fact, order to improve the results of the search engine. Results the aim of visualization environments, as external represen- are classified and ranked according to the best separating tation of the world of interest, is to reduce the amount of line of the two classes of documents: relevant and non rel- cognitive effort required to solve informationally equivalent evant documents. Mathematical tools such as least squares problems [4]. In particular, an IR system should provide distances are used to train the supervised algorithm that users an environment in which they can exploit their skills finds the separating and ranking lines. to maximize their cognitive abilities. The visualization of an IR system is nothing but a process that transforms invisible abstract data and their semantic relationships in a visible Categories and Subject Descriptors collection on a display in order to find the user information H.3.3 [Information Storage and Retrieval]: Information need more easily. Search and Retrieval—Information filtering, Relevance feed- In this paper, we present the design and implementation a back, Retrieval models, Search process; H.5.2 [Information tool for the visualization of Naı̈ve Bayes (NB) probabilistic Interfaces and Presentation]: User Interfaces—Graphi- models for information access components that represents cal user interfaces (GUI) digital objects on the two-dimensional space [2, 3, 1]. The demonstration will applied to the task of automatic text General Terms classification and text retrieval. Algorithms, Design, Experimentation 2. DESIGN The model which upholds the visualization tool defines Keywords a direct relationship between the probability of an object Information Visualization, Machine Learning, Naı̈ve Bayes given a category of interest and a point on a two-dimensional Models, Relevance Feedback space. In this light, it is possible to graph entire collections of objects on a Cartesian plane, and to design algorithms 1. INTRODUCTION that categorize and retrieve documents directly on this two- dimensional representation. This tool demonstrates to be a Visualization is the process of transforming data, informa- valid visualization tool also for understanding the relation- tion, and knowledge into graphic presentations to support ships between categories of objects. tasks such as data analysis and information exploration. The The design of the two-dimensional visualization tool fol- definition of a spatial structure for information visualization lows two main requirements: is challenging because data in an information space may be multi-faceted, relationships of data are interwoven and are • for end-user, the interface should give the opportunity complicated. Moreover, the definition of such a space means to define the query with simple or advanced options, a complex process of extracting displayable attributes from and to express judgements for the documents retrieved objects, organizing the information, projecting objects onto which will be used to re-rank documents; ∗This is an extended abstract of [1] • for researchers, the interface should display the deci- sions taken by the search engine in terms of separating line and explain how the relevance feedback given by the user affects the list of ranked documents. The interface offers the possibility to write free text queries, Appears in the Proceedings of the 1st Italian Information Retrieval as any other search engine, or load predefined queries; pre- Workshop (IIR’10), January 27–28, 2010, Padova, Italy. defined queries are used for research purposes and recreates http://ims.dei.unipd.it/websites/iir10/index.html Copyright owned by the authors. the environment of evaluation tasks organized by campaign such as TREC1 or CLEF2 . The interface associate each document of the collection to a point in the two-dimensional space according to a proba- bilistic algorithm: the abscissa reflects how much the doc- ument is relevant to the query, the ordinate reflects how much the document is not relevant to the query. The pair of numbers gives an indication of the fraction of relevance for that particular document given the query, this pair is plot- ted on a frame and the relative position of this point with respect to the other documents in the collection determines its position in the list of ranked documents. In the two-dimensional representation of documents, the equation of the ranking or the classification function has to be written in such a way that each coordinate of a document is the sum of two addends: a variable component P(d|ci ), the probability of a document d given a category of interest ci , and a constant component P(ci ), the prior of the category of interest ci [3] For example, in the case of NB models the Figure 1: An example of the interface used by re- equation becomes: searchers. log (P(d|ci )) + log (P(ci )) > log (P(d|c̄i )) + log (P(c̄i )) of points move in the space and how the performance changes | {z } | {z } Xi (d) Yi (d) accordingly; drawing the clouds of points incrementally, high- When the inequality holds, the document is considered an lighting the contribution of each term to understand which element of category ci . If ci and c̄i are considered respec- terms better discriminate the two sets of points. tively the set of relevant documents and the set of non rele- In Figure 1, a screen-shot of the main window of the vi- vant documents, we can divide the collection of documents sualization tool si shown. The example shows the interface in these two sets; if we are only interested in the ranking of used by researchers. The different separating lines are cal- documents, we can compute the list of retrieved documents culated for a blind relevance feedback of 10 documents: the by combining the two components into one relevance weight. category of relevant documents in blue, the category of non Documents can be classified or ranked differently accord- relevant documents in red, the best separating line in pur- ing to the Focused Angular Region algorithm which com- ple. The list of retrieved documents is presented on the putes the best separating (or ranking) line by means of re- right. The user can choose to select a document, read it, gression techniques and least squares orthogonal, and ver- and judge it as relevant or non relevant. This information is tical, distances. Information about the categories of docu- stored and used to train the supervised algorithm when the ments are collected during the interaction of the user with user selects the “update search” box. the interface; in particular, the relevance judgements that the user expresses for the documents are used to re-compute 4. REFERENCES the probabilities and train the algorithm (details of this su- pervised algorithm are given in [3]). This part can be done [1] L. De Stefani, G. M. Di Nunzio, and G. Vezzaro. A automatically by selecting in the interface the option “Blind visualization tool of probabilistic models for relevance feedback”, which takes the first n documents of information access components. In Proceedings of the current list of documents and set them as relevant. Research and Advanced Technology for Digital Libraries (ECDL 2009), Corfu, Greece, September/October 2009. LNCS, Springer. 3. RESULTS AND OPEN QUESTIONS [2] G. M. Di Nunzio. Visualization and Classification of This visualization tool was tested on standard benchmark Documents: A New Probabilistic Model to Automated collections and a demonstration was presented at [1] in or- Text Classification. Bulletin of the IEEE Technical der to answer the following research questions: how well the Committee on Digital Libraries (IEEE-TCDL), 2(2), ranking or classification functions are learned from the data 2006. as separating lines; how particular unbalanced distribution [3] G. M. D. Nunzio. Using Scatterplots to Understand of documents can be corrected by means of parameter es- and Improve Probabilistic Models for Text timation; how the multivariate model and the multinomial Categorization and Retrieval. Journal of Approximate model perform on different languages; how blind and/or ex- Reasoning, 50(7):945–956, July 2009. plicit relevance feedback affect ranking list, and how the http://dx.doi.org/10.1016/j.ijar.2009.01.002. selection of relevant documents changes the shape of the [4] M. Scaife, M. Scaife, Y. Rogers, and Y. Rogers. clouds of relevant and non-relevant documents. External cognition: how do graphical representations During the interaction with the system, new questions and work? International Journal of Human-Computer new research ideas were collected about advances types of Studies, 45:185–213, 1996. interaction: changing the estimated probability of terms di- [5] J. Zhang. Visualization for Information Retrieval, rectly; smoothing parameters in order to see how the clouds volume 23 of The Information Retrieval Series. 1 http://trec.nist.gov/ Springer, 2008. ISBN: 978-3-540-75147-2. 2 http://www.clef-campaign.org