=Paper= {{Paper |id=None |storemode=property |title=A Visualization Tool of Probabilistic Models for Information Access Components |pdfUrl=https://ceur-ws.org/Vol-560/paper6.pdf |volume=Vol-560 |dblpUrl=https://dblp.org/rec/conf/iir/Nunzio10 }} ==A Visualization Tool of Probabilistic Models for Information Access Components== https://ceur-ws.org/Vol-560/paper6.pdf
                   A Visualization Tool of Probabilistic Models
                      for Information Access Components∗

                                                          Giorgio Maria Di Nunzio
                                                       Dept. of Information Engineering
                                                             University of Padua
                                                         Via Gradenigo 6/a, 35131
                                                                  Padua, Italy
                                                            dinunzio@dei.unipd.it

ABSTRACT                                                                 the structure, and synthesizing search features, objects and
An effective graphic interface is a key tool to improve the              object relationships into the visual space [5].
fruition of the results retrieved by an Information Retrieval               The introduction of visualization environments may add
(IR) system. In this work, we describe a two-dimensional in-             cognitive processes to the user who needs to understand and
terface that represents the documents ranked on a Cartesian              learn the characteristics of the new environment and inter-
space and allows the user to interact with the documents in              act with them to get the best from the system. In fact,
order to improve the results of the search engine. Results               the aim of visualization environments, as external represen-
are classified and ranked according to the best separating               tation of the world of interest, is to reduce the amount of
line of the two classes of documents: relevant and non rel-              cognitive effort required to solve informationally equivalent
evant documents. Mathematical tools such as least squares                problems [4]. In particular, an IR system should provide
distances are used to train the supervised algorithm that                users an environment in which they can exploit their skills
finds the separating and ranking lines.                                  to maximize their cognitive abilities. The visualization of an
                                                                         IR system is nothing but a process that transforms invisible
                                                                         abstract data and their semantic relationships in a visible
Categories and Subject Descriptors                                       collection on a display in order to find the user information
H.3.3 [Information Storage and Retrieval]: Information                   need more easily.
Search and Retrieval—Information filtering, Relevance feed-                 In this paper, we present the design and implementation a
back, Retrieval models, Search process; H.5.2 [Information               tool for the visualization of Naı̈ve Bayes (NB) probabilistic
Interfaces and Presentation]: User Interfaces—Graphi-                    models for information access components that represents
cal user interfaces (GUI)                                                digital objects on the two-dimensional space [2, 3, 1]. The
                                                                         demonstration will applied to the task of automatic text
General Terms                                                            classification and text retrieval.
Algorithms, Design, Experimentation
                                                                         2.    DESIGN
                                                                           The model which upholds the visualization tool defines
Keywords                                                                 a direct relationship between the probability of an object
Information Visualization, Machine Learning, Naı̈ve Bayes                given a category of interest and a point on a two-dimensional
Models, Relevance Feedback                                               space. In this light, it is possible to graph entire collections
                                                                         of objects on a Cartesian plane, and to design algorithms
1.   INTRODUCTION                                                        that categorize and retrieve documents directly on this two-
                                                                         dimensional representation. This tool demonstrates to be a
   Visualization is the process of transforming data, informa-
                                                                         valid visualization tool also for understanding the relation-
tion, and knowledge into graphic presentations to support
                                                                         ships between categories of objects.
tasks such as data analysis and information exploration. The
                                                                           The design of the two-dimensional visualization tool fol-
definition of a spatial structure for information visualization
                                                                         lows two main requirements:
is challenging because data in an information space may be
multi-faceted, relationships of data are interwoven and are                   • for end-user, the interface should give the opportunity
complicated. Moreover, the definition of such a space means                     to define the query with simple or advanced options,
a complex process of extracting displayable attributes from                     and to express judgements for the documents retrieved
objects, organizing the information, projecting objects onto                    which will be used to re-rank documents;
∗This is an extended abstract of [1]                                          • for researchers, the interface should display the deci-
                                                                                sions taken by the search engine in terms of separating
                                                                                line and explain how the relevance feedback given by
                                                                                the user affects the list of ranked documents.
                                                                         The interface offers the possibility to write free text queries,
Appears in the Proceedings of the 1st Italian Information Retrieval      as any other search engine, or load predefined queries; pre-
Workshop (IIR’10), January 27–28, 2010, Padova, Italy.                   defined queries are used for research purposes and recreates
http://ims.dei.unipd.it/websites/iir10/index.html
Copyright owned by the authors.
the environment of evaluation tasks organized by campaign
such as TREC1 or CLEF2 .
    The interface associate each document of the collection to
a point in the two-dimensional space according to a proba-
bilistic algorithm: the abscissa reflects how much the doc-
ument is relevant to the query, the ordinate reflects how
much the document is not relevant to the query. The pair of
numbers gives an indication of the fraction of relevance for
that particular document given the query, this pair is plot-
ted on a frame and the relative position of this point with
respect to the other documents in the collection determines
its position in the list of ranked documents.
    In the two-dimensional representation of documents, the
equation of the ranking or the classification function has to
be written in such a way that each coordinate of a document
is the sum of two addends: a variable component P(d|ci ),
the probability of a document d given a category of interest
ci , and a constant component P(ci ), the prior of the category
of interest ci [3] For example, in the case of NB models the           Figure 1: An example of the interface used by re-
equation becomes:                                                      searchers.
     log (P(d|ci )) + log (P(ci )) > log (P(d|c̄i )) + log (P(c̄i ))
                                                                       of points move in the space and how the performance changes
     |            {z             } |               {z              }
                 Xi (d)                           Yi (d)
                                                                       accordingly; drawing the clouds of points incrementally, high-
When the inequality holds, the document is considered an               lighting the contribution of each term to understand which
element of category ci . If ci and c̄i are considered respec-          terms better discriminate the two sets of points.
tively the set of relevant documents and the set of non rele-             In Figure 1, a screen-shot of the main window of the vi-
vant documents, we can divide the collection of documents              sualization tool si shown. The example shows the interface
in these two sets; if we are only interested in the ranking of         used by researchers. The different separating lines are cal-
documents, we can compute the list of retrieved documents              culated for a blind relevance feedback of 10 documents: the
by combining the two components into one relevance weight.             category of relevant documents in blue, the category of non
   Documents can be classified or ranked differently accord-           relevant documents in red, the best separating line in pur-
ing to the Focused Angular Region algorithm which com-                 ple. The list of retrieved documents is presented on the
putes the best separating (or ranking) line by means of re-            right. The user can choose to select a document, read it,
gression techniques and least squares orthogonal, and ver-             and judge it as relevant or non relevant. This information is
tical, distances. Information about the categories of docu-            stored and used to train the supervised algorithm when the
ments are collected during the interaction of the user with            user selects the “update search” box.
the interface; in particular, the relevance judgements that
the user expresses for the documents are used to re-compute            4.   REFERENCES
the probabilities and train the algorithm (details of this su-
pervised algorithm are given in [3]). This part can be done            [1] L. De Stefani, G. M. Di Nunzio, and G. Vezzaro. A
automatically by selecting in the interface the option “Blind              visualization tool of probabilistic models for
relevance feedback”, which takes the first n documents of                  information access components. In Proceedings of
the current list of documents and set them as relevant.                    Research and Advanced Technology for Digital Libraries
                                                                           (ECDL 2009), Corfu, Greece, September/October 2009.
                                                                           LNCS, Springer.
3.     RESULTS AND OPEN QUESTIONS                                      [2] G. M. Di Nunzio. Visualization and Classification of
   This visualization tool was tested on standard benchmark                Documents: A New Probabilistic Model to Automated
collections and a demonstration was presented at [1] in or-                Text Classification. Bulletin of the IEEE Technical
der to answer the following research questions: how well the               Committee on Digital Libraries (IEEE-TCDL), 2(2),
ranking or classification functions are learned from the data              2006.
as separating lines; how particular unbalanced distribution            [3] G. M. D. Nunzio. Using Scatterplots to Understand
of documents can be corrected by means of parameter es-                    and Improve Probabilistic Models for Text
timation; how the multivariate model and the multinomial                   Categorization and Retrieval. Journal of Approximate
model perform on different languages; how blind and/or ex-                 Reasoning, 50(7):945–956, July 2009.
plicit relevance feedback affect ranking list, and how the                 http://dx.doi.org/10.1016/j.ijar.2009.01.002.
selection of relevant documents changes the shape of the               [4] M. Scaife, M. Scaife, Y. Rogers, and Y. Rogers.
clouds of relevant and non-relevant documents.                             External cognition: how do graphical representations
   During the interaction with the system, new questions and               work? International Journal of Human-Computer
new research ideas were collected about advances types of                  Studies, 45:185–213, 1996.
interaction: changing the estimated probability of terms di-
                                                                       [5] J. Zhang. Visualization for Information Retrieval,
rectly; smoothing parameters in order to see how the clouds
                                                                           volume 23 of The Information Retrieval Series.
1
    http://trec.nist.gov/                                                  Springer, 2008. ISBN: 978-3-540-75147-2.
2
    http://www.clef-campaign.org