A Data Extraction and Visualization Framework for
                   Information Retrieval Systems

             Alessandro Celestini                    Antonio Di Marco                    Giuseppe Totaro
              Institute for Applied                Institute for Applied             Department of Computer
          Computing, National Research         Computing, National Research         Science, University of Rome
                Council of Italy                     Council of Italy                       “Sapienza”
             a.celestini@iac.cnr.it                a.dimarco@iac.cnr.it              totaro@di.uniroma1.it

ABSTRACT                                                         interfaces during exploratory search sessions, reporting use-
In recent years we are witnessing a continuous growth in the     ful results about their behavior [12, 11]. These works show
amount of data that both public and private organizations        that users spend the majority of their time looking at the
collect and profit by. Search engines are the most common        results and at the facets, whereas only a neglectable amount
tools used to retrieve information, and more recently, clus-     of time for looking at the query itself [11] underlining the im-
tering techniques showed to be an effective tool in helping      portance of user interfaces development. According to those
users to skim query results. The majority of the systems         works, it is clear that textual interfaces are not very effective
proposed to manage information, provide textual interfaces       to improve exploratory search, so a different solution has to
to explore search results that are not specifically designed     be applied.
to provide an interactive experience to the users.               Data visualization techniques seem to be well suited to pur-
Trying to find a solution to this problem, we focus on how to    sue such goals. Indeed, visualization offers an easy-to-use,
extract conveniently data from sources of interest, and how      efficient, and effective method capable to present data to a
to enhance their analysis and consultation through visual-       large and diverse audience including users without any pro-
ization techniques. In this work we present a customizable       gramming background. The main goal of such techniques
framework able to acquire, search and interactively visualize    is to present data in a fashion that supports intuitive inter-
data. This framework is built upon a modular architectural       action to spot patterns and trends, thus making the data
schema and its effectiveness will be illustrated by a proto-     usable and informative. In this work we focus on data ex-
type implemented for a specific application domain.              traction and data visualization for information retrieval sys-
                                                                 tems, i.e., how to extract data from the sources of inter-
                                                                 est in a convenient way, and how to enhance their analysis
Keywords                                                         and consultation through visualization techniques. To meet
Data Visualization, Data Extraction, Acquisition.                these goals we propose a general framework, presenting its
                                                                 architectural schema composed of four logic units: acquisi-
1.     INTRODUCTION                                              tion, elaboration, storage, visualization. We also present a
The size of data collected by private and public organizations   prototype developed for a case study. The prototype has
is steadily growing and search engines are the most common       been implemented for a specific application domain and is
tools used to quickly browse them. Many works, in differ-        available online.
ent research areas, face the problem of how to manipulate        The rest of the paper is organized as follows. Section 2 dis-
such data and to transform them into valuable information,       cusses some frameworks and platforms related to our study.
by making them navigable and easily searchable. Cluster-         Section 3 presents the framework architectural schema. Sec-
ing techniques have been shown to be quite effective to that     tion 4 describes a prototype through a case study, and fi-
purpose and have been thoroughly investigated in the past        nally, Section 5 concludes the paper suggesting directions
years [17, 18, 2]. However the majority of currently avail-      for future works.
able solutions (e.g., Carrot21 , Yippy2 ) just supply textual
interfaces to explore search results.
In recent years, several works studied how users interact with   2.   RELATED WORK
                                                                 In this section we discuss some works proposing frameworks
1
    http://project.carrot2.org                                   and platforms for data visualization.
2
    http://www.yippy.com/                                        WEKA [9] is a Java library that provides a collection
                                                                 of state-of-the-art machine learning algorithms and data
                                                                 processing tools for data mining tasks. It comes with
                                                                 several graphical user interfaces, but can also be extended
                                                                 by using a simple API. The WEKA workbench includes a
                                                                 set of visualization tools and algorithms for classification,
                                                                 regression, attribute selection, and clustering, useful to
                                                                 discover and understand data.
                                                                 Orange [6] is a collection of C++ routines providing a set
                                                                 of data mining and machine learning procedures which can
                                                                 be easily combined in order to develop new algorithms.
                                                                      data to fit operational needs;
                                                                  3. Storage: stores the data previously processed in per-
                                                                     sistent way and make them available to the users;
                                                                  4. Visualization: provides a visual representation of
                                                                     data.


                                                                Actually the framework is mainly focused on the acquisition
                                                                and visualization stages, whereas the other ones are re-
                                                                ported as part of the architecture but are not implemented
                                                                by us. From an engineering perspective, both middle stages
             Figure 1: Architectural Schema                     (elaboration and storage) are considered as black-box
                                                                components: only their input and output specifications
The framework allows to perform different tasks including       must be available. All logic units play a crucial role for
data input and manipulation, methods for developing             visualizing data thus we describe them according to the
classification models, visualization of processed data, etc.    purposes of our framework.
Orange provides also a scriptable environment, based on
Python, and a visual programming environment, based on
a set of graphical widgets.                                     3.1    Acquisition
While WEKA and Orange contain several tools to deal             This component is in charge of collecting and preprocessing
with data mining tasks, our aim is to improve information       data. Given a collection of documents, possibly in different
retrieval systems and user data understanding through           formats, the acquisition stage prepares data and organizes
visualization techniques. Basic statistical analysis on data,   them to feed the elaboration unit.
should be implemented by charts through interactions            Data acquisition can be considered the first (mandatory)
patterns, so that could be performed directly by users.         phase for any data processing activity that anticipates the
In [8] authors present FuseViz, a framework for Web-based       data visualization. Cleveland [5] and Fry [7] examine in
fusion and visualization of data. The framework provides        depth the logical structure of visualizing data by identifying
two basic features: fusion and visualization. FuseViz           seven stages: acquire, parse, filter, mine, represent, refine,
collects data from multiple sources and fuses them into         and interact. Each stage in turn requires to apply techniques
a single data stream. The joint data streams are then           and methods from different fields of computer science.
visualized trough charts and maps in a Web page. FuseViz        The seven stages are important in order to reconcile all sci-
has been designed to operate in a smart environment, where      entific fields involved in data visualization especially from
several deployed probes sense the environment in real time,     the logical point of view. However, regarding to our proto-
and the data to visualize are live time series.                 type we refer to data acquisition as a software component
The Biketastic platform [16] is an application developed to     which is able to collect, parse and extract data in an effi-
facilitate knowledge exchange among bikers. The platform        cient and secure way. The output of data acquisition will be
enables users to share routes and experience. For each          a selection of well-formed contents that are intelligible for
route Biketastic captures location, sensed data and media.      the elaboration unit.
Such information are recorded while participants ride.          We can collect data3 by connecting the acquisition unit to
Routes’ data are then managed by a backend platform that        data source (e.g., files from a disk or data over a network).
makes visualizing and sharing routes’ information easy and      The approach to data collection depends on goals and de-
convenient.                                                     sired results. For instance, forensic data collection requires
FuseViz and Biketastic share the peculiarity of being           the application of scientifically sound and proven methods4
explicitly designed to cope with a specific task in a par-      to produce a bit-stream copy from data, that is an exact
ticular environment.     The proposed schemas could be          bit-by-bit copy of the original media certified by a message
re-implemented in different applications, but there is not      digest and/or a secure hash algorithm. Thus, data collection
a clear extension and adaptation procedure defined (and         in many circumstances has to address specific issues about
possibly supported) by the authors. Our aim is to present       prevention, detection and correction of errors.
a framework that:      a) can be easily integrated with an      The acquired data must be parsed according to their digital
existing information retrieval system b) provides a set of      structure in order to extract data of interest and prepare
tools to profitably extract data from heterogeneous sources     them for an elaboration unit. Parsing is potentially a time-
c) requires minimum effort to produce new interactive           consuming process especially while working with heteroge-
visualizations.                                                 neous data formats. The parsing stage is necessary also to
                                                                extract the metadata related to examined data. Both tex-
                                                                tual contents and metadata are usually extracted and stored
3.     FRAMEWORK OVERVIEW                                       in specific data interchange formats like JSON or XML.
Our framework adheres to a simple and well-known schema         Moreover, security and efficiency aspects have to be consid-
(shown in Figure 1) structured in four logic units:             ered during the design of a data acquisition unit. However,
                                                                3
                                                                  We assume to work with static data. Static/persistent data
     1. Acquisition: aims at obtaining data from sources;       are not modified during data acquisition, while dynamic
                                                                data refer to information that is asynchronously updated.
                                                                4
     2. Elaboration: responsible for processing the acquired      http://dfrws.org/2001/dfrws-rm-final.pdf
it is beyond the scope of the present work to discuss secu-                  original data     Parsing        Processing       Preservation     Presentation

rity and efficiency related issues regardless their important
implications for data acquisition.                                                                                                              Intepretation
                                                                                              impact of
                                                                                                                               DATABASE         DATABASE
                                                                                             original data                     METADATA         METADATA
3.2     Elaboration and Storage


                                                                    MEMORY
                                                                                                               RESULTS          RESULTS          RESULTS
The elaboration unit takes as input the data extracted dur-
ing the acquisition phase, so it has to analyze and extrapo-                                 other metadata   other metadata   other metadata   other metadata

late information from them. Data analysis for instance, may                                  METADATA         METADATA         METADATA         METADATA
be performed by a semantic engine or a traditional search
engine. In the former case we will obtain, as output, the doc-                  DATA             DATA             DATA             DATA             DATA

uments collection enriched with semantic information, in the                                                        TIME
second case the output will be an index. Moreover, along
with the analysis results, the elaboration unit may return                         Figure 2: Data enrichment over time
analysis of the metadata, related to the documents, which
are received as an input.                                           of information decreases over time. Thus, we invested in ef-
The main task of the storage unit is to store analysis results      fort to develop a framework able to overcome the “negative”
produced by the elaboration unit and make them available            wow effect by providing visualizations easy to use and effec-
for the visualization unit. At this stage the main issue is to      tive.
optimize data access, specifically the querying time, in order
to reduce the time spent by the visualization unit retrieving       4.       CASE STUDY: 4P’S PIPELINE
the information to display. Several storage solutions can be        In this section we present an application of the framework
implemented, in particular one may choose among different           developed for a case study. According to the main task ac-
types of data bases [3, 13]. The traditional choice could be a      complished by each framework unit, we named the whole
relational database, but there are several alternatives, e.g.,      procedure the 4P’s pipeline: parsing, processing, preserva-
XML databases or graph databases.                                   tion, and presentation.
                                                                    The prototype is a browser based application available on-
3.3     Visualization                                               line5 . The data set used for testing the 4P’s pipeline is a
The visualization unit is in charge of making data available        collection of documents in different file formats (e.g., PDF,
and valuable for the user. As a matter of fact, visualization       HTML, MS Office types, etc). The data set was obtained by
is fundamental to transform analysis results into valuable          collecting documents from several sources, mainly related to
information for the user and help her/him to explore data.          news in English language.
In particular, the visualization of the results may help the
user to extract new information from data and to decide             4.1         Parsing task
future queries. As previously discussed, the time spent by          The acquisition unit is designed to effectively address the
the user looking at the query itself is negligible, whereas the     issues discussed in Section 3.1. Parsing is the core task of
time spent looking at the results and how they are displayed        our acquisition unit and for its implementation we exploited
is long-lasting. Thus, the interface design is crucial for the      the Apache Tika6 framework. The Apache Tika is a Java
effectiveness of this unit, and the guidelines outlined in [12]     library that carries out detection of document type and the
may became a useful guide for the design and implementa-            extraction of both metadata and structured textual content.
tion of this unit. Given the tight interaction with the user,       It uses existing parser libraries and supports most data for-
it is quite important to take into account the response time        mats.
and usability of the interface. The visualizations provided
should be interactive, to enable the user performing analysis
operations on data. The same data should be displayed in            4.1.1           Tika parsing
several layouts to highlight their different aspects. Finally, it   Tika is currently the de-facto “babel fish”, performing au-
is quite important to provide multiple filters for each visual-     tomatic text extraction and content analysis of more than
ization, in order to offer to the user the chance of a dynamic      1200 data formats. Furthermore there are several projects
interaction with the results.                                       that aim at expanding Tika to handle other data formats.
                                                                    Document type detection is based on a taxonomy provided
                                                                    by the IANA media types registry7 that contains hundreds
3.3.1    The “Wow-Effect”                                           of officially registered types. There are also many unoffi-
A really-effective data visualization technique has to be de-       cial media types that require attention, so Tika has its own
veloped keeping in mind two fundamental guidelines that             media types registry that contains both official registered
are abstraction and correlation.                                    types and other, widely used albeit unofficial, types. This
However, scientists often focus on the creation of trendy –         registry maintains information associated to each supported
but not always useful – visualizations that should arouse           type. Tika implements six methods for type detection [4] re-
astonishment in the users who observe them, causing what            spectively based on the following criteria: filename patterns,
McQuillan [14] defines as the Wow-Effect. Unfortunately,            Content-Type hints, magic byte prefixes, character encod-
the Wow-Effect vanishes quickly and results in having stun-         ings, structure/schema detection, combined approaches.
ning visualizations that are worthless for the audience. This
                                                                    5
effect is also related to the intrinsic complexity of the data        http://kelvin.iac.rm.cnr.it/interface/
                                                                    6
generated from acquisition to visualization stage. As shown           http://tika.apache.org/
                                                                    7
in Figure 2, the impact of original data into the total amount        http://tools.ietf.org/html/rfc6838
The Parser interface is the key concept of Apache Tika. It                                                        Tika exception
                                                                              If any error occurs, try to
provides a high level of abstraction hiding the complexity                    apply an ad-hoc parser
                                                                                                                     YES            Tika
of different file formats and parsing libraries. Moreover, it
                                                                                                                                   Parser
represents an extension point to add new parser Java classes
                                                                                                   Tika                                     Text and
to Apache Tika, that must implement the Parser interface.                   Input File                       detectable?
                                                                                                                                            Metadata
                                                                                                  Detector
The selection of the parser implementation to be used for                                                            NO        Ad-hoc
parsing a given document may be either explicit or auto-                                                     octet-stream      parsers
matic (based on detection heuristics).
Each Tika parser allows to perform text (only for text-                                                            Functional Units
oriented types) and metadata extraction from digital docu-
ments. Parsed metadata are written to the Metadata object
                                                                                                   Figure 3: Acquisition unit
after the parse() method returns.
                                                                       4.2          Processing and Preservation tasks
4.1.2      Acquisition unit in detail                                  The second and the third tasks are respectively the pro-
Our acquisition unit uses Tika to automatically perform                cessing and the preservation of data. The elaboration and
type detection and parsing, against files collected from data          storage units which perform these tasks are tightly coupled.
sources, by using all available detectors and parser imple-            All processed data must be stored in order to preserve the
mentations. Although Tika is, to the best of our knowledge,            elaboration results in a persistent way. They work by us-
the most complete and effective way to extract text and                ing a simple strategy like Write-Once-Read-Many pattern,
metadata from documents, there are some situations where               where the visualization unit plays the reader role.
it could not accomplish its job, for example when Tika fails
to detect the document format or, even if it correctly recog-
nizes the filetype, when an exception occurs during parsing.
                                                                        4.2.1           Elaboration unit
The acquisition unit handles both situations by using alter-           The elaboration unit is formed by the semantic engine Cog-
native parsers which are designed to work with specific types          ito9 . Cogito analyzes text documents, and is able to find hid-
of data (see figure 3):                                                den relationships, trends and events, transforming unstruc-
                                                                       tured information into structured data. Among the several
                                                                       analysis it identifies three different types of entities (peo-
     • Whenever Tika is not able to detect a file because ei-          ple, places and companies/organizations), categorizes docu-
       ther it is not a supported filetype or the document is          ments on the basis of several taxonomies and extract entities
       not correctly detectable (for example, it has a mal-            co-occurrences. Notice that this unit is outside the frame-
       formed/misleading Content-Type attribute), the ex-              work despite we included it in the architectural schema. In-
       amined file is marked as application/octet-stream,              deed, we do not take care of the elaboration unit design and
       i.e., a type used to indicate that a body contains ar-          development, we consider it as given. This unit is the en-
       bitrary binary data. Therefore, the acquisition unit            tity with which the framework interacts and to which the
       processes documents whose the exact type is unde-               framework provides functionalities, i.e., text extraction and
       tectable by using a customized set of ad-hoc parsers,           visualization.
       each one specialized to handle specific types. For in-
       stance, Tika does not currently support Outlook PST
                                                                        4.2.2           Storage unit
       files, so they are marked as octet-stream subtypes.
                                                                       As storage unit we resorted to BaseX10 , an XML data base.
       Then, the acquisition unit analyzes the undetected file
                                                                       BaseX is an open source solution released under the terms
       by using criteria as extension pattern or more sophis-
                                                                       of the BSD License. We decided to use an XML data base
       ticated heuristics and finally it sends the binary data
                                                                       because the results of the elaboration unit are returned in
       to an ad-hoc parser based on the java-libpst 8 library.
                                                                       XML format. Moreover, the use of an XML data base helps
     • During parsing, even though a document is correctly             to reduce the time for XML documents manipulation and
       detected by Tika, some errors/exceptions can occur,             processing, compared to a middleware application [10, 15].
       interrupting the extraction process related to the tar-         An XML data base has also the advantage of not constrain-
       get file. In this case, the acquisition unit tries to restart   ing data to a rigid schema, namely in the same data base we
       the parsing against the file that has caused a Tika ex-         can add XML documents with different structures. Thus,
       ception by using, if available, a suitable parser selected      the structure of the elaboration results can change without
       from an ad-hoc parsers list.                                    effecting the data base structure itself.


The acquisition unit extracts metadata from documents ac-              4.3          Presentation task
cording to a unified schema based on basic metadata proper-            For the development of the visualization unit we used
ties contained in the TikaCoreProperties interface, which              D3.js11 [1], a JavaScript library. The library provides several
all (Tika and ad-hoc) parsers will attempt to extract. A uni-          graphical primitives to implement visualizations and uses
fied schema is necessary in order to have a unique experience          only web standards, namely HTML, SVG and CSS. With
with searching against metadata properties. A complete and             D3 it is possible to realize multi-stage animations and inter-
more complex way to address “metadata interoperability”                active visualizations of complex structures.
consists in applying schema matching techniques in order to            9
                                                                          http://www.expertsystem.net
provide suitable metadata crosswalks.                                  10
                                                                          http://basex.org
8                                                                      11
    https://code.google.com/p/java-libpst/                                http://d3js.org
                                                                             Figure 6: Co-occurrences matrix.


                                                                  information that are reported inside a tooltip as shown in
                                                                  figure. For each country are reported general information
                                                                  such as capital’s name, spoken languages, population fig-
                                                                  ures, etc. Such information do not come from the Cogito
                                                                  analysis, but are added to enrich and enhance the retrieval
                                                                  process carried out by users. The tooltip reports also the list
                                                                  of documents in which the country appears and the features
     Figure 4: Treemap with category zooming
                                                                  detected by Cogito. Features are identified according to a
                                                                  specific taxonomy and for each country are reported all the
                                                                  features detected inside the documents related to that coun-
                                                                  try. Moreover, this visualization displays geographic loca-
                                                                  tions belonging to the country, possibly identified during the
                                                                  analysis, e.g. rivers, cities, mountains, ecc. Figure 6 shows
                                                                  the visualization of entities co-occurrence (only a section of
                                                                  the matrix is reported in figure). Three types of entities are
                                                                  identified by Cogito, that are places, people, organizations.
                                                                  All entities are listed both on rows and columns, when two
                                                                  entities appear inside the same document the square at the
                                                                  intersection is highlighted. The color of the squares is al-
                                                                  ways the same, but the opacity of each square is computed
                                                                  on the basis of the number of co-occurrences. Thus, the
                                                                  higher the number of co-occurrences, the darker the square
                                                                  at the intersection. Furthermore, a tooltip for each high-
Figure 5: Geographic visualization and country se-                lighted square reports the type of the two entities, informa-
lection                                                           tion about the co-occurrence and the list of documents in
                                                                  which they appear. Specifically, the tooltip reports the verb
To improve data retrieval, we realized several visualization      or noun connecting the entities and some information about
alternatives that exploit Cogito’s analysis results. Figure 4     the verb or noun used.
shows a treemap visualization that displays a documents cat-      Figure 7 shows a force directed graph that displays the re-
egorization, notice that the same document may fall in dif-       lations detected among the entities identified in the docu-
ferent categories. Not all categories are displayed, only eight   ments. Each entity is represented by a symbol denoting the
among the most common ones. The categories reported are           entity’s type. An edge connects two entities if a relation has
selected on the basis of the number of documents contained        been detected between them, self-loop are possible. Edges
in the category itself. The treemap visualization is quite        are rendered with different colors based on relations’ type.
effective in providing a global view of the data set. Our im-     The legend concerning edges and nodes is reported on top of
plementation enables also a category zooming to restrict the      the visualization. A tooltip reports some information about
set of interest, i.e., clicking on a document the visualization   the relations. In particular, for each edge is reported the
displays only the documents in the same category. More-           sentence connecting the entities, the verb or noun used in
over, the user is able to retrieve several information such as    the sentence and the document’s name in which the sen-
the document’s name, part of the document content and the         tence appear. Instead for each node a tooltip reports the
document’s acquisition date, directly from the visualization      list of document in which the entity appears. Furthermore,
interface. Figure 5 shows a geographic visualization that         for each visualization, the user may apply several filters. In
displays a geo-categorization of documents. The countries         particular, we give the possibility to filter data by acquisi-
appearing in the documents are rendered with a different          tion date, geographic location, nodes’ types (co-occurrence
color (green), to highlight the difference respect to the oth-    matrix and force directed graph), relations’ type (force di-
ers. The user can select each green country to get several        rected graph), categories (treemap).
                                                                  [1] M. Bostock, V. Ogievetsky, and J. Heer. D3
                                                                      Data-Driven Documents. IEEE TVCG,
                                                                      17(12):2301–2309, Dec 2011.
                                                                  [2] C. Carpineto, S. Osiński, G. Romano, and D. Weiss. A
                                                                      survey of web clustering engines. ACM Comput. Surv.,
                                                                      41(3):1–17, Jul 2009.
                                                                  [3] R. Cattell. Scalable SQL and NoSQL Data Stores.
                                                                      SIGMOD Rec., 39(4):12–27, May 2011.
                                                                  [4] M. Chris and J. Zitting. Tika in Action. Manning
                                                                      Publications Co., 2011.
                                                                  [5] W. S. Cleveland. Visualizing data. Hobart Press, 1993.
                                                                  [6] J. Demšar, T. Curk, A. Erjavec, v. Gorup, T. Hočevar,
                                                                      M. Milutinovič, M. Možina, M. Polajnar, M. Toplak,
                                                                      A. Starič, M. Štajdohar, L. Umek, L. Žagar,
                                                                      J. Žbontar, M. Žitnik, and B. Zupan. Orange: Data
                                                                      mining toolbox in python. Journal of Machine
                                                                      Learning Research, 14(1):2349–2353, Jan 2013.
                                                                  [7] B. Fry. Visualizing Data: Exploring and Explaining
                                                                      Data with the Processing Environment. O’Reilly
                                                                      Media, Inc., 2007.
                                                                  [8] G. Ghidini, S. Das, and V. Gupta. FuseViz: A
                                                                      Framework for Web-based Data Fusion and
                                                                      Visualization in Smart Environments. In Proc. of
                                                                      IEEE MASS ’12, pages 468–472, Oct 2012.
     Figure 7: Entity-relations force directed graph              [9] M. Hall, E. Frank, G. Holmes, B. Pfahringer,
                                                                      P. Reutemann, and I. H. Witten. The weka data
5.     CONCLUSIONS                                                    mining software: An update. SIGKDD Explor. Newsl.,
The interest in data visualization techniques is increasing,          11(1):10–18, Nov 2009.
indeed these techniques are showing to be a useful tool in       [10] S. Jokić, S. Krco, J. Vuckovic, N. Gligoric, and
the processes of data analysis and understanding. In this pa-         D. Drajic. Evaluation of an XML database based
per we have discussed a general framework for data extrac-            Resource Directory performance. In Proc. of TELFOR
tion and visualization, whose aim is to provide a methodol-           ’11, pages 542–545, Nov 2011.
ogy to conveniently extract data and facilitate the creation     [11] B. Kules, R. Capra, M. Banta, and T. Sierra. What do
of effective visualizations. In particular, we described the          exploratory searchers look at in a faceted search
framework’s architecture, illustrating its components and its         interface? In Proc. of JCDL ’09, pages 313–322, 2009.
functionalities, and a prototype. The prototype represents       [12] B. Kules and B. Shneiderman. Users can change their
an example of how our framework can be applied when deal-             web search tactics: Design guidelines for categorized
ing with real information retrieval systems. Moreover, the            overviews. Information Processing & Management,
online application demo provides several visualization exam-          44(2):463–484, Mar 2008.
ples that can be reused in different contexts and application    [13] K. K.-Y. Lee, W.-C. Tang, and K.-S. Choi.
domains.                                                              Alternatives to relational database: Comparison of
Currently we’re experimenting our prototype for digital               NoSQL and XML approaches for clinical data storage.
forensics and investigation purposes, aiming at providing to          Computer Methods and Programs in Biomedicine,
law enforcement agencies a tool for correlating and visualiz-         110(1):99–109, Apr 2013.
ing off-line forensic data, that can be used by an investiga-    [14] A. G. McQuillan. Honesty and foresight in computer
tor even if she/he does not have advanced skills in computer          visualizations. Journal of forestry, 96(6):15–16, Jun
forensics. As a future activity we plan to release a full ver-        1998.
sion of our prototype. At the moment the elaboration en-         [15] M. Paradies, S. Malaika, M. Nicola, and K. Xie.
gine is a proprietary solution that we cannot make publicly           Comparing xml processing performance in middleware
available, hence we aim at replacing this unit with an open           and database: A case study. In Proc. of Middleware
solution. Finally, we want to enhance our framework in or-            Conference Industrial Track ’10, pages 35–39, 2010.
der to facilitate the integration of data extraction and data    [16] S. Reddy, K. Shilton, G. Denisov, C. Cenizal,
visualization endpoints with arbitrary retrieval systems.             D. Estrin, and M. Srivastava. Biketastic: Sensing and
                                                                      Mapping for Better Biking. In Proc. of SIGCHI ’10,
Acknowledgements                                                      pages 1817–1820, 2010.
We would like to express our appreciation to Expert Systems      [17] O. Zamir, O. Etzioni, O. Madani, and R. M. Karp.
for support in using Cogito. Moreover, financial support              Fast and intuitive clustering of web documents. In
from EU projects HOME/2012/ISEC/AG/INT/4000003856                     Proc. of KDD ’97, pages 287–290, 1997.
and HOME/2012/ISEC/AG/4000004362 is kindly acknowl-              [18] H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma.
edged.                                                                Learning to cluster web search results. In Proc. of
                                                                      SIGIR ’04, pages 210–217, 2004.
6.     REFERENCES