<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Data Extraction and Visualization Framework for Information Retrieval Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandro Celestini</string-name>
          <email>a.celestini@iac.cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Antonio Di Marco</string-name>
          <email>a.dimarco@iac.cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Totaro</string-name>
          <email>totaro@di.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer</institution>
          ,
          <addr-line>Science</addr-line>
          ,
          <institution>University of Rome</institution>
          , “
          <addr-line>Sapienza”</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute for Applied, Computing, National Research</institution>
          ,
          <addr-line>Council of</addr-line>
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In recent years we are witnessing a continuous growth in the amount of data that both public and private organizations collect and pro t by. Search engines are the most common tools used to retrieve information, and more recently, clustering techniques showed to be an e ective tool in helping users to skim query results. The majority of the systems proposed to manage information, provide textual interfaces to explore search results that are not speci cally designed to provide an interactive experience to the users. Trying to nd a solution to this problem, we focus on how to extract conveniently data from sources of interest, and how to enhance their analysis and consultation through visualization techniques. In this work we present a customizable framework able to acquire, search and interactively visualize data. This framework is built upon a modular architectural schema and its e ectiveness will be illustrated by a prototype implemented for a speci c application domain.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Data Visualization</kwd>
        <kwd>Data Extraction</kwd>
        <kwd>Acquisition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        The size of data collected by private and public organizations
is steadily growing and search engines are the most common
tools used to quickly browse them. Many works, in di
erent research areas, face the problem of how to manipulate
such data and to transform them into valuable information,
by making them navigable and easily searchable.
Clustering techniques have been shown to be quite e ective to that
purpose and have been thoroughly investigated in the past
years [
        <xref ref-type="bibr" rid="ref17 ref18 ref2">17, 18, 2</xref>
        ]. However the majority of currently
available solutions (e.g., Carrot21, Yippy2) just supply textual
interfaces to explore search results.
      </p>
      <p>
        In recent years, several works studied how users interact with
1http://project.carrot2.org
2http://www.yippy.com/
interfaces during exploratory search sessions, reporting
useful results about their behavior [
        <xref ref-type="bibr" rid="ref11 ref12">12, 11</xref>
        ]. These works show
that users spend the majority of their time looking at the
results and at the facets, whereas only a neglectable amount
of time for looking at the query itself [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] underlining the
importance of user interfaces development. According to those
works, it is clear that textual interfaces are not very e ective
to improve exploratory search, so a di erent solution has to
be applied.
      </p>
      <p>Data visualization techniques seem to be well suited to
pursue such goals. Indeed, visualization o ers an easy-to-use,
e cient, and e ective method capable to present data to a
large and diverse audience including users without any
programming background. The main goal of such techniques
is to present data in a fashion that supports intuitive
interaction to spot patterns and trends, thus making the data
usable and informative. In this work we focus on data
extraction and data visualization for information retrieval
systems, i.e., how to extract data from the sources of
interest in a convenient way, and how to enhance their analysis
and consultation through visualization techniques. To meet
these goals we propose a general framework, presenting its
architectural schema composed of four logic units:
acquisition, elaboration, storage, visualization. We also present a
prototype developed for a case study. The prototype has
been implemented for a speci c application domain and is
available online.</p>
      <p>The rest of the paper is organized as follows. Section 2
discusses some frameworks and platforms related to our study.
Section 3 presents the framework architectural schema.
Section 4 describes a prototype through a case study, and
nally, Section 5 concludes the paper suggesting directions
for future works.</p>
    </sec>
    <sec id="sec-2">
      <title>2. RELATED WORK</title>
      <p>In this section we discuss some works proposing frameworks
and platforms for data visualization.</p>
      <p>
        WEKA [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a Java library that provides a collection
of state-of-the-art machine learning algorithms and data
processing tools for data mining tasks. It comes with
several graphical user interfaces, but can also be extended
by using a simple API. The WEKA workbench includes a
set of visualization tools and algorithms for classi cation,
regression, attribute selection, and clustering, useful to
discover and understand data.
      </p>
      <p>
        Orange [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] is a collection of C++ routines providing a set
of data mining and machine learning procedures which can
be easily combined in order to develop new algorithms.
The framework allows to perform di erent tasks including
data input and manipulation, methods for developing
classi cation models, visualization of processed data, etc.
Orange provides also a scriptable environment, based on
Python, and a visual programming environment, based on
a set of graphical widgets.
      </p>
      <p>
        While WEKA and Orange contain several tools to deal
with data mining tasks, our aim is to improve information
retrieval systems and user data understanding through
visualization techniques. Basic statistical analysis on data,
should be implemented by charts through interactions
patterns, so that could be performed directly by users.
In [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] authors present FuseViz, a framework for Web-based
fusion and visualization of data. The framework provides
two basic features: fusion and visualization. FuseViz
collects data from multiple sources and fuses them into
a single data stream. The joint data streams are then
visualized trough charts and maps in a Web page. FuseViz
has been designed to operate in a smart environment, where
several deployed probes sense the environment in real time,
and the data to visualize are live time series.
      </p>
      <p>
        The Biketastic platform [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] is an application developed to
facilitate knowledge exchange among bikers. The platform
enables users to share routes and experience. For each
route Biketastic captures location, sensed data and media.
Such information are recorded while participants ride.
Routes' data are then managed by a backend platform that
makes visualizing and sharing routes' information easy and
convenient.
      </p>
      <p>FuseViz and Biketastic share the peculiarity of being
explicitly designed to cope with a speci c task in a
particular environment. The proposed schemas could be
re-implemented in di erent applications, but there is not
a clear extension and adaptation procedure de ned (and
possibly supported) by the authors. Our aim is to present
a framework that: a) can be easily integrated with an
existing information retrieval system b) provides a set of
tools to pro tably extract data from heterogeneous sources
c) requires minimum e ort to produce new interactive
visualizations.</p>
    </sec>
    <sec id="sec-3">
      <title>3. FRAMEWORK OVERVIEW</title>
      <p>Our framework adheres to a simple and well-known schema
(shown in Figure 1) structured in four logic units:
1. Acquisition: aims at obtaining data from sources;
2. Elaboration: responsible for processing the acquired
data to t operational needs;
3. Storage: stores the data previously processed in
persistent way and make them available to the users;
4. Visualization: provides a visual representation of
data.</p>
      <p>Actually the framework is mainly focused on the acquisition
and visualization stages, whereas the other ones are
reported as part of the architecture but are not implemented
by us. From an engineering perspective, both middle stages
(elaboration and storage) are considered as black-box
components: only their input and output speci cations
must be available. All logic units play a crucial role for
visualizing data thus we describe them according to the
purposes of our framework.</p>
    </sec>
    <sec id="sec-4">
      <title>3.1 Acquisition</title>
      <p>This component is in charge of collecting and preprocessing
data. Given a collection of documents, possibly in di erent
formats, the acquisition stage prepares data and organizes
them to feed the elaboration unit.</p>
      <p>
        Data acquisition can be considered the rst (mandatory)
phase for any data processing activity that anticipates the
data visualization. Cleveland [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Fry [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] examine in
depth the logical structure of visualizing data by identifying
seven stages: acquire, parse, lter, mine, represent, re ne,
and interact. Each stage in turn requires to apply techniques
and methods from di erent elds of computer science.
The seven stages are important in order to reconcile all
scienti c elds involved in data visualization especially from
the logical point of view. However, regarding to our
prototype we refer to data acquisition as a software component
which is able to collect, parse and extract data in an e
cient and secure way. The output of data acquisition will be
a selection of well-formed contents that are intelligible for
the elaboration unit.
      </p>
      <p>We can collect data3 by connecting the acquisition unit to
data source (e.g., les from a disk or data over a network).
The approach to data collection depends on goals and
desired results. For instance, forensic data collection requires
the application of scienti cally sound and proven methods4
to produce a bit-stream copy from data, that is an exact
bit-by-bit copy of the original media certi ed by a message
digest and/or a secure hash algorithm. Thus, data collection
in many circumstances has to address speci c issues about
prevention, detection and correction of errors.</p>
      <p>The acquired data must be parsed according to their digital
structure in order to extract data of interest and prepare
them for an elaboration unit. Parsing is potentially a
timeconsuming process especially while working with
heterogeneous data formats. The parsing stage is necessary also to
extract the metadata related to examined data. Both
textual contents and metadata are usually extracted and stored
in speci c data interchange formats like JSON or XML.
Moreover, security and e ciency aspects have to be
considered during the design of a data acquisition unit. However,
3We assume to work with static data. Static/persistent data
are not modi ed during data acquisition, while dynamic
data refer to information that is asynchronously updated.</p>
      <sec id="sec-4-1">
        <title>4http://dfrws.org/2001/dfrws-rm-final.pdf</title>
        <p>it is beyond the scope of the present work to discuss
security and e ciency related issues regardless their important
implications for data acquisition.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>3.2 Elaboration and Storage</title>
      <p>The elaboration unit takes as input the data extracted
during the acquisition phase, so it has to analyze and
extrapolate information from them. Data analysis for instance, may
be performed by a semantic engine or a traditional search
engine. In the former case we will obtain, as output, the
documents collection enriched with semantic information, in the
second case the output will be an index. Moreover, along
with the analysis results, the elaboration unit may return
analysis of the metadata, related to the documents, which
are received as an input.</p>
      <p>
        The main task of the storage unit is to store analysis results
produced by the elaboration unit and make them available
for the visualization unit. At this stage the main issue is to
optimize data access, speci cally the querying time, in order
to reduce the time spent by the visualization unit retrieving
the information to display. Several storage solutions can be
implemented, in particular one may choose among di erent
types of data bases [
        <xref ref-type="bibr" rid="ref13 ref3">3, 13</xref>
        ]. The traditional choice could be a
relational database, but there are several alternatives, e.g.,
XML databases or graph databases.
      </p>
    </sec>
    <sec id="sec-6">
      <title>3.3 Visualization</title>
      <p>
        The visualization unit is in charge of making data available
and valuable for the user. As a matter of fact, visualization
is fundamental to transform analysis results into valuable
information for the user and help her/him to explore data.
In particular, the visualization of the results may help the
user to extract new information from data and to decide
future queries. As previously discussed, the time spent by
the user looking at the query itself is negligible, whereas the
time spent looking at the results and how they are displayed
is long-lasting. Thus, the interface design is crucial for the
e ectiveness of this unit, and the guidelines outlined in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
may became a useful guide for the design and
implementation of this unit. Given the tight interaction with the user,
it is quite important to take into account the response time
and usability of the interface. The visualizations provided
should be interactive, to enable the user performing analysis
operations on data. The same data should be displayed in
several layouts to highlight their di erent aspects. Finally, it
is quite important to provide multiple lters for each
visualization, in order to o er to the user the chance of a dynamic
interaction with the results.
      </p>
      <sec id="sec-6-1">
        <title>3.3.1 The “Wow-Effect”</title>
        <p>A really-e ective data visualization technique has to be
developed keeping in mind two fundamental guidelines that
are abstraction and correlation.</p>
        <p>
          However, scientists often focus on the creation of trendy {
but not always useful { visualizations that should arouse
astonishment in the users who observe them, causing what
McQuillan [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] de nes as the Wow-E ect. Unfortunately,
the Wow-E ect vanishes quickly and results in having
stunning visualizations that are worthless for the audience. This
e ect is also related to the intrinsic complexity of the data
generated from acquisition to visualization stage. As shown
in Figure 2, the impact of original data into the total amount
Y
R
O
M
E
M
original data
        </p>
        <p>Parsing</p>
        <p>Processing</p>
        <p>Preservation</p>
        <p>Presentation
impact of
original data
other metadata
other metadata
other metadata
other metadata
METADATA</p>
        <p>METADATA</p>
        <p>METADATA</p>
        <p>METADATA
RESULTS</p>
        <p>DATA
TIME</p>
        <p>DATABASE
METADATA
RESULTS</p>
        <p>Intepretation
DATABASE
METADATA
RESULTS
DATA</p>
        <p>DATA</p>
        <p>DATA</p>
        <p>DATA
of information decreases over time. Thus, we invested in
effort to develop a framework able to overcome the \negative"
wow e ect by providing visualizations easy to use and e
ective.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>4. CASE STUDY: 4P’S PIPELINE</title>
      <p>In this section we present an application of the framework
developed for a case study. According to the main task
accomplished by each framework unit, we named the whole
procedure the 4P's pipeline: parsing, processing,
preservation, and presentation.</p>
      <p>The prototype is a browser based application available
online5. The data set used for testing the 4P's pipeline is a
collection of documents in di erent le formats (e.g., PDF,
HTML, MS O ce types, etc). The data set was obtained by
collecting documents from several sources, mainly related to
news in English language.</p>
    </sec>
    <sec id="sec-8">
      <title>4.1 Parsing task</title>
      <p>The acquisition unit is designed to e ectively address the
issues discussed in Section 3.1. Parsing is the core task of
our acquisition unit and for its implementation we exploited
the Apache Tika6 framework. The Apache Tika is a Java
library that carries out detection of document type and the
extraction of both metadata and structured textual content.
It uses existing parser libraries and supports most data
formats.</p>
      <sec id="sec-8-1">
        <title>4.1.1 Tika parsing</title>
        <p>
          Tika is currently the de-facto \babel sh", performing
automatic text extraction and content analysis of more than
1200 data formats. Furthermore there are several projects
that aim at expanding Tika to handle other data formats.
Document type detection is based on a taxonomy provided
by the IANA media types registry7that contains hundreds
of o cially registered types. There are also many uno
cial media types that require attention, so Tika has its own
media types registry that contains both o cial registered
types and other, widely used albeit uno cial, types. This
registry maintains information associated to each supported
type. Tika implements six methods for type detection [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]
respectively based on the following criteria: lename patterns,
Content-Type hints, magic byte pre xes, character
encodings, structure/schema detection, combined approaches.
        </p>
        <sec id="sec-8-1-1">
          <title>5http://kelvin.iac.rm.cnr.it/interface/</title>
        </sec>
        <sec id="sec-8-1-2">
          <title>6http://tika.apache.org/</title>
        </sec>
        <sec id="sec-8-1-3">
          <title>7http://tools.ietf.org/html/rfc6838</title>
          <p>The Parser interface is the key concept of Apache Tika. It
provides a high level of abstraction hiding the complexity
of di erent le formats and parsing libraries. Moreover, it
represents an extension point to add new parser Java classes
to Apache Tika, that must implement the Parser interface.
The selection of the parser implementation to be used for
parsing a given document may be either explicit or
automatic (based on detection heuristics).</p>
          <p>Each Tika parser allows to perform text (only for
textoriented types) and metadata extraction from digital
documents. Parsed metadata are written to the Metadata object
after the parse() method returns.</p>
        </sec>
      </sec>
      <sec id="sec-8-2">
        <title>4.1.2 Acquisition unit in detail</title>
        <p>Our acquisition unit uses Tika to automatically perform
type detection and parsing, against les collected from data
sources, by using all available detectors and parser
implementations. Although Tika is, to the best of our knowledge,
the most complete and e ective way to extract text and
metadata from documents, there are some situations where
it could not accomplish its job, for example when Tika fails
to detect the document format or, even if it correctly
recognizes the letype, when an exception occurs during parsing.
The acquisition unit handles both situations by using
alternative parsers which are designed to work with speci c types
of data (see gure 3):</p>
        <p>Whenever Tika is not able to detect a le because
either it is not a supported letype or the document is
not correctly detectable (for example, it has a
malformed/misleading Content-Type attribute), the
examined le is marked as application/octet-stream,
i.e., a type used to indicate that a body contains
arbitrary binary data. Therefore, the acquisition unit
processes documents whose the exact type is
undetectable by using a customized set of ad-hoc parsers,
each one specialized to handle speci c types. For
instance, Tika does not currently support Outlook PST
les, so they are marked as octet-stream subtypes.
Then, the acquisition unit analyzes the undetected le
by using criteria as extension pattern or more
sophisticated heuristics and nally it sends the binary data
to an ad-hoc parser based on the java-libpst 8 library.
During parsing, even though a document is correctly
detected by Tika, some errors/exceptions can occur,
interrupting the extraction process related to the
target le. In this case, the acquisition unit tries to restart
the parsing against the le that has caused a Tika
exception by using, if available, a suitable parser selected
from an ad-hoc parsers list.</p>
        <p>The acquisition unit extracts metadata from documents
according to a uni ed schema based on basic metadata
properties contained in the TikaCoreProperties interface, which
all (Tika and ad-hoc) parsers will attempt to extract. A
unied schema is necessary in order to have a unique experience
with searching against metadata properties. A complete and
more complex way to address \metadata interoperability"
consists in applying schema matching techniques in order to
provide suitable metadata crosswalks.</p>
        <sec id="sec-8-2-1">
          <title>8https://code.google.com/p/java-libpst/</title>
          <p>If any error occurs, try to
apply an ad-hoc parser
Input File</p>
          <p>Tika
Detector
Tika exception</p>
          <p>YES</p>
          <p>Tika
Parser
detectable?</p>
          <p>NO Ad-hoc
octet-stream parsers
Functional Units
Text and
Metadata</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>4.2 Processing and Preservation tasks</title>
      <p>The second and the third tasks are respectively the
processing and the preservation of data. The elaboration and
storage units which perform these tasks are tightly coupled.
All processed data must be stored in order to preserve the
elaboration results in a persistent way. They work by
using a simple strategy like Write-Once-Read-Many pattern,
where the visualization unit plays the reader role.</p>
      <sec id="sec-9-1">
        <title>4.2.1 Elaboration unit</title>
        <p>The elaboration unit is formed by the semantic engine
Cogito9. Cogito analyzes text documents, and is able to nd
hidden relationships, trends and events, transforming
unstructured information into structured data. Among the several
analysis it identi es three di erent types of entities
(people, places and companies/organizations), categorizes
documents on the basis of several taxonomies and extract entities
co-occurrences. Notice that this unit is outside the
framework despite we included it in the architectural schema.
Indeed, we do not take care of the elaboration unit design and
development, we consider it as given. This unit is the
entity with which the framework interacts and to which the
framework provides functionalities, i.e., text extraction and
visualization.</p>
      </sec>
      <sec id="sec-9-2">
        <title>4.2.2 Storage unit</title>
        <p>
          As storage unit we resorted to BaseX10, an XML data base.
BaseX is an open source solution released under the terms
of the BSD License. We decided to use an XML data base
because the results of the elaboration unit are returned in
XML format. Moreover, the use of an XML data base helps
to reduce the time for XML documents manipulation and
processing, compared to a middleware application [
          <xref ref-type="bibr" rid="ref10 ref15">10, 15</xref>
          ].
An XML data base has also the advantage of not
constraining data to a rigid schema, namely in the same data base we
can add XML documents with di erent structures. Thus,
the structure of the elaboration results can change without
e ecting the data base structure itself.
        </p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>4.3 Presentation task</title>
      <p>
        For the development of the visualization unit we used
D3.js11 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], a JavaScript library. The library provides several
graphical primitives to implement visualizations and uses
only web standards, namely HTML, SVG and CSS. With
D3 it is possible to realize multi-stage animations and
interactive visualizations of complex structures.
      </p>
      <sec id="sec-10-1">
        <title>9http://www.expertsystem.net 10http://basex.org 11http://d3js.org</title>
        <p>To improve data retrieval, we realized several visualization
alternatives that exploit Cogito's analysis results. Figure 4
shows a treemap visualization that displays a documents
categorization, notice that the same document may fall in
different categories. Not all categories are displayed, only eight
among the most common ones. The categories reported are
selected on the basis of the number of documents contained
in the category itself. The treemap visualization is quite
e ective in providing a global view of the data set. Our
implementation enables also a category zooming to restrict the
set of interest, i.e., clicking on a document the visualization
displays only the documents in the same category.
Moreover, the user is able to retrieve several information such as
the document's name, part of the document content and the
document's acquisition date, directly from the visualization
interface. Figure 5 shows a geographic visualization that
displays a geo-categorization of documents. The countries
appearing in the documents are rendered with a di erent
color (green), to highlight the di erence respect to the
others. The user can select each green country to get several
information that are reported inside a tooltip as shown in
gure. For each country are reported general information
such as capital's name, spoken languages, population
gures, etc. Such information do not come from the Cogito
analysis, but are added to enrich and enhance the retrieval
process carried out by users. The tooltip reports also the list
of documents in which the country appears and the features
detected by Cogito. Features are identi ed according to a
speci c taxonomy and for each country are reported all the
features detected inside the documents related to that
country. Moreover, this visualization displays geographic
locations belonging to the country, possibly identi ed during the
analysis, e.g. rivers, cities, mountains, ecc. Figure 6 shows
the visualization of entities co-occurrence (only a section of
the matrix is reported in gure). Three types of entities are
identi ed by Cogito, that are places, people, organizations.
All entities are listed both on rows and columns, when two
entities appear inside the same document the square at the
intersection is highlighted. The color of the squares is
always the same, but the opacity of each square is computed
on the basis of the number of co-occurrences. Thus, the
higher the number of co-occurrences, the darker the square
at the intersection. Furthermore, a tooltip for each
highlighted square reports the type of the two entities,
information about the co-occurrence and the list of documents in
which they appear. Speci cally, the tooltip reports the verb
or noun connecting the entities and some information about
the verb or noun used.</p>
        <p>Figure 7 shows a force directed graph that displays the
relations detected among the entities identi ed in the
documents. Each entity is represented by a symbol denoting the
entity's type. An edge connects two entities if a relation has
been detected between them, self-loop are possible. Edges
are rendered with di erent colors based on relations' type.
The legend concerning edges and nodes is reported on top of
the visualization. A tooltip reports some information about
the relations. In particular, for each edge is reported the
sentence connecting the entities, the verb or noun used in
the sentence and the document's name in which the
sentence appear. Instead for each node a tooltip reports the
list of document in which the entity appears. Furthermore,
for each visualization, the user may apply several lters. In
particular, we give the possibility to lter data by
acquisition date, geographic location, nodes' types (co-occurrence
matrix and force directed graph), relations' type (force
directed graph), categories (treemap).</p>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>5. CONCLUSIONS</title>
      <p>The interest in data visualization techniques is increasing,
indeed these techniques are showing to be a useful tool in
the processes of data analysis and understanding. In this
paper we have discussed a general framework for data
extraction and visualization, whose aim is to provide a
methodology to conveniently extract data and facilitate the creation
of e ective visualizations. In particular, we described the
framework's architecture, illustrating its components and its
functionalities, and a prototype. The prototype represents
an example of how our framework can be applied when
dealing with real information retrieval systems. Moreover, the
online application demo provides several visualization
examples that can be reused in di erent contexts and application
domains.</p>
      <p>Currently we're experimenting our prototype for digital
forensics and investigation purposes, aiming at providing to
law enforcement agencies a tool for correlating and
visualizing o -line forensic data, that can be used by an
investigator even if she/he does not have advanced skills in computer
forensics. As a future activity we plan to release a full
version of our prototype. At the moment the elaboration
engine is a proprietary solution that we cannot make publicly
available, hence we aim at replacing this unit with an open
solution. Finally, we want to enhance our framework in
order to facilitate the integration of data extraction and data
visualization endpoints with arbitrary retrieval systems.</p>
    </sec>
    <sec id="sec-12">
      <title>Acknowledgements</title>
      <p>We would like to express our appreciation to Expert Systems
for support in using Cogito. Moreover, nancial support
from EU projects HOME/2012/ISEC/AG/INT/4000003856
and HOME/2012/ISEC/AG/4000004362 is kindly
acknowledged.</p>
    </sec>
    <sec id="sec-13">
      <title>REFERENCES</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bostock</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ogievetsky</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Heer</surname>
          </string-name>
          .
          <article-title>D3 Data-Driven Documents</article-title>
          .
          <source>IEEE TVCG</source>
          ,
          <volume>17</volume>
          (
          <issue>12</issue>
          ):
          <volume>2301</volume>
          {
          <fpage>2309</fpage>
          ,
          <string-name>
            <surname>Dec</surname>
          </string-name>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Carpineto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Osinski</surname>
          </string-name>
          , G. Romano, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Weiss</surname>
          </string-name>
          .
          <article-title>A survey of web clustering engines</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>41</volume>
          (
          <issue>3</issue>
          ):1{
          <fpage>17</fpage>
          ,
          <string-name>
            <surname>Jul</surname>
          </string-name>
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cattell</surname>
          </string-name>
          .
          <article-title>Scalable SQL and NoSQL Data Stores</article-title>
          . SIGMOD Rec.,
          <volume>39</volume>
          (
          <issue>4</issue>
          ):
          <volume>12</volume>
          {
          <fpage>27</fpage>
          , May
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Chris</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Zitting</surname>
          </string-name>
          . Tika in Action. Manning Publications Co.,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W. S.</given-names>
            <surname>Cleveland</surname>
          </string-name>
          .
          <article-title>Visualizing data</article-title>
          . Hobart Press,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Demsar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Curk</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Erjavec, v. Gorup,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hocevar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Milutinovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mozina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polajnar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Toplak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Staric</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stajdohar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Umek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zagar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zbontar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zitnik</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Zupan</surname>
          </string-name>
          .
          <article-title>Orange: Data mining toolbox in python</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>14</volume>
          (
          <issue>1</issue>
          ):
          <volume>2349</volume>
          {
          <fpage>2353</fpage>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Fry</surname>
          </string-name>
          .
          <article-title>Visualizing Data: Exploring and Explaining Data with the Processing Environment</article-title>
          .
          <string-name>
            <surname>O'Reilly Media</surname>
          </string-name>
          , Inc.,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Ghidini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Gupta</surname>
          </string-name>
          .
          <article-title>FuseViz: A Framework for Web-based Data Fusion and Visualization in Smart Environments</article-title>
          .
          <source>In Proc. of IEEE MASS '12</source>
          , pages
          <fpage>468</fpage>
          {
          <fpage>472</fpage>
          ,
          <string-name>
            <surname>Oct</surname>
          </string-name>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hall</surname>
          </string-name>
          , E. Frank,
          <string-name>
            <given-names>G.</given-names>
            <surname>Holmes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pfahringer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Reutemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. H.</given-names>
            <surname>Witten</surname>
          </string-name>
          .
          <article-title>The weka data mining software: An update</article-title>
          .
          <source>SIGKDD Explor</source>
          . Newsl.,
          <volume>11</volume>
          (
          <issue>1</issue>
          ):
          <volume>10</volume>
          {
          <fpage>18</fpage>
          ,
          <string-name>
            <surname>Nov</surname>
          </string-name>
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Jokic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Krco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Vuckovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gligoric</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Drajic</surname>
          </string-name>
          .
          <article-title>Evaluation of an XML database based Resource Directory performance</article-title>
          .
          <source>In Proc. of TELFOR '11</source>
          , pages
          <fpage>542</fpage>
          {
          <fpage>545</fpage>
          ,
          <string-name>
            <surname>Nov</surname>
          </string-name>
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kules</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Capra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Banta</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Sierra</surname>
          </string-name>
          .
          <article-title>What do exploratory searchers look at in a faceted search interface?</article-title>
          <source>In Proc. of JCDL '09</source>
          , pages
          <fpage>313</fpage>
          {
          <fpage>322</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>B.</given-names>
            <surname>Kules</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Shneiderman</surname>
          </string-name>
          .
          <article-title>Users can change their web search tactics: Design guidelines for categorized overviews</article-title>
          .
          <source>Information Processing &amp; Management</source>
          ,
          <volume>44</volume>
          (
          <issue>2</issue>
          ):
          <volume>463</volume>
          {
          <fpage>484</fpage>
          ,
          <string-name>
            <surname>Mar</surname>
          </string-name>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>K. K.-Y. Lee</surname>
            ,
            <given-names>W.-C.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            , and
            <given-names>K.-S.</given-names>
          </string-name>
          <string-name>
            <surname>Choi</surname>
          </string-name>
          .
          <article-title>Alternatives to relational database: Comparison of NoSQL and XML approaches for clinical data storage</article-title>
          .
          <source>Computer Methods</source>
          and Programs in Biomedicine,
          <volume>110</volume>
          (
          <issue>1</issue>
          ):
          <volume>99</volume>
          {
          <fpage>109</fpage>
          ,
          <string-name>
            <surname>Apr</surname>
          </string-name>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>McQuillan</surname>
          </string-name>
          .
          <article-title>Honesty and foresight in computer visualizations</article-title>
          .
          <source>Journal of forestry</source>
          ,
          <volume>96</volume>
          (
          <issue>6</issue>
          ):
          <volume>15</volume>
          {
          <fpage>16</fpage>
          ,
          <string-name>
            <surname>Jun</surname>
          </string-name>
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Paradies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Malaika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nicola</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Xie</surname>
          </string-name>
          .
          <article-title>Comparing xml processing performance in middleware and database: A case study</article-title>
          .
          <source>In Proc. of Middleware Conference Industrial Track '10</source>
          , pages
          <fpage>35</fpage>
          {
          <fpage>39</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shilton</surname>
          </string-name>
          , G. Denisov,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cenizal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Estrin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Srivastava</surname>
          </string-name>
          . Biketastic:
          <article-title>Sensing and Mapping for Better Biking</article-title>
          .
          <source>In Proc. of SIGCHI '10</source>
          , pages
          <year>1817</year>
          {
          <year>1820</year>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>O.</given-names>
            <surname>Zamir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Madani</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Karp</surname>
          </string-name>
          .
          <article-title>Fast and intuitive clustering of web documents</article-title>
          .
          <source>In Proc. of KDD '97</source>
          , pages
          <fpage>287</fpage>
          {
          <fpage>290</fpage>
          ,
          <year>1997</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>H.-J. Zeng</surname>
            ,
            <given-names>Q.-C.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , W.-Y. Ma, and
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ma</surname>
          </string-name>
          .
          <article-title>Learning to cluster web search results</article-title>
          .
          <source>In Proc. of SIGIR '04</source>
          , pages
          <fpage>210</fpage>
          {
          <fpage>217</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>