A Data Extraction and Visualization Framework for Information Retrieval Systems Alessandro Celestini Antonio Di Marco Giuseppe Totaro Institute for Applied Institute for Applied Department of Computer Computing, National Research Computing, National Research Science, University of Rome Council of Italy Council of Italy “Sapienza” a.celestini@iac.cnr.it a.dimarco@iac.cnr.it totaro@di.uniroma1.it ABSTRACT interfaces during exploratory search sessions, reporting use- In recent years we are witnessing a continuous growth in the ful results about their behavior [12, 11]. These works show amount of data that both public and private organizations that users spend the majority of their time looking at the collect and profit by. Search engines are the most common results and at the facets, whereas only a neglectable amount tools used to retrieve information, and more recently, clus- of time for looking at the query itself [11] underlining the im- tering techniques showed to be an effective tool in helping portance of user interfaces development. According to those users to skim query results. The majority of the systems works, it is clear that textual interfaces are not very effective proposed to manage information, provide textual interfaces to improve exploratory search, so a different solution has to to explore search results that are not specifically designed be applied. to provide an interactive experience to the users. Data visualization techniques seem to be well suited to pur- Trying to find a solution to this problem, we focus on how to sue such goals. Indeed, visualization offers an easy-to-use, extract conveniently data from sources of interest, and how efficient, and effective method capable to present data to a to enhance their analysis and consultation through visual- large and diverse audience including users without any pro- ization techniques. In this work we present a customizable gramming background. The main goal of such techniques framework able to acquire, search and interactively visualize is to present data in a fashion that supports intuitive inter- data. This framework is built upon a modular architectural action to spot patterns and trends, thus making the data schema and its effectiveness will be illustrated by a proto- usable and informative. In this work we focus on data ex- type implemented for a specific application domain. traction and data visualization for information retrieval sys- tems, i.e., how to extract data from the sources of inter- est in a convenient way, and how to enhance their analysis Keywords and consultation through visualization techniques. To meet Data Visualization, Data Extraction, Acquisition. these goals we propose a general framework, presenting its architectural schema composed of four logic units: acquisi- 1. INTRODUCTION tion, elaboration, storage, visualization. We also present a The size of data collected by private and public organizations prototype developed for a case study. The prototype has is steadily growing and search engines are the most common been implemented for a specific application domain and is tools used to quickly browse them. Many works, in differ- available online. ent research areas, face the problem of how to manipulate The rest of the paper is organized as follows. Section 2 dis- such data and to transform them into valuable information, cusses some frameworks and platforms related to our study. by making them navigable and easily searchable. Cluster- Section 3 presents the framework architectural schema. Sec- ing techniques have been shown to be quite effective to that tion 4 describes a prototype through a case study, and fi- purpose and have been thoroughly investigated in the past nally, Section 5 concludes the paper suggesting directions years [17, 18, 2]. However the majority of currently avail- for future works. able solutions (e.g., Carrot21 , Yippy2 ) just supply textual interfaces to explore search results. In recent years, several works studied how users interact with 2. RELATED WORK In this section we discuss some works proposing frameworks 1 http://project.carrot2.org and platforms for data visualization. 2 http://www.yippy.com/ WEKA [9] is a Java library that provides a collection of state-of-the-art machine learning algorithms and data processing tools for data mining tasks. It comes with several graphical user interfaces, but can also be extended by using a simple API. The WEKA workbench includes a set of visualization tools and algorithms for classification, regression, attribute selection, and clustering, useful to discover and understand data. Orange [6] is a collection of C++ routines providing a set of data mining and machine learning procedures which can be easily combined in order to develop new algorithms. data to fit operational needs; 3. Storage: stores the data previously processed in per- sistent way and make them available to the users; 4. Visualization: provides a visual representation of data. Actually the framework is mainly focused on the acquisition and visualization stages, whereas the other ones are re- ported as part of the architecture but are not implemented by us. From an engineering perspective, both middle stages Figure 1: Architectural Schema (elaboration and storage) are considered as black-box components: only their input and output specifications The framework allows to perform different tasks including must be available. All logic units play a crucial role for data input and manipulation, methods for developing visualizing data thus we describe them according to the classification models, visualization of processed data, etc. purposes of our framework. Orange provides also a scriptable environment, based on Python, and a visual programming environment, based on a set of graphical widgets. 3.1 Acquisition While WEKA and Orange contain several tools to deal This component is in charge of collecting and preprocessing with data mining tasks, our aim is to improve information data. Given a collection of documents, possibly in different retrieval systems and user data understanding through formats, the acquisition stage prepares data and organizes visualization techniques. Basic statistical analysis on data, them to feed the elaboration unit. should be implemented by charts through interactions Data acquisition can be considered the first (mandatory) patterns, so that could be performed directly by users. phase for any data processing activity that anticipates the In [8] authors present FuseViz, a framework for Web-based data visualization. Cleveland [5] and Fry [7] examine in fusion and visualization of data. The framework provides depth the logical structure of visualizing data by identifying two basic features: fusion and visualization. FuseViz seven stages: acquire, parse, filter, mine, represent, refine, collects data from multiple sources and fuses them into and interact. Each stage in turn requires to apply techniques a single data stream. The joint data streams are then and methods from different fields of computer science. visualized trough charts and maps in a Web page. FuseViz The seven stages are important in order to reconcile all sci- has been designed to operate in a smart environment, where entific fields involved in data visualization especially from several deployed probes sense the environment in real time, the logical point of view. However, regarding to our proto- and the data to visualize are live time series. type we refer to data acquisition as a software component The Biketastic platform [16] is an application developed to which is able to collect, parse and extract data in an effi- facilitate knowledge exchange among bikers. The platform cient and secure way. The output of data acquisition will be enables users to share routes and experience. For each a selection of well-formed contents that are intelligible for route Biketastic captures location, sensed data and media. the elaboration unit. Such information are recorded while participants ride. We can collect data3 by connecting the acquisition unit to Routes’ data are then managed by a backend platform that data source (e.g., files from a disk or data over a network). makes visualizing and sharing routes’ information easy and The approach to data collection depends on goals and de- convenient. sired results. For instance, forensic data collection requires FuseViz and Biketastic share the peculiarity of being the application of scientifically sound and proven methods4 explicitly designed to cope with a specific task in a par- to produce a bit-stream copy from data, that is an exact ticular environment. The proposed schemas could be bit-by-bit copy of the original media certified by a message re-implemented in different applications, but there is not digest and/or a secure hash algorithm. Thus, data collection a clear extension and adaptation procedure defined (and in many circumstances has to address specific issues about possibly supported) by the authors. Our aim is to present prevention, detection and correction of errors. a framework that: a) can be easily integrated with an The acquired data must be parsed according to their digital existing information retrieval system b) provides a set of structure in order to extract data of interest and prepare tools to profitably extract data from heterogeneous sources them for an elaboration unit. Parsing is potentially a time- c) requires minimum effort to produce new interactive consuming process especially while working with heteroge- visualizations. neous data formats. The parsing stage is necessary also to extract the metadata related to examined data. Both tex- tual contents and metadata are usually extracted and stored 3. FRAMEWORK OVERVIEW in specific data interchange formats like JSON or XML. Our framework adheres to a simple and well-known schema Moreover, security and efficiency aspects have to be consid- (shown in Figure 1) structured in four logic units: ered during the design of a data acquisition unit. However, 3 We assume to work with static data. Static/persistent data 1. Acquisition: aims at obtaining data from sources; are not modified during data acquisition, while dynamic data refer to information that is asynchronously updated. 4 2. Elaboration: responsible for processing the acquired http://dfrws.org/2001/dfrws-rm-final.pdf it is beyond the scope of the present work to discuss secu- original data Parsing Processing Preservation Presentation rity and efficiency related issues regardless their important implications for data acquisition. Intepretation impact of DATABASE DATABASE original data METADATA METADATA 3.2 Elaboration and Storage MEMORY RESULTS RESULTS RESULTS The elaboration unit takes as input the data extracted dur- ing the acquisition phase, so it has to analyze and extrapo- other metadata other metadata other metadata other metadata late information from them. Data analysis for instance, may METADATA METADATA METADATA METADATA be performed by a semantic engine or a traditional search engine. In the former case we will obtain, as output, the doc- DATA DATA DATA DATA DATA uments collection enriched with semantic information, in the TIME second case the output will be an index. Moreover, along with the analysis results, the elaboration unit may return Figure 2: Data enrichment over time analysis of the metadata, related to the documents, which are received as an input. of information decreases over time. Thus, we invested in ef- The main task of the storage unit is to store analysis results fort to develop a framework able to overcome the “negative” produced by the elaboration unit and make them available wow effect by providing visualizations easy to use and effec- for the visualization unit. At this stage the main issue is to tive. optimize data access, specifically the querying time, in order to reduce the time spent by the visualization unit retrieving 4. CASE STUDY: 4P’S PIPELINE the information to display. Several storage solutions can be In this section we present an application of the framework implemented, in particular one may choose among different developed for a case study. According to the main task ac- types of data bases [3, 13]. The traditional choice could be a complished by each framework unit, we named the whole relational database, but there are several alternatives, e.g., procedure the 4P’s pipeline: parsing, processing, preserva- XML databases or graph databases. tion, and presentation. The prototype is a browser based application available on- 3.3 Visualization line5 . The data set used for testing the 4P’s pipeline is a The visualization unit is in charge of making data available collection of documents in different file formats (e.g., PDF, and valuable for the user. As a matter of fact, visualization HTML, MS Office types, etc). The data set was obtained by is fundamental to transform analysis results into valuable collecting documents from several sources, mainly related to information for the user and help her/him to explore data. news in English language. In particular, the visualization of the results may help the user to extract new information from data and to decide 4.1 Parsing task future queries. As previously discussed, the time spent by The acquisition unit is designed to effectively address the the user looking at the query itself is negligible, whereas the issues discussed in Section 3.1. Parsing is the core task of time spent looking at the results and how they are displayed our acquisition unit and for its implementation we exploited is long-lasting. Thus, the interface design is crucial for the the Apache Tika6 framework. The Apache Tika is a Java effectiveness of this unit, and the guidelines outlined in [12] library that carries out detection of document type and the may became a useful guide for the design and implementa- extraction of both metadata and structured textual content. tion of this unit. Given the tight interaction with the user, It uses existing parser libraries and supports most data for- it is quite important to take into account the response time mats. and usability of the interface. The visualizations provided should be interactive, to enable the user performing analysis operations on data. The same data should be displayed in 4.1.1 Tika parsing several layouts to highlight their different aspects. Finally, it Tika is currently the de-facto “babel fish”, performing au- is quite important to provide multiple filters for each visual- tomatic text extraction and content analysis of more than ization, in order to offer to the user the chance of a dynamic 1200 data formats. Furthermore there are several projects interaction with the results. that aim at expanding Tika to handle other data formats. Document type detection is based on a taxonomy provided by the IANA media types registry7 that contains hundreds 3.3.1 The “Wow-Effect” of officially registered types. There are also many unoffi- A really-effective data visualization technique has to be de- cial media types that require attention, so Tika has its own veloped keeping in mind two fundamental guidelines that media types registry that contains both official registered are abstraction and correlation. types and other, widely used albeit unofficial, types. This However, scientists often focus on the creation of trendy – registry maintains information associated to each supported but not always useful – visualizations that should arouse type. Tika implements six methods for type detection [4] re- astonishment in the users who observe them, causing what spectively based on the following criteria: filename patterns, McQuillan [14] defines as the Wow-Effect. Unfortunately, Content-Type hints, magic byte prefixes, character encod- the Wow-Effect vanishes quickly and results in having stun- ings, structure/schema detection, combined approaches. ning visualizations that are worthless for the audience. This 5 effect is also related to the intrinsic complexity of the data http://kelvin.iac.rm.cnr.it/interface/ 6 generated from acquisition to visualization stage. As shown http://tika.apache.org/ 7 in Figure 2, the impact of original data into the total amount http://tools.ietf.org/html/rfc6838 The Parser interface is the key concept of Apache Tika. It Tika exception If any error occurs, try to provides a high level of abstraction hiding the complexity apply an ad-hoc parser YES Tika of different file formats and parsing libraries. Moreover, it Parser represents an extension point to add new parser Java classes Tika Text and to Apache Tika, that must implement the Parser interface. Input File detectable? Metadata Detector The selection of the parser implementation to be used for NO Ad-hoc parsing a given document may be either explicit or auto- octet-stream parsers matic (based on detection heuristics). Each Tika parser allows to perform text (only for text- Functional Units oriented types) and metadata extraction from digital docu- ments. Parsed metadata are written to the Metadata object Figure 3: Acquisition unit after the parse() method returns. 4.2 Processing and Preservation tasks 4.1.2 Acquisition unit in detail The second and the third tasks are respectively the pro- Our acquisition unit uses Tika to automatically perform cessing and the preservation of data. The elaboration and type detection and parsing, against files collected from data storage units which perform these tasks are tightly coupled. sources, by using all available detectors and parser imple- All processed data must be stored in order to preserve the mentations. Although Tika is, to the best of our knowledge, elaboration results in a persistent way. They work by us- the most complete and effective way to extract text and ing a simple strategy like Write-Once-Read-Many pattern, metadata from documents, there are some situations where where the visualization unit plays the reader role. it could not accomplish its job, for example when Tika fails to detect the document format or, even if it correctly recog- nizes the filetype, when an exception occurs during parsing. 4.2.1 Elaboration unit The acquisition unit handles both situations by using alter- The elaboration unit is formed by the semantic engine Cog- native parsers which are designed to work with specific types ito9 . Cogito analyzes text documents, and is able to find hid- of data (see figure 3): den relationships, trends and events, transforming unstruc- tured information into structured data. Among the several analysis it identifies three different types of entities (peo- • Whenever Tika is not able to detect a file because ei- ple, places and companies/organizations), categorizes docu- ther it is not a supported filetype or the document is ments on the basis of several taxonomies and extract entities not correctly detectable (for example, it has a mal- co-occurrences. Notice that this unit is outside the frame- formed/misleading Content-Type attribute), the ex- work despite we included it in the architectural schema. In- amined file is marked as application/octet-stream, deed, we do not take care of the elaboration unit design and i.e., a type used to indicate that a body contains ar- development, we consider it as given. This unit is the en- bitrary binary data. Therefore, the acquisition unit tity with which the framework interacts and to which the processes documents whose the exact type is unde- framework provides functionalities, i.e., text extraction and tectable by using a customized set of ad-hoc parsers, visualization. each one specialized to handle specific types. For in- stance, Tika does not currently support Outlook PST 4.2.2 Storage unit files, so they are marked as octet-stream subtypes. As storage unit we resorted to BaseX10 , an XML data base. Then, the acquisition unit analyzes the undetected file BaseX is an open source solution released under the terms by using criteria as extension pattern or more sophis- of the BSD License. We decided to use an XML data base ticated heuristics and finally it sends the binary data because the results of the elaboration unit are returned in to an ad-hoc parser based on the java-libpst 8 library. XML format. Moreover, the use of an XML data base helps • During parsing, even though a document is correctly to reduce the time for XML documents manipulation and detected by Tika, some errors/exceptions can occur, processing, compared to a middleware application [10, 15]. interrupting the extraction process related to the tar- An XML data base has also the advantage of not constrain- get file. In this case, the acquisition unit tries to restart ing data to a rigid schema, namely in the same data base we the parsing against the file that has caused a Tika ex- can add XML documents with different structures. Thus, ception by using, if available, a suitable parser selected the structure of the elaboration results can change without from an ad-hoc parsers list. effecting the data base structure itself. The acquisition unit extracts metadata from documents ac- 4.3 Presentation task cording to a unified schema based on basic metadata proper- For the development of the visualization unit we used ties contained in the TikaCoreProperties interface, which D3.js11 [1], a JavaScript library. The library provides several all (Tika and ad-hoc) parsers will attempt to extract. A uni- graphical primitives to implement visualizations and uses fied schema is necessary in order to have a unique experience only web standards, namely HTML, SVG and CSS. With with searching against metadata properties. A complete and D3 it is possible to realize multi-stage animations and inter- more complex way to address “metadata interoperability” active visualizations of complex structures. consists in applying schema matching techniques in order to 9 http://www.expertsystem.net provide suitable metadata crosswalks. 10 http://basex.org 8 11 https://code.google.com/p/java-libpst/ http://d3js.org Figure 6: Co-occurrences matrix. information that are reported inside a tooltip as shown in figure. For each country are reported general information such as capital’s name, spoken languages, population fig- ures, etc. Such information do not come from the Cogito analysis, but are added to enrich and enhance the retrieval process carried out by users. The tooltip reports also the list of documents in which the country appears and the features Figure 4: Treemap with category zooming detected by Cogito. Features are identified according to a specific taxonomy and for each country are reported all the features detected inside the documents related to that coun- try. Moreover, this visualization displays geographic loca- tions belonging to the country, possibly identified during the analysis, e.g. rivers, cities, mountains, ecc. Figure 6 shows the visualization of entities co-occurrence (only a section of the matrix is reported in figure). Three types of entities are identified by Cogito, that are places, people, organizations. All entities are listed both on rows and columns, when two entities appear inside the same document the square at the intersection is highlighted. The color of the squares is al- ways the same, but the opacity of each square is computed on the basis of the number of co-occurrences. Thus, the higher the number of co-occurrences, the darker the square at the intersection. Furthermore, a tooltip for each high- Figure 5: Geographic visualization and country se- lighted square reports the type of the two entities, informa- lection tion about the co-occurrence and the list of documents in which they appear. Specifically, the tooltip reports the verb To improve data retrieval, we realized several visualization or noun connecting the entities and some information about alternatives that exploit Cogito’s analysis results. Figure 4 the verb or noun used. shows a treemap visualization that displays a documents cat- Figure 7 shows a force directed graph that displays the re- egorization, notice that the same document may fall in dif- lations detected among the entities identified in the docu- ferent categories. Not all categories are displayed, only eight ments. Each entity is represented by a symbol denoting the among the most common ones. The categories reported are entity’s type. An edge connects two entities if a relation has selected on the basis of the number of documents contained been detected between them, self-loop are possible. Edges in the category itself. The treemap visualization is quite are rendered with different colors based on relations’ type. effective in providing a global view of the data set. Our im- The legend concerning edges and nodes is reported on top of plementation enables also a category zooming to restrict the the visualization. A tooltip reports some information about set of interest, i.e., clicking on a document the visualization the relations. In particular, for each edge is reported the displays only the documents in the same category. More- sentence connecting the entities, the verb or noun used in over, the user is able to retrieve several information such as the sentence and the document’s name in which the sen- the document’s name, part of the document content and the tence appear. Instead for each node a tooltip reports the document’s acquisition date, directly from the visualization list of document in which the entity appears. Furthermore, interface. Figure 5 shows a geographic visualization that for each visualization, the user may apply several filters. In displays a geo-categorization of documents. The countries particular, we give the possibility to filter data by acquisi- appearing in the documents are rendered with a different tion date, geographic location, nodes’ types (co-occurrence color (green), to highlight the difference respect to the oth- matrix and force directed graph), relations’ type (force di- ers. The user can select each green country to get several rected graph), categories (treemap). [1] M. Bostock, V. Ogievetsky, and J. Heer. D3 Data-Driven Documents. IEEE TVCG, 17(12):2301–2309, Dec 2011. [2] C. Carpineto, S. Osiński, G. Romano, and D. Weiss. A survey of web clustering engines. ACM Comput. Surv., 41(3):1–17, Jul 2009. [3] R. Cattell. Scalable SQL and NoSQL Data Stores. SIGMOD Rec., 39(4):12–27, May 2011. [4] M. Chris and J. Zitting. Tika in Action. Manning Publications Co., 2011. [5] W. S. Cleveland. Visualizing data. Hobart Press, 1993. [6] J. Demšar, T. Curk, A. Erjavec, v. Gorup, T. Hočevar, M. Milutinovič, M. Možina, M. Polajnar, M. Toplak, A. Starič, M. Štajdohar, L. Umek, L. Žagar, J. Žbontar, M. Žitnik, and B. Zupan. Orange: Data mining toolbox in python. Journal of Machine Learning Research, 14(1):2349–2353, Jan 2013. [7] B. Fry. Visualizing Data: Exploring and Explaining Data with the Processing Environment. O’Reilly Media, Inc., 2007. [8] G. Ghidini, S. Das, and V. Gupta. FuseViz: A Framework for Web-based Data Fusion and Visualization in Smart Environments. In Proc. of IEEE MASS ’12, pages 468–472, Oct 2012. Figure 7: Entity-relations force directed graph [9] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data 5. CONCLUSIONS mining software: An update. SIGKDD Explor. Newsl., The interest in data visualization techniques is increasing, 11(1):10–18, Nov 2009. indeed these techniques are showing to be a useful tool in [10] S. Jokić, S. Krco, J. Vuckovic, N. Gligoric, and the processes of data analysis and understanding. In this pa- D. Drajic. Evaluation of an XML database based per we have discussed a general framework for data extrac- Resource Directory performance. In Proc. of TELFOR tion and visualization, whose aim is to provide a methodol- ’11, pages 542–545, Nov 2011. ogy to conveniently extract data and facilitate the creation [11] B. Kules, R. Capra, M. Banta, and T. Sierra. What do of effective visualizations. In particular, we described the exploratory searchers look at in a faceted search framework’s architecture, illustrating its components and its interface? In Proc. of JCDL ’09, pages 313–322, 2009. functionalities, and a prototype. The prototype represents [12] B. Kules and B. Shneiderman. Users can change their an example of how our framework can be applied when deal- web search tactics: Design guidelines for categorized ing with real information retrieval systems. Moreover, the overviews. Information Processing & Management, online application demo provides several visualization exam- 44(2):463–484, Mar 2008. ples that can be reused in different contexts and application [13] K. K.-Y. Lee, W.-C. Tang, and K.-S. Choi. domains. Alternatives to relational database: Comparison of Currently we’re experimenting our prototype for digital NoSQL and XML approaches for clinical data storage. forensics and investigation purposes, aiming at providing to Computer Methods and Programs in Biomedicine, law enforcement agencies a tool for correlating and visualiz- 110(1):99–109, Apr 2013. ing off-line forensic data, that can be used by an investiga- [14] A. G. McQuillan. Honesty and foresight in computer tor even if she/he does not have advanced skills in computer visualizations. Journal of forestry, 96(6):15–16, Jun forensics. As a future activity we plan to release a full ver- 1998. sion of our prototype. At the moment the elaboration en- [15] M. Paradies, S. Malaika, M. Nicola, and K. Xie. gine is a proprietary solution that we cannot make publicly Comparing xml processing performance in middleware available, hence we aim at replacing this unit with an open and database: A case study. In Proc. of Middleware solution. Finally, we want to enhance our framework in or- Conference Industrial Track ’10, pages 35–39, 2010. der to facilitate the integration of data extraction and data [16] S. Reddy, K. Shilton, G. Denisov, C. Cenizal, visualization endpoints with arbitrary retrieval systems. D. Estrin, and M. Srivastava. Biketastic: Sensing and Mapping for Better Biking. In Proc. of SIGCHI ’10, Acknowledgements pages 1817–1820, 2010. We would like to express our appreciation to Expert Systems [17] O. Zamir, O. Etzioni, O. Madani, and R. M. Karp. for support in using Cogito. Moreover, financial support Fast and intuitive clustering of web documents. In from EU projects HOME/2012/ISEC/AG/INT/4000003856 Proc. of KDD ’97, pages 287–290, 1997. and HOME/2012/ISEC/AG/4000004362 is kindly acknowl- [18] H.-J. Zeng, Q.-C. He, Z. Chen, W.-Y. Ma, and J. Ma. edged. Learning to cluster web search results. In Proc. of SIGIR ’04, pages 210–217, 2004. 6. REFERENCES