-

ACM Computing Surveys

Visualization of Large Datasets using Semantic Web Technologies

Suvodeep Mazumdar

s.mazumdar@sheffield.ac.uk 0 0 Department of Information Studies, University of Sheffield Regent Court - 211 Portobello Street, S1 4DP, Sheffield , UK

2007

39 4

Visualization technologies provide means to comprehend, understand and explore data. Observing patterns and anomalies via visualization tools help users to understand issues and take informed decisions. Semantic web technologies used to represent different data types, conforming to particular standards can be exploited to provide meaningful and intuitive visualizations. In this paper, we propose how we intend to provide intuitive and interactive visualizations for large datasets, formalized by multiple ontologies.

Information Visualization semantic web dynamic queries

This research looks at highly complex domains such as aerospace engineering. A jet engine’s life cycle can last up to 50 years, requiring regular maintenance, overhauls, tests and services. Each of these activities involves documentation in the form of text reports, numeric data, images, in-flight data, CAD drawings etc. The volume of this information can easily exceed several terabytes and some structuring is needed for this extra large heterogeneous information set to be usable. Information extraction and semantic web technologies can provide a standardized and structured representation of the multimedia information. An overarching domain ontology is essential to provide an overall view of the entire domain. In order to gain homogeneity, the overarching ontology will be effective, but doing so would be at the cost of losing details embedded in the document. Hence, each document type can be formalized by its own representative ontology, thereby providing more detailed information respect to the global (overarching) representation. Therefore different ontologies provide different lenses to look at the same document type. It is therefore possible to explore the data at different levels of granularity: a coarse view provided by the domain ontology, and a fine-grain view that makes use of the document ontology. How these two different levels are combined in an effective user interface and how can the users effectively manipulate and explore them is our main research question.

Related work and motivation

Several tools for data visualization and exploration have been proposed. For example, Semaplorer [ 4 ] visualize people, tags, photos etc on a geographical map; GapMinder1 provides an exploratory tool for visualizing statistical trends in data over time; ManyEyes2 allows users to upload their own data and create visualizations. However, most of these visualization tools do not address the main questions of this research, generality and scalability for an effective user interaction. For example current visualization techniques cannot handle very large volumes of data. A.Katifori et al. [ 2 ] looked at tools for visualizing ontologies, the available visualization methods and the number of nodes they intend to support (Table1). They found very few visualization methods capable of handling more than 10,000 nodes. GRIDL3 provides an approach that is scalable by hierarchically presenting each axes and for each axis element, a statistical display (bar chart) is presented; GreenMax [ 6 ] provides tree visualization for a million nodes on a representative smaller network of much fewer clusters. 10:38 A. Katifori et al. [ 7 ] and [ 8 ] present a faceted searching and visualization interface for peripheral, which are small but distinguishable, and fringe, which are not individually heterogeneous data by mapping them to known vocabularies extracted from the web distinguishable but are useful to display the structure. The 3D Hyperbolic Browser caannd shvoiswuaulpiztiong50tmheaindantoadeuss,i5n0g0 hpuren-ddreefdinpeedripinhteerraflacoenesw, iadngdettsh.ouTshaendsproefsfernincege of oinnteesr.active multiple visualizations is desirable since it helps in effectively exploring

In the user survey in Ernst and Storey [2003], five ontology size categories are identtihfieedu:nderlying data. One such example is Exhibit, part of the SIMILE Project4, that allows swapping between different perspectives such as timeline or maps. [ 3 ] 1. Fewer than 100 nodes, 2p.roBpeotwseesena 1m0o1raenadd1v,a0n0c0endodapesp,roach as the multiple visualizations are available and 3u.pBdaettewdeensi1m,0u0lt1anaenodu1s0ly,0.00Tnhoedsees, visualizations, however, do not fuse different 4d.oBcuemtweenetn s1e0t,s00f1oramndal1iz0e0d,00b0yndodifefse,rent ontologies, the first goal of the research. 5S.imMiolraerltyh,athne1w00o,0rk01donnoedebsy. the information visualization community has been mainly limThiteednutomhbeormoofgneondeeosuisn dthatias.caTsoe oinvcelrucdoems ebotthhescelalsismesitaantidonins,sttahniscerse.search combines SeMmoasnttiucserWseabre atencthicnipoalotegdy,to ubseedwortkoingagwgirtehgatthee saencdondstcrautcetguorrey odfisopnetrosleogdiesa,nd whereas none is anticipated to be working with the last. In our case, we will use the three chaetteegroorgieenseionuTsadbaletaX, wasitah cfrinitdeirniogns fforromthethcelaisnsfifiocrmataiotinonof vthiseuaolnitzoaltoiogyn vciosmuamlizuantiitoyn, to mpreotvhioddes l(athrgeet-wscoafiler,stinctautietigvoerievsisoufaElizrnatsitoannsdanSdtormeyan[i2p0u0l3a]tiaorne omfehrgeetedroingteonaeosuinsgdleata. one, and so are the last two). In Table X each category lists the method that could be eIfnfdeceteidv,elydeusspeidte, uepvitdoetnhcee ntuhmatbehrigohflmyednytnioanmedicniondteesr.acTthioenclatososilfiscaeftfioenctiisveblaysesdupopnort the existing literature as presented in this section. When there was no information regarding which category the method belongs to, an estimation was made comparing i1tGwAasipthMseoientnhdeefrrr,oshmottfpTi:t/as/wbclwaetwXe.g,goaorpnym.lyintdherre.oergm/ethods claim to provide support for more than 120M,0a0n0yEnyoedse,sh.tTtph:/i/smfaancytesyheosw.aslpthhaawtotrhkes.iibsmsu.ceoomf /smcaanlaybeiylietsy/ in the visualization domain i3sGsrtailplhaicnailmInptoerrftaacnetfoonreD.igital Libraries, www.cs.umd.edu/hcil/west-legal/gridl/ 4 TVhaenSHIMa mILaEnPdroVjaenct:Whitjtkp:[/2/s0im02il]ep.mroipt.oedseu;thErxeheibsiot:luhtttipo:n//ssitmotilhee.mpirto.ebdlue/mwiokfi/vEisxuhaibliitzation of many nodes: 1. Increase available display space, by either using three dimensional and/or hyperbolic spaces. 2. Reduce the number of information elements by clustering or hiding nodes. 3. Use the given visualization space more efficiently by using every available pixel.

Such solutions have been employed by most of the presented visualizations with varying degrees of effectiveness.

On the whole, as Munzner [1997] also states that information density should not be the only metric in ontology visualization: when taken too far, it becomes a clutter. Drawing for example all the links in a highly connected graph yields a picture that can give a high level overview of the global structure but is useless for examining the details. There is always a trade-off between maximum number of nodes displayed users in data exploration, very little has been done in the area of Semantic Data visualization. This research builds upon our previous work [ 3 ] that implements the concept of dynamic queries [ 1 ] to provide highly interactive manipulation of multiple visualization, namely tables, timeline, geographical and topological plots. 3

Proposed Approach

We aim to engage the user communities at Rolls Royce actively during the research period. Our approach is to follow the process of iterative user-centered design. Since there are different types of users from different areas of Rolls Royce aerospace engineering domain (design, manufacturing and service), the target system must be able to generate visualizations that are equally interactive, intuitive and informative for all. We intend to conduct personal interviews of users to understand their daily jobs and the kinds of visualizations they are used to. We will then present the users with use case scenarios supported by low-fidelity mockups and sketches of the system that we perceive will benefit the users. We will be following this process iteratively to gain a sound understanding of the user’s requirements and expectations.

We will also be studying the different ontologies and their inter-dependencies that have been developed for different sets of data currently in use at Rolls Royce. This will help generate a taxonomy that will relate the data types of the concepts to their corresponding visualizations and interactions they can support. We will be using the results from the participatory design sessions to decide the best interactions for different visualizations, so that the user can seamlessly explore the data in different hierarchical layers.

The usability of the semantic data visualization tools would be core. Applying filters to millions of documents generates very large retrieved sets with thousands of results, too much information for the user to process. Past proposals to mitigate this problem include: increase display area by using 3D plots instead of 2D, cluster or hide nodes or utilizing every pixel in the visualization space [ 5 ]. Our approach is radically different and uses classification, clustering and overlapping of data to provide contextual layered visualizations, where each layer contains information only relevant to that layer. For example consider a pie chart generated on the basis of the domain ontology and intended to provide a generic overview of the distribution of the data respect to a specific concept; if the user clicks on a pie chart section which has more detailed information formalized by another supported ontology, then further details corresponding to the specific ontology will be displayed providing a semantic zoom. 4

Methodology

Adopting the User-Centered approach discussed above, a core part of the research is understanding how Rolls Royce engineers conduct their daily work and what tools will be useful during data analysis. The starting point will be observations conducted at Rolls-Royce premises aiming at identifying current practices of data display and analysis. By collecting examples of artifacts currently in use we aim at finding inspiration for a design that will be naturally usable because already familiar. We have already started a series of participatory design sessions with several potential users from different areas of Rolls Royce aerospace engineering domain (design, manufacturing and service). In these sessions we are discussing mock-ups of the visualizations and related interactions so as to actively involve the user community in selecting the - potentially optimal - solution(s). This requirements gathering is paired with the system architecture design to be completed in first year of research.

A series of exhaustive tests on the query response time, loading times, efficiency etc. of the various triple stores will be conducted to select the most efficient system architecture. Once a back-end system is determined, we will be performing tests on loading query results ‘on-the-fly’. Tests conducted in X-Media show that there is a significant waiting time for the visualizations to be initialized. This is the base line against which we will work to improve display efficiency, a core issue in user interaction. The software coding phase would be throughout the second year of the research, when we will also be preparing evaluation and trial materials based on the use case scenarios being developed in year one.

The evaluation of the solution will be carried out with the Rolls Royce engineers at their premises during the first two months of the third year. We will follow the methodology we have used previously in [ 3 ]: participants will be requested to carry out specific tasks designed in partnership with Rolls Royce experts; the interaction will be logged and the screen activity recorded; participants will then be requested to fill in a questionnaire and answer a few targeted questions in an interview. Results from this user evaluation will be used to re-design and modify the application where needed, following which we would be conducting a long-term user trial. The remainder of the third year would be dedicated to thesis writing and providing bug fixes and minor enhancements. 5

Conclusions and Future Work

The work already done in X-Media shows the importance and effectiveness of multiple visualizations in a large complex organization. The ability of a user to visualize the same data in different dimensions, query them and identify patterns and areas of interest is useful in providing or identifying possible solutions. The findings from the X-Media project has been a good stepping stone for the research we intend to conduct over the next few years.

The research, although organized around the case of aerospace engineering, is expected to be generic and applicable to different domains that share similar characteristics and problems. Specifically, we will test our result with the data from GrassPortal8, an online resource for accessing data related to grass species, global 8 GrassPortal, http://www.grassportal.org environmental data, evolutionary relationships among grasses etc. to test the portability of the approach adopted. This will be a good way to measure how successfully the semantic visualization technology can be ported to other domains represented by their respective domain ontologies.

Acknowledgments. This research is supported by SAMULET, a Rolls Royce and DTI funded project for knowledge management in aerospace manufacturing domain.

1. Ahlberg , C. , Williamson , C. , Shneiderman , B. Dynamic Queries for Information Exploration: An Implementation and Evaluation . CHI' 92 , 619 - 626 ( 1992 )

2. Katifori , A. , Halatsis , C. , Lepouras , G. , Vassilakis , C. , Giannopoulou , E. 2007 . Ontology visualization methods-a survey . ACM Comput. Surv . 39 , 4 (Nov. 2007 ), 10 . DOI= http://doi.acm. org/10 .1145/1287620.1287621

3. Petrelli , D. , Mazumdar , S. , Dadzie , A.-S. , Ciravegna , F. : Multi Visualization and Dynamic Query for Effective Exploration of Semantic Data . In Proceedings of the 8th International Semantic Web Conference , pp. 505 - 520 . Springer ( 2009 )

4. Schenk , S. , Saathoff , C. , Staab , S. , Scherp , A. 2009 . SemaPlorer-Interactive semantic exploration of data and media based on a federated cloud infrastructure . Web Semant. 7 , 4 (Dec. 2009 ), 298 - 304 . DOI= http://dx.doi.org/10.1016/j.websem. 2009 . 09 .006

Van

Ham , F. , Van Wijk , J. J. 2002 . Beamtrees: Compact visualization of large hierarchies . In Proceedings of the IEEE Conference on Information Visualization . IEEE CS Press, 93 - 100

6. Wong , P. C. , Foote , H. , Mackey , P. , Chin Jr., G. , Sofia, H., and Thomas , J. ,

A Dynamic

Multiscale Magnifying Tool for Exploring Large Sparse Graphs , Information Visualization 7 , 2

Hildebrand , J. van Ossenbruggen ,

Hardman , and

Jacobs . Supporting subject matter annotation using heterogeneous thesauri: A user study in web data reuse . International Journal of Human-Computer Studies , 67 ( 10 ): 887 - 902 , 10 2009 .

Hildebrand and J. van Ossenbruggen. Configuring semantic web interfaces by data mapping . In S. Handschuh,

Heath , and V. Thai, editors, Visual Interfaces to the Social and the Semantic Web (VISSW 2009 ), volume 443 , February 2009 .