Towards Visualization Recommendation – A Semi- Automated Domain-Specific Learning Approach Pawandeep Kaur Michael Owonibi Birgitta Koenig-Ries Heinz-Nixdorf Chair for Distributed Heinz-Nixdorf Chair for Distributed Heinz-Nixdorf Chair for Distributed Information Systems Information Systems Information Systems Friedrich-Schiller-Universität, Jena Friedrich-Schiller-Universität, Jena Friedrich-Schiller-Universität, Jena pawandeep.kaur@uni-jena.de michael.owonibi@uni-jena.de birgitta.koenig-ries@uni-jena.de ABSTRACT interpretable graphics which keep the reader interested in the publication, and make them understand the research work and Information visualization is important in science as it helps possibly build on it. Ultimately, this results in increased citation scientists in exploring, analysing, and presenting both the obvious of such publications. In addition, it aids researchers in detecting and less obvious features of their datasets. However, scientists are recurring patterns, formulating hypotheses and discovering new not typically visualization experts. It is therefore difficult and knowledge out of those patterns [24]. time-consuming for them to choose the optimal visualization to convey the desired message. To provide a solution for this In this paper, we will focus on the issue of visualization selection problem of visualization selection, we propose a semi-automated, for data presentation and will be using biodiversity research as an context aware visualization recommendation model. In the model, application domain. In the next section, we will first explain the information will be extracted from data and metadata, the latter biodiversity research domain and then analyze the challenges and providing relevant context. This information will be annotated requirements of researchers with respect to the visualization with suitable domain specific operations (like rank abundance), selection. Then, we will present the literature review of the which will be mapped to the relevant visualizations. We also existing solutions (Section 3). In Section 4, we will present our propose an interactive learning workflow for visualization approach to address to the challenges that we have identified. recommendation that will enrich the model from the knowledge gathered from the interaction with the user. We will use 2. REQUIREMENTS ANALYSIS biodiversity research as the application domain to guide the Biodiversity research aims to understand the enormous diversity concrete instantiation of our approach and its evaluation. of life on earth and to identify the factors and interactions that generate and maintain this diversity [20]. Biodiversity data is the Categories and Subject Descriptors data accumulated from the research done by biologists and D.2.12 [Data mapping] ecologists on different taxa and levels, land use, and ecosystem processes. For proper preservation, reusability, and sharing of General Terms such data, metadata is provided along with the data. This metadata Human Factors, Design contains vital contextual information related to the datasets like purpose of the research work, data collection method and other Keywords important keywords. In order to answer the most relevant Data Visualization, Machine Learning, Biodiversity Informatics, questions of biodiversity research, synthesis of data stemming Text Mining, Recommender Systems from integration of datasets from different experiments or observation series is frequently needed. Collaborative projects 1. INTRODUCTION thus tend to enforce centralized data management. This is true, The human brain can comprehend images a lot easier than words e.g., for the Biodiversity Exploratories [16], a large scale, long- or numbers. This makes effective graphics an especially important term project funded by DFG. The Exploratories use the BExIS part of academic literature [19]. Visualization that condenses platform [15] for central data management. The instance of BExIS large amounts of data into effective and understandable graphics used within the Biodiversity Exploratiories (BE) serves as one of is therefore an important component of the presentation and the primary sources for collecting requirements for this study. The communication of scientific research [14]. Supporting scientists in large collection of data available in the BE BExIS is the result of choosing the appropriate visualization during the research process research activities by many disciplines involved in biodiversity is very important. We believe that an optimal choice leads to more science over the last eight years. This data is highly complex, heterogeneous and often not easy to understand. To interpret, analyze, present, and reuse such data a system is required that can analyze and visualize these datasets effectively. According to the survey of 57 journals conducted in [21], natural science journals use far more graphs than mathematical or social science journals [21]. The objective of any graphics in the context of scientific publications and presentations is to effectively 27th GI-Workshop on Foundations of Databases (Grundlagen von communicate information [19]. For that, it is important to choose Datenbanken), 26.05.2015 - 29.05.2015, Magdeburg, Germany. Copyright is held by the author/owner(s). 30 the appropriate visualization with respect to available data and [29]. BHARAT, APT and VIA have a similar direction: They all message to convey. However, the studies [23] have shown that the aim at encoding the data variables to the visual clues, human potential of visualization has not been fully utilized in scientific perception analysis, exploit the knowledge of graphic designs and journals. In [23] Lauren et al identify two main reasons for this displays. Such work was independent of any domain. Vis-WIZZ failure: scientists are overwhelmed by the numerous visualization and VISTA have noticed the need of knowledge accelerated techniques available and they lack expertise in designing graphs. visualization mapping techniques, but their research is limited to numerical or quantitative data. Casner’s BOZ system [5] analyses In general, a visualization process is considered as a ‘search’ task descriptions to generate corresponding visualizations. process in which the user makes a decision about visualization However, the task first needs to be fed manually to the mapping tools and techniques at first, after which other decisions are made engine. Many Eyes [10] by IBM which uses the rapidly adaptive about different controls like layout, structure etc. until a visualization engine (REVA) based on the grammar of graphics satisfactory visualization is produced [13]. With the growing by Leland Wilkinson [11] is an example of commercial amount of data and increasing availability of different approaches in this area. Similarly, Polaris’ work on Visual query visualization techniques this ‘search’ space becomes wider [13]. language (VISQL) is used in the Show_me data module of the In order to successfully execute this search process, one needs to Tableau [17] software. Both of these approaches do not consider have clear knowledge about the information contained in the data, contextual information for recommending visualization. the message that should be conveyed and the semantics of different visualizations. PRAVDA (Perceptual Rule-Based Architecture for Visualizing We argue here that to understand this complex process and then Data Accurately) [4] introduced a rule based architecture for work aptly, one needs to have some visualization expertise. assisting the user in making choices of visualization color However, scientists typically do not have the proficiency to parameters. The appropriate visualization rule is selected based on manipulate the programs and design successful graphs [22]. higher-level abstractions of the data, i.e., metadata. They were the Interactive visualization approaches make the visualization first who introduced knowledge from the metadata into the creation process more adaptive, but, due to their insufficient visualization process. knowledge, scientists often have difficulties in mapping the data elements to graphical attributes [12]. The result of inappropriate Current knowledge-based visualization approaches are highly mapping can impede analysis and even result in misleading interactive [3] and use semantics from different ontologies to conclusions [1]. annotate visual and data components (see, e.g., Gilson at. al [8]). They extract the semantic information from the input data and try Furthermore, matters related to visualization are made even more to find the best match by mapping three different ontologies, complex by human perception subjectivity [9], which means where one is the domain ontology, another is the visualization people perceive the same thing differently under different ontology and the last one is their own ontology which is created circumstances. For better understanding, readers primarily need to by mapping first two. relate the visualization to the realm of their existing knowledge domain [2]. To ensure that the chosen visualization does indeed Though knowledge-based systems reduce the burden placed upon convey the intended message to the target readers, a model like users to acquire knowledge about complex visualization the one proposed in [6] should be the base of visualization design. techniques, they lack expert knowledge [13]. Such solutions . should be based on some ground truth collected from relevant domain experts. Additionally, we argue that limited user interaction to obtain feedback would be useful to enhance the knowledge base. 4. PROPOSED APPROACH Based on the requirements identified in Section 2 and the Figure 2. Nested Model for Visualization Design [6] shortcomings of existing approaches discussed in Section 3, we propose a visualization recommendation model which will help This model, as depicted in Figure 2, divides the design process scientists in making appropriate choices for presenting their data. into four levels which are: 1) characterize the tasks and data in the It will be based on a knowledgebase created by reviewing the vocabulary of the problem domain, 2) abstract this information visualizations presented in biodiversity publications. Such into visual operations and data types, 3) design visual encoding knowledge will enrich our understanding on current trends in and interaction techniques, and lastly create algorithms to execute visualizations for representing biodiversity data. It will also these techniques efficiently. enhance the system with scientific operations and concepts and An approach as depicted in model above, needs to rely on the variables used in the presenting those concepts. We will be domain knowledge and visualization used in that domain. In extracting information from metadata (which contains a Section 4 we will propose such an approach. description of various characteristics of the data and the context of the data collection and usage), integrating the knowledge from the 3. STATE OF THE ART domain vocabularies, and classifying this information with respect to the visual operations performed on the dataset. The knowledge The literature on visualization recommendation can be found from obtained in this way will serve as a key parameter in the early Eighties of the last century. The earliest such work is recommending visualization. BHARAT [25], APT [26], Vis-WIZZ [27], Vista [28] and ViA 31 In addition, to deal with the problem of human perception an example, consider the excerpt of metadata of a specific dataset subjectivity, we propose an interactive machine learning approach from the BE BExIS [30] shown here: for visualization recommendation. We will track the input from Detection of forest activities (harvesting / young stock the user at each interaction and will update that into the respective maintenance, etc.) of the forest owner (Forest Service) on the EPs modules in mapping engine. This will make system learn from the of exploratories. amount and spatial distribution of forest user interaction. However, users will be only prompted to interact harvesting measures by the Forest Service on the EPs. in case they do not get satisfactory results. Thus for a non- computer experts (biodiversity researchers in our case) the To keep it short and precise, just one keyword (“spatial interaction would be nil, if his choice of result is present in our distribution”) has been extracted and will be analyzed and recommended list. processed. By annotating it with terms from an ontology, e.g., the SWEET ontology [7], a relation such as shown in Fig. 4 has been In general, our approach is made up of two main components found. namely: the Visualization Mapping Model (Section 4.1) and the Interactive Learning workflow (Section 4.2). The approach is explained using the metadata of a dataset from the BE BExIS as an example [30]. 4.1 Visualization Mapping Model Figure 3 provides an overview of the visualization mapping model. Each of the five phases identified and marked on the figure will be explained in detail below. Figure 4 : Keyword Annotation This annotation can be explained as: Domain Specific Task: Spatial Distribution on any of the distribution functions like Probability Distribution Function or Chisquare Distribution. Representational Task : Distribution Representational Variables: atleast 2 (independent and dependent) 2) Data to visual encoding: Here, we will perform visual encoding on the variables and the values of the dataset. We will map the data variables to their equivalent visual marks/icons/variables (as in Figure 5 [18]) on the basis of some Figure 3: Visualization Mapping Model existing classification scheme (such as the one presented in [17]) for graphical presentation. Figure 5 shows how the relationship 1) Domain level task abstraction: Task here refers to domain among various aspects of data can be represented within the specific analytic operations which are computed on several visualization. For example, the variable that represent different variables of the dataset in order to derive a concept. For instance, size elements (like area or length) of some entity could be best species distribution is an ecological concept which is about represented via bars of different sizes in a visualization. computation of distribution (a task) of some species over a To give an example, we have used the same dataset as above and geographical area. have extracted some variables as shown in Table 1. In the To understand the domain problem well, first, we need to visualization creation process, first, the variables are identified understand the dataset, the goals of the data collection, the with their respective datatypes (measurement units). This we have analysis performed on the data and how these analytical done and have appended another column as UNIT. Then, by operations can be mapped visually. Metadata provides taking a reference from Figure 5, we have transformed these information about the what, why, when, and who about data and variables into their respective visual icons/variables (shaded context, methodology, keywords related to dataset and research. column named ‘Visual Icons’ in the figure). Trees species is a Extracting this information from the data and metadata and ordinal or categorical variable thus could be best represented as mapping it with the domain specific vocabularies can reveal the colour, shape or orientation styles. In the same way nominal biodiversity related tasks that can be performed on the dataset. As variables could be used as X,Y scales in a 2-D visualization 32 In the next steps, we will be using these icons to represent the Autocorrelation etc. Distribution itself relates to various concepts relations between variables in the visualization. like Trajectory Distribution, Diversity Distribution, PFT Table 1 : Dataset variables Distribution. Second, we can identify related visualizations. In our example, these are Grid Heatmap, Kriged Map and Line Graph. Figure 6: Sample Visualizations used in Biodiversity Research 4) Mapping: Our mapping model is an algorithm that will Figure 5: Bertin’s Visual Variables [18] integrate the knowledge from the previous stages. This algorithm will generate a visualization recommendation list based on the priority of domain specific tasks and feedback from users on the 3) Task to operation encoding: At this stage, we will combine results. The following tasks will be performed: the information from the conceptual knowledge gained from metadata and the visual representation knowledge. Visual • It will use the knowledge from previous steps to representation knowledge will be derived by analyzing existing understand and define the structure of the visualizations publications. We will be creating a domain knowledgebase about appropriate for this dataset. visualizations used in biodiversity publications and will ask In our example, in Step 1 we have understood that the scientists to verify it and provide their feedback. This phase is task to perform is ‘Distribution’ with the use of some important to get the domain expertise about current visualization distribution function for ‘Spatial Analysis’. From Step trends for representations of different studies. The candidate 2, we have identified three candidate visualizations. visualization will be chosen from this knowledgebase according to • It will integrate the knowledge from Step 2, to map the the domain tasks that we have extracted in Step 1. data attributes within the candidate visualizations. In our preliminary work, we have tried to understand the different For example if the user selects ‘heatmap’, then a visualizations used in biodiversity research by reviewing the possible mapping of variables to visual icons are publications from the information system of the Biodiversity depicted in Table 3 (consider Table 1 and Figure 5 also) Exploratories. A small sample of the results is depicted in Figure 6. This figure shows what visualization has been used to represent • It will score/rank the candidate visualizations based on: which biodiversity study/analysis within the reviewed Review Phase (Step 3): By choosing the candidate publications. Taking the same example as in the previous steps, visualizations most popular for that study. with the information contained in Figure 6 we can accomplish two jobs: First, we can infer concepts that are related to the identified System learning: Based on user’s feedback as concept. For example, spatial distribution can be associated with introduced in Section 4.2 below. other spatial analysis methods like Spatial Heterogeneity, Spatial 33 all will have the same probability of being chosen by the user. For Table 2 : Variable attribute mapping example, given the following visualizations: o Grid Heatmap: 33.33 % Visual Icons Variables o Krig map: 33.33 % X axis PLOT o Line graph: 33.33 % Y axis NRderMassnahme Suppose now the user has selected “Line Graph”. Then it will Value Tree Height rank higher in the list and will have a higher probability to get Colour Tree selected. As, it is here: o Line graph: 66.66% 5) Learning: If the user is not satisfied with the results of our o Grid Heatmap: 16.67 % automatic mapping process, he or she will be presented with an o Krig map: 16.67 % interactive workflow which will be explained in detail in the next section and which will improve future system suggestions. The second case is, if the user is not satisfied with the list of 4.2 Interactive Learning Workflow recommended visualizations. We will consider this as a miss case. We believe that trying to fully automate the task of visualization Then, we need to know why the intended visualization (i.e., the recommendation is an extremely difficult area. Classical machine visualization that a user wants) could not be generated. Therefore, learning approaches, in which the system can be trained on we will ask the user to do a manual task selection. When the user visualization mapping for different domain concepts, might be an has selected the task, a new visualization list will be option. However, this is an expensive process as it takes recommended. If that is a hit case, then we will update our tremendous effort in gathering knowledge about the domain semantic algorithm (which we used in Step 3 of Section 4.1) with (especially for wide domain areas like ours) and then takes a long this task. In other words, we will make our algorithm consider this time to train the model on the huge database. Moreover, this task (that the user has selected), when similar context (metadata) approach is not user centric. Therefore, we suggest the use of is encountered next time. This has been depicted via red lines in interactive machine learning approaches to overcome these Figure 7. Now if the user is still not satisfied with the result, or it problems. Algorithms used in the mapping process can be is a miss case again, then we know that the problem is not with the continuously refined, by training them from the logs of user task extraction algorithm, but with something else. So, we will ask interaction. the user to select the visualization and variables. The selected Such an interactive learning workflow is presented in Figure 7. visualization will be updated in the list (from the publication review phase, which we used in Step 3 of Section 4.1), with the corresponding variables that the user has selected. Is the User Satisfied with the Recommendation List? 5. CONCLUSION AND FUTURE WORK In this paper, we have proposed an approach to semi-automatic visualization recommendation. It is based on understanding the Is the User Satisfied problem domain and capturing knowledge from the domain with the vocabularies. We are certain that this will assist users Recommendation List? (biodiversity researchers in our case) in making suitable choices of visualization from the recommended list without needing to get into any technical details. We have also presented an interactive learning workflow that will improve the system from the users' feedback in case the recommended visualizations are not suitable for them. This will make the system more human centric by inculcating knowledge from different viewpoints, which will produce more effective and interpretable graphics. Figure 7: Interactive learning workflow Our work is in its initial stages and we are in the process of gathering the visualization requirements from the domain experts The learning aspect of the visualization will be triggered in three via surveys and publications review. This knowledge will be used different cases: In the first case, if the user is satisfied with the list as a ground truth for mapping the conceptual knowledge to the of recommended visualizations, the system will consider it as a hit visual operations. case. Every hit case will trigger the following actions: 1) The weight parameter will be increased for that recommended 6. REFERENCES list [1] Grammel,L., Troy,M., and Story,M. 2010. How information 2) Within the list, the visualization that a user selected will score visualization novices construct visualizations. IEEE higher than the other visualizations. Returning to the example transactions on visualization and computer from Section 4.1, consider the identified visualization list and the graphics,16(6).943-952.DOI= 10.1109/TVCG.2010.164 respective probability of the visualizations to be selected. Initially, [2] Amar, R. and Stasko, J. 2004. A Knowledge Task-Based Framework for Design and Evaluation of Information 34 Visualizations. In Proceedings of the IEEE Symposium on collaborative functional biodiversity research. Ecological Information Visualization (INFOVIS '04). IEEE Computer Informatics, 8, 10–19 DOI=10.1016/j.ecoinf.2011.11.004 Society, Washington, DC, USA, 143-150. [16] Fischer, M., Boch, S.,Weisser, W.W.,Prati, D.and Schoning, DOI=10.1109/INFOVIS.2004.10 I. 2010. Implementing large-scale and long-term functional [3] Martig, S., Castro,S., Fillottrani, P. and Estévez, E.2003. Un biodiversity research: The Biodiversity Exploratories. Basic Modelo Unificado de Visualización. Proceedings, 9º and Applied Ecology, 11(6).473-485. Congreso Argentino de Ciencias de la Computación. DOI=10.1016/j.baae.2010.07.009 Argentina. 881-892 [17] Tableau. Available at [4] Bergman, L.D., Rogowitz, B.E. and Treinish L.A. 1995. A http://www.tableau.com/products/trial?os=windowsbertin rule-based tool for assisting colormap selection. In book Proceedings of the 6th conference on Visualization '95 (VIS [18] Bertin, J. 1983. Semiology of graphics. University of '95). IEEE Computer Society, Washington, DC, USA, 118- Wisconsin Press, Berlin. 125 [19] Kelleher, C., Wagener, T. 2011. Ten guidelines for effective [5] Stephen M. C., 1991. Task-analytic approach to the data visualization in scientific publications, Environmental automated design of graphic presentations. ACM Trans. Modelling & Software.1-6. Graph. 10, 2 (April 1991), 111-151. DOI=10.1016/j.envsoft.2010.12.006 DOI=10.1145/108360.108361 [20] The Biodiversity Research Centre, University of British [6] Munzner T. 2009. A Nested Model for Visualization Design Columbia. Available at and Validation. IEEE Transactions on Visualization and http://www.biodiversity.ubc.ca/research/groups.html Computer Graphics 15, 6 (November 2009), 921-928. DOI=10.1109/TVCG.ma2009.111 [21] Cleveland.,W.S. 1984. Graphs in Scientific Publications. The American Statistician. 38(4). 261-269. [7] NASA JPL California Institute of Technology. Semantic DOI=10.1080/00031305.1984.10483223 Web for Earth and Environmental Technology (SWEET) version 2.3. Available at https://sweet.jpl.nasa.gov/download [22] Schofield.,E.L. 2002. Quality of Graphs in Scientific Journals: An Exploratory Study. Science Editor 25(2). 39-41 [8] Gilson, O., Silva, N., Grant, P.W. and Chen, M. 2008. From web data to visualization via ontology mapping. In [23] Lauren, E.F., Kevin, C.C. (2012). Graphs, Tables, and Proceedings of the 10th Joint Eurographics / IEEE - VGTC Figures in Scientific Publications: The Good, the Bad, and conference on Visualization (EuroVis'08), Eurographics How Not to Be the Latter, the Journal of Hand Surgery, Association, Aire-la-Ville, Switzerland, 959-966. 37(3). 591-596, DOI=10.1016/j.jhsa.2011.12.041 DOI=10.1111/j.1467-8659.2008.01230.x [24] Reda, K., Johnson, A., Mateevitsi, V., Offord, C., & Leigh, J. [9] Rui, Y., Huang, T.S., Ortega, M. and Mehrotra, S. 1998 (2012). Scalable visual queries for data exploration on large, Relevance Feedback: A Power Tool for Interactive Content- high-resolution 3D displays. In High Performance Based Image Retrieval. IEEE Transactions On Circuits and Computing, Networking, Storage and Analysis (SCC). IEEE. Systems for Video Technology, 8(5).1-13 196-205. DOI=10.1109/SC.Companion.2012.35 [10] IBM. Many Eyes. Available at http://www- [25] Gnanamgari S. 1981. Information presentation through 969.ibm.com/software/analytics/manyeyes default displays. Ph.D. dissertation, Philadelphia, PA, USA [11] Wilkinson, L. Statistics and Computing, The Grammar of [26] Mackinlay J. 1986. Automating the design of graphical Graphics. Springer Press. Chicago, 2005 presentations of relational information. ACM Transactions of Graph, 5(2), 110-141. DOI=10.1145/22949.22950 [12] Heer, J., Ham,F., Carpendale,S., Weaver,C. and Isenberg, P. 2008. Creation and Collaboration: Engaging New Audiences [27] Wehrend S. and Lewis C. 1990. A problem-oriented for Information Visualization. In Information Visualization, classification of visualization techniques. In Proceedings of Lecture Notes In Computer Science, 4950. 92-133. the 1st conference on Visualization '90 (VIS '90), Arie DOI=10.1007/978-3-540-70956-5_5 Kaufman (Ed.). IEEE Computer Society Press, Los Alamitos, CA, USA, 139-143. [13] Chen, M., Ebert,D., Hagen,H., Laramee,R.S., Liere,R.V., Ma,K.L., Ribarsky,W., Scheuermann,G., and Silver,D. 2009. [28] Senay, H. and Ignatius, E. 1994. A Knowledge-Based Data, Information, and Knowledge in Visualization. IEEE System for Visualization Design. IEEE Computer Graphics Computer Graphics Application, 29(1). 12-19. and Applications. 14(6), 36-47. DOI=10.1109/38.329093 DOI=10.1109/MCG.2009.6 [29] Healey C. G., Amant R. S., and Elhaddad M. S.. 1999. Via: [14] Ware, C. Information Visualization: Perception for Design. A perceptual visualization assistant, In 28th Workshop on Morgan Kaufmann Publishers, San Francisco, CA. 2000 Advanced Imagery Pattern Recognition (AIPR-99), pp.2–11. [15] Lotz, T., Nieschulze, J., Bendix, J., Dobbermann, M. and [30] Biodiversity Exploratories Information System (BExIS). König-Ries,.B. 2012. Diversity or uniform? Intercomparison Available at https://www.bexis.uni- of two major German project databases for interdisciplinary jena.de/Data/ShowXml.aspx?DatasetId=4020. Accessed on 07/05/2015 35