Visual Search Analytics: Combining Machine Learning and Interactive Visualization to Support Human-Centred Search Orland Hoeber University of Regina Regina, SK, Canada orland.hoeber@uregina.ca environments, and the ever-growing public collections indexed by the top web search engines. For small col- Abstract lections or for simple fact verification tasks, browsing or using basic search interfaces are often satisfactory; Searching within large online document col- however, to support complex information seeking ac- lections has become a common activity in our tivities within large collections, it is necessary to pro- modern information-centric society. While vide powerful search facilities that support exploratory simple fact verification tasks are well sup- search and analytical reasoning about the information ported by current search technologies, when that has been found. the search tasks become more complex, a In recent years, there have been significant advance- substantial cognitive burden is placed on the ments in search technologies. Web search companies searcher to craft and refine their queries, eval- like Google and Microsoft can index billions of doc- uate and explore among the search results, uments, and return matches to user-supplied queries and ultimately making sense of what is found. within fractions of a second. Open source information Visual search analytics provides a means for retrieval frameworks such as Apache Lucene can effi- relieving this burden through a combination ciently index large unstructured document collections of automatic machine learning and interactive and provide the backbone for custom search systems. visualization. The goal is to automatically ex- Amid all of these development efforts on the back-end tract and infer relevant information during the of the search process, the interfaces to the vast major- search process and present this to the searcher ity of search systems have remained largely unchanged. in a visual format that allows for quick inter- The searcher is provided with a query box in which to pretation and easy manipulation of the search type a description of what is being sought, and the process, providing support for the full range search results are provided in a list-based format that of human-centered search activities. requires document-by-document inspection. 1 Introduction While such interfaces work well for highly targeted search tasks such as fact verification, their ability to A substantial portion of human knowledge exists in support complex search activities such as disambigua- textual formats within online domain-specific docu- tion and exploration are limited. Because of the fun- ment collections. Examples include Wikipedia, the damental differences in why and how people search ACM Digital Library, Engadget, Twitter, Associated within various large online document collections, the Press, the vast document collections within corporate one-size-fits-all approach to search interfaces may not be appropriate in all settings. For example, one might Copyright c 2014 for the individual papers by the paper’s au- initiate a search within an online encyclopedia such as thors. Copying permitted for private and academic purposes. This volume is published and copyrighted by its editors. Wikipedia in order to find specific facts and explana- In: U. Kruschwitz, F. Hopfgartner and C. Gurrin (eds.): Pro- tions about a topic of interest. However, because of ceedings of the MindTheGap’14 Workshop, Berlin, Germany, the lack of specific knowledge about the topic, the ini- 4-March-2014, published at http://ceur-ws.org tial query may be ambiguous or may not even match the terminology used within the encyclopedia. In the sentations as a method for conveying abstract infor- process of exploring among the search results, new in- mation to a user. It provides mechanisms for linking formation will be acquired that will allow for the re- the data being processed within a computer system finement of the query in order to re-focus it more pre- and the mind of the user, via the human vision system cisely on the topic of interest, as well as to develop new [War04]. The goal of information visualization is to search interests based on serendipitously discovered in- take advantage of the parallel processing capabilities formation. By contrast, one might conduct a search of human visual perception [WGK10], allowing people within a corporate document collection to re-find a to see information, visually interpret patterns and re- specific policy document. Because of the common lan- lationships, and minimize the need to read or examine guage used within such documents, even for a very specific details. By making such visual representations specific and accurate query, there may be many search interactive, users are able to manipulate and control results to evaluate. By re-constructing approximately the visualization as they seek to understand the data when the document was last viewed, the collection of being shown [YaKSJ07]. documents can be narrowed down to a more manage- Although information visualization has been ex- able size. The differences in these search processes plored in many different application domains, much exemplify the need for further research and study of of the focus has been on either the human perception how search interfaces can support complex informa- of graphical entities, or the application of visualiza- tion seeking tasks within a broad range of online search tion approaches to various types of data (e.g., multi- contexts. dimensional data, graph data). The lack of focus on In this paper, visual search analytics is intro- human-centric problem solving and data analysis tasks duced as a special class of visual analytics, with has lead to the promotion of a new field of research: a focus on supporting human-centred search activi- visual analytics. Visual analytics combines data pro- ties [Hoe12]. The more general, multidisciplinary re- cessing and machine learning with information visual- search domain of visual analytics combines data pro- ization and human-computer interaction, with a spe- cessing and machine learning with information visu- cific focus on supporting data analysis activities such alization and human-computer interaction, with the as exploration, reasoning, information synthesis, and goal of supporting data exploration, analytical rea- decision-making [TC06, KAF+ 08]. The ultimate aim soning, information synthesis, and decision-making is to take advantage of the powerful analytic capa- [TC06, KAF+ 08]. Visual search analytics applies this bilities of the computer whenever possible, using the philosophy to search contexts, using intelligent vi- resulting information to support the cognitive abilities sual methods for guiding query refinement, explain- of the user through interactive visualization. ing the composition of the search results, supporting Although text and document visualization have document evaluation and comparison, allowing inter- been active research domains for many years active filtering and re-ranking, and enabling analytical [HHWN02, Hea95, WTP+ 95], much of the recent reasoning, sense-making, and exploration among the work in this area has followed the visual analytics ap- search results. By enhancing the abilities of people proach of combining automated processes with inter- to search within large document collections, the ever- active visualizations, resulting in visual text analyt- present big data and information overload problems ics [AdOP12, CCP09, DWS+ 12, GLK+ 13, KKRS13, can be managed. WLS+ 10]. These approaches generally focus on pro- The remainder of this paper is organized as follows. viding visual tools for exploration among document The core concepts of information visualization, visual collections. While they may include basic keyword analytics, and visual text analytics are presented in search as a means for filtering the data, very little at- Section 2. Section 3 outlines the fundamental features tention has been given to supporting the core tasks of visual search analytics, and explains its importance associated with searching among the textual data. for supporting human-centred search activities. Sec- Others have studied how visualization can enhance tion 4 presents a high-level research agenda for the ad- the search interface [Hea09] or the information seek- vancement and study of visual search analytics, along ing process [MW09], and have identified the impor- with a critical discussions of its limitations. A sum- tance of providing additional support to searchers mary of the primary contributions of this work are within the context of exploratory search [WKDs06, provided in Section 5. WR09, WsS10]. For example, HotMap [HY09] pro- vides lightweight visual encodings of the correspon- dence between query terms and search results, and 2 Related Work WordBars [HY08] visually represents the relative fre- Information visualization is the field of study that ex- quency of the top terms within the search results set. plores the use of computer-generated graphical repre- Both of these approaches provide interactive methods for re-ranking the search results based on the addition input provided by the searcher, and have been shown Traditional Search Large Framework Document to be helpful when the search tasks are complex or Collection difficult [Hoe13]. Query Search Engine 3 Visual Search Analytics Visual search analytics extends the normal keyword Search Results search paradigm by automatically extracting salient information from the query, search results, and/or the entire document collection, using visualization to con- Statistical vey this information to the searcher, and allowing the Modelling Machine searcher to interactively engage in the fundamental Learning tasks of query specification and refinement, search re- sults evaluation and exploration, and knowledge dis- covery and management. This domain is related to a Visualization number of other important research areas, including Interaction big data [Rus11], text mining [AZ12], and visual text analytics [AdOP12]. Many large document collections exhibit the big data traits of volume, variety, and ve- Comparison locity, which must be effectively managed. The goal Exploration Synthesis of text mining is to automatically infer structure from Reasoning Understanding unstructured text, and the goal of visual text analytics Discovery is to use visualization to enable the human element of Visual Search Analytics Decision-Making text analysis with the support of automated methods. Framework Visual search analytics draws from advances in these domains, applying text mining within the context of Figure 1: A framework for visual search analytics. the big data problems of large document collections, and focusing on a very important and far-reaching sub- class of visual text analytics problem domains: search. proaches from the domain of natural language pro- Figure 1 illustrates a process-oriented framework cessing [MS99] such as sentiment analysis and named for visual search analytics, extending the traditional entity extraction. Machine learning attempts to learn search framework with automatic methods for infor- generalized models of the data, and include approaches mation extraction and interactive visualization to fa- such as document clustering or sentiment classifica- cilitate communication with the searcher. Rather than tion, or more complex methods such as topic mod- the document-centric focus that is common within the elling using latent dirichlet allocation [BNJ03] or vari- traditional search framework, a human-centred ap- ous graph-based inference approaches [Mur12]. In ad- proach is taken, with the ultimate goal of support- dition, one must consider whether these approaches ing the searcher’s knowledge discovery and decision- should be applied to the queries, the search results, or making activities [Hoe12]. In particular, extracting, even the entire document collection. A fundamentally modelling, and learning from the query and corre- important step in any visual search analytics research sponding search results set provides the information is to choose appropriate methods for the extraction of upon which to base the visualization and interaction useful information upon which to base the visual rep- features, supporting the fundamental search tasks of resentations and provide interactive tools to aid the crafting and refining the query, evaluating and explor- searcher. ing among the search results, and ultimately making The visualization and interaction methods selected sense of what was found. within any visual search analytics research may be While the specific approaches for information ex- guided by Shneiderman’s information seeking mantra: traction depend on the details of the available data “overview first, zoom and filter, then details-on- within a given document collection setting, they are demand” [Shn96], focusing on supporting human- generally divided into two categories: statistical mod- centric search processes [Hoe12, Hoe08]. More specif- elling and machine learning. Statistical modelling ically, the information extracted may be used to pro- focuses on inferring structure from the unstructured vide visual overviews of the search results, as well as textual data, and range in complexity from sim- perhaps the query and the entire document collection; ple term frequency calculations to more complex ap- zooming and filtering may be implemented via query refinement, faceted navigation, and/or search results ity [Hoe09]. Whenever possible, comparisons should re-ranking; and accessing details-on-demand for spe- be made to carefully selected baseline systems repre- cific search results is needed in order for the searcher senting the state-of-the-art and/or industry standard to examine individual documents in detail. search approaches, and measurements should be taken In the design of the visualization features, it is im- to capture not only absolute retrieval effectiveness, but portant to consider the fundamental principles and also the searchers’ perceptions of usefulness and ease theories that describe how and why users perceive of use. Analyzing the time taken to complete a search and interpret visual information, including the Gestalt task should be done with careful consideration of the Laws [Kof35], colour theory [Her64], and the work of specific search activity being supported, noting that Bertin [Ber83] and Tufte [Tuf01]. In addition, Pirolli the extended engagement with an exploratory task, for & Card’s information foraging theory [PC99] provides example, may be considered a beneficial result. The a useful basis for understanding how visualization may outcomes of such evaluations can be used to identify be used to convey helpful information to searchers as aspects that need improvement, allowing for the incre- they seek to fulfill their information seeking goals. An mental refinement of the prototype. Successful evalu- important consideration in any work on visual search ations will build confidence in the value and benefits analytics will be how to effectively abstract and visu- of the combination of specific machine learning and ally convey the complexity of the textual information interactive visualization approaches employed in the to the searcher, drawing upon and contributing to the creation of the visual search analytics prototype. more general field of visual text analytics. The ultimate goal of this research agenda will be to formalize the common elements of search across mul- 4 Research Agenda tiple online document collection settings, identify the reasons for the differences, and study how visual search In order to realize the potential value of visual search analytics approaches support both the common and analytics, the principles must be explored in the de- unique elements in each search setting. This will lead velopment of interactive search interfaces for a num- to further refinement of the framework and general- ber of different large online document collection set- ization of the evaluation results across multiple search tings. Potential avenues for research include searching settings and task types, allowing it to be used as the within encyclopedias and online digital libraries, blogs starting point when developing visual search analytics and microblogs, news websites, private corporate doc- interfaces for new and emerging application domains. ument collections, and the web in general. These doc- While this paper has proposed visual search ana- ument collections each feature important differences lytics as an approach for supporting human-centred not only in the textual data available, but also in the search activities in a broad range of large online doc- types of common information seeking behaviours of ument search settings, it should not be considered a the searchers. While the specifics for a given docu- silver-bullet solution to all search problems in all situ- ment collection must be carefully studied, a general ations. In addition to the computational cost of mod- summary of such data and behaviours is provided in elling and learning from the search data, there is also Figure 2. a cognitive cost imposed on the searcher to learn how More specifically, there is a need for the design, de- to interpret the visual representations and make ef- velopment, and study of visual search analytics proto- fective use the interactive features. For search tasks types that explore the broad range of statistical mod- that already have a high cognitive overhead and are elling and machine learning approaches for extract- frequently being performed (e.g., exploratory search ing meaningful information from within textual data, in online digital libraries), searchers may be willing to and the visual and interactive techniques for present- accept a temporary increase in cognitive load, with the ing this information to the searchers to support their expectation that once they learn the features, the vi- information seeking tasks. By focusing on the human- sual search analytics system will relieve the cognitive centred aspects of search, query refinement can be burden associated with the complex search task and supported, the composition of the search results can provide a more effective way of finding relevant infor- be illustrated, visual document evaluation and com- mation. However, for search tasks that are already parison can be enabled, search results can be filtered, simple in nature (e.g., targeted search on the web) or re-ranked, and explored, and the cognitive activities infrequent (e.g., the occasional search for a document of analytical reasoning, sense-making, and decision- within a corporate intranet), the overhead of a visual making can be enhanced. search analytics approach may not make sense and a An important aspect of such research will be to traditional search interface may be more readily ac- conduct a well-planned series of user evaluations for cepted. As a result, one can expect some resistance to each prototype at various levels of scale and complex- change if the value of the visual search analytics ap- Encyclopedias & Blogs & Corporate Doc. News Websites The Web Digital Libraries Microblogs Collections Data Features unstructured text ✓ ✓ ✓ ✓ ✓ existing meta-data ✓ ✓ ✓ ✓ ✓ temporal features ✓ ✓ ✓ ✓ ✓ geospatial features ✓ ✓ ✓ document relationships ✓ ✓ ✓ ✓ author relationships ✓ ✓ ✓ general topics ✓ ✓ ✓ focused topics ✓ ✓ ✓ Search Behaviours targeted search ✓ ✓ ✓ ✓ ✓ exploratory search ✓ ✓ ✓ re-finding ✓ ✓ ✓ Figure 2: A summary of the common data features and search behaviours within five large online document collection search settings. proach is not carefully measured against the current email, desktop files, image collections, and textual difficulty and frequency of searching within the target data within other visual analytics problem domains. setting. The value of such a framework is that it will provide guidance from both the collection/document perspec- 5 Conclusions tive, as well as the searcher behaviour perspective. Further research on visual search analytics will be This paper proposes visual search analytics as sub- significant and important because of the ubiquity of class of visual analytics, and as an avenue for new re- textual data and the difficulty in analyzing it. In any search focused on providing greater support for the domain where such text is important, taking a visual human-centred elements of search within online doc- search analytics approach will move search beyond a ument collection settings. In conducting such re- simple filtering mechanism, making it a fundamental search, one must consider that searchers in different tool for analyzing and understanding the textual in- settings have different motivations for conducting their formation. While text is everywhere, it is seldom used searches, which lead to different search behaviours that to its fullest potential; visual search analytics is the must be supported. The one-size-fits-all approach of key to enhancing the human-centred aspects of search providing a simple query box and search results list and unlocking the value of textual data. provides limited support for the complexity of search tasks beyond simple fact verification. As the sizes of the document collections in these settings continue to References grow, searchers increasingly face information overload [AdOP12] Aretha B. Alencar, Maria Cristina F. problems making it more and more difficult to find de Oliveira, and Fernando V. Paulovich. the information they are seeking. The goal of visual Seeying beyond reading: A survey on search analytics is to leverage the power of automatic visual text analytics. WIREs Data Min- and intelligent information processing approaches, us- ing and Knowledge Discovery, 2(6):476– ing these to provide the basis for visual and interactive 492, 2012. support to the searcher, allowing them to conduct their search tasks in a more effective manner by supporting [AZ12] Charu C. Aggarwal and Cheng Xi- exploration, analysis, and sense-making among the in- ang Zhai, editors. Mining Text Data. formation provided. Springer Science+Business Media LLC, By exploring the application of visual search ana- Philidelphia, PA, 2012. lytics within various different online document collec- tion settings, the common themes among the different [Ber83] Jaques Bertin. Semiology of Graphics. search activities will lead to the refinement of the vi- Translated by W. J. Berg. University of sual search analytics framework proposed in this pa- Wisconsin Press, Madison, WI, 1983. per. This framework may then be applied to a wide range of search settings beyond the specific document [BNJ03] David M. Blei, Andrew Y. Ng, and collections discussed. These include searching within Michael I. Jordan. Latent dirichlet al- location. Journal of Machine Learning [Hoe09] Orland Hoeber. User evaluation meth- Research, 3(1):993–1022, 2003. ods for visual web search interfaces. In Proceedings of the International Con- [CCP09] Christopher Collins, Sheelagh Carpen- ference on Information Visualization, dale, and Gerald Penn. DocuBurst: Vi- pages 139–145, 2009. sualizing document content using lan- guage structure. Computer Graphics [Hoe12] Orland Hoeber. Human-centred web Forum, 28(3):1039–1046, 2009. search. In C. Jouis, I. Biskri, J-G Ganascia, and M. Roux, editors, Next [DWS+ 12] Wenwen Dou, Xiaoyu Wang, Drew Generation Search Engines: Advanced Skau, William Ribarsky, and Models for Information Retrieval, pages Michelle X. Zhou. LeadLine: In- 217–238. IGI Global, 2012. teractive visual analysis of text data through event identification and ex- [Hoe13] Orland Hoeber. A longitudinal study of ploration. In Proceedings of the IEEE HotMap web search. Online Informa- Conference on Visual Analytics Science tion Review, 37(2):252–267, 2013. and Technology, pages 93–102, 2012. [HY08] Orland Hoeber and Xue Dong Yang. + [GLK 13] Carsten Görg, Zhicheng Liu, Jaeyeon Evaluating WordBars in exploratory Kihm, Jaegul Choo, Haesun Park, and web search scenarios. Information Pro- John Stasko. Combining conceptual cessing and Management, 44(2):485– analyses and interactive visualization 510, 2008. for document exploration and sense- [HY09] Orland Hoeber and Xue Dong Yang. making in Jigsaw. IEEE Transactions HotMap: Supporting visual explo- on Visualization and Computer Graph- rations of web search results. Journal ics, 19(10):1646–1663, 2013. of the American Society for Information [Hea95] Marti Hearst. TileBars: Visualization Science and Technology, 60(1):90–110, of term distribution information in full 2009. text information access. In Proceedings [KAF+ 08] Daniel A. Keim, Gennady Andrienko, of the ACM Conference on Human Fac- Jean-Daniel Fekete, Carsten Görg, Jörn tors in Computing Systems, pages 59– Kohlhammer, and Guy Melançon. Vi- 66, New York, NY, USA, 1995. ACM. sual analytics: Definition, process, and [Hea09] Marti Hearst. Search User Inter- challenges. In Andreas Kerren, John T. faces. Cambridge University Press, Stasko, Jean-Daniel Fekete, and Chris Cambridge, UK, 2009. North, editors, Information Visualiza- tion: Human-Centered Issues and Per- [Her64] Ewald Hering. Outlines of a Theory spectives, LNCS 4950, pages 154–175. of Light Sense (Grundzge der Lehr von Springer, Berlin, 2008. Lichtsinn, 1920). Harvard University [KKRS13] Daniel A. Keim, Miloš Krstajić, Chris- Press, 1964. tian Rohrdantz, and Tobias Schreck. [HHWN02] Susan Havre, Elizabeth Hetzler, Paul Real-time visual analytics for text Witney, and Lucy Nowell. ThemeRiver: streams. IEEE Computer, 46(7):47–55, Visualization thematic changes in large 2013. document collections. IEEE Transac- [Kof35] Kurt Koffka. Principles of Gestalt Psy- tions on Visualization and Computer chology. Harcourt-Brace, New York, Graphics, 8(1):9–20, 2002. 1935. [Hoe08] Orland Hoeber. Web information [MS99] Christopher D. Manning and Hinrich retrieval support systems: The fu- Schütze. Foundations of Statistical Nat- ture of web search. In Proceed- ural Language Processing. The MIT ings of the IEEE/WIC/ACM Interna- Press, Cambridge, MA, 1999. tional Conference on Web Intelligence - Workshops (International Workshop [Mur12] Kevin P. Murphy. Machine Learning: on Web Information Retrieval Support A Probabilistic Perspective. The MIT Systems), pages 29–32, 2008. Press, Cambridge, MA, 2012. [MW09] Gary Marchionini and Ryen W. White. [WR09] Ryen W. White and Resa A. Roth. Ex- Information seeking support systems. ploratory Search: Beyond the Query- IEEE Computer, 42(3):30–32, March Response Paradigm. Morgan & Clay- 2009. pool Publisher, San Rafael, CA, 2009. [PC99] Peter Pirolli and Stuart Card. Infor- [WTP+ 95] James A. Wise, James J. Thomas, Kelly mation foraging. Psychological Review, Pennock, David Lantrip, Marc Pottier, 106(4):643–675, 1999. Anne Schur, and Vern Crow. Visualiz- ing the non-visual: Spatial analysis and [Rus11] Philip Russom. Big Data Analytics. interaction with information from text TDWI Research, Renton, WA, 2011. documents. In Proceedings of IEEE In- formation Visualization, 1995. [Shn96] Ben Shneiderman. The eyes have it: A task by data type taxonomy for infor- [YaKSJ07] Ji Soo Yi, Youn ah Kang, John T. mation visualizations. In Proceedings of Stasko, and Julie A. Jacko. Toward a IEEE Symposium on Visual Languages, deeper understanding of the role of in- pages 336–343, 1996. teraction in information visualization. IEEE Transactions on Visualization [TC06] James J. Thomas and Kristin A. Cook. and Computer Graphics, 13(6):1224– A visual analytics agenda. IEEE 1231, 2007. Computer Graphics and Applications, 26(1):10–13, 2006. [Tuf01] Edward Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, CT, 2001. [War04] Colin Ware. Information Visualization: Perception for Design. Morgan Kauf- mann, San Francisco, second edition, 2004. [WGK10] Matthew Ward, Georges Grinstein, and Daniel Keim. Interactive Data Visual- ization: Foundations, Techniques, and Applications. A K Peters, Natick, MA, 2010. [WKDS06] Ryen W. White, Bill Kules, Steven M. Drucker, and m. c. schraefel. Support- ing exploratory search. Communica- tions of the ACM, 49(4):37–39, 2006. [WLS+ 10] Furu Wei, Shixia Liu, Yangqiu Song, Shimei Pan, Michelle X. Zhou, Weihong Qian, Lei Shi, Li Tan, and Qiang Zhang. TIARA: A visual exporatory text ana- lytic system. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, pages 153– 162, 2010. [WSS10] Max L. Wilson, m. c. schraefel, and Ben Shneiderman. From keyword search to exploration: Designing future search in- terfaces for the web. Foundations and Trends in Web Science, 2(1):1–97, 2010.