Olga Scrivner et al. MAICS 2017 pp. 93–97 Building Customized Text Mining Tools via Shiny Framework: The Future of Data Visualization Olga Scrivner, Vinita Chakilam, Jivitesh Poojary, Nilima Sahoo, Chandan Uppuluri, Stephan De Spiegeleire ABSTRACT would lead to a mutual enrichment, allowing for “synthesis With the increasing volume of data, there is a growing need of computational and humanistic modes of inquiry” [6]. A for dynamic data visualization to help reveal instant changes process of collaboration can be achieved through the follow- in data patterns. There exist many commercial visualization ing steps: tools, but traditional scholars are often disengaged from the tool development process; thus, the choice of functionalities 1) Scholars learn from data scientists about analytical is contingent upon tool developers whose choice may not tools, techniques, and what they can and cannot achieve fit the end-users. This collaboration, however, has a poten- tial in bridging the gap between traditional scholars, who are more interested in sense-making of the text than in the tools, and the data scientists, who are more interested in 2) They exchange research questions and the implicit or the tools than in the substance, but must still contextual- explicit heuristics used in their work ize the outcomes. Until recently, this collaborative process was hindered by the complexity of customization procedures and technological hurdles imposed on users with new instal- 3) They collaborate on how these discoveries can be made lations. With the advent of reactive web frameworks, such and assess the ‘quality’ of developing tools with real as Shiny, the user-driven customization becomes not only data feasible, but also essential to advance scientific research. In this paper, we demonstrate a collaborative e↵ort between Until recently, this collaboration was unfeasible. Software learned scholars and tool developers, allowing for a compu- not only necessitates a team of software engineers and de- tational and humanistic fusion. signers, but also requires installation and consistent updates, which is a technical hurdle for users. Furthermore, the de- Keywords sign of collaborative visualization has been commonly de- visualization, text mining, Shiny web application scribed as a grand challenge for visualization research [12]. While most visualization research has explored the cogni- tive and perceptual aspects of design, social interaction has 1. INTRODUCTION only recently been recognized as a part of visualization sys- In the last decade, the volumes of data collections have tem design [13, 3]. For example, some studies examined grown so “large and complex that it becomes difficult to synchronous and asynchronous collaboration between team process using on-hand databases, management tools or tra- players to improve analytical interpretation [2, 3]. In con- ditional data processing applications” [14]. As Jockers [4] trast, a collaboration to enhance analytical functionalities points out, these massive digital collections “invite, even de- and tool design is not common and mostly related to com- mand, a new type of evidence gathering and meaning mak- mercial customizable software [15]. ing”. Consequentially, visual analytics is becoming the cor- With the advent of reactive web applications, such as nerstone of scientific analysis by combining “visualization, Shiny, the user-driven tool customization becomes a reality. human factors and data analysis” and contributing to an in- First, these applications require no installation and are ac- formation synthesis interpretable to the human eye [5]. Fur- cessible from any web browser, which enables a direct testing thermore, the recent proliferation of visualization tools, both of new functionalities by users. Second, the reactive frame- commercial and open source, has led to an increasing usage work allows for a creation of highly dynamic tools with min- of visual analytics among traditional humanities scholars. imum knowledge of web development. Finally, Shiny is a Since most of these tools have been developed by software web application developed for R, which is an open source engineers, traditional scholars are often disengaged from the language with a large library for data visualization. tool development process. This collaboration, however, has In this paper, we describe our current collaborative re- a potential in bridging the gap between traditional scholars, search on text mining and visualization customization. Our who are more interested in sense-making of the text than in goal is to assist scholars in their process of ingestion (‘read- the tools, and the data scientists, who are more interested ing’), digestion (analyzing and sense-making), and egestion in the tools than in the substance, but must still contextual- (through the creation of new learned texts via queries). Our ize the outcomes. The insights gained from learned scholars workflow is illustrated in Figure 1. 93 Building Customized Text Mining Tools via Shiny Framework: The Future of Data Visualization pp. 93–97 software version releases. In the next section, we will briefly describe our recently developed Shiny application, namely Interactive Text Mining Suite. 2.2 Interactive Text Mining Suite Interactive Text Mining Suite (henceforth, ITMS) is de- signed to assist humanities scholars in the discovery of new insights and patterns within large digital collections, and to provide access to natural language processing techniques with a user-friendly design. Its major strength is the ability to work with data in various formats, PDF and text formats, as well as CSV, JSON, and XML, as shown in Figure 2. Figure 1: Information Visualization workflow: From the initial stage to a custom-designed stage Our initial stage begins with the current version of Interac- tive Text Mining Suite,1 a Shiny web application, developed to test various text mining and visualization techniques for digital humanities scholars [10, 11]. Our second stage com- prises a direct collaboration with various scholars via Riz- zoma, a collaborative social platform for discussions, and by means of various cloud storage platforms. The goal of this social interaction is to 1) identify scholarly research needs, 2) discuss design and functionalities, and, finally 3) develop and embed new functionalities into a web application. This stage also includes bug reports, constant feedback, and sug- gestions on design improvement directly from scholars. The final stage involves a fully-customized version of web appli- cation. This paper is organized as follows. In section 2 we in- Figure 2: Interactive Text Mining Suite: Importing troduce Shiny, a reactive web framework. We then describe data Interactive Text Mining Suite and its current functionalities in section 3. Section 4 and 5 will overview the development In contrast, many existing text mining tools are limited to of customized functionalities for scholarly research, followed specific importing formats. Additionally, ITMS performs by conclusions and future directions presented in section 6. a wide range of common preprocessing tasks, allowing for maximum flexibility and user control, illustrated in Figure 2. SHINY APPLICATION 3 (for a more detailed description, see [10, 11]). 2.1 Shiny Web Framework Traditional imperative web framework model was devel- oped by Trygve Reenskaug in 1979 and followed a three- component structure: model, view, and control [8]. In this model, the controller plays an essential and explicit role: “you have to specify what to do when you receive user re- quests and what resources you are going to mobilize to carry out the necessary tasks outlined in the model” [9]. In con- trast, the recent shift toward a reactive web framework has erased such a strict control, thus enabling dynamic systems that are highly responsive to users’ input and interaction. Shiny, an R package, is one such application. After its re- lease as an open source software package in 2012, the use of this application has been expanding at an unprecedented rate. This trend can be attributed to the combination of several factors: 1) Shiny web applications do not require a knowledge of web development, 2) web applications are user- Figure 3: Interactive Text Mining Suite: Prepro- friendly and dynamic, allowing for instant feedback to users, cessing data 3) web applications are accessible via browser from any de- vice, including mobile devices, which makes it convenient to users, and 4) web applications are highly customizable, al- 3. USER-DRIVEN CUSTOMIZATION lowing for instant modification, as compared to traditional As mentioned in 2.2, ITMS was designed as a digital hu- manities tool suitable for performing common text mining 1 http://www.interactivetextminingsuite.com tasks and visualization methods. That is, it was built for 94 Olga Scrivner et al. MAICS 2017 pp. 93–97 scholars, but not by humanities scholars. However, there ex- ists a gap between scholars, who have been doing more qual- itative text-based research for public and government sec- tors, and data scientist/computational linguistics scholars, who work on theoretical text-mining research.2 To bridge this gap, we have developed a collaborative communication between these two communities (a.k.a. end-users and devel- opers). Instead of a typical github environment for reporting progress and issues, we chose a social collaborative platform rizzoma.3 Rizzoma is built as knowledge-management and discussion platform allowing for real-time team communica- tion and multimedia support. Figure 4 illustrates our col- laborative project structure. Figure 5: Development of data segmentation: win- dow constraint In addition, scholarly research collections are often stored and accessed via bibliographic management systems (e.g., Zotero, Mendeley, and Endnote). While most of these sys- tems do not perform text mining analysis, the Zotero plugin, namely Paper Machine,4 o↵ers a wide range of interactive vi- sualization for document collections. Nevertheless, the user cannot control text segmentation, which yields very broad topic and metadata visualizations. Given that Zotero is the main bibliographic management system in our collaborative project, data import from Zotero into ITMS became one Figure 4: A collaborative platform Rizzoma: ITMS of the most essential primary tasks for our team. Several project options exist for exporting library collections from Zotero, namely rdf and csv formats. However, a few issues were dis- In the following sections, we describe our workflow and de- covered during the exploratory phase: 1) csv and rdf files sign considerations based on this collaboration. only contain local paths to actual document articles (see Figure 6); 2) local paths cannot be accessed directly from a remote web application; 3) running ITMS locally would re- 4. DATA INPUT CONSTRAINT quire R installation and some programming knowledge, thus Based on the previous work with existing text mining generating technological hurdles for end-users. tools, it was determined that the main pollutant for scholarly research is the inability to pre-define text excerpts within the text collection. It appears that stopwords filtering and text preprocessing were not sufficient to obtain intuitive data interpretation for qualitative scholarly studies. Collabora- tively, we have developed and tested the following algorithm (see also Figure 5): 1. Parse document collection 2. Scan every document for a specific term defined by the user (e.g., “security”) or two terms (e.g., “influenc*” within 10 words of “Europ*”) 3. Define a window around these terms (e.g., 10 words to the left and 10 words to the right) 4. Include only the extracted segments into data analysis Figure 6: Zotero collection: rdf file format (top) and and visualization Zotero internal directory structure (bottom) 2 From a personal communication with The Hague Center Two solutions are being currently tested based on the fol- for Strategic Studies (www.hcss.nl ) 3 4 https://rizzoma.com http://papermachines.org/ 95 Building Customized Text Mining Tools via Shiny Framework: The Future of Data Visualization pp. 93–97 lowing criteria: 1) functionality and 2) the level of complex- of words or certain documents). In addition, our current ity. The first approach is the development of a small Shiny work is concentrated on interactive and more meaningful n- application installed locally that would process rdf library gram visualization (e.g., tree visualization), as compared to collections, export pdf files, convert them into text files, and traditional static graphics (Figure 8). Based on our collabo- place them into one directory, which can be accessed from rative feedback, the tree N-gram visualization was identified our web application (see Figure 7). as more meaningful for scholarly interpretation. This tree will share prefixes of N-grams (e.g., “airport”), each con- nected to the root node. The root node is the set of focus words selected in the query. Every path in the tree, i.e., a path from the root node to a leaf node, corresponds to the N-gram made of the words encountered along the path, and having the score associated with the leaf node. Another possible visualization is a network representation, where the central node is the key word. There exist multiple R libraries that might be used to enhance n-gram interpretation, such as JSTORr, ngram, NSP, WordStat, among many others. Figure 7: Small Shiny application: local conversion In order to identify the best fit for the web application, we of Zotero files address the following criteria: 1) user-friendliness, 2) easy human interpretation, and 3) functionality. This application is used only once and has a low level of com- plexity, yet the functionality is less user-friendly, as it cre- 5.2 Interactive Visualization ates an additional directory with extracted files from Zotero. The ability to perform dynamic and interactive visualiza- These files can then be imported into ITMS. The second ap- tion is one of the strengths in reactive applications. While proach is suggested by the end-users: export zotero library there are many R libraries implementing various types of in- as a csv file, run a local script to extract all pdf files, and teractive visualization, we decided to examine two packages, add them into the CSV file as a plain text. While the func- namely plot.ly and googleViz. Comparison and parallel test- tionality is high, the level of complexity is much higher. ing feed our decision to implement their functionalities into ITMS. Table 5.2 presents our current summary. 5. DATA VISUALIZATION Types GoogleViz Plot.ly There is no doubt that visual analytics facilitates analyt- Stepped Area chart ical reasoning [12]. For a tool developer, however, it is not Bubble chart always clear whether implemented visualization methods as- Gauge sist the user in their research. The current project proposes Intensity Map to address this issue by a collaborative examination of vari- Geo Chart ous visualization types in order to determine their usability Table with pages Tree Map NA for Shiny application and for the end-users. First, we de- Annotation chart scribe n-gram analysis, followed by interactive visualization, Sankey chart and topic modeling visualization. Calendar chart NA Timeline chart Merging charts NA Flash charts NA Annotated time line chart Chord diagram NA Filled Chord diagram NA k-means clustering Stream Graph NA PCA NA Hierarchical Clustering NA Doughnut Chart Table 1: Functional comparison of 2 R libraries: plot.ly and googleViz Figure 8: N-gram visualization with JSTORr pack- age: populism After identifying their functionalities, our next step is to determine the best fit via our collaborative feedback. 5.1 N-gram Visualization An N-gram allows users to identify the co-occurrence of 5.3 Topic Modeling Story Telling words within a single text or text collection. The ‘n’ in- Topic modeling is a statistical model used in machine dicates the number of words being selected to create uni- learning and natural language processing for discovering ab- grams, bi-grams, or tri-grams, etc. By using a combination stract topics that occur in a collection of documents. This of words, instead of a single occurrence, the user can gener- analysis assists in “classification, novelty detection, summa- ate higher quality results for data interpretation. Preferably, rization, and similarity and relevance judgements” [1]. While the user can also set a ‘search window’ within which co- topic modeling results can be visualized in di↵erent forms, occurrence should take place (i.e., within a certain amount most common form is in a table format (see Figure 9). 96 Olga Scrivner et al. MAICS 2017 pp. 93–97 4. Libraries: ITMS is unique in that in addition to its ability to analyze data collections, it can add biblio- graphic metadata. As Jockers suggests, library meta- data “has been largely untapped as a means of explor- ing literary history” and could “reveal useful informa- tion about literary trends” [4]. All these considerations and scholarly collaboration also present new opportunities for the field of data visualization and an- alytics, advancing our understanding of computation and human nature, namely “synthesis of computational and hu- Figure 9: Topic visualization in ITMS manistic modes of inquiry” [6]. By comparing other software with their unique options for 7. REFERENCES topic representation, two candidates for ITMS were identi- [1] N. A. Blei D. and M. Jordan. Latent dirichlet fied: topic bubbling and topic coupling from MALLET, a allocation. Journal of Machine Learning Research, topic modeling package. The goal of topic bubbling is to pages 993–1022, 2003. compare the relative importance of all the topics; the size [2] K. Brodlie, D. Duce, J. Gallop, J. Walton, and of a topic bubble is the accumulated size of all word bub- J. Wood. Distributed and collaborative visualization. bles within that topic. In contrast, topic coupling reveals Computer Graphics Forum, 23:223–251, 2004. the relations between the topics based on their associated [3] J. Heer and M. Agrawala. Design Considerations for words. In this representation, topics are shown as a net- Collaborative Visual Analytics. Information work of terms (nodes) linked by their interaction with other Visualization, 7:49–62, 2008. topics. [4] M. L. Jockers. Topics in the Digital Humanities: Macroanalysis: Digital Methods and Literary History. 6. CONCLUSION University of Illinois Press, Urbana, IL, USA, 2013. In recent years, we have seen growing interest in the use [5] D. Keim, F. Mansmann, J. Schneidewind, and of data visualization tools in the humanities fields. How- H. Ziegler. Challenges in Visual Data Analysis. In ever, many of the existing tools are unable to integrate the Information Visualization (IV 2006), IEEE, pages humanistic component of exploratory research. Thus, the 9–16, 2006. overarching goal of the current work on ITMS is to bridge [6] L. Klein and J. Eisenstein. Reading Thomas Je↵erson the gap between tool-developers and learned scholars by with TopicViz: Towards a Thematic Method for adding a user-customization component. In addition, the Exploring Large Cultural Archives, 2013. social interaction between scholars and data scientists has [7] K. L. and E. J. Reading thomas je↵erson with a strong potential to promote text mining methods among topicviz: Towards a thematic method for exploring humanities as well as to enhance capabilities and function- large cultural archives. scholarly and research ality of visualization tools. We have also shown that a re- communication. 4(3), 2013. cent development of reactive Shiny framework has facilitated [8] T. M. H. Reenskaug. Models-views-controllers. the task of user-customization: On the one hand, a wide http://heim.ifi.uio.no/ trygver/1979/mvc-2/1979-12- range of open source R libraries and its overall simplicity MVC.pdf for deployment made the Shiny framework very accessible [9] B. B. Ribeiro. The two frameworks. to non-experienced programmers. On the other hand, Shiny https://github.com/rstudio/shiny/issues/250, 2017. is user-friendly web application, where users are not con- [10] O. Scrivner and J. Davis. Topic Modeling of Scholarly strained by limitations of their local computer memory and Articles: Interactive Text Mining Suite. In platform dependency, as compared to other software tools. International Conference “Dialogue 2016”, 2016. While this project only focuses on a collaboration between political/social science scholars, this idea can be extended [11] O. Scrivner and J. Davis. Interactive Text Mining to other fields. Below we summarize some of the possible Suite: Data Visualization for Literary Studies. In the implementations for future research: Workshop on Corpora in the Digital Humanities, pages 29–38, 2017. 1. Teaching tool: The web application is developed in the [12] J. Thomas and K. Cook, editors. Illuminating the conjunction with the lesson plans, for example statis- Path: The Research and Development Agenda for tics modules. The collaboration can also be expanded Visual Analytics. IEEE Press, 2005. by including students into the development and testing [13] F. Viégas and M. Wattenberg. phases. Communication-minded visualization: A call to action. IBM Systems Journal, 45, 2006. 2. Digital Humanities: Based on individual research, the [14] T. White. Hadoop: The Definitive Guide. Storage and web application can be augmented with additional vi- Analysis at Internet Scale. O’Reilly Media/Yahoo sualization types, for example spatial or chronological Press, Sebastopol, 3rd edition, 2012. maps for literature analysis. [15] J. Whitehead. Collaboration in Software Engineering: 3. Social Science: The user could specify additional me- A Roadmap. In 2007 Future of Software Engineering, dia for research and customize their appearance, e.g. pages 214–225, Washington, DC, USA, 2007. IEEE tweets, blogs, or photos. Computer Society. 97