Olga Scrivner et al.                                      MAICS 2017                                                     pp. 93–97


           Building Customized Text Mining Tools via Shiny
            Framework: The Future of Data Visualization

             Olga Scrivner, Vinita Chakilam, Jivitesh Poojary, Nilima Sahoo, Chandan Uppuluri, Stephan De Spiegeleire


ABSTRACT                                                               would lead to a mutual enrichment, allowing for “synthesis
With the increasing volume of data, there is a growing need            of computational and humanistic modes of inquiry” [6]. A
for dynamic data visualization to help reveal instant changes          process of collaboration can be achieved through the follow-
in data patterns. There exist many commercial visualization            ing steps:
tools, but traditional scholars are often disengaged from the
tool development process; thus, the choice of functionalities            1) Scholars learn from data scientists about analytical
is contingent upon tool developers whose choice may not                     tools, techniques, and what they can and cannot achieve
fit the end-users. This collaboration, however, has a poten-
tial in bridging the gap between traditional scholars, who
are more interested in sense-making of the text than in the
tools, and the data scientists, who are more interested in               2) They exchange research questions and the implicit or
the tools than in the substance, but must still contextual-                 explicit heuristics used in their work
ize the outcomes. Until recently, this collaborative process
was hindered by the complexity of customization procedures
and technological hurdles imposed on users with new instal-              3) They collaborate on how these discoveries can be made
lations. With the advent of reactive web frameworks, such                   and assess the ‘quality’ of developing tools with real
as Shiny, the user-driven customization becomes not only                    data
feasible, but also essential to advance scientific research. In
this paper, we demonstrate a collaborative e↵ort between
                                                                          Until recently, this collaboration was unfeasible. Software
learned scholars and tool developers, allowing for a compu-
                                                                       not only necessitates a team of software engineers and de-
tational and humanistic fusion.
                                                                       signers, but also requires installation and consistent updates,
                                                                       which is a technical hurdle for users. Furthermore, the de-
Keywords                                                               sign of collaborative visualization has been commonly de-
visualization, text mining, Shiny web application                      scribed as a grand challenge for visualization research [12].
                                                                       While most visualization research has explored the cogni-
                                                                       tive and perceptual aspects of design, social interaction has
1. INTRODUCTION                                                        only recently been recognized as a part of visualization sys-
   In the last decade, the volumes of data collections have            tem design [13, 3]. For example, some studies examined
grown so “large and complex that it becomes difficult to               synchronous and asynchronous collaboration between team
process using on-hand databases, management tools or tra-              players to improve analytical interpretation [2, 3]. In con-
ditional data processing applications” [14]. As Jockers [4]            trast, a collaboration to enhance analytical functionalities
points out, these massive digital collections “invite, even de-        and tool design is not common and mostly related to com-
mand, a new type of evidence gathering and meaning mak-                mercial customizable software [15].
ing”. Consequentially, visual analytics is becoming the cor-              With the advent of reactive web applications, such as
nerstone of scientific analysis by combining “visualization,           Shiny, the user-driven tool customization becomes a reality.
human factors and data analysis” and contributing to an in-            First, these applications require no installation and are ac-
formation synthesis interpretable to the human eye [5]. Fur-           cessible from any web browser, which enables a direct testing
thermore, the recent proliferation of visualization tools, both        of new functionalities by users. Second, the reactive frame-
commercial and open source, has led to an increasing usage             work allows for a creation of highly dynamic tools with min-
of visual analytics among traditional humanities scholars.             imum knowledge of web development. Finally, Shiny is a
Since most of these tools have been developed by software              web application developed for R, which is an open source
engineers, traditional scholars are often disengaged from the          language with a large library for data visualization.
tool development process. This collaboration, however, has                In this paper, we describe our current collaborative re-
a potential in bridging the gap between traditional scholars,          search on text mining and visualization customization. Our
who are more interested in sense-making of the text than in            goal is to assist scholars in their process of ingestion (‘read-
the tools, and the data scientists, who are more interested            ing’), digestion (analyzing and sense-making), and egestion
in the tools than in the substance, but must still contextual-         (through the creation of new learned texts via queries). Our
ize the outcomes. The insights gained from learned scholars            workflow is illustrated in Figure 1.


                                                                  93
 Building Customized Text Mining Tools via Shiny Framework: The Future of Data Visualization pp. 93–97


                                                                      software version releases. In the next section, we will briefly
                                                                      describe our recently developed Shiny application, namely
                                                                      Interactive Text Mining Suite.

                                                                      2.2    Interactive Text Mining Suite
                                                                         Interactive Text Mining Suite (henceforth, ITMS) is de-
                                                                      signed to assist humanities scholars in the discovery of new
                                                                      insights and patterns within large digital collections, and
                                                                      to provide access to natural language processing techniques
                                                                      with a user-friendly design. Its major strength is the ability
                                                                      to work with data in various formats, PDF and text formats,
                                                                      as well as CSV, JSON, and XML, as shown in Figure 2.
Figure 1: Information Visualization workflow: From
the initial stage to a custom-designed stage

Our initial stage begins with the current version of Interac-
tive Text Mining Suite,1 a Shiny web application, developed
to test various text mining and visualization techniques for
digital humanities scholars [10, 11]. Our second stage com-
prises a direct collaboration with various scholars via Riz-
zoma, a collaborative social platform for discussions, and by
means of various cloud storage platforms. The goal of this
social interaction is to 1) identify scholarly research needs,
2) discuss design and functionalities, and, finally 3) develop
and embed new functionalities into a web application. This
stage also includes bug reports, constant feedback, and sug-
gestions on design improvement directly from scholars. The
final stage involves a fully-customized version of web appli-
cation.
   This paper is organized as follows. In section 2 we in-            Figure 2: Interactive Text Mining Suite: Importing
troduce Shiny, a reactive web framework. We then describe             data
Interactive Text Mining Suite and its current functionalities
in section 3. Section 4 and 5 will overview the development           In contrast, many existing text mining tools are limited to
of customized functionalities for scholarly research, followed        specific importing formats. Additionally, ITMS performs
by conclusions and future directions presented in section 6.          a wide range of common preprocessing tasks, allowing for
                                                                      maximum flexibility and user control, illustrated in Figure
2. SHINY APPLICATION                                                  3 (for a more detailed description, see [10, 11]).

2.1      Shiny Web Framework
   Traditional imperative web framework model was devel-
oped by Trygve Reenskaug in 1979 and followed a three-
component structure: model, view, and control [8]. In this
model, the controller plays an essential and explicit role:
“you have to specify what to do when you receive user re-
quests and what resources you are going to mobilize to carry
out the necessary tasks outlined in the model” [9]. In con-
trast, the recent shift toward a reactive web framework has
erased such a strict control, thus enabling dynamic systems
that are highly responsive to users’ input and interaction.
Shiny, an R package, is one such application. After its re-
lease as an open source software package in 2012, the use
of this application has been expanding at an unprecedented
rate. This trend can be attributed to the combination of
several factors: 1) Shiny web applications do not require a
knowledge of web development, 2) web applications are user-           Figure 3: Interactive Text Mining Suite: Prepro-
friendly and dynamic, allowing for instant feedback to users,         cessing data
3) web applications are accessible via browser from any de-
vice, including mobile devices, which makes it convenient to
users, and 4) web applications are highly customizable, al-
                                                                      3.    USER-DRIVEN CUSTOMIZATION
lowing for instant modification, as compared to traditional             As mentioned in 2.2, ITMS was designed as a digital hu-
                                                                      manities tool suitable for performing common text mining
1
    http://www.interactivetextminingsuite.com                         tasks and visualization methods. That is, it was built for


                                                                 94
Olga Scrivner et al.                                      MAICS 2017                                                  pp. 93–97


scholars, but not by humanities scholars. However, there ex-
ists a gap between scholars, who have been doing more qual-
itative text-based research for public and government sec-
tors, and data scientist/computational linguistics scholars,
who work on theoretical text-mining research.2 To bridge
this gap, we have developed a collaborative communication
between these two communities (a.k.a. end-users and devel-
opers). Instead of a typical github environment for reporting
progress and issues, we chose a social collaborative platform
rizzoma.3 Rizzoma is built as knowledge-management and
discussion platform allowing for real-time team communica-
tion and multimedia support. Figure 4 illustrates our col-
laborative project structure.


                                                                       Figure 5: Development of data segmentation: win-
                                                                       dow constraint

                                                                       In addition, scholarly research collections are often stored
                                                                       and accessed via bibliographic management systems (e.g.,
                                                                       Zotero, Mendeley, and Endnote). While most of these sys-
                                                                       tems do not perform text mining analysis, the Zotero plugin,
                                                                       namely Paper Machine,4 o↵ers a wide range of interactive vi-
                                                                       sualization for document collections. Nevertheless, the user
                                                                       cannot control text segmentation, which yields very broad
                                                                       topic and metadata visualizations. Given that Zotero is the
                                                                       main bibliographic management system in our collaborative
                                                                       project, data import from Zotero into ITMS became one
Figure 4: A collaborative platform Rizzoma: ITMS                       of the most essential primary tasks for our team. Several
project                                                                options exist for exporting library collections from Zotero,
                                                                       namely rdf and csv formats. However, a few issues were dis-
In the following sections, we describe our workflow and de-            covered during the exploratory phase: 1) csv and rdf files
sign considerations based on this collaboration.                       only contain local paths to actual document articles (see
                                                                       Figure 6); 2) local paths cannot be accessed directly from a
                                                                       remote web application; 3) running ITMS locally would re-
4. DATA INPUT CONSTRAINT                                               quire R installation and some programming knowledge, thus
   Based on the previous work with existing text mining                generating technological hurdles for end-users.
tools, it was determined that the main pollutant for scholarly
research is the inability to pre-define text excerpts within
the text collection. It appears that stopwords filtering and
text preprocessing were not sufficient to obtain intuitive data
interpretation for qualitative scholarly studies. Collabora-
tively, we have developed and tested the following algorithm
(see also Figure 5):

  1. Parse document collection

  2. Scan every document for a specific term defined by the
     user (e.g., “security”) or two terms (e.g., “influenc*”
     within 10 words of “Europ*”)

  3. Define a window around these terms (e.g., 10 words to
     the left and 10 words to the right)

  4. Include only the extracted segments into data analysis            Figure 6: Zotero collection: rdf file format (top) and
     and visualization                                                 Zotero internal directory structure (bottom)

2
  From a personal communication with The Hague Center                  Two solutions are being currently tested based on the fol-
for Strategic Studies (www.hcss.nl )
3                                                                      4
  https://rizzoma.com                                                      http://papermachines.org/


                                                                  95
Building Customized Text Mining Tools via Shiny Framework: The Future of Data Visualization pp. 93–97


lowing criteria: 1) functionality and 2) the level of complex-          of words or certain documents). In addition, our current
ity. The first approach is the development of a small Shiny             work is concentrated on interactive and more meaningful n-
application installed locally that would process rdf library            gram visualization (e.g., tree visualization), as compared to
collections, export pdf files, convert them into text files, and        traditional static graphics (Figure 8). Based on our collabo-
place them into one directory, which can be accessed from               rative feedback, the tree N-gram visualization was identified
our web application (see Figure 7).                                     as more meaningful for scholarly interpretation. This tree
                                                                        will share prefixes of N-grams (e.g., “airport”), each con-
                                                                        nected to the root node. The root node is the set of focus
                                                                        words selected in the query. Every path in the tree, i.e.,
                                                                        a path from the root node to a leaf node, corresponds to
                                                                        the N-gram made of the words encountered along the path,
                                                                        and having the score associated with the leaf node. Another
                                                                        possible visualization is a network representation, where the
                                                                        central node is the key word. There exist multiple R libraries
                                                                        that might be used to enhance n-gram interpretation, such
                                                                        as JSTORr, ngram, NSP, WordStat, among many others.
Figure 7: Small Shiny application: local conversion                     In order to identify the best fit for the web application, we
of Zotero files                                                         address the following criteria: 1) user-friendliness, 2) easy
                                                                        human interpretation, and 3) functionality.
This application is used only once and has a low level of com-
plexity, yet the functionality is less user-friendly, as it cre-        5.2   Interactive Visualization
ates an additional directory with extracted files from Zotero.             The ability to perform dynamic and interactive visualiza-
These files can then be imported into ITMS. The second ap-              tion is one of the strengths in reactive applications. While
proach is suggested by the end-users: export zotero library             there are many R libraries implementing various types of in-
as a csv file, run a local script to extract all pdf files, and         teractive visualization, we decided to examine two packages,
add them into the CSV file as a plain text. While the func-             namely plot.ly and googleViz. Comparison and parallel test-
tionality is high, the level of complexity is much higher.              ing feed our decision to implement their functionalities into
                                                                        ITMS. Table 5.2 presents our current summary.
5. DATA VISUALIZATION
                                                                         Types                       GoogleViz     Plot.ly
   There is no doubt that visual analytics facilitates analyt-           Stepped Area chart
ical reasoning [12]. For a tool developer, however, it is not            Bubble chart
always clear whether implemented visualization methods as-               Gauge
sist the user in their research. The current project proposes            Intensity Map
to address this issue by a collaborative examination of vari-            Geo Chart
ous visualization types in order to determine their usability            Table with pages
                                                                         Tree Map                                  NA
for Shiny application and for the end-users. First, we de-               Annotation chart
scribe n-gram analysis, followed by interactive visualization,           Sankey chart
and topic modeling visualization.                                        Calendar chart                            NA
                                                                         Timeline chart
                                                                         Merging charts                            NA
                                                                         Flash charts                              NA
                                                                         Annotated time line chart
                                                                         Chord diagram               NA
                                                                         Filled Chord diagram        NA
                                                                         k-means clustering
                                                                         Stream Graph                NA
                                                                         PCA                         NA
                                                                         Hierarchical Clustering     NA
                                                                         Doughnut Chart

                                                                        Table 1: Functional comparison of 2 R libraries:
                                                                        plot.ly and googleViz
Figure 8: N-gram visualization with JSTORr pack-
age: populism                                                           After identifying their functionalities, our next step is to
                                                                        determine the best fit via our collaborative feedback.
5.1    N-gram Visualization
   An N-gram allows users to identify the co-occurrence of              5.3   Topic Modeling Story Telling
words within a single text or text collection. The ‘n’ in-                 Topic modeling is a statistical model used in machine
dicates the number of words being selected to create uni-               learning and natural language processing for discovering ab-
grams, bi-grams, or tri-grams, etc. By using a combination              stract topics that occur in a collection of documents. This
of words, instead of a single occurrence, the user can gener-           analysis assists in “classification, novelty detection, summa-
ate higher quality results for data interpretation. Preferably,         rization, and similarity and relevance judgements” [1]. While
the user can also set a ‘search window’ within which co-                topic modeling results can be visualized in di↵erent forms,
occurrence should take place (i.e., within a certain amount             most common form is in a table format (see Figure 9).


                                                                   96
Olga Scrivner et al.                                   MAICS 2017                                                     pp. 93–97


                                                                         4. Libraries: ITMS is unique in that in addition to its
                                                                            ability to analyze data collections, it can add biblio-
                                                                            graphic metadata. As Jockers suggests, library meta-
                                                                            data “has been largely untapped as a means of explor-
                                                                            ing literary history” and could “reveal useful informa-
                                                                            tion about literary trends” [4].
                                                                    All these considerations and scholarly collaboration also present
                                                                    new opportunities for the field of data visualization and an-
                                                                    alytics, advancing our understanding of computation and
                                                                    human nature, namely “synthesis of computational and hu-
       Figure 9: Topic visualization in ITMS                        manistic modes of inquiry” [6].

   By comparing other software with their unique options for        7.     REFERENCES
topic representation, two candidates for ITMS were identi-           [1] N. A. Blei D. and M. Jordan. Latent dirichlet
fied: topic bubbling and topic coupling from MALLET, a                   allocation. Journal of Machine Learning Research,
topic modeling package. The goal of topic bubbling is to                 pages 993–1022, 2003.
compare the relative importance of all the topics; the size          [2] K. Brodlie, D. Duce, J. Gallop, J. Walton, and
of a topic bubble is the accumulated size of all word bub-               J. Wood. Distributed and collaborative visualization.
bles within that topic. In contrast, topic coupling reveals              Computer Graphics Forum, 23:223–251, 2004.
the relations between the topics based on their associated
                                                                     [3] J. Heer and M. Agrawala. Design Considerations for
words. In this representation, topics are shown as a net-
                                                                         Collaborative Visual Analytics. Information
work of terms (nodes) linked by their interaction with other
                                                                         Visualization, 7:49–62, 2008.
topics.
                                                                     [4] M. L. Jockers. Topics in the Digital Humanities:
                                                                         Macroanalysis: Digital Methods and Literary History.
6. CONCLUSION                                                            University of Illinois Press, Urbana, IL, USA, 2013.
   In recent years, we have seen growing interest in the use         [5] D. Keim, F. Mansmann, J. Schneidewind, and
of data visualization tools in the humanities fields. How-               H. Ziegler. Challenges in Visual Data Analysis. In
ever, many of the existing tools are unable to integrate the             Information Visualization (IV 2006), IEEE, pages
humanistic component of exploratory research. Thus, the                  9–16, 2006.
overarching goal of the current work on ITMS is to bridge            [6] L. Klein and J. Eisenstein. Reading Thomas Je↵erson
the gap between tool-developers and learned scholars by                  with TopicViz: Towards a Thematic Method for
adding a user-customization component. In addition, the                  Exploring Large Cultural Archives, 2013.
social interaction between scholars and data scientists has          [7] K. L. and E. J. Reading thomas je↵erson with
a strong potential to promote text mining methods among                  topicviz: Towards a thematic method for exploring
humanities as well as to enhance capabilities and function-              large cultural archives. scholarly and research
ality of visualization tools. We have also shown that a re-              communication. 4(3), 2013.
cent development of reactive Shiny framework has facilitated         [8] T. M. H. Reenskaug. Models-views-controllers.
the task of user-customization: On the one hand, a wide                  http://heim.ifi.uio.no/ trygver/1979/mvc-2/1979-12-
range of open source R libraries and its overall simplicity              MVC.pdf
for deployment made the Shiny framework very accessible
                                                                     [9] B. B. Ribeiro. The two frameworks.
to non-experienced programmers. On the other hand, Shiny
                                                                         https://github.com/rstudio/shiny/issues/250, 2017.
is user-friendly web application, where users are not con-
                                                                    [10] O. Scrivner and J. Davis. Topic Modeling of Scholarly
strained by limitations of their local computer memory and
                                                                         Articles: Interactive Text Mining Suite. In
platform dependency, as compared to other software tools.
                                                                         International Conference “Dialogue 2016”, 2016.
While this project only focuses on a collaboration between
political/social science scholars, this idea can be extended        [11] O. Scrivner and J. Davis. Interactive Text Mining
to other fields. Below we summarize some of the possible                 Suite: Data Visualization for Literary Studies. In the
implementations for future research:                                     Workshop on Corpora in the Digital Humanities,
                                                                         pages 29–38, 2017.
  1. Teaching tool: The web application is developed in the         [12] J. Thomas and K. Cook, editors. Illuminating the
     conjunction with the lesson plans, for example statis-              Path: The Research and Development Agenda for
     tics modules. The collaboration can also be expanded                Visual Analytics. IEEE Press, 2005.
     by including students into the development and testing         [13] F. Viégas and M. Wattenberg.
     phases.                                                             Communication-minded visualization: A call to
                                                                         action. IBM Systems Journal, 45, 2006.
  2. Digital Humanities: Based on individual research, the
                                                                    [14] T. White. Hadoop: The Definitive Guide. Storage and
     web application can be augmented with additional vi-
                                                                         Analysis at Internet Scale. O’Reilly Media/Yahoo
     sualization types, for example spatial or chronological
                                                                         Press, Sebastopol, 3rd edition, 2012.
     maps for literature analysis.
                                                                    [15] J. Whitehead. Collaboration in Software Engineering:
  3. Social Science: The user could specify additional me-               A Roadmap. In 2007 Future of Software Engineering,
     dia for research and customize their appearance, e.g.               pages 214–225, Washington, DC, USA, 2007. IEEE
     tweets, blogs, or photos.                                           Computer Society.


                                                               97