Spiral me to the core: Getting a visual grasp on text
                corpora through clusters and keywords

                  Maren Scheffel, Katja Niemann,                                                  Hendrik Drachsler,
                        Sarah Leon Rojas                                                           Marcus Specht
                             Fraunhofer FIT                                                 Open University of the Netherlands
                           Schloss Birlinghoven                                                   Valkenburgerweg 177
                      53754 Sankt Augustin, Germany                                         6419 AT Heerlen, The Netherlands
                {maren.scheffel, katja.niemann,                                                  {hendrik.drachsler,
               sarah.leon.rojas}@fit.fraunhofer.de                                              marcus.specht}@ou.nl

ABSTRACT                                                                            domain might prove rather difficult. When wanting to write
The amount of literature within a research domain is ever                           a literature review within a certain domain of research or
growing, thus making it difficult to stay on top of everything.                     about a specific topic, it is thus often difficult to get a grasp
Getting a grasp on the important topics of and areas within                         on it and to know where to start. One way can be to rely
a domain or even knowing where to start is often tough and                          on previous literature reviews. But when a topic spans over
tedious. This paper therefore presents a visualization, that                        several domains, several research communities and a longer
is a cluster spiral, that offers a fast but plain and simple way                    period of time, it could be nicer to take all of that into ac-
of exploring the content of large text collections.                                 count at the same time in order to get a feel for what one
                                                                                    is dealing with. This paper therefore describes a fast and
                                                                                    easy way of getting a grasp on a collection of publications,
Categories and Subject Descriptors                                                  using the LAK Dataset of the LAK Challenge 20141 as an
H.3.1 [Content Analysis and Indexing]: Linguistic pro-                              example corpus.
cessing; H.3.3 [Information Search and Retrieval]: Clus-
tering, Information filtering; I.2.7 [Natural Language Pro-
cessing]: Text analysis; I.5.3 [Clustering]: Algorithms;
                                                                                    2. THE LAK DATASET
I.5.4 [Applications]: Text processing; I.7.5 [Document                                 The LAK Dataset contains a collection of structured data
Capture]: Document analysis                                                         of several proceedings and journal volumes from the field of
                                                                                    learning analytics and educational data mining [11]. The
                                                                                    data have been processed according to Linked Data prin-
General Terms                                                                       ciples2 and are thus available in machine readable format.
Algorithm, Visualization                                                            As the data set includes the proceedings of the LAK con-
                                                                                    ferences 2011-13, the proceedings of the EDM conferences
Keywords                                                                            2008-13, plus some journal editions (in progress) of Educa-
                                                                                    tional Technology & Society and the Journal of Educational
learning analytics, natural language processing, clustering,                        Data Mining, it is ideal for our purpose. Currently, there
keyword extraction, visualization                                                   are 462 papers, 853 distinct authors and 272 distinct insti-
                                                                                    tutions included in the dataset. The data are available in
1.     INTRODUCTION                                                                 several formats: RDF/XML, R statistic software compati-
   One typical aspect of the world of research is the fact                          ble, and via a SPARQL endpoint.
that the amount of literature being produced and published                             The LAK Dataset has previously been used for the first
is growing every day. The more years pass, the more ar-                             LAK Challenge that took place during LAK2013 [1]. Derntl
ticles, papers, and books are available. Some research do-                          et al.[2] extract topic models and visualize topic dynamics
mains might have a slowly but steadily growing literature                           and evolution over time with a special focus on how the
corpus while others grow rapidly. Looking only at those                             introduction of the LAK conferences changed the topic dy-
publications from the last year can be a fairly easy thing to                       namics of learning analytics and educational data mining.
do. But taking several years or even decades of publications                        Fazeli et al.[3] look at socio-semantic networks of authors
into account when trying to get an overview about a chosen                          and papers within the learning analytics community in or-
                                                                                    der to provide recommendations to users, e.g. conference
                                                                                    attendees. Maturana et al.[4] use their gnoss platform to
                                                                                    provide faceted search within the LAK Dataset and provide
Permission to make digital or hard copies of all or part of this work for           visualizations of geographical author and organization net-
personal or classroom use is granted without fee provided that copies are           works as well as paper evolution and distribution. Another
not made or distributed for proﬁt or commercial advantage and that copies           visualization of topic evolution within the LAK and EDM
bear this notice and the full citation on the ﬁrst page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior speciﬁc   community is presented by Milikic et al.[5] with their tool
permission and/or a fee.                                                            1
submitted to LAK Data Challenge 2014 held at LAK2014                                    http://lak.linkededucation.org/
                                                                                    2
Copyright by the authors.                                                               http://www.w3.org/DesignIssues/LinkedData.html
Paperista. One more social network analysis, this time with
a focus on authors and institutions, is presented by Nawaz
et al.[6]. The Cite4Me tool by Pereira Nunes et al.[7] offers
search and recommendation functionalities within the LAK
Dataset as well as reference datasets. Taibi et al.[12] analyze
rhetorical patterns over time while Touaq et al.[13] create an
ontology of LAK and EDM based on concept mapping in or-
der to compare the two communities.
  While some of these publications also deal with topic and
concept mining, they often either focus on the evolution over
time or relations between individual papers, authors, insti-
tutions, etc. within the LAK and EDM community when
visualizing their results. Our approach, however, focuses on
grouping a collection of publications based on their textual
content and visualizing that clustered content rather than
individual papers in order to get an overall impression of
the collection in question. That we use the LAK Dataset
for our analysis is one domain example as our approach also
works for other large collection of texts.                               Figure 1: Start view of the visualization

3.   ANALYSIS
                                                                   rithms were available: Lingo, STC and k-means. We looked
   Our approach for the analysis and visualization of the          at all three algorithms and liked clusters created by Lingo
LAK Dataset makes use of the RDF version and bases on              quite well at first sight. Unfortunately, however, Lingo as
the following ideas: in order to get a grasp on what a col-        well as STC both use soft clustering techniques, that is,
lection of papers is about, keywords play an important role.       they create overlapping clusters with papers possibly being
Keywords offer a superficial but still highly useful semantic      assigned to more than one cluster. As the overlap is not lim-
representation of a text as they ”represent in condensed form      ited to only a few documents but rather a lot, we decided
the essential content of a document” [9]. For our analysis, a      not to use either of the two algorithms but use the bisect-
keyword can be one word as well as a sequence of up to three       ing k-means algorithm, i.e. the algorithm starts with k = 2
words. Another important means to get an overview over a           and then always bisects the largest cluster until the final k
collection of documents and thus a better grasp on such a          is reached, offered by carrot2 instead. The calculation of
collection is clustering. Su et al. [10] define clustering as ”a   the clusters is based on the papers’ abstracts and main text
process of partitioning a dataset into groups, or clusters, so     bodies. Additionally to the clustering of the text collection,
that elements of the same cluster are more similar to each         the carrot2 algorithm also calculates labels for every clus-
other than to elements of different clusters”. We therefore        ter. For our analysis we chose to work with two labels per
employ both methods and combine their results into a visu-         cluster.
alization that supports users in getting an overview of what          Finally, in a third step, a JSON file was created combining
large text collections are about.                                  the keyword extraction results with the clustering results as
   As we assumed that the keywords already provided within         a source for the visualization: for every cluster, the keywords
the LAK Dataset by the papers’ authors would not be broad          of its papers are combined and sorted according to their
enough, that is, they are assigned manually and most likely        rank. Then the ten keywords with the highest rank are
based on a narrow word range typical for that research do-         kept for each cluster. A source file thus contains two labels,
main and thus not properly representative for the texts, we        ten keywords and a list of the respective papers for each
did not want to rely on them. We therefore automatically           cluster. In order to offer users several views and to look at
extracted keywords from all papers’ abstracts and bodies           the dataset from different angles, we calculated clusters for
using the AlchemyAPI 3 . Their algorithm extracts keywords         several publication-year combinations.
from any given text using statistical algorithms as well as
natural language processing techniques and ranks the ex-
tracted keywords according to their relevance. Although            4. VISUALIZATION
the AlchemyAPI already makes use of a stop word list, e.g.            When dealing with the analysis of large amounts of text
words such as and, to, me, you, etc., we created our own stop      data, visualization is ”of crucial importance in facilitating
word list as keywords such as learning analytics, educational      knowledge discovery, as well as providing a big picture over-
data mining, data analysis, discussion, result, etc. would         view of overwhelmingly large amounts of data” [8]. For
otherwise quite likely come up as a keyword for the indi-          our visualization we used the Data-Driven Documents D3.js
vidual papers but would not help forming a distinguishing          framework5 , i.e. a JavaScript library, paired with HTML,
semantic representation within the given collection. Were          CSS and JQuery to process the previously created JSON
our approach to be used for another domain or in a more            files.
mixed one, the stop word list could easily be adapted.                Figure 1 shows the default starting view of our visualiza-
   In the next step we clustered the paper collection by call-     tion6 : the clusters for all publications from all years in the
ing the carrot2 Java API4 . Three different clustering algo-
                                                                   5
                                                                    http://d3js.org/
3
  http://www.alchemyapi.com/                                       6
                                                                    The visualization is available at http://mitarbei
4
  http://project.carrot2.org/                                      ter.fit.fraunhofer.de/˜niemann/LAKchallenge2014/
                                           Figure 2: Cluster view and paper list


LAK Dataset. On this starting page, the users can choose          within that cluster are given, followed by a list of papers
the publication(s) and the year(s) they want to visualize.        from other clusters that also have the chosen word as a key-
They can either choose each publication individually (i.e.        word. The lists are also color-coded and the keyword circle
LAK, EDM or JETS) or all of them together. When all               is highlighted in all clusters so as to more easily find the
publications are chosen, the user can choose between indi-        corresponding cluster(s). Figure 3 shows the keyword paper
vidual years or all years combined. If a single publication is    list for the keyword activity of the Social/Network cluster
chosen, only all of its years can be chosen for display. This     and three other ones.
adds up to a total of ten possible combinations.
   The clusters take the form of circles and are ordered ac-
cording to their size in the form of a spiral with the largest    5. DISCUSSION AND CONCLUSION
cluster having the largest circle and being positioned at the        Looking at the visualization of the whole dataset, i.e. all
outside of the spiral and the smallest cluster being in the       publications from all years are taken into account at the
middle of the spiral. Additionally to size and position, ev-      same time, the fourteen clusters and their labels offer a nice
ery cluster also has its own color and is labeled with the two    overview of what the text collection is about. For example,
terms calculated by the carrot2 algorithm.                        we can see that social networks, teachers and institutions
   By clicking on a cluster, the view changes and the vi-         play an important role, but skills, courses and clusters are
sualization zooms into to the chosen cluster. Next to it a        important topics within the research area as well. When
list of all the papers in that specific cluster is given, show-   zooming into the clusters and looking at the different paper
ing the papers’ titles and the publications they were taken       lists of the clusters, it is noticeable that many of the lists con-
from. The titles in that list are linked to a Google search for   tain way more papers from the EDM than from LAK. Only
the respective paper so that users can immediately take a         two of the clusters are dominated by LAK papers while ten
closer look at it if needed. Figure 2 shows the cluster labeled   are dominated by those from EDM. This effect, however,
Analytics/Institutions.                                           is mainly due to the fact that there are about three times
   Once the users have zoomed into a cluster, the keywords of     more papers from EDM than from LAK. After normalizing
that cluster become visible. Figure 2 shows that the Alche-       the numbers, about half of the clusters are still dominated
myAPI algorithm indeed extracts single words as well as           by one publication type (two by LAK and five by EDM)
word sequences, e.g. tool, student success, online learning       and the other half is split between them. In general one
environments, etc. A very common visualization method for         can say that LAK papers share their topic range quite well
keywords are tag clouds as ”a tag cloud is highly effective in    with JETS and EDM as only the Social/Network and the
summarizing large amounts of text in an easily readable, and      Analytics/Institutions clusters are dominated by LAK pa-
understandable, visual manner” [8]. In order to continue the      pers. Some EDM topics, however, seem to be more exclu-
circle approach used for the clusters, however, we adapted        sive and specific to EDM, e.g. Skill/Parameters and De-
the common usage of font size, coloring and word position-        tector/Game, as many of the five EDM-dominated clusters
ing in tag clouds and used sized and spirally ordered circles     contain no or very little papers from LAK or JETS. Topics
instead: the more often a keyword appears in a cluster, the       common to LAK as well as to EDM are, among others, Clus-
larger and the further out in the spiral its circle is.           ters/Features, Teachers/Concept and User/Visualization.
   Clicking on a keyword circle results in a new list next to        Another result that the visualization provides becomes
the visualization. All papers represented by that keyword         clear when inspecting the clusters’ keywords more closely.
                                                                  For many clusters the keywords cover aspects of a domain,
                   Figure 3: Overview with highlighted keywords and corresponding paper list


an approach, a goal, the data used and the stakeholders in-            analytics community. 2013. In [1].
volved. Take the Analytics/Institutions cluster for example:       [4] R. Maturana, M. Alvarado, S. López-Sola, M. Ibañez,
the keyword higher education tells us the domain that is im-           and L. Ruiz Elósegui. Linked data based applications
portant for this cluster, we can also see that the approaches          for learning analytics research: faceted searches,
of social network analysis and machine learning play a role.           enriched contexts, graph browsing and dynamic
As for the goals that this cluster deals with, there are learn-        graphic visualisation of data. 2013. In [1].
ing process and student success, and the data analyzed is          [5] N. Milikic, U. Krcadinac, J. Jovanovic, B. Brankov,
student data coming from online learning environments and              and S. Keca. Paperista: Visual exploration of
LMSs. Taking the Course/Grade cluster as a second exam-                semantically annotated research papers. 2013. In [1].
ple, we can see that it deals with the approaches of formative     [6] S. Nawaz, F. Marbouti, and J. Strobel. Analysis of the
evaluation and classification algorithms that are applied to           community of learning analytics. 2013. In [1].
data taken from online learning activities in online courses,      [7] B. Pereira Nunes, B. Fetahu, and M. Casanova.
submissions, assignments and posts in order to supply pre-             Cite4me: Semantic retrieval and analysis of scientific
dictive models dealing with final grades to instructors.               publications. 2013. In [1].
   These two analyses offer a first step to getting a grasp on
                                                                   [8] A. A. Puretskiy, G. L. Shutt, and M. W. Berry. Survey
the main research topics of the learning analytics and edu-
                                                                       of text visualization techniques. In M. W. Berry and
cational data mining literature, including their commonali-
                                                                       J. Kogan, editors, Text Mining: Applications and
ties and differences. We will use the cluster spiral to delve
                                                                       Theory, pages 107–127. John Wiley & Sons, Ltd, 2010.
further into these domains and plan to provide an extensive
                                                                   [9] S. Rose, D. Engel, N. Cramer, and W. Cowley.
review that is based on the publications’ essential character-
                                                                       Automatic keyword extraction from individual
istics, e.g. application domain, stakeholders, methodologies,
and goals. For new scientists to these communities such a              documents. In M. W. Berry and J. Kogan, editors,
review can offer an entry point to the field. It is also useful        Text Mining: Applications and Theory, pages 3–20.
to bridge the gap between the LAK and EDM communities                  John Wiley & Sons, Ltd, 2010.
and provide researchers from one side insight to the other.       [10] Z. Su, J. Kogan, and C. Nicholas. Constrained
A third valuable aspect of a literature review would also be           clustering with k-means type algorithms. In M. W.
the retrieval of new and important research questions.                 Berry and J. Kogan, editors, Text Mining:
                                                                       Applications and Theory, pages 81–103. John Wiley &
                                                                       Sons, Ltd, 2010.
6.   REFERENCES                                                   [11] D. Taibi and S. Dietze. Fostering analytics on learning
 [1] M. d’Aquin, S. Dietze, H. Drachsler, E. Herder, and               analytics research: the lak dataset. 2013. In [1].
     D. Taibi. Proceedings of the LAK Data Challenge,             [12] D. Taibi, Á. Sándor, D. Simsek, S. Buckingham Shum,
     volume 974. CEUR Workshop Proceedings, Leuven,                    A. Deliddo, and R. Ferguson. Visualizing the lakedm
     Belgium, 2013.                                                    literature using combined concept and rhetorical
 [2] M. Derntl, N. Günnemann, and R. Klamma. A                        sentence extraction. 2013. In [1].
     dynamic topic model of learning analytics research.          [13] A. Zouaq, S. Joksimović, and D. Gašević. Ontology
     2013. In [1].                                                     learning to analyze research trends in learning
 [3] S. Fazeli, H. Drachsler, and P. Sloep. Socio-semantic             analytics publications. 2013. In [1].
     networks of research publications in the learning