Spiral me to the core: Getting a visual grasp on text corpora through clusters and keywords Maren Scheffel, Katja Niemann, Hendrik Drachsler, Sarah Leon Rojas Marcus Specht Fraunhofer FIT Open University of the Netherlands Schloss Birlinghoven Valkenburgerweg 177 53754 Sankt Augustin, Germany 6419 AT Heerlen, The Netherlands {maren.scheffel, katja.niemann, {hendrik.drachsler, sarah.leon.rojas}@fit.fraunhofer.de marcus.specht}@ou.nl ABSTRACT domain might prove rather difficult. When wanting to write The amount of literature within a research domain is ever a literature review within a certain domain of research or growing, thus making it difficult to stay on top of everything. about a specific topic, it is thus often difficult to get a grasp Getting a grasp on the important topics of and areas within on it and to know where to start. One way can be to rely a domain or even knowing where to start is often tough and on previous literature reviews. But when a topic spans over tedious. This paper therefore presents a visualization, that several domains, several research communities and a longer is a cluster spiral, that offers a fast but plain and simple way period of time, it could be nicer to take all of that into ac- of exploring the content of large text collections. count at the same time in order to get a feel for what one is dealing with. This paper therefore describes a fast and easy way of getting a grasp on a collection of publications, Categories and Subject Descriptors using the LAK Dataset of the LAK Challenge 20141 as an H.3.1 [Content Analysis and Indexing]: Linguistic pro- example corpus. cessing; H.3.3 [Information Search and Retrieval]: Clus- tering, Information filtering; I.2.7 [Natural Language Pro- cessing]: Text analysis; I.5.3 [Clustering]: Algorithms; 2. THE LAK DATASET I.5.4 [Applications]: Text processing; I.7.5 [Document The LAK Dataset contains a collection of structured data Capture]: Document analysis of several proceedings and journal volumes from the field of learning analytics and educational data mining [11]. The data have been processed according to Linked Data prin- General Terms ciples2 and are thus available in machine readable format. Algorithm, Visualization As the data set includes the proceedings of the LAK con- ferences 2011-13, the proceedings of the EDM conferences Keywords 2008-13, plus some journal editions (in progress) of Educa- tional Technology & Society and the Journal of Educational learning analytics, natural language processing, clustering, Data Mining, it is ideal for our purpose. Currently, there keyword extraction, visualization are 462 papers, 853 distinct authors and 272 distinct insti- tutions included in the dataset. The data are available in 1. INTRODUCTION several formats: RDF/XML, R statistic software compati- One typical aspect of the world of research is the fact ble, and via a SPARQL endpoint. that the amount of literature being produced and published The LAK Dataset has previously been used for the first is growing every day. The more years pass, the more ar- LAK Challenge that took place during LAK2013 [1]. Derntl ticles, papers, and books are available. Some research do- et al.[2] extract topic models and visualize topic dynamics mains might have a slowly but steadily growing literature and evolution over time with a special focus on how the corpus while others grow rapidly. Looking only at those introduction of the LAK conferences changed the topic dy- publications from the last year can be a fairly easy thing to namics of learning analytics and educational data mining. do. But taking several years or even decades of publications Fazeli et al.[3] look at socio-semantic networks of authors into account when trying to get an overview about a chosen and papers within the learning analytics community in or- der to provide recommendations to users, e.g. conference attendees. Maturana et al.[4] use their gnoss platform to provide faceted search within the LAK Dataset and provide Permission to make digital or hard copies of all or part of this work for visualizations of geographical author and organization net- personal or classroom use is granted without fee provided that copies are works as well as paper evolution and distribution. Another not made or distributed for profit or commercial advantage and that copies visualization of topic evolution within the LAK and EDM bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific community is presented by Milikic et al.[5] with their tool permission and/or a fee. 1 submitted to LAK Data Challenge 2014 held at LAK2014 http://lak.linkededucation.org/ 2 Copyright by the authors. http://www.w3.org/DesignIssues/LinkedData.html Paperista. One more social network analysis, this time with a focus on authors and institutions, is presented by Nawaz et al.[6]. The Cite4Me tool by Pereira Nunes et al.[7] offers search and recommendation functionalities within the LAK Dataset as well as reference datasets. Taibi et al.[12] analyze rhetorical patterns over time while Touaq et al.[13] create an ontology of LAK and EDM based on concept mapping in or- der to compare the two communities. While some of these publications also deal with topic and concept mining, they often either focus on the evolution over time or relations between individual papers, authors, insti- tutions, etc. within the LAK and EDM community when visualizing their results. Our approach, however, focuses on grouping a collection of publications based on their textual content and visualizing that clustered content rather than individual papers in order to get an overall impression of the collection in question. That we use the LAK Dataset for our analysis is one domain example as our approach also works for other large collection of texts. Figure 1: Start view of the visualization 3. ANALYSIS rithms were available: Lingo, STC and k-means. We looked Our approach for the analysis and visualization of the at all three algorithms and liked clusters created by Lingo LAK Dataset makes use of the RDF version and bases on quite well at first sight. Unfortunately, however, Lingo as the following ideas: in order to get a grasp on what a col- well as STC both use soft clustering techniques, that is, lection of papers is about, keywords play an important role. they create overlapping clusters with papers possibly being Keywords offer a superficial but still highly useful semantic assigned to more than one cluster. As the overlap is not lim- representation of a text as they ”represent in condensed form ited to only a few documents but rather a lot, we decided the essential content of a document” [9]. For our analysis, a not to use either of the two algorithms but use the bisect- keyword can be one word as well as a sequence of up to three ing k-means algorithm, i.e. the algorithm starts with k = 2 words. Another important means to get an overview over a and then always bisects the largest cluster until the final k collection of documents and thus a better grasp on such a is reached, offered by carrot2 instead. The calculation of collection is clustering. Su et al. [10] define clustering as ”a the clusters is based on the papers’ abstracts and main text process of partitioning a dataset into groups, or clusters, so bodies. Additionally to the clustering of the text collection, that elements of the same cluster are more similar to each the carrot2 algorithm also calculates labels for every clus- other than to elements of different clusters”. We therefore ter. For our analysis we chose to work with two labels per employ both methods and combine their results into a visu- cluster. alization that supports users in getting an overview of what Finally, in a third step, a JSON file was created combining large text collections are about. the keyword extraction results with the clustering results as As we assumed that the keywords already provided within a source for the visualization: for every cluster, the keywords the LAK Dataset by the papers’ authors would not be broad of its papers are combined and sorted according to their enough, that is, they are assigned manually and most likely rank. Then the ten keywords with the highest rank are based on a narrow word range typical for that research do- kept for each cluster. A source file thus contains two labels, main and thus not properly representative for the texts, we ten keywords and a list of the respective papers for each did not want to rely on them. We therefore automatically cluster. In order to offer users several views and to look at extracted keywords from all papers’ abstracts and bodies the dataset from different angles, we calculated clusters for using the AlchemyAPI 3 . Their algorithm extracts keywords several publication-year combinations. from any given text using statistical algorithms as well as natural language processing techniques and ranks the ex- tracted keywords according to their relevance. Although 4. VISUALIZATION the AlchemyAPI already makes use of a stop word list, e.g. When dealing with the analysis of large amounts of text words such as and, to, me, you, etc., we created our own stop data, visualization is ”of crucial importance in facilitating word list as keywords such as learning analytics, educational knowledge discovery, as well as providing a big picture over- data mining, data analysis, discussion, result, etc. would view of overwhelmingly large amounts of data” [8]. For otherwise quite likely come up as a keyword for the indi- our visualization we used the Data-Driven Documents D3.js vidual papers but would not help forming a distinguishing framework5 , i.e. a JavaScript library, paired with HTML, semantic representation within the given collection. Were CSS and JQuery to process the previously created JSON our approach to be used for another domain or in a more files. mixed one, the stop word list could easily be adapted. Figure 1 shows the default starting view of our visualiza- In the next step we clustered the paper collection by call- tion6 : the clusters for all publications from all years in the ing the carrot2 Java API4 . Three different clustering algo- 5 http://d3js.org/ 3 http://www.alchemyapi.com/ 6 The visualization is available at http://mitarbei 4 http://project.carrot2.org/ ter.fit.fraunhofer.de/˜niemann/LAKchallenge2014/ Figure 2: Cluster view and paper list LAK Dataset. On this starting page, the users can choose within that cluster are given, followed by a list of papers the publication(s) and the year(s) they want to visualize. from other clusters that also have the chosen word as a key- They can either choose each publication individually (i.e. word. The lists are also color-coded and the keyword circle LAK, EDM or JETS) or all of them together. When all is highlighted in all clusters so as to more easily find the publications are chosen, the user can choose between indi- corresponding cluster(s). Figure 3 shows the keyword paper vidual years or all years combined. If a single publication is list for the keyword activity of the Social/Network cluster chosen, only all of its years can be chosen for display. This and three other ones. adds up to a total of ten possible combinations. The clusters take the form of circles and are ordered ac- cording to their size in the form of a spiral with the largest 5. DISCUSSION AND CONCLUSION cluster having the largest circle and being positioned at the Looking at the visualization of the whole dataset, i.e. all outside of the spiral and the smallest cluster being in the publications from all years are taken into account at the middle of the spiral. Additionally to size and position, ev- same time, the fourteen clusters and their labels offer a nice ery cluster also has its own color and is labeled with the two overview of what the text collection is about. For example, terms calculated by the carrot2 algorithm. we can see that social networks, teachers and institutions By clicking on a cluster, the view changes and the vi- play an important role, but skills, courses and clusters are sualization zooms into to the chosen cluster. Next to it a important topics within the research area as well. When list of all the papers in that specific cluster is given, show- zooming into the clusters and looking at the different paper ing the papers’ titles and the publications they were taken lists of the clusters, it is noticeable that many of the lists con- from. The titles in that list are linked to a Google search for tain way more papers from the EDM than from LAK. Only the respective paper so that users can immediately take a two of the clusters are dominated by LAK papers while ten closer look at it if needed. Figure 2 shows the cluster labeled are dominated by those from EDM. This effect, however, Analytics/Institutions. is mainly due to the fact that there are about three times Once the users have zoomed into a cluster, the keywords of more papers from EDM than from LAK. After normalizing that cluster become visible. Figure 2 shows that the Alche- the numbers, about half of the clusters are still dominated myAPI algorithm indeed extracts single words as well as by one publication type (two by LAK and five by EDM) word sequences, e.g. tool, student success, online learning and the other half is split between them. In general one environments, etc. A very common visualization method for can say that LAK papers share their topic range quite well keywords are tag clouds as ”a tag cloud is highly effective in with JETS and EDM as only the Social/Network and the summarizing large amounts of text in an easily readable, and Analytics/Institutions clusters are dominated by LAK pa- understandable, visual manner” [8]. In order to continue the pers. Some EDM topics, however, seem to be more exclu- circle approach used for the clusters, however, we adapted sive and specific to EDM, e.g. Skill/Parameters and De- the common usage of font size, coloring and word position- tector/Game, as many of the five EDM-dominated clusters ing in tag clouds and used sized and spirally ordered circles contain no or very little papers from LAK or JETS. Topics instead: the more often a keyword appears in a cluster, the common to LAK as well as to EDM are, among others, Clus- larger and the further out in the spiral its circle is. ters/Features, Teachers/Concept and User/Visualization. Clicking on a keyword circle results in a new list next to Another result that the visualization provides becomes the visualization. All papers represented by that keyword clear when inspecting the clusters’ keywords more closely. For many clusters the keywords cover aspects of a domain, Figure 3: Overview with highlighted keywords and corresponding paper list an approach, a goal, the data used and the stakeholders in- analytics community. 2013. In [1]. volved. Take the Analytics/Institutions cluster for example: [4] R. Maturana, M. Alvarado, S. López-Sola, M. Ibañez, the keyword higher education tells us the domain that is im- and L. Ruiz Elósegui. Linked data based applications portant for this cluster, we can also see that the approaches for learning analytics research: faceted searches, of social network analysis and machine learning play a role. enriched contexts, graph browsing and dynamic As for the goals that this cluster deals with, there are learn- graphic visualisation of data. 2013. In [1]. ing process and student success, and the data analyzed is [5] N. Milikic, U. Krcadinac, J. Jovanovic, B. Brankov, student data coming from online learning environments and and S. Keca. Paperista: Visual exploration of LMSs. Taking the Course/Grade cluster as a second exam- semantically annotated research papers. 2013. In [1]. ple, we can see that it deals with the approaches of formative [6] S. Nawaz, F. Marbouti, and J. Strobel. Analysis of the evaluation and classification algorithms that are applied to community of learning analytics. 2013. In [1]. data taken from online learning activities in online courses, [7] B. Pereira Nunes, B. Fetahu, and M. Casanova. submissions, assignments and posts in order to supply pre- Cite4me: Semantic retrieval and analysis of scientific dictive models dealing with final grades to instructors. publications. 2013. In [1]. These two analyses offer a first step to getting a grasp on [8] A. A. Puretskiy, G. L. Shutt, and M. W. Berry. Survey the main research topics of the learning analytics and edu- of text visualization techniques. In M. W. Berry and cational data mining literature, including their commonali- J. Kogan, editors, Text Mining: Applications and ties and differences. We will use the cluster spiral to delve Theory, pages 107–127. John Wiley & Sons, Ltd, 2010. further into these domains and plan to provide an extensive [9] S. Rose, D. Engel, N. Cramer, and W. Cowley. review that is based on the publications’ essential character- Automatic keyword extraction from individual istics, e.g. application domain, stakeholders, methodologies, and goals. For new scientists to these communities such a documents. In M. W. Berry and J. Kogan, editors, review can offer an entry point to the field. It is also useful Text Mining: Applications and Theory, pages 3–20. to bridge the gap between the LAK and EDM communities John Wiley & Sons, Ltd, 2010. and provide researchers from one side insight to the other. [10] Z. Su, J. Kogan, and C. Nicholas. Constrained A third valuable aspect of a literature review would also be clustering with k-means type algorithms. In M. W. the retrieval of new and important research questions. Berry and J. Kogan, editors, Text Mining: Applications and Theory, pages 81–103. John Wiley & Sons, Ltd, 2010. 6. REFERENCES [11] D. Taibi and S. Dietze. Fostering analytics on learning [1] M. d’Aquin, S. Dietze, H. Drachsler, E. Herder, and analytics research: the lak dataset. 2013. In [1]. D. Taibi. Proceedings of the LAK Data Challenge, [12] D. Taibi, Á. Sándor, D. Simsek, S. Buckingham Shum, volume 974. CEUR Workshop Proceedings, Leuven, A. Deliddo, and R. Ferguson. Visualizing the lakedm Belgium, 2013. literature using combined concept and rhetorical [2] M. Derntl, N. Günnemann, and R. Klamma. A sentence extraction. 2013. In [1]. dynamic topic model of learning analytics research. [13] A. Zouaq, S. Joksimović, and D. Gašević. Ontology 2013. In [1]. learning to analyze research trends in learning [3] S. Fazeli, H. Drachsler, and P. Sloep. Socio-semantic analytics publications. 2013. In [1]. networks of research publications in the learning