A Dynamic Topic Model of Learning Analytics Research Michael Derntl Nikou Günnemann Ralf Klamma RWTH Aachen University RWTH Aachen University RWTH Aachen University Advanced Community Advanced Community Advanced Community Information Systems (ACIS) Information Systems (ACIS) Information Systems (ACIS) Aachen, Germany Aachen, Germany Aachen, Germany derntl@dbis.rwth- nikou@dbis.rwth- klamma@dbis.rwth- aachen.de aachen.de aachen.de ABSTRACT Research on learning analytics and educational data min- ing has been published since the first conference on Educa- tional Data Mining (EDM) in 2008 and gained momentum through the establishment of the Learning Analytics and Knowledge (LAK) conference in 2011. This paper addresses the LAK Data Challenge from the perspective of visual an- alytics of topic dynamics in the LAK Dataset between 2008 and 2012. The data set was processed using probabilistic, dynamic topic mining algorithms. To enable exploration Figure 1: Yearly distribution of papers over venues and visual analysis of the resulting topic model by LAK researchers and stakeholders we developed and deployed D- VITA, a web-based browsing tool for dynamic topic models. questions posed in the LAK Data Challenge into the In this paper we explore answers to the questions about past, user’s hands. D-VITA is a web-based tool that offers present, and future of LAK posed in the Data Challenge topic-based views on the LAK Dataset using a point- based on a topic model of all papers in the LAK Dataset. and-click metaphor and simple visualizations. We also briefly describe how users can explore the LAK topic model on their own using D-VITA. 2. DATASET AND PREPROCESSING The LAK Dataset underlying the analyses presented in this 1. OBJECTIVES paper includes the EDM conference proceedings 2008–2012 The LAK Data Challenge called for contributions to make (239 papers), the LAK conference proceedings 2011–2012 sense of the field of learning analytics including its “roots, (66 papers), and the papers of the 2012 Special Issue on current state, and future trends, based on how its members Learning Analytics in the Educational Technology and So- report and debate their research”1 . This paper tackles the ciety journal (10 papers; herafter referred to as ETS). The challenge by presenting facts obtained from statistical anal- RDF representation of the LAK Dataset was processed by a yses of the paper full texts included in the provided LAK script that extracted for each paper the identifier, venue ∈ Dataset [7]. The main contributions are as follows: {LAK, EDM, ETS}, year of publication, title, authors, ab- stract, full text, and hyperlink to the full RDF description on data.linkededucation.org. The distribution of the 315 1. A dynamic topic model was computed using the ap- papers over time and venues is given in Figure 1. proach presented in [3]. Using this dynamic topic model we explore in Section 4 three questions about the evo- In the next preprocessing step the paper records were cleaned lution of topics in the LAK Dataset to distill knowledge by removing stopwords and by applying stemming methods about past, present and future of LAK research. on the included word sets. For word stemming we used the 2. In Section 5 we describe the visual analytics applica- Porter Stemming technique [6], which is well established for tion D-VITA2 , which puts the toolkit to answer the this purpuse. As a result, close to 5000 distinct word stems 1 were identified as being used in the 315 papers. http://solaresearch.org/events/lak/lak-data-challenge/ 2 http://monet.informatik.rwth-aachen.de/DVita/?id=16 3. DYNAMIC TOPIC MINING From a text mining perspective the LAK Dataset represents a text corpus in which a set of words is used in a set of pa- pers. To identify what is relevant to LAK research, we used the dynamic topic modeling approach described in [3] to ob- tain the distribution of words over a pre-defined number of topics. This is a probabilistic, unsupervised machine learn- ing approach that has been gaining increasing prominence recently [2]. In these probabilistic topic models a topic is a Copyright 2013 by the authors distribution of words, so each topic is typically represented by its most frequently occurring words. Topic mining also with pairwise divergence values is displayed in Figure 2. The obtains the distribution of these topics over the papers in maximum Jensen-Shannon divergence value is ln(2) ≈ .69. the data set. Dynamic topic mining applies these analy- The darker the cell color, the lower the divergence, thus the sis steps using several consecutive time slices in the data higher the similarity. The matrix is generally “light-colored”, set. For the LAK Dataset, we chose the five calendar years indicating that the topics’ word distributions diverge to a ∈ {2008 . . . 2012} as time slices. The results will thus reveal high degree. Topic pair (A, S) has the lowest dissimilarity the evolution of topics over documents during those discrete value, and Figure 3 reveals why: both topics are about stu- time slices, and the evolution of words used in the papers dent modeling. Topic S generally appears to have several for each topic over time. loosely related topics. Dynamic topic mining requires the analyst to pre-set the 4. ANALYSIS OF LAK TOPIC DYNAMICS number of topics. Based on previous experiments with vary- In this section we explore three questions about the LAK ing numbers of topics in paper collections in well-defined Dataset, intending to shed some light on the past and present subject areas, we decided to run the analysis of the LAK topics of learning analytics research, along with a cautious Dataset with a set of 20 topics. This number, while some- glimpse into the future. what arbitrary, shall provide for sufficient discriminatory power for both the distribution of topics over papers and the distribution of words over topics. With fewer topics, terms Question 1: What have been the most relevant topics like ‘learning’, for instance, are more likely to be present overall in the LAK data set? with relatively high relevance in many topics, while a larger This question addresses the LAK Data Challenge aspects preset would increase the number of topics exposed in each of roots and current state of learning analytics. Figure 3 paper. Both situations would impede reasonable interpreta- shows an overview chart of the 20 topics identified in the tion and visualization of the results. LAK Dataset. The horizontal axis reflects the rank of mean relevance of each topic and the vertical axis reflects the rank A word of explanation regarding the labels used to refer to of stability3 over the five time slices in the dataset. The size topics in this paper: mathematically each topic is a distri- of each bubble reflects the relevance of the topic in 2012, the bution over words. In a dynamic topic model this distri- most recent period. We make several observations: bution changes over time, i.e. a specific word may rise or fall in relevance for a topic. In the rest of the paper we will therefore label each topic with an ordered tuple repre- • The most relevant topics most prominently feature the senting those words with the highest mean relevance for this terms students/learners, model, and data. This aligns topic over time. In topic modeling literature we found that well with SoLAR’s definition of learning analytics as four words is a good number to form a topic label. For in- “the measurement, collection, analysis and reporting stance, for topic “students model parameters skill” the of data about learners and their contexts, for purposes most relevant word on average is student followed by model, of understanding and optimizing learning and the en- parameters, and skill. For illustration, based on the word vironments in which it occurs,”[1] considering that un- distribution for this topic in 2008 only, the label would be derstanding and optimization is necessarily based on “model student skill learning”. Often, such word tuples models of learners and data. are rephrased as more expressive labels; for instance “student modeling” could be appropriate in our example. • The topic with the highest mean relevance is “student model parameters skill” (A); this topic also has the The obtained topic model including 20 topics was analyzed highest variance in relevance. to see whether the topics have sufficient discriminatory power. • In the top-right quadrant we find topic “model data To this end, we used the ten most important words for features prediction” (B) which has a strong rele- each topic and the corresponding probability distributions to vance in 2012, high mean relevance rank over all years compute a dissimilarity measure of the distributions by us- and a high stability. As such, it can be considered ing the Jensen-Shannon divergence measure [5]. The matrix as one of the core topics in the LAK Dataset. In 2012 the distribution of words in this topic would ad- vocate the label “prediction model data students”, i.e. prediction is currently most relevant for this topic. • Topic “network community discussion analysis” (R) is also worth looking at. While it is relatively irrele- vant and volatile, it is among the relevant topics in 2012 (cf. the bubble size). The topic evolution chart in Figure 4 reveals that this topic accumulated most of its relevance in 2011, the year of the first LAK con- ference. Also, 8 of the 10 papers with the strongest focus on this topic in 2011 were published in the LAK conference (see bottom portion of Figure 4) although EDM published 2.5 times the number of papers in that 3 Stability was computed by inverting the variance of the Figure 2: Overview of topic divergence topic’s relevance over time Figure 3: Topic stability plotted against average topic relevance over time. the topic, i.e. the more documents expose this topic. Since each document exposes different topics to varying degrees the relevance of topic P k at time t is formally defined as relevance(k, t) := |D1t | d∈Dt θd [k], where Dt is the set of documents belonging to time t, and θd is the topic distribu- tion for document d. Observing the ThemeRiver in Figure 5 it is evident that there were some shifts in topic focus during the years 2008 and 2010, where we have only the EDM pa- pers in the dataset. Between 2010 and 2011 we identify the strongest turbulence, presumably based on substantial shifts in topic foci introduced by the 2011 LAK conference. Inter- estingly the topic distribution remains rather stable during the last time slice, in which LAK 2012, EDM 2012 and the ETS special issue are included. This might suggest that these three publication venues propelled the convergence of LAK research as represented in the LAK Dataset. To see which topics rose in relevance between 2010 and 2011 we filter for topics and zoom into the transition be- tween 2010 and 2011 as illustrated in Figure 6. Those three topics that have their absolute highest relevance in 2011 are marked with an up-pointing triangle with a solid-black Figure 4: Evolution (top) and most representative outline. These are “model students data probability”, papers in 2011 (bottom) of topic “network commu- “network community discussion analysis”, and “problem nity discussion analysis” students model types”, indicating an increased focus on student modeling as well as community and network analy- sis through the first LAK conference in 2011. year. This topic, in 2012 represented by the word or- der “network community social user”, therefore ap- pears to be a genuine LAK topic which was previously Question 3: What topics rose the most in 2012, the rather irrelevant for the EDM conference. most recent time slice in the data set? This question looks into what the dynamic topic model of Question 2: What changes in topic dynamics did the the LAK Dataset suggests as rising topics over the next first LAK conference in 2011 bring about? year(s). We try to answer this by identifying those five top- This question aims to reveal whether and how the LAK com- ics that had the highest rise in relevance between 2011 and munity relates to the EDM community in terms of topics 2012. The topic labels represent the word distribution in covered by their papers. To explore this we look (a) at at 2012, and the number in parentheses indicates the absolute the overall distribution of topics over time and (b) at the gain in relevance: relative change of topic relevance between 2010 and 2011. The evolution of the overall distribution of topics is illus- trated in the ThemeRiver in Figure 5. In a ThemeRiver [4] the horizontal axis represents the points in time to which the documents in a dataset belong (in the LAK Dataset that is the publication date), and the vertical axis represents the relevance of the topic. Each current in the ThemeRiver therefore presents the dynamic development of a selected Figure 5: Overall distribution of topic relevance be- topic over time. The wider the current, the more relevant is tween 2008 and 2012 The Document and Word Evolution Panel shows for the selected topic an ordered list of the most relevant papers in the “Relevant Documents” tab. The icons next to each document allow showing the topic pie for the document and its content, respectively. The “Similar Docs” icon will bring up the Document Browser with a list of similar docu- ments. Under the “Word Evolution” tab the user will find a ThemeRiver illustrating the evolution of the distribution of words in the selected topic over time. D-VITA also offers a Document Browser to perform keyword- based search, explore the topic distribution of documents, Figure 6: Topics with rising relevance in 2011 and navigate documents based on similarity. 6. CONCLUSION 1. students data courses system (+.054) In a nutshell, we discovered the following: Regarding the 2. students interaction participants analysis (+.036) past, we found that LAK and EDM do have a substantial 3. learning analytics social learners (+.035) shared topic foundation including themes like student mod- 4. students actions learning state (+.025) eling, data classification, and clustering. We also found that 5. data user learning dataset (+.013) the EDM conference series had some turbulence in topical focus between 2008 and 2010, the time window when only EDM papers are present in the dataset. In sum these five topics have accumulated a share of 42% of the topic distribution by 2012, starting from 11% in 2008 (cf. Regarding the present we found that the LAK Dataset ex- Figure 7). These developments indicate a strong increase in poses a strong emphasis on learner modeling, data model- focus on the students’ activities and actions in courses as ing, analysis and prediction. The first LAK conference in well as social and interaction analytics. 2011 also brought some considerable shifts in topic focus; e.g. LAK 2011 has visibly strengthened network and social analysis aspects on top of EDM topics. Regarding the near future we found that the shifts in the topics’ proportions in 2012 appear rather moderate, thus indicating a phase of convergence of LAK research topics. Projecting recent topic shifts into the future, we can expect increased emphasis on social and interaction aspects and a sustained, strong role of students as research subjects. Figure 7: Cumulative relevance of the top-five rising 7. ACKNOWLEDGMENTS topics 2012 over all years This work was supported by the European Commission through the the support action TEL-Map (FP7-257822) and the in- 5. D-VITA TOPIC ANALYTICS TOOKIT tegrated project Layers (FP7-318209). Except for Figures 1 and 3 all figures were produced using D-VITA, a web-based visual analytics tool we developed and 8. REFERENCES deployed for visual analytics of dynamic topic models. The [1] About SoLAR, 2012. tool allows users to visually interact with the output of the http://www.solaresearch.org/mission/about/. dynamic topic mining algorithms on the LAK Dataset. The [2] D. M. Blei. Probabilistic topic models. Commun. ACM, application window shown in Figure 8 has three panels: 55(4):77–84, 2012. [3] D. M. Blei and J. D. Lafferty. Dynamic topic models. The Topics Panel shows the list of topics obtained by the In ICML, pages 113–120, 2006. dynamic topic modeling algorithm; topics can be sorted by [4] S. Havre, E. G. Hetzler, P. Whitney, and L. T. Nowell. rising, falling and mean relevance, as well as variance of Themeriver: Visualizing thematic changes in large relevance. The topics can be filtered using keywords; in the document collections. IEEE Trans. Vis. Comput. screen shot the keyword “visual” is used as a filter. The topic Graph., 8(1):9–20, 2002. list thus only includes topics whose set of relevant words includes this word stem. Topics checked by the user will be [5] J. Lin. Divergence Measures Based on the Shannon visualized in the ThemeRiver in the Topic Evolution Panel. Entropy. IEEE Transactions on Information Theory, 37(1), 1991. The Topic Evolution Panel shows a ThemeRiver of evolu- [6] M. F. Porter. An algorithm for suffix stripping. tion of relevance of the topics selected in the Topics Panel. Program, 14(3):130–137, 1980. Data points for each topic and time slice, respectively, can [7] D. Taibi and S. Dietze. Fostering analytics on learning be clicked, which will trigger the display of detailed infor- analytics research: the LAK dataset, Technical Report, mation on the clicked topic at the selected time slice in the 03/2013, 2013. http://resources.linkededucation. Document and Word Evolution Panel. org/2013/03/lak-dataset-taibi.pdf. Figure 8: Application window showing ThemeRiver and document list (rotated image)