A Dynamic Topic Model of Learning Analytics Research

                   Michael Derntl                    Nikou Günnemann                        Ralf Klamma
              RWTH Aachen University               RWTH Aachen University             RWTH Aachen University
                Advanced Community                   Advanced Community                 Advanced Community
             Information Systems (ACIS)           Information Systems (ACIS)         Information Systems (ACIS)
                  Aachen, Germany                      Aachen, Germany                    Aachen, Germany
                 derntl@dbis.rwth-                    nikou@dbis.rwth-                  klamma@dbis.rwth-
                    aachen.de                            aachen.de                          aachen.de

ABSTRACT
Research on learning analytics and educational data min-
ing has been published since the first conference on Educa-
tional Data Mining (EDM) in 2008 and gained momentum
through the establishment of the Learning Analytics and
Knowledge (LAK) conference in 2011. This paper addresses
the LAK Data Challenge from the perspective of visual an-
alytics of topic dynamics in the LAK Dataset between 2008
and 2012. The data set was processed using probabilistic,
dynamic topic mining algorithms. To enable exploration             Figure 1: Yearly distribution of papers over venues
and visual analysis of the resulting topic model by LAK
researchers and stakeholders we developed and deployed D-
VITA, a web-based browsing tool for dynamic topic models.               questions posed in the LAK Data Challenge into the
In this paper we explore answers to the questions about past,           user’s hands. D-VITA is a web-based tool that offers
present, and future of LAK posed in the Data Challenge                  topic-based views on the LAK Dataset using a point-
based on a topic model of all papers in the LAK Dataset.                and-click metaphor and simple visualizations.
We also briefly describe how users can explore the LAK topic
model on their own using D-VITA.                                   2.   DATASET AND PREPROCESSING
                                                                   The LAK Dataset underlying the analyses presented in this
1.     OBJECTIVES                                                  paper includes the EDM conference proceedings 2008–2012
The LAK Data Challenge called for contributions to make            (239 papers), the LAK conference proceedings 2011–2012
sense of the field of learning analytics including its “roots,     (66 papers), and the papers of the 2012 Special Issue on
current state, and future trends, based on how its members         Learning Analytics in the Educational Technology and So-
report and debate their research”1 . This paper tackles the        ciety journal (10 papers; herafter referred to as ETS). The
challenge by presenting facts obtained from statistical anal-      RDF representation of the LAK Dataset was processed by a
yses of the paper full texts included in the provided LAK          script that extracted for each paper the identifier, venue ∈
Dataset [7]. The main contributions are as follows:                {LAK, EDM, ETS}, year of publication, title, authors, ab-
                                                                   stract, full text, and hyperlink to the full RDF description
                                                                   on data.linkededucation.org. The distribution of the 315
     1. A dynamic topic model was computed using the ap-
                                                                   papers over time and venues is given in Figure 1.
        proach presented in [3]. Using this dynamic topic model
        we explore in Section 4 three questions about the evo-
                                                                   In the next preprocessing step the paper records were cleaned
        lution of topics in the LAK Dataset to distill knowledge
                                                                   by removing stopwords and by applying stemming methods
        about past, present and future of LAK research.
                                                                   on the included word sets. For word stemming we used the
     2. In Section 5 we describe the visual analytics applica-     Porter Stemming technique [6], which is well established for
        tion D-VITA2 , which puts the toolkit to answer the        this purpuse. As a result, close to 5000 distinct word stems
1
                                                                   were identified as being used in the 315 papers.
    http://solaresearch.org/events/lak/lak-data-challenge/
2
    http://monet.informatik.rwth-aachen.de/DVita/?id=16
                                                                   3.   DYNAMIC TOPIC MINING
                                                                   From a text mining perspective the LAK Dataset represents
                                                                   a text corpus in which a set of words is used in a set of pa-
                                                                   pers. To identify what is relevant to LAK research, we used
                                                                   the dynamic topic modeling approach described in [3] to ob-
                                                                   tain the distribution of words over a pre-defined number of
                                                                   topics. This is a probabilistic, unsupervised machine learn-
                                                                   ing approach that has been gaining increasing prominence
                                                                   recently [2]. In these probabilistic topic models a topic is a
Copyright 2013 by the authors                                      distribution of words, so each topic is typically represented
by its most frequently occurring words. Topic mining also          with pairwise divergence values is displayed in Figure 2. The
obtains the distribution of these topics over the papers in        maximum Jensen-Shannon divergence value is ln(2) ≈ .69.
the data set. Dynamic topic mining applies these analy-            The darker the cell color, the lower the divergence, thus the
sis steps using several consecutive time slices in the data        higher the similarity. The matrix is generally “light-colored”,
set. For the LAK Dataset, we chose the five calendar years         indicating that the topics’ word distributions diverge to a
∈ {2008 . . . 2012} as time slices. The results will thus reveal   high degree. Topic pair (A, S) has the lowest dissimilarity
the evolution of topics over documents during those discrete       value, and Figure 3 reveals why: both topics are about stu-
time slices, and the evolution of words used in the papers         dent modeling. Topic S generally appears to have several
for each topic over time.                                          loosely related topics.

Dynamic topic mining requires the analyst to pre-set the           4.     ANALYSIS OF LAK TOPIC DYNAMICS
number of topics. Based on previous experiments with vary-         In this section we explore three questions about the LAK
ing numbers of topics in paper collections in well-defined         Dataset, intending to shed some light on the past and present
subject areas, we decided to run the analysis of the LAK           topics of learning analytics research, along with a cautious
Dataset with a set of 20 topics. This number, while some-          glimpse into the future.
what arbitrary, shall provide for sufficient discriminatory
power for both the distribution of topics over papers and the
distribution of words over topics. With fewer topics, terms        Question 1: What have been the most relevant topics
like ‘learning’, for instance, are more likely to be present       overall in the LAK data set?
with relatively high relevance in many topics, while a larger      This question addresses the LAK Data Challenge aspects
preset would increase the number of topics exposed in each         of roots and current state of learning analytics. Figure 3
paper. Both situations would impede reasonable interpreta-         shows an overview chart of the 20 topics identified in the
tion and visualization of the results.                             LAK Dataset. The horizontal axis reflects the rank of mean
                                                                   relevance of each topic and the vertical axis reflects the rank
A word of explanation regarding the labels used to refer to        of stability3 over the five time slices in the dataset. The size
topics in this paper: mathematically each topic is a distri-       of each bubble reflects the relevance of the topic in 2012, the
bution over words. In a dynamic topic model this distri-           most recent period. We make several observations:
bution changes over time, i.e. a specific word may rise or
fall in relevance for a topic. In the rest of the paper we
will therefore label each topic with an ordered tuple repre-            • The most relevant topics most prominently feature the
senting those words with the highest mean relevance for this              terms students/learners, model, and data. This aligns
topic over time. In topic modeling literature we found that               well with SoLAR’s definition of learning analytics as
four words is a good number to form a topic label. For in-                “the measurement, collection, analysis and reporting
stance, for topic “students model parameters skill” the                   of data about learners and their contexts, for purposes
most relevant word on average is student followed by model,               of understanding and optimizing learning and the en-
parameters, and skill. For illustration, based on the word                vironments in which it occurs,”[1] considering that un-
distribution for this topic in 2008 only, the label would be              derstanding and optimization is necessarily based on
“model student skill learning”. Often, such word tuples                   models of learners and data.
are rephrased as more expressive labels; for instance “student
modeling” could be appropriate in our example.                          • The topic with the highest mean relevance is “student
                                                                          model parameters skill” (A); this topic also has the
The obtained topic model including 20 topics was analyzed                 highest variance in relevance.
to see whether the topics have sufficient discriminatory power.         • In the top-right quadrant we find topic “model data
To this end, we used the ten most important words for                     features prediction” (B) which has a strong rele-
each topic and the corresponding probability distributions to             vance in 2012, high mean relevance rank over all years
compute a dissimilarity measure of the distributions by us-               and a high stability. As such, it can be considered
ing the Jensen-Shannon divergence measure [5]. The matrix                 as one of the core topics in the LAK Dataset. In
                                                                          2012 the distribution of words in this topic would ad-
                                                                          vocate the label “prediction model data students”,
                                                                          i.e. prediction is currently most relevant for this topic.

                                                                        • Topic “network community discussion analysis” (R)
                                                                          is also worth looking at. While it is relatively irrele-
                                                                          vant and volatile, it is among the relevant topics in
                                                                          2012 (cf. the bubble size). The topic evolution chart
                                                                          in Figure 4 reveals that this topic accumulated most
                                                                          of its relevance in 2011, the year of the first LAK con-
                                                                          ference. Also, 8 of the 10 papers with the strongest
                                                                          focus on this topic in 2011 were published in the LAK
                                                                          conference (see bottom portion of Figure 4) although
                                                                          EDM published 2.5 times the number of papers in that
                                                                   3
                                                                     Stability was computed by inverting the variance of the
       Figure 2: Overview of topic divergence                      topic’s relevance over time
                    Figure 3: Topic stability plotted against average topic relevance over time.


                                                                 the topic, i.e. the more documents expose this topic. Since
                                                                 each document exposes different topics to varying degrees
                                                                 the relevance of topic P  k at time t is formally defined as
                                                                 relevance(k, t) := |D1t | d∈Dt θd [k], where Dt is the set of
                                                                 documents belonging to time t, and θd is the topic distribu-
                                                                 tion for document d. Observing the ThemeRiver in Figure 5
                                                                 it is evident that there were some shifts in topic focus during
                                                                 the years 2008 and 2010, where we have only the EDM pa-
                                                                 pers in the dataset. Between 2010 and 2011 we identify the
                                                                 strongest turbulence, presumably based on substantial shifts
                                                                 in topic foci introduced by the 2011 LAK conference. Inter-
                                                                 estingly the topic distribution remains rather stable during
                                                                 the last time slice, in which LAK 2012, EDM 2012 and the
                                                                 ETS special issue are included. This might suggest that
                                                                 these three publication venues propelled the convergence of
                                                                 LAK research as represented in the LAK Dataset.

                                                                 To see which topics rose in relevance between 2010 and
                                                                 2011 we filter for topics and zoom into the transition be-
                                                                 tween 2010 and 2011 as illustrated in Figure 6. Those three
                                                                 topics that have their absolute highest relevance in 2011
                                                                 are marked with an up-pointing triangle with a solid-black
Figure 4: Evolution (top) and most representative
                                                                 outline. These are “model students data probability”,
papers in 2011 (bottom) of topic “network commu-
                                                                 “network community discussion analysis”, and “problem
nity discussion analysis”
                                                                 students model types”, indicating an increased focus on
                                                                 student modeling as well as community and network analy-
                                                                 sis through the first LAK conference in 2011.
     year. This topic, in 2012 represented by the word or-
     der “network community social user”, therefore ap-
     pears to be a genuine LAK topic which was previously
                                                                 Question 3: What topics rose the most in 2012, the
     rather irrelevant for the EDM conference.
                                                                 most recent time slice in the data set?
                                                                 This question looks into what the dynamic topic model of
Question 2: What changes in topic dynamics did the               the LAK Dataset suggests as rising topics over the next
first LAK conference in 2011 bring about?                        year(s). We try to answer this by identifying those five top-
This question aims to reveal whether and how the LAK com-        ics that had the highest rise in relevance between 2011 and
munity relates to the EDM community in terms of topics           2012. The topic labels represent the word distribution in
covered by their papers. To explore this we look (a) at at       2012, and the number in parentheses indicates the absolute
the overall distribution of topics over time and (b) at the      gain in relevance:
relative change of topic relevance between 2010 and 2011.

The evolution of the overall distribution of topics is illus-
trated in the ThemeRiver in Figure 5. In a ThemeRiver [4]
the horizontal axis represents the points in time to which the
documents in a dataset belong (in the LAK Dataset that is
the publication date), and the vertical axis represents the
relevance of the topic. Each current in the ThemeRiver
therefore presents the dynamic development of a selected         Figure 5: Overall distribution of topic relevance be-
topic over time. The wider the current, the more relevant is     tween 2008 and 2012
                                                                  The Document and Word Evolution Panel shows for the
                                                                  selected topic an ordered list of the most relevant papers
                                                                  in the “Relevant Documents” tab. The icons next to each
                                                                  document allow showing the topic pie for the document
                                                                  and its content, respectively. The “Similar Docs” icon will
                                                                  bring up the Document Browser with a list of similar docu-
                                                                  ments. Under the “Word Evolution” tab the user will find a
                                                                  ThemeRiver illustrating the evolution of the distribution of
                                                                  words in the selected topic over time.

                                                                  D-VITA also offers a Document Browser to perform keyword-
                                                                  based search, explore the topic distribution of documents,
     Figure 6: Topics with rising relevance in 2011               and navigate documents based on similarity.

                                                                  6.   CONCLUSION
     1. students data courses system (+.054)                      In a nutshell, we discovered the following: Regarding the
     2. students interaction participants analysis (+.036)        past, we found that LAK and EDM do have a substantial
     3. learning analytics social learners (+.035)                shared topic foundation including themes like student mod-
     4. students actions learning state (+.025)                   eling, data classification, and clustering. We also found that
     5. data user learning dataset (+.013)                        the EDM conference series had some turbulence in topical
                                                                  focus between 2008 and 2010, the time window when only
                                                                  EDM papers are present in the dataset.
In sum these five topics have accumulated a share of 42% of
the topic distribution by 2012, starting from 11% in 2008 (cf.    Regarding the present we found that the LAK Dataset ex-
Figure 7). These developments indicate a strong increase in       poses a strong emphasis on learner modeling, data model-
focus on the students’ activities and actions in courses as       ing, analysis and prediction. The first LAK conference in
well as social and interaction analytics.                         2011 also brought some considerable shifts in topic focus;
                                                                  e.g. LAK 2011 has visibly strengthened network and social
                                                                  analysis aspects on top of EDM topics.

                                                                  Regarding the near future we found that the shifts in the
                                                                  topics’ proportions in 2012 appear rather moderate, thus
                                                                  indicating a phase of convergence of LAK research topics.
                                                                  Projecting recent topic shifts into the future, we can expect
                                                                  increased emphasis on social and interaction aspects and a
                                                                  sustained, strong role of students as research subjects.

Figure 7: Cumulative relevance of the top-five rising             7.   ACKNOWLEDGMENTS
topics 2012 over all years                                        This work was supported by the European Commission through
                                                                  the the support action TEL-Map (FP7-257822) and the in-
5.     D-VITA TOPIC ANALYTICS TOOKIT                              tegrated project Layers (FP7-318209).
Except for Figures 1 and 3 all figures were produced using
D-VITA, a web-based visual analytics tool we developed and        8.   REFERENCES
deployed for visual analytics of dynamic topic models. The        [1] About SoLAR, 2012.
tool allows users to visually interact with the output of the         http://www.solaresearch.org/mission/about/.
dynamic topic mining algorithms on the LAK Dataset. The           [2] D. M. Blei. Probabilistic topic models. Commun. ACM,
application window shown in Figure 8 has three panels:                55(4):77–84, 2012.
                                                                  [3] D. M. Blei and J. D. Lafferty. Dynamic topic models.
The Topics Panel shows the list of topics obtained by the
                                                                      In ICML, pages 113–120, 2006.
dynamic topic modeling algorithm; topics can be sorted by
                                                                  [4] S. Havre, E. G. Hetzler, P. Whitney, and L. T. Nowell.
rising, falling and mean relevance, as well as variance of
                                                                      Themeriver: Visualizing thematic changes in large
relevance. The topics can be filtered using keywords; in the
                                                                      document collections. IEEE Trans. Vis. Comput.
screen shot the keyword “visual” is used as a filter. The topic
                                                                      Graph., 8(1):9–20, 2002.
list thus only includes topics whose set of relevant words
includes this word stem. Topics checked by the user will be       [5] J. Lin. Divergence Measures Based on the Shannon
visualized in the ThemeRiver in the Topic Evolution Panel.            Entropy. IEEE Transactions on Information Theory,
                                                                      37(1), 1991.
The Topic Evolution Panel shows a ThemeRiver of evolu-            [6] M. F. Porter. An algorithm for suffix stripping.
tion of relevance of the topics selected in the Topics Panel.         Program, 14(3):130–137, 1980.
Data points for each topic and time slice, respectively, can      [7] D. Taibi and S. Dietze. Fostering analytics on learning
be clicked, which will trigger the display of detailed infor-         analytics research: the LAK dataset, Technical Report,
mation on the clicked topic at the selected time slice in the         03/2013, 2013. http://resources.linkededucation.
Document and Word Evolution Panel.                                    org/2013/03/lak-dataset-taibi.pdf.
Figure 8: Application window showing ThemeRiver and document list (rotated image)