<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Deconstruct and Reconstruct: Using Topic Modeling on an Analytics Corpus</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mike Sharkey</string-name>
          <email>mike@bluecanarydata.com</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammed Ansari</string-name>
          <email>mohammed@bluecanarydata.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Blue Canary</institution>
          ,
          <addr-line>145 S. 79th St., ste. 59, Chandler, AZ 85226, 480-262-3438</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Blue Canary</institution>
          ,
          <addr-line>145 S. 79th St., ste. 59, Chandler, AZ 85226, 602-617-4174</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>The question posed by the 2014 LAK Data Challenge is “What do analytics on learning analytics tell us?” The authors looked to take a two-pronged approach to this challenge. First, the authors wanted to use advanced analytical techniques on the corpus to make the “eat your own dog food” point. Since many of the EDM/LAK submissions explain advanced statistical or semantic analytic approaches, we wanted to utilize those same methods for our analysis. To that end, we used two natural language processing (NLP) tools to analyze the corpus of papers. First we used Latent Dirichlet allocation (LDA) to extract clusters of terms from the content. Second, we used Turbo Topics to convert the LDA output into phrases (bi-grams and tri-grams). The use of these NLP tools allowed us to execute the second part of our approach to the challenge. Once the corpus was aggregated as topics, we used Tableau to visually inspect the corpus for trends. In addition to standard descriptive visualizations, we were able to identify trends in the corpus topics from 2008 through 2013. Most interesting is that with both EDM and LAK, we noticed a trend of topic convergence after three years. Also, we were able to easily discern topic trends such as the increased popularity of "social" and "network," over the last three years, and the consistent appearance of 'Cognitive Tutor' related topics (e.g. intelligent tutoring, concept map). While these findings may not be unexpected, we believe that the ability to extract and visualize these outcomes is unique.</p>
      </abstract>
      <kwd-group>
        <kwd>Natural language</kwd>
        <kwd>LDA</kwd>
        <kwd>Turbo Topics</kwd>
        <kwd>EDM</kwd>
        <kwd>LAK</kwd>
        <kwd>corpus</kwd>
        <kwd>Tableau</kwd>
        <kwd>visualization</kwd>
        <kwd>analytics</kwd>
        <kwd>social networks</kwd>
        <kwd>cognitive tutor</kwd>
        <kwd>IRT</kwd>
        <kwd>assessment</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>• Human-centered computing~Information visualization
• Computing methodologies~Natural language processing
• Computing methodologies~Topic modeling
• Computing methodologies~Latent Dirichlet allocation</p>
    </sec>
    <sec id="sec-2">
      <title>1. INTRODUCTION</title>
      <p>The 2014 LAK Data Challenge is a classic meta problem. The
challenge poses the question “What do analytics on learning
analytics tell us?” Blue Canary chose to enter the challenge in order
to contribute the analytical body of work that the EDM and LAK
community members have been doing for years. As we thought
about how the meta-analysis would work, we tried to use key tenets
of good data analysis. One such tenet can be summarized as
“automation, but with the human touch.” By this we mean that we
use our engineering skills to automate as many parts of the
analytical process as possible, but we still rely on human
intervention when required/appropriate.</p>
      <p>The corpus of papers was comprised of papers submitted to the
Educational Data Mining (EDM) conferences from 2008 to 2013
and papers submitted to the Learning Analytics and Knowledge
(LAK) conferences from 2011 to 2013. Our approach to analyzing
the corpus was twofold – extract topics from the corpus and then
use visualizations to surface findings.
1.1</p>
    </sec>
    <sec id="sec-3">
      <title>NLP and Topic Modeling</title>
      <p>First, we used natural language processing (NLP) tools to model
topics from the corpus. Precious metals processing is an
appropriate analogy in this case. A gold mining operation
excavates large rocks, breaks them down into ore, refines the pure
gold, and then sells the gold so that designers can create jewelry.
For the end user, it is the gold jewelry that is of value. Similarly,
we looked at a large corpus of papers, broke it down into word
vectors, aggregated those vectors into topics and aggregated again
into concepts.
1.2</p>
    </sec>
    <sec id="sec-4">
      <title>Visualization</title>
      <p>Continuing to use the gold analogy, most buyers don’t judge the
value of a piece of jewelry by examining the quality of the gold.
The value is assessed by looking at the overall presentation of the
piece. For our analysis, we wanted to present data visualizations
that would surface the findings and information that peers would
find interesting. Additionally, though, we also wanted to allow
users to ‘inspect the gold’ if desired. We used Tableau to create
and deliver the visualizations, and we used a topic browser to let
users browse topics in the context of their original papers.</p>
    </sec>
    <sec id="sec-5">
      <title>2. TOPIC MODELING METHODOLOGY</title>
      <p>
        The bulk of the analysis we performed was guided by the Topic
Modeling work driven by David M. Blei at Princeton.1 Specifically,
we felt that using Blei’s work on Turbo Topics would be the best
approach to the EDM/LAK corpus. Turbo Topics builds off of
single term topics and aggregates the findings into multiple term
ngrams that give the user more context [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. An n-gram (e.g. bi-grams
are two-word phrases) such as “cognitive tutor” has much more
meaning in this space than the terms “cognitive” and “tutor”
independently.
________________________________________
      </p>
      <sec id="sec-5-1">
        <title>1 http://www.cs.princeton.edu/~blei/topicmodeling.html</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>2.1 Analysis Process</title>
      <p>
        We followed a specific process in order to deconstruct the corpus
down to topics and then aggregate the topics back up to meaningful
n-grams. Figure 1 below shows the steps involved:
The papers started in XML format2 thanks to the work done by
Taibi &amp; Deitze [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We had to convert that into a format that was
more suitable for our NLP pipeline. The papers were converted
into a CSV file where each paper was a line in the CSV. Once the
corpus was more readily machine readable, we used LDA to
process the first round of topic aggregation. LDA assumes that
there are latent underlying topics in a corpus and that the topic has
a number of correlated words [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>The output of LDA was a series of vectors, with each vector
corresponding to an assumed underlying topic. These vectors then
became the input for Turbo Topics – a process that would ingest
the LDA results and create a series of n-grams that are relevant to
the topics in the corpus. Examples of such n-grams from this
corpus included “classification algorithms”, “intelligent tutoring”,
and “decision tree”.</p>
    </sec>
    <sec id="sec-7">
      <title>2.2 Human Intervention</title>
      <p>While Blue Canary attempted to systematize as much of the
analysis as possible, we realize that there is still a need for human
intelligence to guide the topic modeling.</p>
      <p>The first step of human intervention was in selecting stop words.
These are words that should be excluded from the analysis because
their frequency doesn’t add value to the observations. In the CSV
2 http://meco.l3s.uni-hannover.de:9080/wp2/?page_id=16</p>
      <sec id="sec-7-1">
        <title>3 http://lak14.bluecanarydata.com</title>
        <p>step, we experimented with different upper and lower bound
settings for stop word limitations. We settled on only including
words that appeared more than 50 times but less than 200 times –
running LDA with these limits created a set of topic vectors that we
observed to be optimal.</p>
        <p>A second example of human intervention was in the final step of
grouping topics into concepts. This was purely a manual process
that involved browsing the approximately 80 n-grams, looking for
analytic themes, and grouping them accordingly. For example,
topics such as “error rate”, “feature selection”, and “activity
sequences” were grouped as Machine Learning while “discussion
forums”, “natural language”, and “topic words” were grouped as
Semantic/Text. The entire list of topics and concepts can be seen
in the “Concepts and Terms” tab of the accompanying Blue Canary
LAK site3.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>3. VISUALIZING THE RESULTS</title>
      <p>In the analytics space, visualizing ones data is an effective way of
divining trends and patterns in the underlying set. For the LAK
challenge, we had the topics and concepts as the core results from
our analysis. We created two metrics that we could use to frame
the observation of these results.</p>
      <p>The first was frequency – how many times does the topic appear in
the corpus. Since both the number of papers per year and the
number of words per paper varied from 2008 to 2013, we had to
normalize this metric. We chose to index the frequency on the most
frequent term. To illustrate, we look at the corpus in 2011 where
the most frequent topic was ‘social network’ with 112 appearances.
The next most frequent topic was ‘item difficulty’ with 90
appearances. In our analysis, we gave ‘social network’ a frequency
of 1.0 and ‘item difficulty’ a score of 0.8.</p>
      <p>The second metric was breadth – the number of different papers
containing a given topic. Again, we had to normalize since the
number of papers increased from 27 to 144 over the 6-year
timeframe. We normalized breadth by using percent of documents
in which the topic appeared. To remove outliers, we set a rule that
a topic must appear in at least 5% of the papers to be included in
our analysis.</p>
    </sec>
    <sec id="sec-9">
      <title>3.1 Tableau</title>
      <p>The first tool used for visualization was Tableau. Tableau’s
advanced visual features makes it an ideal tool for exploring our
results.
All Tableau visualizations can be found at the Blue Canary LAK
site (http://lak14.bluecanarydata.com). The visualizations can
range from simple descriptive charts (like Figure 2 showing the size
and breadth of papers in the corpus) to interactive trend charts (like
Figure 3 where the user can find the top N topics from any of the
six years of the corpus).
In the context of the LAK Data Challenge, Tableau was the most
useful tool utilized by the Blue Canary team. We did not approach
the task with a specific hypothesis to be proved or disproved.
Rather, we took the challenge more literally and asked 'What do the
data have to say?’ Tableau helped us find answers to that question.</p>
    </sec>
    <sec id="sec-10">
      <title>3.2 Topic Browser</title>
      <p>
        A second, more detailed way to view the output is using a
webbased topic browser. Researchers have developed different tools to
accomplish this task, including tools such as TopicExplorer [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
Topic Model Visualization Engine [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Blue Canary chose to use
Topic Model Visualization Engine (as shown in Figure 4). The
topic browser can also be found on the Blue Canary LAK site
(http://lak14.bluecanarydata.com).
This topic browser allows the user to explore the occurrences of
any topic in the context of its native paper. This explorer is useful
when trying to decipher irregularities of the topic modeling output.
For example, the topic “free fall” showed significant presence in
the 2013 papers. It turns out that two different papers used a
Physical Sciences class as the backdrop for their analysis and “free
fall” was one of the course concepts.
      </p>
    </sec>
    <sec id="sec-11">
      <title>4. FINDINGS</title>
      <p>After processing the data and looking for trends, the Blue Canary
team found two things that we believe are of interest to the analytics
community.</p>
    </sec>
    <sec id="sec-12">
      <title>4.1 Topic Convergence</title>
      <p>We created a scatter plot of all topics year by year. We excluded
topics that appeared in less than 5% of the papers for the year and
we plotted against our two main metrics (topic frequency and
breadth).
The resulting scatter plots (as shown in Figure 5. and on the
‘Convergence’ report on the blue Canary LAK site) show an
interesting trend. The slope of the line reflects the
homo/heterogeneity of the paper topics. A shallow slope indicates
that few topics dominate the overall corpus conversation. A steeper
slope indicates that there are more topics that are frequently
mentioned across more papers. The purpose of the regression is to
help highlight the trend. It is not meant to comment on the strength
of the fit.</p>
      <p>Looking at the patterns in the scatter plot, we see that in the first
three years of the LAK papers, the topics trended towards a more
concentrated set. The implication here is that at the start of the
conferences, papers tended to be more diverse. However, after a
few years, submissions started to address a similar set of topics.
One possible explanation of this is that after two years, the topics
become more accepted in the space and therefore get adopted/used
more frequently.</p>
      <p>One caveat is that this trend might be specific to the conference
(EDM vs. LAK). Additional work to split the data by conference
might shed more light on this trend.</p>
    </sec>
    <sec id="sec-13">
      <title>4.2 Concept Trends</title>
      <p>The most obvious output of our LDA to Turbo Topics to Concept
process is to look at the popularity of the overall concepts over time.
The ‘Concept Frequency’ report shows that starting in 2011 (when
the LAK papers were introduced to the corpus), the topics
associated with ‘Social and Networks’ tended to dominate in
popularity. This is not surprising as this concept includes topics
such as ‘interaction network’, ‘online communities’, and ‘network
structure’.</p>
      <p>A second observation about concept frequency is the consistent
presence of topics associated with ‘Cognitive Tutors”. There is a
close link between work done by researchers in this area and both
the EDM and LAK communities so it’s not too surprising to see
this outcome. What makes the presence of this concept more
striking is that there are two other concept groups about related
fields (“Item Response Theory” and “Assessment Topics”). Even
with these topics being spread across three different concept
groups, their high frequency still shows up in the reports.</p>
    </sec>
    <sec id="sec-14">
      <title>4.3 Top Topics</title>
      <p>For reference purposes, the following tables list the top 3 topics that
appeared in the corpus from 2008 to 2013. This information is also
available at the Blue Canary LAK site
(http://lak14.bluecanarydata.com) under the ‘Topic Frequency’
report tab:
‘Relative %’ in Table 1 refers to the frequency with which a topic
appears in the corpus relative to the most frequently appearing
topic.</p>
    </sec>
    <sec id="sec-15">
      <title>5. COMPARISON TO OTHER ANALYSES</title>
      <p>Blue Canary is by no means the first to apply NLP techniques to
the corpus of analytic papers in an attempt to extract meaning. Prior
LAK Data Challenge entrants have taken similar approaches.</p>
    </sec>
    <sec id="sec-16">
      <title>5.1 LAK13 Ontology Learning</title>
      <p>In the 2013 LAK Data Challenge, Zouaq et. al. [6] used an ontology
learning tool to extract concepts and concept maps from the corpus.
The researchers presented the top ranked concepts from both the
EDM and LAK papers, and from the EDM and LAK abstracts. The
resulting table showed that most of the top concepts were unigrams
such as “student”, “datum”, “model”, “learner”, and “result”.
Contrasting this to Blue Canary’s Topic Modeling approach, we see
a natural progression from these unigrams to bigrams (as can be
seen in Table 1). This progression is a good example of how one
body of research can build upon previous works in order to add
more clarity for the audience.</p>
    </sec>
    <sec id="sec-17">
      <title>5.2 LAK13 Dynamic Topic Modeling</title>
      <p>Another 2013 LAK Data Challenge entrant, Derntl et. al. [6], used
an approach that was more similar to what Blue Canary did with
Turbo Topics. The researchers used Dynamic Topic Modeling, a
precursor to the Turbo Topics technique developed by Blei. One
key difference was that the Blue Canary work tried to make the
topics more understandable. That is, a grouping of keywords forms
a topic, but that topic needs to be something palatable to the reader.
Derntl et. al. labelled their topics as an amalgam of the keywords
(e.g. students – data – courses – system). While this is descriptive
of the content, it is less relatable in context. Blue Canary’s use of
bi-grams and concept labelling helps to better bridge the context
gap.</p>
    </sec>
    <sec id="sec-18">
      <title>5.3 Google Trends</title>
      <p>As a litmus test for the topic trends, Blue Canary also looked at a
Google Trends chart of the popularity of some of the LAK/EDM
topics (http://bit.ly/P23CRn). While interesting to look at (Figure
6.), this avenue doesn’t provide much insight into the trajectory of
the LAK/EDM topics. Google Trends takes its popularity metrics
from a wider swath of sources so the ratings shouldn’t be expected
to be correlated with the work from the corpus. As an example, the
topic “social network” was left off of the Google Trends search.
The popularity of the 2010 movie made the scale of that term dwarf
all others.</p>
    </sec>
    <sec id="sec-19">
      <title>6. CONCLUSION</title>
      <p>Blue Canary went in to the LAK Data Challenge assuming that we
could accomplish two goals. First, that we could use our
engineering expertise and analytical knowledge to efficiently
process the corpus. Second, that we could visualize the processed
results and uncover findings that would be of interest to the EDM
and LAK communities. We believe that the steps outlined in this
paper combined with the visualizations created with the output
(http://lak14.bluecanarydata.com) prove that we have successfully
accomplished our goals.</p>
      <p>Perhaps the most salient takeaway is what’s referenced in the title
of this paper. The process that Blue Canary used to analyze topics
was to deconstruct the corpus to a more atomic level and then to
reconstruct the findings into contextual parts. It is this
reconstruction that we believe has the most value. This paper built
off of previous researchers who did a similar job of deconstructing
the papers. What makes this paper different, though, is that Blue
Canary reconstructed the findings to a more coarse level that allows
others to better understand the topics discussed in the corpus
Blue Canary took this approach as a way to stress the fact that the
analytics must be usable by others in order for the work to have
some tangible value beyond pure research. The research furthers
the state of the art, and then the application of the research is what’s
used by institutions and businesses to help students and customers.
We reconstructed keywords into topics and concepts, and we also
created a companion web application
(http://lak14.bluecanarydata.com) that allows users to browse and
drill into the findings. This is a good example of how the research
can be extended to an applied solution that can derive value from
analytics.</p>
    </sec>
    <sec id="sec-20">
      <title>7. ACKNOWLEDGEMENTS</title>
      <p>Deepest thanks to Andy Allen and the rest of the Blue Canary team
members who contributed directly and indirectly. Papers like this
are a great example of what smart people can do when they work
collaboratively.</p>
      <p>Additional thanks to the EDM and LAK communities for
continually fostering a culture of innovation around data and
analytics in higher education.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lafferty</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Visualizing topics with multi-word expressions</article-title>
          .
          <source>arXiv preprint arXiv:0907</source>
          .
          <fpage>1013</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Taibi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Dietze</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Fostering analytics on learning analytics research: the LAK dataset</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ng</surname>
            ,
            <given-names>A. Y.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Jordan</surname>
            ,
            <given-names>M. I.</given-names>
          </string-name>
          (
          <year>2001</year>
          ).
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          (pp.
          <fpage>601</fpage>
          -
          <lpage>608</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Hinneburg</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Preiss</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Schröder</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2012</year>
          ).
          <article-title>TopicExplorer: Exploring document collections with topic models</article-title>
          .
          <source>In Machine Learning and Knowledge Discovery in Databases</source>
          (pp.
          <fpage>838</fpage>
          -
          <lpage>841</lpage>
          ). Springer Berlin Heidelberg.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Chaney</surname>
            ,
            <given-names>A. J. B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Blei</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          (
          <year>2012</year>
          , March).
          <source>Visualizing Topic Models. In ICWSM.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Derntl</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Günnemann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Klamma</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>A Dynamic Topic Model of Learning Analytics Research</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>