Ontology Learning to Analyze Research Trends in
                   Learning Analytics Publications
              Amal Zouaq                                  Srećko Joksimović                             Dragan Gašević
    Department of Mathematics and                 School of Interactive Arts and Tech- School of Computing and Information
          Computer Science                                      nologies                             Systems
    Royal Military College of Canada                   Simon Fraser University                Athabasca University
        Kingston, ON, Canada                             Surrey, BC, Canada                 Athabasca, AB, Canada
     +1 613 541 6000, Ext. 6478                            +1 778 782 7474                       +1 604 569 8515
        amal.zouaq@rmc.ca                                  sjoksimo@sfu.ca                            dgasevic@acm.org


ABSTRACT                                                                papers presented at the LAK conference editions and another one
In this paper, we show how ontology learning tools can be used to       for the papers presented at the EDM conference editions – in
reveal (i) the central research topics that are tackled in the pub-     order to compare the two conferences based on concepts and rela-
lished literature on learning analytics and educational data mining;    tionships gauged as most important. We also performed analysis
and (ii)relationships between these research topics and iii)            based on (a) paper abstracts only and (b) main body of text of the
(dis)similarities between learning analytics and educational data       papers.
mining.                                                                 In this short report, we first describe the data analysis pipeline.
                                                                        This is followed by a very brief discussion of a small fragment of
Categories and Subject Descriptors                                      the results we obtained in our analysis. The complete results in the
I.2.7 [Artificial Intelligence]: Natural Language Processing;           CSV format are available at [8].
G.2.2 [Discrete Mathematics]: Graph Theory
                                                                        2. DATA ANALYSIS PIPELINE
General Terms                                                           The data analysis relies on our ontology learning tool, OntoC-
Algorithms, Measurement, Experimentation                                maps[10]. Ontology learning from text is a multi-layer knowledge
                                                                        extraction task that targets the following components:
Keywords
Ontology learning, deep parsing, filtering, information retrieval,      Terms and concepts: The first step consists in identifying candi-
ranking algorithms, graph theoretic statistics                          date expressions in texts. These expressions are then ranked using
                                                                        some kind of measure (statistical metrics, graph-based metrics,
1. INTRODUCTION                                                         etc.) to extract those that are relevant for the domain. These filte-
Learning analytics is a new research discipline. Although it at-        red relevant expressions are then considered “concepts” in the
tracted a considerable amount of attention in educational research      ontology learning community.
and practice, debate is still very active about the scope of the dis-   Taxonomy: This step identifies “is-a” links in texts, generally
cipline. The definition of learning analytics offered by the Society    using patterns indicating a taxonomical link in text such as
for Learning Analytics Research [7], which is commonly used in          Hearst’s patterns[11], or using the inner structure of multiword
the literature to date, gives a general framework for the main tasks    expressions. For example, a “carnivorous plant” can be considered
learning analytics are about. However, given the youth of the           a “plant” just by looking at the syntactic structure “Adjective
discipline, there are generally two open questions:                     noun” of the expression.
-    What are the central research topics that are tackled in the       Conceptual relationships: This step uses various techniques (pat-
     published literature?                                              terns, machine learning, etc.) to identify any kind of transversal
-    What are the relationships between the central research top-       relations, with a domain and range.
     ics?
-    What are similarities and differences between learning ana-        Axioms: Finally, axioms here mean defined classes, or rules from
     lytics and educational data mining?                                texts.
To address the above questions, we aimed to analyze systemati-          OntoCmaps requires a domain corpus as input. As such, LAK and
cally textual content available in the LAK Challenge data set. In       EDM proceedings (the LAK dataset [13]) were an appropriate set
particular, we used a state-of-the-art ontology learning tool, On-      of texts to test the ontology learning process. OntoCmaps relies on
toCmaps, that enabled the automatic (i) parsing of textual content,     three main phases to learn a domain ontology: 1) the extraction
(ii) creation of conceptual maps based on the extracted concepts        phase that performs a deep semantic analysis based on dependen-
and relationships, and (iii) filtering/ranking of the most important    cy patterns; 2) the integration phase that builds concept maps,
concepts and relationships based on measures of information re-         which are composed of terms and labeled relationships, and uses
trieval, graph theory, and voting theory. The concept extraction        basic disambiguation techniques. These concept maps form a
and their filtering/ranking was done (i) for each edition of the two    graph; and finally 3) the filtering phase where various metrics
conferences and the journal special issue (from the LAK 2013            rank the items (terms and relationships) in concept maps.
Challenge dataset)individually (i.e., LAK 2011-2012, EDM 2008-          2.1 The Extraction Phase
2013, and LAK ET&S special issue) to see the emerging trends            In the extraction phase, OntoCmapsis based on a hierarchy of
through the years; and (ii) by creating two subsets – one for the       syntactic patterns. Each pattern describes a set of syntactic rela-
tionships that permit the extraction of a “semantic representation”.     include:
OntoCmaps does not rely on any predefined domain knowledge. It
                                                                         •    The Degree centrality of a node which identifies the number
uses two NLP tools to obtain the syntactic representations: the
                                                                              of edges from and to a given node.
Stanford Parser along with its dependency module [2] and the
Stanford parts-of-speech (POS) Tagger [6]. Given a sentence, the         •    The Betweenness centrality, which assigns each node a value
Stanford parser generates syntactic dependency relations between              that is derived from the number of shortest paths that pass
each pair of related words of a sentence. The POS Tagger identi-              through it;
fies words’ parts-of-speech. Based on these two inputs, OntoC-           •    The HITS algorithm which ranks nodes according to the
maps creates a pattern syntactic format that enriches words in                importance of hubs and authorities [5]. This resulted in two
each dependency relation with their parts-of-speech. This enriched            measures Hits-Hubs and Hits-Authority;
representation is then used as input to a pattern recognition task.
A recognized pattern fires a rule that applies various transforma-       •    The PageRank of a node [1];
tions on the syntactic representation to obtain a “semantic repre-       •    We also computed standard information retrieval metrics,
sentation”, in the form of expressions, triples or sets of triples.           mainly term frequency (TF) and TF-IDF.
The patterns are divided into conceptual patterns and hierarchical
patterns. Hierarchical patterns concentrate on the extraction of         Finally, using the graph-based metrics, we defined a number of
taxonomical links, following the work of [11], but based on the          voting schemes with the aim of improving the precision of filter-
dependency formalism. Conceptual patterns identify the main              ing. All the VS relied on three metrics that were identified as be-
structures of the language that can be transformed into triples          ing among the best metrics in previous experiments [10][11]:
useful for the extraction of conceptual relations. They are orga-        Degree, Betweenness and HITS-Hubs. The VS include:
nized into a hierarchy from most-detailed patterns (containing the       •    The majority voting scheme, which recognizes a term as an
biggest number of dependency relationships) to least detailed. The            important one if it is chosen by at least k metrics out of n
extraction phase targets deeper levels of the hierarchy first to              with k>n/2.
avoid extracting too abstract or incomplete representations. For         •    Borda Count Voting Scheme: This method assigns a “rank”
instance, if the pattern “nsubj-dobj-xcomp” exists in text, the ex-           to each candidate. A candidate who is ranked first receive n
tractor should fire it instead of firing one of its higher-level coun-        points (n=size of the domain terms to be ranked), second n-1,
terparts “nsubj-dobj” and “nsubj-xcomp”which contain only a                   third n-2 and so on. The “score” of a term for all metrics is
subset of the syntactic relationships of interest. If a pattern is in-        equal to the sum of the points obtained by the term in each
stantiated, then all its parents in the hierarchy are disregarded.            metric.
                                                                         •    Nauru Voting Scheme: The Nauru voting scheme is based on
2.2 The Integration Phase                                                     the sum of the inverted rank of each term in each metric. It is
In this integration phase, all the extracted relationships are ga-
                                                                              used to put more emphasis on higher ranks.
thered into concept maps. Some basic term disambiguation tasks
are performed at this level mainly: i) lemmatization which consid-       Table 1 shows the top ranked concepts based on the majority vot-
ers singular, plural and other forms of the same terms or relation-      ing scheme. All the base metrics (Betweenness, PageRank, De-
ships as referring to a single concept or relationship; ii) basic syn-   gree, etc.) and voting schemes have been computed and can be
onym detection based on abbreviation relations that are generated        found at [8]. The Web site [8] also features a visualization of the
by the Stanford parser and iii) a kind of co-reference resolution        extracted data based on the obtained concept maps. The visualiza-
phase that is built in some of the patterns, and that allows for the     tion is performed per venue (EDM/LAK/ETS-SI), per corpus
creation of semantic links between terms in a sentence, even if not      (only abstracts or main texts) and per year (2008-2012).
direct dependency links existed in the original dependency repre-        2.3.2 Relationship Filtering
sentation. For example, in the sentence: carnivorous plants are
                                                                         Similarly, a number of metrics were used to identify important
organisms which eat insects, the co-reference resolution creates a
                                                                         relationships.
relation “eat” between the term “carnivorous plants” and the term
“insects” while the grammatical representation links the term            The first measure consists of all the relationships that occur be-
“plants” to the term “insects”.                                          tween important terms (determined through the voting schemes)
                                                                         as important relationships. This constitutes our voting schemes for
All these operations result in concept maps around various terms.
                                                                         relationships, which were based on the results of the majority
For example, if there were a number of statements around the
                                                                         voting scheme for concepts.
term “carnivorous plants” in texts, it is likely that a concept map
around “carnivorous plants” will be created. This process is re-         The second measure ranks relationships based on Edge Between-
peated for all identified terms and relationships and results in an      ness centrality, which is a measure of the importance of edges
aggregation of concept maps through links between various con-           based on the number of shortest paths which contain them.
cept maps, thus constituting a graph, with terms representing
nodes, and relationships representing edges.                             The third measure is based on assigning frequencies of co-
                                                                         occurrence weights based on the Dice coefficient [9], a standard
2.3 The Filtering Phase                                                  measure for semantic relatedness.
The third and last phase for learning the domain ontology is the
                                                                         Table 2 shows an excerpt of the top ranked relationships based on
filtering phase, which aims at ranking the items in concept maps
                                                                         the majority voting scheme. Contrary to standard named entity
(domain terms, taxonomical links, and conceptual links).
                                                                         extractors, an important aspect of using ontology learning is the
2.3.1 Concept Filtering                                                  ability to extract relationships as well, thus, obtaining not only
A number of metrics from graph theory and from information               topics but also relationships (taxonomical and conceptual) be-
retrieval are used to identify relevant terms. Graph-based metrics       tween these topics. A better approach would mix the two ap-
were computed using the JUNG framework [3]. These metrics                proaches and combine topic extraction using named entity extrac-
                                                                         tors, linked data semantic annotators and ontology learning.
  Table 1.Top ranked concepts based on the majority voting scheme extracted the subsets of the LAK 2013 Challenge dataset
                                     LAK                   LAK                     EDM                       EDM
                                  (abstracts)          (paper body)              (abstracts)              (paper body)
                            student (0.50)          student (0.75)        student (0.75)            student (0.75)
                            datum (0.45)            datum (0.20)          model (0.38)              model (0.23)
                            informal_learn
                                                    learner (0.15)        datum (0.37)              datum (0.19)
                            (0.31)
                            learn (0.31)            course (0.15)         method (0.19)             skill (0.09)
                            teacher (0.29)          analysis (0.12)       paper (0.16)              problem (0.08)
                            model (0.27)            activity (0.11)       system (0.13)             result (0.06)
                            learning_analytics
                                                    user (0.10)           result (0.12)             method (0.06)
                            (0.26)
                            learner (0.25)          tool (0.10)           approach (0.11)           parameter (0.05)
                            social_factor (0.21)    learn (0.09)          skill (0.08)              question (0.05)
                            social_learn (0.19)     analytics (0.07)      analysis (0.07)           performance (0.05)
                            effective_learn                               intelligent_
                                                    group (0.07)                                    system (0.05)
                            (0.19)                                        tutoring_system(0.07)
                            group_learn (0.17)      system (0.07)         behavior (0.07)           approach (0.04)
                            knowledge_
                                                    teacher (0.06)        tool (0.07)               example (0.04)
                            professional (0.17)
                            Lak (0.17)              instructor (0.06)     work (0.06)               feature (0.04)
                            knowledge (0.17)        network (0.06)        Researcher (0.06)         item (0.04)
Table 2.Top ranked relationships based on the majority voting scheme extracted the subsets of the LAK 2013 Challenge dataset.
                             Each cell in the table contains a concept-relationship-concept triplet
                      LAK                                LAK                                  EDM                               EDM
                   (abstracts)                      (paper body)                            (abstracts)                      (paper body)
                                              course–being recorded as
        learner–build–knowledge (1)                                           datum–mining–method (1)                    model–fit–student (1)
                                              well as to–student (1)
                                              datum–break ability to                                                     datum–are collected
        datum–obtained      from–learner
                                              educate        effectively–     method–linguistics in–paper (0.95)         far       from–student
        (0.81)
                                              student (0.60)                                                             (0.96)
                                                                                                                         skill–will have been
        learning_analytics–important step     system–addresses indivi-
                                                                              model–are trained over–datum (0.70)        covered      by–student
        for–teachers_of_tomorrow (0.78)       dually–student (0.45)
                                                                                                                         (0.67)
        teachers_of_tomorrow–is          a–   analysis–have since been                                                   problem–assign for–
                                                                              system–provides–student (0.61)
        teacher (0.77)                        moved as–student (0.37)                                                    student (0.67)
                                                                                                                         example–
        tool–incorporate functionality to     network–impacting–              student–are represented by–model
                                                                                                                         parameterization by–
        access–datum (0.65)                   student (0.31)                  (0.56)
                                                                                                                         student (0.63)
                                              process–finally   should
        model–can be used to inform–                                                                                     question–were based–
                                              promote reflection on–          model–can detect–student (0.50)
        student (0.64)                                                                                                   student (0.62)
                                              instructor (0.29)
                                                                                                                         student–provides
        datum–obtained    from–instructor     tool–identify–student
                                                                              datum–derived from–student (0.43)          useful evidence to–
        (0.62)                                (0.27)
                                                                                                                         model (0.60)
                                              datum–may be presented          goal–has been investigated           by–   step–requires–student
        learner–generating–datum (0.58)
                                              to–learner (0.25)               researcher (0.42)                          (0.57)
                                                                                                                         performance–
        student–accessing–                    activity–conducted        by–
                                                                              tutoring_system–is a–system (0.40)         dependent        upon–
        online_discussion_forum (0.56)        user (0.25)
                                                                                                                         student (0.56)
        model–can be used to inform–          group–will       contain–       student–study                    with–     accuracy–varies
        teacher (0.51)                        student (0.25)                  intelligent_tutoring_system(0.39)          across–student (0.48)
        student–flock to–online_service       environment–capture–            skill–studied       in–tutoring_system     student–is guessing–
        (0.48)                                datum (0.24)                    (0.38)                                     result (0.48)
        datum–are combined to calculate–      model–highly      accurate      intelligent_tutoring_system–are            student–collect–datum
        likelihood_of_student (0.45)          on–student (0.22)               informed by–datum (0.32)                   (0.45)
                                              average–miss–student            analysis–reveals–unexpected_result         word–uttered       by–
        instructor–guide–student (0.39)
                                              (0.21)                          (0.30)                                     student (0.44)
        learn–integral               to–      role–are imposed on–                                                       datum–were used to
                                                                              unexpected_result–is a–result (0.30)
        success_of_community (0.37)           student(0.21)                                                              build–model (0.44)
        likelihood_of_student–is related      information–useful for–         collaborative–learning–                    skill–are included in–
        to–student (0.36)                     student (0.20)                  interactions_of_student (0.29)             model (0.41)
We can also notice that we were not always successful in extract-      cases, such as learning_analytics, the lemmatizer returned the
ing meaningful relationships labels from this corpus. One possible     expression itself). First, we could not possible include all the re-
explanation is the type of texts (publications) and the amount of      sults of all the metrics we calculated in our experiment (those
noise in these texts. In fact, OntoCmaps is made to run on clean       results are available at [8]). Second, we selected the metrics which
plain sentences that describe a domain of interest and define it.      were proven to be most accurate in our previous research [10],
Parts of research papers such as figure captions, formulas, and        [11]. Finally, it should be noted that the purpose of our experi-
references represent noise for OntoCmaps. Additional cleaning of       ment here was not to evaluate the effectiveness of individual me-
the input texts would be necessary. However, even when the la-         trics, but rather to experiment if ontology learning technology can
bels were not meaningful, the existence of a link between two          shed some light on the questions posed in the introduction of re-
concepts (unlabeled relationship) was shedding some light on the       levance to the LAK 2013 Data Challenge.
domain (see Section 3).                                                Concepts reported in Table 1 reveal that papers of both the LAK
                                                                       and EDM conferences have students, data and models as shared
3. FINDINGS                                                            concepts. However, it is clear that LAK papers also focus on
In this section, we present only results of the 15-top ranked con-
                                                                       teachers/instructors, informal learning, and social, networked, and
cepts and relationships according to the Majority Voting Scheme
                                                                       group learning. On the other hand, EDM papers focus on (data
(Betweenness, Degree, and Hits-Hub) as shown in Tables 1-2
                                                                       mining) methods and approaches, intelligent tutoring systems,
(N.B. As can be noticed in the tables, the majority of the terms are
                                                                       features (extraction), and various types of parameters.
lemmatized, that is, we show only their lemma or root. For exam-
ple,informal_learn for informal learning or datum for data. In few


         Figure 1.Two conceptual maps extracted from the abstracts of the papers presented at the LAK conference
Relationships reported in Table 2 further corroborate the observa-     focused on data collected by and for instructors, not only for stu-
tion that the LAK papers are more focused on teachers in order to      dents. This probably indicates a trend that the LAK community
empower them with learning analytics and to help them guide            has so far acknowledged the role of instructors in the learning
students. Moreover, there is an emphasis on (promoting) reflec-        process and aimed at supporting them as much as learners. The
tion of both students and instructors. Various aspects of social       EDM community has however focused more on measuring and
learning such as role playing and impact of communities appear to      predicting specific types of skills. This is consistent with their
be highly popular topics in the LAK papers. On the other hand,         focus on intelligent tutoring systems in which automated assess-
EDM papers are much more focused on intelligent tutoring sys-          ment of learners’ skills is of paramount importance.
tems, accuracy of different types of (predictive) models, and re-      Finally, we were also able to visualize the extracted conceptual
vealing unexpected patterns. Certainly, focus on data is shared by     graphs. In Figure 1, we show the relationships of concept learning
both the LAK and EDM communities, but LAK also seems to be             analytics as extracted from the abstracts of the papers presented at
the LAK conference. This figure further corroborates earlier oob-      (strongly) related to discourse analytics, which seems to be con-
                                                                                                                                     co
servations by indicating that learning analytics is an integral part   sistent
                                                                          tent with the strong emphasis of learning analytics on social
of teaching profession, is an important step for teachers of tomo
                                                             tomor-    learning and which is further confirmed by extracted relationships
row and learners, and offers a new approach. This figure reveals       of discoursee learning analytics with sense-making,
                                                                                                             sense         argumentation
also the nature of learning analytics to promote qualitative unde
                                                              under-   and social, all of which are types of skills recognized as impor-
                                                                                                                                   impo
standing of context
                ntext of information. Learning analytics is also       tant for the modern society.


     Figure 2.. Visualization of top 30 ranked concepts based on the majority voting scheme extracted from the abstracts of the LAK
                                                          2013 Challenge dataset.
In future work, we plan to analyze further the research trends over    closest
                                                                         osest communities. More interesting results are available on our
the years for the LAK and EDM communities. Anot    Another of our      website [8]. For example, those results allow for (i) comparing
goals is to compare the extractions of an ontology learning system     results of different concept/relationship measures and (ii) chrono-
                                                                                                                                    chron
such as OntoCmaps with Linked data Semantic Annotators such            logical trends emerging throughout the years of individual edi-  ed
as DBPedia Spotlight1 or Alchemy2.                                     tions of both the conferences. An example
                                                                                                             xample of one of the visualiza-
                                                                                                                                  visualiz
                                                                       tions available at [8] is presented in Figure 2.
4. CONCLUSION
Funnily, our text analysis tool inferred that EDM is an abbrevi
                                                           abbrevia-   Of course, ontology learning tools are not perfectly accurate, and
tion of learning analytics.. This probably comes from the open         thus, few “strange” concepts and relationships are shown in our
debate reflected in the analyzed papers about the relationships        tables. An opportunity is however in combining such ontology
between learning analytics and educational data mining. We hope        learning tools as starting points of the concept map development
that this paper sheds some light on the (dis)similarities of the two   of the learning analytics domain, which can then be refined
areas.
     s. We also hope that our analysis of the LAK 2013 Data ChaChal-   through crowd sourcing (e.g., in a Wiki-like
                                                                                                          Wiki      manner).
lenge dataset with the ontology learning tools indicated a high
potential of this type of analytics to help the research community
                                                                       5. REFERENCES
of new research discipline define itself and relationships with        [1] Brin, S. & Page, L. (1998). The anatomy of a large-scale
                                                                                                                         large      hy-
                                                                           per-textual
                                                                               textual web search engine, Stanford University.
                                                                       [2] De Marneffe, M-C,   C, MacCartney, B. and Manning. C.D.
1
    https://github.com/dbpedia-spotlight/dbpedia-spotlight/
                                                 spotlight/                (2006). Generating Typed Dependency Parses from Phrase
2
    http://www.alchemyapi.com/                                             Structure Parses. In Proc. of LREC, pp. 449-454,
                                                                                                                   449      ELRA.
[3] JUNG (2013). Last retrieved from http://jung.sourceforge.net/    [10] Zouaq, A., Gasevic, D. and Hatala, M. (2011). Towards
[4] Klein, D. and Manning, C.D. (2003). Accurate Unlexicalized            Open Ontology Learning and Filtering, Information Systems,
    Parsing. Proc. of the 41st Meeting of the Association for             36(7): 1064–1081.
    Computational Linguistics, pp. 423-430.                          [11] Zouaq, A., Gasevic, D. and Hatala, M. (2012a). Voting
[5] Kleinberg, J. (1999). Authoritative sources in a hyperlinked          Theory for Concept Detection. The 9th Extended Semantic
    environment, Journal of the ACM 46(5): 604-632, ACM.                  Web Conference 2012 (ESWC 2012), pp. 315-329.
[6] Toutanova, K., Klein, D., Manning, C.D. & Singer, Y.
                                                                     [12] Hearst, M. A. (1992). Automatic acquisition of hyponyms
     (2003). Feature-rich part-of-speech tagging with a cyclic de-
                                                                          from large text corpora. In Proc.14th Conference on Compu-
     pendency network, In Proc. of HLT-NAACL, pp. 252-259.
                                                                          tational Linguistics – Vol. 2 (COLING '92), 539-545.
[7] http://www.solaresearch.org/mission/about/
                                                                     [13] Taibi, D., Dietze, S., Fostering analytics on learning analytics
[8] http://lakchallenge.co.nf                                             research: the LAK dataset, Technical Report, 03/2013, URL:
[9] Van Rijsbergen, CornelisJoost (1979). Information Retrieval.          http://resources.linkededucation.org/2013/03/lak-dataset-
    London: Butterworths. ISBN 3-642-12274-4.                             taibi.pdf.