Ontology Learning to Analyze Research Trends in Learning Analytics Publications Amal Zouaq Srećko Joksimović Dragan Gašević Department of Mathematics and School of Interactive Arts and Tech- School of Computing and Information Computer Science nologies Systems Royal Military College of Canada Simon Fraser University Athabasca University Kingston, ON, Canada Surrey, BC, Canada Athabasca, AB, Canada +1 613 541 6000, Ext. 6478 +1 778 782 7474 +1 604 569 8515 amal.zouaq@rmc.ca sjoksimo@sfu.ca dgasevic@acm.org ABSTRACT papers presented at the LAK conference editions and another one In this paper, we show how ontology learning tools can be used to for the papers presented at the EDM conference editions – in reveal (i) the central research topics that are tackled in the pub- order to compare the two conferences based on concepts and rela- lished literature on learning analytics and educational data mining; tionships gauged as most important. We also performed analysis and (ii)relationships between these research topics and iii) based on (a) paper abstracts only and (b) main body of text of the (dis)similarities between learning analytics and educational data papers. mining. In this short report, we first describe the data analysis pipeline. This is followed by a very brief discussion of a small fragment of Categories and Subject Descriptors the results we obtained in our analysis. The complete results in the I.2.7 [Artificial Intelligence]: Natural Language Processing; CSV format are available at [8]. G.2.2 [Discrete Mathematics]: Graph Theory 2. DATA ANALYSIS PIPELINE General Terms The data analysis relies on our ontology learning tool, OntoC- Algorithms, Measurement, Experimentation maps[10]. Ontology learning from text is a multi-layer knowledge extraction task that targets the following components: Keywords Ontology learning, deep parsing, filtering, information retrieval, Terms and concepts: The first step consists in identifying candi- ranking algorithms, graph theoretic statistics date expressions in texts. These expressions are then ranked using some kind of measure (statistical metrics, graph-based metrics, 1. INTRODUCTION etc.) to extract those that are relevant for the domain. These filte- Learning analytics is a new research discipline. Although it at- red relevant expressions are then considered “concepts” in the tracted a considerable amount of attention in educational research ontology learning community. and practice, debate is still very active about the scope of the dis- Taxonomy: This step identifies “is-a” links in texts, generally cipline. The definition of learning analytics offered by the Society using patterns indicating a taxonomical link in text such as for Learning Analytics Research [7], which is commonly used in Hearst’s patterns[11], or using the inner structure of multiword the literature to date, gives a general framework for the main tasks expressions. For example, a “carnivorous plant” can be considered learning analytics are about. However, given the youth of the a “plant” just by looking at the syntactic structure “Adjective discipline, there are generally two open questions: noun” of the expression. - What are the central research topics that are tackled in the Conceptual relationships: This step uses various techniques (pat- published literature? terns, machine learning, etc.) to identify any kind of transversal - What are the relationships between the central research top- relations, with a domain and range. ics? - What are similarities and differences between learning ana- Axioms: Finally, axioms here mean defined classes, or rules from lytics and educational data mining? texts. To address the above questions, we aimed to analyze systemati- OntoCmaps requires a domain corpus as input. As such, LAK and cally textual content available in the LAK Challenge data set. In EDM proceedings (the LAK dataset [13]) were an appropriate set particular, we used a state-of-the-art ontology learning tool, On- of texts to test the ontology learning process. OntoCmaps relies on toCmaps, that enabled the automatic (i) parsing of textual content, three main phases to learn a domain ontology: 1) the extraction (ii) creation of conceptual maps based on the extracted concepts phase that performs a deep semantic analysis based on dependen- and relationships, and (iii) filtering/ranking of the most important cy patterns; 2) the integration phase that builds concept maps, concepts and relationships based on measures of information re- which are composed of terms and labeled relationships, and uses trieval, graph theory, and voting theory. The concept extraction basic disambiguation techniques. These concept maps form a and their filtering/ranking was done (i) for each edition of the two graph; and finally 3) the filtering phase where various metrics conferences and the journal special issue (from the LAK 2013 rank the items (terms and relationships) in concept maps. Challenge dataset)individually (i.e., LAK 2011-2012, EDM 2008- 2.1 The Extraction Phase 2013, and LAK ET&S special issue) to see the emerging trends In the extraction phase, OntoCmapsis based on a hierarchy of through the years; and (ii) by creating two subsets – one for the syntactic patterns. Each pattern describes a set of syntactic rela- tionships that permit the extraction of a “semantic representation”. include: OntoCmaps does not rely on any predefined domain knowledge. It • The Degree centrality of a node which identifies the number uses two NLP tools to obtain the syntactic representations: the of edges from and to a given node. Stanford Parser along with its dependency module [2] and the Stanford parts-of-speech (POS) Tagger [6]. Given a sentence, the • The Betweenness centrality, which assigns each node a value Stanford parser generates syntactic dependency relations between that is derived from the number of shortest paths that pass each pair of related words of a sentence. The POS Tagger identi- through it; fies words’ parts-of-speech. Based on these two inputs, OntoC- • The HITS algorithm which ranks nodes according to the maps creates a pattern syntactic format that enriches words in importance of hubs and authorities [5]. This resulted in two each dependency relation with their parts-of-speech. This enriched measures Hits-Hubs and Hits-Authority; representation is then used as input to a pattern recognition task. A recognized pattern fires a rule that applies various transforma- • The PageRank of a node [1]; tions on the syntactic representation to obtain a “semantic repre- • We also computed standard information retrieval metrics, sentation”, in the form of expressions, triples or sets of triples. mainly term frequency (TF) and TF-IDF. The patterns are divided into conceptual patterns and hierarchical patterns. Hierarchical patterns concentrate on the extraction of Finally, using the graph-based metrics, we defined a number of taxonomical links, following the work of [11], but based on the voting schemes with the aim of improving the precision of filter- dependency formalism. Conceptual patterns identify the main ing. All the VS relied on three metrics that were identified as be- structures of the language that can be transformed into triples ing among the best metrics in previous experiments [10][11]: useful for the extraction of conceptual relations. They are orga- Degree, Betweenness and HITS-Hubs. The VS include: nized into a hierarchy from most-detailed patterns (containing the • The majority voting scheme, which recognizes a term as an biggest number of dependency relationships) to least detailed. The important one if it is chosen by at least k metrics out of n extraction phase targets deeper levels of the hierarchy first to with k>n/2. avoid extracting too abstract or incomplete representations. For • Borda Count Voting Scheme: This method assigns a “rank” instance, if the pattern “nsubj-dobj-xcomp” exists in text, the ex- to each candidate. A candidate who is ranked first receive n tractor should fire it instead of firing one of its higher-level coun- points (n=size of the domain terms to be ranked), second n-1, terparts “nsubj-dobj” and “nsubj-xcomp”which contain only a third n-2 and so on. The “score” of a term for all metrics is subset of the syntactic relationships of interest. If a pattern is in- equal to the sum of the points obtained by the term in each stantiated, then all its parents in the hierarchy are disregarded. metric. • Nauru Voting Scheme: The Nauru voting scheme is based on 2.2 The Integration Phase the sum of the inverted rank of each term in each metric. It is In this integration phase, all the extracted relationships are ga- used to put more emphasis on higher ranks. thered into concept maps. Some basic term disambiguation tasks are performed at this level mainly: i) lemmatization which consid- Table 1 shows the top ranked concepts based on the majority vot- ers singular, plural and other forms of the same terms or relation- ing scheme. All the base metrics (Betweenness, PageRank, De- ships as referring to a single concept or relationship; ii) basic syn- gree, etc.) and voting schemes have been computed and can be onym detection based on abbreviation relations that are generated found at [8]. The Web site [8] also features a visualization of the by the Stanford parser and iii) a kind of co-reference resolution extracted data based on the obtained concept maps. The visualiza- phase that is built in some of the patterns, and that allows for the tion is performed per venue (EDM/LAK/ETS-SI), per corpus creation of semantic links between terms in a sentence, even if not (only abstracts or main texts) and per year (2008-2012). direct dependency links existed in the original dependency repre- 2.3.2 Relationship Filtering sentation. For example, in the sentence: carnivorous plants are Similarly, a number of metrics were used to identify important organisms which eat insects, the co-reference resolution creates a relationships. relation “eat” between the term “carnivorous plants” and the term “insects” while the grammatical representation links the term The first measure consists of all the relationships that occur be- “plants” to the term “insects”. tween important terms (determined through the voting schemes) as important relationships. This constitutes our voting schemes for All these operations result in concept maps around various terms. relationships, which were based on the results of the majority For example, if there were a number of statements around the voting scheme for concepts. term “carnivorous plants” in texts, it is likely that a concept map around “carnivorous plants” will be created. This process is re- The second measure ranks relationships based on Edge Between- peated for all identified terms and relationships and results in an ness centrality, which is a measure of the importance of edges aggregation of concept maps through links between various con- based on the number of shortest paths which contain them. cept maps, thus constituting a graph, with terms representing nodes, and relationships representing edges. The third measure is based on assigning frequencies of co- occurrence weights based on the Dice coefficient [9], a standard 2.3 The Filtering Phase measure for semantic relatedness. The third and last phase for learning the domain ontology is the Table 2 shows an excerpt of the top ranked relationships based on filtering phase, which aims at ranking the items in concept maps the majority voting scheme. Contrary to standard named entity (domain terms, taxonomical links, and conceptual links). extractors, an important aspect of using ontology learning is the 2.3.1 Concept Filtering ability to extract relationships as well, thus, obtaining not only A number of metrics from graph theory and from information topics but also relationships (taxonomical and conceptual) be- retrieval are used to identify relevant terms. Graph-based metrics tween these topics. A better approach would mix the two ap- were computed using the JUNG framework [3]. These metrics proaches and combine topic extraction using named entity extrac- tors, linked data semantic annotators and ontology learning. Table 1.Top ranked concepts based on the majority voting scheme extracted the subsets of the LAK 2013 Challenge dataset LAK LAK EDM EDM (abstracts) (paper body) (abstracts) (paper body) student (0.50) student (0.75) student (0.75) student (0.75) datum (0.45) datum (0.20) model (0.38) model (0.23) informal_learn learner (0.15) datum (0.37) datum (0.19) (0.31) learn (0.31) course (0.15) method (0.19) skill (0.09) teacher (0.29) analysis (0.12) paper (0.16) problem (0.08) model (0.27) activity (0.11) system (0.13) result (0.06) learning_analytics user (0.10) result (0.12) method (0.06) (0.26) learner (0.25) tool (0.10) approach (0.11) parameter (0.05) social_factor (0.21) learn (0.09) skill (0.08) question (0.05) social_learn (0.19) analytics (0.07) analysis (0.07) performance (0.05) effective_learn intelligent_ group (0.07) system (0.05) (0.19) tutoring_system(0.07) group_learn (0.17) system (0.07) behavior (0.07) approach (0.04) knowledge_ teacher (0.06) tool (0.07) example (0.04) professional (0.17) Lak (0.17) instructor (0.06) work (0.06) feature (0.04) knowledge (0.17) network (0.06) Researcher (0.06) item (0.04) Table 2.Top ranked relationships based on the majority voting scheme extracted the subsets of the LAK 2013 Challenge dataset. Each cell in the table contains a concept-relationship-concept triplet LAK LAK EDM EDM (abstracts) (paper body) (abstracts) (paper body) course–being recorded as learner–build–knowledge (1) datum–mining–method (1) model–fit–student (1) well as to–student (1) datum–break ability to datum–are collected datum–obtained from–learner educate effectively– method–linguistics in–paper (0.95) far from–student (0.81) student (0.60) (0.96) skill–will have been learning_analytics–important step system–addresses indivi- model–are trained over–datum (0.70) covered by–student for–teachers_of_tomorrow (0.78) dually–student (0.45) (0.67) teachers_of_tomorrow–is a– analysis–have since been problem–assign for– system–provides–student (0.61) teacher (0.77) moved as–student (0.37) student (0.67) example– tool–incorporate functionality to network–impacting– student–are represented by–model parameterization by– access–datum (0.65) student (0.31) (0.56) student (0.63) process–finally should model–can be used to inform– question–were based– promote reflection on– model–can detect–student (0.50) student (0.64) student (0.62) instructor (0.29) student–provides datum–obtained from–instructor tool–identify–student datum–derived from–student (0.43) useful evidence to– (0.62) (0.27) model (0.60) datum–may be presented goal–has been investigated by– step–requires–student learner–generating–datum (0.58) to–learner (0.25) researcher (0.42) (0.57) performance– student–accessing– activity–conducted by– tutoring_system–is a–system (0.40) dependent upon– online_discussion_forum (0.56) user (0.25) student (0.56) model–can be used to inform– group–will contain– student–study with– accuracy–varies teacher (0.51) student (0.25) intelligent_tutoring_system(0.39) across–student (0.48) student–flock to–online_service environment–capture– skill–studied in–tutoring_system student–is guessing– (0.48) datum (0.24) (0.38) result (0.48) datum–are combined to calculate– model–highly accurate intelligent_tutoring_system–are student–collect–datum likelihood_of_student (0.45) on–student (0.22) informed by–datum (0.32) (0.45) average–miss–student analysis–reveals–unexpected_result word–uttered by– instructor–guide–student (0.39) (0.21) (0.30) student (0.44) learn–integral to– role–are imposed on– datum–were used to unexpected_result–is a–result (0.30) success_of_community (0.37) student(0.21) build–model (0.44) likelihood_of_student–is related information–useful for– collaborative–learning– skill–are included in– to–student (0.36) student (0.20) interactions_of_student (0.29) model (0.41) We can also notice that we were not always successful in extract- cases, such as learning_analytics, the lemmatizer returned the ing meaningful relationships labels from this corpus. One possible expression itself). First, we could not possible include all the re- explanation is the type of texts (publications) and the amount of sults of all the metrics we calculated in our experiment (those noise in these texts. In fact, OntoCmaps is made to run on clean results are available at [8]). Second, we selected the metrics which plain sentences that describe a domain of interest and define it. were proven to be most accurate in our previous research [10], Parts of research papers such as figure captions, formulas, and [11]. Finally, it should be noted that the purpose of our experi- references represent noise for OntoCmaps. Additional cleaning of ment here was not to evaluate the effectiveness of individual me- the input texts would be necessary. However, even when the la- trics, but rather to experiment if ontology learning technology can bels were not meaningful, the existence of a link between two shed some light on the questions posed in the introduction of re- concepts (unlabeled relationship) was shedding some light on the levance to the LAK 2013 Data Challenge. domain (see Section 3). Concepts reported in Table 1 reveal that papers of both the LAK and EDM conferences have students, data and models as shared 3. FINDINGS concepts. However, it is clear that LAK papers also focus on In this section, we present only results of the 15-top ranked con- teachers/instructors, informal learning, and social, networked, and cepts and relationships according to the Majority Voting Scheme group learning. On the other hand, EDM papers focus on (data (Betweenness, Degree, and Hits-Hub) as shown in Tables 1-2 mining) methods and approaches, intelligent tutoring systems, (N.B. As can be noticed in the tables, the majority of the terms are features (extraction), and various types of parameters. lemmatized, that is, we show only their lemma or root. For exam- ple,informal_learn for informal learning or datum for data. In few Figure 1.Two conceptual maps extracted from the abstracts of the papers presented at the LAK conference Relationships reported in Table 2 further corroborate the observa- focused on data collected by and for instructors, not only for stu- tion that the LAK papers are more focused on teachers in order to dents. This probably indicates a trend that the LAK community empower them with learning analytics and to help them guide has so far acknowledged the role of instructors in the learning students. Moreover, there is an emphasis on (promoting) reflec- process and aimed at supporting them as much as learners. The tion of both students and instructors. Various aspects of social EDM community has however focused more on measuring and learning such as role playing and impact of communities appear to predicting specific types of skills. This is consistent with their be highly popular topics in the LAK papers. On the other hand, focus on intelligent tutoring systems in which automated assess- EDM papers are much more focused on intelligent tutoring sys- ment of learners’ skills is of paramount importance. tems, accuracy of different types of (predictive) models, and re- Finally, we were also able to visualize the extracted conceptual vealing unexpected patterns. Certainly, focus on data is shared by graphs. In Figure 1, we show the relationships of concept learning both the LAK and EDM communities, but LAK also seems to be analytics as extracted from the abstracts of the papers presented at the LAK conference. This figure further corroborates earlier oob- (strongly) related to discourse analytics, which seems to be con- co servations by indicating that learning analytics is an integral part sistent tent with the strong emphasis of learning analytics on social of teaching profession, is an important step for teachers of tomo tomor- learning and which is further confirmed by extracted relationships row and learners, and offers a new approach. This figure reveals of discoursee learning analytics with sense-making, sense argumentation also the nature of learning analytics to promote qualitative unde under- and social, all of which are types of skills recognized as impor- impo standing of context ntext of information. Learning analytics is also tant for the modern society. Figure 2.. Visualization of top 30 ranked concepts based on the majority voting scheme extracted from the abstracts of the LAK 2013 Challenge dataset. In future work, we plan to analyze further the research trends over closest osest communities. More interesting results are available on our the years for the LAK and EDM communities. Anot Another of our website [8]. For example, those results allow for (i) comparing goals is to compare the extractions of an ontology learning system results of different concept/relationship measures and (ii) chrono- chron such as OntoCmaps with Linked data Semantic Annotators such logical trends emerging throughout the years of individual edi- ed as DBPedia Spotlight1 or Alchemy2. tions of both the conferences. An example xample of one of the visualiza- visualiz tions available at [8] is presented in Figure 2. 4. CONCLUSION Funnily, our text analysis tool inferred that EDM is an abbrevi abbrevia- Of course, ontology learning tools are not perfectly accurate, and tion of learning analytics.. This probably comes from the open thus, few “strange” concepts and relationships are shown in our debate reflected in the analyzed papers about the relationships tables. An opportunity is however in combining such ontology between learning analytics and educational data mining. We hope learning tools as starting points of the concept map development that this paper sheds some light on the (dis)similarities of the two of the learning analytics domain, which can then be refined areas. s. We also hope that our analysis of the LAK 2013 Data ChaChal- through crowd sourcing (e.g., in a Wiki-like Wiki manner). lenge dataset with the ontology learning tools indicated a high potential of this type of analytics to help the research community 5. REFERENCES of new research discipline define itself and relationships with [1] Brin, S. & Page, L. (1998). The anatomy of a large-scale large hy- per-textual textual web search engine, Stanford University. [2] De Marneffe, M-C, C, MacCartney, B. and Manning. C.D. 1 https://github.com/dbpedia-spotlight/dbpedia-spotlight/ spotlight/ (2006). Generating Typed Dependency Parses from Phrase 2 http://www.alchemyapi.com/ Structure Parses. In Proc. of LREC, pp. 449-454, 449 ELRA. [3] JUNG (2013). Last retrieved from http://jung.sourceforge.net/ [10] Zouaq, A., Gasevic, D. and Hatala, M. (2011). Towards [4] Klein, D. and Manning, C.D. (2003). Accurate Unlexicalized Open Ontology Learning and Filtering, Information Systems, Parsing. Proc. of the 41st Meeting of the Association for 36(7): 1064–1081. Computational Linguistics, pp. 423-430. [11] Zouaq, A., Gasevic, D. and Hatala, M. (2012a). Voting [5] Kleinberg, J. (1999). Authoritative sources in a hyperlinked Theory for Concept Detection. The 9th Extended Semantic environment, Journal of the ACM 46(5): 604-632, ACM. Web Conference 2012 (ESWC 2012), pp. 315-329. [6] Toutanova, K., Klein, D., Manning, C.D. & Singer, Y. [12] Hearst, M. A. (1992). Automatic acquisition of hyponyms (2003). Feature-rich part-of-speech tagging with a cyclic de- from large text corpora. In Proc.14th Conference on Compu- pendency network, In Proc. of HLT-NAACL, pp. 252-259. tational Linguistics – Vol. 2 (COLING '92), 539-545. [7] http://www.solaresearch.org/mission/about/ [13] Taibi, D., Dietze, S., Fostering analytics on learning analytics [8] http://lakchallenge.co.nf research: the LAK dataset, Technical Report, 03/2013, URL: [9] Van Rijsbergen, CornelisJoost (1979). Information Retrieval. http://resources.linkededucation.org/2013/03/lak-dataset- London: Butterworths. ISBN 3-642-12274-4. taibi.pdf.