Relating educational materials via extraction of their topics Márcio de Carvalho Saraiva supervised by Claudia Bauzer Medeiros Institute of Computing University of Campinas 13083-852 Campinas-SP / Brazil marcio.saraiva@ic.unicamp.br ABSTRACT als’ topics through the use of graph algorithms. This work Digital educational documents are growing in size and va- was validated with data from Johns Hopkins University and riety, and scientists are facing difficulties to find their way University of Michigan provided at Coursera, which is one through them. One of the initiatives that have emerged to of the largest e-learning repositories at the moment, and a solve this problem involves the use of automatic classifica- Higher Education Institute from São Paulo - Brazil. Our tion algorithms. However, it is difficult to analyze implicit work expands the analysis options in educational material relationships among topics of materials. This paper presents repositories. Moreover, our proposal improves the search CIMAL, a framework for enabling flexible access to material among different material formats by standardizing topics stored in arbitrary repositories. CIMAL combines seman- they cover. tic classification, taxonomies and graphs to elicit relation- ships among topics of educational documents. We validated 2. THEORETICAL FOUNDATION AND our work using materials from Coursera (courses offered by RELATED WORK Johns Hopkins University and University of Michigan) and a Higher Education Institute, from Brazil. 2.1 Educational Data Mining According to Romero [9] EDM is concerned with ”re- searching, developing, and applying computerized methods 1. INTRODUCTION to detect patterns in collections of educational data that Usually, lecturers use educational material repositories would otherwise be hard or impossible to analyze due to the to publish, store and share materials with their peers in enormous volume of data within which they exist”. academia and students. The access to those documents is Typically, research towards helping users to select educa- usually open. Given such availability, how to find and choose tional material can be roughly classified as (i) development the material(s) more suitable to study a given topic? of tools to analyze, access or store materials in reposito- Sites such as the International Bank of Educational Ob- ries, (ii) mechanisms to integrate heterogeneous materials jects, the ACM Learning Center and the ACM Techpack, via user monitoring, and (iii) use of learning objects to en- the Coursera platform, MERLOT and SlideShare show that capsulate and standardize contents. the access to collections of educational materials in different formats and the analysis of their contents are still done in a 2.2 Components and Content from Educational restricted way. Even simple queries through the interfaces Material of these repositories can result in a large number of items, The strategy we adopted to extract and represent top- making it difficult to understand them and select the rel- ics of educational material is inspired by a concept that we evant ones. Furthermore, none of these repositories offers name components of educational material. Components are means to analyze relationships among the stored objects, positional structures that highlight information of a given which would help select material. material in order to facilitate its understanding. Header, This paper presents the design and implementation of body, footer and numbering of slides are examples of com- CIMAL (Courseware Integration under Multiple relations ponents of slides; titles, subtitles and the progress bar are to Assist Learning), abstractly presented in [10]. CIMAL examples of components of videos. This information also is a framework to analyze educational documents reposito- can be used for analysis; in our work, we use these charac- ries, allowing visualizations of relationships among materi- teristics in classification, indexing, comparison and retrieval tasks. Unlike other approaches in the literature that use the en- tire text of a document equally, we also extract information of components from different types of material to guide clas- sification tasks. Our work presents a novel strategy for doc- uments analysis, which considers the components present in Proceedings of the VLDB 2018 Ph.D. Workshop, August 27, 2018. Rio de the documents to facilitate the identification of topics in the Janeiro, Brazil. Copyright (C) 2018 for this paper by its authors. Copying documents. permitted for private and academic purposes. . 2.3 Classification of topics 1 educational material for subsequent search. The latter pro- To classify educational materials, we use a technique called vides all the services needed to look for materials using graph Explicit Semantic Analysis. In natural language processing algorithms. These services can be accessed through the User and information retrieval, According to Egozi et al. [4], Ex- Interface by lecturers and students. plicit Semantic Analysis (ESA) is semantic representation The first step is to set up the repositories (actions repre- of text (entire documents or individual words) that uses a sented by arrows with letters ’a’ and ’b’) before users can document corpus as a knowledge base. perform a search (arrows with letter ’c’) . Preprocessing starts when the Courseware Crawler imports such materi- 2.4 Recognition of relationships als from external resources (1a) and stores them in a Local According to Jiang et al. [5], extraction of relations is the Courseware Repository (2a). Next, the Components and task of detecting and characterizing the semantic relations Contents Collector extracts texts and the position of these between entities in texts. They affirm that current state-of- texts from the materials in the Local Courseware Repository the-art methods use carefully designed features or kernels (3a). Extracted data are stored in the Components and Con- and standard classification to solve this problem. tents Repository (4a). Next, the Intermediate Graph Rep- Mining of metadata (e.g., number of accesses to data or resentation Builder creates a graph representation for each identification of entities in the documentation of objects) is material from the repositories via the components and con- often used to derive relationships among data, such as the tents stored by the previous step (5a). These representations work of Pereira[8]. Relationships of educational materials are stored in the Representations Repository (6a). are viewed as the connections or associations among mate- In parallel, the Combiner, also proposed in our research, rials considering educational aspects, such as the association imports an external taxonomy from a Taxonomy Reposi- on the contents or connection of lecturers schedules [7]. tory, and a set of external expert texts from Domain textual Another approach to recognize relationships is to use ex- documents Repository (1a). These data are unified in an ternal taxonomies ([6]) or to build an architecture with hi- Enhanced Taxonomy, in which each concept of the taxon- erarchies to organize objects in levels, so that these relation- omy has a reference to a text by experts, and stored in the ships among the objects become the relationships between Enriched Taxonomy Repository (1b). the levels ([12]). Once representations and enriched taxonomy repositories are created, the Classifier is ready to define the topics cov- 2.5 Analysis using graph databases ered in each of the materials (2b,3b,7a). This information is then stored in the Classification Repository (8a). We can characterize a graph database through its data Lastly, the Relationships Analyzer looks for prespecified model that differentiates it from traditional relational databases relationships among the items and their topics in the Clas- [1]. A data model is a set of conceptual tools to manage sification Repository (9a), creating the Relations Repository and represent data, consisting of three components [3] : 1) (10a). data structure types, 2) collection of operators or inferenc- All preprocessing steps must be performed every time we ing rules, and 3) a collection of general integrity rules. Data add educational material, taxonomy or texts from a domain in a graph database are stored and represented as nodes, textual base. edges, and properties. After such preprocessing, lecturers and students can run Each graph database management system has its own spe- queries through the Interface Layer (1c). It redirects the cialized graph query language, and there are many graph query to the Graph Engine and the Search Engine (2c). The models. For example, many graph databases based on Re- latter accesses the Relations Repository (3c) to find relevant source Description Framework (RDF) use SPARQL (SPARQL educational materials that are related to the user query. Protocol and RDF Query Language), but Neo4J, a graph database widely used in research, uses the Cypher language. Finally, integrity rules in a graph database are based on its graph constraints. 4. IMPLEMENTATION Several researchers have adopted graph representations The CIMAL software is the first implementation of the and graph database systems as a computational means to architecture described in Section 3. We have developed deal with situations where relationships are first-class citi- the components of Interface and Preprocessing Layer us- zens (e.g. [2]). They interpret scientific data using concepts ing JAVA code, our texts come from Wikipedia, the tax- of linked data, interactions with other data and topological onomy from ACM Computing Classification System, and properties about data organization. methods of Apache Lucene, a high-performance full-featured text search engine library. Since CIMAL uses graphs to perform relationships analy- 3. THE CIMAL’S ARCHITECTURE sis, the Persistence Layer stores all data in a database with CIMAL’s architecture is a novel design to support the native support for graphs (Neo4j). With this approach, we analysis of relationships among educational material based are able to use already established technologies and solu- on their implicit topics. This architecture combines multiple tions for processing graphs. We chose the Neo4j database algorithms for content extraction and classification of topics system because it is the most popular graph database in big given a suite of educational material repositories. companies (e.g. eBay and Wallmart) and in research, ac- Figure 1 presents an overview of our architecture, which cording to the Db-Engines site, an initiative to collect and comprises three layers. The Persistence Layer is composed present information on 341 database management systems. by six repositories: Local Courseware, Components and Con- Our main implementation is divided in four steps: (Step tents, Representations, Enriched Taxonomy, Classification A) Extraction of elements of interest; (Step B) Intermediate and Relations. The Preprocessing Layer prepares data from Representation Instantiation – based on the schema defined 2 Figure 1: System Architecture for Analysis of Relationships among Educational Material Contents. in our research; (Step C) Intermediate Representation Anal- tool) recognizes the topics of each Intermediate Represen- ysis; (Step D) Interaction with users. tation according to the taxonomy and creates a document about the ”Classification of Representations”. In our stud- 4.1 Step A - Extraction of elements of interest ies, we defined that the words present in the components of At Step A, the Components and Contents Collector extracts the slides or that are among the five most repeated in videos components from material based on a Java Framework called subtitles should be 3 times more important in the classifica- DDEx and several APIs for document handling. It scans ed- tion than the words in the rest of the documents. The third ucational material based on a set of positional rules defined module (Relationship Analyzer tool) concerns the produc- by users and identifies the desired components. Each identi- tion of information about relations, based on the ”Classifi- fied component is encapsulated in a standard representation cation of Representations”. and forwarded to Step B. The Combiner tool adds one page of Wikipedia to each The texts from header and body, and number of slides node of the Taxonomy, thus producing an Enriched Taxon- were extracted automatically using DDEX as components omy. Next, the Classifier tool calculates the similarity of of each slide. In addition, the texts present on the body of each text of Intermediate Graph Representation (related a slides were also extracted. Through the subtitle file, avail- each educational material) for each pages of the Enriched able for each of the videos, the texts and the time stamps Taxonomy. of each of the lecturers’ statements were extracted. 4.4 Step D - Interaction with users 4.2 Step B - Intermediate Representation In- At last, in Step D users can perform queries to find rele- stantiation vant content. Here we implemented in Java programs and Step B creates the Intermediate Graph Representation and 2graph the Interface layer tools. 2graph is a java-based API stores this representation in a repository. The use of this rep- to perform Extract, Transform and Load (ETL) resources resentation enables the manipulation of parts of educational to graph structures/databases, to handle the information material without interfering with the material themselves. produced by CIMAL and interact with users. The components and contents of a material are trans- formed into a graph where the nodes represent the elements 5. RESEARCH CHALLENGES of interest that are used in our work. These elements differ To achieve the objective of this research the following ob- according to the kind of material, for example in a video we stacles have been faced: would like to extract the subtitles and in a slide we extract 1) Although widespread, the idea of sharing teaching ma- sections. terials still faces resistance from lecturers. In order to per- form classification tests and also to verify relationships be- 4.3 Step C - Intermediate Representation tween the topics, it is necessary to find different materials Analysis but with similar approaches to explain topics. The solu- Step C has three software modules we implemented: The tion found was to use materials from the same repository first module (”Combiner” tool) is concerned with creation (Coursera) and from the Computing area, in which the idea and storage of an enriched taxonomy. The second (Classifier of electronic sharing is more popular. 3 2) Most of the lesson videos are produced for a specific framework contributes to helping lecturers and students nav- audience. Consequently, many lectures only explain con- igate through collections of materials. Our implementa- cepts in a specific language, and do not produce subtitles for tion is validated on slides and videos from case studies and other audiences. Automatic transcription of captions is still showed that the components on slides and videos can be a research problem. Therefore, we have selected only videos used to classify text and relate topic of these materials. that had their subtitle produced manually, which drastically One particular question is of interest to us: ”Can the his- reduced the amount of educational videos available in ed- tory of courses taken by students influence the topics that ucational repositories that could be used. Thus, we used the students are looking for in educational material reposi- videos from the Coursera platform, which follow a standard tories?” of subtitle production, thereby making the analysis of video To answer this question, it is necessary to collect data of content more adequate. user accesses to these materials. For example, data on the 3) The use of graphs for analysis of relationships is very last courses that a student held in Coursera could be used to common in many research domains, but this practice is not construct a personalized study guide on subjects that would yet widespread in the educational field. In our work we be interesting for this student; the recommendation system only use volunteers with knowledge in graphs to analyze the could also recommend more Coursera courses. contributions of this research. 8. REFERENCES 6. CASE STUDIES [1] R. Angles and C. Gutierrez. Survey of graph database models. ACM Comput. Surv., 40(1):1:1–1:39, Feb. 6.1 Analysis of important topics in a Special- 2008. ization Course from Coursera [2] P. Cavoto, V. Cardoso, R. Vignes Lebbe, and We collected 97 sets of slides and 97 videos from the Spe- A. Santanchè. FishGraph: A Network-Driven Data cialization course in Data Science, offered by Johns Hopkins Analysis. In 11th IEEE Int. Conf. on eScience, University, to be used as a case study. Using our system, Germany, 2015. we are able to discover the topics covered throughout the [3] E. F. Codd. Data models in database management. specialization course without requiring annotations or other SIGPLAN Not., 16(1):112–114, June 1980. extra tasks for teachers. We point out that CIMAL can [4] O. Egozi, S. Markovitch, and E. Gabrilovich. thus also be used by lecturers to annotate and classify their Concept-based information retrieval using explicit materials. More details on this case study can be found at semantic analysis. ACM Trans. Inf. Syst., [11]. 29(2):8:1–8:34, Apr. 2011. [5] J. Jiang. Information extraction from text. In C. C. 6.2 Proposed new multidisciplinary activities Aggarwal and C. Zhai, editors, Mining Text Data, in an educational institution pages 11–41. Springer US, 2012. A second case study was conducted at an educational in- [6] O. Matos-Junior, N. Ziviani, F. C. Botelho, M. Cristo, stitution in the state of São Paulo, Brazil. We show how we A. Lacerda, and A. S. da Silva. Using taxonomies for find similarities among different courses, thereby highlight- product recommendation. JIDM, 3(2):pages 85–100, ing possible intersections, thus revealing potential multi- 2012. course activities. [7] Y. Ouyang and M. Zhu. eLORM: Learning object We were able to extract the contents and topics covered relationship mining based repository. Proc. - IEEE in each of the documents that regulated the courses of this Int. Conf. on E-Commerce Technology and institution and relate each of their contents through graphs. CEC/EEE, pages 691–698, 2007. Documents with many relations revealed possible interac- [8] B. Pereira. Entity Linking with Multiple Knowledge tions between their respective courses. Bases: An Ontology Modularization Approach. In ISWC, pages 513–520. Springer, 2014. 6.3 Standardizing validation [9] C. Romero and S. Ventura. Data mining in education. To finalize our study, we designed a questionnaire to eval- Wiley Interdisciplinary Reviews: Data Mining and uate the classification of topics extracted from 6 materials Knowledge Discovery, 3(1):12–27, 2013. (randomly chosen for the questionnaire does not get too [10] M. C. Saraiva and C. B. Medeiros. Use of graphs and long) from the ”Python for Everybody Specialization”, pro- taxonomic classifications to analyze content vided by University of Michigan. Thirty volunteers of differ- relationships among courseware. In SBBD 2016, ent levels of education and specialties in sub-areas of Com- Salvador, Bahia, Brazil, pages 265–270, 2016. puter Science gave opinions for each of five topics extracted using the CIMAL implementation. After this activity, we [11] M. C. Saraiva and C. B. Medeiros. Finding out topics can see that CIMAL classifies the materials using pertinent in educational materials using their components. In topics, since 64% of the topics indicated by the framework 47th Annual IEEE FIE, Indianapolis, IN, USA, pp. were evaluated ”Some related (16,5%)”, ”Related (15%)” or 1-7, 2017. ”Closely related (32,5%)” by the volunteers. [12] K. Sathiyamurthy, T. V. Geetha, and M. Senthilvelan. An approach towards dynamic assembling of learning objects. In ICACCI, pages 1193–1198. ACM, 2012. 7. CONCLUSIONS AND FUTURE WORK This paper presented the design and implementation of CIMAL, which allows searching content from educational material, and eliciting relationships among topics. This 4