Relating educational materials via extraction of their topics

                                                      Márcio de Carvalho Saraiva
                                                  supervised by Claudia Bauzer Medeiros
                                                          Institute of Computing
                                                          University of Campinas
                                                                 13083-852
                                                           Campinas-SP / Brazil
                                                    marcio.saraiva@ic.unicamp.br

ABSTRACT                                                                         als’ topics through the use of graph algorithms. This work
Digital educational documents are growing in size and va-                        was validated with data from Johns Hopkins University and
riety, and scientists are facing difficulties to find their way                  University of Michigan provided at Coursera, which is one
through them. One of the initiatives that have emerged to                        of the largest e-learning repositories at the moment, and a
solve this problem involves the use of automatic classifica-                     Higher Education Institute from São Paulo - Brazil. Our
tion algorithms. However, it is difficult to analyze implicit                    work expands the analysis options in educational material
relationships among topics of materials. This paper presents                     repositories. Moreover, our proposal improves the search
CIMAL, a framework for enabling flexible access to material                      among different material formats by standardizing topics
stored in arbitrary repositories. CIMAL combines seman-                          they cover.
tic classification, taxonomies and graphs to elicit relation-
ships among topics of educational documents. We validated                        2.    THEORETICAL FOUNDATION AND
our work using materials from Coursera (courses offered by                             RELATED WORK
Johns Hopkins University and University of Michigan) and
a Higher Education Institute, from Brazil.                                       2.1    Educational Data Mining
                                                                                    According to Romero [9] EDM is concerned with ”re-
                                                                                 searching, developing, and applying computerized methods
1.    INTRODUCTION                                                               to detect patterns in collections of educational data that
   Usually, lecturers use educational material repositories                      would otherwise be hard or impossible to analyze due to the
to publish, store and share materials with their peers in                        enormous volume of data within which they exist”.
academia and students. The access to those documents is                             Typically, research towards helping users to select educa-
usually open. Given such availability, how to find and choose                    tional material can be roughly classified as (i) development
the material(s) more suitable to study a given topic?                            of tools to analyze, access or store materials in reposito-
   Sites such as the International Bank of Educational Ob-                       ries, (ii) mechanisms to integrate heterogeneous materials
jects, the ACM Learning Center and the ACM Techpack,                             via user monitoring, and (iii) use of learning objects to en-
the Coursera platform, MERLOT and SlideShare show that                           capsulate and standardize contents.
the access to collections of educational materials in different
formats and the analysis of their contents are still done in a                   2.2    Components and Content from Educational
restricted way. Even simple queries through the interfaces                              Material
of these repositories can result in a large number of items,                        The strategy we adopted to extract and represent top-
making it difficult to understand them and select the rel-                       ics of educational material is inspired by a concept that we
evant ones. Furthermore, none of these repositories offers                       name components of educational material. Components are
means to analyze relationships among the stored objects,                         positional structures that highlight information of a given
which would help select material.                                                material in order to facilitate its understanding. Header,
   This paper presents the design and implementation of                          body, footer and numbering of slides are examples of com-
CIMAL (Courseware Integration under Multiple relations                           ponents of slides; titles, subtitles and the progress bar are
to Assist Learning), abstractly presented in [10]. CIMAL                         examples of components of videos. This information also
is a framework to analyze educational documents reposito-                        can be used for analysis; in our work, we use these charac-
ries, allowing visualizations of relationships among materi-                     teristics in classification, indexing, comparison and retrieval
                                                                                 tasks.
                                                                                    Unlike other approaches in the literature that use the en-
                                                                                 tire text of a document equally, we also extract information
                                                                                 of components from different types of material to guide clas-
                                                                                 sification tasks. Our work presents a novel strategy for doc-
                                                                                 uments analysis, which considers the components present in
Proceedings of the VLDB 2018 Ph.D. Workshop, August 27, 2018. Rio de             the documents to facilitate the identification of topics in the
Janeiro, Brazil. Copyright (C) 2018 for this paper by its authors. Copying       documents.
permitted for private and academic purposes.
.                                                                                2.3    Classification of topics

                                                                             1
                                                                       educational material for subsequent search. The latter pro-
   To classify educational materials, we use a technique called        vides all the services needed to look for materials using graph
Explicit Semantic Analysis. In natural language processing             algorithms. These services can be accessed through the User
and information retrieval, According to Egozi et al. [4], Ex-          Interface by lecturers and students.
plicit Semantic Analysis (ESA) is semantic representation                 The first step is to set up the repositories (actions repre-
of text (entire documents or individual words) that uses a             sented by arrows with letters ’a’ and ’b’) before users can
document corpus as a knowledge base.                                   perform a search (arrows with letter ’c’) . Preprocessing
                                                                       starts when the Courseware Crawler imports such materi-
2.4    Recognition of relationships                                    als from external resources (1a) and stores them in a Local
According to Jiang et al. [5], extraction of relations is the          Courseware Repository (2a). Next, the Components and
task of detecting and characterizing the semantic relations            Contents Collector extracts texts and the position of these
between entities in texts. They affirm that current state-of-          texts from the materials in the Local Courseware Repository
the-art methods use carefully designed features or kernels             (3a). Extracted data are stored in the Components and Con-
and standard classification to solve this problem.                     tents Repository (4a). Next, the Intermediate Graph Rep-
   Mining of metadata (e.g., number of accesses to data or             resentation Builder creates a graph representation for each
identification of entities in the documentation of objects) is         material from the repositories via the components and con-
often used to derive relationships among data, such as the             tents stored by the previous step (5a). These representations
work of Pereira[8]. Relationships of educational materials             are stored in the Representations Repository (6a).
are viewed as the connections or associations among mate-                 In parallel, the Combiner, also proposed in our research,
rials considering educational aspects, such as the association         imports an external taxonomy from a Taxonomy Reposi-
on the contents or connection of lecturers schedules [7].              tory, and a set of external expert texts from Domain textual
   Another approach to recognize relationships is to use ex-           documents Repository (1a). These data are unified in an
ternal taxonomies ([6]) or to build an architecture with hi-           Enhanced Taxonomy, in which each concept of the taxon-
erarchies to organize objects in levels, so that these relation-       omy has a reference to a text by experts, and stored in the
ships among the objects become the relationships between               Enriched Taxonomy Repository (1b).
the levels ([12]).                                                        Once representations and enriched taxonomy repositories
                                                                       are created, the Classifier is ready to define the topics cov-
2.5    Analysis using graph databases                                  ered in each of the materials (2b,3b,7a). This information
                                                                       is then stored in the Classification Repository (8a).
   We can characterize a graph database through its data
                                                                          Lastly, the Relationships Analyzer looks for prespecified
model that differentiates it from traditional relational databases
                                                                       relationships among the items and their topics in the Clas-
[1]. A data model is a set of conceptual tools to manage
                                                                       sification Repository (9a), creating the Relations Repository
and represent data, consisting of three components [3] : 1)
                                                                       (10a).
data structure types, 2) collection of operators or inferenc-
                                                                          All preprocessing steps must be performed every time we
ing rules, and 3) a collection of general integrity rules. Data
                                                                       add educational material, taxonomy or texts from a domain
in a graph database are stored and represented as nodes,
                                                                       textual base.
edges, and properties.
                                                                          After such preprocessing, lecturers and students can run
   Each graph database management system has its own spe-
                                                                       queries through the Interface Layer (1c). It redirects the
cialized graph query language, and there are many graph
                                                                       query to the Graph Engine and the Search Engine (2c). The
models. For example, many graph databases based on Re-
                                                                       latter accesses the Relations Repository (3c) to find relevant
source Description Framework (RDF) use SPARQL (SPARQL
                                                                       educational materials that are related to the user query.
Protocol and RDF Query Language), but Neo4J, a graph
database widely used in research, uses the Cypher language.
Finally, integrity rules in a graph database are based on its
graph constraints.                                                     4.   IMPLEMENTATION
   Several researchers have adopted graph representations                 The CIMAL software is the first implementation of the
and graph database systems as a computational means to                 architecture described in Section 3. We have developed
deal with situations where relationships are first-class citi-         the components of Interface and Preprocessing Layer us-
zens (e.g. [2]). They interpret scientific data using concepts         ing JAVA code, our texts come from Wikipedia, the tax-
of linked data, interactions with other data and topological           onomy from ACM Computing Classification System, and
properties about data organization.                                    methods of Apache Lucene, a high-performance full-featured
                                                                       text search engine library.
                                                                          Since CIMAL uses graphs to perform relationships analy-
3.    THE CIMAL’S ARCHITECTURE                                         sis, the Persistence Layer stores all data in a database with
  CIMAL’s architecture is a novel design to support the                native support for graphs (Neo4j). With this approach, we
analysis of relationships among educational material based             are able to use already established technologies and solu-
on their implicit topics. This architecture combines multiple          tions for processing graphs. We chose the Neo4j database
algorithms for content extraction and classification of topics         system because it is the most popular graph database in big
given a suite of educational material repositories.                    companies (e.g. eBay and Wallmart) and in research, ac-
  Figure 1 presents an overview of our architecture, which             cording to the Db-Engines site, an initiative to collect and
comprises three layers. The Persistence Layer is composed              present information on 341 database management systems.
by six repositories: Local Courseware, Components and Con-                Our main implementation is divided in four steps: (Step
tents, Representations, Enriched Taxonomy, Classification              A) Extraction of elements of interest; (Step B) Intermediate
and Relations. The Preprocessing Layer prepares data from              Representation Instantiation – based on the schema defined


                                                                   2
      Figure 1: System Architecture for Analysis of Relationships among Educational Material Contents.


in our research; (Step C) Intermediate Representation Anal-            tool) recognizes the topics of each Intermediate Represen-
ysis; (Step D) Interaction with users.                                 tation according to the taxonomy and creates a document
                                                                       about the ”Classification of Representations”. In our stud-
4.1    Step A - Extraction of elements of interest                     ies, we defined that the words present in the components of
At Step A, the Components and Contents Collector extracts              the slides or that are among the five most repeated in videos
components from material based on a Java Framework called              subtitles should be 3 times more important in the classifica-
DDEx and several APIs for document handling. It scans ed-              tion than the words in the rest of the documents. The third
ucational material based on a set of positional rules defined          module (Relationship Analyzer tool) concerns the produc-
by users and identifies the desired components. Each identi-           tion of information about relations, based on the ”Classifi-
fied component is encapsulated in a standard representation            cation of Representations”.
and forwarded to Step B.                                                  The Combiner tool adds one page of Wikipedia to each
   The texts from header and body, and number of slides                node of the Taxonomy, thus producing an Enriched Taxon-
were extracted automatically using DDEX as components                  omy. Next, the Classifier tool calculates the similarity of
of each slide. In addition, the texts present on the body of           each text of Intermediate Graph Representation (related a
slides were also extracted. Through the subtitle file, avail-          each educational material) for each pages of the Enriched
able for each of the videos, the texts and the time stamps             Taxonomy.
of each of the lecturers’ statements were extracted.
                                                                       4.4    Step D - Interaction with users
4.2    Step B - Intermediate Representation In-                        At last, in Step D users can perform queries to find rele-
       stantiation                                                     vant content. Here we implemented in Java programs and
Step B creates the Intermediate Graph Representation and               2graph the Interface layer tools. 2graph is a java-based API
stores this representation in a repository. The use of this rep-       to perform Extract, Transform and Load (ETL) resources
resentation enables the manipulation of parts of educational           to graph structures/databases, to handle the information
material without interfering with the material themselves.             produced by CIMAL and interact with users.
  The components and contents of a material are trans-
formed into a graph where the nodes represent the elements             5.    RESEARCH CHALLENGES
of interest that are used in our work. These elements differ           To achieve the objective of this research the following ob-
according to the kind of material, for example in a video we           stacles have been faced:
would like to extract the subtitles and in a slide we extract             1) Although widespread, the idea of sharing teaching ma-
sections.                                                              terials still faces resistance from lecturers. In order to per-
                                                                       form classification tests and also to verify relationships be-
4.3    Step C - Intermediate Representation                            tween the topics, it is necessary to find different materials
       Analysis                                                        but with similar approaches to explain topics. The solu-
Step C has three software modules we implemented: The                  tion found was to use materials from the same repository
first module (”Combiner” tool) is concerned with creation              (Coursera) and from the Computing area, in which the idea
and storage of an enriched taxonomy. The second (Classifier            of electronic sharing is more popular.


                                                                   3
   2) Most of the lesson videos are produced for a specific          framework contributes to helping lecturers and students nav-
audience. Consequently, many lectures only explain con-              igate through collections of materials. Our implementa-
cepts in a specific language, and do not produce subtitles for       tion is validated on slides and videos from case studies and
other audiences. Automatic transcription of captions is still        showed that the components on slides and videos can be
a research problem. Therefore, we have selected only videos          used to classify text and relate topic of these materials.
that had their subtitle produced manually, which drastically            One particular question is of interest to us: ”Can the his-
reduced the amount of educational videos available in ed-            tory of courses taken by students influence the topics that
ucational repositories that could be used. Thus, we used             the students are looking for in educational material reposi-
videos from the Coursera platform, which follow a standard           tories?”
of subtitle production, thereby making the analysis of video            To answer this question, it is necessary to collect data of
content more adequate.                                               user accesses to these materials. For example, data on the
   3) The use of graphs for analysis of relationships is very        last courses that a student held in Coursera could be used to
common in many research domains, but this practice is not            construct a personalized study guide on subjects that would
yet widespread in the educational field. In our work we              be interesting for this student; the recommendation system
only use volunteers with knowledge in graphs to analyze the          could also recommend more Coursera courses.
contributions of this research.
                                                                     8.   REFERENCES
6.    CASE STUDIES                                                    [1] R. Angles and C. Gutierrez. Survey of graph database
                                                                          models. ACM Comput. Surv., 40(1):1:1–1:39, Feb.
6.1   Analysis of important topics in a Special-                          2008.
      ization Course from Coursera                                    [2] P. Cavoto, V. Cardoso, R. Vignes Lebbe, and
   We collected 97 sets of slides and 97 videos from the Spe-             A. Santanchè. FishGraph: A Network-Driven Data
cialization course in Data Science, offered by Johns Hopkins              Analysis. In 11th IEEE Int. Conf. on eScience,
University, to be used as a case study. Using our system,                 Germany, 2015.
we are able to discover the topics covered throughout the             [3] E. F. Codd. Data models in database management.
specialization course without requiring annotations or other              SIGPLAN Not., 16(1):112–114, June 1980.
extra tasks for teachers. We point out that CIMAL can                 [4] O. Egozi, S. Markovitch, and E. Gabrilovich.
thus also be used by lecturers to annotate and classify their             Concept-based information retrieval using explicit
materials. More details on this case study can be found at                semantic analysis. ACM Trans. Inf. Syst.,
[11].                                                                     29(2):8:1–8:34, Apr. 2011.
                                                                      [5] J. Jiang. Information extraction from text. In C. C.
6.2   Proposed new multidisciplinary activities                           Aggarwal and C. Zhai, editors, Mining Text Data,
      in an educational institution                                       pages 11–41. Springer US, 2012.
   A second case study was conducted at an educational in-            [6] O. Matos-Junior, N. Ziviani, F. C. Botelho, M. Cristo,
stitution in the state of São Paulo, Brazil. We show how we              A. Lacerda, and A. S. da Silva. Using taxonomies for
find similarities among different courses, thereby highlight-             product recommendation. JIDM, 3(2):pages 85–100,
ing possible intersections, thus revealing potential multi-               2012.
course activities.                                                    [7] Y. Ouyang and M. Zhu. eLORM: Learning object
   We were able to extract the contents and topics covered                relationship mining based repository. Proc. - IEEE
in each of the documents that regulated the courses of this               Int. Conf. on E-Commerce Technology and
institution and relate each of their contents through graphs.             CEC/EEE, pages 691–698, 2007.
Documents with many relations revealed possible interac-              [8] B. Pereira. Entity Linking with Multiple Knowledge
tions between their respective courses.                                   Bases: An Ontology Modularization Approach. In
                                                                          ISWC, pages 513–520. Springer, 2014.
6.3   Standardizing validation
                                                                      [9] C. Romero and S. Ventura. Data mining in education.
  To finalize our study, we designed a questionnaire to eval-             Wiley Interdisciplinary Reviews: Data Mining and
uate the classification of topics extracted from 6 materials              Knowledge Discovery, 3(1):12–27, 2013.
(randomly chosen for the questionnaire does not get too
                                                                     [10] M. C. Saraiva and C. B. Medeiros. Use of graphs and
long) from the ”Python for Everybody Specialization”, pro-
                                                                          taxonomic classifications to analyze content
vided by University of Michigan. Thirty volunteers of differ-
                                                                          relationships among courseware. In SBBD 2016,
ent levels of education and specialties in sub-areas of Com-
                                                                          Salvador, Bahia, Brazil, pages 265–270, 2016.
puter Science gave opinions for each of five topics extracted
using the CIMAL implementation. After this activity, we              [11] M. C. Saraiva and C. B. Medeiros. Finding out topics
can see that CIMAL classifies the materials using pertinent               in educational materials using their components. In
topics, since 64% of the topics indicated by the framework                47th Annual IEEE FIE, Indianapolis, IN, USA, pp.
were evaluated ”Some related (16,5%)”, ”Related (15%)” or                 1-7, 2017.
”Closely related (32,5%)” by the volunteers.                         [12] K. Sathiyamurthy, T. V. Geetha, and M. Senthilvelan.
                                                                          An approach towards dynamic assembling of learning
                                                                          objects. In ICACCI, pages 1193–1198. ACM, 2012.
7.    CONCLUSIONS AND FUTURE WORK
  This paper presented the design and implementation of
CIMAL, which allows searching content from educational
material, and eliciting relationships among topics. This


                                                                 4