Automatic Learning Content Sequence via
                Linked Open Data

                                 Ruben Manrique

     Systems and Computing Engineering Department, School of Engineering,
                  Universidad de los Andes, Bogotá, Colombia
                        rf.manrique@uniandes.edu.co


1   Problem Statement
The paradigm of lifelong learning supported by technology is redefining the clas-
sic approach making closed models in which objectives, content and sequence are
predetermined more open and self-directed. In the new era of learning, people
take on learning processes that are not necessarily carried out under the guidance
of a tutor. In this format, learners are in charge of searching for and selecting
the most appropriate resources to accomplish their goals. Learning content is
any type of resource relevant to a learning goal, and the main source for these
is the Web. However, even though most documents are not annotated as ed-
ucational on the Web, they are still being used as learning resources. Given
these characteristics, self-directed learners face different challenges in selecting
and organizing adequate learning resources. We identify three main challenges:
effective search and selection of educational content, sound organization of this
content, and personalization of these two previous processes. Although person-
alized learning is recognized as a key concept in future learning support systems,
our proposal is centered on the other two challenges, mainly educational content
search, selection and organization.
    The first of these challenges is that learners do not always have enough lit-
eracy skills to do an effective and selective search [8]. There is a great volume
of resources available on the Web that can overwhelm learners. Secondly, un-
like traditional approaches, learners do not have the assistance of an expert to
structure and organize the resources so that the learning goal is achieved in a
logical manner. The sequence with which the knowledge is acquired is important
because complex concepts require the understanding of more basic ones. Despite
that considerable attention has been paid to content organization in e-learning
courses, most systems are still under development [3, 17] or are implemented
in closed scenarios, or in courses where objectives, learning materials and se-
quence are not driven by the learner himself [18, 7, 1]. This proposal addresses
this problem and explores strategies to organize and select resources by auto-
matically enriching the resources with metadata about the concepts involved
and an ordering strategy. Concepts found in Knowledge Bases (KB) belonging
to the Linked Open Data (LOD) initiative are used concept space. The ordering
strategy can be defined explicitly via a specific property modeled in the KB on-
tology or can be generated by prerequisite relationships discovered automatically
through the background knowledge found in such bases.
2   Relevancy
On a more practical level, we want to support lifelong learners. As a result of
the use of our system, learners will spend less time searching for and selecting
the learning resources that best suits their learning goals. Also, showing a plan
for using materials gives learners a general overview of a topic allowing them a
more comprehensive understanding and a logical order to follow.
    This dissertation will also contribute to current research in Technology En-
hanced Learning and in Semantic Web. Regarding Technology Enhanced Learn-
ing, our work will contribute to advance current research on connecting Linked
Data with open learning resources, in particular for autonomous lifelong learning.
The Semantic Web research community should benefit from our products and
results as all our algorithmic proposals take advantage of background knowledge
found in KBs belonging to the LOD initiative to infer knowledge about con-
tent documents and relationships between them, which can be used for purposes
other than educational ones.

3   Related Work
Learning Content Sequence: There have been several works on learning con-
tent sequence in closed learning scenarios such as e-learning courses. In such
scenarios, the problem is known as “pedagogical sequencing” or “curriculum
sequencing” and is defined as the problem of how the learning material is ar-
ranged and delivered to a particular learner according to his/her preferences,
level of knowledge and/or the pedagogical conventions [2, 18, 7, 1]. While some
approaches solve the problem sequentially by defining events and rules that guide
the construction of the sequence [18], others have modeled it as a constraint sat-
isfaction problem in which diverse computational techniques can be employed [2,
23]. These approaches have proven useful in the context of well defined, closed
domain ontologies and of learning objects (LO) annotated with standards like
LOM, SCORM or LRMI. In these cases, the sequencing of Learning Content
can be achieved by following subsumption relations in the domain ontology and
by aligning ontology concepts and LO annotation terms. The main limitations
of these approaches are availability, scalability and maintenance. To start, not
all domains have well established and agreed on ontologies and even in the case
of domains having these ontologies both tasks: LO annotation and ontology
construction and maintenance require intensive expert human intervention, as
they are not completely automatic tasks. A second problem refers to scope. In
the context of life-long self-directed learning, the learner might be interested in
cross-domain problems and pertinent content might include learning resources
not originally labeled as LO. In order to address the above issues, research has
been oriented to the automatic extraction of concepts and prerequisite relation-
ships to guide the selection and organization of learning resources.
    Automatic Extraction of Concept-Based Metadata: [22] developed an
automatic annotation tool for annotating documents with metadata such as con-
cepts, type of concepts, topic and the learning resource type. They justified the
importance of extending IEEE LOM metadata with concept based attributes
since it allows a better retrieval process. Ontology domains are used to map
lexical terms to concepts. To evaluate the importance of a concept, they analyze
the frequency of its related concepts. The type of concept (i.e. outcome, prereq-
uisite, defined and used) is identified by analyzing the documents with a shallow
parsing approach and by using some inference rules. In a more recent but similar
works, [9, 6] enrich learning resources metadata with semantic concepts from a
domain ontology. Using an ontology extraction algorithm named Hieron, they
extract the concepts covered by the content resource with an associated weight
that represents the degree of pertinence. [4] addresses the problem of determining
an effective learning path from a corpus of Web documents via the annotation of
outcome and prerequisite concepts. They employ a machine-learning techniques
to predict the class of concept on the basis of contextual and local text features.
Regarding information extraction techniques that take advantage of LOD-KB
background knowledge, we only found the work done by [10]. In this work, au-
thors propose an automatic semantic description of Web Documents based on
LOD concepts. The semantic description is called “fingerprint” and allows to
judge if a Web resource is relevant with respect to a learning context or not. We
do not find evidence of its used to select or organize resources.
    Prerequisite relationships identification: The problem of identifying
prerequisite relationships in open setting has been addressed by several re-
searchers. We focus here on the state of the art proposals [24, 25, 11]. [24] pro-
posed a supervised method for predicting prerequisite dependencies between
Wikipedia pages. They used a set of features such as random walk with restart
(RWR) and Pagerank scores in combination with a Maximum Entropy (Max-
Ent) classifier. Later [25] formulates an optimization problem to obtain concepts
maps linked with prerequisite relationships extracted from textbooks. A set of
objective functions is proposed taking advantage of the order and frequency in
which concepts appear in the textbook structure. Finally, a reference distance
RefD is proposed in [11] to measure a prerequisite relation among concepts.
Interesting this measure could be generalizable to different concept spaces and
weighting functions. Our approach is different from the preceding ones for we rely
on LOD-KBs (e.g. DBpedia, Yago) as Universal concept space. We believe that
employing similarity measures for LOD concepts and the background knowledge
present in the KB, it is possible to infer candidate prerequisite relationships.

4   Hypotheses and Research Questions
In order to address the main problem we propose the following hypotheses and
research questions:
    (H1) It is possible to generate a learning concept graph (LCG) that ex-
presses prerequisite relationships from the well-structured information in LOD
KB. (Q1) How can LOD KB background information be exploited to infer can-
didate prerequisite relationships? (Q2) How accurate are these inferred prereq-
uisite relationships?
    (H2) It is possible to augment learning resources with metadata associated
with the main concepts that it addressed. (Q3) How can a concept based descrip-
tion for learning resources be extracted (automatically) based on its content?
(Q4) How well does this concept-based description represent the resource?

5    Approach
This dissertation is aimed at selecting learning resources to satisfy a learning
need and at automatically inferring prerequisite relationships between concepts
related to this learning need that allow a logical ordering of the found resources.
The resources, therefore, must follow a representation that allows to discern the
main concepts that it approaches. Knowing how the concepts are related and the
concepts that each resource addresses is possible to select and order the learning
resources for a given learning objective.
    In general, our methodology to address the above issues relies on (i) the
automatic discovery of prerequisite relationships between concepts and (ii) the
automatic construction of a concept-based representation for learning resources.
Both processes are guided by the knowledge found in LOD-KBs and allow to
enrich a learning resource with metadata about the main concepts it addresses
and their corresponding prerequisite concepts.
    There are two main components in our proposal: the Learning Concept Graph
(LCG) and the Automatic Learning Content Sequence (ALCS) generator service.
LCG infers automatically prerequisite relationships among concepts by analyz-
ing their semantic relationships and linkages. This information is exploited by
ALCS to extract the concepts and the order in which they have to be cover
to achieve an input-learning goal. Finally, by extracting information about the
main concepts involved in a learning resource, the ordered sequence of concepts
is transformed into an ordered sequence of resources. The paragraphs below
explain both components.
    Learning Concept Graph (LCG): The main objective of the LCG is to
represent prerequisite relationships in a concept space framed by the selected KB.
The prerequisite relationship pre(ca , cb ) ≡ [ca isP rerequisiteOf cb ] meets
the following properties:(1) asymmetric pre(ca , cb ) ⇒ ¬pre(cb , ca ), (2) irreflexi-
ble pre(ca , ca ) is not defined, (3) transitivity pre(ca , cb ), pre(cb , cc ) ⇒ pre(ca , cc ).
In order to measure the prerequisite relations among concepts, we propose an
approach to infer candidate relationships exploiting the graph base structure and
the semantic of the concepts in the KB. Our proposal to find candidate prereq-
uisite relationships is inspired in RefD [11] measure that fulfills the properties of
our prerequisite relationship and shows superior results than more sophisticated
supervised approaches. RefD infers prerequisite dependencies analyzing observ-
able reference relationships between concepts (a reference could be a hyperlink,
a citation, a mention, etc). This measure is defined as:

                             Pk                                Pk
                                i=1 r(ci , cb )w(ci , ca )      i=1 r(ci , ca )w(ci , cb )
         Ref D(ca , cb ) =        Pk                       −      Pk                         (1)
                                    i=1 w(ci , cb )                 i=1 w(ci , cb )

   Where k is the size of the concept universe, r(ci , ca ) is an indicator function
showing the existence of a reference between ci and ca , and w(ci , ca ) weights
the importance of ci to ca . In order to leverage RefD with the rich semantic
relationships present in LOD KB, we will explore the use of hierarchical infor-
mation (i.e. categories and other taxonomical information) and common related
concepts follow the ontology properties to calculate w(ci , ca ). Likewise, follow-
ing the properties paths 1 between concepts is an indicative of reference. Hence,
different ontology properties of the KB could be explored for r(ci , ca ).
     Recently, different similarity measures for LOD-KB have been proposed [19,
16] and appear to be a promising option to calculate w(ci , ca ). In preliminary
results, we employed the similarity measures used by [20, 19].
     Automatic Learning Content Sequence (ALCS): The ALCS has two
main purposes. First, it builds a concept based representation for learning re-
sources (Definition 1). This representation allows to enrich resources with meta-
data about the main topic concepts that they address. Second, given an ordered
sequence of concepts extracted from LCG, it selects a ordered sequence of learn-
ing resources to address these concepts.
     Definition 1 : A resource ri is represented as Ri = {(c, w(ri , c))|c ∈ C} .
The weight that denotes how the concept c is covered by the resource is defined
by w(ri , c). The concept space C is limited by the selected LOD-KB, so each
concept is basically an URI.
     (1) Concept Based Resource Representation: The advantage of the bag of
concepts based approach (Definition 1) is that it allows to exploit semantic in-
formation found in the KB to leverage and extend the representation. In the first
step, the system extracts KB concepts mentions found in the text. This “enrich-
ment” process takes an input text and returns a set of URIs of structured LOD
entities to allow further reasoning on related concepts. Services such as DBpe-
dia Spotlight, OpenCalais, AlchemyAPI or Bioportal could be employed for this
task. Then, on the top of these mentions, different semantic expansion strategies
could be applied [13]: (i) categorical expansion (CE): for each annotation, we
include its categories (or other hierarchical information about concepts in the
KB) in the representation. (ii) property Expansion (PE): for each annotation,
the representation is enriched with the concepts found by traversing some of the
properties of the KB ontology.
     We have found that large documents may involve concepts that have no
relation to the main topic addressed by the resource, thus the above expansion
strategies could lead to additional noise. Hence, instead of using such strategy,
we proposed a filtering strategy that has already shown better results in this
scenario. The concept behind this filtering strategy is that noisy concepts tend
to be disconnected from those that represent the main topic of the resource. The
filtering is achieved analyzing how the concepts are related via the linkages in
the KB and discarding those that are disconnected or are weakly connected [13].
     For the weighted strategy (i.e. w(ri , c) in Definition 1) we have followed
approaches such as the frequency of the concept and an abstraction of the well-
known TF-IDF [21]. These weighting schemas has shown to be enough to rep-
resent the coverage of the concept for a given resource. Concepts with higher
1
    https://www.w3.org/TR/sparql11-property-paths/
weights and its prerequisites extracted from LCG are incorporated as main con-
cepts and prerequisites concepts metadata information.
    (2) Resource Selection: Given a learning concept target cj (i.e. a learning
goal), ALCS extracts an organized sequence of concepts from LCG g = {c1 , c2 , ..., cj }
to achieve it. Then, we select the set of resources lr = {r1 , r2 , ..., rj } that maxi-
mizes the coverage of the concepts in g. We follow a greedy approach that selects
for each concept ck in the sequence the resource ri with highest w(ri , ck ). We
plan to explore more sophisticated resource selection strategies based on the
definition of a constraint satisfaction problem. Likewise, we will also investigate
how to limit the number of prerequisites retrieved (|g|) in a “intelligent” way.

6   Evaluation Plan and Preliminary Results
We plan to evaluate the prerequisite relationships in LCG using the CrowdComp
dataset [24]. This dataset contains binary-labeled concept pairs from five differ-
ent domains collected using the Amazon Mechanical Turk. We plan to reported
the accuracy, precision, recall, and F1 as evaluation measures. We want to com-
pare our approach to discover prerequisites with the strategies reported by [24,
11, 12] (Q1,Q2). Some trials on a limited set of concepts, the use of RefD and
LOD similarities explain above allowed us to test our approach on a small scale.
    In order to test the information extraction and the resource representation
(Q3,Q4) we have built a dataset with more than 30,000 documents that includes
open academic articles recovered from CORE service2 , Web pages with technol-
ogy and science news and a set of learning objects retrieved from MERLOT3 .
The complete and some parts of the dataset has already been employed in [14,
13]. We also plan to use a recent release dataset [5] that contains all embedded
Learning Resource Metadata Initiative (LRMI)4 markup statements extracted
from the Common Crawl releases 2013-2015. Each entity description contains
the URL of the WEB document from which it has been extracted, so it is pos-
sible to access the WEB resource content and employed our learning resource
representation.
    So far, a developed service that maps documents to bag of concepts repre-
sentations (Definition 1) with expansion and filtering strategies was developed.
This service could operate with different LOD-KBs, though all our experiments
have been performed using DBpedia. We have partially evaluated the learning
resource representation in an academic document recommendation task (Q3,Q4)
[13, 15]. All academic documents in the candidate recommendation set were rep-
resented as Ri and different semantic expansion and filtering strategies were ap-
plied. In terms of classical measures of Top-N recommender tasks: MRR (Mean
Reciprocal Rank), MAP@N (Mean Average Precision), and NDCG@N (Nor-
malized Discounted Cumulative Gain), we have obtained superior results that
word-based representations. In general our results validates that: (i) the proposal
concept based representation express correctly an academic document content
2
  https://core.ac.uk/
3
  https://www.merlot.org/
4
  http://lrmi.dublincore.net/
(ii) using a concept space framed by DBpedia was enough to achieve a correct
representation, so it is expected that specific domain ontologies could lead to
better results and (iii) different strategies that exploit the background knowl-
edge present in the KB could be employed to leverage the representation. We
plan to evaluate how well this representation lead to correct metadata (i.e. main
concepts) with human reviewers that manual annotate the resources with its
main concepts.

7    Reflections
This research addresses the problem of finding and organizing learning resources
to support self-directed learners in achieving a learning goal. All the proposed
strategies have been oriented to the use of semantic technologies and, in par-
ticular, to the use of LOD-KB. Given a need for learning, the system draws
a sequence that expresses the concepts involved and how they should be ad-
dressed. The organization of the sequence can be driven by some property of the
KB’s ontology, or through prerequisite relationships between the concepts in-
ferred automatically from the background knowledge in the KB. To ensure that
the sequence of concepts is translated into a sequence of resources, we propose a
representation based on concepts automatically extracted from these resources
contents. This representation allows to enrich the resource with metadata about
the main concepts it addresses. We have still addressing the problem of gener-
ating the LCG that expresses the prerequisite relationships amongst concepts.

Acknowledgment. This thesis is funded by Colciencias under PhD scholarship
No. 647/2014. I would like to express my heartfelt gratitude to my supervisor
Dr. Olga. Mariño for her continuous encouragement. Likewise, I would also like
to thank Prof. Stefan Dietze and the anonymous reviewers for their feedback.

References
 1. Acampora, G., Loia, V., Gaeta, M.: Exploring e-learning knowledge through on-
    tological memetic agents. IEEE Computational Intelligence Magazine 5(2), 66–77
    (2010)
 2. Al-Muhaideb, S., Menai, M.E.B.: Evolutionary computation approaches to the
    curriculum sequencing problem. Natural Computing 10(2), 891–920 (2011)
 3. Bariso, E.U.: Personalised eLearning in Further Education. Technology-Supported
    Environments for Personalized Learning (2010)
 4. Changuel, S., Labroche, N., Bouchon-Meunier, B.: Resources sequencing using au-
    tomatic prerequisite–outcome annotation. ACM Trans. Intell. Syst. Technol. 6(1),
    6:1–6:30 (Mar 2015)
 5. Dietze, S., Taibi, D., Yu, R., Barker, P., d’Aquin, M.: Analysing and improving
    embedded markup of learning resources on the web. In: Proceedings of the 26th
    International Conference on World Wide Web Companion. pp. 283–292. WWW
    ’17 Companion, Republic and Canton of Geneva, Switzerland (2017)
 6. Farhat, R., Jebali, B., Jemni, M.: Ontology based semantic metadata extraction
    system for learning objects, pp. 247–250 (2015)
 7. Garrido, A., Morales, L., Serina, I.: On the use of case-based planning for e-learning
    personalization. Expert Systems with Applications 60, 1–15 (2016)
 8. Hill, J.R.: Resource-Based Learning, pp. 2850–2852. Springer US, Boston, MA
    (2012)
 9. Jebali, B., Farhat, R.: Ontology-based semantic metadata extraction approach.
    In: 2013 International Conference on Electrical Engineering and Software Appli-
    cations. pp. 1–5 (March 2013)
10. Krieger, K., Schneider, J., Nywelt, C., Rösner, D.: Creating semantic fingerprints
    for web documents. In: Proceedings of the 5th International Conference on Web
    Intelligence, Mining and Semantics. pp. 11:1–11:6. WIMS ’15 (2015)
11. Liang, C., Wu, Z., Huang, W., Giles, C.L.: Measuring prerequisite relations among
    concepts. In: Proceedings of the 2015 Conference on Empirical Methods in Natural
    Language Processing. pp. 1668–1674. Lisbon, Portugal (September 2015)
12. Liang, C., Ye, J., Wu, Z., Pursel, B., Giles, C.L.: Recovering concept prerequi-
    site relations from university course dependencies. In: In the 7th Symposium on
    Educational Advances in Artificial Intelligence (2017)
13. Manrique, R., Herazo, O., Marino, O.: Exploring the use of linked open data
    for user research interest modeling. In: 12mo Computing Colombian Conference
    (12CCC) (2017)
14. Manrique, R., Marino, O.: Diversified semantic query reformulation. In: The Inter-
    national Conference on Knowledge Engineering and Semantic Web (KESW2017)
    (Submitted) (2017)
15. Manrique, R., Marino, O.: How does the size of a document affect linked open data
    user modeling strategies. In: Int. Workshop on Web Personalization, Recommender
    Systems and Social Media (WPRSM 2017) (2017)
16. Meymandpour, R., Davis, J.G.: A semantic similarity measure for linked data:
    An information content-based approach. Knowledge-Based Systems 109, 276 – 293
    (2016)
17. O’Donnell, E., Lawless, S., Sharp, M., Wade, V.P.: A Review of Personalised E-
    Learning: towards supporting learner diversity. International Journal of Distance
    Education Technologies 13(1), 22–47 (2015)
18. Paquette, G., Mariño, O., Rogozan, D., Léonard, M.: Competency-based person-
    alization for massive online learning. Smart Learning Environments 2(1), 4 (2015)
19. Paul, C., Rettinger, A., Mogadala, A., Knoblock, C.A., Szekely, P.: Efficient Graph-
    Based Document Similarity, pp. 334–349. Cham (2016)
20. Pekar, V., Staab, S.: Taxonomy learning: factoring the structure of a taxonomy into
    a semantic classification decision. Proceedings of the 19th international conference
    on Computational linguistics - Volume 1 pp. 1–7 (2002)
21. Piao, G., Breslin, J.G.: Analyzing aggregated semantics-enabled user modeling on
    google+ and twitter for personalized link recommendations. In: Proceedings of the
    2016 Conference on User Modeling Adaptation and Personalization. pp. 105–109.
    UMAP ’16 (2016)
22. Roy, D., Sarkar, S., Ghose, S.: Automatic extraction of pedagogic metadata from
    learning content. Int. J. Artif. Intell. Ed. 18(2), 97–118 (Apr 2008)
23. Sharma, R., Banati, H., Bedi, P.: Adaptive Content Sequencing for e-Learning
    Courses Using Ant Colony Optimization, pp. 579–590 (2012)
24. Talukdar, P.P., Cohen, W.W.: Crowdsourced comprehension: Predicting prerequi-
    site structure in wikipedia. In: Proceedings of the Seventh Workshop on Building
    Educational Applications Using NLP. Stroudsburg, PA, USA (2012)
25. Wang, S., Ororbia, A., Wu, Z., Williams, K., Liang, C., Pursel, B., Giles, C.L.:
    Using prerequisites to extract concept maps from textbooks. In: Proceedings of
    the 25th ACM International on Conference on Information and Knowledge Man-
    agement. pp. 317–326. CIKM ’16 (2016)