=Paper= {{Paper |id=Vol-2674/paper03 |storemode=property |title=Digging Into Prerequisite Annotation |pdfUrl=https://ceur-ws.org/Vol-2674/paper03.pdf |volume=Vol-2674 |authors=Chiara Alzetta,Ilenia Galluccio,Frosina Koceva,Samuele Passalacqua,Ilaria Torre |dblpUrl=https://dblp.org/rec/conf/aied/AlzettaGKP020 }} ==Digging Into Prerequisite Annotation== https://ceur-ws.org/Vol-2674/paper03.pdf

Digging Into Prerequisite Annotation

Chiara Alzetta, Ilenia Galluccio, Frosina Koceva,
Samuele Passalacqua, Ilaria Torre

DIBRIS, University of Genoa, Italy
{chiara.alzetta,ilenia.galluccio,frosina.koceva}@edu.unige.it,
samuele.passalacqua@dibris.unige.it, ilaria.torre@unige.it

Abstract. Intelligent textbooks are often engineered with an explicit
representation of their concepts and prerequisite relations (PR). PR iden-
tification is hence crucial for intelligent textbooks but still presents some
challenges, also when performed by human experts. This may cause PR-
annotated datasets to be inconsistent and compromise the accuracy of
automatic creation of enhanced learning materials. This paper investi-
gates possible reasons for PR disagreement and the nature of PR itself,
with the aim of contributing to the development of shared strategies for
PR annotation, analysis and modelling in textbooks.

Keywords: prerequisite relation · annotation · agreement.

1 Introduction
Fundamental functionalities of intelligent textbooks, such as content enrichment
[23, 21, 9] and personalization [35, 31, 2], exploit the knowledge contained in the
text to select relevant content and organise it according to a structure that
reflects the prerequisite relations (PR) among concepts [17, 19]. PRs are utterly
relevant in education since they establish which concepts are needed by a student
to understand a further concept without loss of information or misconceptions
[20, 15, 1, 14]. Despite their importance, PRs recognition, both in manual and
automatic form, still presents many open challenges. PR manual annotation is
in many cases a common way to achieve a PR-enriched intelligent textbook, but
is also an essential step for developing automatic textbook content modelling.
Systems for automatic PR learning (e.g. [22, 1]) often rely on manually annotated
gold standards [29, 34, 15] for evaluating or training machine learning algorithms.
Agreement rates [6] are often used to estimate annotations’ reliability [8], but
these measures are conditioned by many factors, including the design of the
annotation task and the subjectivity of the phenomenon to be labelled [5, 28].
PR-annotations rarely achieve high agreement values unless the annotation task
is extremely guided [18]. A broad literature discusses causes of disagreement in
annotation [12, 11, 7], however this has not been fully explored for PRs. Our goal
Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
mons License Attribution 4.0 International (CC BY 4.0).
2 C.Alzetta, I. Galluccio et al.

is to investigate this matter and highlight possible causes of errors and issues, in
order to provide a way to improve PR modelling for intelligent textbooks. This
is useful especially when increasing the size of the dataset is not a practical way
to reduce the impact of errors (as in case of high demanding and time-consuming
annotation tasks like PR).

2 Experimental Study, Results and Discussion

Our analyses are conducted on PRET dataset described in [4], obtained by
asking five domain experts to manually annotate PRs between the concepts
mentioned in a computer science textbook chapter[10]. Experts had to read the
text and in parallel annotate PRs among the encountered concepts. We stressed
this point since our annotation goal is to model the textbook content rather
than elicit the PR of a domain from the expert’s mind. For the purpose of this
paper, in order to investigate the variability of annotations, we performed an in-
context annotation check, showing to each expert a subset of her/his annotations
and asking to confirm or revise their annotation, after reading again the text.
This process involved only “rare” annotations (concept pairs annotated only by
him/her) since errors are known to be more probable among them [13]. In the
end, the manual annotation resulted in 25 concept pairs annotated by all five
experts, 46 annotated by four experts, 83 by three, 214 by two and 477 by only
one annotator, for a total of 845 unique pairs.
2.1 Interpretation Variability and Annotation Errors. This section deals
with the research question, i.e. which are the main causes of disagreement and
errors. Our results cannot be generalised but are discussed in the light of the
literature. During the annotation check mentioned above, we asked experts to
specify the motivation for their revisions, among the following: a) Not a Con-
cept: at least one concept of the pair is not a domain term; b) Background
Knowledge: the PR derives form the expert’s domain knowledge since it’s not
expressed in the text; c) Too Far : the concepts are too distant in the path; d)
Annotation Error : mistake due to distraction; e) Wrong Direction: the concepts
should be reversed; f ) Co-Requisites: there is no dependency relation between
the two concepts. The table in Fig.1 reports the distribution of these labels in the
revised pairs. Not a Concept is the most recurrent problem for all annotators:
common-usage terms like channel and system were added as domain concepts,
but then revised. As it will be discussed below, this error does not seem mostly
a trivial one, but an interpretation error commonly observed for educational
concepts [33], whose boundaries verge on subjectivity [5]. Other error types are
unevenly distributed: Too Far seems again related to interpretation and subjec-
tivity issues, that may arise from the expert questioning his/her annotation style
(e.g., annotating as PR also indirectly related concepts), Background Knowledge
might be again accounted as annotation style error, however in this case it is not
due to subjectivity, but to wrong interpretation of the task (i.e., the tendency
to infer from other sources PRs that are not explicitly written in the text). The
last three types seem all due to distraction. While distraction is inevitable (thus
Digging Into Prerequisite Annotation 3

some kind of revision could be a good practice), subjectivity and annotation
style errors are more tricky, but the possibility of revision could allow to obtain
a more reliable annotation [26, 24]. Concept interpretation error will be analysed
below more in depth since it is particularly critical for intelligent textbooks, even
independently of PR relations, as discussed also in [33].

Fig. 1. ’Revision’ and ’Error Type’ percentages.

Concept Interpretation. Despite the guidelines, the notion of concept was faced
in different ways, using a fine vs coarse-grained interpretation. To understand
the role played by concepts on agreement, we performed Pearson correlation
between root and leaf nodes (those with zero in- and out-degree respectively)
in annotations’ graphs [16]. Our intuition is that graphs with high root nodes
correlation suggest an agreement on what is intended as core knowledge, i.e. Pri-
mary Notions (PN) for the domain; similarly, correlation on leaves, reflecting the
paths’ Learning Outcomes (LO), indicates a common interpretation of concepts
granularity. We noticed that revision helped to harmonise the annotations: while
prior the revision we obtained on average a stronger correlation between root
nodes (0.49) than leaves (0.29) (p < 0.05), on the revised graphs the correlations
became similar (0.47 for roots and 0.46 for leaves with p < 0.05).
2.2 Semantic and Lexical Relationships in PRs. We investigate here if
there are any semantic relations and linguistic patterns that can be identified as
frequently occurring in PRs with high vs low agreement. As Fig.2 shows, lexical

Fig. 2. Semantic Relationship Type distribution and description for PRs annotated by
3 or more experts (High Agr ) or 1 expert only (Low Agr ).

relations are the most frequent semantic type among PRs with high agreement,
4 C.Alzetta, I. Galluccio et al.

covering more than 60% of cases, while Functional Relations are more common
among PRs with low agreement. This could be due to the fact that taxonomies,
by definition, exploit a“dependency relation” to classify elements. On the other
hand the Functional Relation is highly affected by the presentation of concepts in
the text, which entails a text interpretation and a subsequent low agreement due
to subjectivity. Such distributions highlight the need for annotation guidelines
with clear examples of PRs not involved in lexical relations, since those types
might be harder to identify along a text and could give raise to disagreement.
2.3 Holistic Nature of the Annotation Process. Previous work [29, 34,
36] mostly addressed PR annotation and evaluation as a pairwise combination
of concepts; we also used this approach [3]. However, we believe that such ap-
proach over-simplifies the annotation and may result in misinterpretations of
the relations contained in the text. Indeed, semantic annotations often result in
(directed) graphs [27, 25, 30] in which each path is an interpretation of a relation
that arises from reading the whole text (this explains their holistic nature) and
should be evaluated accounting for those peculiarities. The commonly adopted
pairwise evaluation of PRs misses the interdependence between concepts involved
in a PR path and does not take into account the pedagogical characteristics of
the annotated graph as a whole. Temporal relation processing may represent an
interesting ground of comparison for PR, since precedence relation also shows
a transitive and sequential nature. Researchers in both fields encounter similar
limitations using traditional performance metrics used in information retrieval,
e.g. precision, recall [32]. A common scenario in both fields is when three items
A, B, C (concepts or events) are annotated by a rater such that A < B and
B < C, but another rater identifies the relation A < C: in such cases, tradi-
tional agreement metrics will fail to identify A < C as a shared relation, even if
it is an implicit consequence of the other two relations [32]2 . This suggests that
agreement could be computed by considering transitive edges and path similar-
ity in the two graphs. However, it is worth nothing that this approach involving
transitive closure might result in considering as PR some relations that are too
far in a path. This is an open issue, as well as the selection of proper metrics
to compare PR graph similarities. In our current work we are testing metrics
that work at different levels, among them Graph Edit Distance (GED), Vertex
Edge Overlap (VEO), PageRank, inter-agreement in learning pahts based on
the transitive characteristic of the PR (for details see [16]). We believe this re-
search direction is a relevant contribution wrt the current approaches in order
to compute not just agreement between annotators, but also between automatic
methods for knowledge graph creation, highly useful in intelligent textbooks.
Conclusions In this paper we investigated different forms of disagreement on
PR annotation that affect PR processing and management, and consequently
textbook modelling. Our aim is thus to contribute to the modelling of PR struc-
ture in textbooks by developing a subjectivity-aware PR coding protocol. The
analysis also opens the study of future research on using fuzzy PRs for structur-
ing knowledge behind text instead of yes/no (binary) PRs.
2
< indicates both the temporal relation before and the prerequisite relation ≺
Digging Into Prerequisite Annotation 5

References

1. Adorni, G., Alzetta, C., Koceva, F., Passalacqua, S., Torre, I.: Towards the iden-
tification of propaedeutic relations in textbooks. In: International Conference on
Artificial Intelligence in Education. pp. 1–13. Springer (2019)
2. Alpizar-Chacon, I., Sosnovsky, S.: Interlingua: Linking textbooks across different
languages (2019)
3. Alzetta, C., Koceva, F., Passalacqua, S., Torre, I., Adorni, G.: Pret: Prerequisite-
enriched terminology. a case study on educational texts. In: Proceedings of the
Fifth Italian Conference on Computational Linguistics CLiC-it 2018 (2018)
4. Alzetta, C., Miaschi, A., Adorni, G., Dell’Orletta, F., Koceva, F., Passalacqua,
S., Torre, I.: Prerequisite or not prerequisite? that’s the problem! an nlp-based
approach for concept prerequisites learning. In: 6th Italian Conference on Compu-
tational Linguistics, CLiC-it 2019. vol. 2481. CEUR-WS (2019)
5. Amidei, J., Piwek, P., Willis, A.: Rethinking the agreement in human evaluation
tasks (2018)
6. Artstein, R.: Inter-annotator agreement. In: Handbook of linguistic annotation,
pp. 297–313. Springer (2017)
7. Artstein, R., Poesio, M.: Inter-coder agreement for computational linguistics. Com-
putational Linguistics 34(4), 555–596 (2008)
8. Bayerl, P.S., Paul, K.I.: What determines inter-coder agreement in manual anno-
tations? a meta-analytic investigation. Computational Linguistics 37(4), 699–725
(2011)
9. Boulanger, D., Kumar, V.: An overview of recent developments in intelligent e-
textbooks and reading analytics (2019)
10. Brookshear, G., Brylow, D.: Computer Science: An Overview, Global Edition, chap.
4 Networking and the Internet. Pearson Education Limited. (2015)
11. Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Com-
putational linguistics 22(2), 249–254 (1996)
12. Feinstein, A.R., Cicchetti, D.V.: High agreement but low kappa: I. the problems
of two paradoxes. Journal of clinical epidemiology 43(6), 543–549 (1990)
13. Fort, K., Nazarenko, A., Rosset, S.: Modeling the complexity of manual annotation
tasks: a grid of analysis. In: International Conference on Computational Linguistics.
pp. 895–910 (2012)
14. Gagne, R.M.: Learning hierarchies. Educational psychologist 6(1), 1–9 (1968)
15. Gasparetti, F., De Medio, C., Limongelli, C., Sciarrone, F., Temperini, M.: Prereq-
uisites between learning objects: Automatic extraction based on a machine learning
approach. Telematics and Informatics 35(3), 595–610 (2018)
16. Koceva, F., Alzetta, C., Galluccio, I., Passalacqua, S., Torre, I.: Prerequisite simi-
larity metrics, http://teldh.dibris.unige.it/pr-similarity-metrics/
17. Labutov, I., Huang, Y., Brusilovsky, P., He, D.: Semi-supervised techniques for
mining learning outcomes and prerequisites. In: Proceedings of the 23rd ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining. pp.
907–915. ACM (2017)
18. Li, I., Fabbri, A.R., Tung, R.R., Radev, D.R.: What should i learn first: Introducing
lecturebank for nlp education and prerequisite chain learning. Proceedings of AAAI
2019 (2019)
19. Liang, C., Ye, J., Wang, S., Pursel, B., Giles, C.L.: Investigating active learning
for concept prerequisite learning. Proc. EAAI (2018)
6 C.Alzetta, I. Galluccio et al.

20. Manrique, R., Sosa, J., Marino, O., Nunes, B.P., Cardozo, N.: Investigating learn-
ing resources precedence relations via concept prerequisite learning. In: 2018
IEEE/WIC/ACM International Conference on Web Intelligence (WI). pp. 198–
205. IEEE (2018)
21. Matsuda, N., Shimmei, M.: Pastel: Evidence-based learning engineering method to
create intelligent online textbook at scale (2019)
22. Miaschi, A., Alzetta, C., Cardillo, F.A., Dell’Orletta, F.: Linguistically-driven
strategy for concept prerequisites learning on italian. In: Proceedings of 14th Work-
shop on Innovative Use of NLP for Building Educational Applications (BEA 2019)
(2019)
23. Miller, B., Ranum, D.: Runestone interactive: tools for creating interactive course
materials. In: Proceedings of the first ACM conference on Learning@ scale confer-
ence. pp. 213–214. ACM (2014)
24. Plank, B., Hovy, D., Søgaard, A.: Linguistically debatable or just plain wrong?
In: Proceedings of the 52nd Annual Meeting of the Association for Computational
Linguistics (volume 2: Short Papers). pp. 507–511 (2014)
25. Poesio, M.: The mate/gnome proposals for anaphoric annotation, revisited. In:
Proceedings of the 5th SIGdial Workshop on Discourse and Dialogue at HLT-
NAACL 2004. pp. 154–162 (2004)
26. Pustejovsky, J., Stubbs, A.: Natural Language Annotation for Machine Learning:
A guide to corpus-building for applications. ” O’Reilly Media, Inc.” (2012)
27. Ries, K.: Segmenting conversations by topic, initiative, and style. In: Workshop
on Information Retrieval Techniques for Speech Applications. pp. 51–66. Springer
(2001)
28. Saldaña, J.: The coding manual for qualitative researchers. Sage (2015)
29. Talukdar, P.P., Cohen, W.W.: Crowdsourced comprehension: predicting prerequi-
site structure in wikipedia. In: Proceedings of the Seventh Workshop on Building
Educational Applications Using NLP. pp. 307–315. Association for Computational
Linguistics (2012)
30. Tannier, X., Muller, P.: Evaluating temporal graphs built from texts via transitive
reduction. Journal of Artificial Intelligence Research 40, 375–413 (2011)
31. Thaker, K., Brusilovsky, P., He, D.: Student modeling with automatic knowledge
component extraction for adaptive textbooks (2019)
32. UzZaman, N., Allen, J.: Temporal evaluation. In: Proceedings of the 49th An-
nual Meeting of the Association for Computational Linguistics: Human Language
Technologies. pp. 351–356 (2011)
33. Wang, M., Chau, H., Thaker, K., Brusilovsky, P., He, D.: Concept annotation for
intelligent textbooks. arXiv preprint arXiv:2005.11422 (2020)
34. Wang, S., Ororbia, A., Wu, Z., Williams, K., Liang, C., Pursel, B., Giles, C.L.:
Using prerequisites to extract concept maps from textbooks. In: Proceedings of the
25th acm international on conference on information and knowledge management.
pp. 317–326. ACM (2016)
35. Yin, C., Uosaki, N., Chu, H.C., Hwang, G.J., Hwang, J., Hatono, I., Tabata, Y.:
Learning behavioral pattern analysis based on students’ logs in reading digital
books. In: Proceedings of the 25th international conference on computers in edu-
cation. pp. 549–557 (2017)
36. Zhou, Y., Xiao, K.: Extracting prerequisite relations among concepts in wikipedia.
In: 2019 International Joint Conference on Neural Networks (IJCNN). pp. 1–8.
IEEE (2019)