NLP-CIC @ PRELEARN: Mastering Prerequisites Relations, from
                   Handcrafted Features to Embeddings∗
                         Jason Angel                               Segun Taofeek Aroyehun
               Instituto Politécnico Nacional                   Instituto Politécnico Nacional
                    Mexico City, Mexico                               Mexico City, Mexico
                ajason08@gmail.com                             aroyehun.segun@gmail.com

                                              Alexander Gelbukh
                                         Instituto Politécnico Nacional
                                              Mexico City, Mexico
                                             www.gelbukh.com

                       Abstract                               of resources (features) used to train the model –
                                                              raw text VS. structured information.
    We present our systems and findings                          The combination of these settings defined the
    for the prerequisite relation learning task               four PRELEARN subtasks. Formally, a prerequi-
    (PRELEARN) at EVALITA 2020. The                           site relation exists between two concepts if one has
    task aims to classify whether a pair of con-              to be known beforehand in order to understand the
    cepts hold a prerequisite relation or not.                other. For the PRELEARN task, given a pair of
    We model the problem using handcrafted                    concepts, the relation exists only if the latter con-
    features and embedding representations                    cept is a prerequisite for the former. Therefore, the
    for in-domain and cross-domain scenarios.                 task is a binary classification task.
    Our submissions ranked first place in both                   We approach the problem from two perspec-
    scenarios with average F1 score of 0.887                  tives: handcrafted features based on lexical com-
    and 0.690 respectively across domains on                  plexity and pre-trained embeddings. We employed
    the test sets. We made our code freely                    static embeddings from Wikipedia and Wikidata,
    available1 .                                              and contextual embeddings from Italian-BERT
                                                              model.
1   Introduction
                                                              2   Related works
A prerequisite relation is a pedagogical relation
                                                              Prerequisite relation learning has been mostly
that indicates the order in which concepts can be
                                                              studied for the English language (Liang et al.,
presented to learners. The relation can be used to
                                                              2018; Talukdar and Cohen, 2012). Adorni et al.
guide the presentation sequence of topics and sub-
                                                              (2019) performed unsupervised prerequisite rela-
jects during the design of academic programs, lec-
                                                              tions extraction from textbooks using word co-
tures, and curricula or instructional materials.
                                                              occurrence and order of words appearance in the
   In this work, we present our systems to au-
                                                              text. In the case of Italian language there is ITA-
tomatically detect prerequisite relations for Ital-
                                                              PREREQ (Miaschi et al., 2019), the first dataset
ian language in the context of the PRELEARN
                                                              for prerequisite learning, and actually the one used
shared task (Alzetta et al., 2020) at EVALITA
                                                              for the present work. It was automatically built as
2020 (Basile et al., 2020). The evaluation of
                                                              a projection of AL-CPL (Liang et al., 2018) from
submissions considers: (1) in-domain and cross-
                                                              the English Wikipedia to the Italian Wikipedia. In
domain scenarios defined by either the inclusion
                                                              addition, Miaschi et al. (2019) examines the util-
(in-domain) or exclusion (cross-domain) of the
                                                              ity of lexical features for individual concepts and
target domain in the training set. The four domains
                                                              features derived from the concept pairs.
are ’data mining’ (DM), ’geometry’ (Geo), ’pre-
calculus’ (Prec), and ’physics’ (Phy). (2) the type           3   Methodology
    ∗
       “Copyright © 2020 for this paper by its authors. Use   This section describes the data analysis, the fea-
permitted under Creative Commons License Attribution 4.0      tures we used to model the task, and the system
International (CC BY 4.0).”
    1
      https://github.com/ajason08/                            we finally submitted to the PRELEARN competi-
EVALITA2020_PRELEARN                                          tion.
3.1   Dataset                                              • Age of acquisition of related concepts: We
The dataset provided by the organizers includes              derived a list of concepts related to each con-
the concept pairs splitted into the following do-            cept by matching which of them appears in
mains: ’data mining’, ’geometry’, ’precalculus’              the concept description. Then, we average
and ’physics’. The dataset contains the list of con-         the age of acquisition of those concepts. We
cepts with a link to the corresponding Wikipedia             also took the count of the related concepts.
article. The first paragraph of such article is named      • Description length: we count the number of
the concept description. All concept descriptions            words in the concept description.
are cleaned in order to facilitate the extraction of
information from the text, e.g. the mathematical           • Number of mathematical expressions: we
expressions are already tagged using this pattern            count the occurrence of mathematical expres-
formula <number>.                                            sions. We assume that more complex con-
   Table 1 displays the number of samples and the            cepts will have a higher occurrence of math-
distribution over the prerequisite relations (posi-          ematical expressions in their descriptions.
tive samples) across domains for the training set.
The test sets in turn exhibits a 50-50 distribution        • Concept view frequency: the average of
over positive and negative samples.                          the daily unique visits by Wikipedia users
   The only preprocessing we did was lowercase               (including editors, anonymous editors, and
the concept description and remove line-breaks.              readers) over the last year. We think that
                                                             the number of visitors will be correlated
  Domain          Samples      Prerequisites rel.            with the degree of complexity of a con-
  Data mining       424             0.257                    cept. To gather this information we used the
  Geometry         1548             0.214                    Pageviews Analysis of Wikipedia 2 .
  Precalculus      2220             0.142                   Concept-to-Concept features: they aim to
  Physics          1716             0.238               model the relation between the concept pairs,
                                                        specifically we evaluate whether a concept appears
Table 1: Training set number of samples and dis-        as a sub-string in the title or description of the
tribution of prerequisite relations (positive sam-      other concept. We did this in both directions re-
ples) across domain                                     sulting in two features. We also represent the do-
                                                        main they belong to as a one-hot vector.
3.2   Features                                              Wiki-embeddings: We map each concept iden-
                                                        tifier to their corresponding Wikipedia title and
The following are the set of features we experi-
                                                        Wikidata identifier using the Wikidata Query Ser-
ment with:
                                                        vice3 . Then, we obtain the 100 dimensional vec-
   Complexity-based: a set of handcrafted fea-
                                                        tor for each Wikipedia title from a pre-trained
tures intended to measure how complex a concept
                                                        Wikipedia embedding4 (Yamada et al., 2020).
is. The rationale is that less complex concepts are
                                                        Similarly, we use the Wikidata embedding5 (Lerer
prerequisites for the more complex ones. We used
                                                        et al., 2019) to represent the Wikidata identifiers
some features that have been found effective for
                                                        as 200 dimensional vectors.
the task of complex word identification (Aroyehun
                                                            Italian-BERT features: We used a pre-trained
et al., 2018), specifically they are:
                                                        uncased version of Italian BERT (base model)6
   • Age of acquisition of concept: we use ItAoA        provided by the MDZ Digital Library team (db-
     (Montefinese et al., 2019), a dataset of age       mdz) trained on 13GB of text mainly from
     of acquisition norms (we average the values          2
                                                            https://pageviews.toolforge.org
                                                          3
     for the different entries per word), to derive         query.wikidata.org
                                                          4
     the age of acquisition for each concept we             http://wikipedia2vec.s3.amazonaws.
                                                        com/models/it/2018-04-20/itwiki_
     compute the geometric mean of values from          20180420_100d.pkl.bz2
                                                          5
     ItAoA for words which occur in the con-                https://dl.fbaipublicfiles.com/
     cept description after replacing outliers (by      torchbiggraph/wikidata_translation_v1.
                                                        tsv.gz
     the closest permitted value). In addition, we        6
                                                            https://huggingface.co/dbmdz/
     use the number of matches as a feature.            bert-base-italian-uncased
          Scenario           Resources       System                DM      Geo     Phy    Prec     AVG
          in-domain          raw-text        Italian-BERT         0.838   0.925   0.855   0.930    0.887
          in-domain          structured      Complex+wd           0.808   0.905   0.795   0.915    0.856
          in-domain          structured      Complex              0.828   0.895   0.785   0.885    0.848
          cross-domain       raw-text        Italian-BERT         0.565   0.785   0.635   0.775    0.690
          cross-domain       structured      Complex+wd           0.535   0.775   0.600   0.760    0.668
          cross-domain       structured      Complex              0.494   0.735   0.595   0.730    0.639

                 Table 2: Test set results for the four PRELEARN subtasks using F1-score

                                     Settings      In-domain        Cross-domain
                                     raw-text        +2.1%             +4.2%
                                    structured      +15.6%             +4.8%

        Table 3: Performance advantage over the next best participant on average across domains


Wikipedia and other text sources. With this model,            data embedding of each concept in the concept
we get the 768 dimensional vector representation              pair to the feature set. This system participated
for a sequence corresponding to the [CLS] token               under the structured resource setting as well. We
as in the original implementation of BERT (De-                decided to not include the Wikipedia embeddings
vlin et al., 2019). The sequence consists of the              considering the ablation analysis we present in Ta-
combination of the concept and its Wikipedia de-              ble 4.
scription.                                                       Italian-BERT: a single layer neural network
                                                              mapping the 768 features from the [CLS] to the
3.3   Systems                                                 output space of dimension 2 as a sequence pair
Considering the proposed features and our exper-              classification task. In addition, the pre-trained
imental results at Section 5, we proposed the fol-            weights of the base model are fine-tuned on the
lowing three systems to address both, in-domain               training dataset. We fine-tune the base model us-
and cross-domain scenarios. For the in-domain                 ing the huggingface transformers library (version
scenario we trained with a combination of all the             3.1) for Pytorch (Wolf et al., 2019). In the in-
training samples per domain. In the same way, we              domain scenario, we use the following training pa-
combined the remaining three domains for each                 rameters: the number epochs is 10, learning rate is
cross-domain experiment (i.e. excluding samples               5e−5, weight decay is 0.01, batch size is 32, warm
from the target domain).                                      up steps is 100, optimizer is AdamW with a linear
   Complex: a completely handcrafted machine                  schedule after a period of warm up steps. We find
learning system, it uses all the complexity-based             that the model exhibits high variance across runs
and Concept-to-Concept features (except the do-               in our cross-domain experiments. Hence, in addi-
main vector for cross-domain scenario), and we                tion to the parameter settings for the in-domain ex-
normalize the features using Z-score normaliza-               periments, we choose the number of training steps
tion. This system uses a tree-ensemble learner as             using a validation set for the unseen target domain.
classifier7 with the default parameters provided by           Accordingly, we set the maximum training step to
Breiman (2001)8 . This system participated under              400 and the warm up steps to 100, 200, 150, and
the structured resource setting because the “con-             200 for data mining, geometry, physics, and pre-
cept view frequency” feature is structured infor-             calculus cross-domain scenarios respectively.
mation.
                                                              4     Results
   Complex+wd: an improved version of the
Complex system by only concatenating the Wiki-                Table 2 shows our per-domain results for our sys-
                                                              tems indicating the kind of scenario and resources
  7
    Other classifiers were tested and obtained lower pefor-   they used. We observe the clear superiority of
mance
  8
    https://cran.r-project.org/web/                           Italian-BERT which only relies on raw-text re-
packages/randomForest/index.html                              sources. This suggest that just fine-tuning BERT
     Scenario         Resources    Feature set                 DM        Geo       Phy      Prec      AVG
     in-domain        raw          complexity                 0.646     0.817     0.622     0.792     0.720
     in-domain        raw          wp embedding               0.705     0.818     0.670     0.827     0.755
     in-domain        raw          Italian-BERT               0.947     0.746     0.829     0.842     0.841
     in-domain        structured   complexity                 0.648     0.805     0.629     0.804     0.721
                                   +page view
     in-domain        structured   wd embedding               0.660     0.814     0.674     0.838     0.746
     in-domain        structured   wd+wp embedding            0.694     0.824     0.672     0.831     0.755
     in-domain        structured   complexity                 0.697     0.823     0.686     0.845     0.763
                                   +page view
                                   +wd embedding
     cross-domain     raw          complexity                 0.072     0.592     0.258     0.586     0.377
     cross-domain     raw          wp embedding               0.000     0.622     0.079     0.344     0.261
     cross-domain     raw          Italian-BERT               0.145     0.646     0.460     0.570     0.455
     cross-domain     structured   complexity                 0.107     0.588     0.297     0.577     0.392
                                   +page view
     cross-domain     structured   wd embedding               0.000     0.661     0.355     0.608     0.406
     cross-domain     structured   wd+wp embedding            0.000     0.660     0.332     0.605     0.399
     cross-domain     structured   complexity                 0.064     0.645     0.366     0.630     0.426
                                   +page view
                                   +wd embedding

Table 4: Ablation analysis results using F1-score (validation set for Italian-BERT and 10-fold for the
others)


is enough for gaining a notion of prerequisite       5       Discussion: ablation analysis
relations on concepts. Still, the systems based
on handcrafted features and non-contextual em-
                                                     During the creation our systems we perform sev-
bedding exhibit competitive results, with a good
                                                     eral experiments over the possible features to use.
enough performance to rank first in the structured
                                                     We did 10-fold cross validation for the in-domain
resource setting, while being faster, more inter-
                                                     experiments except with the Italian-BERT9 , for
pretable and simpler than the Italian-BERT coun-
                                                     which we used a stratified split of 30% for val-
terpart.
                                                     idation set. Table 4 shows the experimental re-
   The results showed that there is a huge perfor-   sults over the training (validation) set for both,
mance reduction for the cross-domain scenario.       in-domain and cross-domain scenarios. The “Re-
The largest performance drop is on the “data min-    sources” column serves to identify the type of re-
ing” domain. Given that we train our models on       sources used for the current feature.
the combination of examples from all other do-
                                                        We observe that the “data mining” domain ap-
mains, it is likely that the probable cause is the
                                                     pears to be difficult in the cross-domain scenario,
domain mismatch. Yet, the reduction on the test
                                                     models based on the non-contextual embedding
sets are smaller than what we observe in our K-
                                                     features obtain results of zero. We suspect that this
fold experiments and validation sets.
                                                     difficulty is due to the domain mismatch.
   In addition, we show in Table 3 the performance
                                                        Based on these results, we select the Italian-
advantage we obtained over the next best partici-
                                                     BERT for the raw-text setting, and the “complex-
pant based on the ranking released by the organiz-
                                                     ity +page view” and the addition of Wikidata em-
ers.
                                                     beddings (“wd embedding”) for the structured re-
   One can see that the greater performance advan-   source setting for our submissions.
tage is from the structured resource setting. This
suggests that the “Concept view frequency” and
                                                         9
the Wikidata embedding features are effective.               Due to its high computational requirements
6   Conclusion                                            Leo Breiman. 2001. Random forests. Machine learn-
                                                            ing, 45(1):5–32.
We tackle the task of prerequisite relation learning
using a variety of systems that explore three set of      Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
                                                             Kristina Toutanova. 2019. BERT: Pre-training of
features: handcrafted features based on complex-             deep bidirectional transformers for language under-
ity intuitions, embedding models from Wikipedia              standing. In Proceedings of the 2019 Conference of
and Wikidata, and contextual embedding from                  the North American Chapter of the Association for
Italian-BERT model. We examine the capabili-                 Computational Linguistics: Human Language Tech-
                                                             nologies, Volume 1 (Long and Short Papers), pages
ties of our models in in-domain and cross-domain             4171–4186, Minneapolis, Minnesota, June. Associ-
scenarios. Our models ranked first in all the sub-           ation for Computational Linguistics.
task of the PRELEARN competition at EVALITA
2020. We found that although our Italian-BERT             Adam Lerer, Ledell Wu, Jiajun Shen, Timothee
                                                            Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex
model outperformed the others, the simpler mod-             Peysakhovich. 2019. PyTorch-BigGraph: A Large-
els show competitive results.                               scale Graph Embedding System. In Proceedings of
   We plan to further examine the impact of using           the 2nd SysML Conference, Palo Alto, CA, USA.
a combination of all possible domains as training         Chen Liang, Jianbo Ye, Han Zhao, Bart Pursel, and
set on the performance of our models.                       C Lee Giles. 2018. Active learning of strict partial
                                                            orders: A case study on concept prerequisite rela-
Acknowledgments                                             tions. arXiv preprint arXiv:1801.06481.

The authors thank CONACYT for the computer                Alessio Miaschi, Chiara Alzetta, Franco Al-
resources provided through the INAOE Supercom-              berto Cardillo, and Felice Dell’Orletta. 2019.
                                                            Linguistically-driven strategy for concept prereq-
puting Laboratory’s Deep Learning Platform for              uisites learning on italian. In Proceedings of the
Language Technologies.                                      Fourteenth Workshop on Innovative Use of NLP for
                                                            Building Educational Applications, pages 285–295.

                                                          Maria Montefinese, David Vinson, Gabriella
References                                                 Vigliocco, and Ettore Ambrosini. 2019. Ital-
Giovanni Adorni, Chiara Alzetta, Frosina Koceva,           ian age of acquisition norms for a large set of words
  Samuele Passalacqua, and Ilaria Torre. 2019. To-         (itaoa). Frontiers in psychology, 10:278.
  wards the identification of propaedeutic relations in
  textbooks. In International Conference on Artificial    Partha Talukdar and William Cohen. 2012. Crowd-
  Intelligence in Education, pages 1–13. Springer.          sourced comprehension: predicting prerequisite
                                                            structure in wikipedia. In Proceedings of the Sev-
Chiara Alzetta, Alessio Miaschi, Felice Dell’Orletta,       enth Workshop on Building Educational Applica-
  Frosina Koceva, and Ilaria Torre. 2020. Prelearn @        tions Using NLP, pages 307–315.
  evalita 2020: Overview of the prerequisite relation
  learning task for italian. In Valerio Basile, Danilo    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
  Croce, Maria Di Maro, and Lucia C. Passaro, edi-          Chaumond, Clement Delangue, Anthony Moi, Pier-
  tors, Proceedings of Seventh Evaluation Campaign          ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow-
  of Natural Language Processing and Speech Tools           icz, Joe Davison, Sam Shleifer, Patrick von Platen,
  for Italian. Final Workshop (EVALITA 2020), On-           Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
  line. CEUR.org.                                           Teven Le Scao, Sylvain Gugger, Mariama Drame,
                                                            Quentin Lhoest, and Alexander M. Rush. 2019.
Segun Taofeek Aroyehun, Jason Angel, Daniel Alejan-         Huggingface’s transformers: State-of-the-art natural
  dro Pérez Alvarez, and Alexander Gelbukh. 2018.          language processing. ArXiv, abs/1910.03771.
  Complex word identification: Convolutional neural
  network vs. feature engineering. In Proceedings of      Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki
  the thirteenth workshop on innovative use of NLP for       Shindo, Hideaki Takeda, Yoshiyasu Takefuji, and
  building educational applications, pages 322–327.          Yuji Matsumoto. 2020. Wikipedia2Vec: an efficient
                                                             toolkit for learning and visualizing the embeddings
Valerio Basile, Danilo Croce, Maria Di Maro, and Lu-         of words and entities from wikipedia. arXiv preprint
  cia C. Passaro. 2020. Evalita 2020: Overview               1812.06280v3.
  of the 7th evaluation campaign of natural language
  processing and speech tools for italian. In Valerio
  Basile, Danilo Croce, Maria Di Maro, and Lucia C.
  Passaro, editors, Proceedings of Seventh Evalua-
  tion Campaign of Natural Language Processing and
  Speech Tools for Italian. Final Workshop (EVALITA
  2020), Online. CEUR.org.