NLP-CIC @ PRELEARN: Mastering Prerequisites Relations, from Handcrafted Features to Embeddings∗ Jason Angel Segun Taofeek Aroyehun Instituto Politécnico Nacional Instituto Politécnico Nacional Mexico City, Mexico Mexico City, Mexico ajason08@gmail.com aroyehun.segun@gmail.com Alexander Gelbukh Instituto Politécnico Nacional Mexico City, Mexico www.gelbukh.com Abstract of resources (features) used to train the model – raw text VS. structured information. We present our systems and findings The combination of these settings defined the for the prerequisite relation learning task four PRELEARN subtasks. Formally, a prerequi- (PRELEARN) at EVALITA 2020. The site relation exists between two concepts if one has task aims to classify whether a pair of con- to be known beforehand in order to understand the cepts hold a prerequisite relation or not. other. For the PRELEARN task, given a pair of We model the problem using handcrafted concepts, the relation exists only if the latter con- features and embedding representations cept is a prerequisite for the former. Therefore, the for in-domain and cross-domain scenarios. task is a binary classification task. Our submissions ranked first place in both We approach the problem from two perspec- scenarios with average F1 score of 0.887 tives: handcrafted features based on lexical com- and 0.690 respectively across domains on plexity and pre-trained embeddings. We employed the test sets. We made our code freely static embeddings from Wikipedia and Wikidata, available1 . and contextual embeddings from Italian-BERT model. 1 Introduction 2 Related works A prerequisite relation is a pedagogical relation Prerequisite relation learning has been mostly that indicates the order in which concepts can be studied for the English language (Liang et al., presented to learners. The relation can be used to 2018; Talukdar and Cohen, 2012). Adorni et al. guide the presentation sequence of topics and sub- (2019) performed unsupervised prerequisite rela- jects during the design of academic programs, lec- tions extraction from textbooks using word co- tures, and curricula or instructional materials. occurrence and order of words appearance in the In this work, we present our systems to au- text. In the case of Italian language there is ITA- tomatically detect prerequisite relations for Ital- PREREQ (Miaschi et al., 2019), the first dataset ian language in the context of the PRELEARN for prerequisite learning, and actually the one used shared task (Alzetta et al., 2020) at EVALITA for the present work. It was automatically built as 2020 (Basile et al., 2020). The evaluation of a projection of AL-CPL (Liang et al., 2018) from submissions considers: (1) in-domain and cross- the English Wikipedia to the Italian Wikipedia. In domain scenarios defined by either the inclusion addition, Miaschi et al. (2019) examines the util- (in-domain) or exclusion (cross-domain) of the ity of lexical features for individual concepts and target domain in the training set. The four domains features derived from the concept pairs. are ’data mining’ (DM), ’geometry’ (Geo), ’pre- calculus’ (Prec), and ’physics’ (Phy). (2) the type 3 Methodology ∗ “Copyright © 2020 for this paper by its authors. Use This section describes the data analysis, the fea- permitted under Creative Commons License Attribution 4.0 tures we used to model the task, and the system International (CC BY 4.0).” 1 https://github.com/ajason08/ we finally submitted to the PRELEARN competi- EVALITA2020_PRELEARN tion. 3.1 Dataset • Age of acquisition of related concepts: We The dataset provided by the organizers includes derived a list of concepts related to each con- the concept pairs splitted into the following do- cept by matching which of them appears in mains: ’data mining’, ’geometry’, ’precalculus’ the concept description. Then, we average and ’physics’. The dataset contains the list of con- the age of acquisition of those concepts. We cepts with a link to the corresponding Wikipedia also took the count of the related concepts. article. The first paragraph of such article is named • Description length: we count the number of the concept description. All concept descriptions words in the concept description. are cleaned in order to facilitate the extraction of information from the text, e.g. the mathematical • Number of mathematical expressions: we expressions are already tagged using this pattern count the occurrence of mathematical expres- formula . sions. We assume that more complex con- Table 1 displays the number of samples and the cepts will have a higher occurrence of math- distribution over the prerequisite relations (posi- ematical expressions in their descriptions. tive samples) across domains for the training set. The test sets in turn exhibits a 50-50 distribution • Concept view frequency: the average of over positive and negative samples. the daily unique visits by Wikipedia users The only preprocessing we did was lowercase (including editors, anonymous editors, and the concept description and remove line-breaks. readers) over the last year. We think that the number of visitors will be correlated Domain Samples Prerequisites rel. with the degree of complexity of a con- Data mining 424 0.257 cept. To gather this information we used the Geometry 1548 0.214 Pageviews Analysis of Wikipedia 2 . Precalculus 2220 0.142 Concept-to-Concept features: they aim to Physics 1716 0.238 model the relation between the concept pairs, specifically we evaluate whether a concept appears Table 1: Training set number of samples and dis- as a sub-string in the title or description of the tribution of prerequisite relations (positive sam- other concept. We did this in both directions re- ples) across domain sulting in two features. We also represent the do- main they belong to as a one-hot vector. 3.2 Features Wiki-embeddings: We map each concept iden- tifier to their corresponding Wikipedia title and The following are the set of features we experi- Wikidata identifier using the Wikidata Query Ser- ment with: vice3 . Then, we obtain the 100 dimensional vec- Complexity-based: a set of handcrafted fea- tor for each Wikipedia title from a pre-trained tures intended to measure how complex a concept Wikipedia embedding4 (Yamada et al., 2020). is. The rationale is that less complex concepts are Similarly, we use the Wikidata embedding5 (Lerer prerequisites for the more complex ones. We used et al., 2019) to represent the Wikidata identifiers some features that have been found effective for as 200 dimensional vectors. the task of complex word identification (Aroyehun Italian-BERT features: We used a pre-trained et al., 2018), specifically they are: uncased version of Italian BERT (base model)6 • Age of acquisition of concept: we use ItAoA provided by the MDZ Digital Library team (db- (Montefinese et al., 2019), a dataset of age mdz) trained on 13GB of text mainly from of acquisition norms (we average the values 2 https://pageviews.toolforge.org 3 for the different entries per word), to derive query.wikidata.org 4 the age of acquisition for each concept we http://wikipedia2vec.s3.amazonaws. com/models/it/2018-04-20/itwiki_ compute the geometric mean of values from 20180420_100d.pkl.bz2 5 ItAoA for words which occur in the con- https://dl.fbaipublicfiles.com/ cept description after replacing outliers (by torchbiggraph/wikidata_translation_v1. tsv.gz the closest permitted value). In addition, we 6 https://huggingface.co/dbmdz/ use the number of matches as a feature. bert-base-italian-uncased Scenario Resources System DM Geo Phy Prec AVG in-domain raw-text Italian-BERT 0.838 0.925 0.855 0.930 0.887 in-domain structured Complex+wd 0.808 0.905 0.795 0.915 0.856 in-domain structured Complex 0.828 0.895 0.785 0.885 0.848 cross-domain raw-text Italian-BERT 0.565 0.785 0.635 0.775 0.690 cross-domain structured Complex+wd 0.535 0.775 0.600 0.760 0.668 cross-domain structured Complex 0.494 0.735 0.595 0.730 0.639 Table 2: Test set results for the four PRELEARN subtasks using F1-score Settings In-domain Cross-domain raw-text +2.1% +4.2% structured +15.6% +4.8% Table 3: Performance advantage over the next best participant on average across domains Wikipedia and other text sources. With this model, data embedding of each concept in the concept we get the 768 dimensional vector representation pair to the feature set. This system participated for a sequence corresponding to the [CLS] token under the structured resource setting as well. We as in the original implementation of BERT (De- decided to not include the Wikipedia embeddings vlin et al., 2019). The sequence consists of the considering the ablation analysis we present in Ta- combination of the concept and its Wikipedia de- ble 4. scription. Italian-BERT: a single layer neural network mapping the 768 features from the [CLS] to the 3.3 Systems output space of dimension 2 as a sequence pair Considering the proposed features and our exper- classification task. In addition, the pre-trained imental results at Section 5, we proposed the fol- weights of the base model are fine-tuned on the lowing three systems to address both, in-domain training dataset. We fine-tune the base model us- and cross-domain scenarios. For the in-domain ing the huggingface transformers library (version scenario we trained with a combination of all the 3.1) for Pytorch (Wolf et al., 2019). In the in- training samples per domain. In the same way, we domain scenario, we use the following training pa- combined the remaining three domains for each rameters: the number epochs is 10, learning rate is cross-domain experiment (i.e. excluding samples 5e−5, weight decay is 0.01, batch size is 32, warm from the target domain). up steps is 100, optimizer is AdamW with a linear Complex: a completely handcrafted machine schedule after a period of warm up steps. We find learning system, it uses all the complexity-based that the model exhibits high variance across runs and Concept-to-Concept features (except the do- in our cross-domain experiments. Hence, in addi- main vector for cross-domain scenario), and we tion to the parameter settings for the in-domain ex- normalize the features using Z-score normaliza- periments, we choose the number of training steps tion. This system uses a tree-ensemble learner as using a validation set for the unseen target domain. classifier7 with the default parameters provided by Accordingly, we set the maximum training step to Breiman (2001)8 . This system participated under 400 and the warm up steps to 100, 200, 150, and the structured resource setting because the “con- 200 for data mining, geometry, physics, and pre- cept view frequency” feature is structured infor- calculus cross-domain scenarios respectively. mation. 4 Results Complex+wd: an improved version of the Complex system by only concatenating the Wiki- Table 2 shows our per-domain results for our sys- tems indicating the kind of scenario and resources 7 Other classifiers were tested and obtained lower pefor- they used. We observe the clear superiority of mance 8 https://cran.r-project.org/web/ Italian-BERT which only relies on raw-text re- packages/randomForest/index.html sources. This suggest that just fine-tuning BERT Scenario Resources Feature set DM Geo Phy Prec AVG in-domain raw complexity 0.646 0.817 0.622 0.792 0.720 in-domain raw wp embedding 0.705 0.818 0.670 0.827 0.755 in-domain raw Italian-BERT 0.947 0.746 0.829 0.842 0.841 in-domain structured complexity 0.648 0.805 0.629 0.804 0.721 +page view in-domain structured wd embedding 0.660 0.814 0.674 0.838 0.746 in-domain structured wd+wp embedding 0.694 0.824 0.672 0.831 0.755 in-domain structured complexity 0.697 0.823 0.686 0.845 0.763 +page view +wd embedding cross-domain raw complexity 0.072 0.592 0.258 0.586 0.377 cross-domain raw wp embedding 0.000 0.622 0.079 0.344 0.261 cross-domain raw Italian-BERT 0.145 0.646 0.460 0.570 0.455 cross-domain structured complexity 0.107 0.588 0.297 0.577 0.392 +page view cross-domain structured wd embedding 0.000 0.661 0.355 0.608 0.406 cross-domain structured wd+wp embedding 0.000 0.660 0.332 0.605 0.399 cross-domain structured complexity 0.064 0.645 0.366 0.630 0.426 +page view +wd embedding Table 4: Ablation analysis results using F1-score (validation set for Italian-BERT and 10-fold for the others) is enough for gaining a notion of prerequisite 5 Discussion: ablation analysis relations on concepts. Still, the systems based on handcrafted features and non-contextual em- During the creation our systems we perform sev- bedding exhibit competitive results, with a good eral experiments over the possible features to use. enough performance to rank first in the structured We did 10-fold cross validation for the in-domain resource setting, while being faster, more inter- experiments except with the Italian-BERT9 , for pretable and simpler than the Italian-BERT coun- which we used a stratified split of 30% for val- terpart. idation set. Table 4 shows the experimental re- The results showed that there is a huge perfor- sults over the training (validation) set for both, mance reduction for the cross-domain scenario. in-domain and cross-domain scenarios. The “Re- The largest performance drop is on the “data min- sources” column serves to identify the type of re- ing” domain. Given that we train our models on sources used for the current feature. the combination of examples from all other do- We observe that the “data mining” domain ap- mains, it is likely that the probable cause is the pears to be difficult in the cross-domain scenario, domain mismatch. Yet, the reduction on the test models based on the non-contextual embedding sets are smaller than what we observe in our K- features obtain results of zero. We suspect that this fold experiments and validation sets. difficulty is due to the domain mismatch. In addition, we show in Table 3 the performance Based on these results, we select the Italian- advantage we obtained over the next best partici- BERT for the raw-text setting, and the “complex- pant based on the ranking released by the organiz- ity +page view” and the addition of Wikidata em- ers. beddings (“wd embedding”) for the structured re- One can see that the greater performance advan- source setting for our submissions. tage is from the structured resource setting. This suggests that the “Concept view frequency” and 9 the Wikidata embedding features are effective. Due to its high computational requirements 6 Conclusion Leo Breiman. 2001. Random forests. Machine learn- ing, 45(1):5–32. We tackle the task of prerequisite relation learning using a variety of systems that explore three set of Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of features: handcrafted features based on complex- deep bidirectional transformers for language under- ity intuitions, embedding models from Wikipedia standing. In Proceedings of the 2019 Conference of and Wikidata, and contextual embedding from the North American Chapter of the Association for Italian-BERT model. We examine the capabili- Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long and Short Papers), pages ties of our models in in-domain and cross-domain 4171–4186, Minneapolis, Minnesota, June. Associ- scenarios. Our models ranked first in all the sub- ation for Computational Linguistics. task of the PRELEARN competition at EVALITA 2020. We found that although our Italian-BERT Adam Lerer, Ledell Wu, Jiajun Shen, Timothee Lacroix, Luca Wehrstedt, Abhijit Bose, and Alex model outperformed the others, the simpler mod- Peysakhovich. 2019. PyTorch-BigGraph: A Large- els show competitive results. scale Graph Embedding System. In Proceedings of We plan to further examine the impact of using the 2nd SysML Conference, Palo Alto, CA, USA. a combination of all possible domains as training Chen Liang, Jianbo Ye, Han Zhao, Bart Pursel, and set on the performance of our models. C Lee Giles. 2018. Active learning of strict partial orders: A case study on concept prerequisite rela- Acknowledgments tions. arXiv preprint arXiv:1801.06481. The authors thank CONACYT for the computer Alessio Miaschi, Chiara Alzetta, Franco Al- resources provided through the INAOE Supercom- berto Cardillo, and Felice Dell’Orletta. 2019. Linguistically-driven strategy for concept prereq- puting Laboratory’s Deep Learning Platform for uisites learning on italian. In Proceedings of the Language Technologies. Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, pages 285–295. Maria Montefinese, David Vinson, Gabriella References Vigliocco, and Ettore Ambrosini. 2019. Ital- Giovanni Adorni, Chiara Alzetta, Frosina Koceva, ian age of acquisition norms for a large set of words Samuele Passalacqua, and Ilaria Torre. 2019. To- (itaoa). Frontiers in psychology, 10:278. wards the identification of propaedeutic relations in textbooks. In International Conference on Artificial Partha Talukdar and William Cohen. 2012. Crowd- Intelligence in Education, pages 1–13. Springer. sourced comprehension: predicting prerequisite structure in wikipedia. In Proceedings of the Sev- Chiara Alzetta, Alessio Miaschi, Felice Dell’Orletta, enth Workshop on Building Educational Applica- Frosina Koceva, and Ilaria Torre. 2020. Prelearn @ tions Using NLP, pages 307–315. evalita 2020: Overview of the prerequisite relation learning task for italian. In Valerio Basile, Danilo Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Croce, Maria Di Maro, and Lucia C. Passaro, edi- Chaumond, Clement Delangue, Anthony Moi, Pier- tors, Proceedings of Seventh Evaluation Campaign ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- of Natural Language Processing and Speech Tools icz, Joe Davison, Sam Shleifer, Patrick von Platen, for Italian. Final Workshop (EVALITA 2020), On- Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, line. CEUR.org. Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2019. Segun Taofeek Aroyehun, Jason Angel, Daniel Alejan- Huggingface’s transformers: State-of-the-art natural dro Pérez Alvarez, and Alexander Gelbukh. 2018. language processing. ArXiv, abs/1910.03771. Complex word identification: Convolutional neural network vs. feature engineering. In Proceedings of Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki the thirteenth workshop on innovative use of NLP for Shindo, Hideaki Takeda, Yoshiyasu Takefuji, and building educational applications, pages 322–327. Yuji Matsumoto. 2020. Wikipedia2Vec: an efficient toolkit for learning and visualizing the embeddings Valerio Basile, Danilo Croce, Maria Di Maro, and Lu- of words and entities from wikipedia. arXiv preprint cia C. Passaro. 2020. Evalita 2020: Overview 1812.06280v3. of the 7th evaluation campaign of natural language processing and speech tools for italian. In Valerio Basile, Danilo Croce, Maria Di Maro, and Lucia C. Passaro, editors, Proceedings of Seventh Evalua- tion Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2020), Online. CEUR.org.