Multi-label Classification using BERT and Knowledge Graphs with a Limited Training Dataset Malick Ebiele1,*,† , Lucy McKenna1,† , Malika Bendechache1,† and Rob Brennan2,† 1 ADAPT Centre, School of Computing, Dublin City University, Dublin, Ireland 2 ADAPT Centre, School of Computer Science, University College Dublin, Dublin, Ireland Abstract This paper provides a new approach combining BERT and Knowledge Graphs (KGs) to solve a multi-label classification problem with limited training data. The paper introduces a method of using taxonomies and a dataset with 518 entries and 340 concepts to fine-tune BERT. It also introduces a new data augmentation technique called Perfect Binary Tree (PBT)-Flow to deal with limited or imbalanced training data. The proposed approach obtained a recall@10 of 61.12%, a precision@10 of 11.86% and F1score@10 of 18.83%. While these results seem low, they are promising because of the simple architecture of the model used (BERT+2xFC), the limited size of the training data, and the large number of output concepts. Keywords Multi-label classification, BERT, Knowledge graphs, Data augmentation 1. Introduction Multi-label classification is the task of assigning one or more concepts to an object or text [1]. This is a challenging task, especially with limited training data and large number of output concepts. Fortunately, the complexity of the task can be reduced by using a KG of the concepts. In fact, by leveraging the hierarchy defined in the concepts’ ontology, one can considerably simplify the task at hand with little loss in semantic if the ontology is complete, well formed, semantically consistent and high quality. BERT [2] is a machine learning framework for natural language processing (NLP) that can be applied to multi-label classification. GAN-BERT is an extension of BERT by using Semi-Supervised Generative Adversarial Networks for the fine-tuning stage [3]. This paper introduces a new approach combining BERT and KGs to classify textual data and a new data augmentation technique called Perfect Binary Tree-Flow (PBT-Flow). This paper investigates the research question: “To what extent can KGs, BERT and the PBT- Flow data augmentation technique improve the precision, recall and F1 score of multi-label classification using a limited training dataset”? In order to explore this research question, the Woodstock’22: Symposium on the irreproducible science, June 07–11, 2022, Woodstock, NY * Corresponding author. † These authors contributed equally. $ malick.ebiele@adaptcentre.ie (M. Ebiele); lucy.mckenna@adaptcentre.ie (L. McKenna); malika.bendechache@adaptcentre.ie (M. Bendechache); rob.brennan@adaptcentre.ie (R. Brennan)  0000-0001-5019-6839 (M. Ebiele); 0000-0002-6035-7656 (L. McKenna); 0000-0003-0069-1860 (M. Bendechache); 0000-0001-8236-362X (R. Brennan) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) ARK-Virus Project [4] was selected as a use-case. The ARK-Virus Project is an extension of the ARK Platform [5] for risk management of personal protective equipment in healthcare settings during the COVID-19 pandemic. This is discussed in more detail in the Use Case and Requirements Section below (Section 3). This paper has two main contributions. First, a method which uses KGs to simplify multi- label classification by reducing the number of output concepts. Second, the presentation of the PBT-Flow data augmentation technique for dealing with limited or unbalanced training datasets. The remainder of this paper is structured as follows; Section 2 presents the related work. Section 3 describes the use case and requirements. Section 4 depicts the design of the proposed approach. Section 5 details the experimental settings. Section 6 provides an evaluation and Section 7 presents the conclusion. 2. Related Work Different mechanisms have been used to solve multi-label classification problems. Rios and Kavuluru [6] proposed a model combining a Convolutional Neural Network (CNN) and a 2-Layer Graph CNN (GCNN) to perform experiments on MIMIC II and MIMIC III. Heo et al [7], on the other hand, proposed D2SBERT, a sequence of n BERT+Multilayer Perceptrons (MLPs) with an attention layer in between for medical discharge summary code prediction. Finally, Khezrian et al [8], introduced TagBERT (BERT+CNN+MLP) to produce tag recommendations for Online Q&A communities and performed experiments on the Freecode dataset. None of these previous works leveraged the possible hierarchy of the label space or did not perform an optimised data augmentation. This work differs from them at two levels. One, by leveraging the hierarchy of the label space by using KG. Two, by introducing and using an optimised data augmentation technique. 3. Use Case and Requirements The ARK-Virus Project uses the ARK-Platform, a socio-technical risk governance system, to manage and analyse risk projects in the area of infection prevention and control. Data entered on the ARK Platform can be annotated with concepts from controlled SKOS1 taxonomies - the ARK Risk Terminology and the ARK Health Terminology2 . These taxonomies contain a combined total of 525 concepts plus definitions. The taxonomies use a three-layer hierarchy with the top level having a total of 141 concepts. It is also worth mentioning that these taxonomies have been built, used, and validated by domain experts over the past couple of years. The annotation of text on the ARK Platform is currently a manual process which, given the large number of concepts, can be time-consuming. Providing a set of suggested concepts, based on text entered into the ARK Platform, would be extremely useful to users. This paper demonstrates how KGs and BERT can be used together to solve this multi-label classification problem. 1 Simple Knowledge Organization System - https://www.w3.org/TR/skos-reference/ 2 Taxonomies and platform demo available at https://openark.adaptcentre.ie/ The approach presented in this paper can be applied to other use cases. The only requirement that needs to be met to apply the approach introduced here is to have an hierarchical label space. However, the ontology of the label space should be well formed, semantically consistent, and high quality [9] with a reasonable amount of concepts in the top layer. The KG’s structure can negatively affect the model performance, for example, if a broad domain is modelled with a narrow taxonomy. This will lead to a loss of the semantic which will negatively impact the performance of the proposed approach. Future work will assess how much this loss will impact the performance of the proposed approach. In this paper, the top level labels have been used for one main reason. The labels’ taxonomies only have three layer hierarchy which makes the loss of semantics acceptable compared to a much complex multi-label classification task given the low resources. One could have used the second top or the second last layer or any other layer for a deeper taxonomy. The idea is to leverage the taxonomies hierarchy to simply the problem at hand. 4. Design Figure 1: Knowledge Graph (KG) Enhanced BERT Training Process for a Limited Training Dataset. Figure 1 outlines the proposed approach. First, the input data, a tabular data extracted from the ARK Platform where each entry is a collection of sentences annotated with concepts defined in the ARK taxonomies, is cleaned and enriched. The process of enrichment consists of replacing the original concepts with the top level concepts from the taxonomy hierarchy. In this case, this process reduced the total number of concepts from 340 to 116. Second, the new data of 518 entries is then split into 50%, 25% and 25% for training, test, and validation sets, respectively (see appendix A for more details). Third, the training set is augmented using the PBT-Flow technique. Fourth, the augmented data is fed into the model (BERT+2xFC). Fifth and finally, the model outputs the class probabilities. The performance is then measured using the precision@k, recall@k and F1score@k metrics with k={5,10}. 5. Experimental Settings Given the limited training data, data augmentation was used to acquire more data. Data augmentation is a technique for increasing the diversity of training data without explicitly collecting new data [10]. The PBT-Flow3 Data Augmentation Technique consists of applying a set of data augmentation techniques to the input data following the Perfect Binary Tree (PBT) structure. The nodes of the tree represent the datasets and the edges indicate whether or not an augmentation technique was applied on the data. Each node has two children - one child is the result of applying an augmentation on the node (data) and the other child is a copy of the parent (no augmentation applied). Every node of the same depth is applied the same augmentation technique and each augmentation technique is applied once. The output of PBT-Flow is the concatenation of the leaves of the PBT. PBT-Flow generates a new dataset of n2m entries from an input data of n entries and a set of m data augmentation techniques. PBT-Flow used five augmentation techniques: two Synonym Replacements [11] (k-words, with k between 1 and 10, from the original text are replaced with their respective synonyms using PPDB and WORDNET vocabulary databases , Back Translation [12] (the original text in English is translated to Deutch language then translated back), Random Swap [13] (Randomly swap k-words of the original text), Contextual Augmentation [14] (k-words in the original text are replaced with other words with paradigmatic relations). Applying the PBT-Flow technique to the original training data of 259 entries resulted in a training data of 5232 entries (after removing duplicated entries and NAs). For GAN-BERT [3], entries with less than 4 concepts were unlabelled which resulted in sets of 809 labelled and 4423 unlabelled training data. 6. Evaluation Two approaches have been used to fine-tune BERT: a supervised approach and a semi-supervised approach based on GAN-BERT [3]. The Generator and the Discriminator of GAN-BERT are defined as discussed in [3]. However, the soft-max function has been replace by the sigmoid function and the cross-entropy-loss by the binary-cross-entropy-loss. The classifier of supervised learning model is the same as the Discriminator minus one neuron in the output layer (because the output layer of the Discriminator has 116 + 1 neurons; one extras neuron to output the probability of the input text being fake or real). The data was augmented using the five augmentation techniques mentioned above. BERT run for 34 epochs and GAN-BERT for 11 epochs. Early stopping monitoring the validation loss with patience and min_delta equal to 5 and 5e-05, respectively, have been used. From Table 1 it can be seen that GAN-BERT+PBT-Flow model outperformed other models by 0.24% (Recall@10) up to 11.89% (Recall@5). In general, models using PBT-Flow outperformed the others. These results validate the experimental results presented in [3]. Those results showed that GAN-BERT outperformed BERT when both models are fine tuned using very limited labelled data. One can notice that the margin of improvement in GAN-BERT is greater than in BERT. This could be due to the filtering of inputs labelled with less than 4 concepts for this model. 7. Conclusion This paper demonstrates that combining KGs and PBT-Flow improve BERT models’ performance for multi-label classification, in both supervised and semi-supervised approaches. These results 3 Code source available at https://github.com/malick-jaures/research/tree/main/PBT-flow Table 1 Experimental Results. (*NoAug - no augmentation applied) Precision@k Recall@k F1score@k Models k=5 k=10 k=5 k=10 k=5 k=10 BERT (NoAug) 15.34 10.07 39.37 51.13 20.80 16.04 BERT + PBT-Flow 18.29 11.62 46.25 58.74 24.66 18.48 GAN-BERT (NoAug) 15.50 10.07 39.14 50.56 20.92 15.95 GAN-BERT + PBT-Flow 19.53 11.86 51.03 61.12 26.34 18.83 are interesting; combining with a threshold could suggest good concepts. While the results cannot be directly compared to the state-of-the-art models, they are similar to previously published works especially in terms of recall@10 [8, 6]. TagBERT [8] is the state-of-the-art in term of Precision@10 and F1score@10 on the Freecode dataset with 40.25% and 46.5%, respectively. On the other hand, the same model obtained a Recall@10 of 64.42% while TagCNN [15] achieved 94.9%. In future work, we envisage to run experiments which will consist of testing our approach on public benchmarks along with TagBERT, TagCNN and other models. We also intent to retrain our model with data extracted from the ARK platform as soon as more data will be available on the platform. Acknowledgments This research has received funding from the ADAPT Centre for Digital Con- tent Technology, funded under the SFI Research Centres Programme (Grant 13/RC/2106 P2), co-funded by the European Regional Development Fund. For the purpose of Open Access, the author has applied a CC BY public copyright licence to any Author Accepted Manuscript version arising from this submission. We would also like to express our gratitude to Dr. Brian Davis for his advice and support. References [1] G. Tsoumakas, I. M. Katakis, Multi-label classification: An overview, Int. J. Data Warehous. Min. 3 (2007) 1–13. [2] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, CoRR abs/1810.04805 (2018). URL: http://arxiv. org/abs/1810.04805. [3] D. Croce, G. Castellucci, R. Basili, GAN-BERT: Generative adversarial learning for robust text classification with a bunch of labeled examples, 2020. [4] L. McKenna, J. Liang, N. Duda, N. McDonald, R. Brennan, Ark-virus: An ark platform extension for mindful risk governance of personal protective equipment use in healthcare, 2021. [5] N. McDonald, L. McKenna, R. Vining, B. Doyle, J. Liang, M. E. Ward, P. Ulfvengren, U. Geary, J. Guilfoyle, A. Shuhaiber, J. Hernandez, M. Fogarty, U. Healy, C. Tallon, R. Brennan, Evaluation of an access-risk-knowledge (ark) platform for governance of risk and change in complex socio-technical systems, International Journal of Environmental Research and Public Health 18 (2021). [6] A. Rios, R. Kavuluru, Few-shot and zero-shot multi-label learning for structured label spaces, 2018. [7] T. Heo, Y. Yoo, Y. Park, B. Jo, Medical code prediction from discharge summary: Document to sequence BERT using sequence attention, CoRR abs/2106.07932 (2021). URL: https: //arxiv.org/abs/2106.07932. [8] N. Khezrian, J. Habibi, I. Annamoradnejad, Tag recommendation for online q&a com- munities based on BERT pre-training technique, CoRR abs/2010.04971 (2020). URL: https://arxiv.org/abs/2010.04971. [9] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, S. Auer, Quality assessment for linked data: A survey (2016). doi:10.3233/SW-150175. [10] S. Y. Feng, V. Gangal, J. Wei, S. Chandar, S. Vosoughi, T. Mitamura, E. H. Hovy, A survey of data augmentation approaches for NLP, CoRR abs/2105.03075 (2021). URL: https://arxiv.org/abs/2105.03075. [11] X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text classification, volume 28, Curran Associates, Inc., 2015. URL: https://proceedings.neurips.cc/paper/2015/ file/250cf8b51c773f3f8dc8b4be867a9a02-Paper.pdf. [12] R. Sennrich, B. Haddow, A. Birch, Improving neural machine translation models with monolingual data, CoRR abs/1511.06709 (2015). URL: http://arxiv.org/abs/1511.06709. [13] J. Wei, K. Zou, EDA: Easy data augmentation techniques for boosting performance on text classification tasks, 2019. [14] S. Kobayashi, Contextual augmentation: Data augmentation by words with paradigmatic relations, 2018. [15] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural network for modelling sentences, 2014. Appendix A. Statistics of our dataset Table 2 below gives the statistics of the number of concept per entry of the training, test, and validation sets used in the experiments above. Table 2 Statistics of the number of concept per entry of the training, test, and validation sets Sets count mean std min 25% 50% 75% max Training 259 1.9 1.4 1 1 1 2 10 Test 129 1.9 1.3 1 1 1 3 7 Validation 130 1.9 1.3 1 1 1 2 8 B. Our dataset compared to Freecode dataset Our dataset has 518 entries with 116 unique labels while Freecode4 dataset has 46995 entries with 9000 unique labels. This means that Freecode dataset contains about 77.6 times more labels but also 90.7 times more entries than ours. In other words, Freecode has a ratio of 5.22 entries per label while our dataset has a ratio of 4.46 entries per label. Moreover, the top 3 most representative set of labels in Freecode have respectively 1390, 711, and 571 entries. In our dataset, the top 3 most representative set of labels have respectively 60, 19, and 13 entries. The least representative set of labels in both datasets have only 1 entry. Table 3 below gives the statistics of the number of concepts/tags per entry of our dataset and the Freecode dataset. Table 3 Statistics of the number of concepts/tags per entry of our and Freecode datasets Datasets count mean std min 25% 50% 75% max Ours 518 1.93 1.35 1 1 1 2 10 Freecode 46995 3.55 2.48 1 2 3 5 38 Table 4 below gives the statistics of the number of words per entry of our dataset and the Freecode dataset. Table 4 Statistics of number of words per entry of our and Freecode datasets Datasets count mean std min 25% 50% 75% max Ours 518 31.93 35.79 1 12 21 38 328 Freecode 46995 50.3 26.9 1 31 45 66 330 4 Available at https://www.kaggle.com/datasets/navidkhezrian/freecode