Transformer-based Model for Text Classification in Ukrainian Larysa Katerynych, Maksym Veres and Eduard Safarov Taras Shevchenko National University of Kyiv, Academician Glushkov Avenue 4d, Kyiv, 03680, Ukraine Abstract The purpose of this paper is to find a solution for printed text classification in Ukrainian, as well as to choose the means for its implementation. The paper considers the problem of identification of short texts by their scientific topic. A model, built for classification, is described. The current state of NLP and transfer learning is studied too. In practice, the effectiveness of the implemented methods is proven, which allows to obtain good results of text classification. These methods and approaches include concepts such as transfer learning, NLP, BERT. The model is built using Python programming language and some of its machine learning libraries. A multilanguage BERT model by Google is additionally trained on Ukrainian texts from school subjects. After that, the questions from the external independent evaluation are submitted to the input, and the model classifies them. Keywords 1 Text classification, deep learning, recurrent neural networks, long short-term memory, convolutional neural networks, natural language processing, bidirectional encoder representations from transformers, local interpretable model-agnostic explanations. 1. Introduction Searching for scientific papers and articles related to a particular field of knowledge can be difficult for students, scientists, researchers, etc. Thematic classification of works facilitates the search process. Having a list of subjects in the research field, it is needed to find out which subject area a particular study is more related to. However, manually categorizing a large collection of resources is a time- consuming process. Usually, the strategy is to search based on keywords, certain terms in the title of the article and even in the whole text. However, processing the entire text of the article takes a long time. In this case, neural network (NN) text classification can be applied. With deep learning evolving, NN architectures such as recurrent NN (RNN), long short-term memory (LSTM) and convolutional NN (CNN) demonstrated good performance in natural language processing (NLP) tasks such as text classification, machine translation and more. However, the effectiveness of deep learning models in NLP depends on large sets of labeled text data. Most labeled datasets are not sufficient for deep NN training, because these networks have many parameters, and training such networks on small datasets leads to overtraining. Another reason that held NLP progress back was the lack of transfer learning. It was not used in NLP until 2018, when Google introduced a transformer model. Since then, transfer learning in NLP has helped to solve complex text processing problems. 2. Transfer learning in NLP Classic machine learning technique is depicted in the Figure 1. As can be seen, each separate task demands training of a separate NN with its own model and data. If a new problem arises for NN to solve, it could be difficult to build an effective system for this purpose. Transfer learning is depicted in the Figure 2. It is a technique, where a deep learning model, taught on a large dataset, is used to perform similar tasks on another dataset [1]. This model of deep learning is known as a pre-trained model [2]. Most tasks in NLP, such as text classification, machine translation, etc., are sequence modeling tasks. Classic machine learning models and NN cannot fixate the consecutive information, present in the text. Therefore, RNN have been used, as these architectures can model the sequential information. Information Technology and Implementation (IT&I-2021), December 01–03, 2021, Kyiv, Ukraine EMAIL: katerynych@gmail.com (L. Katerynych); veres@ukr.net (M. Veres); edward.saf99@gmail.com (E. Safarov) ORCID: 0000-0001-7837-764X (L. Katerynych); 0000-0002-8512-5560 (M. Veres); 0000-0001-9651-4679 (E. Safarov) ©️ 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) 329 Figure 1: Classic machine learning Figure 2: Transfer learning However, these periodic NN have their drawbacks. One of the main problems is that RNN cannot be parallelized (as opposed to a linear NN [3]) because it accepts one input at a time. In the case of a text sequence, RNN or LSTM accept one token at a time as input. Therefore, training such model on a large dataset will take a long time. As mentioned above, in 2018, the transformer was introduced by Google, which gave a significant impulse to NLP systems [4]. Soon, a wide range of models, based on transformers, were offered for various NLP tasks. There are many advantages to using transformer- based models, but the most important are the following: [2] 1. These models do not process input sequence token-by-token. They take the entire sequence at once, which is a significant improvement over RNN-based models, as the model can now be accelerated by graphics processing unit (GPU). 2. Labeled data is not required for the preparation of these models. It is needed to provide a huge amount of unlabeled text data to prepare a model based on the transformer. This trained model can be used for other NLP tasks. They can include text classification; named-entity recognition (NER), for example, people, geographical, company names; text generation; etc. Bidirectional encoder representations from transformers (BERT) and the second version of generative pre-trained transformer (GPT-2) are the most popular NLP models based on transformers. For example, it is possible to use the previously trained BERT model to classify Ukrainian text. 3. Model fine-tuning BERT is a large NN architecture with big number of parameters that can range from 100 to 300 mln. Therefore, training the BERT model from scratch on a small dataset will lead to overtraining. It is better to use a pre-trained BERT model as a starting point. These pre-trained models are usually trained on big datasets. There are several options for BERT. All of them are listed on the official page [4]. A universal multilingual BERT for 104 languages was chosen. It uses a dictionary of whole words as well as the most common syllables. A part of the dictionary for multilingual version of BERT can be seen in the Figure 3. This dictionary shows which words NN uses as input. These are whole words, for example, кілька. But most Ukrainian words are broken down into syllables. So, the word прийшов will be split into прий and ##шов. 330 Figure 3: Some words from BERT dictionary The training of the model can be continued on another, relatively smaller new dataset. This process is known as fine-tuning of the model [5]. Fine-tuning strategies depend on various factors, but the most important are the size of the new dataset and its similarity to the original dataset. Given that the nature of a typical NN for NLP is more universal in the early layers and becomes more closely related to a specific dataset on subsequent layers, four main scenarios can be identified: 1. The new dataset is smaller and similar in content to the original dataset. If the amount of data is small, then it makes no sense to fine-tune the NN due to overfitting. Since the data is similar to the original, it can be assumed that the distinguishing features in NN will be relevant for this dataset as well. Therefore, the optimal solution is to train the linear classifier as a distinctive feature of NN. 2. The new dataset is relatively large and similar in content to the original dataset. Since there is more data, the overfitting does not take place, if the entire NN is being fine-tuned. 3. The new dataset is smaller and significantly different in content from the original dataset. Since the amount of data is small, only a linear classifier will be sufficient. Since the data is significantly different, it is better to train the classifier not from the top of NN, which contains more specific data. Instead, it is better to train the classifier by activating it on earlier layers of NN. 4. The new dataset is relatively large and differs significantly in content from the original dataset. Since the dataset is very large, it is possible to train the entire NN from scratch. Nevertheless, in practice, it is often still more advantageous to use it to initialize weights from a pre-trained model. In this case, there is enough data to fine-tune the entire NN. 4. BERT architecture First, BERT is based on the transformer architecture as stated above. Second, BERT is pre-trained on a large set of unlabeled text, including the entire Wikipedia (that is 2.5 bln words) and the BooksCorpus (800 mln words). This process took 4 days for 16 tensor processing units (TPU). The pre-training step is half the success of BERT. The reason is when a model trains on a large text body, it begins to gain a deeper understanding of how the given language “works”. Third, BERT is a deep bidirectional model. Bidirectionality means that BERT learns information from both left and right side of the token context at the training stage. A similar method of bidirectional processing is also used by embeddings from language model (ELMo) system developed by Paul Allen Institute of Artificial Intelligence. However, BERT demonstrates a more complex relationship between the layers of language representation, so it is considered deeply bidirectional and ELMo is superficially bidirectional. For comparison, the visualization of NN architectures of different types is displayed in the Figure 4. They include an example of one-way processing method – OpenAI GPT. 5. ktrain library ktrain library [6] for Python programming language can be used for BERT fine-tuning. This is a wrapper for Keras framework that helps to build, train, and deploy NN models with minimum amount of code. ktrain provides means for: [7]  learning speed regulation, which will help to find the initial level of learning for the model; 331  visual graphs of learning speed to increase productivity;  pre-trained models for text data (e.g., text classification, NER), images (e.g., image classification), graphs (e.g., link prediction);  methods that allow downloading and pre-processing text and images in various formats;  verification of data that have been misclassified to improve the model;  an application programming interface (API) for saving and deploying models and pre- processing data. Figure 4: BERT, OpenAI GPT and ELMo respectively 6. Problem statement As an example of a practical solution to the problem of Ukrainian text classification, BERT model can be considered and taught to classify scientific topics of given texts. These texts will cover the following 7 school subjects:  history of Ukraine,  physics,  geography,  biology,  mathematics (algebra and geometry),  Ukrainian (language and literature),  chemistry. The task is to create a system that determines automatically to which subject the question relates. The dataset used for this case consists of electronic textbooks on the specified subjects. To test the model, the questions from the tests, that were offered to school graduates at the external independent evaluation (also known as “ЗНО”) of 2021, were considered. A sample can be seen in the Figure 5. It should also be noted that only 11th (final) grade textbooks are included in the dataset, while the external evaluation questions cover several years of study and relate to a wider range of knowledge. Google Colab was used for development. 6.1. Download and process input data ktrain library is needed to get started: !pip install ktrain import ktrain from ktrain import text Publicly available [9] textbooks for remote studying are considered as input data. The textbooks were converted to TXT format and divided into smaller files. ktrain automatically detects natural language and character encoding, processes the data, and sets up the model: (x_train, y_train), (x_test, y_test), preproc = text.texts_from_folder( '/content/drive/MyDrive/dataset/', maxlen=75, max_features=10000, preprocess_mode='bert', train_test_names=['train', 'test'], val_pct=0.1, classes=['history', 'physics', 'math', 'geography', 'biology', 'Ukrainian', 'chemistry']) 332 Figure 5: A page from the external independent evaluation test on Ukrainian language and literature The first argument is the path to the dataset folder. The maxlen argument specifies the maximum number of words (512 for BERT, but it is better to use less to reduce memory usage and increase speed) in each file, with extra words being cut off. maxlen=75 because the input text files are small. The text must be pre-processed for usage with BERT. This is achieved by setting the preprocess_mode value to 'bert'. The BERT model and vocabulary will be loaded automatically, if necessary. val_pct=0.1 means automatically selecting 10% of the data for validation. Finally, the texts_from_folder function expects the following directory structure: train ... test ... So, the dataset folder with corresponding content for 7 classes 'history', 'physics', 'math', 'geography', 'biology', 'Ukrainian', 'chemistry' is created. The result may look like this: detected encoding: UTF-8-SIG downloading pretrained BERT model (multi_cased_L-12_H-768_A-12.zip)... extracting pretrained BERT model... 333 done. cleanup downloaded zip... done. preprocessing train... language: uk done. Is Multi-Label? False preprocessing test... language: uk done. 6.2. Use BERT learner object for content in Ukrainian Creating a model and wrapping it with the learner: model = text.text_classifier('bert', (x_train, y_train), preproc=preproc) learner = ktrain.get_learner(model, train_data=(x_train, y_train), val_data=(x_test, y_test), batch_size=32) The first argument of the get_learner function uses a pre-trained BERT model with a randomly initialized end dense layer. The second and third arguments are training and verification data, respectively. The last argument of get_learner is the packet size. A small batch_size=32 is used based on Google's recommendations. 6.3. Model training To train the model, the optimal level of training is found, that corresponds to the problem being solved. ktrain offers an effective method called lr_find, which trains a model with different learning metrics and creates a graph of model loss as the learning level increases. Loss graph is displayed in the Figure 6. learner.lr_find() learner.lr_plot() The graph of the loss function shows that the classifier provides minimum losses when the training level is 10-5...10-4. Model training: learner.fit_onecycle(2e-5, 1) fit_onecycle function from ktrain library is used. This function utilizes the onecycle learning speed policy, which linearly increases the learning speed during the first half of learning and then reduces the learning speed for the second half [8]. Result: begin training using onecycle policy with max lr of 2e-05... 542/542 [==============================] - 710s 1s/step - loss: 0.5202 - accuracy: 0.8294 - val_loss: 0.2498 - val_accuracy: 0.9182 Validation: learner.validate(val_data=(x_test, y_test)) Result: precision recall f1-score support 0 0.91 0.89 0.90 370 1 0.87 0.93 0.90 313 2 0.90 0.87 0.89 348 3 0.85 0.77 0.81 201 4 0.93 0.95 0.94 792 334 5 0.98 0.97 0.97 541 6 0.89 0.91 0.90 222 accuracy 0.92 2787 macro avg 0.90 0.90 0.90 2787 weighted avg 0.92 0.92 0.92 2787 Figure 6: Loss graph As can be seen from the results of validation, training reaches 85-98% accuracy (precision) in one epoch. Classification errors: learner.view_top_losses(n=10, preproc=preproc) Creating a predictor: p = ktrain.get_predictor(learner.model, preproc) It is better to save the model for later use: p.save('./drive/MyDrive/predictor') After downloading it and trying to offer models of questions from the external independent evaluation of 2021: fin_bert_model = ktrain.load_predictor('./drive/MyDrive/predictor') p.predict("Вільгельм фон Габсбург-Лотрінген – австрійський архікнязь, полковник армії УНР, поет. Під яким псевдонімом він відомий як полковник Українських січових стрільців (УСС)?") history p.predict("Усю воду із широкої посудини перелили у високу вузьку порожню посудину. Якими стануть сила тиску й тиск води на дно вузької посудини після цього порівняно із силою тиску и тиском цієї води на дно широкої посудини? Уважайте, що посудини мають циліндричну форму") physics p.predict("Установіть відповідність між графіком (1 — 3) функції, визначеної на проміжку [— 4; 4], та її властивістю (А — Д)") math p.predict("Укажіть материк, на якому лежить Україна") geography p.predict("Під час експерименту декілька яєць морських їжаків помістили в морську воду, де їх запліднили. У цю воду добавили мічений Тритієм (3Н) тимідиловий нуклеотид (рис. 1), який поглинали клітини ембріонів") biology p.predict("Зображене в уривку - Христос Воскресе! – І розвіявся морок. Упали кайдани з невольницьких рук. - Спішіть! Байдаки у відкритому морі! Поблизу нема ні галер, ні фелюк! суголосне з подіями твору") Ukrainian 335 p.predict("Алмаз і графіт — прості речовини") chemistry fin_bert_model.predict(Однакові кульки, підвішені на нитках, заряджені так, як це показано на рисунках. У якому з випадків правильно зображено положення цих кульок, зумовлене їхньою взаємодією?") physics fin_bert_model.predict(“Яка владна інституція звернулася із цитованою відозвою до населення України?") history fin_bert_model.predict("Установіть відповідність між виразом (1 — 3) і твердженням про його значення (А — Д), яке є правильним, якщо a = -2") math fin_bert_model.predict("З будь-якої точки Світового океану можна дістатися в будь-яку іншу, не перетнувши суходіл. Це доводить, що Світовий океан —") geography fin_bert_model.predict("Уключення міченої сполуки в молекули клітин ембріона відбувається під час") biology fin_bert_model.predict("Однаковий звук позначають букви, підкреслені в окремих словах речення") Ukrainian fin_bert_model.predict("Укажіть формулу вуглекислого газу") chemistry Taking a closer look at how the model makes conclusions for certain inputs: fin_bert_model.explain("Мічена сполука в клітинах ембріонів потрапляє в молекули") Result: y=biology (probability 0.989, score 5.285) top features Contribution Feature +5.804 Highlighted in text (sum) -0.519 мічена сполука в клітинах ембріонів потрапляє в молекули This visualization is generated using a technique called local interpretable model-agnostic explanations (LIME) [10]. It helps to understand the relative importance of different words for the final prediction using a linear interpreted model. “Green” words contribute to the correct classification (biology), “red” words reduce the probability of correct prediction. Shade of color indicates the strength or size of the coefficients in the final linear model. 7. Other approaches There are a number of solutions representing classification of English texts. Many of them provide algorithms to build text classifiers. But it’s hard to find works describing algorithms to construct a classifier of Ukrainian text. The main problem, however, is not an algorithm itself, but the lack of resources for experiments to train the classifier. A researcher can:  use Brownian Corps of the Ukrainian Language (BCUL),  use specific dictionaries,  create own dataset. Deep learning tends to use large and robust datasets in order to perform well. One of the attempts to perform Ukrainian text classification is described in the paper [11], where its authors consider random forest classifier, support vector machines (SVM), naive Bayes classifier and logistic regression algorithms as well as BCUL dataset. The best result is shown by SVM model. Its average accuracy is 80%. Another approach to the problem proposes a solution, which can detect sentiments from Ukrainian text (namely hotel reviews) and classify them for positive and negative ones [12]. Besides, it analyzes 336 reviews about given hotel and summarizes its most important positive and negative properties. The dataset, which contains user reviews in Ukrainian was parsed from Booking.com and TripAdvisor. Following techniques were considered: fastText (created by Facebook Artificial Intelligence Research lab), Seq2Seq, CNN, RNN and recurrent convolution neural network (RCNN). RCNN model gives the best accuracy on available dataset: 85% for text classification and 86% for sentence classification. 8. Conclusions BERT and ktrain were used to solve the problem of classifying documents in Ukrainian. A complex model like BERT can be applied to any problem in any language (out of 104 most popular ones in the world, including Ukrainian, for which there is a pre-trained BERT) using ktrain. Testing accuracy of 85-98% was achieved in one learning epoch. Although the effectiveness of BERT was proven, it is relatively slow both in terms of learning and predictions for new data. Therefore, if the training lasts more than one epoch, it may be better to omit the val_data argument from get_learner and check the accuracy only after training. This can be done in ktrain using the learner.validate method, as shown in code samples above. BERT can be quite demanding on memory. If errors, indicating that the GPU memory limits have been exceeded, are encountered, it is possible to reduce the value of either maxlen, or batch_size parameters. If the model performs well after training, it should be saved for future classifications. When using BERT, Keras’s built-in load_model function does not work, although model.save_weights and model.load_weights can still be used to save weights and load them. But to load the model, the learner.load_model function from ktrain should be used. Finally, the proposed BERT solution for Ukrainian text classification was compared to other approaches to this problem. As can be seen, the described model performs well, and its accuracy is high enough. 9. References [1] I. Goodfellow, Y. Bengio, A. Courville, Deep learning, The MIT Press, 2016. [2] E. Olivas, J. Guerrero, M. Sober, J. Benedito, A. Lopez, Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods and Techniques, IGI Publishing, 2009. [3] L. Katerynych, M. Veres, E. Safarov, Neural Networks’ Learning Process Acceleration, in: Proceedings of the 12th International Scientific and Practical Conference of Programming, UkrPROG’2020, Problems in Programming Scientific Journal, Kyiv, 2020, pp. 313–321. doi: 10.15407/pp2020.02-03.313. [4] J. Devlin, S. Petrov. Multilingual BERT models, 2019. URL: https://github.com/google- research/bert/blob/master/multilingual.md. [5] J. Devlin, M. Chang, Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing, 2018. URL: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html. [6] A. Maiya. ktrain: A Low-Code Library for Augmented Machine Learning, 2020. URL: https://arxiv.org/pdf/2004.10703.pdf. [7] S. Ravichandiran, Getting Started with Google BERT: Build and Train State-of-the-Art Natural Language Processing Models using BERT, Packt Publishing Ltd, 2021. [8] L. Smith: A Disciplined Approach to Neural Network Hyper-parameters: Part 1, 2018. URL: https://arxiv.org/pdf/1803.09820.pdf. [9] Electronic Versions of Textbooks, Institute for Modernization of Educational Content, 2021. URL: https://lib.imzo.gov.ua/yelektronn-vers-pdruchnikv. [10] M. Ribeiro, S. Singh, C. Guestrin, "Why Should I Trust You?": Explaining the Predictions of any Classifier, in: 22nd International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, pp. 1135–1144. [11] K. Bobrovnyk, K. Dukhnovska, M. Piroh, Thematic Classification of Ukrainian Texts, Difficulties of its Introductions, in: Control Systems and Computers, Kyiv, Ukraine. doi:10.15407/usim.2019.01.041. [12] D. Babenko, Determining sentiment and important properties of Ukrainian-language user reviews, Master’s thesis, Ukrainian Catholic University, Lviv, Ukraine, 2020. 337