Feasibility of Improving BERT for Linguistic Prediction
                   on Ukrainian corpus

     Hanna Livinska1 [0000-0001-9676-7932] and Oleksandr Makarevych2 [0000-0002-2084-5209]
                       1, 2
                          Taras Shevchenko National Univeristy of Kyiv
                          Volodymyrska str. 64, Kyiv, 01601, Ukraine
                            2
                              oleksandrmakarevych@knu.ua


        Abstract. What makes BERT (Bidirectional Encoder Representations from
        Transformers) different from other recently published language models is the
        fact that it supports numerous languages, including Ukrainian. The first purpose
        of this research is to look at how well the BERT model is actually trained taking
        into account that Ukrainian language is a low-resource one. The second one is
        to create a hand-picked dataset to further train the published model and to com-
        pare the results of two models. Training the model in this research is based on
        texts written in Ukrainian, including fairytales, novels and stories for kids. This
        specific dataset is chosen mainly because of the fact that stories for kids have
        not so big of a vocabulary considering its audience and the fact that those sto-
        ries usually follow similar paths in their narration. Our model is trained on two
        tasks as in the original paper: masked token prediction and next sentence classi-
        fication. The model shows a clear improvement for Ukrainian language com-
        pared to the original version.

        Keywords: NLP, BERT, transformer, attention, next sentence prediction, ma-
        chine learning.


1       Introduction

In the recent years, Data Science has been a buzzword in the Computer Science indus-
try. It has been attracting both theoretical scientists and practicing developers with
different backgrounds and has been called one of the best jobs of the 21st century by
the Harvard Business Review. This resurrected interest might be attributed to the fact
that today we have more computational power and storage which opens a door for
developing new, more powerful machine learning models.
   Amongst other reasons, Data Science attract so many researchers in the field be-
cause it requires not only classical programming skills, but also strong knowledge of
Calculus, Probability, Statistics, desire to experiment with data and creative thinking
in order to push industry further and find insights in the data that has not been seen
before.
   Today, one of the most prominent fields in the area of Computer Science is Natural
Language Processing (NLP) as it allows us to analyze natural languages to solve
problems we could not find solutions to before. It was shown that language model
pre-training is an effective tool for many natural language processing tasks ([1], [2],
    Copyright © 2020 for this paper by its authors.
    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
[3], [4]). In particular, as attention mechanism and transformer architecture have been
developed, it allowed new models to reach state-of-the-art performance in the area of
text generation and understanding. This research focuses on pre-training BERT-model
(Bidirectional Encoder Representations from Transformers) which utilizes both atten-
tion mechanism and transformer to see if it is possible to improve results when work-
ing with low-resource language. It was originally developed and published in 2018
([5], [6]) and it is considered one of the best in the industry.
   In 2012, the deep neural network submitted to ImageNet Large Scale Visual Rec-
ognition Challenge by Alex Krizhevsky and Ilya Sutskever [7] demonstrated that deep
learning was a viable strategy for machine learning and thus led to increased interest
in deep learning and machine learning research. The success of AlexNet model was
caused by the fact that lower layers of the model learned low-level features such as
edges, while higher layers focused on understanding higher level concepts like pat-
terns and entire parts of objects. A key property of an Image-Net-like dataset is thus
to encourage a model to learn features that will likely generalize to new tasks in the
problem domain. Previous studies [7] revealed that state-of-the-art models for tasks
such as reading comprehension and natural language inference did not in fact posses
deep natural language understanding but rather picked up on cues to perform special
pattern matching. Attention mechanism is used to solve this problem and give the
model the ability to understand language contexts better [5].


2      Architecture

In a gist, BERT model is comprised of six encoders and decoders stacked together.
Each decoder processes the input sequence consequently which results in outputting
probabilities for a missing word from the model’s vocabulary. The input for the en-
coder is comprised of three main parts:
       • Token embeddings
       • Segment embeddings
       • Position embeddings
   Token embeddings are actual embeddings of the words in the sequence. Each se-
quence starts with a special token [CLS] to denote the beginning of the sequence. The
two sentences are separated by a special [SEP] token. Segment embeddings are used
to denote whether the token belongs to the first or the second sentence in the se-
quence. Position embeddings represent the numerical place of the token in the overall
sequence. All these three parts are combined together to form the input that is then
processed by attention mechanism to produce state-of-the-art results.
To understand attention mechanism better, first, let us look at the problem with Se-
quence-to-Sequence model.
   A Sequence-to-Sequence model is a model that takes a sequence of items and out-
puts another sequence of items. In this particular application we have a translation
from English to Ukrainian. These models are composed of two parts: encoder and
decoder. Encoder processes each item in the input sequence and produces a vector
called a context. Decoder, on the other hand, takes context vector as an input and
produces output sequence item by item.
   It is important to mention that both encoder and decoder are usually Recurrent
Neural Networks (RNN). Since the encoder and decoder are both RNNs, at each time
step either encoder or decoder updates its hidden state based on its inputs and previ-
ous inputs it has seen. As a result, the last hidden state of the encoder becomes con-
text vector that decoder uses.
   The most obvious drawback of this method is that the longer the input sequence is,
the more information can be vanished when the context vector is produced. While the
encoder processes the input sequence word by word, at each time step it produces a
hidden state that is the summarization of the previous hidden states.
   Another problem with this approach is that while the decoder produces output se-
quence it cannot focus on parts of the input sequence that are relevant for producing
current output.


                        Fig.1. Standard Encoder-Decoder architecture

In this scenario we can clearly see how the attention can help us. Rather than working
with final hidden state of the encoder, the encoder passes to the decoder all the hidden
states that were produced by processing input sequence. As a consequence, the de-
coder does an extra step before producing its output. In order to focus on the parts of
the input sequence relevant to this decoding time step, the decoder does the following.
   As each encoder's hidden state is mostly associated with a certain word, a score is
assigned to each hidden state to show how important the hidden state is for producing
output at the current decoder time step. After that, each hidden state is multiplied by
its softmaxed score to amplify hidden states with high scores and draw out hidden
states with low scores.
   Finally, we obtain a new context vector by summing hidden states vectors from the
previous step. Decoder hidden state vector and context vector are then passed through
a feed-forward neural network to obtain an output word. (Output of the decoder is
discarded, but the decoder RNN produces a new hidden state).
   This procedure is repeated for each time step. As a result, attention mechanism can
dramatically improve the performance of sequence to sequence models.
   Transformer model uses the above-mentioned attention mechanism to produce
state-of-the-art results. It was first introduced in work [8] in 2017. The architecture of
the model consists of six encoders and six decoders stacked together. The encoders
are all identical in structure and have two components: Self-Attention followed by
Feed-Forward Neural Network. The encoder’s inputs first flow through a self-
attention layer – a layer that helps the encoder to look at other word in the input se-
quence as it encodes a specific word. The decoder has both of these layers, but be-
tween them an attention layer lies that helps the decoder to focus on relevant parts of
the input sentence.
   As the model processes each position in the input sequence, self-attention allows it
to look at other positions in the input sequence for clues that can lead to a better en-
coding for a particular word. It is similar to how hidden states are used in RNNs to
connect words that were processed before with the current word. Basically, self-
attention is the method used to incorporate understanding of other relevant words into
the one we are currently processing.
   For example, while processing the word ‘they' in the sentence:
     “Kate and Mark didn’t listen to their parents and rushed downstairs to open
their presents. They were really excited”, Kate and Mark would contribute signifi-
cantly to the encoding of the word ‘they’ when compared to other words.


                           Fig.2. Example of applying self-attention


3      BERT for Ukrainian Corpus

Now let us move on to the most interesting part of this research project to see how all
these things are used in real-world applications.
   There have been multiple models that are based on transformer with some degree
of alterations. Two of the most well-known are GPT (Generative Pre-Trained Trans-
former) by OpenAI ([3], [9]) and BERT by Google [5].
   GPT is a language model developed by OpenAI, company founded by Elon Musk.
This model can produce texts that are so well-written; they cannot be distinguished
from that written by human. This model, however, only supports English.
   The other model is BERT which stands for Bidirectional Encoder Representations
from Transformers. BERT is a method of pre-training language representations,
meaning that we train a general understanding of the language on a large text corpus
to use that model for downstream NLP tasks. BERT is conceptually simple and em-
pirically powerful. It was built with contextual representation in mind, so the word
‘bank’ would have different representations in ‘bank deposit’ and ‘river bank’. The
code and pre-trained models are available at [6].
   The model training was based on two things: Masked Language Model and Next
Sentence Prediction. In the context of the masked language model we mask a word
from the input sequence and try to infer the missing word that suits the blank best.
This means that the model looks at both right and left context surrounding the missing
word to give a suitable inference. For example, in the sentence:
      “the little ___ played with a red car that his father bought him for his birthday”
the model places the word “boy" instead of the blank. Next sentence prediction task
again tries to understand the underlying meaning of two sentences to compute prob-
ability that one sentence can be continuation of another. For example, if we take
      “It was really cold outside”
as our first sentence, the model gives 93% probability for
      “That's why they stayed at home and watched Netflix”
and 15% probability for
      “The Earth is the third planet from the Sun”
as being the continuation for the first sentence.
   BERT is particularly interesting when compared to other state-of-the-art models
because it was published in English and multilingual versions. Both models were
trained on corpus derived from Wikipedia. However, since Ukrainian is a low-
resource language, Ukrainian Wikipedia does not reflect the nature of the language
which raises a fair question regarding the performance of the model and how easy it
would be to improve upon existing pre-training. The initial guess was that although
BERT is claiming to be multilingual, it was not performing well on low-resource
languages like Ukrainian. The assumption proved itself to be true as you will be able
to see later.
   One of the biggest challenges that were faced in this project was to find a suitable
dataset. Whereas Ukrainian corpuses are not widespread, it was necessary to create
one. Since generating a general-purpose corpus would take an unreasonable amount
of time, it was decided to focus on specific part of language and use it for the re-
search, in particular Children’s literature, including fairytales, novels and short sto-
ries.
   This segment of Ukrainian language is a perfect candidate to train the model on
since the child’s vocabulary is not as big as that of an adult and is not as developed as
that of an adult. What is more, children’s literature usually follows similar patterns in
its narration structure which allows model to better understand the underlying mean-
ing and produce better results. However, literature for children still possesses a sig-
nificant part of the language context, so the model might do a much better job training
on it.
    For training the model, we hand-picked 742 texts ranging in size, from short novels
to chapters from Ukrainian classic books. Each text was preprocessed, cleaned and
split into sentences for the later use. Moreover, each sentence was analyzed on spe-
cific delimiters it used. All of the delimiters were standardized and similar ones were
replaced by one universal representation. This was an important step as it allowed the
model to exclude delimiter-specific bias enforced by statistical occurrence of this
delimiters in specific contexts. In other words, it allowed the model to focus more on
the actual meaning of the sentences rather than picking up on patterns involving de-
limiters use. All in all, more than 93 000 sentences in Ukrainian were prepared to be
used in training phase.
    The project focuses on altering stories based on Masked Language Model. The in-
put for this part is comprised of stories for kids. Each story is broken down into sen-
tences and in each sentence one word is picked by random and is masked by [MASK]
token. It is important to note that in reality we might have a compound word that, if
masked, would be masked by multiple [MASK] tokens, but for the sake of the project
each word, regardless of the size, was masked by one [MASK] token.
    The idea is to see what the original BERT might place in the place of the masked
token and compare the results to the trained version to see which one understands
Ukrainian language better.
    As suggested by the authors in the original paper [5], 2-4 is the optimal number of
epochs to use for training BERT. Each of three possible suggestion were tested with
different learning rates, however, only one set of hypermarameters turned out to be
the most effective. Specifically, training the model on 4 epochs with the learning rate
of 0.00005 caused the loss to decrease from 5.45397 to 1.66947, while other combina-
tions of hypermarameters inevitably caused the loss to increase. This decrease can be
seen as a good improvement considering the size of the dataset and the number of
weights to be adjusted in the model. Google Colaboratory with 1 free GPU was used
for training and took 8 hours for 4 epochs.
    Below are the sentences with masked words and predictions based on original and
trained models. All sentences considered we give first in Ukrainian and straightaway
their translation in English. Again, it might be the case that the word would be
masked by multiple tokens as in the original implementation, but for the sake of our
project any word, regardless of size, was masked by exactly one [MASK] token to see
how two models would perform.
    From the inference we can see some interesting results that need to be discussed.
First of all, unfortunately, original BERT does a poor job on masked words. For ex-
ample, in sentence
‘[CLS] Раз прийшов лис до [MASK] в гості , та й тхір його гарно погостив.
[SEP]’ (in Ukrainian)
(‘[CLS] Once upon a time, the fox came to visit [MASK], so the ferret served him
well. [SEP]’ – in English),
original BERT outputs
‘[CLS] Раз прийшов лис доu в гості, та й тхір його гарно погостив. [SEP]’
(‘[CLS] Once upon a time, the fox came toU visit, so the ferret served him well.
[SEP]’),
and in
‘[CLS] Той чоловік пригонить бички додому та й [MASK]: [SEP]’
(‘[CLS] That man yarded the bulls home and [MASK]: [SEP]’)
the output was
‘[CLS] Той чоловік пригонить бички додому та йо: [SEP]’
(‘[CLS] That man ran the bulls home andO: [SEP]’).
   As we can see, original BERT not only fails to predict correct part of speech, but
even fails to make predictions in Ukrainian like in the first sentence. This is a clear
proof of the fact that the corpus that was used to train Multilingual Bert for Ukrainian
does not generalize language well enough. This can be attributed to the fact that
Ukrainian is a low-resource language and it is hard to create a good corpus. It is also
possible that due to the fact that multilingual BERT was trained on numerous lan-
guages, it would take a lot of time to gather a good corpus for each language. This
gives an opportunity for researchers from different countries to improve BERT as it is
easier for them to gather corpus specific to their region.
   On the contrary, trained BERT gives promising results. In the same sentences:
‘[CLS] Раз прийшов лис до [MASK] в гості, та й тхір його гарно погостив.
[SEP]’
(‘[CLS] Once upon a time, the fox came to visit [MASK], so the ferret served him
well. [SEP]’),
and
‘[CLS] Той чоловік пригонить бички додому та й [MASK]: [SEP]’
(‘[CLS] That man yarded the bulls home and [MASK]: [SEP]’)
trained model outputs:
‘[CLS] Раз прийшов лис до нього в гості, та й тхір його гарно погостив. [SEP]’
(‘[CLS] Once upon a time, the fox came to visit him, so the ferret served him well.
[SEP]’),
and
‘[CLS] Той чоловік пригонить бички додому та й каже: [SEP]’
(‘[CLS] That man yarded the bulls home and said: [SEP]’).
   These are remarkable results considering the fact that the original word was
masked with only one MASK token. As we can see, our model not only identifies the
correct part of speech, but also outputs pronoun in the right gender (in Ukrainian).
This is due to the fact that BERT uses attention mechanism to infer context from sur-
rounding words and thus, as a result, understands the context better and can make
pretty good predictions.
   The results clearly suggest that even with a small corpus of more than 700 docu-
ments, BERT was able to learn the underlying meanings in sentences and make better
predictions in terms of Masked Language Model. For example, in the first sentence
   ‘[CLS] Жили собі [MASK] і баба.[SEP]’
   (‘[CLS] There were a [MASK] and a grandmother[SEP]’)
we can see that our model places a word ‘мати’ (‘mother’) instead of the blank, while
the original model tries to put a comma. It should be noted, that here our model put
wrong word (‘mother’ instead of ‘grandfather’), but it is the correct part of speech and
theoretically could be possible.
   However, even trained BERT sometimes fails to understand the context. It should
be noticed that the model might place punctuation marks like ",", or "." and "–" in
place of the actual word. This result is not surprising since the corpus that the model
was trained on consisted of many dialogues with those particular punctuation marks.


4      Possible Applications

Models based on Natural Language Processing are widely used for different applica-
tions in the modern world. Some of the most common use-cases are:
          • Named Entity Recognition
          • Sentiment Analysis
          • Topic Modeling
          • Fake News Detection
          • Machine Translation
          • Question Answering
          • Natural Language Generation
          • Information Extraction
and many others.
   In terms of possible applications for the above-mentioned model, it can be used for
different purposes. In terms of Named Entity Recognition, the model can be used to
identify different categories like names, organizations, locations, time expressions,
monetary values, percentages, etc. Moreover, the model can be used in classrooms
during language classes. It can be incorporated into special educational software that
would help young children learn language. The result of this research can also be used
for question answering. One of the most common applications of Language Models is
to understand the factual information in the given text and based on that find the an-
swer to question. However, for this approach the training of the model should be
slightly changed.
   Considering the fact that BERT has no alternatives on Ukrainian market, it can be-
come a pretty powerful tool in numerous government-related applications. Machine
Learning and especially Natural Language Processing model in Ukrainian can mod-
ernize the overall day-to-day tasks.
   In particular, similar models can be modified for some government purposes pro-
viding potential ways to help solving vital problems. For instance, NLP models are
widely used across the globe to identify terrorists and prevent their attacks. If trained
correctly, this model can recognize the underlying intent of the message written in
Instagram, Twitter and especially Facebook. Police can adopt this model to scan the
internet for potential threats and for eliminating them before they caused any damage.
For another thing, the model can become quite useful in numerous predictions per-
formed by National Bank of Ukraine. It has been established that raw numbers do not
necessarily tell the whole story. Nation’s mood, reaction to implementing different
monetary policies can provide a very valuable input that can dramatically improve the
forecast of inflation, GDP and UAJ exchange rate.
   PyTorch-Transformers library can be used for training multiple language models,
including BERT, for different purposes. The library supports the following models:
     •BertModel – raw fully pre-trained transformer;
     •BertForMaskedLM – BERT Transformer with pre-trained masked language
          modelling head on top;
     •BertForNextSentencePrediction – BERT Transformer with pre-trained next sen-
          tence prediction classifier on top;
     •BertForPretraining – BERT Transformer with masked language modelling head
          and next sentence prediction classifier on top;
     •BertForSequenceClassification – BERT Transformer with a sequence classifica-
          tion head on top (sequence classification head is only initialized and has to
          be trained);
     •BertForTokenClassification – BERT Transformer with a token classification
          head on top (token classification head is only initialized and has to be
          trained);
     •BertForQuestionAnswering – BERT Transformer for answering questions from
          the text.
   As can be seen from the list, the model is offered in multiple pre-trained states for
different NLP-tasks. The model can potentially be used for text summarization, which
can be quite useful nowadays. By applying the self-attention mechanism, more con-
textual information can be extracted and higher scores can be assigned to those parts
of the text that possess the most important information.


5      Conclusions

Even though the model shows a clear improvement compared to the original version,
it definitely leaves room for improvement and future analysis. One of the possible
ways to improve the model would be to increase the dataset size. For example, not
only novels for children but also for adults can be added.
   Another option would be to experiment with different topics to see which one is
trained better than others. What is more, all the punctuation can be removed from the
texts on the cleaning stage since the model often produced a dot if the masked word
was close towards the end of the sentence.
   Moreover, a goal of this research is to attract research community to Ukrainian
language in NLP-specific applications. As for now, there is no well-established
benchmark to estimate the performance of Ukrainian Language Models. With this
paper, the authors hope to make the first step towards including Ukrainian into mod-
ern NLP research. Authors would like to continue their work and establish a more
robust, reliable approach to estimate the performance of Ukrainian Language models.
   As a conclusion, this research project shows that even with little resources, the per-
formance of the model that was trained on Wikipedia text can be improved to later be
used for other downstream tasks like classification, question-answering and reading
comprehension.
   It is also important to mention that, despite being trained on Ukrainian language,
the same approach could be used to train similar models for other Slavic languages.
This is not surprising as languages of this group share pretty similar grammatical and
semantical structures. Nearly all European countries have their own form of folklore,
both with the features of the modern language and the older one.


References
 1. Dai, A.M., Le, Q.V.: Semi-supervised sequence learning. In: Advances in Neural Informa-
    tion Processing Systems (NIPS) 28, Conference Proceedings, pp. 3079–3087 (2015).
 2. Peters, M., Neumann, M., Zettlemoyer, L., Yih, W.: Dissecting contextual word embed-
    dings: Architecture and representation. In: Proceedings of the Conference on Empirical
    Methods in Natural Language Processing, pp. 1499–1509 (2018).
 3. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I. Improving language understand-
    ing with unsupervised learning. Technical report, OpenAI (2018).
 4. Howard, J., Ruder, S.: Universal language model fine-tuning for text classification. In:
    Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics,
    vol. 1, pp. 328–339, Melbourne, Australia, July 15 - 20, (2018).
 5. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: Pre-training of Deep Bidirec-
    tional Transformers for Language Understanding. In: Proceedings of NAACL-HLT, pp.
    4171–4186, Minneapolis, Minnesota, (2019).
 6. Google-research/BERT, https://github.com/google-research/bert.
 7. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet Classification with Deep Convolu-
    tional Neural Networks. In: Advances in Neural Information Processing Systems (NIPS)
    25, vol. 2, Conference Proceedings, pp. 1097-1105, (2012).
 8. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.,
    Polosukhin, I. Attention Is All You Need, In: Advances in Neural Information Processing
    Systems. Proceedings of the 31st Conference on Neural Information Processing Systems
    (NIPS 2017), Long Beach, CA, USA (2017).
 9. Better Language Models and Their Implications. https://openai.com/blog/better-language-
    models/ (2019).