Context-Based Question-Answering System
                   for the Ukrainian Language

                         Serhii Tiutiunnyk and Vsevolod Dyomkin

                              Ukrainian Catholic University
                        Faculty of Applied Sciences, Lviv, Ukraine
                  tiutiunnyk@ucu.edu.ua, vseloved@gmail.com


        Abstract. We introduce a context-based question answering model for the
        Ukrainian language based on Wikipedia articles using Bidirectional Encoder
        Representations from Transformers (BERT) [1] model which takes a context
        (Wikipedia article) and a question to the context. The result of the model is an
        answer to the question. The model consists of two parts. The first one is a pre-
        trained multilingual BERT model which are trained on the top-100 the most pop-
        ular languages on Wikipedia articles. The second part is the fine-tuned model,
        which is trained on the data set of questions and answers to the Wikipedia articles.
        The training and validation data is Stanford Question Answering Dataset
        (SQuAD) [2].There is no any question answering datasets for the Ukrainian lan-
        guage. The plan is to build an appropriate dataset with machine translate and use
        it for the fine-tuning training stage and compare the result with models which
        were fine-tuned on the other languages. The next experiment is to train a model
        on the Slavic languages dataset before fine-tuning on the Ukrainian language and
        compare the results.


        Keywords: Context-based Question Answering · Bidirectional Encoder · Rep-
        resentations from Transformers · multilingual BERT · fine-tuning · Generative
        Pre-trained Transformer · Stanford Question Answering Dataset


1       Introduction and Motivation

1.1     Problem Importance
Nowadays, it becomes more challenging to stay in the context of an expert area without
handling huge volumes of data. Textual information grows exponentially together with
video, audio, photo, and other types of data. Therefore, a model, which answers a ques-
tion, is significant. It can be used for building chat-bots, automatic quiz generation.
Finally, it helps to handle text documents and retrieve the necessary information much
faster. For example, it is useful for companies with a massive base of inner instructions.
Employees can retrieve the required data based on the scope of the documents.


Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
In: Proceedings of the 1st Masters Symposium on Advances in Data Mining, Machine Learning,
and Computer Vision (MS-AMLV 2019), Lviv, Ukraine, November 15-16, 2019, pp. 81–88
82                                                     Serhii Tiutiunnyk and Vsevolod Dyomkin


  Along with it, the question answering system might be beneficial for layers, medical
workers, and other specific professions.


1.2    General Formulation of the Problem
Question answering task is one of the classical problems in natural language processing
(NLP). At the input for content-based question-answering model has a context and a
question. As a context, we can take an article, a document, an essay, a paper, or any
other piece of textual information. In this project, we will use articles from Wikipedia.
A question is a natural human language question. Articles and questions are in the
Ukrainian language. The result of the model is a phrase from the context, which con-
tains the answer to the question.


2      Review of Related Work

Despite the importance of the problem, it is not appropriately solved for the Ukrainian
language yet. There was no public result for the Ukrainian language found except some
multilingual models like BERT [1].


2.1    Classical Methods
Let us start a review of existed methods from the classical approaches. Under the term
classical, we mean methods, which use well-known strategies without artificial neural
network models. There are unsupervised and supervised methods. Unsupervised ap-
proaches are based on word embedding [3] distances and word frequencies. Supervised
methods use labeled dataset for training (logistic regression, support vector machine,
etc.). Also, we can attribute logic-based methods (for example, Machine Comprehen-
sion Using Commonsense Knowledge [16]) to the set of classical methods. Such meth-
ods are used to solve question-answering task because logical representations yield
more abstract concepts, such as temporal or logical relations. This is very useful for
learning a type of commonsense knowledge.
   Unsupervised Methods. Two different approaches are distinguished within this cat-
egory of methods – based on measuring Euclidean distance between sentences and
counting word and phrase frequencies.
   Euclidean distance between sentences. The first traditional method we came across
during reviewing of related works is finding the minimal Euclidean distance between
question and sentences from the context [4]. The idea of this approach is to find an
average vector of words for each sentence. The answer to the question is the closest
sentence from the context to the question according to Euclidean distance. It is possible
to specify the answer by splitting the sentence into phrases, but it is an additional task,
which will decrease the accuracy of the method. One more drawback of the described
method is relying on the quality of word embeddings. Also, this method does not take
into account a dependency between the words in the sentence.
Context-Based Question-Answering System …                                                83


    Word and phrase frequency. It is possible to use n-gram approach [5] for generating
an answer. The question is parsed into the dependency tree and rebuilt into a narrative
sentence with missing the target word or phrase. The missed phrase is filling by n-gram
model. An artificial neural network model can replace the n-gram model. It will be
discussed below. The drawback of this approach is a low accuracy of dependency parser
models and relying on the phrase frequency in a relatively small volume of text.
    Supervised Methods. This category of methods often use logistic regression and
support vector machine approaches. Supervised traditional methods are described in
[4]. The author uses the SQuAD dataset mentioned above for learning. Sentences from
the context are split into the sentences and added to a binary vector. The target sentence
is marked as 1 and all other items are 0. After that, multinomial logistic regression [6]
is being trained by the labeled data or support vector machine [7]. One of the advantages
of this approach is the ability to add some features to the model (dependency between
the words, term frequency (TF), inverse document frequency (IDF) [8], etc.). A term
frequency is a feature, which increases the weight of frequent words and inverse docu-
ment frequency wise verse decreases the weight of widespread words.
    Pros and Cons. The advantages of the classical approaches are simplicity and high
transparency of the models. Along with it, the model performance on artificial samples
is not good enough (near 70% accuracy on the SQuAD validation set). The result will
be worse with increasing size of the context or setting a goal to retrieve a more specific
answer (a phrase instead of a sentence). Moreover, the results for the Ukrainian lan-
guage are even worse than the English language. It happens due to higher grammar
complexity of the Ukrainian language, fewer text corpora, the presence of word cases
and other language specifics.


2.2    Artificial Neural Network Models
In this part, we will review supervised and unsupervised cases for each main model.
   Long Short-Term Memory Model. Long short-term memory model (LSTM) [9] is
a recurrent neural network architecture, which allows building sequence-to-sequence
models. Also, the input and output vector sizes are not fixed. As an input, LSTM model
takes a context and a question and returns a word scores from the context. To connect
a vector for context and a vector for a question, we add an attention layer. It is a crucial
part of the question answering system based on LSTM model. Attention layer is a dot
product of context and question output vectors. After that, the result of the dot product
converts into the probability of being an answer to the question. The approach men-
tioned above is described in the paper dedicated to Bidirectional Attention Flow (BAF)
[10].
   Generative pre-trained transformer (GPT). There is a second version of this
model called GPT-2 [11]. GPT-2 is one of the State-of-the-Art models in language
modeling tasks. This model was trained on the Wikipedia articles and internet pages to
make the style of generated text more various. This model can only generate the next
word based on the previous text. Hence, to make it answer the question, we have to
rephrase questions sentence into a narrative sentence with a skipped phrase for the an-
swer. GPT-2 will generate the answer. The peculiarity of this model is the absence of
84                                                    Serhii Tiutiunnyk and Vsevolod Dyomkin


the context. On the one hand, it can be an advantage if there is no specific data to re-
trieve the answer. On the other hand, the accuracy will be low for the tasks from special
areas (law, medicine, etc.), as the model was not trained on data from the corresponded
areas. Anyway, GPT-2 cannot be applied to the Ukrainian language, as it is trained only
on English texts. Along with it, training the model from scratch or even pre-training on
Ukrainian corpora requires a lot of resources and time.


2.3    Bidirectional Encoder Representations from Transformers
Bidirectional Encoder Representations from Transformers (BERT) [1] is a transformer-
based neural which shows state-of-the-art results in a wide variety of NLP tasks pro-
vided by Google researchers. Multilingual BERT model was built for top-100 of the
most popular languages used in Wikipedia and it can be used for a hundred languages
out-of-the-box. BERT model training process consists of two stages. The first stage is
pre-training on the text corpora for language modeling task. The second stage is fine-
tuning on the question-answer datasets. The first stage requires substantial computa-
tional resource. Fine-tuning, however, can be performed even on a single graphics pro-
cessing unit (GPU).
   Multilingual BERT results. There are several modifications of BERT multilingual
models, which differ by the fine-tuning process. There are BERT models fine-tuned on
a translated dataset, original dataset (English), cased (use original word case) and un-
cased (all words are lowercased). Table 1 shows the result of BERT modifications on
Cross-lingual Natural Language Inference (XNLI) [13] dataset (translated datasets).

                     Table 1. BERT multilingual model performance

Model                          English   Chinese Spanish German Arabic Urdu
BERT - Translate Train Cased   81.9      76.6    77.8    75.9   70.7   61.6
BERT - Translate Train Uncased 81.4      74.2    77.3    75.2   70.5   61.7
BERT - Translate Test Uncased 81.4       70.1    74.9    74.4   70.4   62.1
BERT - Zero Shot Uncased       81.4      63.8    74.3    70.5   62.1   58.3

   As we can see, the best result for the vast majority of languages is provided by model
pre-trained on the translated cased dataset. One more important thing is that the trans-
lated cased BERT model performs better for non-Latin alphabets languages [12].


3      Research Hypotheses and Problem

3.1    Hypotheses
Accuracy Hypothesis. The main objective of this project is to build a question-answer
model for the Ukrainian language, which shows accuracy near results shown in Table
1 on the well-known benchmarks. The first hypothesis says it is possible to achieve an
Context-Based Question-Answering System …                                              85


efficiency near 70-80%, which is close to results for the other languages provided by
Google researchers.
   Model Comparison Hypothesis. One more goal is to compare different approaches
for pre-training. Some datasets have human translated data into the Russian and other
Slavic languages. It seems that fine-tuning model on Slavic languages datasets and then
fine-tuning on the turned into Ukrainian language dataset might improve performance
for the Ukrainian language comparing with direct fine-tuning on the Ukrainian lan-
guage dataset. So, the next task of this project is to confirm or deny this hypothesis.


3.2    Problems
Translation Problems. To achieve the project goals mentioned above, we need to find
an appropriate machine translator to create the dataset in the Ukrainian language, build
different model pipelines, and compare results. Furthermore, it might require a human
translated small dataset in the Ukrainian language to verify the models.
   Articles Retrieval Problems. Besides, the project needs to retrieve Wikipedia arti-
cles in the Ukrainian language. There are articles in the datasets which exist in the Eng-
lish Wikipedia and are absent in the Ukrainian part. Hence, we have to detect such items
and exclude them from the datasets. Moreover, Wikipedia provides articles in the Ex-
tensible Markup Language (XML) format, which must be converted into the human-
readable text.


4      Envisioned Approach

4.1    Dataset Generation
The very first task is to generate Ukrainian language dataset from the existing datasets
(SQuAD [2]) by machine translator. There is a subtask related to the machine transla-
tion. It is comparison and checking the quality of the translation. The quality of the
translated dataset directly affects the quality of the model. Translation quality can be
checked by reverse translation. If the difference between the original text and the text
after the forward and the backward translation is small enough, it indicates high quality
of the translator.


4.2    Data Storing
Generated datasets and retrieved articles from Ukrainian Wikipedia are stored in the
database to make access to the data more convenient. As the data size is bigger than
read-only memory capacity, we will need to split and read data partially.


4.3    Models Pipeline
The base pre-trained model is multilingual BERT model. Then it is fine-tuned on the
different datasets and variations. The first model will be fine-tuned on the translated
86                                                     Serhii Tiutiunnyk and Vsevolod Dyomkin


training datasets (SQuAD). The accuracy of the model is calculated on the test sets of
the corresponded datasets. The next model is fine-tuned on the human-translated da-
tasets for Slavic languages. After that, the model will be fine-tuned on the machine-
translated dataset for the Ukrainian language. Combinations on the fine-tuning stage
produce different models, which are being compared on the test sets and the human-
translated Ukrainian language dataset created manually.


5      Research Methodology and Plan

5.1    Methodological Approach
One can distinguish three methodological approaches [14]:
─ Quantitative methods are appropriate for measuring, ranking, comparing, etc.
─ Qualitative methods are best to measure describing, interpreting, contextualizing.
  Very often, it is related to the textual results.
─ Mixed methods, which combine a numerical measurement and exploration.
   On the one hand, quantitative methods are the best for comparison fine-tuned models
between each other and with state-of-the-art models for the English language.
   There will be applied the F1 [15] score and precision to get a quantitative measure
and two types of matching. The first one is exact matching answers, and the second one
will check if the original answer is present in the model answer.
   On the other hand, sometimes answer to the question is not precisely equal to the
expected value, but the meaning is correct. That is why mixed methods are the most
appropriate for question-answering model evaluation. On this stage, the question-an-
swer system may require a subsystem (additional artificial neural network or a simple
set of rules) which decides if the answer is correct even if the model response is not
equal to the labeled value.
   Along with it, as it was mentioned above, the last stage of model evaluation is a
human-translated test set in the Ukrainian language, which allows providing a qualita-
tive measurement.


5.2    Plan for the Research
Table 2 shows a plan for the research.

                            Table 2. Timeline for the research

Milestone                                                  Start Date End Date
Coordination of the direction of the thesis                Aug 2019 Aug 2019
Review of past and current related work                    Aug 2019 Oct 2019
Thesis proposal                                            Aug 2019 Sep 2019
Comparison machine translators                             Sep 2019 Oct 2019
Context-Based Question-Answering System …                                                   87


Milestone                                                     Start Date End Date
Translation datasets and building database of articles        Sep 2019 Oct 2019
Building baseline model                                       Oct 2019 Oct 2019
Building advanced fine-tuned models                           Oct 2019 Dec 2019
Model evaluation                                              Nov 2019 Dec 2019
Writing master thesis                                         Oct 2019 Jan 2020
Submission of thesis for final review                         -          8 Jan 2020
Master Thesis Defense                                         -          End of Jan 2020


6      Conclusive Remarks and Outlook

The most valuable thing from the potential results of the project is a high performance
context-based question-answering model for the Ukrainian language. After the comple-
tion of this work, we will know how to build question-answering systems for the
Ukrainian language. Further, these methods can be applied to the other Slavic languages
or languages with very complicated grammar, peculiar properties, or non-Latin charac-
ters.
    Translated datasets will be reusable for the other researches and projects and can be
taken as a start point for the human translation process.
    There is a point in the research plan where hypothesis might fail, and research must
start from scratch. It is a hypothesis about building a high-performed question-answer-
ing model based on fine-tuning on the machine-translated datasets. This approach
showed good results for English, Spanish, German, and Arabic languages. Along with
it, the efficiency for the Urdu language is significantly worse than for the languages
mentioned earlier (see Table 1).


References
 1. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidirectional
    transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
 2. Stanford Question Answering Dataset. https://rajpurkar.github.io/SQuADexplorer/
 3. Al-Rfou, R., Perozzi, B., Skiena, S.: Polyglot: Distributed word representations for multi-
    lingual NLP. arXiv preprint arXiv:1307.1662 (2013)
 4. Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine
    comprehension of text. arXiv preprint arXiv:1606.05250 (2016)
 5. Sperandei, S., Understanding logistic regression analysis. Biochemia Medica 24(1), 12–18
    (2014). doi: 10.11613/BM.2014.003
 6. Kwok, J.T.Y.: Automated text categorization using support vector machine. In: 5th Interna-
    tional Conference on Neural Information Processing, pp. 347–351 (1998)
 7. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, Inc.,
    New York (1983)
 8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Computation, 9(8), 1735–
    1780 (1997)
88                                                          Serhii Tiutiunnyk and Vsevolod Dyomkin


 9. Seo, M., Kembhavi, A., Farhadi, A., Hajishirzi, H.: Bidirectional attention flow for machine
    comprehension. arXiv preprint arXiv:1611.01603 (2016)
10. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are
    unsupervised multitask learners. OpenAI Blog 1(8) (2019)
11. Google AI Research. https://github.com/google-research/bert
12. Cross-lingual Natural Language Inference dataset. https://github.com/facebookre-
    search/XNLI
13. Newman, I., Benz, C.R., Ridenour, C.S.: Qualitative-Quantitative Research Methodology:
    Exploring the Interactive Continuum. SIU Press, Carbondale and Edwardsville (1998)
14. Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall, and F-score, with
    implication for evaluation. In: Losada D.E., Fernández-Luna J.M. (eds.) ECIR 2005. LNCS,
    vol. 3408, pp. 345–359. Springer, Berlin, Heidelberg (2005)
15. Apidianaki, M., Mohammad, S., May, J., Shutova, E., Bethard, S., Carpuat,M.: Proceedings
    of the 12th International Workshop on Semantic Evaluation. Association for Computational
    Linguistics (2018)
16. Ostermann, S., Roth, M., Modi, A., Thater, S., Pinkal, M.: SemEval 2018 Task 11: Machine
    comprehension using commonsense knowledge. In: 12th International Workshop on Seman-
    tic Evaluation, pp. 747–757. Association for Computational Linguistics (2018)