Investigation of Transitivity Relation in Natural Language
Inference
Petro Zdebskyi1, Andrii Berko 1 and Victoria Vysotska 1,2
1
    Lviv Polytechnic National University, S. Bandera Street, 12, Lviv, 79013, Ukraine
2
    Osnabrück University, Friedrich-Janssen-Str. 1, Osnabrück, 49076, Germany


                 Abstract
                 Motivation of this work is a data-centric approach of improving model accuracy by improving
                 data quality instead of improving model architecture. The idea is to improve dataset with
                 transitivity relations to help machine learning model learn such dependencies. Alongside with
                 enriching dataset investigate how good is previously trained model in catching such relations.
                 So, basically study can be divided into two main parts investigating dataset and investigating
                 machine learning model trained on such datasets. It was found that the existing model catches
                 transitive dependencies well. It was also investigated that “entailment” relation is more
                 directional that “contradiction” and “neutral”.

                 Keywords 1
                 Natural Language Inference, Recognizing Textual Entailment, transitive relation


1. Introduction
   The NLI task is a good benchmark for NLU research where two sentences passed to a model and
asked to determine the relationship between them by selecting one of 3 options: entailment, neutral, and
contradiction. Success in NLI doesn't demand complex machine learning skills, but rather an ability to
understand the meaning of sentences, lexical and compositional semantics, as well as phenomena such
as quantification, tense, beliefs, modality, and lexical and syntactic ambiguity.
   The study solely relies on the MultiNLI dataset, which is considered the largest dataset for Natural
Language Inference and supports multiple languages. MultiNLI consists of sentence pairs that are
annotated with textual entailment information, and it covers different genres of text. [1]


Figure 1: MultiNLI samples example


COLINS-2023: 7th International Conference on Computational Linguistics and Intelligent Systems, April 20–21, 2023, Kharkiv, Ukraine
EMAIL: petro.v.zdebskyi@lpnu.ua (P. Zdebskyi); Andrii.Y.Berko@lpnu.ua (A. Berko); victoria.a.vysotska@lpnu.ua (V. Vysotska)
ORCID: 0000-0002-0478-2308 (P. Zdebskyi); 0000-0003-2892-9519 (A. Berko); 0000-0001-6417-3689 (V. Vysotska)
              ©️ 2023 Copyright for this paper by its authors.
              Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
              CEUR Workshop Proceedings (CEUR-WS.org)
    NLI task is direct relation [2]. Let's check relation in the opposite direction. Instead of making
inference from first sentence to second sentence, swap them and make the inference from the second
sentence to the first one.
    Model that will be used called Roberta. It is huge deep learning model trained by Facebook and fine-
tuned on previously mentioned dataset (MultiNLI). It is improved BERT model. Performance was
improved by training the model on more data for a longer period of time; removed the next sentence
prediction objective; and dynamically changed the masking pattern on to the training data. So, basically
it has improved training procedure, which is called Roberta, and it achieves better results on various
benchmarks, without finetuning for GLUE or any additional data for SQuAD.
    Investigate dataset if second sentence on some samples are the same as first on other samples (strict
and fuzzy matching) for creating dataset for analysing transitivity and check Roberta ability of learning
transitivity relation on the dataset created on the previous step.


2. Related works. Transformer architecture
    The Transformer is a neural network architecture designed for natural language processing (NLP)
tasks. It has become one of the most widely used and popular architectures in NLP. This architecture
is based on a self-attention mechanism, which allows to better understand the context and relationships
between words because the model looks at different parts of the input sequence [1-3]. Unlike traditional
NLP models such as convolutional or/and recurrent neural networks, the Transformer is faster and more
efficient because it process the entire sequence in parallel [4-7].
    It composed of an encoder and a decoder, each of which is made up of a stack of identical layers.
The encoder takes the input sequence and produces a sequence of hidden states, while the decoder takes
the encoder output and produces the final output sequence. Both the encoder and decoder use self-
attention to process the input and produce the output.
    One of the main advantages of the Transformer is its ability to handle variable-length input
sequences without the need for padding or truncation. This is achieved through the use of positional
encoding, which encodes the position of each word in the input sequence as a vector.
    The Transformer has achieved state-of-the-art performance on a wide range of NLP tasks, including
machine translation, language modeling, question answering, and sentiment analysis. Many popular
NLP models, such as BERT, GPT-3, and RoBERTa, are based on the Transformer architecture.
    Positional encoding is a technique used in NLP to encode the relative position of tokens in a
sequence. In NLP, sequences of tokens are often processed by models such as the Transformer, which
are based on attention mechanisms. Since these models don’t have any notion of order they require an
additional signal to represent the order of the sequence. Positional encoding achieves this by adding an
encoding vector to the embedding of each token that captures the token's position in the sequence. The
encoding vector is calculated based on the position of the token in the sequence and the dimension of
the embedding space. This encoding vector is added to the token's embedding, which allows the model
to differentiate between tokens based on their position in the sequence. For tasks such as language
modeling, machine translation, and text classification it is important to capture both the semantics of
the tokens and their position in the sequence and the Transformer model with positional encoding can
achieve this.


2.1.    Encoder and Decoder
    Encoder-decoder is a type of neural network architecture commonly used in natural language
processing and machine translation tasks. It consists of two main components: an encoder and a
decoder.
    As it shown in Fig. 1, the encoder takes an input sequence and converts it into a fixed-length vector
representation, which captures the essential information of the input. This vector is passed on to the
decoder, which generates an output sequence based on the input vector.
    Such architecture is well-suited for tasks where the input and output have different lengths because
it can handle sequences of variable length for input and output. Examples of such tasks can be machine
translation, text summarization.
   The encoder contains 6 identical layers stacked together [1-7]. Each of them two sub-layers: a multi-
head self-attention, and a fully connected feed-forward network. There is a residual connection around
each of the two sub-layers, followed by the normalization layer [8-12].
   The decoder takes the vector from encoder and generates an output sequence. The decoder often has
residual connections and normalization similarly to encoder.
   The encoder typically has a stack of recurrent or convolutional neural networks that process the
input sequence one token at a time and produce a sequence of hidden states. It takes an input sequence
such as a sentence in one language and transforms it into a fixed-length vector representation called a
context vector.


Figure 2: Transformer architecture


2.2.    Scaled Dot-Product Attention
    Scaled Dot-Product Attention is one of the main components of the Transformer. It is the attention
mechanism that helps model to generate the output sequence better by concentrating on the most
relevant parts of the input sequence. It computes a weighted sum of values based on a set of key-value
pairs where the weights are computed by taking the dot product of the query with each key and then
applying a softmax function to normalize the weights. Also, to prevent the gradients from becoming
too large dot product is scaled by the square root of the dimension of the key vectors. As a result,
weighted sum is then used as input to the next layer in the network.
   Calculation of the attention on a set of the queries is stored in a matrix Q. Keys are stored in matrix
K and values in matrix V. Formula for Dot-Product Attention:
                                                          𝑄𝐾 𝑡                                       (1)
                             𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛(𝑄, 𝐾, 𝑉) = 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(     )𝑉
                                                          √𝑑𝑘
    There are two attention functions that are used. First is additive and the second is multiplicative
attention. In complexity these are similar but the product is faster and more efficient in terms of space
because it’s implemented with matrix multiplication code that is highly optimized.
    If we have small values of 𝑑𝑘 then two approaches works similarly but additive attention works
better than dot product attention without scaling when 𝑑𝑘 have bigger values. The dot products grow
and pushes the softmax function where it has extremely small gradients when we have large values of
𝑑𝑘 .
    Such Attention mechanism is very useful in NLP tasks where the input sequences can be very long
because it allows the model to focus only on the most relevant parts of the sequence and as a result it
improves the accuracy of the model.
2.3.    Multi-Head Attention
    Multi-head attention is commonly used in neural network models for NLP tasks like machine
translation and text classification. It allows the model to analyse simultaneously distinct parts of the
sequence. It has improved ability to analyse difficult relationships between the parts of the input.
    It is better to linearly project queries, keys, and values the number of times h with different
projections on 𝑑𝑘 , 𝑑𝑘 , and 𝑑𝑣 sizes instead of using single attention with 𝑑𝑚𝑜𝑑𝑒𝑙 -sized keys, values,
and queries. Attention function is performed in parallel on predicted versions of queries, keys and
values. With that we obtain output values which are 𝑑𝑣 -dimensional.
    The weighted sum of the input sequence is then computed for each head, and the results are
concatenated together to form a single output vector. After that multi-head attention output is passed
through a basic feedforward neural network layer. Multi-head focus allows models to share information
from different view subspaces in different positions. By averaging we control this because one head of
attention is used.


Figure 3: Scaled Dot-Product Attention and Multi-Head Attention
2.4.    Application of Attention in the Model
   There are three different ways in which multi-headed attention can be used in Transformer:
    In the encoder-decoder attention, keys and memory values passed from the output of the encoder
     but requests from the decoder last level. Specifically, at each time step in the decoder, the attention
     mechanism computes an attention weight vector over the encoded input sequence. This allows
     each position in the decoder to visit all positions in the input sequence. This is very similar to the
     common approaches of the encoder-decoder attention in the seq-to-seq models.
    In Self-Attention encoder approach it has the same attention to itself. In the level of self-awareness
     everything is coming from the one place. In this approach the output is taken from the encoder
     previous level. Each head of the attention mechanism attends to a different part of the input
     sequence and computes an attention weight vector. In the encoder each position can be in any or
     all positions of the decoder’s previous level.
    Similarly, the decoder self-attention allows decoder to be in all positions of decoder until this
     position. At each time step, the decoder attends to the previously generated tokens in the target
     sequence to compute an attention weight vector. It's implemented inside the scaled attention of the
     product point and all values at the softmax are masked input to prevent the left flow of the
     information.
   The Transformer is can to capture complex relation between the input and output by using multi-
head attention in these ways, that is why it’s one of the most successful model in NLP. [3]


Figure 4: Attention mechanism visualization
3. Methods and technology
3.1.    BERT
    In this section there is details about BERT implementation. In general, there are two steps in the
structure of the model. First is pre-training and the seconds is fine-tuning. The model is trained in
unsupervised manner (unlabeled data) in different tasks tasks during pre-training. The model is
initialized with trained parameters and then all parameters are adjusted using supervised approach
(labeled data) for fine-tuning. For each task there is different fine-tuned models despite the fact that
they are initialized with the identical pre-trained parameters.
    A distinctive feature of such model is its that for different tasks is uses the same architecture. It
means that there is only slight difference between the final architecture and pre-trained architecture.
    The BERT model architecture is a multilayer Transformer encoder which is bidirectional. The
database for BERT have the same model size as OpenAI first generation GPT model. Although, the
GPT Transformer has limited self-awareness so it can only look to the left in the context while the
BERT has bidirectional self-awareness. BERT is trained on a task called Masked Language Modeling
(MLM), where it is given a sentence with some of the words randomly masked and is asked to predict
the masked words based on the context of the other words in the sentence [4]. This helps model to learn
the relationships between words and to understand the meaning of a sentence [11]. BERT is also trained
on a task called Next Sentence Prediction (NSP), where it is given a pair of sentences and is asked to
predict whether the second sentence is the next sentence in the document after the first sentence or not.
    The representation of the input can represent a pair of sentences and a single sentence to make model
work with different subsequent tasks. The sequence in this context is sequence of tokens passed to
model that can be one sentence or two sentences together.
    Embedding that is used in BERT called WordPiece embedding. The is special marker at the start of
each sequence. There are two ways differentiate between sentences. First approach is to use special
character to separate. Another is adding to each marker the learned embedding this says if it belongs to
one sentence or another. [4, 11, 13-15]
3.2.    Roberta
    RoBERTa (Robustly Optimized BERT Pre-training Approach) is a variant of the BERT
(Bidirectional Encoder Representations from Transformers) model. But it has massive dataset for pre-
training compared to BERT (160 GB of text data). It uses dynamic masking which randomly masks out
different tokens at each training epoch. RoBERTa is trained on a masked language modeling task
instead of Next Sentence Prediction.
    In the fist implementation, random replacement and masking is made once at the beginning and
stored during the training but the data is duplicated in practice so for each sentence the mask is not
always the same [5]. NSP or Next Sentence Prediction is a binary classification task to predict if two
sentences comes one after another in the text. Examples for training crated by extracting consecutive
sentences from the training text data while negative created using sentences from different documents.
    The Next Sentence Prediction task is used to improve the performance of tasks like as natural
language inference that require understanding the relationship between sentences. [5, 12, 16-18]
3.3.    MultiNLI dataset
    The MultiNLI (Multi-Genre Natural Language Inference) dataset is a popular benchmark dataset for
natural language understanding tasks in machine learning. It has 433,000 examples and it makes it one
of the biggest dataset for inference in natural language. MultiNLI has higher complexity by using data
from 10 genres of written and spoken English. That allows evaluate models for almost complete
complexity of the language and provides an obvious settings for assessing domain adaptation.
    The MultiNLI dataset consists of sentence pairs drawn from various genres, including fiction,
government reports, and conversational speech. Each sentence pair is labeled with one of three
relationship types: entailment, contradiction, or neutral.
    The NLP tasks depend on understanding natural language or NLU to succeed [19-21]. A lot of work
has been done to advance applied understanding natural language tasks so that the model can succeed
in these issues. In general model must be accurate in NLU as well as in additional machine learning
tasks like memory access or structured prediction. Given that it makes it pretty difficult to make the
correct assessment on how NLP models understand the meaning of language [22-27].
    The entailment relationship indicates that the first sentence logically entails the second sentence,
meaning that if the first sentence is true, then the second sentence must also be true. The contradiction
relationship indicates that the first sentence contradicts the second sentence, meaning that if the first
sentence is true, then the second sentence must be false. The neutral relationship indicates that there is
no logical relationship between the two sentences [28-36].
    Methodology for data collection in MultiNLI is very similar to the SNLI. It has pair of sentences
from the previous source and then the annotator was asked to make a new sentence. The dataset was
created to encourage research into natural language understanding across multiple genres, rather than
just a single genre, which was the case with many previous datasets. This makes it more challenging
and realistic for models to perform well on this dataset.
    The text for MultiNLI assumption sentences comes from 10 sources of fully available text. Therefore
it should be diverse and represent American English. The sentence pairs in the MultiNLI dataset were
drawn from ten different genres, including telephone conversations, travel guides, and government
documents. This ensures that the dataset covers a wide range of topics and writing styles [37-41].
    The gold mark was assigned for each pair of sentences that represent the majority of votes between
the mark assigned by the original annotator and the four marks assigned by the verifying annotators.
Some small number of sentences did not have a consensus. Such sentences should not be used in
standard estimates and are included in the distributed housing but they have a dash as a gold label. As
it shown in Fig. 5 dataset has sentence pair, genre, gold label and all labels assigned by annotators. [1]


Figure 5: Examples from the development set of MultiNLI dataset which was randomly chosen


3.4.    Properties of binary relations
   The binary relation can be defined this way “There any sets A and B. Binary relation called R from
A to B and can be described formally as R : A × B. Also, this relation is a subset of A × B set.”
   Reflexive is a relation R on a set A is called reflexive if (a, a) ∈ R for every element a ∈ A. Every
vertex has a self-loop.


Figure 6: Reflexive relation                               Figure 7: Symmetric relation

    If (b, a) ∈ R and (a, b) ∈ R then relation considered symmetric, for all a, b ∈ A. It mean that there
is end in the opposite direction if there is edge from one vertex to another. In this case antisymmetric a
relation such that for all a, b ∈ A, if (a, b) ∈ R and (b, a) ∈ R.
   Antisymmetric relation if for all a, b ∈ A, if (a, b) ∈ R and (b, a) ∈ R. So between distinct vertices
there is at most one edge.


Figure 8: Antisymmetric relation                           Figure 9: Transitive relation

   A binary relation is called transitive if (a, b) ∈ R and (b, c) ∈ R, then (a, c) ∈ R, for all a, b, c ∈
A. It means that is there is any edge from one vertex to another then there is path from one vertex to the
second one. [6]


3.5.    Textual Entailment as a Directional Relation
    In textual entailment task is premise and hypothesis relation is considered directional. For example,
let’s take “Last year I was in Paris” as a premise and “I was abroad a year ago”. In the example we
clearly can see that relation is directional, we can entail hypothesis from the premise, but cannot do
other way around because if someone was abroad doesn’t mean he was in Paris. Only few authors
exploited the directional character of the entailment relation, which means that if T → H it is unlikely
that the reverse H → T also holds. From a logical point of view, the entailment relation is alike to the
implication which, contrary to the equivalence, is not symmetric. [2, 8, 9]


4. Check of how many hypothesis are premises in other samples
   Idea is to check if there are hypothesises (second sentences) which are the premises (first sentences)
in other samples. That will allow us to create dataset to validate how good is models in catching
transitive relation. Strong matching is used, but fuzzy matching can be utilized as well. Sample of the
dataset was checked because the dataset is huge and it will take a lot of time to make calculations on
the whole dataset.
                  𝑇𝑖𝑚𝑒𝐸𝑙𝑎𝑝𝑠𝑒𝑑 = (𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑠𝑁𝑢𝑚𝑏𝑒𝑟 ⋅ 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑇𝑖𝑚𝑒)2                                  (2)

    Iteration time on my computer was 0.0028, so we can plot the graph of calculation time. So, even
for half of the dataset we need 84 hours of calculations.


Figure 10: Time complexity graph
5. Discussion and conclusions
5.1.       Validation of model entailment with transitive on data
   From all of these rules only first one can be considered strictly. If have entailment in first pair and
entailment in second one then there should be entailment relation between premise of first pair and
hypothesis of second pair. For example first pair: “I have been to Paris” and “I have been to France”,
second pair: “I have been to France” and “I visited France”, then “I have been to Paris” and “I visited
France” is an entailment. Similarly we can explain for other labels.
   Based on domain knowledge such rules were created:


Figure 11: Transitive relation rules

Table 1
Classification of original samples from the dataset
                           First pair                                                   Second pair

   Premise        Hypothesis       Predicted    True label     Premise        Hypothesis        Predicted     True label
                                     label                                                        label

 I'm afraid      I don't think   entailment    entailment    I don't think   You are          neutral        neutral
 not, sir.       so.                                         so.             looking in
                                                                             the wrong
                                                                             place.

 I'm afraid      I don't think   entailment    entailment    I don't think   I have no        neutral        contradictio
 not, sir.       so.                                         so.             real idea.                      n

 you know     I don't            entailment    entailment    I don't         I know.          contradictio   contradictio
 and he you know.                                            know.                            n              n
 know i don't
 know

 Not at all--    I don't think   entailment    entailment    I don't think   You are          neutral        neutral
 or at least I   so.                                         so.             looking in
 don't think                                                                 the wrong
 so.                                                                         place.

 Not at all--    I don't think   entailment    entailment    I don't think   I have no        neutral        contradictio
 or at least I   so.                                         so.             real idea.                      n
 don't think
 so.

 uh-huh          That's right.   entailment    entailment    That's right.   That's           entailment     entailment
 that's true                                                                 correct.
 that's true
 yeah

 Quite.'         Absolutely.     entailment    entailment    Absolutely.     There's no       neutral        neutral
                                                                             doubt that
                                                                             I'll do it.
 Quite.'         Absolutely.     entailment    entailment    Absolutely.     Definitely.      entailment     entailment
Table 2
Classification of created samples with transitive relation
            Premise                               Hypothesis                   Predicted          True label
                                                                                 label
 I'm afraid not, sir.              You are looking in the wrong place.       neutral          neutral
 I'm afraid not, sir.              I have no real idea.                      contradiction    neutral
 you know and he you know i I know.                                          contradiction    contradiction
 don't know
 Not at all--or at least I don't   You are looking in the wrong place.       neutral          neutral
 think so.

 Not at all--or at least I don't   I have no real idea.                      contradiction    neutral
 think so.

 uh-huh that's true that's true    That's correct.                           entailment       entailment
 yeah
 Quite.'                           There's no doubt that I'll do it.         neutral          neutral
 Quite.'                           Definitely.                               entailment       entailment


   In these tables wrong classifications are highlighted with grey background. We can use accuracy
metric here, because classes are balanced. Accuracy here equals 75% (6/8*100). Accuracy is calculated
based only on 8 samples so it cannot be considered reliable from statistical point of view, but there is
restriction related to computational complexity. When we go deeper in analysis of the result we can see
that there are 2 errors for the samples in which there are errors on the previous table. For other classes
samples is accurate. So, there are troubles only with samples for which model failed even on the original
dataset, but if we look at the samples they are really hard to decide that this two sentences are really
contradiction. As a conclusion, we can say that in general model catched transitive relation, except for
the samples in which model fails on the original dataset. Possible improvement will be to validate
samples with class “contradiction” in the dataset model was trained on.


5.2.       Model entailment from hypothesises to premises
   We’ve decided to check if what we will get if we swap first and second sentence. Second sentence
will be considered as premise and first one as hypothesis [10]. Because the dataset is huge random
subset containing 10000 samples was taken from the dataset (calculation time was more than 2 hours).

Table 3
Metrics of classification from hypothesis to premise
                                          From premise to         From hypothesis to         Decreased by
                                             hypothesis         premise (reverse order)
  Accuracy                                0.95                 0.63                        1.51
  Accuracy for class “similar”            0.98                 0.33                        2.97
  Accuracy for class “neutral”            0.90                 0.77                        1.17
  Accuracy for class “contradiction”      0.98                 0.78                        1.26


   In the table we can see that accuracy decreased by a factor of 2.97, 1.17, 1.26 for class similar
(entailment), neutral, and contradiction respectively. From this we can infer that similar (entailment)
class is the most directional relation, then contradiction, and then neutral. These number are actually
intuitive because if one of the sentences contradicts to another one usually means that it can be in reverse
order from second sentence to the first one. For class “neutral” accuracy drops even less compared to
“contradiction” because neutral sentences are usually even more often neutral in other direction. As for
“entailment” class direction is very strong and we can support that with the numbers, because accuracy
became lower approximately by a factor of 3.
6. References
[1] Williams, N. Nangia, S. R. Bowman, A broad-coverage challenge corpus for sentence
     understanding through inference. arXiv preprint arXiv:1704.05426 (2017).
[2] D. Tatar, G. Serban, A. D. Mihis, R. Mihalcea, Textual entailment as a directional relation. Journal
     of Research and Practice in Information Technology 41(1) (2009) 53-64.
[3] A. Vaswani, et al. Attention is all you need. Advances in neural information processing systems, 30
     (2017).
[4] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[5] Y. Liu, et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint
     arXiv:1907.11692 (2019).
[6] William McDaniel Albritton, ICS 241: Discrete Mathematics II, Waterloo, 2015.
[7] P. Zdebskyi, V. Lytvyn, Y. Burov, Z. Rybchak, P. Kravets, O. Lozynska, R. Holoshchuk, S.
     Kubinska, A. Dmytriv, Intelligent System for Semantically Similar Sentences Identification and
     Generation Based on Machine Learning Methods, CEUR workshop proceedings Vol-2604 (2020)
     317-346.
[8] A. L. Kalouli, A. Buis, L. Real, M. Palmer, V. De Paiva, Explaining simple natural language
     inference. In Proceedings of the 13th Linguistic Annotation Workshop, 2019, pp. 132-143.
[9] T. Chen, Z. Jiang, A. Poliak, K. Sakaguchi, B. Van Durme, Uncertain natural language
     inference. arXiv preprint arXiv:1909.03042, 2019.
[10] A. Poliak, A survey on recognizing textual entailment as an NLP evaluation. arXiv preprint
     arXiv:2010.03061, 2020.
[11] Y. You, et al. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv
     preprint arXiv:1904.00962, 2019.
[12] L. Z. Liu, Y. Wang, J. Kasai, H. Hajishirzi, N. A. Smith, Probing across time: What does RoBERTa
     know and when?. arXiv preprint arXiv:2104.07885, 2021.
[13] N. Khairova, A. Shapovalova, O. Mamyrbayev, N. Sharonova, K. Mukhsina, Using BERT model
     to Identify Sentences Paraphrase in the News Corpus, CEUR Workshop Proceedings, Vol-3171
     (2022) 38-48.
[14] P. Zweigenbaum, P. Jacquemart, N. Grabar, B. Habert, Building a text corpus for representing the
     variety of medical language, Studies in Health Technology and Informatics 84 (2001) 290–294.
[15] H. Livinska, O. Makarevych, Feasibility of Improving BERT for Linguistic Prediction on
     Ukrainian Сorpus. CEUR workshop proceedings Vol-2604 (2020) 552-561.
[16] V. Lytvyn, P. Pukach, V. Vysotska, M. Vovk, N. Kholodna, Identification and Correction of
     Grammatical Errors in Ukrainian Texts Based on Machine Learning Technology. Mathematic 11,
     (2023) 904. https://doi.org/10.3390/math11040904
[17] N. Sharonova, I. Kyrychenko, I. Gruzdo, G. Tereshchenko, Generalized Semantic Analysis
     Algorithm of Natural Language Texts for Various Functional Style Types, CEUR Workshop
     Proceedings, Vol-3171 (2022) 16-26.
[18] V. Shynkarenko, I. Demidovich, Natural Language Texts Authorship Establishing Based on the
     Sentences Structure, CEUR Workshop Proceedings, Vol-3171 (2022) 328-337.
[19] O. Kuropiatnyk, V. Shynkarenko, Automation of Template Formation to Identify the Structure of
     Natural Language Documents. In: CEUR Workshop Proceedings, Vol-2870 (2021) 179-190.
[20] V. Shynkarenko, I. Demidovich, Authorship Determination of Natural Language Texts by Several
     Classes of Indicators with Customizable Weights. In: CEUR Workshop Proceedings, Vol-2870
     (2021) 832-844.
[21] O. Kuropiatnyk, V. Shynkarenko, Text Borrowings Detection System for Natural Language
     Structured Digital Documents, CEUR workshop proceedings Vol-2604 (2020) 294-305.
[22] V. Lytvyn, S. Kubinska, A. Berko, T. Shestakevych, L. Demkiv, Y. Shcherbyna, Peculiarities of
     Generation of Semantics of Natural Language Speech by Helping Unlimited and Context-
     Dependent Grammar, CEUR workshop proceedings, Vol-2604 (2020) 536-551.
[23] O. Bisikalo, Y. Ivanov, N. Karevina, Encoding of Natural Language Information on the Basis of
     the Power Set. In: 2018 IEEE 13th International Scientific and Technical Conference on Computer
     Sciences and Information Technologies, CSIT 2018 - Proceedings, 2018, 2, pp. 17–20.
[24] O. Bisikalo, I. Bogach, V. Sholota, The Method of Modelling the Mechanism of Random Access
     Memory of System for Natural Language Processing. In: Proceedings - 15th International
     Conference on Advanced Trends in Radioelectronics, Telecommunications and Computer
     Engineering, TCSET 2020, 2020, pp. 472–477.
[25] O.V. Bisikalo, O.M. Vasilevskyi, Evaluation of uncertainty in the measurement of sense of natural
     language constructions, International Journal of Metrology and Quality Engineering, 2017, 8.
[26] S. Albota, Linguistically Manipulative, Disputable, Semantic Nature of the Community Reddit
     Feed Post. In: CEUR Workshop Proceedings, Vol-2870 (2021) 769-783.
[27] S. Albota, Semantic analysis of the Reddit vaccination news feed in the Reddit social network. In
     Computer science and information technologies: proceedings of IEEE 16th International
     conference CSIT, Lviv, Ukraine, 22–25 September, 2021, pp. 56–59.
[28] S. Albota, War Implications in the Reddit News Feed: Semantic Analysis. In Computer science
     and information technologies: proceedings of IEEE 17th International conference CSIT, Lviv,
     Ukraine, 10–12 November, 2022, pp. 99–102.
[29] I. Khomytska, V. Teslyuk, The Multifactor Method Applied for Authorship Attribution on the
     Phonological Level, CEUR workshop proceedings, Vol-2604 (2020) 189-198.
[30] I. Khomytska, V.Teslyuk, I. Bazylevych, Y. Kordiiaka, Machine Learning and Classical Methods
     Combined for Text Differentiation, CEUR Workshop Proceedings, Vol-3171 (2022) 1107-1116.
[31] I. Khomytska, V. Teslyuk, K. Prysyazhnyk, N. Hrytsiv, The Lehmann-Rosenblatt test applied for
     determination of statistical parameters of Charles Dickens's authorial style. In Computer Science
     and Information Technologies (CSIT): Proceedings of IEEE XVIth Scientific and Technical
     Conference. Lviv, Ukraine, 22–25 Sept. 2021, Vol. 2, pp. 64–67.
[32] I. Khomytska, V. Teslyuk, I. Bazylevych, Yu. Kordiiaka, Machine learning and classical methods
     combined for text differentiation, CEUR Workshop Proceedings, Vol-3171 (2022) 1107-1116.
[33] I. Khomytska, V. Teslyuk, I. Bazylevych, The statistical parameters of Ivan Franko’s authorial
     style determined by the chi-square test. In Computer Science and Information Technologies
     (CSIT): Proceedings of IEEE XVIIth Scientific and Technical Conference, Lviv, Ukraine, 10–12
     November, 2022, pp. 73–76.
[34] V. Husak, O. Lozynska, I. Karpov, I. Peleshchak, S. Chyrun, A. Vysotskyi, Information System
     for Recommendation List Formation of Clothes Style Image Selection According to User’s Needs
     Based on NLP and Chatbots, CEUR workshop proceedings, Vol-2604 (2020)788-818.
[35] O. Veres, I. Rishnyak, H. Rishniak, Application of Methods of Machine Learning for the
     Recognition of Mathematical Expressions, CEUR Workshop Proceedings 2362 (2019) 378-389.
[36] Yu.M. Furgala, B.P. Rusyn, Peculiarities of melin transform application to symbol recognition. In:
     14th International Conference on Advanced Trends in Radioelectronics, Telecommunications and
     Computer Engineering, TCSET, 2018, pp. 251-254.
[37] E. Fedorov, O. Nechyporenko, Method for Recognizing Linguistic Constructions Based on
     Stochastic Neural Networks, CEUR Workshop Proceedings, Vol-3171 (2022) 104-115.
[38] S. Kubinska, R. Holoshchuk, S. Holoshchuk, L. Chyrun, Ukrainian Language Chatbot for
     Sentiment Analysis and User Interests Recognition based on Data Mining, CEUR Workshop
     Proceedings, Vol-3171 (2022) 315-327.
[39] K. Smelyakov, A. Chupryna, D. Darahan, S. Midina, Effectiveness of Modern Text Recognition
     Solutions and Tools for Common Data Sources, CEUR Workshop Proceedings 2870 (2021) 154.
[40] R. Martsyshyn, et al. Technology of speaker recognition of multimodal interfaces automated
     systems under stress. In: International Conference: The Experience of Designing and Application
     of CAD Systems in Microelectronics, CADSM, 2013, pp. 447-448.
[41] P. Zhezhnych, O. Markiv, Recognition of tourism documentation fragments from web-page posts.
     In: 14th International Conference on Advanced Trends in Radioelectronics, Telecommunications
     and Computer Engineering, TCSET, 2018, pp. 948-951.