Collection and Processing of a Medical Corpus
                         in Ukrainian

      Olga Cherednichenko 1[0000-0002-9391-5220], Olga Kanishcheva 1[0000-0002-9035-1765],
         Olena Yakovleva 2[0000-0002-6129-6146], Denis Arkatov 1[0000-0003-0162-059X]
       1 National Technical University “Kharkiv Polytechnic Institute”, Kharkiv, Ukraine
           2 Kharkiv National University of Radio Electronics, Kharkiv, Ukraine

      olha.cherednichenko@gmail.com, kanichshevaolga@gmail.com,
             olena.yakovleva@nure.ua, denarkatov@gmail.com


        Abstract. The text corpora are the basis of natural language studying. We de-
        scribe the structure of a Ukrainian-language corpus (UKRMED), which contains
        a variety of medical text genres (Сlinical protocols, Blogs, and Wikipedia). The
        paper shows the process of collecting, creating and processing a corpus of medi-
        cal data in Ukrainian. We represent our own framework for creating a text corpus.
        The medical domain and text simplification are chosen as corpus directions. The
        authors gave statistical characteristics of the corpus, an analysis of the morpho-
        logical parts of speech is provided. Frequency lemmas for this medical corps are
        analyzed. The UKRMED corpus can be used for solving the task of natural lan-
        guage simplification.

        Keywords: Medicine Corpus, Corpus Linguistic, Ukrainian, Text Collection.


1       Introduction

The text corpora are the basis of natural language studying, namely the medicine cor-
pora is selected as a domain for research. It is important for everybody to understand
the healthcare records and doctor recommendations regarding their health or the health
of their family. On the other hand, the Internet contains a huge amount of medical in-
formation from descriptions of diseases and symptoms to recommendations for preven-
tion and treatment, including descriptions of medicine. It is noticed that medical texts
often contain many special terms and abbreviations, which complicate their understand-
ing by ordinary people. Solving the problem of medical text simplification [1], we were
faced with the lack of suitable data sets for conducting experiments with the Ukrainian
language. Thus, the goal of the given study is to create a Ukrainian text corpus in the
medical domain in order to study the challenges of special text simplification.
   In general, meaning, a corpus is a collection of interconnected documents in a natural
language. Usually studying a natural language model begins using a generalized or spe-
cial corpus. For example, there are a lot of examples of scientific papers based on ready-
made data sets, as a rule the text corpus from Wikipedia is used. However, the most
successful language models are often highly specialized for a particular domain. It leads
    Copyright © 2020 for this paper by its authors.
    Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
to the task of creating and developing a specialized text corpus. In the given paper we
investigate the issues of text corpus creation in the medical domain as well as particu-
larities of the Ukrainian language in the medical field.
    We can underline some features of Ukrainian medical texts. Medical texts contain
many borrowed words that came from Latin, for instance. Sometimes, terms and no-
tions may have a variety of synonyms, which make the text even more complicated. It
is really difficult to percept such specific domain texts by ordinary people. We suggest
that non-expert readers may query additional use of simplified medical texts in the fol-
lowing cases. Firstly, in the case a person does not have a medical education, a simpli-
fied text may become useful to find out an idea of what some medical prescriptions
actually mean. For example, when a medical instruction defines to make some exami-
nations, the test name can be quite complex (like, biochemical blood assay) while for
the patient it will mean just a blood-based test that requires some particular preparation
before it. Secondly, a person may wish to know a more comprehensive explanation of
his/her diagnosis written in the medical records. A simplified text, in this case, may
help to specify the issue as well as clarify the features, possible consequences and un-
derstand the recommendations. Thirdly, a patient should have a possibility to distin-
guish truth and fake information from the Internet sources. Therefore, medical text sim-
plification can help people to follow the doctor’s instructions and obtain clear explana-
tions of meanings of particular treatment methods or prescriptions.
    In addition, the area where the task of text simplification becomes quite important is
text automated processing. The original texts may be quite complex for Natural Lan-
guage Processing (NLP) techniques. It requires the pre-simplified text would be used
to applying NLP algorithms for language processing. There are information retrieval
and parsing, information summarization, and annotation, machine translation, etc. that
may be included as examples of such problems.
    This paper represents the empirical study towards medical text corpus creation in
order to investigate the readability and comprehension of original medical texts in
Ukrainian.


2      Background and Related Work

NLP is a field of computer science and computational linguistics that study the peculi-
arities of communication between software and humans under the meaning of language.
Many areas in NLP relate to natural language understanding and the opportunity to
derive sense from natural language input [1, 3, 6, 13]. One of the traditional NLP tasks
is the creation and development of text corpus. Many authors deal with the problems of
corpus linguistics [2, 4, 7, 14].
    The medical texts are very difficult in understanding due to they contain not only
complex words in the general meaning but a lot of special terms and notions from the
medical domain. It causes difficulties in perception texts and inconvenience in reading.
Natural Language Processing techniques suggest applying statistics, machine learning,
deep learning, and linguistics in order to solve those tasks. Linguistically complex
tasks, such as the medical text understanding, are the most challenging due to their
specific and they require peculiar models and methods.
    We started our research with the task described in [1], where we investigated how a
linguistic approach can be applied to solve the problem of the identification of complex
words. The discourse of our research is Ukrainian medical texts. In order to study med-
ical text simplification, we analyze medical unified protocols and test the NLP approach
for medical word identification [1].
    The medicine as a domain for NLP researchers is not new. Some articles are devoted
to the creation of text resources to specify medical discourse. For instance, user opinion
mining is considered in the paper [5]. Opinion mining in the medical field has not been
studied deep. Text corpus in the medical domain, such as polarized lexicons, is pre-
sented in [5] for this task.
    A method for the comparison of the simplified and original texts is presented in [6].
The goal of the paper is to level off the parallel text corpus, which consists of news in
Spanish with their simplified sample. In this paper, the provided algorithm is used for
the creation of a corpus for the study of text simplification.
    The paper [2] presents a work in progress to create an annotated text corpus for Lat-
vian. An important aspect that considered in [2] is the variety and balance of the corpus
in terms of genres, authors and lexical units. The paper [3] is considered issues of cor-
pus creation as well. Presented by authors the framework provides several built-in NLP
tools to automatically preprocess texts and is highly customizable [3].
    The development of a doctor-patient dialogue corpus to support a speech-to-speech
machine translation effort for English-Persian medical dialogues is described in the pa-
per [4]. The described corpus was developed by recording and transcribing dialogues
in English, and then translated into Persian. The authors highlight the benefits and
drawbacks of creating a corpus in this way. Benefits include the ability to customize
the corpus in a way that would be infeasible for actual doctor-patient data and avoidance
of privacy and legal issues, while drawbacks include the fact that the Persian does not
originate as speech, but as text translation of English speech [4].
    Another side of corpus linguistics concerns low resource languages. Building lan-
guage corpora for low resource languages is challenging because of limited digitized
texts [7]. Language corpora are needed for building information retrieval services such
as search and translation and to support further online content creation. A novel solution
to source relevant multilingual content is proposed in [7]. The way of gathering data is
performed by crowdsourcing translations via an online competitive game where partic-
ipants would be paid for their contributions [7].
    Some researchers outline the significance of combining results from different study
areas. Therefore, as it is shown in the paper [8] agglomerating results from studies of
individual biological components may produce biomedical discovery and the promise
of therapeutic development. Such knowledge integration could be facilitated by auto-
mated text mining. The paper [8] notices that the creation of appropriate datasets is
hampered by the absence of a resource for launching a distributed annotation effort, as
well as by the lack of a standardized annotation schema. The annotation schema and
the corresponded tool are proposed in [8]. Those proposals can be widely adopted so
that the resulting annotated corpora from a multitude of disease studies are assembled
into a unified benchmark dataset.
    The paper [9] is devoted to studying some cultural peculiarities of rendering English
texts into Ukrainian. The authors of the study [9] emphasize the importance of the ex-
perience exchange in order to strengthen the ties with economically developed coun-
tries, as well as to improve the level of professional and ethical training of current and
future physicians. The studied type of text combines the features of both medical and
moral-ethical discourses, thus causing some difficulties in the adequate translation from
English into Ukrainian.
    The purpose of the study [10] is to study the readiness of Ukrainian psychiatrists to
introduce a new classification reflecting the latest trends in the integration of mod-ern
neurobiological research into clinical practice. The new terminology and new classifi-
cation at a certain domain are studied based on survey results in European countries
and Ukraine. The authors point out the relevance and precision of the terms used in the
medical domain [10]. It highlights the relevance of our research direction.
    An analysis of the references shows that little attention is paid to the creation of a
linguistic corps, especially the Ukrainian language and in the medical domain. How-
ever, many authors note the importance of studying medical texts. Therefore, we pro-
pose to focus on the process of text corpus creation in this paper. The aim of the given
work is the medical text corpus in Ukrainian as a reference data set for further study.


3      Collection and Processing of Texts

3.1    Corpus Creation

Creating a quality text corpus is a challenge due to the significant influence of initial
data on processing results. There are no defaulted technics for the creation of text data
set in the theory of corpus linguistics. We suggest our own way to gather texts in the
field of medicine in order to perform the medical text corpus. The main requirement of
the corpus is the ability to provide data for language issues study. To create a medical
corpus in Ukrainian, this study proposes the following pipeline (Fig. 1). Wikipedia
texts, as well as texts from Social Networks, blogs, digital libraries, as well as official
websites of medical clinics and the ministry of health, are considered as data sources.
The first step is identifying the data sources and estimation of Ukrainian medical text
availability on the chosen data sources.
   Data is collected under the assumption that three categories of text complexity can
be distinguished. We give out three classes of text complexity such as complex, mod-
erate and simple texts. We apply our own perception and attitude of medical texts to
divide them into three categories. One of the most important requests on the second
pipeline step is collecting the same amount of data in each group. In addition, we try to
use semantically close sources to collect text data. At the pre-processing stage, the text
is cleared of unnecessary characters, as well as conversion to encoding UTF-8. The
third step includes such common preprocessing actions as tokenization, normalization
and makes description with a set of statistical indicators.
                             Fig. 1. Text corpus creation pipeline

   The simplest and most common way to organize a text corpus for managing it is to
store documents in a file system on the disk. By organizing the placement of documents
in the corpus into subdirectories, you can ensure their classification and meaningful
separation according to available meta-information, such as dates or sources. Thanks
to the storage of each document in a separate file, corpus reading tools can quickly
search for different subsets of documents, the processing of which can be carried out in
parallel when each process receives its own text, different from the other subset of doc-
uments. We use this way of storage in the given research.
   The pre-validation of the text data is carried out manually and some texts can be
returned to the pre-processing step. We perform the crowdsourcing technic for manual
data validation. At the processing stage, we suggest to determine the parts of speech
and obtain the statistical descriptors of the received texts. The final step is the validation
of collected text data. First of all, we check the POS tags. The balance of the text corpus
is paid attention, as well as analysis of statistical characteristics, should be done. Pre-
pared and marked up texts are placed in the repository.

3.2    Corpus Description
Our corpus (UKRMED), the UKRainian MEDicine text corpus, combines three medi-
cal text genres with a focus on clinical protocols (“Complex texts”), medicine forums
(“Simple texts”) and texts from Wikipedia (“Moderate texts”). The clinical protocols
and other complex texts are taken from the official website of the Ministry of Health of
Ukraine (https://guidelines.moz.gov.ua), dissertations and scientific papers
(http://idvamnu.com.ua/journal). The texts from the “Simple text” category are taken
from such sites as KinesisLife (https://kinesislife.ua), hospital sites (https://gynecol-
ogy.kyiv.ua), Dr. Komarovsky (https://komarovskiy.net), etc. The data from “Moder-
ate texts” category is taken from Ukrainian Wikipedia (https://uk.wikipedia.org/wiki/).
   A quantitative report in terms of the number of sentences, text tokens, and types
distributed over the different genres is given in Table 1.

Table 1. Text Genre Distribution and Quantitative Data of the UKRMED Medical Text Corpus.

        Text Category                        Number of sentences Number of tokens
        Complex texts                        26,730                    329,837
        Simple texts                         25,395                    320,209
        Moderate texts                       27,081                    363,539
        Total                                79,209                    1,013,585

   This corpus was created for the experiments with medical text simplification that
was launched in [1], and for experiments with readability metrics [11, 12]. We distin-
guished some featured indices, which are calculated, for our text corpus. The defined
features are shown in Table 2. Table 3 shows the corresponding values for the features
from Table 2 except “Percentage of speech parts”. The analysis of the parts of the
speech will be considered separately.

                                 Table 2. Statistic text features.

 Text Level                          Feature                                  Abbreviation
                                     Text length in tokens                    TLT
 Macro level (paragraph level)
                                     Text length in letters                   TLL
                                     Average sentence length in tokens        ASLT
 Syntactic level
                                     Average sentence length in letters       ASLL
                                     Average token length in symbols          ATLS

                                     Percentage of unique words               US

 Lexical level                       Percentage of monosyllables              M
                                     Average word repetition rate             AWR
                                     Percentage of speech parts               PSP
                                     (nouns, verbs, adjectives etc.)

   As we find out such indices as TLT, TLL, ASLT, ATLS, and US from Table 3 have
very close values. Small differences have ASLL and M features. These values are not
obvious due to the average sentence length is bigger for “Moderate texts” or the num-
ber of monosyllables (for example, “жар”/fever, “корь”/measles) is bigger for texts
from “Complex texts”.
                          Table 3. Values of statistic text features.

       Text Fea-       Complex texts          Simple texts              Moderate texts
       ture
       TLT                    329,837                  320,209               363,539
       TLL                   2,051,544                1,925,313             2,313,698
       ASLT                     7,63                     7,23                 7,67
       ASLL                    76,75                    75,81                 85,44
       ATLS                     5,87                     5,65                 6,03
       US                        4%                      5%                    4%
       M                        40%                      38%                  35%
       AWR                     11,34                    10,35                 11,62


   The main purpose of the future work with UKRMED is an analysis of readability
and text simplification in the medical domain. Caused by our plans we take values of
average word length in symbols (ATLS) for all the categories (Fig. 2) and try to analyze
these feature values. Fig. 2 shows that the “Simple texts” category has the smallest
length of words on average but “Moderate texts” and “Complex texts” have similar
values for token length. It helps us to future research for text simplification and assess-
ment of text difficulty.


                           Fig. 2. ATLS values for all categories.

3.3    Part-of-Speech Tagset
We used POS tagger Pymorph2 for the analysis of speech parts of our corpus
(https://pymorphy2.readthedocs.io/en/latest/index.html). The part of tagset for the POS
annotation of UKRMED Medical Text Corpus is presented in Table 4.
                      Table 4. The tagset of POS tagger Pymorph2.

                              POS Tag      Value
                              NOUN    noun
                              ADJF       adjective name (full)
                              COMP       comparative
                              VERB       verb (personal form)
                              GRND       gerund
                              NUMR       numeral
                              ADVB       adverb
                              NPRO       pronoun
                              PRED       predicative
                              PREP       preposition
                              CONJ       conjunction
                              INTJ       interjection
                              PRCL       particle


   As a result of our experiments, we received parts of speech categorization for our
three categories (Table 5). Table 5 shows that the most numerous parts are NOUN,
ADJF, and VERB. Other parts of speech are rather small in comparison with them.

                            Table 5. Parts of speech assignment.

POS Tag          Complex texts       Simple texts        Moderate texts     Total
NOUN                130,467             119,641             148,749        398,857
ADJF                 57,300              42,173                 61,107     160,580
VERB                 21,973              36,063                  31,013    89,049
INTJ                   15                  24                     11         50
NPRO                 7,854               15,762                  13,607    37,223
ADVB                 7,431               13,371                 11,550     32,352
None                 3,605                7,091                  2,637     13,333
PRCL                  805                 2,124                  1,529      4,458
CONJ                  928                 1,426                  1,839      4,193
PREP                 1,079                 826                   1,416      3,321
PRED                  451                  910                    561       1,922
GRND                  466                  715                    595       1,776
NUMR                  426                  550                    643       1,619
COMP                 1,611                  -                      -        1,611


   We have found out also big enough category as None. It is a category for words that
the morph analyzer could not analyze.
   In our work, we tried to analyze the most common words for each category. For this,
on the pre-processing stage, all tokens were lemmatized using Pymorphy2. The results
are presented in Table 6.

                       Table 6. Statistic about lemmas for each category
                      (Lemma – POS – Frequency – Relative frequency).

         Complex texts                 Simple texts                   Moderate texts
лікування/treatment - NOUN -       який/which/who - NPRO - 2629 який/which/who - NPRO - 2826
1977 - 0.6                         - 0.82                       - 0.78
який/which/who - NPRO - 1658 -     цей/this- NPRO - 1551 - 0.48 цей/this - NPRO - 1374 - 0.38
0.5
пацієнт/patient - NOUN - 1453 - захворювання/disease - NOUN        може/maybe - None - 1338 -
0.44                            - 1332 - 0.42                      0.37
хворий/sick - ADJF - 1301 - 0.39може/maybe - None - 1198 -         лікування/treatment - NOUN -
                                0.37                               1256 - 0.35
дослідження/research - NOUN - такий/such - NPRO - 1160 - 0.36      також/also - CONJ - 1240 - 0.34
1002 - 0.3
захворювання/disease - NOUN - лікування/treatment - NOUN -         інший/other - NPRO - 1198 -
925 - 0.28                      1108 - 0.35                        0.33
цей/this - NPRO - 907 - 0.27    шкіра/skin - NOUN - 1068 -         захворювання/disease - NOUN
                                0.33                               - 1162 - 0.32
після/after - ADVB - 865 - 0.26 весь/all - NPRO - 971 - 0.3        такий/such - NPRO - 1106 - 0.3
мати/have - NOUN - 849 - 0.26      вони/they - NPRO - 942 - 0.29   мати/have - NOUN - 1017 - 0.28
даний/given - ADJF - 754 - 0.23    про/about - NOUN - 874 - 0.27   випадок/happening - NOUN -
                                                                   982 - 0.27

    The table shows the top 10 frequency lemmas for each category. From Table 6 it is
seen that some of the words are repeated and some are not. Words that are repeated in
all categories are highlighted in blue, which are found in “Complex texts” and “Mod-
erate texts” – in red and in “Simple texts” and “Moderate texts” – in green.
    The category “Complex texts” is perfectly characterized by such words as “treat-
ment”, “patient”, “patient”, “study”. They are at the top of the list of frequency to-
kens. Such a word as “which” is a frequency word for all three categories, since, in our
opinion, sentences in the medical field are quite long and you have to use pronouns and
subordinate clauses. Relative pronouns (for example, “який”/”which/who”) play the
role of conjoined words for joining subordinate clauses to main ones.
    The word ”може”/”can”, which is found on “Moderate texts” and “Simple texts”,
is due to the fact that the discussion of diseases, treatment or diagnosis is advisory. It is
also a pronoun and it allows you to highlight objects in a speech situation. Thus, these
pronouns connect the text, help to avoid repetition. The word “мати” (translated as a
noun "mother" or a verb "have") was mislabeled. In our case, it is a verb, not a noun.
Therefore we have also a disambiguation problem.
4      Conclusion and Future Work

Medical texts include drug packages, medical records, fact sheets, medical reference
books, and training materials, certificates, etc. To solve the problem of the simplifica-
tion of a medical text, it is first necessary to single out the features of such texts. In this
study, we rely on the texts of medical clinical protocols. In order to accelerate the de-
velopment and implementation of the state standards in the field of health, the Ministry
of Health of Ukraine approves medical and technological documents on the basis of
evidence-based medicine. Such documents include a unified clinical protocol for med-
ical care, as well as an adapted clinical trial that based on evidence. Depending on the
disease, the plan of treatment and preventive measures may differ, which is also pre-
scribed in the legislation in the local protocols of prevention and treatment.
   The main idea of our research is the simplification of the medical text depends on
the complexity of this text and the stakeholder, who studies this text. So, for patients,
such parts of the protocol as a passport of the protocol, or a list of references, can be
omitted. For patients, those parts of the protocol that describe the symptoms of the dis-
ease, the epidemiology, the necessary actions of the doctor and, especially, the recom-
mendations are of the greatest interest. It should also be noted that all medical records
are provided in the state language. As a result, the medical text is replete with not only
Latin special terms, but also complex medical words in the Ukrainian language. Anal-
ysis of the Ukrainian text in terms of linguistics is a daunting task. In this case, the
problem is complicated by the huge amount of medical terminology. At the same time,
the text also contains words from the subject area, which do not require simplification.
   In our paper, we described the structure and the quantitative features of UKRMED,
an annotated Ukrainian text corpus that contains three classes of medical texts. On a
wide scale, such kind of language resources is a valuable asset for up-to-days researches
and the development of effective medical domain targeted technologies. The future
study will focus on medical fact extraction from official documents, Ukrainian text
simplification in the medical field, question answering services development, etc.


References
 1. Cherednichenko, O., Kanishcheva, O., Babkova, N.: Complex term identification for
    Ukrainian medical texts. In: Proc. of the 1st International Workshop on Informatics & Data-
    Driven Medicine (IDDM 2018), Vol. 2255, pp. 146–154, CEUR-WS (2018).
 2. Gruzitis, N., Pretkalnina, L., Saulite, B., Rituma, L., Nespore-Berzkalne, G., Znotins, A.,
    Paikens, P.: Creation of a balanced state-of-the-art multilayer corpus for NLU. In: Proc. Of
    the 11th International Conference on Language Resources and Evaluation, pp. 4506–4513
    (2019).
 3. Janssen, M.: TEITOK: Text-faithful annotated corpora. In: Proc. of the 10th International
    Conference on Language Resources and Evaluation, LREC 2016, pp. 4037–4043 (2016).
 4. Belvin, R. S., May, W., Narayanan, S., Georgiou, P., Ganjavi, S.: Creation of a doctor-pa-
    tient dialogue corpus using standardized patients. In: Proc. of the 4th International Confer-
    ence on Language Resources and Evaluation, LREC 2004, pp. 187–190 (2004).
 5. Goeuriot, L., Na, J. C., Kyaing, W. Y. M., Khoo, C., Chang, Y. K., Theng, Y. L., Kim, J. J.:
    Sentiment lexicons for health-related opinion mining. In: Proc. of the 2nd ACM SIGHIT
    International Health Informatics Symposium, pp. 219–225 (2012).
 6. Bott, S., Saggion, H.: An Unsupervised Alignment Algorithm for Text Simplification Cor-
    pus Construction. In: Proc. of the Workshop on Monolingual Text-To-Text Generation,
    pp. 20–26 (2011).
 7. Packham, S., Suleman, H.: Crowdsourcing a text corpus is not a game. In: Lecture Notes in
    Computer Science, Vol. 9469, pp. 225–234 (2015).
 8. Cano, C., Monaghan, T., Blanco, A., Wall, D. P., Peshkin, L.: Collaborative text-annotation
    resource for disease-centered relation extraction from biomedical text. Journal of Biomedi-
    cal Informatics, 42(5), pp. 967–977 (2009).
 9. Velychenko, O., Popova, O.: Cross-cultural specificities of rendering texts on medical ethics
    in Ukrainian translation. Naukovy Visnyk of South Ukrainian National Pedagogical Univer-
    sity Named after K. D. Ushynsky: Linguistic Sciences, 2019(29), pp. 36–50 (2019).
10. Zubatiuk, O., Nosova, E.: Neuroscience-based nomenclature in Ukraine. European Neuro-
    psychopharmacology, 27, S663 (2017).
11. Wermter, J., Hahn, U.: An Annotated German-Language Medical Text Corpus as Language
    Resource. In: Proc. of the Fourth International Conference on Language Resources and Eval-
    uation (LREC’04), pp. 473-476 (2004)
12. Readability formulas (title from screen), https://readable.com/features/readability-formulas/
    last accessed 2020/04/12.
13. Vysotska, V., Lytvyn, V., Burov, Y., Gozhyj, A., Makara, S.: The consolidated information
    web-resource about pharmacy networks in city. In: Proc. of the 1st International Workshop
    on Informatics & Data-Driven Medicine (IDDM 2018), CEUR Workshop Proceedings,
    pp. 239-255 (2018).
14. Lytvyn V., Burov Y., Kravets P., Vysotska V., Demchuk A., Berko A., Ryshkovets Y.,
    Shcherbak S., Naum O.: Methods and Models of Intellectual Processing of Texts for Build-
    ing Ontologies of Software for Medical Terms Identification in Content Classification. In:
    CEUR Workshop Proceedings, of the 2nd International Workshop on Informatics & Data-
    Driven Medicine (IDDM 2019), Vol. 2362, pp. 354-368 (2019).