<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Complex Term Identification for Ukrainian Medical Texts</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>National Technical University “Kharkiv Polytechnic Institute”</institution>
          ,
          <addr-line>2, Kyrpychova str., 61002 Kharkiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The medical texts are very difficult in understanding when we have not only complex words in general meaning, but a lot of special terms and notions. It causes difficulties in understanding texts in medicine domain. Natural language processing engages people all over the world to apply statistics, machine learning, deep learning, and linguistics in order to solve those tasks. Linguistically complex tasks, such as the medical text understanding, are the most challenging because they require linguistic intuition. In this paper, we study how linguistic approach can be applied to solve the problem of identification of complex words. In order to study medical texts simplification, we try to analyze medical unified protocols as the case and test nlp approach for medical words identification. A dataset of different medical protocols from official resources is developed. The features of special medical words are studied.</p>
      </abstract>
      <kwd-group>
        <kwd>Term Identification</kwd>
        <kwd>Medicine Text</kwd>
        <kwd>Text Simplification</kwd>
        <kwd>Explanation</kwd>
        <kwd>NLP</kwd>
        <kwd>Ukrainian Texts</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Medicine, biology, pharmacology and other neighboring areas are well-known to be
overloaded with complex terms and notions. Medical texts contain many borrowed
words that came from Latin, for instance. Moreover, terms may have multiple
synonyms which make the texts even more complicated. Non-expert readers may get
additional use of simplified medical texts in several situations. Firstly, when a person
doesn’t have a medical education, a simplified text may become useful to get out an
idea what some medical instructions actually mean. For example, when a medical
prescription defines to make some tests, the test name can be quite complex (like,
biochemical blood assay) while for the patient it will mean just a blood test that requires
some particular preparation before it. Secondly, a person may wish to obtain a more
comprehensive explanation of his/her diagnosis written in the medical assessment
report. A simplified text, in this case, may help to understand the character of disorder
but not the accurate and detailed diagnosis. Thirdly, after a patient gets the subscription
from the physician that contains many complex medical terms, it may lead to
confusions. Therefore, text simplification can help a person to follow the doctor’s instructions
and get some clear explanations what particular treatment methods mean.</p>
      <p>Medical information is available on the web in the information systems of medical
establishments, healthcare portals, medical libraries, etc. More and more non-expert
readers would like to use this information. The digital character and availability of this
data together with a broad range of potential readers induce the development of an
information system for medical texts simplification.</p>
      <p>
        Text simplification problem arises when the initial text has to be modified in order
to make it more readable and understandable for the audience[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. There can be different
reasons why the initial text looks inappropriate for usage [2]. The most probable cases
are the result of the text complexity itself or the peculiarities of the audience that wants
to use the text. Anyway, simplification of the texts requires modern methods of
language processing and data analysis [3, 4].The difficulty of the text’s syntactic and
lexical structure may lead to many inconveniences for the readers. For example, the text
might be too complex for the people with some specific disabilities and disorders which
make it hard to perceive the information. Other categories of potential readers who may
face problems while using complex texts include children, low literacy people,
language learners, etc. For all of them, a simplified text would be a good solution to get
the idea and use of the textual information.
      </p>
      <p>Another area where the problem of texts simplification becomes quite essential is
machine processing of textual data. The original texts may be complex enough for
Natural Language Processing (NLP) techniques. Therefore, in order to solve some
problems of language processing, it would be convenient to apply NLP algorithms to the
text that has already been simplified previously. Such problems include information
retrieval and parsing, information summarization and annotation, machine translation,
etc.</p>
      <p>This paper represents the empirical study of medical information simplification in
order to increase the readability and comprehension of original medical texts.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>The process of decreasing the linguistic complexity of a text, and retaining the original
information and meaning is the problem of text simplification. Text simplification can
be used for many purposes: second language learners, preprocessing in pipelines and
assistive technology, to automatically extract of data from doctor’s notes, laboratory
results and other medical documents [5-7]. In order to properly represent textual
information into computable forms for a certain task like classification, clustering, sentiment
analysis, recommendation and information retrieval different ways of text
simplification are used.</p>
      <p>Nowadays automatic lexical simplification systems either do not have sufficient
coverage (supervised approaches), or they only perform one-to-one word substitutions and
thus cannot simplify longer lexical phrases, and they do not perform any kind of word
reordering [8]. In the paper [9], an initial dataset for automated text simplification using
a refined set of operationalized guidelines for manual simplification were create and
methodology for expanding the dataset was develop. Adaptation of statistical machine
translation to perform text simplification, taking advantage of large-scale paraphrases
learned from bilingual texts and a small number of manual simplifications with multiple
references is presented in [10]. Semantic term weighting which considers term
meanings is significant for specific applications of machine learning [4].</p>
      <p>The use of text simplification as a pre-processing step for statistical machine
translation of grammatically complex under-resourced languages is explored [11]. The
experiments on English-to-Serbian translation show that this approach can improve
grammaticality of the translation output and reduce technical post-editing effort (number of
post-edit operations). Simplification can be applied on lexical, syntactic, and discourse
level [12]. Some lexical plugins [13] allow the use of different synonyms in order to
avoid repetition and in the case of the syntactic simplification, the user to see all the
conjunctions in the text and to separate complex sentences with a few simple clicks.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>Medical texts include drug packages, medical records, fact sheets, medical reference
books, and training materials, certificates, etc. To solve the problem of the
simplification of a medical text, it is first necessary to single out the features of such texts. In this
study, we rely on the texts of medical clinical protocols. In order to accelerate the
development and implementation of the state standards in the field of health, the Ministry
of Health of Ukraine approves medical and technological documents on the basis of
evidence-based medicine. Such documents include a unified clinical protocol for
medical care, as well as an adapted clinical trial that based on evidence. Depending on the
disease, the plan of treatment and preventive measures may differ, which is also
prescribed in the legislation in the local protocols of prevention and treatment.</p>
      <p>In the modern conditions of Ukraine, which is actively integrating into the
European and world community, health care reforms have been adopted. These reforms
provide a legal basis for evidence-based medicine. A universal clinical protocol provides
the foundation for the functioning of medical institutions and private doctors.
Therefore, in this work, it is the texts of medical protocols that are used to construct the corpus
of medical texts for the analysis of linguistic complexity.</p>
      <p>At the first stage, the typical medical protocol is considered. Let us consider, as an
example, a protocol for the prevention of cardiovascular disease [14]. According to the
order of the Ministry of Health of Ukraine, this protocol defines signs and criteria for
the diagnosis of the disease; the conditions in which the medical aid should be provided;
a diagnostic program consisting of compulsory and additional research; medical
program; recommendations for the further prevention of medical care.</p>
      <p>Clinical protocols, as well as other guidance documents, are publicly available on
the Internet [14]. They are a guide for medical practitioners, for the administration of
health facilities, and also a source of information for patients. All protocols before
approval pass multiple collective expertise. The data that is entered in the protocol meets
the medical standards of the national and international levels.</p>
      <p>The analysis shows that all protocols have a common structure: introduction,
abbreviations, passport part, general part, main part, description of the stages of medical care,
resource support for the implementation of the protocol, indicators of the quality of
medical care, a list of references and appendixes.</p>
      <p>The main idea of our research is the simplification of the medical text depends on
the complexity of this text and the stakeholder, who studies this text. So, for patients,
such parts of the protocol as a passport of the protocol, or a list of references, can be
omitted. For patients, those parts of the protocol that describe the symptoms of the
disease, the epidemiology, the necessary actions of the doctor and, especially, the
recommendations are of the greatest interest. It should also be noted that all medical records
are provided in the state language. As a result, the medical text is replete with not only
Latin special terms, but also complex medical words in the Ukrainian language.
Analysis of the Ukrainian text in terms of linguistics is a daunting task. In this case, the
problem is complicated by the huge amount of medical terminology. At the same time,
the text also contains words from the subject area, which do not require simplification.
The proposed pipeline (Fig. 1) is quite general for such kind of tasks.</p>
      <p>We would like to underline the step of complex word identification that is based on
feature extraction. We presuppose to use morphological templates to identify complex
medical words. The aim of this empirical study is to realize the approach to in complex
word identification process for Ukrainian medical texts.</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments and Results</title>
      <p>Dataset
We have twenty existing clinical protocols [14] for our experiments, which contain
more than 2,500 thousand words. Protocols were taken from the site "Register of
medical-technological documents" (http://mtd.dec.gov.ua) and they have several directions,
such as dermatovenereology, genetics, gastroenterology, dermatology, allergology, and
hematology. The protocols’ texts have several obligatory parts, however, inside the
paragraphs, they are weakly structured. The example of the original text is presented in
Fig. 2.</p>
      <p>In Fig. 2 we singled out such parts of the document: yellow color is abbreviations
(АГ); blue color is complex medical words (кортикостероїдів/ corticosteroids,
гепатопротекторів/ hepatoprotectors) and pink color is special medical terms (цироз
печінки/ liver cirrhosis, портальна гіпертензія/ portal hypertension).
We used a Pymorphy2 program as the morphological analyzer
(https://github.com/kmike/pymorphy2) and a list of stopwords to preprocess the texts.
Our list of stopwords was created from different resources and contains such parts of
speech as pronouns, adjectives, prepositions, exclamations, suffixes, a combination of
letters etc. On the first step, we tried to use simple frequency for term identification.
Examples of the most and least frequently used words are presented in Table 1 and
Table 2, respectively. These results were very poor, the necessary words were in
different parts of the term list.</p>
      <p>In Table 2 contains not only medical special terms such as сероконверсія
(seroconversion), гломерулонефрит (glomerulonephritis) etc. but and words of general
medical vocabulary вагітність (pregnancy), спектроскопія (spectroscopy). Thus, we
cannot use simple frequency for finding of special medical terms.</p>
      <p>As the next step, we selected nouns from the general frequency lexicons. This
allowed us to find in the text's terms, which have the following characteristic structure:
noun + noun or adjective + noun (adjective + adjective + noun). Lexicons of
compound terms for noun + noun construct are presented in Table 3 and Table 4.</p>
      <p>Word (eng/ukr)
охорона здоров’я/ health care
кваліфікаціяхвороб/ qualification of diseases
дитинаінвалід/ child is disabled
інфаркт міокарда/ myocardial infarction
забезпечення якості/ quality assurance
синдром Дауна/ Down syndrome
стан здоров’я/ health status
хвороба гоше/ Gaucher's disease
емфіземалегень/ emphysema of the lungs
цироз печінки/ cirrhosis</p>
      <p>Word (eng/ukr)
практика сімейних/ family practice
контрольний рівень/ control level
хірургічне втручання/ surgical intervention
підвищений ризик/ high risk
шлунково-кишковийтракт/ gastrointestinal tract
хронічний кашель/ chronic cough
протозойна інфекція/ protozoal infection
бронхіальна астма/ bronchial asthma
вірусний гепатит/ viral hepatitis
церебральний параліч/ cerebral palsy
Lexical template
noun
noun + noun
adjective + noun
adjective + adjective + noun</p>
      <p>Accordingly, we received the statistics about our lexical templates. This information
is presented in Table 5. The total number of nouns is 5,973 in our dataset documents.</p>
      <p>We try to change our templates to identify complex medical terms. The issue is the
quality and quantity of data in the templates. We use our testing data set to evaluate the
obtained results. The precision, recall and F-measure are presented in table 6.</p>
      <p>Lexical template
noun
noun + noun
adjective + noun</p>
      <p>Quality analysis of experiment showed unsatisfied results. The main reason for such
a point is that we analyzed data set which is on Ukrainian. Morphological analyzer
works not properly enough on Ukrainian texts. The list of stop-word was changed but
it didn't cause better results in special words searching. The quality results claim that
templates occur often in texts, but for their automatic extraction morphological analyzer
is needed. Besides that, it's reasonable to investigate sentences structure in medical
tests. Issues mentioned above are topics for future research.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Works</title>
      <p>In our work, analysis of the frequency dictionary showed that specific and complex
terms are used less frequently, which allowed us to discard the words with the highest
frequency of occurrence. Such wise, the results of our experiments show that we can
distinguish two main groups that require simplification:</p>
      <p>1) Specific words (for example, glucocorticosteroid), which can be replaced by more
simple "spoken" synonyms;</p>
      <p>2) Compound terms that can be simplified by clarifying their meaning. It will
influence the syntactic structure of a sentence.</p>
      <p>For the first group, it will be advisable to use a dictionary of synonyms. To work
with the second group, it is necessary to create the special terminology ontology or
another lexical resource.</p>
      <p>The experiment shows that special medical texts, such as protocols, are written
according to a specific pattern, with the result that words and phrases such as "treatment",
"patient" or "Ukraine" are found hundreds of times more often than other nouns.
Among the words with an average frequency (about 100 in our experiment) are found
as well as recognized words (professor − 91, document − 91), and complex (clinical
examination − 91, syndrome − 178), as well as diagnoses (hepatitis − 123, tuberculosis
− 141). It should be noted that among the words with a low frequency (less than 60) the
share of special medical terms is 0.87. This confirms the hypothesis that the frequency
of the word in the corpus of medical texts is a sign of its complexity. At the same time,
words found in this way require additional research by other methods.
2. Shardlow, M.: A Survey of Automated Text Simplification. In: International Journal of</p>
      <p>Advanced Computer Science and Applications,pp.58-70(2014)
3. Falkenjack, J.etal.: Services for Text Simplification and Analysis. In: Proceedings of the
21st Nordic Conference of Computational Linguistics, pp.309-313, Gothenburg, Sweden,
23-24 May (2017)
4. Matsuo, R., Tu Bao Ho: Semantic Term Weighting for Clinical Texts. In: Expert Systems</p>
      <p>With Applications (2018) doi.org/10.1016/j.eswa.2018.08.028.
5. Popolov, D., Barr, J. R.: Units of meaning’ in medical documents. In: IEEE International</p>
      <p>Conference on Semantic Computing, pp. 320-323 (2014) doi.org/10.1109/ICSC.2014.62
6. Mukherjee,P. etal.: NegAIT:A new parser for medical text simplification using
morphological, sentential and double negation. In: Journal of Biomedical Informatics 69, pp.55-62
(2017)
7. Sridevi, M., Arunkumar, B.R.: Information Extraction from Clinical Text using NL Pand
Machine Learning: Issues and Opportunities. In: International Journal of Computer
Applications, pp.11-16 (2016)
8. Stajner, S.,Saggion, H.,Ponzetto, S. P.: Improving lexical coverage of text simplification
systems for Spanish. In: Expert Systems with Applications, vol.118, pp.80-91 (2018)
9. Djamasbi, S.: Improving Manual and Automated Text Simplification (2017)
10. Xu, W.etal.: Optimizing Statistical Machine Translation for Text Simplification. In:
Transactions of the Association for Computational Linguistics, vol.4, pp.401-415 (2016).
11. Baltic,J.: Can Text Simplification Help Machine Transltion In: Modern Computing, vol. 4,</p>
      <p>No.2, pp. 230-242 (2016)
12. Stajner, S.,Glavas, G.: Leveraging event-based semantics for automated text simplification.</p>
      <p>In: Expert Systems With Applications 82, pp.383-395 (2017)
13. Hervas,R. etal.: Integration of lexical and syntactic simplification capabilities in a text
editor. In: Procedia Computer Science, 27, pp.94-103 (2014)
14. Registry of medical and technological documents http://mtd.dec.gov.ua/index.php/uk/
15. Abrahamsson,E. etal.: Medical text simplification using synonym replacement: Adapting
assessment of word difficulty to acompounding language. In: Proceedings of the 3rd
Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)
@EACL 2014, pp.57-65 (2014)
16. Jackson,R. G., PatelR., Jayatilleke N.etal.: Natural language processing to extract
symptoms of severe mental illness from clinicaltext: the Clinical Record Interactive Search
Comprehensive Data Extraction (CRIS-CODE). In: project BMJOpen 2017 (2016)
doi.org/10.1136/bmjopen-2016-012012
17. Nojavan,F. etal.: Explanation of “taghtirolbol” in traditional medical texts Which one
Dribbling or Pollakiurea. In: Journal of Islamic and Iranian Traditional Medicine, vol.6,
No. 2 (2015)
18. Chen,J. etal.: A Natural Language Processing System That Links Medical Terms in
Electronic Health Record Notes to Lay Definitions: System Development Using Physician
Reviews. In: JMedInternetRes 2018 (2018) doi.org/10.2196/jmir.8669
19. Salah, Ait-Mokhtaretal.: A Framework to Generate Sets of Terms from Large Scale
Medical Vocabularies for Natural Language Processing. In:European FP7-ICTEURECA
project:http://eurecaproject.eu/ (2015)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Siddharthan</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>A survey of research on text simplification</article-title>
          .
          <source>In: International Journal of Applied Linguistics</source>
          , Peeters Publishers, Belgium (
          <year>2014</year>
          )
          <article-title>doi</article-title>
          .org/10.1075/itl.165.2.06sid
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>