<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Collection and Processing of a Medical Corpus in Ukrainian</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Kharkiv National University of Radio Electronics</institution>
          ,
          <addr-line>Kharkiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>National Technical University “Kharkiv Polytechnic Institute”</institution>
          ,
          <addr-line>Kharkiv</addr-line>
          ,
          <country country="UA">Ukraine</country>
        </aff>
      </contrib-group>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>The text corpora are the basis of natural language studying. We describe the structure of a Ukrainian-language corpus (UKRMED), which contains a variety of medical text genres (Сlinical protocols, Blogs, and Wikipedia). The paper shows the process of collecting, creating and processing a corpus of medical data in Ukrainian. We represent our own framework for creating a text corpus. The medical domain and text simplification are chosen as corpus directions. The authors gave statistical characteristics of the corpus, an analysis of the morphological parts of speech is provided. Frequency lemmas for this medical corps are analyzed. The UKRMED corpus can be used for solving the task of natural language simplification.</p>
      </abstract>
      <kwd-group>
        <kwd>Medicine Corpus</kwd>
        <kwd>Corpus Linguistic</kwd>
        <kwd>Ukrainian</kwd>
        <kwd>Text Collection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The text corpora are the basis of natural language studying, namely the medicine
corpora is selected as a domain for research. It is important for everybody to understand
the healthcare records and doctor recommendations regarding their health or the health
of their family. On the other hand, the Internet contains a huge amount of medical
information from descriptions of diseases and symptoms to recommendations for
prevention and treatment, including descriptions of medicine. It is noticed that medical texts
often contain many special terms and abbreviations, which complicate their
understanding by ordinary people. Solving the problem of medical text simplification [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], we were
faced with the lack of suitable data sets for conducting experiments with the Ukrainian
language. Thus, the goal of the given study is to create a Ukrainian text corpus in the
medical domain in order to study the challenges of special text simplification.
      </p>
      <p>In general, meaning, a corpus is a collection of interconnected documents in a natural
language. Usually studying a natural language model begins using a generalized or
special corpus. For example, there are a lot of examples of scientific papers based on
readymade data sets, as a rule the text corpus from Wikipedia is used. However, the most
successful language models are often highly specialized for a particular domain. It leads
to the task of creating and developing a specialized text corpus. In the given paper we
investigate the issues of text corpus creation in the medical domain as well as
particularities of the Ukrainian language in the medical field.</p>
      <p>We can underline some features of Ukrainian medical texts. Medical texts contain
many borrowed words that came from Latin, for instance. Sometimes, terms and
notions may have a variety of synonyms, which make the text even more complicated. It
is really difficult to percept such specific domain texts by ordinary people. We suggest
that non-expert readers may query additional use of simplified medical texts in the
following cases. Firstly, in the case a person does not have a medical education, a
simplified text may become useful to find out an idea of what some medical prescriptions
actually mean. For example, when a medical instruction defines to make some
examinations, the test name can be quite complex (like, biochemical blood assay) while for
the patient it will mean just a blood-based test that requires some particular preparation
before it. Secondly, a person may wish to know a more comprehensive explanation of
his/her diagnosis written in the medical records. A simplified text, in this case, may
help to specify the issue as well as clarify the features, possible consequences and
understand the recommendations. Thirdly, a patient should have a possibility to
distinguish truth and fake information from the Internet sources. Therefore, medical text
simplification can help people to follow the doctor’s instructions and obtain clear
explanations of meanings of particular treatment methods or prescriptions.</p>
      <p>In addition, the area where the task of text simplification becomes quite important is
text automated processing. The original texts may be quite complex for Natural
Language Processing (NLP) techniques. It requires the pre-simplified text would be used
to applying NLP algorithms for language processing. There are information retrieval
and parsing, information summarization, and annotation, machine translation, etc. that
may be included as examples of such problems.</p>
      <p>This paper represents the empirical study towards medical text corpus creation in
order to investigate the readability and comprehension of original medical texts in
Ukrainian.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background and Related Work</title>
      <p>
        NLP is a field of computer science and computational linguistics that study the
peculiarities of communication between software and humans under the meaning of language.
Many areas in NLP relate to natural language understanding and the opportunity to
derive sense from natural language input [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3, 6, 13</xref>
        ]. One of the traditional NLP tasks
is the creation and development of text corpus. Many authors deal with the problems of
corpus linguistics [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4, 7, 14</xref>
        ].
      </p>
      <p>The medical texts are very difficult in understanding due to they contain not only
complex words in the general meaning but a lot of special terms and notions from the
medical domain. It causes difficulties in perception texts and inconvenience in reading.
Natural Language Processing techniques suggest applying statistics, machine learning,
deep learning, and linguistics in order to solve those tasks. Linguistically complex
tasks, such as the medical text understanding, are the most challenging due to their
specific and they require peculiar models and methods.</p>
      <p>
        We started our research with the task described in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], where we investigated how a
linguistic approach can be applied to solve the problem of the identification of complex
words. The discourse of our research is Ukrainian medical texts. In order to study
medical text simplification, we analyze medical unified protocols and test the NLP approach
for medical word identification [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>The medicine as a domain for NLP researchers is not new. Some articles are devoted
to the creation of text resources to specify medical discourse. For instance, user opinion
mining is considered in the paper [5]. Opinion mining in the medical field has not been
studied deep. Text corpus in the medical domain, such as polarized lexicons, is
presented in [5] for this task.</p>
      <p>A method for the comparison of the simplified and original texts is presented in [6].
The goal of the paper is to level off the parallel text corpus, which consists of news in
Spanish with their simplified sample. In this paper, the provided algorithm is used for
the creation of a corpus for the study of text simplification.</p>
      <p>
        The paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] presents a work in progress to create an annotated text corpus for
Latvian. An important aspect that considered in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is the variety and balance of the corpus
in terms of genres, authors and lexical units. The paper [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is considered issues of
corpus creation as well. Presented by authors the framework provides several built-in NLP
tools to automatically preprocess texts and is highly customizable [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        The development of a doctor-patient dialogue corpus to support a speech-to-speech
machine translation effort for English-Persian medical dialogues is described in the
paper [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The described corpus was developed by recording and transcribing dialogues
in English, and then translated into Persian. The authors highlight the benefits and
drawbacks of creating a corpus in this way. Benefits include the ability to customize
the corpus in a way that would be infeasible for actual doctor-patient data and avoidance
of privacy and legal issues, while drawbacks include the fact that the Persian does not
originate as speech, but as text translation of English speech [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Another side of corpus linguistics concerns low resource languages. Building
language corpora for low resource languages is challenging because of limited digitized
texts [7]. Language corpora are needed for building information retrieval services such
as search and translation and to support further online content creation. A novel solution
to source relevant multilingual content is proposed in [7]. The way of gathering data is
performed by crowdsourcing translations via an online competitive game where
participants would be paid for their contributions [7].</p>
      <p>Some researchers outline the significance of combining results from different study
areas. Therefore, as it is shown in the paper [8] agglomerating results from studies of
individual biological components may produce biomedical discovery and the promise
of therapeutic development. Such knowledge integration could be facilitated by
automated text mining. The paper [8] notices that the creation of appropriate datasets is
hampered by the absence of a resource for launching a distributed annotation effort, as
well as by the lack of a standardized annotation schema. The annotation schema and
the corresponded tool are proposed in [8]. Those proposals can be widely adopted so
that the resulting annotated corpora from a multitude of disease studies are assembled
into a unified benchmark dataset.</p>
      <p>The paper [9] is devoted to studying some cultural peculiarities of rendering English
texts into Ukrainian. The authors of the study [9] emphasize the importance of the
experience exchange in order to strengthen the ties with economically developed
countries, as well as to improve the level of professional and ethical training of current and
future physicians. The studied type of text combines the features of both medical and
moral-ethical discourses, thus causing some difficulties in the adequate translation from
English into Ukrainian.</p>
      <p>The purpose of the study [10] is to study the readiness of Ukrainian psychiatrists to
introduce a new classification reflecting the latest trends in the integration of mod-ern
neurobiological research into clinical practice. The new terminology and new
classification at a certain domain are studied based on survey results in European countries
and Ukraine. The authors point out the relevance and precision of the terms used in the
medical domain [10]. It highlights the relevance of our research direction.</p>
      <p>An analysis of the references shows that little attention is paid to the creation of a
linguistic corps, especially the Ukrainian language and in the medical domain.
However, many authors note the importance of studying medical texts. Therefore, we
propose to focus on the process of text corpus creation in this paper. The aim of the given
work is the medical text corpus in Ukrainian as a reference data set for further study.
3
3.1</p>
    </sec>
    <sec id="sec-3">
      <title>Collection and Processing of Texts</title>
      <sec id="sec-3-1">
        <title>Corpus Creation</title>
        <p>Creating a quality text corpus is a challenge due to the significant influence of initial
data on processing results. There are no defaulted technics for the creation of text data
set in the theory of corpus linguistics. We suggest our own way to gather texts in the
field of medicine in order to perform the medical text corpus. The main requirement of
the corpus is the ability to provide data for language issues study. To create a medical
corpus in Ukrainian, this study proposes the following pipeline (Fig. 1). Wikipedia
texts, as well as texts from Social Networks, blogs, digital libraries, as well as official
websites of medical clinics and the ministry of health, are considered as data sources.
The first step is identifying the data sources and estimation of Ukrainian medical text
availability on the chosen data sources.</p>
        <p>Data is collected under the assumption that three categories of text complexity can
be distinguished. We give out three classes of text complexity such as complex,
moderate and simple texts. We apply our own perception and attitude of medical texts to
divide them into three categories. One of the most important requests on the second
pipeline step is collecting the same amount of data in each group. In addition, we try to
use semantically close sources to collect text data. At the pre-processing stage, the text
is cleared of unnecessary characters, as well as conversion to encoding UTF-8. The
third step includes such common preprocessing actions as tokenization, normalization
and makes description with a set of statistical indicators.</p>
        <p>The simplest and most common way to organize a text corpus for managing it is to
store documents in a file system on the disk. By organizing the placement of documents
in the corpus into subdirectories, you can ensure their classification and meaningful
separation according to available meta-information, such as dates or sources. Thanks
to the storage of each document in a separate file, corpus reading tools can quickly
search for different subsets of documents, the processing of which can be carried out in
parallel when each process receives its own text, different from the other subset of
documents. We use this way of storage in the given research.</p>
        <p>The pre-validation of the text data is carried out manually and some texts can be
returned to the pre-processing step. We perform the crowdsourcing technic for manual
data validation. At the processing stage, we suggest to determine the parts of speech
and obtain the statistical descriptors of the received texts. The final step is the validation
of collected text data. First of all, we check the POS tags. The balance of the text corpus
is paid attention, as well as analysis of statistical characteristics, should be done.
Prepared and marked up texts are placed in the repository.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Corpus Description</title>
        <p>Our corpus (UKRMED), the UKRainian MEDicine text corpus, combines three
medical text genres with a focus on clinical protocols (“Complex texts”), medicine forums
(“Simple texts”) and texts from Wikipedia (“Moderate texts”). The clinical protocols
and other complex texts are taken from the official website of the Ministry of Health of
Ukraine (https://guidelines.moz.gov.ua), dissertations and scientific papers
(http://idvamnu.com.ua/journal). The texts from the “Simple text” category are taken
from such sites as KinesisLife (https://kinesislife.ua), hospital sites
(https://gynecology.kyiv.ua), Dr. Komarovsky (https://komarovskiy.net), etc. The data from
“Moderate texts” category is taken from Ukrainian Wikipedia (https://uk.wikipedia.org/wiki/).</p>
        <p>A quantitative report in terms of the number of sentences, text tokens, and types
distributed over the different genres is given in Table 1.
1,013,585</p>
        <p>
          This corpus was created for the experiments with medical text simplification that
was launched in [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], and for experiments with readability metrics [11, 12]. We
distinguished some featured indices, which are calculated, for our text corpus. The defined
features are shown in Table 2. Table 3 shows the corresponding values for the features
from Table 2 except “Percentage of speech parts”. The analysis of the parts of the
speech will be considered separately.
        </p>
        <p>Text Level
Macro level (paragraph level)
Syntactic level
Lexical level
US
M
AWR
PSP</p>
        <p>As we find out such indices as TLT, TLL, ASLT, ATLS, and US from Table 3 have
very close values. Small differences have ASLL and M features. These values are not
obvious due to the average sentence length is bigger for “Moderate texts” or the
number of monosyllables (for example, “жар”/fever, “корь”/measles) is bigger for texts
from “Complex texts”.</p>
        <p>Text
Feature
TLT
TLL
ASLT
ASLL
ATLS
US
M
AWR
2,051,544
7,63
76,75
5,87
4%
40%
11,34
1,925,313
7,23
75,81
5,65
5%
38%
10,35
2,313,698
7,67
85,44
6,03
4%
35%
11,62</p>
        <p>The main purpose of the future work with UKRMED is an analysis of readability
and text simplification in the medical domain. Caused by our plans we take values of
average word length in symbols (ATLS) for all the categories (Fig. 2) and try to analyze
these feature values. Fig. 2 shows that the “Simple texts” category has the smallest
length of words on average but “Moderate texts” and “Complex texts” have similar
values for token length. It helps us to future research for text simplification and
assessment of text difficulty.
We used POS tagger Pymorph2 for the analysis of speech parts of our corpus
(https://pymorphy2.readthedocs.io/en/latest/index.html). The part of tagset for the POS
annotation of UKRMED Medical Text Corpus is presented in Table 4.</p>
        <p>As a result of our experiments, we received parts of speech categorization for our
three categories (Table 5). Table 5 shows that the most numerous parts are NOUN,
ADJF, and VERB. Other parts of speech are rather small in comparison with them.</p>
        <p>We have found out also big enough category as None. It is a category for words that
the morph analyzer could not analyze.</p>
        <p>In our work, we tried to analyze the most common words for each category. For this,
on the pre-processing stage, all tokens were lemmatized using Pymorphy2. The results
are presented in Table 6.</p>
        <p>The table shows the top 10 frequency lemmas for each category. From Table 6 it is
seen that some of the words are repeated and some are not. Words that are repeated in
all categories are highlighted in blue, which are found in “Complex texts” and
“Moderate texts” – in red and in “Simple texts” and “Moderate texts” – in green.</p>
        <p>The category “Complex texts” is perfectly characterized by such words as
“treatment”, “patient”, “patient”, “study”. They are at the top of the list of frequency
tokens. Such a word as “which” is a frequency word for all three categories, since, in our
opinion, sentences in the medical field are quite long and you have to use pronouns and
subordinate clauses. Relative pronouns (for example, “який”/”which/who”) play the
role of conjoined words for joining subordinate clauses to main ones.</p>
        <p>The word ”може”/”can”, which is found on “Moderate texts” and “Simple texts”,
is due to the fact that the discussion of diseases, treatment or diagnosis is advisory. It is
also a pronoun and it allows you to highlight objects in a speech situation. Thus, these
pronouns connect the text, help to avoid repetition. The word “мати” (translated as a
noun "mother" or a verb "have") was mislabeled. In our case, it is a verb, not a noun.
Therefore we have also a disambiguation problem.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>Medical texts include drug packages, medical records, fact sheets, medical reference
books, and training materials, certificates, etc. To solve the problem of the
simplification of a medical text, it is first necessary to single out the features of such texts. In this
study, we rely on the texts of medical clinical protocols. In order to accelerate the
development and implementation of the state standards in the field of health, the Ministry
of Health of Ukraine approves medical and technological documents on the basis of
evidence-based medicine. Such documents include a unified clinical protocol for
medical care, as well as an adapted clinical trial that based on evidence. Depending on the
disease, the plan of treatment and preventive measures may differ, which is also
prescribed in the legislation in the local protocols of prevention and treatment.</p>
      <p>The main idea of our research is the simplification of the medical text depends on
the complexity of this text and the stakeholder, who studies this text. So, for patients,
such parts of the protocol as a passport of the protocol, or a list of references, can be
omitted. For patients, those parts of the protocol that describe the symptoms of the
disease, the epidemiology, the necessary actions of the doctor and, especially, the
recommendations are of the greatest interest. It should also be noted that all medical records
are provided in the state language. As a result, the medical text is replete with not only
Latin special terms, but also complex medical words in the Ukrainian language.
Analysis of the Ukrainian text in terms of linguistics is a daunting task. In this case, the
problem is complicated by the huge amount of medical terminology. At the same time,
the text also contains words from the subject area, which do not require simplification.</p>
      <p>In our paper, we described the structure and the quantitative features of UKRMED,
an annotated Ukrainian text corpus that contains three classes of medical texts. On a
wide scale, such kind of language resources is a valuable asset for up-to-days researches
and the development of effective medical domain targeted technologies. The future
study will focus on medical fact extraction from official documents, Ukrainian text
simplification in the medical field, question answering services development, etc.
5. Goeuriot, L., Na, J. C., Kyaing, W. Y. M., Khoo, C., Chang, Y. K., Theng, Y. L., Kim, J. J.:
Sentiment lexicons for health-related opinion mining. In: Proc. of the 2nd ACM SIGHIT
International Health Informatics Symposium, pp. 219–225 (2012).
6. Bott, S., Saggion, H.: An Unsupervised Alignment Algorithm for Text Simplification
Corpus Construction. In: Proc. of the Workshop on Monolingual Text-To-Text Generation,
pp. 20–26 (2011).
7. Packham, S., Suleman, H.: Crowdsourcing a text corpus is not a game. In: Lecture Notes in</p>
      <p>Computer Science, Vol. 9469, pp. 225–234 (2015).
8. Cano, C., Monaghan, T., Blanco, A., Wall, D. P., Peshkin, L.: Collaborative text-annotation
resource for disease-centered relation extraction from biomedical text. Journal of
Biomedical Informatics, 42(5), pp. 967–977 (2009).
9. Velychenko, O., Popova, O.: Cross-cultural specificities of rendering texts on medical ethics
in Ukrainian translation. Naukovy Visnyk of South Ukrainian National Pedagogical
University Named after K. D. Ushynsky: Linguistic Sciences, 2019(29), pp. 36–50 (2019).
10. Zubatiuk, O., Nosova, E.: Neuroscience-based nomenclature in Ukraine. European
Neuropsychopharmacology, 27, S663 (2017).
11. Wermter, J., Hahn, U.: An Annotated German-Language Medical Text Corpus as Language
Resource. In: Proc. of the Fourth International Conference on Language Resources and
Evaluation (LREC’04), pp. 473-476 (2004)
12. Readability formulas (title from screen), https://readable.com/features/readability-formulas/
last accessed 2020/04/12.
13. Vysotska, V., Lytvyn, V., Burov, Y., Gozhyj, A., Makara, S.: The consolidated information
web-resource about pharmacy networks in city. In: Proc. of the 1st International Workshop
on Informatics &amp; Data-Driven Medicine (IDDM 2018), CEUR Workshop Proceedings,
pp. 239-255 (2018).
14. Lytvyn V., Burov Y., Kravets P., Vysotska V., Demchuk A., Berko A., Ryshkovets Y.,
Shcherbak S., Naum O.: Methods and Models of Intellectual Processing of Texts for
Building Ontologies of Software for Medical Terms Identification in Content Classification. In:
CEUR Workshop Proceedings, of the 2nd International Workshop on Informatics &amp;
DataDriven Medicine (IDDM 2019), Vol. 2362, pp. 354-368 (2019).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Cherednichenko</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kanishcheva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Babkova</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <article-title>: Complex term identification for Ukrainian medical texts</article-title>
          .
          <source>In: Proc. of the 1st International Workshop on Informatics &amp; DataDriven Medicine (IDDM 2018)</source>
          , Vol.
          <volume>2255</volume>
          , pp.
          <fpage>146</fpage>
          -
          <lpage>154</lpage>
          , CEUR-WS (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Gruzitis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pretkalnina</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saulite</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rituma</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nespore-Berzkalne</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Znotins</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paikens</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Creation of a balanced state-of-the-art multilayer corpus for NLU</article-title>
          .
          <source>In: Proc. Of the 11th International Conference on Language Resources and Evaluation</source>
          , pp.
          <fpage>4506</fpage>
          -
          <lpage>4513</lpage>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Janssen</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>TEITOK: Text-faithful annotated corpora</article-title>
          .
          <source>In: Proc. of the 10th International Conference on Language Resources and Evaluation</source>
          ,
          <string-name>
            <surname>LREC</surname>
          </string-name>
          <year>2016</year>
          , pp.
          <fpage>4037</fpage>
          -
          <lpage>4043</lpage>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Belvin</surname>
            ,
            <given-names>R. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>May</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Narayanan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Georgiou</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ganjavi</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Creation of a doctor-patient dialogue corpus using standardized patients</article-title>
          .
          <source>In: Proc. of the 4th International Conference on Language Resources and Evaluation</source>
          ,
          <string-name>
            <surname>LREC</surname>
          </string-name>
          <year>2004</year>
          , pp.
          <fpage>187</fpage>
          -
          <lpage>190</lpage>
          (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>