MedSimples: An Automated Simplification Tool
  for Promoting Health Literacy in Brazil?

    Liana Paraguassu1[0000−0003−4043−1836] , Leonardo Zilio2[0000−0002−6101−0814] ,
                        Luis Antonio Leiva Hercules3 , and
                Maria José Bocorny Finatto1[0000−0002−6022−8408]
             1
               Universidade Federal do Rio Grande do Sul, UFRGS, Brazil
                 liana@linguatraducoes.com, mariafinatto@gmail.com
            2
              University of Surrey, United Kingdom, l.zilio@surrey.ac.uk
                       3
                         Automatic Data Processing, ADP, Brazil

    Functional illiteracy rates in Brazil are critical. According to a recent study
(2018) conducted by the Paulo Montenegro Institute, 3 out of 10 Brazilians
are considered functional illiterates. Also according to INAF4 , only 12% are
truly proficcient readers. On the other hand, with the significant increase of
Internet access in Brazil in the past years, information is available to a much
larger number of people. According to the Brazilian Institute of Geography and
Statistics (IBGE), in 2017, 67,00% of the Brazilian population had access to the
Internet. Although a search on the Internet by a layperson cannot replace going
to the doctor, the growing number of Internet users who rely on it as a source of
information is a reality that cannot be overlooked. Therefore, it is important that
the source be reliable, but also that the information provided on these sources be
linguistically accessible and understandable by people with low levels of literacy.
In this scenario, our research problem is this: How can we render health-related
information available on the Web in a linguistically accessible way to people
with limited education and low literacy skills? Our project combined Natural
Language Processing, Corpus Linguistics and Terminology Studies to develop
MedSimples, an online tool that automatically identifies complex phrases in
health-related texts presented by the user and offers suggestions of simplification.
MedSimples is an example of how research on medical language, associated with
NLP, can enrich the current scenario of Digital Humanities in its broadest scope.
    The prototypical user for the tool is the Health or Communication profes-
sional interested in producing accessible texts graduated according to the needs
of their audience. However, it can also be used by anyone who is interested in
getting simplified health-related information for themselves. Our Text Simplifi-
cation approach is focused on lexical and terminological levels, and the goal is
to reduce complexity while preserving meaning. Lexical simplification can have
?
  Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
  mons License Attribution 4.0 International (CC BY 4.0). DHandNLP, 2 March 2020,
  Evora, Portugal.
  Supported by LARA 2019 (Google Inc.), CNPq - Brasil (Productivity in research
  305625/2016-0), Universal (403521/2016-5), SEAD-UFRGS, and Expanding Excel-
  lence in England (E3).
4
  INAF is a Brazilian literacy indicator. http://www.ipm.org.br/inaf
2       L. Paraguassu et al.

two approaches [3]: modifying the vocabulary on a text by selecting words that
are more adequate to the reader’s reading skills, or adding explanations or def-
initions to the vocabulary that cannot be replaced. MedSimples combines both
approaches, using the suggestions of modifications mostly for complex phrases,
and then using simpler definitions for the terminology that is present in the text.
We believe that this way the information is preserved, being more accurate and
reliable, and, in addition to that, by explaining the complex vocabulary, we can
promote health literacy by educating our readers on health issues.
    The tool was initially built based on a Parkinson’s Disease (PD) corpus. The
PD corpus is a representative collection of original texts available on the Internet
and published by reliable sources. These texts have been simplified by Linguis-
tic graduate and undergraduate students and validated by health specialists to
build a parallel simplified PD corpus. This corpus was then further analysed
and converted into an initial list of terms and a list of example-sentences with
varying degrees of complexity, that can be used for consultation by the health
professionals.
    In terms of automatic processing of texts, for identifying complex phrases
and terms, and offering suggestions of simplification or simpler definitions for
the user, MedSimples currently relies on three lexical resources and one parser.
The three lexical resources comprise a list of words that are considered simple,
a list of words that are considered complex along with simpler synonyms, and a
glossary of terms with simple definitions. The first step in MedSimples process is
to parse the text that was selected by the user, we used the PassPort parser [4] for
annotating morphological and lemma information on the text. This annotated
text is then processed and, when MedSimples recognizes a term that is listed in
our glossary it automatically provides a simplified explanation. For the complex
words that are not considered health-related terms, we used the synsets present
in the Thesaurus of Portuguese (TeP 2.0) [1], which were filtered based on a list
of simple words extracted from CorPop [2], a corpus of texts considered fairly
accessible to the average Brazilian reader.


         Fig. 1. Parkinson’s Disease Simplification. Source text: Wikipedia.
                                   Title Suppressed Due to Excessive Length         3

    Figure 1 shows an excerpt of a text about PD extracted from the Internet
and processed on MedSimples. The synonyms in blue are from the filtered TeP
2.0 synsets and the explanations in green are from our term glossary. Examples:
(term from glossary) doença degenerativa (tipo de doença em que a condição
da pessoa vai piorando aos poucos), (filtered item, difficult word) denominam
(chamar, classificar, designar, nomear).
    Although MedSimples is at its early stages it has shown promising results
with PD material. Currently, we are expanding our database to cover a wider
range of health-related topics. In addition, we are working on the qualification of
our glossary of health-related terms and our list of complex words. For instance,
we are refining and expanding our list of terms to cover words that should be
considered terms, such as “demência” and “tremores” (Fig. 1), and that were
previously listed as complex words, which was usually leading to inaccurate
suggestions of simplification.


References
1. Maziero, E., Pardo, T.: Interface de acesso ao tep 2.0–thesaurus para o português
   do brasil. Relatório técnico. University of Sao Paulo (2008)
2. Pasqualini, B.F.: CorPop: um corpus de referência do português popular escrito do
   Brasil. Ph.D. thesis, Universidade Federal do Rio Grande do Sul (2018)
3. Saggion, H.: Automatic text simplification: Synthesis lectures on human language
   technologies, vol. 10 (1). California, Morgan & Claypool Publishers (2017)
4. Zilio, L., Wilkens, R., Fairon, C.: Passport: A dependency parsing model for por-
   tuguese. In: International Conference on Computational Processing of the Por-
   tuguese Language. pp. 479–489. Springer (2018)