=Paper= {{Paper |id=Vol-3068/short6 |storemode=property |title=Automatic Medical Text Simplification: Challenges of Data Quality and Curation |pdfUrl=https://ceur-ws.org/Vol-3068/short6.pdf |volume=Vol-3068 |authors=Chandrayee Basu,Rosni Vasu,Michihiro Yasunaga,Sohyeong Kim,Qian Yang |dblpUrl=https://dblp.org/rec/conf/aaaifs/BasuVYKY21 }} ==Automatic Medical Text Simplification: Challenges of Data Quality and Curation== https://ceur-ws.org/Vol-3068/short6.pdf
 Automatic Medical Text Simplification: Challenges of Data Quality and Curation
                                                      Chandrayee Basu1
                                              Rosni Vasu2 , Michihiro Yasunaga1 ,
                                                 Sohyeong Kim1 , Qian Yang3
                                                           1
                                                              Stanford University
                                                            cbasu@stanford.edu
                                                           2
                                                             University of Zurich
                                                             3
                                                               Cornell University

                            Abstract                                     form. Research in automatic non-medical text simplification
                                                                         has been burgeoning, with the introduction of large paral-
  Health Literacy is the degree to which individuals can com-            lel corpora (Zhu, Bernhard, and Gurevych 2010; Woodsend
  prehend basic health information needed to make appropri-
                                                                         and Lapata 2011; Coster and Kauchak 2011; Xu, Callison-
  ate health decisions. The topmost reason for low health lit-
  eracy is the vocabulary gap between providers and patients.            Burch, and Napoles 2015; Paetzold and Specia 2017). Cre-
  Automatic medical text simplification can contribute to im-            ation of multi-references enabled models that can learn dif-
  proving health literacy by assisting providers with patient-           ferent kinds of textual transformations separately, viz. lexi-
  friendly communication, improving health data search, and              cal changes (e.g. paraphrasing), syntactic modifications (e.g.
  making online medical texts more accessible. It is, however,           reordering of concepts, splitting texts, reducing sentence
  extremely challenging to curate quality corpus for this nat-           length etc.) and compression (e.g. deleting peripheral in-
  ural language processing (NLP) task. In this position paper,           formation irrelevant to the target domain) (Alva-Manchego
  we observe that, despite recent research efforts, existing open        et al. 2020).
  corpora for medical text simplification are poor in quality and
                                                                            References are gold standard human generated simplifica-
  size. In order to match the progress in general text simpli-
  fication and style transfer, we must leverage careful crowd-           tions, used to validate model outputs. The success of the au-
  sourcing. We discuss the challenges of naive crowd-sourcing.           tomatic text simplification and style transfer hinges on large
  We propose that careful crowd-sourcing for medical text sim-           amounts of crowd-sourced multiple references. However,
  plification is possible, when combined with automatic data             crowd-sourcing even a single set of references for medical
  labeling, a well-designed expert-layman collaboration frame-           texts is challenging. It requires the recruitment of a specific
  work, and context-dependent crowd-sourcing instructions.               sub-population with a certain degree of domain expertise.
                                                                         For example, Nye et al. (2018) described an elaborate pro-
Low health literacy has been associated with non-adherence               cess of recruiting MDs and medical experts from Upwork,
to treatment plans and regimens, poor patient self-care, lack            for PICO data annotation. Naturally, we observe a dearth
of timely communication of health issues, and increased                  of high-quality parallel training corpus in medical AI. Fur-
risk of hospitalization and mortality (King 2010). Simpli-               thermore, text simplification task has additional challenges.
fication of medical documents, of online communications                  Only the expert knows what content of the domain-specific
like email messages and patient instructions can go a                    text is relevant to the laymen, whereas the laymen or med-
long way to mitigate health literacy challenges. While the               ical writers trained to translate medical texts can judge the
consumer versions of medical journals, news articles, and                quality and accessibility of the simplified versions.
a few trusted websites (NIA 2018; Savery et al. 2020) are                   In this work, we make the following contributions:
written by trained experts, they are by no means exhaustive.
Automated approaches are necessary to keep pace with the                 • identify the open-source datasets for medical text simpli-
rapidly growing body of biomedical literature. In this work,               fication
we evaluate some of the open corpora that power automated                • characterize the datasets by their quantity, quality, diver-
text simplification in the medical domain.                                 sity, and representativeness
   We define text simplification, following Siddharthan                  • identify challenges of scaling high-quality corpus genera-
(2014), as the process of reducing the linguistic complexity               tion for medical text simplification
of a text, while still retaining the original information con-
                                                                            Assumptions: We treat summarization as a subset of text
tent and meaning. A domain-specific expert text undergoes
                                                                         simplification. We only consider corpora that represent com-
various kinds of transformations to reach the final simple
                                                                         posite textual transformations (simple text is derived after
Copyright © 2021for this paper by its authors. Use permitted under       a combination of syntactic, semantic, thematic, and lexical
Creative Commons License Attribution 4.0 International (CC BY            transformations of the expert text) (Lyu et al. 2021) for fur-
4.0).                                                                    ther analysis.
    Datasets for Medical Text Simplification                          MSD is a non-parallel corpus derived from Merck Man-
Datasets for medical text simplification support two kinds         uals, a trusted health reference for 100 years, with a wide
of document simplification: sentence-level and paragraph-          range of medical topics. For each topic, the manual con-
level. We focus on sentence-level and short paragraph-level        tains a consumer version and an expert version of the text,
simplification. After an elaborate search, we found three          making it an ideal candidate for a text simplification cor-
datasets in English for medical text simplification: two par-      pus curation. This dataset offers wide coverage of medical
allel corpora SIMPWIKI (Van den Bercken, Sips, and Lofi            topics and medical PICO elements (Cao et al. 2020). The
2019) and PARASIMP (Devaraj et al. 2021), and one non-             authors scraped raw consumer and professional texts from
parallel corpus MSD (Cao et al. 2020).                             the MSD website, split them into sentences, identified par-
   Next, we delve deeper into how these datasets are created       allel groups by matching document titles and subsection ti-
and the potential artifacts of the data collection and annota-     tles, and picked linked sentences from the matched sections
tion processes.                                                    of the articles. The resulting text pairs were validated by
                                                                   non-native English speakers. The annotators used native lan-
Artifacts of Corpus Curation                                       guage translations to speed up annotations. The text pairs
                                                                   are also annotated with UMLS concepts (Bodenreider 2004)
In the absence of reliable crowd-sourcing of medical texts,        for domain knowledge. MSD data has 130,349 expert texts,
researchers resort to crawling medical websites. The ex-           114,674 layman texts in the non-parallel training set, and
pert texts are sampled from the online articles and checked        675 expert-layman pairs for validation. The texts are ≤ 245
posthoc for adequate corpus representativeness. The layman         tokens long.
texts are retrieved from the layman or consumer versions              We also considered a paragraph-level simplification
of the professional articles, based on the alignment of sec-       corpus (Devaraj et al. 2021). The corpus consists of
tion titles and text content. The alignment is either checked      technical abstracts of biomedical systematic reviews and
manually for a small fraction of the corpus or automatically       corresponding plain language summaries (PLS) from
derived using different algorithms. Only a few of the auto-        Cochrane Database of Systematic Reviews (McIlwain et al.
matically aligned pairs are validated by the experts. Auto-        2014). The PLS are written in simple English. They usually
matic alignment is not always reasonable (Alva-Manchego,           represent the key essence of the abstracts and are structured
Scarton, and Specia 2020). Random sampling of expert texts         heterogeneously (Kadic et al. 2016). We decided to exclude
from larger articles and unreliable automatic retrieval can        this corpus from our analysis due to the abstractive summary
lead to text pieces that are not stand-alone (Choi et al. 2021).   nature of the layman versions.
We found that the process of expert verification is insuffi-
cient for quality data curation and could still lead to pairs         The size of parallel corpora is extremely small compared
lacking correspondence. On the other end, models trained           to those for non-medical text simplification, where the me-
using highly aligned text pairs may exhibit limited general-       dian corpus size is 154K (Alva-Manchego, Scarton, and
izability.                                                         Specia 2020).
   A more recent trend is to generate large volumes of
non-parallel corpus, obviating validation of automatically
aligned pairs. This follows similar approaches in non-
                                                                        Automatic Dataset Quality Assessment
medical text style transfer (Shen et al. 2017; He and              We assessed MSD and SIMPWIKI for their overall qual-
McAuley 2016; Madaan et al. 2020). Some researchers dis-           ity, diversity and representativeness. We define these terms
tinguish between text simplification and text style transfer       as follows: Quality : grammatical correctness, average read-
tasks. We consider text simplification as a sub-domain of          ability score, adherence to domain specific styles, Diversity
text style transfer where the goal is to transform text from       : coverage of various transformations that text simplifica-
the expert style to the layman style.                              tions entail in medical domain (different from diversity of
                                                                   language generation (Ippolito et al. 2019)), and Representa-
Datasets                                                           tiveness : coverage of various medical sub-domains (e.g. gy-
                                                                   necology, neurology, cardiology) and topics (for e.g., symp-
Van Den Bercken (Van den Bercken, Sips, and Lofi 2019)
                                                                   toms, signs, treatments).
contributed the very first publicly available medical text sim-
plification corpus, which we refer to as SIMPWIKI, sim-
ilar to (Cao et al. 2020). The authors created three sub-          Metrics
sets, fully-aligned expert: medical subset of Wikipedia data       We measured the above features separately for parallel and
from Hwang et al. (2015), gleaned using QuickUMLS (Sol-            non-parallel corpora.
daini and Goharian 2016) for NER and later validated by               Quality:     For grammatical correctness, we used
experts, partly-aligned expert and fully-aligned automatic:        the average acceptability score returned by textattack’s
texts from Wikipedia and Simple Wikipedia aligned using            RoBERTa-based classifier for CoLA (Morris et al. 2020;
BLEU score (Papineni et al. 2002). Fully aligned text pairs        Warstadt, Singh, and Bowman 2019; HuggingFace 2021).
have strong one-on-one correspondence, partly aligned sim-         We computed the readability of the two corpora in terms of
ple texts cover the expert text entirely, but have additional      Flesch-Kincaid Reading Ease, Flesch-Kincaid Grade level
facts. This dataset has 9212 expert-layman pairs. The texts        (Kincaid et al. 1975), and Automated Readability Index
are ≤ 128 tokens long.                                             (ARI) (Senter and Smith 1967), similar to Li and Nenkova
(2015); Devaraj et al. (2021). We used classifiability,          Results
relative lexical complexity, and elaboration as metrics          Quality Approximately 90 % of the expert texts in MSD
of domain-specific styles. We measured classifiability by        were acceptable and > 97 % of the MSD layman texts
the test accuracy of a trained attribute model (Yang et al.      and SIMPWIKI were acceptable by the CoLA model. This
2018; Subramanian et al. 2018; Prabhumoye et al. 2018).          means ≈360 texts, each in expert and layman versions
Following Siddharthan (2014), we expect good quality             within SIMPWIKI corpus, were not acceptable. 10 % of
simplified corpus to contain sufficient elaborations of          the expert texts (11460 texts) in MSD had low acceptability
technical concepts and jargon and fewer low-frequency            score, possibly because of unique vocabulary and sentence
words. We trained a 1D CNN attribute model (Kim 2014)            structures, and incomplete references.
with GPT2 embedding (Radford et al. 2019) for computing             See Table. 2 for the readability scores. We found dis-
classifiability. We reported how much elaborations are           crepancies with the readability scores reported in Cao et al.
present in the simple texts of the corpora using thresholded     (2020). Paired t-test shows that the expert and the simple
cosine similarity between Sentence-BERT embeddings               texts in both MSD and SIMPWIKI have statistically signif-
(Zhong et al. 2020; Reimers and Gurevych 2019) of the            icant differences in readability, measured by Flesch Read-
text pairs. We embedded each sentence of the simple text         ing Score, Flesch Kincaid Grade, and Automated Readabil-
and the expert text and computed pairwise alignments. We         ity Index (p < 0.001). The minimum readability of medical
used Sentence-BERT because it is tuned on several corpora,       texts compared to general English corpora is low (Minimum
including SciDocs (Cohan et al. 2020) to embed sentences         Flesch Kincaid grade level is 11.9), also observed by De-
and short paragraphs and performed better than competing         varaj et al. (2021).
models on several downstream tasks. That said, wherever             Lexical complexity, computed using the EASSE package
possible, we avoided language-model based metrics due to         (Martin et al. 2019), represents the word rank score distri-
a mismatch between medical and model training datasets.          bution of the corpus. While the mean complexity of MSD
                                                                 is not very different between expert and layman versions,
   Diversity: We argue that quality corpus for text sim-         a much lower standard deviation confirms that expert texts
plification should be diverse enough to accommodate vari-        have more rare words. SIMPWIKI has more common words
ous textual transformations, that domain-specific simplifica-    in both expert and layman versions than MSD, and the com-
tions entail. These transformations could be lexical, seman-     plexity varies across the corpus. We measured percentage of
tic, and syntactic. Lexical transformations refer to substitu-   simple texts that potentially contain elaborations, both for
tion of complex terms or phrases by more accessible ones         MSD, and separately for differently aligned pairs of SIM-
and could also include elaborations (extensions) or explana-     PWIKI. We found high proportion of elaborations in MSD
tions (intentions). Syntactic transformations are more style     based on our coarse approach, which is desirable. However,
dependent like formality change, voice change, tense change      further human validations are required to confirm the rele-
etc. We measured semantic diversity of the MSD validation        vance of these elaborations.
data and the entire SIMPWIKI corpus using Sentence-BERT             We trained two different attribute models for classifiabil-
based corpus alignment.                                          ity check. We did not notice a significant difference in the
                                                                 test accuracy of the two corpora. Note that the training data
   We measured lexical and syntactic transformations using       size was significantly larger for MSD. The accuracy was
referenceless quality features like Levenshtein similarity,      0.88 and 0.81 for MSD and SIMPWIKI respectively.
the proportion of words added, deleted or kept, compression
ratio, lexical complexity ratio etc., from the EASSE library
(Martin et al. 2018, 2019; Alva-Manchego et al. 2020).           Diversity
                                                                 We computed several referenceless text quality metrics us-
   Representativeness: We also checked which of the two          ing the EASSE library (Martin et al. 2019). We made some
corpora covers a wider range of medical subdomains and           modifications to output mean, standard deviation, and stan-
topics. Cao et al. (2020) already measured the representa-       dard error of the metrics. We used these automatic metrics
tiveness for MSD by the distributions of the PICO elements       as a proxy for simplification-related transformations. An av-
(slightly different from the PICO elements in Nye et al.         erage compression ratio of > 1 in MSD points to more
(2018)) and medical subdomains.                                  elaborations and explanations (potentially irrelevant facts).
                                                                 A higher standard deviation of compression ratio indicates
   SIMPWIKI being a subset of Wikipedia articles relevant
                                                                 more diversity in transformations. Higher additions in MSD
to medical topics, we referred to Shafee et al. (2017), for
                                                                 indicate more domain specific words (possibly more com-
its representativeness. There are 30,000 articles on medical
                                                                 mon words) being introduced in the simpler versions. Over-
topics in Wikipedia. The articles are rated for quality and
                                                                 all, we observe that MSD represents more textual transfor-
importance by editors. The top-rated articles are on tuber-
                                                                 mations than SIMPWIKI.
culosis and pneumonia. High-importance includes common
diseases and treatments. Mid-importance encompasses con-
ditions, tests, drugs, anatomy and symptoms. The remaining                 Human Data Quality Assessment
low-importance articles include niche or peripheral medical      In the previous section, we used automatic metrics to eval-
topics such as laws, physicians and rare conditions.             uate the approximate quality and diversity of the corpora
                                           Table 1: Transformation Diversity Metrics.

                                                                  Layman to Expert Ratio
                            Metrics
                                                                   MSD                   SIMPWIKI
                            Compression Ratio              1.257 ± 0.9                  0.907 ± 0.46
                            Levenshtein Similarity       0.519 ± 0.166                0.641 ± 0.219
                            Exact copies                         0.029                          0.07
                            Additions proportion         0.526 ± 0.254                0.304 ± 0.251
                            Deletions proportion         0.439 ± 0.244                0.421 ± 0.286
                            Added words                20.135 ± 18.144                7.951 ± 8.941
                            Deleted words              17.181 ± 20.504                12.16 ± 11.89
                            Kept words                  11.914 ± 9.881                  12.529 ± 9.5
                            Corpus alignment             0.428 ± 0.226      0.832 ± 0.161 (auto full)
                                                                             0.862 ± 0.125 (exp full)
                                                                            0.597 ± 0.163 (exp part)

                                                     Table 2: Quality metrics.

                                                         MSD Test                            SIMPWIKI
              Metric
                                                     Expert            Layman              Expert          Layman
              Acceptability score                    0.907                0.976            0.977              0.965
              Flesch Reading Ease            17.44 ± 32.25       37.116 ± 28.19    30.07 ± 28.34     41.47 ± 29.12
              Flesch Kincaid Grade level        15.2 ± 5.4           12.6 ± 5.7       14.4 ± 5.5         11.9 ± 5.1
              ARI                               15.6 ± 6.4             13 ± 6.9       15.1 ± 6.7         12.4 ± 6.2
              Lexical complexity              9.17 ± 0.087            9 ± 0.792     8.842 ± 0.79     8.695 ± 0.867
              Elaboration                                                  27.4                      4.6 (auto full)
                                                                                                      0.8 (exp full)
                                                                                                      0.8 (exp part)


for medical text simplification. We found that MSD poten-            also be useful in the medical domain for personalization
tially is more diverse, but also has lower acceptability be-         (Paetzold and Specia 2016; Su et al. 2021).
cause of the sheer scale of the data and unique vocabulary.
The expert texts in MSD require a higher minimum reading                To assess whether crowd-sourcing is a valid option for
grade. While this corpus seems to contain more elaborations          quality check and multi-reference generation of medical
in the validation set, compared to SIMPWIKI, the elabora-            texts, we conducted a test internally, between two coauthors
tions cannot be explicitly learnt from the non-parallel train-       of this paper. Both the authors had high school biology in
ing data. All of the above points to the need for further data       English. One author consumes medical information weekly
collection and quality human annotation.                             from scientific articles, popular science news and blogs, and
                                                                     communicates with medical practitioner online. Another au-
Crowd-sourcing                                                       thor uses google search infrequently for medical symptoms
In many NLP tasks, it is customary to complement auto-               lookup only. We sampled 60 sentences from MSD: 20 with
matic model validations with human evaluations. A large              longer simple texts, 20 with longer expert texts and 20 where
body of work has been dedicated to analyse and correct               simple and expert texts have similar number of tokens. We
the mismatch between human judgement and automatic                   asked each author to indicate agreement on several state-
evaluation. Researchers found that both metrics (Banerjee            ments covering content preservation, coverage, textual sim-
and Lavie 2005; Zhang et al. 2019; Ma et al. 2019) and               plicity, concept simplicity and fluency of the simple text, for
artifacts of data collection (Freitag, Grangier, and Caswell         e.g.
2020) can be responsible for the mismatch. One solution to            • The simple sentence explains all the unknown concepts
ensure data diversity is to crowd-source multiple references            adequately
(Freitag, Grangier, and Caswell 2020). Lyu et al. (2021);             • The simple sentence removes all redundancy and covers
Alva-Manchego et al. (2020) released a text simplification              only the key point in the reference sentence
multi-reference corpus annotated with various simplification
transformations. Newsela corpus for general text simplifi-            • I cannot think of an alternative way to simplify it
cation was annotated for different grades of education (Xu,             Average Krippendorff’s alpha (Krippendorff 2011) across
Callison-Burch, and Napoles 2015). Multi-references will             10 quality questions, between the two authors, was 0.299 ±
0.048. The results show high disagreement between the au-         key concepts from texts, identifying concepts that need elab-
thors, questioning the plausibility of reliable human eval-       orations and so on. This model can be leveraged to improve
uation and crowd-sourcing of medical texts. However, in           layman evaluations.
the absence of crowd-sourcing, we cannot generate diverse
enough data to train and validate models with good general-                               Discussion
izability.
                                                                  Automatic medical text simplification can contribute to im-
Can layman assess the simplification quality and                  proving health literacy by assisting providers with patient-
provide alternative references?                                   friendly communication, improving health data search, and
                                                                  making online medical texts more accessible. However, it
To test this question, we conducted a pilot study with two        is challenging to create large annotated and parallel cor-
users, where we iterated on a few different designs of            pus for this task, unlike for non-medical texts. In this pa-
layman evaluation of MSD validation data. The users had           per, we identified the existing corpora for training automatic
high school biology in English, but minimal experience            text simplification models, and analyzed their quality and di-
of consuming medical information online. We found that            versity using several automatic metrics. We found that tak-
the users were unmotivated to read the entire expert text,        ing snapshots from expert and consumer articles that are
because of the jargon, resulting in an inability to judge the     not aligned could lead to poor quality parallel corpus. We
quality of the simplification. More importantly, some of          also assessed the potential of leveraging crowd-sourcing for
the ratings changed, after the texts were explained to the        large-scale model evaluation and data annotation for this
users. A prominent artifact of data scraping and automatic        task. We found that laymen evaluate the medical texts very
alignment was the change in the subject of the text, which        differently, depending upon their exposure to medical infor-
confused the evaluation. For e.g. in this text pair: expert: In   mation. We proposed some crowd-sourcing solutions that
adults , BMI , defined as weight ( kg ) divided by the square     could use expert-layman collaboration. In future, we plan
of the height ( m2 ) , is used to screen for overweight or        to explore such collaborative data curation and annotation,
obesity ( see table Body Mass Index ( BMI ) ) : Overweight        in practice. Another exciting research avenue would be to
= 25 to 29.9 kg/m2 ; Obesity = ≥ 30 kg/m2 simple: Obesity         train controllable simplification models that can interface
is diagnosed by determining the BMI.                              with and learn from these two stakeholders.
BMI is the subject in the former and obesity is the subject in
the latter. When asked if the users were confident that they
could rewrite the simplification better, we got an unanimous                             References
yes.                                                              Alva-Manchego, F.; Martin, L.; Bordes, A.; Scarton, C.;
                                                                  Sagot, B.; and Specia, L. 2020. ASSET: A Dataset for Tun-
   We concluded that only experts have the ability to com-        ing and Evaluation of Sentence Simplification Models with
prehend which sections of the expert texts are useful for lay-    Multiple Rewriting Transformations. In Proceedings of the
men. Only laymen and trained writers can validate whether         58th Annual Meeting of the Association for Computational
the simple versions are readable and meaningful. In other         Linguistics, 4668–4679. Online: Association for Compu-
words, scaling up human evaluation and annotation, in this        tational Linguistics. doi:10.18653/v1/2020.acl-main.424.
case, calls for well-designed collaboration between experts       URL https://aclanthology.org/2020.acl-main.424.
and laymen.
                                                                  Alva-Manchego, F.; Scarton, C.; and Specia, L. 2020.
                                                                  Data-driven sentence simplification: Survey and benchmark.
Expert-layman collaboration                                       Computational Linguistics 46(1): 135–187.
We delineated various potential formats of expert-layman          Banerjee, S.; and Lavie, A. 2005. METEOR: An automatic
collaborations. The experts could be MD and biomedical            metric for MT evaluation with improved correlation with hu-
students, physicians and nurses directly, or they could be        man judgments. In Proceedings of the acl workshop on in-
models of expert behavior. The simplest approach would            trinsic and extrinsic evaluation measures for machine trans-
be to show definitions of the UMLS concepts. We found             lation and/or summarization, 65–72.
that these concepts are not always accessible for layman.
Other researchers have used Google’s “define:” to improve         Bodenreider, O. 2004. The unified medical language system
readability of medical texts (Elhadad 2006). Some potential       (UMLS): integrating biomedical terminology. Nucleic acids
expert-layman collaboration could look like the following:        research 32(suppl 1): D267–D270.
Show examples of text pairs rated by experts, their rationale
                                                                  Cao, Y.; Shui, R.; Pan, L.; Kan, M.-Y.; Liu, Z.; and Chua, T.-
behind rating and their corrections to unacceptable simpli-
                                                                  S. 2020. Expertise style transfer: A new task towards better
fication, ask experts to generate a question from the expert
                                                                  communication between experts and laymen. arXiv preprint
text and ask layman to answer the question after reading the
                                                                  arXiv:2005.00701 .
simple version of the text. The expert generated question is
automatically based on the key content of the expert text.        Choi, E.; Palomaki, J.; Lamm, M.; Kwiatkowski, T.; Das, D.;
The layman should understand the content of the simple text       and Collins, M. 2021. Decontextualization: Making Sen-
to answer this question. We could also use limited expert an-     tences Stand-Alone. Transactions of the Association for
notated data to model expert behavior in terms of extracting      Computational Linguistics 9: 447–461.
Cohan, A.; Feldman, S.; Beltagy, I.; Downey, D.; and Weld,      las (automated readability index, fog count and flesch read-
D. S. 2020. Specter: Document-level representation learn-       ing ease formula) for navy enlisted personnel. Technical
ing using citation-informed transformers. arXiv preprint        report, Naval Technical Training Command Millington TN
arXiv:2004.07180 .                                              Research Branch.
Coster, W.; and Kauchak, D. 2011.          Simple English       King, A. 2010. Poor health literacy: a’hidden’risk factor.
Wikipedia: A New Text Simplification Task. In Proceedings       Nature Reviews Cardiology 7(9): 473–474.
of the 49th Annual Meeting of the Association for Compu-        Krippendorff, K. 2011. Computing Krippendorff’s alpha-
tational Linguistics: Human Language Technologies, 665–         reliability .
669. Portland, Oregon, USA: Association for Computational
Linguistics. URL https://aclanthology.org/P11-2117.             Li, J. J.; and Nenkova, A. 2015. Fast and accurate prediction
                                                                of sentence specificity. In Twenty-Ninth AAAI Conference
Devaraj, A.; Marshall, I.; Wallace, B.; and Li, J. J. 2021.     on Artificial Intelligence.
Paragraph-level Simplification of Medical Texts. In Pro-
ceedings of the 2021 Conference of the North American           Lyu, Y.; Liang, P. P.; Pham, H.; Hovy, E.; Póczos, B.;
Chapter of the Association for Computational Linguistics:       Salakhutdinov, R.; and Morency, L.-P. 2021. StylePTB:
Human Language Technologies, 4972–4984. Online: As-             A Compositional Benchmark for Fine-grained Controllable
sociation for Computational Linguistics. doi:10.18653/v1/       Text Style Transfer. arXiv preprint arXiv:2104.05196 .
2021.naacl-main.395. URL https://aclanthology.org/2021.         Ma, Q.; Wei, J.; Bojar, O.; and Graham, Y. 2019. Results
naacl-main.395.                                                 of the WMT19 Metrics Shared Task: Segment-Level and
Elhadad, N. 2006. Comprehending technical texts: Predict-       Strong MT Systems Pose Big Challenges. In Proceedings of
ing and defining unfamiliar terms. In AMIA annual sym-          the Fourth Conference on Machine Translation (Volume 2:
posium proceedings, volume 2006, 239. American Medical          Shared Task Papers, Day 1), 62–90. Florence, Italy: Associ-
Informatics Association.                                        ation for Computational Linguistics. doi:10.18653/v1/W19-
                                                                5302. URL https://aclanthology.org/W19-5302.
Freitag, M.; Grangier, D.; and Caswell, I. 2020. BLEU
might be Guilty but References are not Innocent. In Pro-        Madaan, A.; Setlur, A.; Parekh, T.; Poczos, B.; Neubig, G.;
ceedings of the 2020 Conference on Empirical Methods            Yang, Y.; Salakhutdinov, R.; Black, A. W.; and Prabhu-
in Natural Language Processing (EMNLP), 61–71. Online:          moye, S. 2020. Politeness Transfer: A Tag and Generate
Association for Computational Linguistics. doi:10.18653/        Approach. In Proceedings of the 58th Annual Meeting of
v1/2020.emnlp-main.5. URL https://aclanthology.org/2020.        the Association for Computational Linguistics, 1869–1881.
emnlp-main.5.                                                   Online: Association for Computational Linguistics. doi:
                                                                10.18653/v1/2020.acl-main.169. URL https://aclanthology.
He, R.; and McAuley, J. 2016. Ups and downs: Modeling           org/2020.acl-main.169.
the visual evolution of fashion trends with one-class collab-
orative filtering. In proceedings of the 25th international     Martin, L.; Humeau, S.; Mazaré, P.; Bordes, A.; de la Clerg-
conference on world wide web, 507–517.                          erie, É. V.; and Sagot, B. 2019. Reference-less Qual-
                                                                ity Estimation of Text Simplification Systems. CoRR
HuggingFace. 2021. The AI community building the future.        abs/1901.10746. URL http://arxiv.org/abs/1901.10746.
URL https://huggingface.co/.
                                                                Martin, L.; Humeau, S.; Mazaré, P.-E.; de La Clergerie, É.;
Hwang, W.; Hajishirzi, H.; Ostendorf, M.; and Wu, W.            Bordes, A.; and Sagot, B. 2018. Reference-less Quality Es-
2015. Aligning sentences from standard wikipedia to sim-        timation of Text Simplification Systems. In Proceedings
ple wikipedia. In Proceedings of the 2015 Conference of         of the 1st Workshop on Automatic Text Adaptation (ATA),
the North American Chapter of the Association for Compu-        29–38. Tilburg, the Netherlands: Association for Compu-
tational Linguistics: Human Language Technologies, 211–         tational Linguistics. doi:10.18653/v1/W18-7005. URL
217.                                                            https://aclanthology.org/W18-7005.
Ippolito, D.; Kriz, R.; Kustikova, M.; Sedoc, J.; and           McIlwain, C.; Santesso, N.; Simi, S.; Napoli, M.; Lasserson,
Callison-Burch, C. 2019. Comparison of diverse decoding         T.; Welsh, E.; Churchill, R.; Rader, T.; Chandler, J.; Tovey,
methods from conditional language models. arXiv preprint        D.; et al. 2014. Standards for the reporting of Plain Lan-
arXiv:1906.06362 .                                              guage Summaries in new Cochrane Intervention Reviews
Kadic, A. J.; Fidahic, M.; Vujcic, M.; Saric, F.; Propadalo,    (PLEACS) .
I.; Marelja, I.; Dosenovic, S.; and Puljak, L. 2016. Cochrane   Morris, J.; Lifland, E.; Yoo, J. Y.; Grigsby, J.; Jin, D.; and Qi,
plain language summaries are highly heterogeneous with          Y. 2020. TextAttack: A Framework for Adversarial Attacks,
low adherence to the standards. BMC medical research            Data Augmentation, and Adversarial Training in NLP. In
methodology 16(1): 1–4.                                         Proceedings of the 2020 Conference on Empirical Methods
Kim, Y. 2014. Convolutional Neural Networks for Sentence        in Natural Language Processing: System Demonstrations,
Classification. CoRR abs/1408.5882. URL http://arxiv.org/       119–126.
abs/1408.5882.                                                  NIA, N. 2018. Online Health Information: Is It Reli-
Kincaid, J. P.; Fishburne Jr, R. P.; Rogers, R. L.; and         able? URL https://www.nia.nih.gov/health/online-health-
Chissom, B. S. 1975. Derivation of new readability formu-       information-it-reliable.
Nye, B.; Li, J. J.; Patel, R.; Yang, Y.; Marshall, I. J.;       Van den Bercken, L.; Sips, R.-J.; and Lofi, C. 2019. Eval-
Nenkova, A.; and Wallace, B. C. 2018. A corpus with multi-      uating neural text simplification in the medical domain. In
level annotations of patients, interventions and outcomes to    The World Wide Web Conference, 3286–3292.
support language processing for medical literature. In Pro-     Warstadt, A.; Singh, A.; and Bowman, S. R. 2019. Neural
ceedings of the conference. Association for Computational       network acceptability judgments. Transactions of the Asso-
Linguistics. Meeting, volume 2018, 197. NIH Public Access.      ciation for Computational Linguistics 7: 625–641.
Paetzold, G.; and Specia, L. 2016. Anita: An Intelli-           Woodsend, K.; and Lapata, M. 2011. Learning to Simplify
gent Text Adaptation Tool. In Proceedings of COLING             Sentences with Quasi-Synchronous Grammar and Integer
2016, the 26th International Conference on Computational        Programming. In Proceedings of the 2011 Conference on
Linguistics: System Demonstrations, 79–83. Osaka, Japan:        Empirical Methods in Natural Language Processing, 409–
The COLING 2016 Organizing Committee. URL https:                420. Edinburgh, Scotland, UK.: Association for Computa-
//aclanthology.org/C16-2017.                                    tional Linguistics. URL https://aclanthology.org/D11-1038.
Paetzold, G. H.; and Specia, L. 2017. A survey on lexical       Xu, W.; Callison-Burch, C.; and Napoles, C. 2015. Problems
simplification. Journal of Artificial Intelligence Research     in current text simplification research: New data can help.
60: 549–593.                                                    Transactions of the Association for Computational Linguis-
Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002.        tics 3: 283–297.
Bleu: a method for automatic evaluation of machine trans-       Yang, Z.; Hu, Z.; Dyer, C.; Xing, E. P.; and Berg-
lation. In Proceedings of the 40th annual meeting of the        Kirkpatrick, T. 2018. Unsupervised text style transfer using
Association for Computational Linguistics, 311–318.             language models as discriminators. In Proceedings of the
Prabhumoye, S.; Tsvetkov, Y.; Salakhutdinov, R.; and Black,     32nd International Conference on Neural Information Pro-
A. W. 2018. Style transfer through back-translation. arXiv      cessing Systems, 7298–7309.
preprint arXiv:1804.09000 .                                     Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi,
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.;           Y. 2019. Bertscore: Evaluating text generation with bert.
Sutskever, I.; et al. 2019. Language models are unsupervised    arXiv preprint arXiv:1904.09675 .
multitask learners. OpenAI blog 1(8): 9.                        Zhong, Y.; Jiang, C.; Xu, W.; and Li, J. J. 2020. Discourse
Reimers, N.; and Gurevych, I. 2019. Sentence-bert: Sen-         level factors for sentence deletion in text simplification. In
tence embeddings using siamese bert-networks. arXiv             Proceedings of the AAAI Conference on Artificial Intelli-
preprint arXiv:1908.10084 .                                     gence, volume 34, 9709–9716.
Savery, M.; Abacha, A. B.; Gayen, S.; and Demner-               Zhu, Z.; Bernhard, D.; and Gurevych, I. 2010. A Monolin-
Fushman, D. 2020. Question-driven summarization of an-          gual Tree-based Translation Model for Sentence Simplifica-
swers to consumer health questions. Scientific Data 7(1):       tion. In Proceedings of the 23rd International Conference
1–9.                                                            on Computational Linguistics (Coling 2010), 1353–1361.
Senter, R.; and Smith, E. A. 1967. Automated readability        Beijing, China: Coling 2010 Organizing Committee. URL
index. Technical report, CINCINNATI UNIV OH.                    https://aclanthology.org/C10-1152.
Shafee, T.; Masukume, G.; Kipersztok, L.; Das, D.;
Häggström, M.; and Heilman, J. 2017. Evolution of
Wikipedia’s medical content: past, present and future. J Epi-
demiol Community Health 71(11): 1122–1129.
Shen, T.; Lei, T.; Barzilay, R.; and Jaakkola, T. 2017. Style
transfer from non-parallel text by cross-alignment. arXiv
preprint arXiv:1705.09655 .
Siddharthan, A. 2014. A survey of research on text simpli-
fication. ITL-International Journal of Applied Linguistics
165(2): 259–298.
Soldaini, L.; and Goharian, N. 2016. Quickumls: a fast,
unsupervised approach for medical concept extraction. In
MedIR workshop, sigir, 1–4.
Su, L.; Duan, N.; Cui, E.; Ji, L.; Wu, C.; Luo, H.; Liu, Y.;
Zhong, M.; Bharti, T.; and Sacheti, A. 2021. GEM: A Gen-
eral Evaluation Benchmark for Multimodal Tasks. arXiv
preprint arXiv:2106.09889 .
Subramanian, S.; Lample, G.; Smith, E. M.; Denoyer, L.;
Ranzato, M.; and Boureau, Y.-L. 2018. Multiple-attribute
text style transfer. arXiv preprint arXiv:1811.00552 .