Automatic Medical Text Simplification: Challenges of Data Quality and Curation Chandrayee Basu1 Rosni Vasu2 , Michihiro Yasunaga1 , Sohyeong Kim1 , Qian Yang3 1 Stanford University cbasu@stanford.edu 2 University of Zurich 3 Cornell University Abstract form. Research in automatic non-medical text simplification has been burgeoning, with the introduction of large paral- Health Literacy is the degree to which individuals can com- lel corpora (Zhu, Bernhard, and Gurevych 2010; Woodsend prehend basic health information needed to make appropri- and Lapata 2011; Coster and Kauchak 2011; Xu, Callison- ate health decisions. The topmost reason for low health lit- eracy is the vocabulary gap between providers and patients. Burch, and Napoles 2015; Paetzold and Specia 2017). Cre- Automatic medical text simplification can contribute to im- ation of multi-references enabled models that can learn dif- proving health literacy by assisting providers with patient- ferent kinds of textual transformations separately, viz. lexi- friendly communication, improving health data search, and cal changes (e.g. paraphrasing), syntactic modifications (e.g. making online medical texts more accessible. It is, however, reordering of concepts, splitting texts, reducing sentence extremely challenging to curate quality corpus for this nat- length etc.) and compression (e.g. deleting peripheral in- ural language processing (NLP) task. In this position paper, formation irrelevant to the target domain) (Alva-Manchego we observe that, despite recent research efforts, existing open et al. 2020). corpora for medical text simplification are poor in quality and References are gold standard human generated simplifica- size. In order to match the progress in general text simpli- fication and style transfer, we must leverage careful crowd- tions, used to validate model outputs. The success of the au- sourcing. We discuss the challenges of naive crowd-sourcing. tomatic text simplification and style transfer hinges on large We propose that careful crowd-sourcing for medical text sim- amounts of crowd-sourced multiple references. However, plification is possible, when combined with automatic data crowd-sourcing even a single set of references for medical labeling, a well-designed expert-layman collaboration frame- texts is challenging. It requires the recruitment of a specific work, and context-dependent crowd-sourcing instructions. sub-population with a certain degree of domain expertise. For example, Nye et al. (2018) described an elaborate pro- Low health literacy has been associated with non-adherence cess of recruiting MDs and medical experts from Upwork, to treatment plans and regimens, poor patient self-care, lack for PICO data annotation. Naturally, we observe a dearth of timely communication of health issues, and increased of high-quality parallel training corpus in medical AI. Fur- risk of hospitalization and mortality (King 2010). Simpli- thermore, text simplification task has additional challenges. fication of medical documents, of online communications Only the expert knows what content of the domain-specific like email messages and patient instructions can go a text is relevant to the laymen, whereas the laymen or med- long way to mitigate health literacy challenges. While the ical writers trained to translate medical texts can judge the consumer versions of medical journals, news articles, and quality and accessibility of the simplified versions. a few trusted websites (NIA 2018; Savery et al. 2020) are In this work, we make the following contributions: written by trained experts, they are by no means exhaustive. Automated approaches are necessary to keep pace with the • identify the open-source datasets for medical text simpli- rapidly growing body of biomedical literature. In this work, fication we evaluate some of the open corpora that power automated • characterize the datasets by their quantity, quality, diver- text simplification in the medical domain. sity, and representativeness We define text simplification, following Siddharthan • identify challenges of scaling high-quality corpus genera- (2014), as the process of reducing the linguistic complexity tion for medical text simplification of a text, while still retaining the original information con- Assumptions: We treat summarization as a subset of text tent and meaning. A domain-specific expert text undergoes simplification. We only consider corpora that represent com- various kinds of transformations to reach the final simple posite textual transformations (simple text is derived after Copyright © 2021for this paper by its authors. Use permitted under a combination of syntactic, semantic, thematic, and lexical Creative Commons License Attribution 4.0 International (CC BY transformations of the expert text) (Lyu et al. 2021) for fur- 4.0). ther analysis. Datasets for Medical Text Simplification MSD is a non-parallel corpus derived from Merck Man- Datasets for medical text simplification support two kinds uals, a trusted health reference for 100 years, with a wide of document simplification: sentence-level and paragraph- range of medical topics. For each topic, the manual con- level. We focus on sentence-level and short paragraph-level tains a consumer version and an expert version of the text, simplification. After an elaborate search, we found three making it an ideal candidate for a text simplification cor- datasets in English for medical text simplification: two par- pus curation. This dataset offers wide coverage of medical allel corpora SIMPWIKI (Van den Bercken, Sips, and Lofi topics and medical PICO elements (Cao et al. 2020). The 2019) and PARASIMP (Devaraj et al. 2021), and one non- authors scraped raw consumer and professional texts from parallel corpus MSD (Cao et al. 2020). the MSD website, split them into sentences, identified par- Next, we delve deeper into how these datasets are created allel groups by matching document titles and subsection ti- and the potential artifacts of the data collection and annota- tles, and picked linked sentences from the matched sections tion processes. of the articles. The resulting text pairs were validated by non-native English speakers. The annotators used native lan- Artifacts of Corpus Curation guage translations to speed up annotations. The text pairs are also annotated with UMLS concepts (Bodenreider 2004) In the absence of reliable crowd-sourcing of medical texts, for domain knowledge. MSD data has 130,349 expert texts, researchers resort to crawling medical websites. The ex- 114,674 layman texts in the non-parallel training set, and pert texts are sampled from the online articles and checked 675 expert-layman pairs for validation. The texts are ≤ 245 posthoc for adequate corpus representativeness. The layman tokens long. texts are retrieved from the layman or consumer versions We also considered a paragraph-level simplification of the professional articles, based on the alignment of sec- corpus (Devaraj et al. 2021). The corpus consists of tion titles and text content. The alignment is either checked technical abstracts of biomedical systematic reviews and manually for a small fraction of the corpus or automatically corresponding plain language summaries (PLS) from derived using different algorithms. Only a few of the auto- Cochrane Database of Systematic Reviews (McIlwain et al. matically aligned pairs are validated by the experts. Auto- 2014). The PLS are written in simple English. They usually matic alignment is not always reasonable (Alva-Manchego, represent the key essence of the abstracts and are structured Scarton, and Specia 2020). Random sampling of expert texts heterogeneously (Kadic et al. 2016). We decided to exclude from larger articles and unreliable automatic retrieval can this corpus from our analysis due to the abstractive summary lead to text pieces that are not stand-alone (Choi et al. 2021). nature of the layman versions. We found that the process of expert verification is insuffi- cient for quality data curation and could still lead to pairs The size of parallel corpora is extremely small compared lacking correspondence. On the other end, models trained to those for non-medical text simplification, where the me- using highly aligned text pairs may exhibit limited general- dian corpus size is 154K (Alva-Manchego, Scarton, and izability. Specia 2020). A more recent trend is to generate large volumes of non-parallel corpus, obviating validation of automatically aligned pairs. This follows similar approaches in non- Automatic Dataset Quality Assessment medical text style transfer (Shen et al. 2017; He and We assessed MSD and SIMPWIKI for their overall qual- McAuley 2016; Madaan et al. 2020). Some researchers dis- ity, diversity and representativeness. We define these terms tinguish between text simplification and text style transfer as follows: Quality : grammatical correctness, average read- tasks. We consider text simplification as a sub-domain of ability score, adherence to domain specific styles, Diversity text style transfer where the goal is to transform text from : coverage of various transformations that text simplifica- the expert style to the layman style. tions entail in medical domain (different from diversity of language generation (Ippolito et al. 2019)), and Representa- Datasets tiveness : coverage of various medical sub-domains (e.g. gy- necology, neurology, cardiology) and topics (for e.g., symp- Van Den Bercken (Van den Bercken, Sips, and Lofi 2019) toms, signs, treatments). contributed the very first publicly available medical text sim- plification corpus, which we refer to as SIMPWIKI, sim- ilar to (Cao et al. 2020). The authors created three sub- Metrics sets, fully-aligned expert: medical subset of Wikipedia data We measured the above features separately for parallel and from Hwang et al. (2015), gleaned using QuickUMLS (Sol- non-parallel corpora. daini and Goharian 2016) for NER and later validated by Quality: For grammatical correctness, we used experts, partly-aligned expert and fully-aligned automatic: the average acceptability score returned by textattack’s texts from Wikipedia and Simple Wikipedia aligned using RoBERTa-based classifier for CoLA (Morris et al. 2020; BLEU score (Papineni et al. 2002). Fully aligned text pairs Warstadt, Singh, and Bowman 2019; HuggingFace 2021). have strong one-on-one correspondence, partly aligned sim- We computed the readability of the two corpora in terms of ple texts cover the expert text entirely, but have additional Flesch-Kincaid Reading Ease, Flesch-Kincaid Grade level facts. This dataset has 9212 expert-layman pairs. The texts (Kincaid et al. 1975), and Automated Readability Index are ≤ 128 tokens long. (ARI) (Senter and Smith 1967), similar to Li and Nenkova (2015); Devaraj et al. (2021). We used classifiability, Results relative lexical complexity, and elaboration as metrics Quality Approximately 90 % of the expert texts in MSD of domain-specific styles. We measured classifiability by were acceptable and > 97 % of the MSD layman texts the test accuracy of a trained attribute model (Yang et al. and SIMPWIKI were acceptable by the CoLA model. This 2018; Subramanian et al. 2018; Prabhumoye et al. 2018). means ≈360 texts, each in expert and layman versions Following Siddharthan (2014), we expect good quality within SIMPWIKI corpus, were not acceptable. 10 % of simplified corpus to contain sufficient elaborations of the expert texts (11460 texts) in MSD had low acceptability technical concepts and jargon and fewer low-frequency score, possibly because of unique vocabulary and sentence words. We trained a 1D CNN attribute model (Kim 2014) structures, and incomplete references. with GPT2 embedding (Radford et al. 2019) for computing See Table. 2 for the readability scores. We found dis- classifiability. We reported how much elaborations are crepancies with the readability scores reported in Cao et al. present in the simple texts of the corpora using thresholded (2020). Paired t-test shows that the expert and the simple cosine similarity between Sentence-BERT embeddings texts in both MSD and SIMPWIKI have statistically signif- (Zhong et al. 2020; Reimers and Gurevych 2019) of the icant differences in readability, measured by Flesch Read- text pairs. We embedded each sentence of the simple text ing Score, Flesch Kincaid Grade, and Automated Readabil- and the expert text and computed pairwise alignments. We ity Index (p < 0.001). The minimum readability of medical used Sentence-BERT because it is tuned on several corpora, texts compared to general English corpora is low (Minimum including SciDocs (Cohan et al. 2020) to embed sentences Flesch Kincaid grade level is 11.9), also observed by De- and short paragraphs and performed better than competing varaj et al. (2021). models on several downstream tasks. That said, wherever Lexical complexity, computed using the EASSE package possible, we avoided language-model based metrics due to (Martin et al. 2019), represents the word rank score distri- a mismatch between medical and model training datasets. bution of the corpus. While the mean complexity of MSD is not very different between expert and layman versions, Diversity: We argue that quality corpus for text sim- a much lower standard deviation confirms that expert texts plification should be diverse enough to accommodate vari- have more rare words. SIMPWIKI has more common words ous textual transformations, that domain-specific simplifica- in both expert and layman versions than MSD, and the com- tions entail. These transformations could be lexical, seman- plexity varies across the corpus. We measured percentage of tic, and syntactic. Lexical transformations refer to substitu- simple texts that potentially contain elaborations, both for tion of complex terms or phrases by more accessible ones MSD, and separately for differently aligned pairs of SIM- and could also include elaborations (extensions) or explana- PWIKI. We found high proportion of elaborations in MSD tions (intentions). Syntactic transformations are more style based on our coarse approach, which is desirable. However, dependent like formality change, voice change, tense change further human validations are required to confirm the rele- etc. We measured semantic diversity of the MSD validation vance of these elaborations. data and the entire SIMPWIKI corpus using Sentence-BERT We trained two different attribute models for classifiabil- based corpus alignment. ity check. We did not notice a significant difference in the test accuracy of the two corpora. Note that the training data We measured lexical and syntactic transformations using size was significantly larger for MSD. The accuracy was referenceless quality features like Levenshtein similarity, 0.88 and 0.81 for MSD and SIMPWIKI respectively. the proportion of words added, deleted or kept, compression ratio, lexical complexity ratio etc., from the EASSE library (Martin et al. 2018, 2019; Alva-Manchego et al. 2020). Diversity We computed several referenceless text quality metrics us- Representativeness: We also checked which of the two ing the EASSE library (Martin et al. 2019). We made some corpora covers a wider range of medical subdomains and modifications to output mean, standard deviation, and stan- topics. Cao et al. (2020) already measured the representa- dard error of the metrics. We used these automatic metrics tiveness for MSD by the distributions of the PICO elements as a proxy for simplification-related transformations. An av- (slightly different from the PICO elements in Nye et al. erage compression ratio of > 1 in MSD points to more (2018)) and medical subdomains. elaborations and explanations (potentially irrelevant facts). A higher standard deviation of compression ratio indicates SIMPWIKI being a subset of Wikipedia articles relevant more diversity in transformations. Higher additions in MSD to medical topics, we referred to Shafee et al. (2017), for indicate more domain specific words (possibly more com- its representativeness. There are 30,000 articles on medical mon words) being introduced in the simpler versions. Over- topics in Wikipedia. The articles are rated for quality and all, we observe that MSD represents more textual transfor- importance by editors. The top-rated articles are on tuber- mations than SIMPWIKI. culosis and pneumonia. High-importance includes common diseases and treatments. Mid-importance encompasses con- ditions, tests, drugs, anatomy and symptoms. The remaining Human Data Quality Assessment low-importance articles include niche or peripheral medical In the previous section, we used automatic metrics to eval- topics such as laws, physicians and rare conditions. uate the approximate quality and diversity of the corpora Table 1: Transformation Diversity Metrics. Layman to Expert Ratio Metrics MSD SIMPWIKI Compression Ratio 1.257 ± 0.9 0.907 ± 0.46 Levenshtein Similarity 0.519 ± 0.166 0.641 ± 0.219 Exact copies 0.029 0.07 Additions proportion 0.526 ± 0.254 0.304 ± 0.251 Deletions proportion 0.439 ± 0.244 0.421 ± 0.286 Added words 20.135 ± 18.144 7.951 ± 8.941 Deleted words 17.181 ± 20.504 12.16 ± 11.89 Kept words 11.914 ± 9.881 12.529 ± 9.5 Corpus alignment 0.428 ± 0.226 0.832 ± 0.161 (auto full) 0.862 ± 0.125 (exp full) 0.597 ± 0.163 (exp part) Table 2: Quality metrics. MSD Test SIMPWIKI Metric Expert Layman Expert Layman Acceptability score 0.907 0.976 0.977 0.965 Flesch Reading Ease 17.44 ± 32.25 37.116 ± 28.19 30.07 ± 28.34 41.47 ± 29.12 Flesch Kincaid Grade level 15.2 ± 5.4 12.6 ± 5.7 14.4 ± 5.5 11.9 ± 5.1 ARI 15.6 ± 6.4 13 ± 6.9 15.1 ± 6.7 12.4 ± 6.2 Lexical complexity 9.17 ± 0.087 9 ± 0.792 8.842 ± 0.79 8.695 ± 0.867 Elaboration 27.4 4.6 (auto full) 0.8 (exp full) 0.8 (exp part) for medical text simplification. We found that MSD poten- also be useful in the medical domain for personalization tially is more diverse, but also has lower acceptability be- (Paetzold and Specia 2016; Su et al. 2021). cause of the sheer scale of the data and unique vocabulary. The expert texts in MSD require a higher minimum reading To assess whether crowd-sourcing is a valid option for grade. While this corpus seems to contain more elaborations quality check and multi-reference generation of medical in the validation set, compared to SIMPWIKI, the elabora- texts, we conducted a test internally, between two coauthors tions cannot be explicitly learnt from the non-parallel train- of this paper. Both the authors had high school biology in ing data. All of the above points to the need for further data English. One author consumes medical information weekly collection and quality human annotation. from scientific articles, popular science news and blogs, and communicates with medical practitioner online. Another au- Crowd-sourcing thor uses google search infrequently for medical symptoms In many NLP tasks, it is customary to complement auto- lookup only. We sampled 60 sentences from MSD: 20 with matic model validations with human evaluations. A large longer simple texts, 20 with longer expert texts and 20 where body of work has been dedicated to analyse and correct simple and expert texts have similar number of tokens. We the mismatch between human judgement and automatic asked each author to indicate agreement on several state- evaluation. Researchers found that both metrics (Banerjee ments covering content preservation, coverage, textual sim- and Lavie 2005; Zhang et al. 2019; Ma et al. 2019) and plicity, concept simplicity and fluency of the simple text, for artifacts of data collection (Freitag, Grangier, and Caswell e.g. 2020) can be responsible for the mismatch. One solution to • The simple sentence explains all the unknown concepts ensure data diversity is to crowd-source multiple references adequately (Freitag, Grangier, and Caswell 2020). Lyu et al. (2021); • The simple sentence removes all redundancy and covers Alva-Manchego et al. (2020) released a text simplification only the key point in the reference sentence multi-reference corpus annotated with various simplification transformations. Newsela corpus for general text simplifi- • I cannot think of an alternative way to simplify it cation was annotated for different grades of education (Xu, Average Krippendorff’s alpha (Krippendorff 2011) across Callison-Burch, and Napoles 2015). Multi-references will 10 quality questions, between the two authors, was 0.299 ± 0.048. The results show high disagreement between the au- key concepts from texts, identifying concepts that need elab- thors, questioning the plausibility of reliable human eval- orations and so on. This model can be leveraged to improve uation and crowd-sourcing of medical texts. However, in layman evaluations. the absence of crowd-sourcing, we cannot generate diverse enough data to train and validate models with good general- Discussion izability. Automatic medical text simplification can contribute to im- Can layman assess the simplification quality and proving health literacy by assisting providers with patient- provide alternative references? friendly communication, improving health data search, and making online medical texts more accessible. However, it To test this question, we conducted a pilot study with two is challenging to create large annotated and parallel cor- users, where we iterated on a few different designs of pus for this task, unlike for non-medical texts. In this pa- layman evaluation of MSD validation data. The users had per, we identified the existing corpora for training automatic high school biology in English, but minimal experience text simplification models, and analyzed their quality and di- of consuming medical information online. We found that versity using several automatic metrics. We found that tak- the users were unmotivated to read the entire expert text, ing snapshots from expert and consumer articles that are because of the jargon, resulting in an inability to judge the not aligned could lead to poor quality parallel corpus. We quality of the simplification. More importantly, some of also assessed the potential of leveraging crowd-sourcing for the ratings changed, after the texts were explained to the large-scale model evaluation and data annotation for this users. A prominent artifact of data scraping and automatic task. We found that laymen evaluate the medical texts very alignment was the change in the subject of the text, which differently, depending upon their exposure to medical infor- confused the evaluation. For e.g. in this text pair: expert: In mation. We proposed some crowd-sourcing solutions that adults , BMI , defined as weight ( kg ) divided by the square could use expert-layman collaboration. In future, we plan of the height ( m2 ) , is used to screen for overweight or to explore such collaborative data curation and annotation, obesity ( see table Body Mass Index ( BMI ) ) : Overweight in practice. Another exciting research avenue would be to = 25 to 29.9 kg/m2 ; Obesity = ≥ 30 kg/m2 simple: Obesity train controllable simplification models that can interface is diagnosed by determining the BMI. with and learn from these two stakeholders. BMI is the subject in the former and obesity is the subject in the latter. When asked if the users were confident that they could rewrite the simplification better, we got an unanimous References yes. Alva-Manchego, F.; Martin, L.; Bordes, A.; Scarton, C.; Sagot, B.; and Specia, L. 2020. ASSET: A Dataset for Tun- We concluded that only experts have the ability to com- ing and Evaluation of Sentence Simplification Models with prehend which sections of the expert texts are useful for lay- Multiple Rewriting Transformations. In Proceedings of the men. Only laymen and trained writers can validate whether 58th Annual Meeting of the Association for Computational the simple versions are readable and meaningful. In other Linguistics, 4668–4679. Online: Association for Compu- words, scaling up human evaluation and annotation, in this tational Linguistics. doi:10.18653/v1/2020.acl-main.424. case, calls for well-designed collaboration between experts URL https://aclanthology.org/2020.acl-main.424. and laymen. Alva-Manchego, F.; Scarton, C.; and Specia, L. 2020. Data-driven sentence simplification: Survey and benchmark. Expert-layman collaboration Computational Linguistics 46(1): 135–187. We delineated various potential formats of expert-layman Banerjee, S.; and Lavie, A. 2005. METEOR: An automatic collaborations. The experts could be MD and biomedical metric for MT evaluation with improved correlation with hu- students, physicians and nurses directly, or they could be man judgments. In Proceedings of the acl workshop on in- models of expert behavior. The simplest approach would trinsic and extrinsic evaluation measures for machine trans- be to show definitions of the UMLS concepts. We found lation and/or summarization, 65–72. that these concepts are not always accessible for layman. Other researchers have used Google’s “define:” to improve Bodenreider, O. 2004. The unified medical language system readability of medical texts (Elhadad 2006). Some potential (UMLS): integrating biomedical terminology. Nucleic acids expert-layman collaboration could look like the following: research 32(suppl 1): D267–D270. Show examples of text pairs rated by experts, their rationale Cao, Y.; Shui, R.; Pan, L.; Kan, M.-Y.; Liu, Z.; and Chua, T.- behind rating and their corrections to unacceptable simpli- S. 2020. Expertise style transfer: A new task towards better fication, ask experts to generate a question from the expert communication between experts and laymen. arXiv preprint text and ask layman to answer the question after reading the arXiv:2005.00701 . simple version of the text. The expert generated question is automatically based on the key content of the expert text. Choi, E.; Palomaki, J.; Lamm, M.; Kwiatkowski, T.; Das, D.; The layman should understand the content of the simple text and Collins, M. 2021. Decontextualization: Making Sen- to answer this question. We could also use limited expert an- tences Stand-Alone. Transactions of the Association for notated data to model expert behavior in terms of extracting Computational Linguistics 9: 447–461. Cohan, A.; Feldman, S.; Beltagy, I.; Downey, D.; and Weld, las (automated readability index, fog count and flesch read- D. S. 2020. Specter: Document-level representation learn- ing ease formula) for navy enlisted personnel. Technical ing using citation-informed transformers. arXiv preprint report, Naval Technical Training Command Millington TN arXiv:2004.07180 . Research Branch. Coster, W.; and Kauchak, D. 2011. Simple English King, A. 2010. Poor health literacy: a’hidden’risk factor. Wikipedia: A New Text Simplification Task. In Proceedings Nature Reviews Cardiology 7(9): 473–474. of the 49th Annual Meeting of the Association for Compu- Krippendorff, K. 2011. Computing Krippendorff’s alpha- tational Linguistics: Human Language Technologies, 665– reliability . 669. Portland, Oregon, USA: Association for Computational Linguistics. URL https://aclanthology.org/P11-2117. Li, J. J.; and Nenkova, A. 2015. Fast and accurate prediction of sentence specificity. In Twenty-Ninth AAAI Conference Devaraj, A.; Marshall, I.; Wallace, B.; and Li, J. J. 2021. on Artificial Intelligence. Paragraph-level Simplification of Medical Texts. In Pro- ceedings of the 2021 Conference of the North American Lyu, Y.; Liang, P. P.; Pham, H.; Hovy, E.; Póczos, B.; Chapter of the Association for Computational Linguistics: Salakhutdinov, R.; and Morency, L.-P. 2021. StylePTB: Human Language Technologies, 4972–4984. Online: As- A Compositional Benchmark for Fine-grained Controllable sociation for Computational Linguistics. doi:10.18653/v1/ Text Style Transfer. arXiv preprint arXiv:2104.05196 . 2021.naacl-main.395. URL https://aclanthology.org/2021. Ma, Q.; Wei, J.; Bojar, O.; and Graham, Y. 2019. Results naacl-main.395. of the WMT19 Metrics Shared Task: Segment-Level and Elhadad, N. 2006. Comprehending technical texts: Predict- Strong MT Systems Pose Big Challenges. In Proceedings of ing and defining unfamiliar terms. In AMIA annual sym- the Fourth Conference on Machine Translation (Volume 2: posium proceedings, volume 2006, 239. American Medical Shared Task Papers, Day 1), 62–90. Florence, Italy: Associ- Informatics Association. ation for Computational Linguistics. doi:10.18653/v1/W19- 5302. URL https://aclanthology.org/W19-5302. Freitag, M.; Grangier, D.; and Caswell, I. 2020. BLEU might be Guilty but References are not Innocent. In Pro- Madaan, A.; Setlur, A.; Parekh, T.; Poczos, B.; Neubig, G.; ceedings of the 2020 Conference on Empirical Methods Yang, Y.; Salakhutdinov, R.; Black, A. W.; and Prabhu- in Natural Language Processing (EMNLP), 61–71. Online: moye, S. 2020. Politeness Transfer: A Tag and Generate Association for Computational Linguistics. doi:10.18653/ Approach. In Proceedings of the 58th Annual Meeting of v1/2020.emnlp-main.5. URL https://aclanthology.org/2020. the Association for Computational Linguistics, 1869–1881. emnlp-main.5. Online: Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.169. URL https://aclanthology. He, R.; and McAuley, J. 2016. Ups and downs: Modeling org/2020.acl-main.169. the visual evolution of fashion trends with one-class collab- orative filtering. In proceedings of the 25th international Martin, L.; Humeau, S.; Mazaré, P.; Bordes, A.; de la Clerg- conference on world wide web, 507–517. erie, É. V.; and Sagot, B. 2019. Reference-less Qual- ity Estimation of Text Simplification Systems. CoRR HuggingFace. 2021. The AI community building the future. abs/1901.10746. URL http://arxiv.org/abs/1901.10746. URL https://huggingface.co/. Martin, L.; Humeau, S.; Mazaré, P.-E.; de La Clergerie, É.; Hwang, W.; Hajishirzi, H.; Ostendorf, M.; and Wu, W. Bordes, A.; and Sagot, B. 2018. Reference-less Quality Es- 2015. Aligning sentences from standard wikipedia to sim- timation of Text Simplification Systems. In Proceedings ple wikipedia. In Proceedings of the 2015 Conference of of the 1st Workshop on Automatic Text Adaptation (ATA), the North American Chapter of the Association for Compu- 29–38. Tilburg, the Netherlands: Association for Compu- tational Linguistics: Human Language Technologies, 211– tational Linguistics. doi:10.18653/v1/W18-7005. URL 217. https://aclanthology.org/W18-7005. Ippolito, D.; Kriz, R.; Kustikova, M.; Sedoc, J.; and McIlwain, C.; Santesso, N.; Simi, S.; Napoli, M.; Lasserson, Callison-Burch, C. 2019. Comparison of diverse decoding T.; Welsh, E.; Churchill, R.; Rader, T.; Chandler, J.; Tovey, methods from conditional language models. arXiv preprint D.; et al. 2014. Standards for the reporting of Plain Lan- arXiv:1906.06362 . guage Summaries in new Cochrane Intervention Reviews Kadic, A. J.; Fidahic, M.; Vujcic, M.; Saric, F.; Propadalo, (PLEACS) . I.; Marelja, I.; Dosenovic, S.; and Puljak, L. 2016. Cochrane Morris, J.; Lifland, E.; Yoo, J. Y.; Grigsby, J.; Jin, D.; and Qi, plain language summaries are highly heterogeneous with Y. 2020. TextAttack: A Framework for Adversarial Attacks, low adherence to the standards. BMC medical research Data Augmentation, and Adversarial Training in NLP. In methodology 16(1): 1–4. Proceedings of the 2020 Conference on Empirical Methods Kim, Y. 2014. Convolutional Neural Networks for Sentence in Natural Language Processing: System Demonstrations, Classification. CoRR abs/1408.5882. URL http://arxiv.org/ 119–126. abs/1408.5882. NIA, N. 2018. Online Health Information: Is It Reli- Kincaid, J. P.; Fishburne Jr, R. P.; Rogers, R. L.; and able? URL https://www.nia.nih.gov/health/online-health- Chissom, B. S. 1975. Derivation of new readability formu- information-it-reliable. Nye, B.; Li, J. J.; Patel, R.; Yang, Y.; Marshall, I. J.; Van den Bercken, L.; Sips, R.-J.; and Lofi, C. 2019. Eval- Nenkova, A.; and Wallace, B. C. 2018. A corpus with multi- uating neural text simplification in the medical domain. In level annotations of patients, interventions and outcomes to The World Wide Web Conference, 3286–3292. support language processing for medical literature. In Pro- Warstadt, A.; Singh, A.; and Bowman, S. R. 2019. Neural ceedings of the conference. Association for Computational network acceptability judgments. Transactions of the Asso- Linguistics. Meeting, volume 2018, 197. NIH Public Access. ciation for Computational Linguistics 7: 625–641. Paetzold, G.; and Specia, L. 2016. Anita: An Intelli- Woodsend, K.; and Lapata, M. 2011. Learning to Simplify gent Text Adaptation Tool. In Proceedings of COLING Sentences with Quasi-Synchronous Grammar and Integer 2016, the 26th International Conference on Computational Programming. In Proceedings of the 2011 Conference on Linguistics: System Demonstrations, 79–83. Osaka, Japan: Empirical Methods in Natural Language Processing, 409– The COLING 2016 Organizing Committee. URL https: 420. Edinburgh, Scotland, UK.: Association for Computa- //aclanthology.org/C16-2017. tional Linguistics. URL https://aclanthology.org/D11-1038. Paetzold, G. H.; and Specia, L. 2017. A survey on lexical Xu, W.; Callison-Burch, C.; and Napoles, C. 2015. Problems simplification. Journal of Artificial Intelligence Research in current text simplification research: New data can help. 60: 549–593. Transactions of the Association for Computational Linguis- Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. tics 3: 283–297. Bleu: a method for automatic evaluation of machine trans- Yang, Z.; Hu, Z.; Dyer, C.; Xing, E. P.; and Berg- lation. In Proceedings of the 40th annual meeting of the Kirkpatrick, T. 2018. Unsupervised text style transfer using Association for Computational Linguistics, 311–318. language models as discriminators. In Proceedings of the Prabhumoye, S.; Tsvetkov, Y.; Salakhutdinov, R.; and Black, 32nd International Conference on Neural Information Pro- A. W. 2018. Style transfer through back-translation. arXiv cessing Systems, 7298–7309. preprint arXiv:1804.09000 . Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Y. 2019. Bertscore: Evaluating text generation with bert. Sutskever, I.; et al. 2019. Language models are unsupervised arXiv preprint arXiv:1904.09675 . multitask learners. OpenAI blog 1(8): 9. Zhong, Y.; Jiang, C.; Xu, W.; and Li, J. J. 2020. Discourse Reimers, N.; and Gurevych, I. 2019. Sentence-bert: Sen- level factors for sentence deletion in text simplification. In tence embeddings using siamese bert-networks. arXiv Proceedings of the AAAI Conference on Artificial Intelli- preprint arXiv:1908.10084 . gence, volume 34, 9709–9716. Savery, M.; Abacha, A. B.; Gayen, S.; and Demner- Zhu, Z.; Bernhard, D.; and Gurevych, I. 2010. A Monolin- Fushman, D. 2020. Question-driven summarization of an- gual Tree-based Translation Model for Sentence Simplifica- swers to consumer health questions. Scientific Data 7(1): tion. In Proceedings of the 23rd International Conference 1–9. on Computational Linguistics (Coling 2010), 1353–1361. Senter, R.; and Smith, E. A. 1967. Automated readability Beijing, China: Coling 2010 Organizing Committee. URL index. Technical report, CINCINNATI UNIV OH. https://aclanthology.org/C10-1152. Shafee, T.; Masukume, G.; Kipersztok, L.; Das, D.; Häggström, M.; and Heilman, J. 2017. Evolution of Wikipedia’s medical content: past, present and future. J Epi- demiol Community Health 71(11): 1122–1129. Shen, T.; Lei, T.; Barzilay, R.; and Jaakkola, T. 2017. Style transfer from non-parallel text by cross-alignment. arXiv preprint arXiv:1705.09655 . Siddharthan, A. 2014. A survey of research on text simpli- fication. ITL-International Journal of Applied Linguistics 165(2): 259–298. Soldaini, L.; and Goharian, N. 2016. Quickumls: a fast, unsupervised approach for medical concept extraction. In MedIR workshop, sigir, 1–4. Su, L.; Duan, N.; Cui, E.; Ji, L.; Wu, C.; Luo, H.; Liu, Y.; Zhong, M.; Bharti, T.; and Sacheti, A. 2021. GEM: A Gen- eral Evaluation Benchmark for Multimodal Tasks. arXiv preprint arXiv:2106.09889 . Subramanian, S.; Lample, G.; Smith, E. M.; Denoyer, L.; Ranzato, M.; and Boureau, Y.-L. 2018. Multiple-attribute text style transfer. arXiv preprint arXiv:1811.00552 .