Towards dataset creation and establishing baselines for sentence-level neural clinical paraphrase generation and simplification Viraj Adduru1*, Sadid A. Hasan2, Joey Liu2, Yuan Ling2, Vivek Datla2, Kathy Lee2, Ashequl Qadir2, Oladimeji Farri2 1 Rochester Institute of Technology, Rochester, NY, USA 2 Artificial Intelligence Lab, Philips Research North America, Cambridge, MA, USA {vra2128}@rit.edu, {firstname.lastname, kathy.lee_1, dimeji.farri}@philips.com Abstract other physicians [Qenam et al., 2017]. They contain com- plex medical terminologies that the patients are not familiar A paraphrase is a restatement of a text while retain- with. A recent study reported that allowing patients to ac- ing the meaning. Clinical paraphrasing involves re- cess their clinical notes has showed an improvement in their statement of sentences, paragraphs, or documents health care process [Kosten et al., 2012]. Realizing the need containing complex vocabulary used by clinicians. for increased inclusion of patients in their health care pro- Paraphrasing can result in an alternative text that is cess, large health care systems have allowed for the patients either simple or complex form of the original input to access their medical records [Delbanco et al., 2015]. text. Simplification is a form of paraphrasing in However, these medical records contain raw complex clini- which a sentence is restated into a linguistically cal text intended for the communication between medical simpler sentence yet retaining the meaning of the professionals. Paraphrasing or simplification of clinical text original sentence. Clinical text simplification has will improve the patients’ understanding of their health con- potential applications such as simplification of ditions and thereby play an important role in connecting clinical reports for patients towards better under- patients and caregivers across the clinical continuum to- standing of their clinical conditions. Deep learning wards better patient outcome. has emerged as a successful technique for various Traditional clinical paraphrasing and simplification ap- natural language understanding tasks precondi- proaches use lexical methods [Kandula et al., 2010; tioned with large annotated datasets. In this paper, Pivovarov and Elhadad, 2015; Qenam et al., 2017], which we propose a methodology to create preliminary are typically focused on identifying complex clinical words, datasets for clinical paraphrasing, and clinical text phrases, or sentences and replace them with their alterna- simplification to foster training of deep learning- tives in case of paraphrasing or simpler versions in case of based clinical paraphrase generation and simplifi- simplification. Lexical methods take advantage of cation models. knowledge sources like Unified Medical Language System (UMLS) metathesaurus [Lindberg et al., 1993] which con- 1 Introduction and related work tains grouped words and phrases that describe various medi- Paraphrasing (a.k.a. paraphrase generation) is transforming cal concepts. Simplification is traditionally performed by a text that can be a word, phrase, sentence, paragraph, or a mapping UMLS concepts to their alternatives provided in document, while retaining the meaning and content. For consumer health vocabulary (CHV) [Qenam et al., 2017]. example, the sentence ‘I am very well’ can be paraphrased Recently, paraphrase generation was casted as a monolin- as ‘I am doing great’. Paraphrasing can lead to a new text gual machine translation problem resulting in the develop- which may be simpler, more complex or at the same com- ment of data-driven methods using statistical machine trans- plexity level as the source text. The task of paraphrasing text lation (SMT) [Koehn, 2010], and neural machine translation to a simpler form is called simplification. In simplification, (NMT) principles [Koehn, 2017]. SMT methods [Quirk et the output text is a linguistically simplified version of the al., 2004; Wubben et al., 2010; Zhao et al., 2009] model the input text. Paraphrasing and simplification may have nu- conditional distributions of the words and phrases and re- merous applications such as document summarization, text place the phrases in the source text with the phrases that simplification for target audience e.g. children, and question maximize the probability of the resulting text. However, answering [Madnani and Dorr, 2010]. syntactic relationships are difficult to model using SMT In the clinical context, health care systems and medical methods. Monolingual NMT systems use neural network knowledge-bases contain large collections of texts that are architectures to model complex relationships by automati- often not comprehensible to the layman population. For cally learning from large datasets containing source and example, clinical texts like radiology reports are used by target text pairs, both belonging to the same language. Cur- radiologists to professionally communicate their findings to rent NMT systems for paraphrase generation or simplifica- tion [Brad and Rebedea, 2017; Hasan et al., 2016; Prakash —————— et al., 2016] use sequence-to-sequence networks based on  *This work was conducted as part of an internship program at encoder-decoder architectures. Unlike traditional methods, Philips Research. NMTs do not need semantic or syntactic rules to be explicit- nicians. We crawl the articles with same topics from two or ly defined. However, they need carefully constructed da- more web-based knowledge sources. Each sentence in a tasets that contain sufficient information to robustly train the topic (i.e. in an article) from one resource is mapped to sen- deep neural networks. tences belonging to the same topic from another resource(s) Existing clinical paraphrasing and simplification datasets using a one-to-many scheme to create all possible sentence are limited to short phrases. Hasan et al., (2016) trained an pair combinations. These sentence pairs essentially contain attention-based encoder-decoder model [Bahdanau et al., a large number of unrelated pairs from which meaningful 2015] using a dataset created by merging two word and paraphrasing pairs are identified. phrase level datasets: the paraphrase database (PPDB) Manual identification of the relevant paraphrase pairs is a [Pavlick et al., 2015] and the UMLS metathesaurus. They tedious task as the sentence pair combinations (as discussed showed that their model outperformed an upper bound para- above) contain a large number (in millions) of unrelated phrasing baseline. However, they used a phrasal dataset that sentence pairs. Therefore, we use an automated approach to does not contain more complex contextual knowledge like a identify the paraphrase pairs from the sentence pair combi- sentential dataset, and the ability of the network to simplify nations. Our method is similar to the approach by [Zhu et the clinical text was not explored. In contrast to paraphras- al., 2010]. They use TF-IDF [M. Shieber and Nelken, 2006] ing, simplification is a harder problem and may involve ad- metric to align sentences between Wikipedia and Simple- dition, deletion or splitting of sentences to suite the target Wikipedia knowledge sources to create sentence pairs for audience. These operations require additional knowledge the text simplification task. However, some studies, e.g. Xu that a dataset with longer sequences like sentences or para- et al., 2015, reveal the noisy nature of such datasets, which graphs could provide. Other studies [Brad and Rebedea, motivated us to explore various textual similarity/distance 2017; Prakash et al., 2016] have trained encoder-decoder metrics instead of relying on one single metric for sentence architectures with attention for paraphrasing using general alignment. Our intuition is that the strengths of a collection domain sentence level datasets like Microsoft Common Ob- of diverse metrics may be useful for better sentence align- jects in Context (MSCOCO) [Lin et al., 2014], Newsela ment. In addition to various existing metrics, we train a neu- [Xu et al., 2015] and Wikianswers [Fader et al., 2013]. They ral paraphrase identification model to estimate a similarity demonstrated that neural machine translation models suc- score between two sentences, which is also used as a sup- cessfully captured the complex semantic relationships from plementary sentence alignment metric. the general domain datasets. However, it is unclear how these networks would perform on complex clinical text. 2.2 Sentence alignment In this paper, our aim is to pioneer the creation of parallel Paraphrase pairs can be identified by computing various (with source and target pairs) sentential datasets for clinical sentence similarity/distance metrics between the two sen- paraphrase generation and simplification. Web-based un- tences in a pair. Various character-level and word-level met- structured knowledge sources like www.mayoclinic.com rics that we used are described below. contain articles on various medical topics. We obtain arti- cles with matching titles from different web-based Levenshtein distance knowledge sources and align the sentences using various Levenshtein distance [Levenshtein, 1966] is defined as the metrics to create paraphrase and simplification pairs. Addi- minimum number of string operations consisting of addi- tionally, we train NMT models using the prepared clinical tions, deletions, and substitutions of symbols that are neces- datasets and present baseline performance metrics for both sary to transform one string into another. Normalized Le- clinical paraphrase generation and simplification. venshtein distance (LDN) is computed by dividing the num- Next section outlines our approach to create clinical para- ber of string operations required by the length of the longer phrase generation and simplification datasets. First, we dis- string. Character- or word-level LDN is calculated by treat- cuss our proposed methodology for extracting sentence pairs ing characters or words as symbols respectively: from web-based clinical knowledge sources. Then we de- 𝑁 scribe various metrics to align the pairs of related sentences 𝐿𝐷𝑁 = (1) 𝑚𝑎𝑥 (𝑛, 𝑚) for dataset creation. Section 3 discusses the neural network architectures used for establishing baselines. Sections 4 and where N is the minimum number of string operations to 5 present the performance evaluation of the models and in transform a text x to y or vice versa, and n and m are the section 6 we conclude and discuss the future work. number of symbols in the texts x and y respectively. Damerau-Levenshtein distance 2 Approach Damerau-Levenshtein distance [Damerau, 1964] is similar to LDN and is defined as the minimum number of string 2.1 Paraphrase pairs from web-based resources operations needed to transform one string into the other. In addition to the string operations in Levenshtein distance, Web-based textual resources contain large collections of Damerau-Levenshtein distance further includes transposi- articles for various medical topics related to diseases, anat- tion of two adjacent symbols. Normalized Demerau- omy, treatment, symptoms etc. These articles are often tar- Levenshtein distance (DLDN) is calculated by dividing the geted for general (non-clinician) users and are easier to un- derstand unlike the complex clinical reports written by cli- number of string operations by the number of symbols in the Sorensen similarity longer string. Sorensen similarity (also called Sorensen-Dice coefficient) [Sørensen, 1948] is similar to Jaccard similarity and it is Optimal string alignment distance computed as the ratio of twice the number of common items Optimal string alignment distance [Herranz et al., 2011] is a (intersection) and the sum of number of items in the two variant of DLDN but under a restriction that no substring is strings. edited more than once. The normalized form is computed All the above metrics are used in their normalized forms similarly as in DLDN. (values between 0 to 1). These metrics calculate the simi- Jaro-Winkler distance larity/distance between the sentence pairs using the charac- Jaro-Winkler distance (JWD) [Winkler, 1990] computes the ter- or word-level overlap and the pattern of their occurrenc- distance between two strings, where the substitution of two es in the sentences. However, these metrics do not consider close symbols is considered more important than the substi- the presence of concepts (e.g. words or phrases) that are tution of two symbols that are far from each other. The Jaro- paraphrased using a different vocabulary (e.g. ‘glioma’ can Winkler distance JWD is given by: be paraphrased with its synonym ‘brain tumor’) and also do not perform well for sentences that differ by a few words 𝑑𝑗 , 𝑖𝑓 𝑑𝑗 < 0 resulting in contradicting sentences. Therefore, we need a 𝐽𝑊𝐷 = { (2) 𝑑𝑗 + 𝑘 𝑝 (1 − 𝑑𝑗 ), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 similarity metric that can consider complex semantic rela- tionships between the concepts represented in the sentences. where k is the length of the common prefix at the start of Deep neural network architectures with recurrent neural the string up to 4 symbols, p is the constant usually set to networks (RNNs), and convolution neural networks (CNNs) 0.1 and dj is the Jaro distance given by: have so far demonstrated state-of-the-art performance [Conneau et al., 2017] in learning semantic associations 0 𝑖𝑓 𝑞 = 0 between the sentences. Therefore, deep-learning based sys- 𝑑𝑗 = {1 𝑞 𝑞 𝑞 − 𝑡 tems are increasingly being used for advanced natural lan- ( + + ), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (3) guage inferencing tasks like paraphrase identification, and 3 𝑛 𝑚 𝑞 textual entailment [Ghaeini et al., 2018], which motivated us where q is the number of matching words between the to create a neural paraphrase identification model for the two texts x and y with lengths 𝑛 and 𝑚 respectively and 𝑡 is purpose of supplementing our sentence similarity measures half of the number of transpositions. Jaro-Winkler distance for better sentence pair alignment. is a normalized quantity ranging from 0 to 1. 2.3 Paraphrase identification metric Longest common subsequence Neural paraphrase identification can be stated as a binary Longest common subsequence distance (LCSD) classification task in which a neural network model esti- [Bakkelund, 2009] is computed using the following equa- mates the probability that the two sentences are paraphrases. tion: This estimated probability can be used as a similarity metric 𝐿𝐶𝑆(𝑥, 𝑦) to align the sentence pairs. 𝐿𝐶𝑆𝐷 = 1 − (4) 𝑚𝑎𝑥(𝑛, 𝑚) Neural paraphrase identification where LCS (longest common subsequence) is the longest The network consists of stacked bidirectional long short- subsequence common to strings x and y with lengths n and term memory (BiLSTM) layers in a Siamese architecture m respectively. [Dadashov et al., 2017] (Figure 1). Each arm of the Siamese network consists of three stacked BiLSTM layers. The out- N-gram distance puts of the final BiLSTM layers of both the arms are con- N-gram is a contiguous sequence of n items from a given catenated and fed into the dense layer with ReLU activation sample of a text. N-gram distance [Kondrak, 2005] is simi- followed by a second dense layer with a sigmoid activation lar to computing LCS but in this case the symbols are n- function. We use a depth of 300 for all the BiLSTM layers grams. We used n = 4 in this paper. and the dense layers. The maximum sequence length of the Cosine similarity BiLSTM layers is set to 30. The words in the input sentenc- Cosine similarity between two strings is computed as the es are embedded using Word2Vec embeddings pre-trained cosine of the angle between the vector representation of two on the Google news corpus. strings (x and y) and is given by the equation: Hybrid dataset for paraphrase identification 𝑉𝑥 . 𝑉𝑦 Our paraphrase identification model is trained using a hy- 𝐶𝑆 = (5) brid corpus created by merging two paraphrase corpora: |𝑉𝑥 |. |𝑉𝑦 | Quora question pairs, and Paralex question pairs. The Quora Jaccard similarity question pair corpus [Iyer et al., 2017] consists of 404289 Jaccard similarity is calculated as the ratio of the intersec- question pairs with 149263 paraphrase pairs and 255027 tion to the union of the items in the two strings. non-paraphrase pairs. The Paralex dataset [Fader et al., 2013] consists of 35,692,309 question pairs, where all the question pairs are paraphrases of each other. The Paralex Paraphrase identification model for sentence alignment dataset is unbalanced as it does not contain any non- The probability score from our paraphrase identification paraphrase pairs. After merging the sentence pairs from both model for the predicted class is used along with the word- the corpora, we have 35692309 sentence pairs with and character-level similarity/distance metrics to calculate a 35437283 paraphrase pairs and only 255027 non-paraphrase mean similarity score. Note that, all normalized distance pairs. To balance the dataset, we identify the list of unique metrics are converted into similarity metrics by subtracting questions and then randomly select two questions from the the corresponding score from 1, thereby obtaining 12 differ- unique questions list and add the pair to the merged corpus ent similarity metrics. The mean similarity score is comput- as a non-paraphrase pair if the pair does not already exist. ed using the formula given below: Non-paraphrase pairs are created until the non-paraphrase, and paraphrase pairs are equal in number, resulting in a bal- 𝑚𝑒𝑎𝑛 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑠𝑐𝑜𝑟𝑒 = 𝑚𝑒𝑎𝑛 (𝑎𝑙𝑙 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑠𝑐𝑜𝑟𝑒s) (6) anced dataset of 70 million pairs. Minimum and maximum thresholds of 0.5 and 0.8 are empirically selected by observing the sentence pairs. Sen- tence pairs with a similarity score within these thresholds are considered paraphrase pairs. We use a maximum thresh- old of 0.8 to avoid selection of identical sentences. 2.4 Paraphrase generation dataset The paraphrase sentence pairs are obtained from three web- based unstructured knowledge sources: Wikipedia, Sim- pleWikipedia, and MayoClinic. These sentence pairs form our clinical paraphrase generation dataset, which is later used to train a baseline neural paraphrase generation model. Wikipedia and SimpleWikipedia SimpleWikipedia contains simplified versions of pages from the original Wikipedia. However, the text in the correspond- ing documents is unaligned (no sentence-to-sentence match- ing). Pairing sentences from Wikipedia to those in Sim- pleWikipedia leads to a parallel corpus for the simplification dataset as the later mostly contains simplified versions of the former. However, in the case of paraphrase generation, the resultant pairs can be swapped as paraphrasing applies in Figure 1. Paraphrase identification architecture. Gray arrows both directions. The swapping also helps augmenting the represent weight-sharing between the left and right BiLSTM. dataset. We create a parallel corpus using 164 matched titles from Wikipedia from clinically relevant categories such as Training anatomy, brain, disease, medical condition etc. Sentences The dataset is preprocessed by removing punctuations, nor- from each of 164 Wikipedia documents are paired with all malization with respect to case, and standard tokenization. the sentences from the documents with identical titles from The tokens are embedded using Word2Vec embeddings pre- SimpleWiki. Thus, we obtain 818520 sentence pairs for trained on Google news corpus [Mikolov et al., 2013]. which we compute similarity scores as discussed in the pre- Words that are not found in the pre-trained vocabulary are vious subsections. We finally obtain 1491 related sentence embedded with a zero-vector representing an UNK token. pairs after thresholding the mean similarity score and we Longer sentences (> 30 words) are truncated, and smaller name this parallel corpus as WikiSWiki. sentences (< 30 words) are padded with UNK tokens. As the Mayoclinic sentences are in a bidirectional relationship to each other, Mayoclinic contains pages for 48 identically matched titles the training pairs are swapped to increase the dataset size. from the 164 titles identified from Wikipedia. Unique sen- The dataset is split into 80%, 10% and 10% for training, tences from WikiSWiki were paired with the sentences ob- validation and testing respectively. tained from the pages with matched titles from Mayoclinic The paraphrase identification model is trained using Ad- and similarity scores are computed. Using the same thresh- am optimizer [Kingma and Ba, 2014] with Nesterov mo- olds as above, 3203 sentence pairs are selected. These pairs mentum [Nesterov, 1983] to optimize a binary cross entropy are added to the WikiSWiki corpus to form a corpus con- loss. The update direction is calculated using a batch size of taining 4694 sentence pairs; we name it WikiSwikiMayo. 512. We utilize early stopping using validation error with patience of 3 epochs to prevent overfitting. 2.5 Simplification dataset The network is trained for 18 epochs before early stop- The WikiSWiki is a simplification corpus as it mostly con- ping at 22 minutes per epoch. The validation accuracy of tains sentences mapped to their simpler forms. However, the our model is 95% and test accuracy is 97%. small number of sentence pairs may be insufficient for train- ing the network to learn complex relationships required for clinical text simplification. Therefore, we use additional web-based knowledge sources to increase the dataset size. Web-based knowledge sources: www.webmd.com (webmd) and www.medicinenet.com (medicinenet), are other clini- cal knowledge sources that are similar to MayoClinic. Through manual inspection, we found that webmd contains simpler sentences than medicinenet in many topics that we have examined, which is reasonable as medicinenet content Figure 2. Encoder-decoder architecture. x and y are the source is curated by clinicians. Therefore, we use them as addition- and target sequences respectively. al knowledge sources to create our simplification dataset. For paraphrase generation, the network is trained using For 164 topics from the WikiSWiki dataset we perform a the WikiSwikiMayo corpus containing 4694 sentence pairs. google search with ‘webmd’ and ‘medicinenet’ as additional search terms. The search returns 61314 sentences from The source and target sentences are swapped as paraphras- ing is bidirectional thereby, doubling the number of sen- webmd and medicinenet for all 164 topics. Sentences from tence pairs to 9388. The dataset is divided into training, val- medicinenet are paired to the sentences from SimpleWiki and webmd from the articles with matched titles. Sentences idation and test sets. The training sentence pairs that contain sentences from the source side of the test are removed to from Wikipedia articles are paired with sentences from prevent data leak issues. Same is repeated for the validation webmd separately as they are already paired with Sim- pleWikipedia. We obtain 714608 new sentence pairs result- set. Using this we make sure that any sentence occurs as a source sentence in exactly one of the sets (training, test or ing in 1002 final pairs after computing similarity scores and validation). The number of sentence pairs in training, test thresholding. These sentence pairs are merged with WikiS- Wiki dataset to create the monolingual clinical simplifica- and validation datasets are 6095, 611 and 611 respectively. The paraphrase generation network is trained for 10000 tion dataset containing 2493 sentence pairs. Although our steps with a batch size of 128 samples per step. final corpus contains a small number of sentence pairs, our main contribution in this paper is to introduce an automated The simplification corpus containing 2493 sentence pairs is used to train the simplification network. Vocabularies for method to create sentence pairs from web-based knowledge source and target are created separately in case of simplifi- sources, towards creating a large clinical simplification cor- pus in the future. cation. The source and target vocabularies are different in case of text simplification. As simplification is a unidirec- tional task, we do not use data swapping. We prevent data 3 Paraphrase generation and simplification leak issues using the same procedure as paraphrase genera- tion while splitting the data. The training, test and validation 3.1 Model sets contain 1918, 187 and 187 sentence pairs respectively. Sequence-to-sequence models using encoder-decoder archi- The simplification network is trained for 3500 steps. tecture with attention [Vinyals et al., 2015] (Figure 2) are trained for both paraphrase generation and simplification 4 Evaluation metrics tasks. The encoder and decoder are made of three stacked BLEU [Papineni et al., 2002], METEOR [Banerjee and RNN layers using BiLSTM cells and LSTM cells respec- Lavie, 2005] and translation error rate (TER) [Snover et al., tively. We use a cell depth of 1024 for all the layers in the 2006] are used to evaluate our models. These metrics are encoder and the decoder. The maximum sequence length is shown to correlate with human judgements for evaluating set to 50. The sentences are preprocessed, and the words are paraphrase generation models [Wubben et al., 2010]. BLEU encoded using one-hot vector encoding. The outputs of the looks for exact string matching using n-gram overlaps to decoder are projected onto the output vocabulary space us- evaluate the similarity between two sentences. METEOR ing a dense layer with a softmax activation function. uses WordNet to obtain synonymously related words to evaluate sentence similarity. Higher BLEU and METEOR 3.2 Training scores indicate higher similarity. TER score measures the The network parameters are optimized by minimizing a number of edits necessary to transform the source sentence sampled softmax loss function. The gradients are truncated to the target. Lower TER score indicates higher similarity. by limiting the global norm to 1. The network is trained using mini-batch gradient descent algorithm with batch size 5 Results and discussion of 128. An initial learning rate of 0.5 is used with a decay of 0.99 for every step. The training set is shuffled for every epoch. The networks are trained using 80% of the sentence 5.1 Sentence alignment pairs and validated on 10% and tested on 10%. Both models Table 1 presents a few examples of the aligned sentence are developed using Tensorflow, version 1.2, and two Tesla pairs for both clinical paraphrase generation and simplifica- K20 GPUs. tion. Mean to the same topic. Furthermore, using other similarity met- Clinical Paraphrase Generation Sim. rics that are based on word matching helps in overcoming Score this problem in cases where the paraphrase identification Example 1: Good metric is insensitive. We examined that this holds true in S1: No drug is currently approved for the treatment of small- majority of the pairs by visual inspection of the selected 0.52 pox. sentence pairs, for both the datasets. S2: No cure or treatment for smallpox exists Example 2: Acceptable 5.2 Paraphrase generation and simplification S1: Worldwide, breast cancer is the most common invasive Average quality scores on the test sets for the clinical para- cancer in women. 0.62 phrase generation and the text simplification models are S2: After skin cancer, breast cancer is the most common cancer diagnosed in women in the United States presented in Table 2. These scores serve as baselines for clinical paraphrase generation and text simplification for the Example 3: Bad datasets that we have created. The quality metrics are lower S1: Gallbladder cancer is a rare type of cancer which forms for clinical text simplification than the paraphrase genera- in the gallbladder. 0.53 S2: At this stage, gallbladder cancer is confined to the inner tion. This is expected as in the case of paraphrase generation layers of the gallbladder many of the words from the source sentence can be retained in the paraphrased sentence whereas simplification involves Clinical Text Simplification complex transformations which results in different words in Example 1: Good the resulting sentence and hence the quality scores are low. S1: In Western cultures, ingestion of or exposure to peanuts, Further human evaluations are required to better rate the wheat, nuts, certain types of seafood like shellfish, milk, performance of the simplification model. and eggs are the most prevalent causes. 0.54 S2: In the Western world, the most common causes are eating Task BLEU METEOR TER or touching peanuts, wheat, tree nuts, shellfish, milk, and Clinical Paraphrase eggs. 9.4±0.5 15.1±0.3 108.7±1.5 Generation Example 2: Acceptable Clinical Text 9.9±1.6 10.6±0.8 97.7±2.9 S1: Together the bones in the body form the skeleton. 0.54 Simplification S2: The bones are the framework of the body. Table 2. Average scores computed over test sentence pairs. Example 3: Bad Few example outputs of the clinical paraphrase genera- S1: There are two major types of diabetes, called type 1 and type 2 0.54 tion and simplification system are presented in Table 3. The S2: There are other kinds of diabetes, like diabetes insipidus. examples show that both paraphrase generation and simpli- fication models retained the knowledge of the overall topic Table 1. Examples of aligned sentence pairs. Good represents that in the generated sentences. Example 2 in both models shows accepted sentences are paraphrases. Bad represents that accepted that, though the topic of the generated sentence matches sentences are not paraphrases. with the source, the sentence is not a paraphrase or the sim- plification respectively, as the context in the resultant sen- In Table 1, for both paraphrase generation and text sim- tence is different from that of the source. This may be be- plification tasks, though the similarity score between the cause of the failure in the alignment of the sentences while sentence pairs is similar across all the examples there is a creating the datasets. This shows that the paraphrase identi- large variability in the classification of the sentence pair. fication model and the metrics were not fully sufficient to This means there is an overlap between the distributions of pair the sentences accurately. In particular, the paraphrase the mean similarity score of the paraphrase pairs and the identification model trained on general domain question non-paraphrase pairs. Therefore, the selection of minimum pairs may not generalize well to identify paraphrase pairs in threshold less than 0.5 introduces more non-paraphrase pairs case of clinical texts. The solution may be using transfer into the dataset and by selecting the threshold more than 0.5 learning and training the paraphrase identification network we lose a large number of pairs that are paraphrases. One on a subset of human rated clinical paraphrases. desirable approach is to train a linear regression or any mul- ti-variate machine learning model to classify the paraphrase Clinical Example 1 Example 2 pairs using all the computed similarity metrics. However, Paraphrase Generation training such machine learning systems requires ground- Source dengue fever pro- Lung cancer often truth data and therefore is outside the scope of this paper. nounced den gay is an spreads (metastasiz- Our paraphrase identification system uses a vocabulary infectious disease caused es) to other parts of from the Google News corpus dataset. The words that are by the dengue virus the body, such as the not present in this vocabulary are assigned the UNK token. brain and the bones Therefore, the neural paraphrase identification network is not sensitive when two semantically similar sentences refer to different objects. However, this problem is minimized in our case as we pair the sentences from the pages belonging Target dengue fever is caused Primary lung cancers text during sentence alignment, which would help to create by any of the four den- themselves most cleaner datasets. gue viruses spread by commonly metasta- Previous research has found that existing simplification mosquitoes that thrive in size to the brain, datasets created using Wikipedia-like knowledge sources and near human lodgings bones, liver and ad- renal glands are noisy [Xu et al., 2015] as these knowledge sources are Generated Dengue fever is a mos- Lung cancer staging not created with a specific objective. However, task specific quito borne tropical is an assessment of datasets for clinical paraphrase generation and simplifica- disease caused by the the degree of spread tion do not exist as of writing this paper. Therefore, we ap- dengue virus of the cancer from its proached the creation of such datasets for clinical para- original source phrase generation and simplification using web-based Clinical Text knowledge sources. We hope that this serves as a starting Simplification point towards developing automated approaches for creating Source Diabetes is due to either Ventricular tachycar- task specific datasets using unstructured knowledge sources. the pancreas not produc- dia can be classified ing enough insulin or the based on its mor- cells of the body not phology 6 Conclusion and future work responding properly to This paper presents a preliminary work on automated meth- the insulin produced odology to create clinical paraphrase generation and simpli- Target Diabetes is the condition Ventricular tachycar- fication datasets. We use web-based knowledge sources and that results from lack of dia can be treated in a insulin in a person blood few different ways automatically align sentence pairs from matching topics to or when their body has a create the datasets. Additionally, these datasets are used to problem using the insu- train sequence-to-sequence models leveraging an encoder- lin it produces insulin decoder architecture with attention for paraphrase genera- resistance tion and simplification. Further research to improve string Generated Diabetes can occur when Ventricular tachycar- similarity metrics is required to accurately identify similar the pancreas produces dia can be caused by sentence pairs to create cleaner datasets. In future, we will very little to no insulin many different things include more knowledge sources and topics to create larger or when the body does datasets and use automated methods to remove unrelated or not respond appropriate- ly to insulin unwanted text in the paired sentences. Table 3. Example outputs from clinical paraphrase generation and References simplification models. [Bahdanau et al., 2015] Bahdanau, D., Cho, K., Bengio, Y.,. Our datasets consist of a small number of sentence pairs Neural Machine Translation By Jointly Learning To Align and (few thousands) and may not be sufficient for the neural Translate, in: ICLR. pp. 1–15, 2015. network models to learn complex clinical concepts. Fur- [Bakkelund, 2009] Bakkelund, D.,. An LCS-based string metric. thermore, we use only 164 medical topics from Wikipedia for this work. Improving the efficiency of paraphrase identi- University of Oslo, Oslo, Norway, 2009. fication and inclusion of more knowledge sources and topics [Banerjee and Lavie, 2005] Banerjee, S., Lavie, A.,. METEOR: will create larger and better training datasets. Many sentenc- An Automatic Metric for MT Evaluation with Improved es that are paired contain text related to additional infor- Correlation with Human Judgments, in: ACL. pp. 65–72, 2005. mation that the other sentence does not contain. For exam- [Brad and Rebedea, 2017] Brad, F., Rebedea, T.,. Neural ple: Paraphrase Generation using Transfer Learning, in: INLG. pp. Source: “It isn’t clear why some people get asthma and 257–261, 2017. others don’t, but it’s probably due to a combination of envi- [Conneau et al., 2017] Conneau, A., Kiela, D., Schwenk, H.,. ronmental and genetic factors”. Target: “Asthma is thought to be caused by a combina- Supervised Learning of Universal Sentence Representations tion of genetic and environmental factors”. from Natural Language Inference Data, in: CoRR. 2017. [Dadashov et al., 2017] Dadashov, E., Sakshuwong, S., Yu, K.,. The removal of the additional text in the first part of the source sentence will improve the training of the neural net- Quora Question Duplication 1–9, 2017. work as it can focus more on the important text. The un- [Damerau, 1964] Damerau, F.J.,. A Technique for Computer wanted text in this example can be easily removed as it is Detection and Correction of Spelling Errors. Commun. ACM 7, clearly separated from the rest of the sentence. However, 171–176, 1964. many sentences that contain unwanted text are not easily [Delbanco et al., 2015] Delbanco, T., Walker, J., Darer, J.D., separable. Moreover, manual removal of unwanted text from thousands of sentences (if not millions) is not practi- Elmore, J.G., Feldman, H.J.,. Open Notes: Doctors and Patients cal. Automated methods are needed to remove unwanted Signing On. Ann. Intern. Med. 153, 121–126, 2015. [Fader et al., 2013] Fader, A., Zettlemoyer, L., Etzioni, O.,. Paraphrase-Driven Learning for Open Question Answering, in: o(1/k^2). Dokl. AN USSR 269, 543–547, 1983. ACL. pp. 1608–1618, 2013. [Papineni et al., 2002] Papineni, K., Roukos, S., Ward, T., Zhu, [Ghaeini et al., 2018] Ghaeini, R., Hasan, S.A., Datla, V. et al. W.,. BLEU: a method for automatic evaluation of machine DR-BiLSTM: Dependent Reading Bidirectional LSTM for translation, in: ACL. pp. 311–318, 2002. Natural Language Inference, in: NAACL HTL, 2018. [Pavlick et al., 2015] Pavlick, E., Rastogi, P., Ganitkevitch, J., [Hasan et al., 2016] Hasan, S.A., Liu, B., Liu, J. et al. Neural Durme, B. Van, Callison-Burch, C.,. PPDB 2.0: Better Clinical Paraphrase Generation with Attention, in: CNLP paraphrase ranking, fine-grained entailment relations, word Workshop. pp. 42–53, 2016. embeddings, and style classification. ACL 425–430, 2015. [Herranz et al., 2011] Herranz, J., Nin, J., Sole, M.,. Optimal [Pivovarov and Elhadad, 2015] Pivovarov, R., Elhadad, N.,. Symbol Alignment Distance: A New Distance for Sequences of Automated methods for the summarization of electronic health Symbols. IEEE Trans. Knowl. Data Eng. 23, 1541–1554, 2011. records. J. Am. Med. Informatics Assoc. 22, 938–947, 2015. [Iyer et al., 2017] Iyer, S., Dandekar, N., Csernai, K.,. Quora [Prakash et al., 2016] Prakash, A., Hasan, S.A., Lee, K., Datla, V., question pair dataset [WWW Document], 2017. Qadir, A., Liu, J., Farri, O.,. Neural Paraphrase Generation with [Kandula et al., 2010] Kandula, S., Curtis, D., Zeng-Treitler, Q.,. Stacked Residual LSTM Networks, in: COLING. pp. 2923– A semantic and syntactic text simplification tool for health 2934, 2016. content., in: AMIA. pp. 366–70, 2010. [Qenam et al., 2017] Qenam, B., Kim, T.Y., Carroll, M.J., [Kingma and Ba, 2014] Kingma, D.P., Ba, J.,. Adam: A Method Hogarth, M.,. Text Simplification Using Consumer Health for Stochastic Optimization, in: ICLR. pp. 1–15, 2014. Vocabulary to Generate Patient-Centered Radiology Reporting: Translation and Evaluation. J. Med. Internet Res. 19, e417, [Koehn, 2017] Koehn, P.,. Neural Machine Translation. CoRR, 2017. 2017. [Quirk et al., 2004] Quirk, C., Brockett, C., Dolan, B.,. [Koehn, 2010] Koehn, P.,. Statistical Machine Translation, 1st ed. Monolingual Machine Translation for Paraphrase Generation, Cambridge University Press, NY, USA, 2010. in: ACL, 2004. [Kondrak, 2005] Kondrak, G.,. N-gram similarity and distance. [Snover et al., 2006] Snover, M., Dorr, B., Schwartz, R., SPIR 115–126, 2005. Micciulla, L., Makhoul, J.,. A Study of Translation Edit Rate [Kosten et al., 2012] Kosten, T.R., Domingo, C.B., Shorter, D., with Targeted Human Annotation, in: AMTA. pp. 223–231, Orson, F. et al. Inviting Patients to Read Their Doctors’ Notes: 2006. A Quasi- experimental Study and a Look Ahead. Ann. Intern. [Sørensen, 1948] Sørensen, T.,. A method of establishing groups Med. 157, 461–470, 2012. of equal amplitude in plant sociology based on similarity of [Levenshtein, 1966] Levenshtein, V.,. Binary Codes Capable of species and its application to analyses of the vegetation on Correcting Deletions, Insertions, and Reversals. Sov. Phys. Danish commons. Biol. Skr. 5, 1–34, 1948. Dokl. 10, 707–710, 1966. [Vinyals et al., 2015] Vinyals, O., Kaiser, L., Koo, T., Petrov, S., [Lin et al., 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Sutskever, I., Hinton, G.,. Grammar as a Foreign Language, in: Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.,. Microsoft NIPS, 2015. COCO: Common objects in context, in: ECCV. pp. 740–755, [Winkler, 1990] Winkler, W.E.,. String Comparator Metrics and 2014. Enhanced Decision Rules in the Fellegi-Sunter Model of Record [Lindberg et al., 1993] Lindberg, D.A., Humphreys, B.L., Linkage, in: ASA. pp. 354–359, 1990. McCray, A.T.,. The Unified Medical Language System. [Wubben et al., 2010] Wubben, S., van den Bosch, A., Krahmer, Methods Inf. Med. 32, 281–291, 1993. E.,. Paraphrase Generation As Monolingual Translation: Data [M. Shieber and Nelken, 2006] M. Shieber, S., Nelken, R.,. and Evaluation, in: INLG, INLG ’10. pp. 203–207, 2010. Towards robust context-sensitive sentence alignment for [Xu et al., 2015] Xu, W., Callison-Burch, C., Napoles, C.,. monolingual corpora, 2006. Problems in Current Text Simplification Research: New Data [Madnani and Dorr, 2010] Madnani, N., Dorr, B.J.,. Generating Can Help, in: ACL. pp. 283–297, 2015. Phrasal and Sentential Paraphrases: A Survey of Data-Driven [Zhao et al., 2009] Zhao, S., Lan, X., Liu, T., Li, S.,. Application- Methods. Comput. Linguist. 36, 341–387, 2010. driven statistical paraphrase generation, in: ACL. pp. 834–842, [Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., Dean, 2009. J.,. Distributed Representations of Words and Phrases and their [Zhu et al., 2010] Zhu, Z., Bernhard, D., Gurevych, I.,. A Compositionality. NIPS 1–9, 2013. Monolingual Tree-based Translation Model for Sentence [Nesterov, 1983] Nesterov, Y.,. A method for unconstrained Simplification, in: COLING. pp. 1353–1361, 2010. convex minimization problem with the rate of convergence