=Paper= {{Paper |id=Vol-2148/paper7 |storemode=property |title=Towards Dataset Creation And Establishing Baselines for Sentence-level Neural Clinical Paraphrase Generation and Simplification |pdfUrl=https://ceur-ws.org/Vol-2148/paper07.pdf |volume=Vol-2148 |authors=Viraj Adduru,Sadid A. Hasan,Joey Liu,Yuan Ling,Vivek V Datla,Ashequl Qadir,Oladimeji Farri |dblpUrl=https://dblp.org/rec/conf/ijcai/AdduruHLLDQF18 }} ==Towards Dataset Creation And Establishing Baselines for Sentence-level Neural Clinical Paraphrase Generation and Simplification== https://ceur-ws.org/Vol-2148/paper07.pdf

Towards dataset creation and establishing baselines for sentence-level neural
clinical paraphrase generation and simplification
Viraj Adduru1*, Sadid A. Hasan2, Joey Liu2, Yuan Ling2, Vivek Datla2, Kathy Lee2,
Ashequl Qadir2, Oladimeji Farri2
1
Rochester Institute of Technology, Rochester, NY, USA
2
Artificial Intelligence Lab, Philips Research North America, Cambridge, MA, USA
{vra2128}@rit.edu,
{firstname.lastname, kathy.lee_1, dimeji.farri}@philips.com

Abstract other physicians [Qenam et al., 2017]. They contain com-
plex medical terminologies that the patients are not familiar
A paraphrase is a restatement of a text while retain-
with. A recent study reported that allowing patients to ac-
ing the meaning. Clinical paraphrasing involves re-
cess their clinical notes has showed an improvement in their
statement of sentences, paragraphs, or documents health care process [Kosten et al., 2012]. Realizing the need
containing complex vocabulary used by clinicians.
for increased inclusion of patients in their health care pro-
Paraphrasing can result in an alternative text that is
cess, large health care systems have allowed for the patients
either simple or complex form of the original input to access their medical records [Delbanco et al., 2015].
text. Simplification is a form of paraphrasing in
However, these medical records contain raw complex clini-
which a sentence is restated into a linguistically
cal text intended for the communication between medical
simpler sentence yet retaining the meaning of the professionals. Paraphrasing or simplification of clinical text
original sentence. Clinical text simplification has
will improve the patients’ understanding of their health con-
potential applications such as simplification of
ditions and thereby play an important role in connecting
clinical reports for patients towards better under- patients and caregivers across the clinical continuum to-
standing of their clinical conditions. Deep learning
wards better patient outcome.
has emerged as a successful technique for various
Traditional clinical paraphrasing and simplification ap-
natural language understanding tasks precondi- proaches use lexical methods [Kandula et al., 2010;
tioned with large annotated datasets. In this paper,
Pivovarov and Elhadad, 2015; Qenam et al., 2017], which
we propose a methodology to create preliminary
are typically focused on identifying complex clinical words,
datasets for clinical paraphrasing, and clinical text phrases, or sentences and replace them with their alterna-
simplification to foster training of deep learning-
tives in case of paraphrasing or simpler versions in case of
based clinical paraphrase generation and simplifi-
simplification. Lexical methods take advantage of
cation models. knowledge sources like Unified Medical Language System
(UMLS) metathesaurus [Lindberg et al., 1993] which con-
1 Introduction and related work tains grouped words and phrases that describe various medi-
Paraphrasing (a.k.a. paraphrase generation) is transforming cal concepts. Simplification is traditionally performed by
a text that can be a word, phrase, sentence, paragraph, or a mapping UMLS concepts to their alternatives provided in
document, while retaining the meaning and content. For consumer health vocabulary (CHV) [Qenam et al., 2017].
example, the sentence ‘I am very well’ can be paraphrased Recently, paraphrase generation was casted as a monolin-
as ‘I am doing great’. Paraphrasing can lead to a new text gual machine translation problem resulting in the develop-
which may be simpler, more complex or at the same com- ment of data-driven methods using statistical machine trans-
plexity level as the source text. The task of paraphrasing text lation (SMT) [Koehn, 2010], and neural machine translation
to a simpler form is called simplification. In simplification, (NMT) principles [Koehn, 2017]. SMT methods [Quirk et
the output text is a linguistically simplified version of the al., 2004; Wubben et al., 2010; Zhao et al., 2009] model the
input text. Paraphrasing and simplification may have nu- conditional distributions of the words and phrases and re-
merous applications such as document summarization, text place the phrases in the source text with the phrases that
simplification for target audience e.g. children, and question maximize the probability of the resulting text. However,
answering [Madnani and Dorr, 2010]. syntactic relationships are difficult to model using SMT
In the clinical context, health care systems and medical methods. Monolingual NMT systems use neural network
knowledge-bases contain large collections of texts that are architectures to model complex relationships by automati-
often not comprehensible to the layman population. For cally learning from large datasets containing source and
example, clinical texts like radiology reports are used by target text pairs, both belonging to the same language. Cur-
radiologists to professionally communicate their findings to rent NMT systems for paraphrase generation or simplifica-
tion [Brad and Rebedea, 2017; Hasan et al., 2016; Prakash
—————— et al., 2016] use sequence-to-sequence networks based on

*This work was conducted as part of an internship program at encoder-decoder architectures. Unlike traditional methods,
Philips Research.
NMTs do not need semantic or syntactic rules to be explicit- nicians. We crawl the articles with same topics from two or
ly defined. However, they need carefully constructed da- more web-based knowledge sources. Each sentence in a
tasets that contain sufficient information to robustly train the topic (i.e. in an article) from one resource is mapped to sen-
deep neural networks. tences belonging to the same topic from another resource(s)
Existing clinical paraphrasing and simplification datasets using a one-to-many scheme to create all possible sentence
are limited to short phrases. Hasan et al., (2016) trained an pair combinations. These sentence pairs essentially contain
attention-based encoder-decoder model [Bahdanau et al., a large number of unrelated pairs from which meaningful
2015] using a dataset created by merging two word and paraphrasing pairs are identified.
phrase level datasets: the paraphrase database (PPDB) Manual identification of the relevant paraphrase pairs is a
[Pavlick et al., 2015] and the UMLS metathesaurus. They tedious task as the sentence pair combinations (as discussed
showed that their model outperformed an upper bound para- above) contain a large number (in millions) of unrelated
phrasing baseline. However, they used a phrasal dataset that sentence pairs. Therefore, we use an automated approach to
does not contain more complex contextual knowledge like a identify the paraphrase pairs from the sentence pair combi-
sentential dataset, and the ability of the network to simplify nations. Our method is similar to the approach by [Zhu et
the clinical text was not explored. In contrast to paraphras- al., 2010]. They use TF-IDF [M. Shieber and Nelken, 2006]
ing, simplification is a harder problem and may involve ad- metric to align sentences between Wikipedia and Simple-
dition, deletion or splitting of sentences to suite the target Wikipedia knowledge sources to create sentence pairs for
audience. These operations require additional knowledge the text simplification task. However, some studies, e.g. Xu
that a dataset with longer sequences like sentences or para- et al., 2015, reveal the noisy nature of such datasets, which
graphs could provide. Other studies [Brad and Rebedea, motivated us to explore various textual similarity/distance
2017; Prakash et al., 2016] have trained encoder-decoder metrics instead of relying on one single metric for sentence
architectures with attention for paraphrasing using general alignment. Our intuition is that the strengths of a collection
domain sentence level datasets like Microsoft Common Ob- of diverse metrics may be useful for better sentence align-
jects in Context (MSCOCO) [Lin et al., 2014], Newsela ment. In addition to various existing metrics, we train a neu-
[Xu et al., 2015] and Wikianswers [Fader et al., 2013]. They ral paraphrase identification model to estimate a similarity
demonstrated that neural machine translation models suc- score between two sentences, which is also used as a sup-
cessfully captured the complex semantic relationships from plementary sentence alignment metric.
the general domain datasets. However, it is unclear how
these networks would perform on complex clinical text. 2.2 Sentence alignment
In this paper, our aim is to pioneer the creation of parallel Paraphrase pairs can be identified by computing various
(with source and target pairs) sentential datasets for clinical sentence similarity/distance metrics between the two sen-
paraphrase generation and simplification. Web-based un- tences in a pair. Various character-level and word-level met-
structured knowledge sources like www.mayoclinic.com rics that we used are described below.
contain articles on various medical topics. We obtain arti-
cles with matching titles from different web-based Levenshtein distance
knowledge sources and align the sentences using various Levenshtein distance [Levenshtein, 1966] is defined as the
metrics to create paraphrase and simplification pairs. Addi- minimum number of string operations consisting of addi-
tionally, we train NMT models using the prepared clinical tions, deletions, and substitutions of symbols that are neces-
datasets and present baseline performance metrics for both sary to transform one string into another. Normalized Le-
clinical paraphrase generation and simplification. venshtein distance (LDN) is computed by dividing the num-
Next section outlines our approach to create clinical para- ber of string operations required by the length of the longer
phrase generation and simplification datasets. First, we dis- string. Character- or word-level LDN is calculated by treat-
cuss our proposed methodology for extracting sentence pairs ing characters or words as symbols respectively:
from web-based clinical knowledge sources. Then we de- 𝑁
scribe various metrics to align the pairs of related sentences 𝐿𝐷𝑁 = (1)
𝑚𝑎𝑥 (𝑛, 𝑚)
for dataset creation. Section 3 discusses the neural network
architectures used for establishing baselines. Sections 4 and where N is the minimum number of string operations to
5 present the performance evaluation of the models and in transform a text x to y or vice versa, and n and m are the
section 6 we conclude and discuss the future work. number of symbols in the texts x and y respectively.
Damerau-Levenshtein distance
2 Approach Damerau-Levenshtein distance [Damerau, 1964] is similar
to LDN and is defined as the minimum number of string
2.1 Paraphrase pairs from web-based resources operations needed to transform one string into the other. In
addition to the string operations in Levenshtein distance,
Web-based textual resources contain large collections of Damerau-Levenshtein distance further includes transposi-
articles for various medical topics related to diseases, anat- tion of two adjacent symbols. Normalized Demerau-
omy, treatment, symptoms etc. These articles are often tar- Levenshtein distance (DLDN) is calculated by dividing the
geted for general (non-clinician) users and are easier to un-
derstand unlike the complex clinical reports written by cli-
number of string operations by the number of symbols in the Sorensen similarity
longer string. Sorensen similarity (also called Sorensen-Dice coefficient)
[Sørensen, 1948] is similar to Jaccard similarity and it is
Optimal string alignment distance
computed as the ratio of twice the number of common items
Optimal string alignment distance [Herranz et al., 2011] is a
(intersection) and the sum of number of items in the two
variant of DLDN but under a restriction that no substring is strings.
edited more than once. The normalized form is computed
All the above metrics are used in their normalized forms
similarly as in DLDN.
(values between 0 to 1). These metrics calculate the simi-
Jaro-Winkler distance larity/distance between the sentence pairs using the charac-
Jaro-Winkler distance (JWD) [Winkler, 1990] computes the ter- or word-level overlap and the pattern of their occurrenc-
distance between two strings, where the substitution of two es in the sentences. However, these metrics do not consider
close symbols is considered more important than the substi- the presence of concepts (e.g. words or phrases) that are
tution of two symbols that are far from each other. The Jaro- paraphrased using a different vocabulary (e.g. ‘glioma’ can
Winkler distance JWD is given by: be paraphrased with its synonym ‘brain tumor’) and also do
not perform well for sentences that differ by a few words
𝑑𝑗 , 𝑖𝑓 𝑑𝑗 < 0 resulting in contradicting sentences. Therefore, we need a
𝐽𝑊𝐷 = { (2)
𝑑𝑗 + 𝑘 𝑝 (1 − 𝑑𝑗 ), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 similarity metric that can consider complex semantic rela-
tionships between the concepts represented in the sentences.
where k is the length of the common prefix at the start of Deep neural network architectures with recurrent neural
the string up to 4 symbols, p is the constant usually set to networks (RNNs), and convolution neural networks (CNNs)
0.1 and dj is the Jaro distance given by: have so far demonstrated state-of-the-art performance
[Conneau et al., 2017] in learning semantic associations
0 𝑖𝑓 𝑞 = 0 between the sentences. Therefore, deep-learning based sys-
𝑑𝑗 = {1 𝑞 𝑞 𝑞 − 𝑡 tems are increasingly being used for advanced natural lan-
( + + ), 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (3) guage inferencing tasks like paraphrase identification, and
3 𝑛 𝑚 𝑞
textual entailment [Ghaeini et al., 2018], which motivated us
where q is the number of matching words between the to create a neural paraphrase identification model for the
two texts x and y with lengths 𝑛 and 𝑚 respectively and 𝑡 is purpose of supplementing our sentence similarity measures
half of the number of transpositions. Jaro-Winkler distance for better sentence pair alignment.
is a normalized quantity ranging from 0 to 1.
2.3 Paraphrase identification metric
Longest common subsequence Neural paraphrase identification can be stated as a binary
Longest common subsequence distance (LCSD) classification task in which a neural network model esti-
[Bakkelund, 2009] is computed using the following equa- mates the probability that the two sentences are paraphrases.
tion: This estimated probability can be used as a similarity metric
𝐿𝐶𝑆(𝑥, 𝑦) to align the sentence pairs.
𝐿𝐶𝑆𝐷 = 1 − (4)
𝑚𝑎𝑥(𝑛, 𝑚) Neural paraphrase identification
where LCS (longest common subsequence) is the longest The network consists of stacked bidirectional long short-
subsequence common to strings x and y with lengths n and term memory (BiLSTM) layers in a Siamese architecture
m respectively. [Dadashov et al., 2017] (Figure 1). Each arm of the Siamese
network consists of three stacked BiLSTM layers. The out-
N-gram distance puts of the final BiLSTM layers of both the arms are con-
N-gram is a contiguous sequence of n items from a given catenated and fed into the dense layer with ReLU activation
sample of a text. N-gram distance [Kondrak, 2005] is simi- followed by a second dense layer with a sigmoid activation
lar to computing LCS but in this case the symbols are n- function. We use a depth of 300 for all the BiLSTM layers
grams. We used n = 4 in this paper. and the dense layers. The maximum sequence length of the
Cosine similarity BiLSTM layers is set to 30. The words in the input sentenc-
Cosine similarity between two strings is computed as the es are embedded using Word2Vec embeddings pre-trained
cosine of the angle between the vector representation of two on the Google news corpus.
strings (x and y) and is given by the equation: Hybrid dataset for paraphrase identification
𝑉𝑥 . 𝑉𝑦 Our paraphrase identification model is trained using a hy-
𝐶𝑆 = (5) brid corpus created by merging two paraphrase corpora:
|𝑉𝑥 |. |𝑉𝑦 |
Quora question pairs, and Paralex question pairs. The Quora
Jaccard similarity question pair corpus [Iyer et al., 2017] consists of 404289
Jaccard similarity is calculated as the ratio of the intersec- question pairs with 149263 paraphrase pairs and 255027
tion to the union of the items in the two strings. non-paraphrase pairs. The Paralex dataset [Fader et al.,
2013] consists of 35,692,309 question pairs, where all the
question pairs are paraphrases of each other. The Paralex Paraphrase identification model for sentence alignment
dataset is unbalanced as it does not contain any non- The probability score from our paraphrase identification
paraphrase pairs. After merging the sentence pairs from both model for the predicted class is used along with the word-
the corpora, we have 35692309 sentence pairs with and character-level similarity/distance metrics to calculate a
35437283 paraphrase pairs and only 255027 non-paraphrase mean similarity score. Note that, all normalized distance
pairs. To balance the dataset, we identify the list of unique metrics are converted into similarity metrics by subtracting
questions and then randomly select two questions from the the corresponding score from 1, thereby obtaining 12 differ-
unique questions list and add the pair to the merged corpus ent similarity metrics. The mean similarity score is comput-
as a non-paraphrase pair if the pair does not already exist. ed using the formula given below:
Non-paraphrase pairs are created until the non-paraphrase,
and paraphrase pairs are equal in number, resulting in a bal- 𝑚𝑒𝑎𝑛 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑠𝑐𝑜𝑟𝑒 = 𝑚𝑒𝑎𝑛 (𝑎𝑙𝑙 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑠𝑐𝑜𝑟𝑒s) (6)
anced dataset of 70 million pairs.
Minimum and maximum thresholds of 0.5 and 0.8 are
empirically selected by observing the sentence pairs. Sen-
tence pairs with a similarity score within these thresholds
are considered paraphrase pairs. We use a maximum thresh-
old of 0.8 to avoid selection of identical sentences.

2.4 Paraphrase generation dataset
The paraphrase sentence pairs are obtained from three web-
based unstructured knowledge sources: Wikipedia, Sim-
pleWikipedia, and MayoClinic. These sentence pairs form
our clinical paraphrase generation dataset, which is later
used to train a baseline neural paraphrase generation model.
Wikipedia and SimpleWikipedia
SimpleWikipedia contains simplified versions of pages from
the original Wikipedia. However, the text in the correspond-
ing documents is unaligned (no sentence-to-sentence match-
ing). Pairing sentences from Wikipedia to those in Sim-
pleWikipedia leads to a parallel corpus for the simplification
dataset as the later mostly contains simplified versions of
the former. However, in the case of paraphrase generation,
the resultant pairs can be swapped as paraphrasing applies in
Figure 1. Paraphrase identification architecture. Gray arrows both directions. The swapping also helps augmenting the
represent weight-sharing between the left and right BiLSTM. dataset. We create a parallel corpus using 164 matched titles
from Wikipedia from clinically relevant categories such as
Training anatomy, brain, disease, medical condition etc. Sentences
The dataset is preprocessed by removing punctuations, nor- from each of 164 Wikipedia documents are paired with all
malization with respect to case, and standard tokenization. the sentences from the documents with identical titles from
The tokens are embedded using Word2Vec embeddings pre- SimpleWiki. Thus, we obtain 818520 sentence pairs for
trained on Google news corpus [Mikolov et al., 2013]. which we compute similarity scores as discussed in the pre-
Words that are not found in the pre-trained vocabulary are vious subsections. We finally obtain 1491 related sentence
embedded with a zero-vector representing an UNK token. pairs after thresholding the mean similarity score and we
Longer sentences (> 30 words) are truncated, and smaller name this parallel corpus as WikiSWiki.
sentences (< 30 words) are padded with UNK tokens. As the Mayoclinic
sentences are in a bidirectional relationship to each other, Mayoclinic contains pages for 48 identically matched titles
the training pairs are swapped to increase the dataset size. from the 164 titles identified from Wikipedia. Unique sen-
The dataset is split into 80%, 10% and 10% for training, tences from WikiSWiki were paired with the sentences ob-
validation and testing respectively. tained from the pages with matched titles from Mayoclinic
The paraphrase identification model is trained using Ad- and similarity scores are computed. Using the same thresh-
am optimizer [Kingma and Ba, 2014] with Nesterov mo- olds as above, 3203 sentence pairs are selected. These pairs
mentum [Nesterov, 1983] to optimize a binary cross entropy are added to the WikiSWiki corpus to form a corpus con-
loss. The update direction is calculated using a batch size of taining 4694 sentence pairs; we name it WikiSwikiMayo.
512. We utilize early stopping using validation error with
patience of 3 epochs to prevent overfitting. 2.5 Simplification dataset
The network is trained for 18 epochs before early stop-
The WikiSWiki is a simplification corpus as it mostly con-
ping at 22 minutes per epoch. The validation accuracy of
tains sentences mapped to their simpler forms. However, the
our model is 95% and test accuracy is 97%.
small number of sentence pairs may be insufficient for train-
ing the network to learn complex relationships required for
clinical text simplification. Therefore, we use additional
web-based knowledge sources to increase the dataset size.
Web-based knowledge sources: www.webmd.com (webmd)
and www.medicinenet.com (medicinenet), are other clini-
cal knowledge sources that are similar to MayoClinic.
Through manual inspection, we found that webmd contains
simpler sentences than medicinenet in many topics that we
have examined, which is reasonable as medicinenet content Figure 2. Encoder-decoder architecture. x and y are the source
is curated by clinicians. Therefore, we use them as addition- and target sequences respectively.
al knowledge sources to create our simplification dataset.
For paraphrase generation, the network is trained using
For 164 topics from the WikiSWiki dataset we perform a
the WikiSwikiMayo corpus containing 4694 sentence pairs.
google search with ‘webmd’ and ‘medicinenet’ as additional
search terms. The search returns 61314 sentences from The source and target sentences are swapped as paraphras-
ing is bidirectional thereby, doubling the number of sen-
webmd and medicinenet for all 164 topics. Sentences from
tence pairs to 9388. The dataset is divided into training, val-
medicinenet are paired to the sentences from SimpleWiki
and webmd from the articles with matched titles. Sentences idation and test sets. The training sentence pairs that contain
sentences from the source side of the test are removed to
from Wikipedia articles are paired with sentences from
prevent data leak issues. Same is repeated for the validation
webmd separately as they are already paired with Sim-
pleWikipedia. We obtain 714608 new sentence pairs result- set. Using this we make sure that any sentence occurs as a
source sentence in exactly one of the sets (training, test or
ing in 1002 final pairs after computing similarity scores and
validation). The number of sentence pairs in training, test
thresholding. These sentence pairs are merged with WikiS-
Wiki dataset to create the monolingual clinical simplifica- and validation datasets are 6095, 611 and 611 respectively.
The paraphrase generation network is trained for 10000
tion dataset containing 2493 sentence pairs. Although our
steps with a batch size of 128 samples per step.
final corpus contains a small number of sentence pairs, our
main contribution in this paper is to introduce an automated The simplification corpus containing 2493 sentence pairs
is used to train the simplification network. Vocabularies for
method to create sentence pairs from web-based knowledge
source and target are created separately in case of simplifi-
sources, towards creating a large clinical simplification cor-
pus in the future. cation. The source and target vocabularies are different in
case of text simplification. As simplification is a unidirec-
tional task, we do not use data swapping. We prevent data
3 Paraphrase generation and simplification leak issues using the same procedure as paraphrase genera-
tion while splitting the data. The training, test and validation
3.1 Model sets contain 1918, 187 and 187 sentence pairs respectively.
Sequence-to-sequence models using encoder-decoder archi- The simplification network is trained for 3500 steps.
tecture with attention [Vinyals et al., 2015] (Figure 2) are
trained for both paraphrase generation and simplification 4 Evaluation metrics
tasks. The encoder and decoder are made of three stacked BLEU [Papineni et al., 2002], METEOR [Banerjee and
RNN layers using BiLSTM cells and LSTM cells respec- Lavie, 2005] and translation error rate (TER) [Snover et al.,
tively. We use a cell depth of 1024 for all the layers in the 2006] are used to evaluate our models. These metrics are
encoder and the decoder. The maximum sequence length is shown to correlate with human judgements for evaluating
set to 50. The sentences are preprocessed, and the words are paraphrase generation models [Wubben et al., 2010]. BLEU
encoded using one-hot vector encoding. The outputs of the looks for exact string matching using n-gram overlaps to
decoder are projected onto the output vocabulary space us- evaluate the similarity between two sentences. METEOR
ing a dense layer with a softmax activation function. uses WordNet to obtain synonymously related words to
evaluate sentence similarity. Higher BLEU and METEOR
3.2 Training scores indicate higher similarity. TER score measures the
The network parameters are optimized by minimizing a number of edits necessary to transform the source sentence
sampled softmax loss function. The gradients are truncated to the target. Lower TER score indicates higher similarity.
by limiting the global norm to 1. The network is trained
using mini-batch gradient descent algorithm with batch size 5 Results and discussion
of 128. An initial learning rate of 0.5 is used with a decay of
0.99 for every step. The training set is shuffled for every
epoch. The networks are trained using 80% of the sentence 5.1 Sentence alignment
pairs and validated on 10% and tested on 10%. Both models Table 1 presents a few examples of the aligned sentence
are developed using Tensorflow, version 1.2, and two Tesla pairs for both clinical paraphrase generation and simplifica-
K20 GPUs. tion.
Mean to the same topic. Furthermore, using other similarity met-
Clinical Paraphrase Generation Sim. rics that are based on word matching helps in overcoming
Score this problem in cases where the paraphrase identification
Example 1: Good metric is insensitive. We examined that this holds true in
S1: No drug is currently approved for the treatment of small- majority of the pairs by visual inspection of the selected
0.52
pox. sentence pairs, for both the datasets.
S2: No cure or treatment for smallpox exists
Example 2: Acceptable 5.2 Paraphrase generation and simplification
S1: Worldwide, breast cancer is the most common invasive Average quality scores on the test sets for the clinical para-
cancer in women. 0.62 phrase generation and the text simplification models are
S2: After skin cancer, breast cancer is the most common
cancer diagnosed in women in the United States
presented in Table 2. These scores serve as baselines for
clinical paraphrase generation and text simplification for the
Example 3: Bad datasets that we have created. The quality metrics are lower
S1: Gallbladder cancer is a rare type of cancer which forms for clinical text simplification than the paraphrase genera-
in the gallbladder. 0.53
S2: At this stage, gallbladder cancer is confined to the inner
tion. This is expected as in the case of paraphrase generation
layers of the gallbladder many of the words from the source sentence can be retained
in the paraphrased sentence whereas simplification involves
Clinical Text Simplification complex transformations which results in different words in
Example 1: Good the resulting sentence and hence the quality scores are low.
S1: In Western cultures, ingestion of or exposure to peanuts, Further human evaluations are required to better rate the
wheat, nuts, certain types of seafood like shellfish, milk, performance of the simplification model.
and eggs are the most prevalent causes. 0.54
S2: In the Western world, the most common causes are eating Task BLEU METEOR TER
or touching peanuts, wheat, tree nuts, shellfish, milk, and Clinical Paraphrase
eggs. 9.4±0.5 15.1±0.3 108.7±1.5
Generation
Example 2: Acceptable Clinical Text
9.9±1.6 10.6±0.8 97.7±2.9
S1: Together the bones in the body form the skeleton. 0.54 Simplification
S2: The bones are the framework of the body.
Table 2. Average scores computed over test sentence pairs.
Example 3: Bad Few example outputs of the clinical paraphrase genera-
S1: There are two major types of diabetes, called type 1 and
type 2
0.54 tion and simplification system are presented in Table 3. The
S2: There are other kinds of diabetes, like diabetes insipidus. examples show that both paraphrase generation and simpli-
fication models retained the knowledge of the overall topic
Table 1. Examples of aligned sentence pairs. Good represents that in the generated sentences. Example 2 in both models shows
accepted sentences are paraphrases. Bad represents that accepted that, though the topic of the generated sentence matches
sentences are not paraphrases. with the source, the sentence is not a paraphrase or the sim-
plification respectively, as the context in the resultant sen-
In Table 1, for both paraphrase generation and text sim- tence is different from that of the source. This may be be-
plification tasks, though the similarity score between the cause of the failure in the alignment of the sentences while
sentence pairs is similar across all the examples there is a creating the datasets. This shows that the paraphrase identi-
large variability in the classification of the sentence pair. fication model and the metrics were not fully sufficient to
This means there is an overlap between the distributions of pair the sentences accurately. In particular, the paraphrase
the mean similarity score of the paraphrase pairs and the identification model trained on general domain question
non-paraphrase pairs. Therefore, the selection of minimum pairs may not generalize well to identify paraphrase pairs in
threshold less than 0.5 introduces more non-paraphrase pairs case of clinical texts. The solution may be using transfer
into the dataset and by selecting the threshold more than 0.5 learning and training the paraphrase identification network
we lose a large number of pairs that are paraphrases. One on a subset of human rated clinical paraphrases.
desirable approach is to train a linear regression or any mul-
ti-variate machine learning model to classify the paraphrase Clinical Example 1 Example 2
pairs using all the computed similarity metrics. However, Paraphrase
Generation
training such machine learning systems requires ground-
Source dengue fever pro- Lung cancer often
truth data and therefore is outside the scope of this paper. nounced den gay is an spreads (metastasiz-
Our paraphrase identification system uses a vocabulary infectious disease caused es) to other parts of
from the Google News corpus dataset. The words that are by the dengue virus the body, such as the
not present in this vocabulary are assigned the UNK token. brain and the bones
Therefore, the neural paraphrase identification network is
not sensitive when two semantically similar sentences refer
to different objects. However, this problem is minimized in
our case as we pair the sentences from the pages belonging
Target dengue fever is caused Primary lung cancers text during sentence alignment, which would help to create
by any of the four den- themselves most cleaner datasets.
gue viruses spread by commonly metasta- Previous research has found that existing simplification
mosquitoes that thrive in size to the brain, datasets created using Wikipedia-like knowledge sources
and near human lodgings bones, liver and ad-
renal glands are noisy [Xu et al., 2015] as these knowledge sources are
Generated Dengue fever is a mos- Lung cancer staging not created with a specific objective. However, task specific
quito borne tropical is an assessment of datasets for clinical paraphrase generation and simplifica-
disease caused by the the degree of spread tion do not exist as of writing this paper. Therefore, we ap-
dengue virus of the cancer from its proached the creation of such datasets for clinical para-
original source phrase generation and simplification using web-based
Clinical Text knowledge sources. We hope that this serves as a starting
Simplification point towards developing automated approaches for creating
Source Diabetes is due to either Ventricular tachycar- task specific datasets using unstructured knowledge sources.
the pancreas not produc- dia can be classified
ing enough insulin or the based on its mor-
cells of the body not phology 6 Conclusion and future work
responding properly to This paper presents a preliminary work on automated meth-
the insulin produced odology to create clinical paraphrase generation and simpli-
Target Diabetes is the condition Ventricular tachycar- fication datasets. We use web-based knowledge sources and
that results from lack of dia can be treated in a
insulin in a person blood few different ways
automatically align sentence pairs from matching topics to
or when their body has a create the datasets. Additionally, these datasets are used to
problem using the insu- train sequence-to-sequence models leveraging an encoder-
lin it produces insulin decoder architecture with attention for paraphrase genera-
resistance tion and simplification. Further research to improve string
Generated Diabetes can occur when Ventricular tachycar- similarity metrics is required to accurately identify similar
the pancreas produces dia can be caused by sentence pairs to create cleaner datasets. In future, we will
very little to no insulin many different things include more knowledge sources and topics to create larger
or when the body does
datasets and use automated methods to remove unrelated or
not respond appropriate-
ly to insulin unwanted text in the paired sentences.

Table 3. Example outputs from clinical paraphrase generation and References
simplification models.
[Bahdanau et al., 2015] Bahdanau, D., Cho, K., Bengio, Y.,.
Our datasets consist of a small number of sentence pairs Neural Machine Translation By Jointly Learning To Align and
(few thousands) and may not be sufficient for the neural Translate, in: ICLR. pp. 1–15, 2015.
network models to learn complex clinical concepts. Fur-
[Bakkelund, 2009] Bakkelund, D.,. An LCS-based string metric.
thermore, we use only 164 medical topics from Wikipedia
for this work. Improving the efficiency of paraphrase identi- University of Oslo, Oslo, Norway, 2009.
fication and inclusion of more knowledge sources and topics [Banerjee and Lavie, 2005] Banerjee, S., Lavie, A.,. METEOR:
will create larger and better training datasets. Many sentenc- An Automatic Metric for MT Evaluation with Improved
es that are paired contain text related to additional infor- Correlation with Human Judgments, in: ACL. pp. 65–72, 2005.
mation that the other sentence does not contain. For exam-
[Brad and Rebedea, 2017] Brad, F., Rebedea, T.,. Neural
ple:
Paraphrase Generation using Transfer Learning, in: INLG. pp.
Source: “It isn’t clear why some people get asthma and 257–261, 2017.
others don’t, but it’s probably due to a combination of envi-
[Conneau et al., 2017] Conneau, A., Kiela, D., Schwenk, H.,.
ronmental and genetic factors”.
Target: “Asthma is thought to be caused by a combina- Supervised Learning of Universal Sentence Representations
tion of genetic and environmental factors”. from Natural Language Inference Data, in: CoRR. 2017.
[Dadashov et al., 2017] Dadashov, E., Sakshuwong, S., Yu, K.,.
The removal of the additional text in the first part of the
source sentence will improve the training of the neural net- Quora Question Duplication 1–9, 2017.
work as it can focus more on the important text. The un- [Damerau, 1964] Damerau, F.J.,. A Technique for Computer
wanted text in this example can be easily removed as it is Detection and Correction of Spelling Errors. Commun. ACM 7,
clearly separated from the rest of the sentence. However, 171–176, 1964.
many sentences that contain unwanted text are not easily
[Delbanco et al., 2015] Delbanco, T., Walker, J., Darer, J.D.,
separable. Moreover, manual removal of unwanted text
from thousands of sentences (if not millions) is not practi- Elmore, J.G., Feldman, H.J.,. Open Notes: Doctors and Patients
cal. Automated methods are needed to remove unwanted Signing On. Ann. Intern. Med. 153, 121–126, 2015.
[Fader et al., 2013] Fader, A., Zettlemoyer, L., Etzioni, O.,.
Paraphrase-Driven Learning for Open Question Answering, in: o(1/k^2). Dokl. AN USSR 269, 543–547, 1983.
ACL. pp. 1608–1618, 2013. [Papineni et al., 2002] Papineni, K., Roukos, S., Ward, T., Zhu,
[Ghaeini et al., 2018] Ghaeini, R., Hasan, S.A., Datla, V. et al. W.,. BLEU: a method for automatic evaluation of machine
DR-BiLSTM: Dependent Reading Bidirectional LSTM for translation, in: ACL. pp. 311–318, 2002.
Natural Language Inference, in: NAACL HTL, 2018. [Pavlick et al., 2015] Pavlick, E., Rastogi, P., Ganitkevitch, J.,
[Hasan et al., 2016] Hasan, S.A., Liu, B., Liu, J. et al. Neural Durme, B. Van, Callison-Burch, C.,. PPDB 2.0: Better
Clinical Paraphrase Generation with Attention, in: CNLP paraphrase ranking, fine-grained entailment relations, word
Workshop. pp. 42–53, 2016. embeddings, and style classification. ACL 425–430, 2015.
[Herranz et al., 2011] Herranz, J., Nin, J., Sole, M.,. Optimal [Pivovarov and Elhadad, 2015] Pivovarov, R., Elhadad, N.,.
Symbol Alignment Distance: A New Distance for Sequences of Automated methods for the summarization of electronic health
Symbols. IEEE Trans. Knowl. Data Eng. 23, 1541–1554, 2011. records. J. Am. Med. Informatics Assoc. 22, 938–947, 2015.
[Iyer et al., 2017] Iyer, S., Dandekar, N., Csernai, K.,. Quora [Prakash et al., 2016] Prakash, A., Hasan, S.A., Lee, K., Datla, V.,
question pair dataset [WWW Document], 2017. Qadir, A., Liu, J., Farri, O.,. Neural Paraphrase Generation with
[Kandula et al., 2010] Kandula, S., Curtis, D., Zeng-Treitler, Q.,. Stacked Residual LSTM Networks, in: COLING. pp. 2923–
A semantic and syntactic text simplification tool for health 2934, 2016.
content., in: AMIA. pp. 366–70, 2010. [Qenam et al., 2017] Qenam, B., Kim, T.Y., Carroll, M.J.,
[Kingma and Ba, 2014] Kingma, D.P., Ba, J.,. Adam: A Method Hogarth, M.,. Text Simplification Using Consumer Health
for Stochastic Optimization, in: ICLR. pp. 1–15, 2014. Vocabulary to Generate Patient-Centered Radiology Reporting:
Translation and Evaluation. J. Med. Internet Res. 19, e417,
[Koehn, 2017] Koehn, P.,. Neural Machine Translation. CoRR,
2017.
2017.
[Quirk et al., 2004] Quirk, C., Brockett, C., Dolan, B.,.
[Koehn, 2010] Koehn, P.,. Statistical Machine Translation, 1st ed.
Monolingual Machine Translation for Paraphrase Generation,
Cambridge University Press, NY, USA, 2010.
in: ACL, 2004.
[Kondrak, 2005] Kondrak, G.,. N-gram similarity and distance.
[Snover et al., 2006] Snover, M., Dorr, B., Schwartz, R.,
SPIR 115–126, 2005.
Micciulla, L., Makhoul, J.,. A Study of Translation Edit Rate
[Kosten et al., 2012] Kosten, T.R., Domingo, C.B., Shorter, D., with Targeted Human Annotation, in: AMTA. pp. 223–231,
Orson, F. et al. Inviting Patients to Read Their Doctors’ Notes: 2006.
A Quasi- experimental Study and a Look Ahead. Ann. Intern.
[Sørensen, 1948] Sørensen, T.,. A method of establishing groups
Med. 157, 461–470, 2012.
of equal amplitude in plant sociology based on similarity of
[Levenshtein, 1966] Levenshtein, V.,. Binary Codes Capable of species and its application to analyses of the vegetation on
Correcting Deletions, Insertions, and Reversals. Sov. Phys. Danish commons. Biol. Skr. 5, 1–34, 1948.
Dokl. 10, 707–710, 1966.
[Vinyals et al., 2015] Vinyals, O., Kaiser, L., Koo, T., Petrov, S.,
[Lin et al., 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Sutskever, I., Hinton, G.,. Grammar as a Foreign Language, in:
Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.,. Microsoft NIPS, 2015.
COCO: Common objects in context, in: ECCV. pp. 740–755,
[Winkler, 1990] Winkler, W.E.,. String Comparator Metrics and
2014.
Enhanced Decision Rules in the Fellegi-Sunter Model of Record
[Lindberg et al., 1993] Lindberg, D.A., Humphreys, B.L., Linkage, in: ASA. pp. 354–359, 1990.
McCray, A.T.,. The Unified Medical Language System.
[Wubben et al., 2010] Wubben, S., van den Bosch, A., Krahmer,
Methods Inf. Med. 32, 281–291, 1993.
E.,. Paraphrase Generation As Monolingual Translation: Data
[M. Shieber and Nelken, 2006] M. Shieber, S., Nelken, R.,. and Evaluation, in: INLG, INLG ’10. pp. 203–207, 2010.
Towards robust context-sensitive sentence alignment for
[Xu et al., 2015] Xu, W., Callison-Burch, C., Napoles, C.,.
monolingual corpora, 2006.
Problems in Current Text Simplification Research: New Data
[Madnani and Dorr, 2010] Madnani, N., Dorr, B.J.,. Generating Can Help, in: ACL. pp. 283–297, 2015.
Phrasal and Sentential Paraphrases: A Survey of Data-Driven
[Zhao et al., 2009] Zhao, S., Lan, X., Liu, T., Li, S.,. Application-
Methods. Comput. Linguist. 36, 341–387, 2010.
driven statistical paraphrase generation, in: ACL. pp. 834–842,
[Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., Dean, 2009.
J.,. Distributed Representations of Words and Phrases and their
[Zhu et al., 2010] Zhu, Z., Bernhard, D., Gurevych, I.,. A
Compositionality. NIPS 1–9, 2013.
Monolingual Tree-based Translation Model for Sentence
[Nesterov, 1983] Nesterov, Y.,. A method for unconstrained Simplification, in: COLING. pp. 1353–1361, 2010.
convex minimization problem with the rate of convergence