Towards dataset creation and establishing baselines for sentence-level neural
                   clinical paraphrase generation and simplification
           Viraj Adduru1*, Sadid A. Hasan2, Joey Liu2, Yuan Ling2, Vivek Datla2, Kathy Lee2,
                                       Ashequl Qadir2, Oladimeji Farri2
                            1
                              Rochester Institute of Technology, Rochester, NY, USA
            2
              Artificial Intelligence Lab, Philips Research North America, Cambridge, MA, USA
                                                {vra2128}@rit.edu,
                          {firstname.lastname, kathy.lee_1, dimeji.farri}@philips.com

                           Abstract                                  other physicians [Qenam et al., 2017]. They contain com-
                                                                     plex medical terminologies that the patients are not familiar
    A paraphrase is a restatement of a text while retain-
                                                                     with. A recent study reported that allowing patients to ac-
    ing the meaning. Clinical paraphrasing involves re-
                                                                     cess their clinical notes has showed an improvement in their
    statement of sentences, paragraphs, or documents                 health care process [Kosten et al., 2012]. Realizing the need
    containing complex vocabulary used by clinicians.
                                                                     for increased inclusion of patients in their health care pro-
    Paraphrasing can result in an alternative text that is
                                                                     cess, large health care systems have allowed for the patients
    either simple or complex form of the original input              to access their medical records [Delbanco et al., 2015].
    text. Simplification is a form of paraphrasing in
                                                                     However, these medical records contain raw complex clini-
    which a sentence is restated into a linguistically
                                                                     cal text intended for the communication between medical
    simpler sentence yet retaining the meaning of the                professionals. Paraphrasing or simplification of clinical text
    original sentence. Clinical text simplification has
                                                                     will improve the patients’ understanding of their health con-
    potential applications such as simplification of
                                                                     ditions and thereby play an important role in connecting
    clinical reports for patients towards better under-              patients and caregivers across the clinical continuum to-
    standing of their clinical conditions. Deep learning
                                                                     wards better patient outcome.
    has emerged as a successful technique for various
                                                                        Traditional clinical paraphrasing and simplification ap-
    natural language understanding tasks precondi-                   proaches use lexical methods [Kandula et al., 2010;
    tioned with large annotated datasets. In this paper,
                                                                     Pivovarov and Elhadad, 2015; Qenam et al., 2017], which
    we propose a methodology to create preliminary
                                                                     are typically focused on identifying complex clinical words,
    datasets for clinical paraphrasing, and clinical text            phrases, or sentences and replace them with their alterna-
    simplification to foster training of deep learning-
                                                                     tives in case of paraphrasing or simpler versions in case of
    based clinical paraphrase generation and simplifi-
                                                                     simplification. Lexical methods take advantage of
    cation models.                                                   knowledge sources like Unified Medical Language System
                                                                     (UMLS) metathesaurus [Lindberg et al., 1993] which con-
1   Introduction and related work                                   tains grouped words and phrases that describe various medi-
Paraphrasing (a.k.a. paraphrase generation) is transforming          cal concepts. Simplification is traditionally performed by
a text that can be a word, phrase, sentence, paragraph, or a         mapping UMLS concepts to their alternatives provided in
document, while retaining the meaning and content. For               consumer health vocabulary (CHV) [Qenam et al., 2017].
example, the sentence ‘I am very well’ can be paraphrased               Recently, paraphrase generation was casted as a monolin-
as ‘I am doing great’. Paraphrasing can lead to a new text           gual machine translation problem resulting in the develop-
which may be simpler, more complex or at the same com-               ment of data-driven methods using statistical machine trans-
plexity level as the source text. The task of paraphrasing text      lation (SMT) [Koehn, 2010], and neural machine translation
to a simpler form is called simplification. In simplification,       (NMT) principles [Koehn, 2017]. SMT methods [Quirk et
the output text is a linguistically simplified version of the        al., 2004; Wubben et al., 2010; Zhao et al., 2009] model the
input text. Paraphrasing and simplification may have nu-             conditional distributions of the words and phrases and re-
merous applications such as document summarization, text             place the phrases in the source text with the phrases that
simplification for target audience e.g. children, and question       maximize the probability of the resulting text. However,
answering [Madnani and Dorr, 2010].                                  syntactic relationships are difficult to model using SMT
   In the clinical context, health care systems and medical          methods. Monolingual NMT systems use neural network
knowledge-bases contain large collections of texts that are          architectures to model complex relationships by automati-
often not comprehensible to the layman population. For               cally learning from large datasets containing source and
example, clinical texts like radiology reports are used by           target text pairs, both belonging to the same language. Cur-
radiologists to professionally communicate their findings to         rent NMT systems for paraphrase generation or simplifica-
                                                                     tion [Brad and Rebedea, 2017; Hasan et al., 2016; Prakash
——————                                                               et al., 2016] use sequence-to-sequence networks based on
    
      *This work was conducted as part of an internship program at   encoder-decoder architectures. Unlike traditional methods,
Philips Research.
NMTs do not need semantic or syntactic rules to be explicit-       nicians. We crawl the articles with same topics from two or
ly defined. However, they need carefully constructed da-           more web-based knowledge sources. Each sentence in a
tasets that contain sufficient information to robustly train the   topic (i.e. in an article) from one resource is mapped to sen-
deep neural networks.                                              tences belonging to the same topic from another resource(s)
   Existing clinical paraphrasing and simplification datasets      using a one-to-many scheme to create all possible sentence
are limited to short phrases. Hasan et al., (2016) trained an      pair combinations. These sentence pairs essentially contain
attention-based encoder-decoder model [Bahdanau et al.,            a large number of unrelated pairs from which meaningful
2015] using a dataset created by merging two word and              paraphrasing pairs are identified.
phrase level datasets: the paraphrase database (PPDB)                 Manual identification of the relevant paraphrase pairs is a
[Pavlick et al., 2015] and the UMLS metathesaurus. They            tedious task as the sentence pair combinations (as discussed
showed that their model outperformed an upper bound para-          above) contain a large number (in millions) of unrelated
phrasing baseline. However, they used a phrasal dataset that       sentence pairs. Therefore, we use an automated approach to
does not contain more complex contextual knowledge like a          identify the paraphrase pairs from the sentence pair combi-
sentential dataset, and the ability of the network to simplify     nations. Our method is similar to the approach by [Zhu et
the clinical text was not explored. In contrast to paraphras-      al., 2010]. They use TF-IDF [M. Shieber and Nelken, 2006]
ing, simplification is a harder problem and may involve ad-        metric to align sentences between Wikipedia and Simple-
dition, deletion or splitting of sentences to suite the target     Wikipedia knowledge sources to create sentence pairs for
audience. These operations require additional knowledge            the text simplification task. However, some studies, e.g. Xu
that a dataset with longer sequences like sentences or para-       et al., 2015, reveal the noisy nature of such datasets, which
graphs could provide. Other studies [Brad and Rebedea,             motivated us to explore various textual similarity/distance
2017; Prakash et al., 2016] have trained encoder-decoder           metrics instead of relying on one single metric for sentence
architectures with attention for paraphrasing using general        alignment. Our intuition is that the strengths of a collection
domain sentence level datasets like Microsoft Common Ob-           of diverse metrics may be useful for better sentence align-
jects in Context (MSCOCO) [Lin et al., 2014], Newsela              ment. In addition to various existing metrics, we train a neu-
[Xu et al., 2015] and Wikianswers [Fader et al., 2013]. They       ral paraphrase identification model to estimate a similarity
demonstrated that neural machine translation models suc-           score between two sentences, which is also used as a sup-
cessfully captured the complex semantic relationships from         plementary sentence alignment metric.
the general domain datasets. However, it is unclear how
these networks would perform on complex clinical text.             2.2 Sentence alignment
   In this paper, our aim is to pioneer the creation of parallel   Paraphrase pairs can be identified by computing various
(with source and target pairs) sentential datasets for clinical    sentence similarity/distance metrics between the two sen-
paraphrase generation and simplification. Web-based un-            tences in a pair. Various character-level and word-level met-
structured knowledge sources like www.mayoclinic.com               rics that we used are described below.
contain articles on various medical topics. We obtain arti-
cles with matching titles from different web-based                 Levenshtein distance
knowledge sources and align the sentences using various            Levenshtein distance [Levenshtein, 1966] is defined as the
metrics to create paraphrase and simplification pairs. Addi-       minimum number of string operations consisting of addi-
tionally, we train NMT models using the prepared clinical          tions, deletions, and substitutions of symbols that are neces-
datasets and present baseline performance metrics for both         sary to transform one string into another. Normalized Le-
clinical paraphrase generation and simplification.                 venshtein distance (LDN) is computed by dividing the num-
   Next section outlines our approach to create clinical para-     ber of string operations required by the length of the longer
phrase generation and simplification datasets. First, we dis-      string. Character- or word-level LDN is calculated by treat-
cuss our proposed methodology for extracting sentence pairs        ing characters or words as symbols respectively:
from web-based clinical knowledge sources. Then we de-                                              𝑁
scribe various metrics to align the pairs of related sentences                        𝐿𝐷𝑁 =                                 (1)
                                                                                              𝑚𝑎𝑥 (𝑛, 𝑚)
for dataset creation. Section 3 discusses the neural network
architectures used for establishing baselines. Sections 4 and         where N is the minimum number of string operations to
5 present the performance evaluation of the models and in          transform a text x to y or vice versa, and n and m are the
section 6 we conclude and discuss the future work.                 number of symbols in the texts x and y respectively.
                                                                   Damerau-Levenshtein distance
2   Approach                                                       Damerau-Levenshtein distance [Damerau, 1964] is similar
                                                                   to LDN and is defined as the minimum number of string
2.1 Paraphrase pairs from web-based resources                      operations needed to transform one string into the other. In
                                                                   addition to the string operations in Levenshtein distance,
Web-based textual resources contain large collections of           Damerau-Levenshtein distance further includes transposi-
articles for various medical topics related to diseases, anat-     tion of two adjacent symbols. Normalized Demerau-
omy, treatment, symptoms etc. These articles are often tar-        Levenshtein distance (DLDN) is calculated by dividing the
geted for general (non-clinician) users and are easier to un-
derstand unlike the complex clinical reports written by cli-
number of string operations by the number of symbols in the      Sorensen similarity
longer string.                                                   Sorensen similarity (also called Sorensen-Dice coefficient)
                                                                 [Sørensen, 1948] is similar to Jaccard similarity and it is
Optimal string alignment distance
                                                                 computed as the ratio of twice the number of common items
Optimal string alignment distance [Herranz et al., 2011] is a
                                                                 (intersection) and the sum of number of items in the two
variant of DLDN but under a restriction that no substring is     strings.
edited more than once. The normalized form is computed
                                                                    All the above metrics are used in their normalized forms
similarly as in DLDN.
                                                                 (values between 0 to 1). These metrics calculate the simi-
Jaro-Winkler distance                                            larity/distance between the sentence pairs using the charac-
Jaro-Winkler distance (JWD) [Winkler, 1990] computes the         ter- or word-level overlap and the pattern of their occurrenc-
distance between two strings, where the substitution of two      es in the sentences. However, these metrics do not consider
close symbols is considered more important than the substi-      the presence of concepts (e.g. words or phrases) that are
tution of two symbols that are far from each other. The Jaro-    paraphrased using a different vocabulary (e.g. ‘glioma’ can
Winkler distance JWD is given by:                                be paraphrased with its synonym ‘brain tumor’) and also do
                                                                 not perform well for sentences that differ by a few words
                𝑑𝑗 ,                    𝑖𝑓 𝑑𝑗 < 0                resulting in contradicting sentences. Therefore, we need a
      𝐽𝑊𝐷 = {                                           (2)
                𝑑𝑗 + 𝑘 𝑝 (1 − 𝑑𝑗 ),    𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒                 similarity metric that can consider complex semantic rela-
                                                                 tionships between the concepts represented in the sentences.
   where k is the length of the common prefix at the start of    Deep neural network architectures with recurrent neural
the string up to 4 symbols, p is the constant usually set to     networks (RNNs), and convolution neural networks (CNNs)
0.1 and dj is the Jaro distance given by:                        have so far demonstrated state-of-the-art performance
                                                                 [Conneau et al., 2017] in learning semantic associations
             0                          𝑖𝑓 𝑞 = 0                 between the sentences. Therefore, deep-learning based sys-
      𝑑𝑗 = {1 𝑞 𝑞 𝑞 − 𝑡                                          tems are increasingly being used for advanced natural lan-
             ( + +      ),             𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒        (3)      guage inferencing tasks like paraphrase identification, and
            3 𝑛 𝑚   𝑞
                                                                 textual entailment [Ghaeini et al., 2018], which motivated us
   where q is the number of matching words between the           to create a neural paraphrase identification model for the
two texts x and y with lengths 𝑛 and 𝑚 respectively and 𝑡 is     purpose of supplementing our sentence similarity measures
half of the number of transpositions. Jaro-Winkler distance      for better sentence pair alignment.
is a normalized quantity ranging from 0 to 1.
                                                                 2.3 Paraphrase identification metric
Longest common subsequence                                       Neural paraphrase identification can be stated as a binary
Longest    common      subsequence    distance   (LCSD)          classification task in which a neural network model esti-
[Bakkelund, 2009] is computed using the following equa-          mates the probability that the two sentences are paraphrases.
tion:                                                            This estimated probability can be used as a similarity metric
                            𝐿𝐶𝑆(𝑥, 𝑦)                            to align the sentence pairs.
                𝐿𝐶𝑆𝐷 = 1 −                         (4)
                            𝑚𝑎𝑥(𝑛, 𝑚)                            Neural paraphrase identification
   where LCS (longest common subsequence) is the longest         The network consists of stacked bidirectional long short-
subsequence common to strings x and y with lengths n and         term memory (BiLSTM) layers in a Siamese architecture
m respectively.                                                  [Dadashov et al., 2017] (Figure 1). Each arm of the Siamese
                                                                 network consists of three stacked BiLSTM layers. The out-
N-gram distance                                                  puts of the final BiLSTM layers of both the arms are con-
   N-gram is a contiguous sequence of n items from a given       catenated and fed into the dense layer with ReLU activation
sample of a text. N-gram distance [Kondrak, 2005] is simi-       followed by a second dense layer with a sigmoid activation
lar to computing LCS but in this case the symbols are n-         function. We use a depth of 300 for all the BiLSTM layers
grams. We used n = 4 in this paper.                              and the dense layers. The maximum sequence length of the
Cosine similarity                                                BiLSTM layers is set to 30. The words in the input sentenc-
Cosine similarity between two strings is computed as the         es are embedded using Word2Vec embeddings pre-trained
cosine of the angle between the vector representation of two     on the Google news corpus.
strings (x and y) and is given by the equation:                  Hybrid dataset for paraphrase identification
                               𝑉𝑥 . 𝑉𝑦                           Our paraphrase identification model is trained using a hy-
                      𝐶𝑆 =                             (5)       brid corpus created by merging two paraphrase corpora:
                             |𝑉𝑥 |. |𝑉𝑦 |
                                                                 Quora question pairs, and Paralex question pairs. The Quora
Jaccard similarity                                               question pair corpus [Iyer et al., 2017] consists of 404289
Jaccard similarity is calculated as the ratio of the intersec-   question pairs with 149263 paraphrase pairs and 255027
tion to the union of the items in the two strings.               non-paraphrase pairs. The Paralex dataset [Fader et al.,
                                                                 2013] consists of 35,692,309 question pairs, where all the
question pairs are paraphrases of each other. The Paralex        Paraphrase identification model for sentence alignment
dataset is unbalanced as it does not contain any non-            The probability score from our paraphrase identification
paraphrase pairs. After merging the sentence pairs from both     model for the predicted class is used along with the word-
the corpora, we have 35692309 sentence pairs with                and character-level similarity/distance metrics to calculate a
35437283 paraphrase pairs and only 255027 non-paraphrase         mean similarity score. Note that, all normalized distance
pairs. To balance the dataset, we identify the list of unique    metrics are converted into similarity metrics by subtracting
questions and then randomly select two questions from the        the corresponding score from 1, thereby obtaining 12 differ-
unique questions list and add the pair to the merged corpus      ent similarity metrics. The mean similarity score is comput-
as a non-paraphrase pair if the pair does not already exist.     ed using the formula given below:
Non-paraphrase pairs are created until the non-paraphrase,
and paraphrase pairs are equal in number, resulting in a bal-         𝑚𝑒𝑎𝑛 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑠𝑐𝑜𝑟𝑒 = 𝑚𝑒𝑎𝑛 (𝑎𝑙𝑙 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑠𝑐𝑜𝑟𝑒s)   (6)
anced dataset of 70 million pairs.
                                                                    Minimum and maximum thresholds of 0.5 and 0.8 are
                                                                 empirically selected by observing the sentence pairs. Sen-
                                                                 tence pairs with a similarity score within these thresholds
                                                                 are considered paraphrase pairs. We use a maximum thresh-
                                                                 old of 0.8 to avoid selection of identical sentences.

                                                                 2.4 Paraphrase generation dataset
                                                                 The paraphrase sentence pairs are obtained from three web-
                                                                 based unstructured knowledge sources: Wikipedia, Sim-
                                                                 pleWikipedia, and MayoClinic. These sentence pairs form
                                                                 our clinical paraphrase generation dataset, which is later
                                                                 used to train a baseline neural paraphrase generation model.
                                                                 Wikipedia and SimpleWikipedia
                                                                 SimpleWikipedia contains simplified versions of pages from
                                                                 the original Wikipedia. However, the text in the correspond-
                                                                 ing documents is unaligned (no sentence-to-sentence match-
                                                                 ing). Pairing sentences from Wikipedia to those in Sim-
                                                                 pleWikipedia leads to a parallel corpus for the simplification
                                                                 dataset as the later mostly contains simplified versions of
                                                                 the former. However, in the case of paraphrase generation,
                                                                 the resultant pairs can be swapped as paraphrasing applies in
 Figure 1. Paraphrase identification architecture. Gray arrows   both directions. The swapping also helps augmenting the
 represent weight-sharing between the left and right BiLSTM.     dataset. We create a parallel corpus using 164 matched titles
                                                                 from Wikipedia from clinically relevant categories such as
Training                                                         anatomy, brain, disease, medical condition etc. Sentences
The dataset is preprocessed by removing punctuations, nor-       from each of 164 Wikipedia documents are paired with all
malization with respect to case, and standard tokenization.      the sentences from the documents with identical titles from
The tokens are embedded using Word2Vec embeddings pre-           SimpleWiki. Thus, we obtain 818520 sentence pairs for
trained on Google news corpus [Mikolov et al., 2013].            which we compute similarity scores as discussed in the pre-
Words that are not found in the pre-trained vocabulary are       vious subsections. We finally obtain 1491 related sentence
embedded with a zero-vector representing an UNK token.           pairs after thresholding the mean similarity score and we
Longer sentences (> 30 words) are truncated, and smaller         name this parallel corpus as WikiSWiki.
sentences (< 30 words) are padded with UNK tokens. As the        Mayoclinic
sentences are in a bidirectional relationship to each other,     Mayoclinic contains pages for 48 identically matched titles
the training pairs are swapped to increase the dataset size.     from the 164 titles identified from Wikipedia. Unique sen-
The dataset is split into 80%, 10% and 10% for training,         tences from WikiSWiki were paired with the sentences ob-
validation and testing respectively.                             tained from the pages with matched titles from Mayoclinic
   The paraphrase identification model is trained using Ad-      and similarity scores are computed. Using the same thresh-
am optimizer [Kingma and Ba, 2014] with Nesterov mo-             olds as above, 3203 sentence pairs are selected. These pairs
mentum [Nesterov, 1983] to optimize a binary cross entropy       are added to the WikiSWiki corpus to form a corpus con-
loss. The update direction is calculated using a batch size of   taining 4694 sentence pairs; we name it WikiSwikiMayo.
512. We utilize early stopping using validation error with
patience of 3 epochs to prevent overfitting.                     2.5 Simplification dataset
   The network is trained for 18 epochs before early stop-
                                                                 The WikiSWiki is a simplification corpus as it mostly con-
ping at 22 minutes per epoch. The validation accuracy of
                                                                 tains sentences mapped to their simpler forms. However, the
our model is 95% and test accuracy is 97%.
small number of sentence pairs may be insufficient for train-
ing the network to learn complex relationships required for
clinical text simplification. Therefore, we use additional
web-based knowledge sources to increase the dataset size.
Web-based knowledge sources: www.webmd.com (webmd)
and www.medicinenet.com (medicinenet), are other clini-
cal knowledge sources that are similar to MayoClinic.
Through manual inspection, we found that webmd contains
simpler sentences than medicinenet in many topics that we
have examined, which is reasonable as medicinenet content             Figure 2. Encoder-decoder architecture. x and y are the source
is curated by clinicians. Therefore, we use them as addition-         and target sequences respectively.
al knowledge sources to create our simplification dataset.
                                                                     For paraphrase generation, the network is trained using
For 164 topics from the WikiSWiki dataset we perform a
                                                                  the WikiSwikiMayo corpus containing 4694 sentence pairs.
google search with ‘webmd’ and ‘medicinenet’ as additional
search terms. The search returns 61314 sentences from             The source and target sentences are swapped as paraphras-
                                                                  ing is bidirectional thereby, doubling the number of sen-
webmd and medicinenet for all 164 topics. Sentences from
                                                                  tence pairs to 9388. The dataset is divided into training, val-
medicinenet are paired to the sentences from SimpleWiki
and webmd from the articles with matched titles. Sentences        idation and test sets. The training sentence pairs that contain
                                                                  sentences from the source side of the test are removed to
from Wikipedia articles are paired with sentences from
                                                                  prevent data leak issues. Same is repeated for the validation
webmd separately as they are already paired with Sim-
pleWikipedia. We obtain 714608 new sentence pairs result-         set. Using this we make sure that any sentence occurs as a
                                                                  source sentence in exactly one of the sets (training, test or
ing in 1002 final pairs after computing similarity scores and
                                                                  validation). The number of sentence pairs in training, test
thresholding. These sentence pairs are merged with WikiS-
Wiki dataset to create the monolingual clinical simplifica-       and validation datasets are 6095, 611 and 611 respectively.
                                                                  The paraphrase generation network is trained for 10000
tion dataset containing 2493 sentence pairs. Although our
                                                                  steps with a batch size of 128 samples per step.
final corpus contains a small number of sentence pairs, our
main contribution in this paper is to introduce an automated         The simplification corpus containing 2493 sentence pairs
                                                                  is used to train the simplification network. Vocabularies for
method to create sentence pairs from web-based knowledge
                                                                  source and target are created separately in case of simplifi-
sources, towards creating a large clinical simplification cor-
pus in the future.                                                cation. The source and target vocabularies are different in
                                                                  case of text simplification. As simplification is a unidirec-
                                                                  tional task, we do not use data swapping. We prevent data
3   Paraphrase generation and simplification                      leak issues using the same procedure as paraphrase genera-
                                                                  tion while splitting the data. The training, test and validation
3.1 Model                                                         sets contain 1918, 187 and 187 sentence pairs respectively.
Sequence-to-sequence models using encoder-decoder archi-          The simplification network is trained for 3500 steps.
tecture with attention [Vinyals et al., 2015] (Figure 2) are
trained for both paraphrase generation and simplification         4     Evaluation metrics
tasks. The encoder and decoder are made of three stacked          BLEU [Papineni et al., 2002], METEOR [Banerjee and
RNN layers using BiLSTM cells and LSTM cells respec-              Lavie, 2005] and translation error rate (TER) [Snover et al.,
tively. We use a cell depth of 1024 for all the layers in the     2006] are used to evaluate our models. These metrics are
encoder and the decoder. The maximum sequence length is           shown to correlate with human judgements for evaluating
set to 50. The sentences are preprocessed, and the words are      paraphrase generation models [Wubben et al., 2010]. BLEU
encoded using one-hot vector encoding. The outputs of the         looks for exact string matching using n-gram overlaps to
decoder are projected onto the output vocabulary space us-        evaluate the similarity between two sentences. METEOR
ing a dense layer with a softmax activation function.             uses WordNet to obtain synonymously related words to
                                                                  evaluate sentence similarity. Higher BLEU and METEOR
3.2 Training                                                      scores indicate higher similarity. TER score measures the
The network parameters are optimized by minimizing a              number of edits necessary to transform the source sentence
sampled softmax loss function. The gradients are truncated        to the target. Lower TER score indicates higher similarity.
by limiting the global norm to 1. The network is trained
using mini-batch gradient descent algorithm with batch size       5     Results and discussion
of 128. An initial learning rate of 0.5 is used with a decay of
0.99 for every step. The training set is shuffled for every
epoch. The networks are trained using 80% of the sentence         5.1 Sentence alignment
pairs and validated on 10% and tested on 10%. Both models         Table 1 presents a few examples of the aligned sentence
are developed using Tensorflow, version 1.2, and two Tesla        pairs for both clinical paraphrase generation and simplifica-
K20 GPUs.                                                         tion.
                                                                  Mean    to the same topic. Furthermore, using other similarity met-
Clinical Paraphrase Generation                                    Sim.    rics that are based on word matching helps in overcoming
                                                                  Score   this problem in cases where the paraphrase identification
Example 1: Good                                                           metric is insensitive. We examined that this holds true in
S1: No drug is currently approved for the treatment of small-             majority of the pairs by visual inspection of the selected
                                                                  0.52
    pox.                                                                  sentence pairs, for both the datasets.
S2: No cure or treatment for smallpox exists
Example 2: Acceptable                                                     5.2 Paraphrase generation and simplification
S1: Worldwide, breast cancer is the most common invasive                  Average quality scores on the test sets for the clinical para-
    cancer in women.                                              0.62    phrase generation and the text simplification models are
S2: After skin cancer, breast cancer is the most common
    cancer diagnosed in women in the United States
                                                                          presented in Table 2. These scores serve as baselines for
                                                                          clinical paraphrase generation and text simplification for the
Example 3: Bad                                                            datasets that we have created. The quality metrics are lower
S1: Gallbladder cancer is a rare type of cancer which forms               for clinical text simplification than the paraphrase genera-
    in the gallbladder.                                           0.53
S2: At this stage, gallbladder cancer is confined to the inner
                                                                          tion. This is expected as in the case of paraphrase generation
    layers of the gallbladder                                             many of the words from the source sentence can be retained
                                                                          in the paraphrased sentence whereas simplification involves
Clinical Text Simplification                                              complex transformations which results in different words in
Example 1: Good                                                           the resulting sentence and hence the quality scores are low.
S1: In Western cultures, ingestion of or exposure to peanuts,             Further human evaluations are required to better rate the
    wheat, nuts, certain types of seafood like shellfish, milk,           performance of the simplification model.
    and eggs are the most prevalent causes.                       0.54
S2: In the Western world, the most common causes are eating                Task                    BLEU        METEOR             TER
    or touching peanuts, wheat, tree nuts, shellfish, milk, and            Clinical Paraphrase
    eggs.                                                                                         9.4±0.5       15.1±0.3        108.7±1.5
                                                                           Generation
Example 2: Acceptable                                                      Clinical Text
                                                                                                  9.9±1.6       10.6±0.8        97.7±2.9
S1: Together the bones in the body form the skeleton.             0.54     Simplification
S2: The bones are the framework of the body.
                                                                             Table 2. Average scores computed over test sentence pairs.
Example 3: Bad                                                               Few example outputs of the clinical paraphrase genera-
S1: There are two major types of diabetes, called type 1 and
    type 2
                                                                  0.54    tion and simplification system are presented in Table 3. The
S2: There are other kinds of diabetes, like diabetes insipidus.           examples show that both paraphrase generation and simpli-
                                                                          fication models retained the knowledge of the overall topic
Table 1. Examples of aligned sentence pairs. Good represents that         in the generated sentences. Example 2 in both models shows
accepted sentences are paraphrases. Bad represents that accepted          that, though the topic of the generated sentence matches
sentences are not paraphrases.                                            with the source, the sentence is not a paraphrase or the sim-
                                                                          plification respectively, as the context in the resultant sen-
   In Table 1, for both paraphrase generation and text sim-               tence is different from that of the source. This may be be-
plification tasks, though the similarity score between the                cause of the failure in the alignment of the sentences while
sentence pairs is similar across all the examples there is a              creating the datasets. This shows that the paraphrase identi-
large variability in the classification of the sentence pair.             fication model and the metrics were not fully sufficient to
This means there is an overlap between the distributions of               pair the sentences accurately. In particular, the paraphrase
the mean similarity score of the paraphrase pairs and the                 identification model trained on general domain question
non-paraphrase pairs. Therefore, the selection of minimum                 pairs may not generalize well to identify paraphrase pairs in
threshold less than 0.5 introduces more non-paraphrase pairs              case of clinical texts. The solution may be using transfer
into the dataset and by selecting the threshold more than 0.5             learning and training the paraphrase identification network
we lose a large number of pairs that are paraphrases. One                 on a subset of human rated clinical paraphrases.
desirable approach is to train a linear regression or any mul-
ti-variate machine learning model to classify the paraphrase               Clinical         Example 1                   Example 2
pairs using all the computed similarity metrics. However,                  Paraphrase
                                                                           Generation
training such machine learning systems requires ground-
                                                                              Source        dengue fever pro-           Lung cancer often
truth data and therefore is outside the scope of this paper.                                nounced den gay is an       spreads (metastasiz-
   Our paraphrase identification system uses a vocabulary                                   infectious disease caused   es) to other parts of
from the Google News corpus dataset. The words that are                                     by the dengue virus         the body, such as the
not present in this vocabulary are assigned the UNK token.                                                              brain and the bones
Therefore, the neural paraphrase identification network is
not sensitive when two semantically similar sentences refer
to different objects. However, this problem is minimized in
our case as we pair the sentences from the pages belonging
    Target        dengue fever is caused      Primary lung cancers      text during sentence alignment, which would help to create
                  by any of the four den-     themselves most           cleaner datasets.
                  gue viruses spread by       commonly metasta-            Previous research has found that existing simplification
                  mosquitoes that thrive in   size to the brain,        datasets created using Wikipedia-like knowledge sources
                  and near human lodgings     bones, liver and ad-
                                              renal glands              are noisy [Xu et al., 2015] as these knowledge sources are
    Generated     Dengue fever is a mos-      Lung cancer staging       not created with a specific objective. However, task specific
                  quito borne tropical        is an assessment of       datasets for clinical paraphrase generation and simplifica-
                  disease caused by the       the degree of spread      tion do not exist as of writing this paper. Therefore, we ap-
                  dengue virus                of the cancer from its    proached the creation of such datasets for clinical para-
                                              original source           phrase generation and simplification using web-based
 Clinical Text                                                          knowledge sources. We hope that this serves as a starting
 Simplification                                                         point towards developing automated approaches for creating
    Source        Diabetes is due to either   Ventricular tachycar-     task specific datasets using unstructured knowledge sources.
                  the pancreas not produc-    dia can be classified
                  ing enough insulin or the   based on its mor-
                  cells of the body not       phology                   6   Conclusion and future work
                  responding properly to                                This paper presents a preliminary work on automated meth-
                  the insulin produced                                  odology to create clinical paraphrase generation and simpli-
    Target        Diabetes is the condition   Ventricular tachycar-     fication datasets. We use web-based knowledge sources and
                  that results from lack of   dia can be treated in a
                  insulin in a person blood   few different ways
                                                                        automatically align sentence pairs from matching topics to
                  or when their body has a                              create the datasets. Additionally, these datasets are used to
                  problem using the insu-                               train sequence-to-sequence models leveraging an encoder-
                  lin it produces insulin                               decoder architecture with attention for paraphrase genera-
                  resistance                                            tion and simplification. Further research to improve string
    Generated     Diabetes can occur when     Ventricular tachycar-     similarity metrics is required to accurately identify similar
                  the pancreas produces       dia can be caused by      sentence pairs to create cleaner datasets. In future, we will
                  very little to no insulin   many different things     include more knowledge sources and topics to create larger
                  or when the body does
                                                                        datasets and use automated methods to remove unrelated or
                  not respond appropriate-
                  ly to insulin                                         unwanted text in the paired sentences.

Table 3. Example outputs from clinical paraphrase generation and        References
simplification models.
                                                                        [Bahdanau et al., 2015] Bahdanau, D., Cho, K., Bengio, Y.,.
   Our datasets consist of a small number of sentence pairs               Neural Machine Translation By Jointly Learning To Align and
(few thousands) and may not be sufficient for the neural                  Translate, in: ICLR. pp. 1–15, 2015.
network models to learn complex clinical concepts. Fur-
                                                                        [Bakkelund, 2009] Bakkelund, D.,. An LCS-based string metric.
thermore, we use only 164 medical topics from Wikipedia
for this work. Improving the efficiency of paraphrase identi-             University of Oslo, Oslo, Norway, 2009.
fication and inclusion of more knowledge sources and topics             [Banerjee and Lavie, 2005] Banerjee, S., Lavie, A.,. METEOR:
will create larger and better training datasets. Many sentenc-            An Automatic Metric for MT Evaluation with Improved
es that are paired contain text related to additional infor-              Correlation with Human Judgments, in: ACL. pp. 65–72, 2005.
mation that the other sentence does not contain. For exam-
                                                                        [Brad and Rebedea, 2017] Brad, F., Rebedea, T.,. Neural
ple:
                                                                          Paraphrase Generation using Transfer Learning, in: INLG. pp.
   Source: “It isn’t clear why some people get asthma and                 257–261, 2017.
others don’t, but it’s probably due to a combination of envi-
                                                                        [Conneau et al., 2017] Conneau, A., Kiela, D., Schwenk, H.,.
ronmental and genetic factors”.
   Target: “Asthma is thought to be caused by a combina-                  Supervised Learning of Universal Sentence Representations
tion of genetic and environmental factors”.                               from Natural Language Inference Data, in: CoRR. 2017.
                                                                        [Dadashov et al., 2017] Dadashov, E., Sakshuwong, S., Yu, K.,.
   The removal of the additional text in the first part of the
source sentence will improve the training of the neural net-              Quora Question Duplication 1–9, 2017.
work as it can focus more on the important text. The un-                [Damerau, 1964] Damerau, F.J.,. A Technique for Computer
wanted text in this example can be easily removed as it is                Detection and Correction of Spelling Errors. Commun. ACM 7,
clearly separated from the rest of the sentence. However,                 171–176, 1964.
many sentences that contain unwanted text are not easily
                                                                        [Delbanco et al., 2015] Delbanco, T., Walker, J., Darer, J.D.,
separable. Moreover, manual removal of unwanted text
from thousands of sentences (if not millions) is not practi-              Elmore, J.G., Feldman, H.J.,. Open Notes: Doctors and Patients
cal. Automated methods are needed to remove unwanted                      Signing On. Ann. Intern. Med. 153, 121–126, 2015.
                                                                        [Fader et al., 2013]   Fader, A., Zettlemoyer, L., Etzioni, O.,.
  Paraphrase-Driven Learning for Open Question Answering, in:           o(1/k^2). Dokl. AN USSR 269, 543–547, 1983.
  ACL. pp. 1608–1618, 2013.                                           [Papineni et al., 2002] Papineni, K., Roukos, S., Ward, T., Zhu,
[Ghaeini et al., 2018] Ghaeini, R., Hasan, S.A., Datla, V. et al.        W.,. BLEU: a method for automatic evaluation of machine
  DR-BiLSTM: Dependent Reading Bidirectional LSTM for                    translation, in: ACL. pp. 311–318, 2002.
  Natural Language Inference, in: NAACL HTL, 2018.                    [Pavlick et al., 2015] Pavlick, E., Rastogi, P., Ganitkevitch, J.,
[Hasan et al., 2016] Hasan, S.A., Liu, B., Liu, J. et al. Neural        Durme, B. Van, Callison-Burch, C.,. PPDB 2.0: Better
  Clinical Paraphrase Generation with Attention, in: CNLP               paraphrase ranking, fine-grained entailment relations, word
  Workshop. pp. 42–53, 2016.                                            embeddings, and style classification. ACL 425–430, 2015.
[Herranz et al., 2011] Herranz, J., Nin, J., Sole, M.,. Optimal       [Pivovarov and Elhadad, 2015] Pivovarov, R., Elhadad, N.,.
  Symbol Alignment Distance: A New Distance for Sequences of             Automated methods for the summarization of electronic health
  Symbols. IEEE Trans. Knowl. Data Eng. 23, 1541–1554, 2011.             records. J. Am. Med. Informatics Assoc. 22, 938–947, 2015.
[Iyer et al., 2017] Iyer, S., Dandekar, N., Csernai, K.,. Quora       [Prakash et al., 2016] Prakash, A., Hasan, S.A., Lee, K., Datla, V.,
   question pair dataset [WWW Document], 2017.                           Qadir, A., Liu, J., Farri, O.,. Neural Paraphrase Generation with
[Kandula et al., 2010] Kandula, S., Curtis, D., Zeng-Treitler, Q.,.      Stacked Residual LSTM Networks, in: COLING. pp. 2923–
  A semantic and syntactic text simplification tool for health           2934, 2016.
  content., in: AMIA. pp. 366–70, 2010.                               [Qenam et al., 2017] Qenam, B., Kim, T.Y., Carroll, M.J.,
[Kingma and Ba, 2014] Kingma, D.P., Ba, J.,. Adam: A Method             Hogarth, M.,. Text Simplification Using Consumer Health
  for Stochastic Optimization, in: ICLR. pp. 1–15, 2014.                Vocabulary to Generate Patient-Centered Radiology Reporting:
                                                                        Translation and Evaluation. J. Med. Internet Res. 19, e417,
[Koehn, 2017] Koehn, P.,. Neural Machine Translation. CoRR,
                                                                        2017.
  2017.
                                                                      [Quirk et al., 2004] Quirk, C., Brockett, C., Dolan, B.,.
[Koehn, 2010] Koehn, P.,. Statistical Machine Translation, 1st ed.
                                                                        Monolingual Machine Translation for Paraphrase Generation,
  Cambridge University Press, NY, USA, 2010.
                                                                        in: ACL, 2004.
[Kondrak, 2005] Kondrak, G.,. N-gram similarity and distance.
                                                                      [Snover et al., 2006] Snover, M., Dorr, B., Schwartz, R.,
  SPIR 115–126, 2005.
                                                                        Micciulla, L., Makhoul, J.,. A Study of Translation Edit Rate
[Kosten et al., 2012] Kosten, T.R., Domingo, C.B., Shorter, D.,         with Targeted Human Annotation, in: AMTA. pp. 223–231,
  Orson, F. et al. Inviting Patients to Read Their Doctors’ Notes:      2006.
  A Quasi- experimental Study and a Look Ahead. Ann. Intern.
                                                                      [Sørensen, 1948] Sørensen, T.,. A method of establishing groups
  Med. 157, 461–470, 2012.
                                                                        of equal amplitude in plant sociology based on similarity of
[Levenshtein, 1966] Levenshtein, V.,. Binary Codes Capable of           species and its application to analyses of the vegetation on
  Correcting Deletions, Insertions, and Reversals. Sov. Phys.           Danish commons. Biol. Skr. 5, 1–34, 1948.
  Dokl. 10, 707–710, 1966.
                                                                      [Vinyals et al., 2015] Vinyals, O., Kaiser, L., Koo, T., Petrov, S.,
[Lin et al., 2014] Lin, T.Y., Maire, M., Belongie, S., Hays, J.,        Sutskever, I., Hinton, G.,. Grammar as a Foreign Language, in:
   Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.,. Microsoft       NIPS, 2015.
   COCO: Common objects in context, in: ECCV. pp. 740–755,
                                                                      [Winkler, 1990] Winkler, W.E.,. String Comparator Metrics and
   2014.
                                                                        Enhanced Decision Rules in the Fellegi-Sunter Model of Record
[Lindberg et al., 1993] Lindberg, D.A., Humphreys, B.L.,                Linkage, in: ASA. pp. 354–359, 1990.
   McCray, A.T.,. The Unified Medical Language System.
                                                                      [Wubben et al., 2010] Wubben, S., van den Bosch, A., Krahmer,
   Methods Inf. Med. 32, 281–291, 1993.
                                                                        E.,. Paraphrase Generation As Monolingual Translation: Data
[M. Shieber and Nelken, 2006] M. Shieber, S., Nelken, R.,.              and Evaluation, in: INLG, INLG ’10. pp. 203–207, 2010.
  Towards robust context-sensitive sentence alignment for
                                                                      [Xu et al., 2015] Xu, W., Callison-Burch, C., Napoles, C.,.
  monolingual corpora, 2006.
                                                                        Problems in Current Text Simplification Research: New Data
[Madnani and Dorr, 2010] Madnani, N., Dorr, B.J.,. Generating           Can Help, in: ACL. pp. 283–297, 2015.
  Phrasal and Sentential Paraphrases: A Survey of Data-Driven
                                                                      [Zhao et al., 2009] Zhao, S., Lan, X., Liu, T., Li, S.,. Application-
  Methods. Comput. Linguist. 36, 341–387, 2010.
                                                                        driven statistical paraphrase generation, in: ACL. pp. 834–842,
[Mikolov et al., 2013] Mikolov, T., Chen, K., Corrado, G., Dean,        2009.
  J.,. Distributed Representations of Words and Phrases and their
                                                                      [Zhu et al., 2010] Zhu, Z., Bernhard, D., Gurevych, I.,. A
  Compositionality. NIPS 1–9, 2013.
                                                                        Monolingual Tree-based Translation Model for Sentence
[Nesterov, 1983] Nesterov, Y.,. A method for unconstrained              Simplification, in: COLING. pp. 1353–1361, 2010.
  convex minimization problem with the rate of convergence