=Paper= {{Paper |id=Vol-2226/paper4 |storemode=property |title=Automated Detection of Adverse Drug Reactions in the Biomedical Literature Using Convolutional Neural Networks and Biomedical Word Embeddings |pdfUrl=https://ceur-ws.org/Vol-2226/paper4.pdf |volume=Vol-2226 |authors=Diego Saldana Miranda |dblpUrl=https://dblp.org/rec/conf/swisstext/Miranda18 }} ==Automated Detection of Adverse Drug Reactions in the Biomedical Literature Using Convolutional Neural Networks and Biomedical Word Embeddings== https://ceur-ws.org/Vol-2226/paper4.pdf
 Automated Detection of Adverse Drug Reactions in the Biomedical Literature
  Using Convolutional Neural Networks and Biomedical Word Embeddings
                                        Diego Saldana Miranda
                                          Novartis Pharma A.G.
                                      Applied Technology Innovation
                                            Novartis Campus
                                               4056 Basel
                                diego.saldana miranda@novartis.com

                                                                      odic reports such as Development Safety Update Re-
                                                                      ports (DSURs) and Periodic Safety Update Reports
                         Abstract                                     (PSURs) regarding the safety of their drugs and prod-
                                                                      ucts (Krishnamurthy et al., 2017).
    Monitoring the biomedical literature for cases                       One of the most important sources of information
    of Adverse Drug Reactions (ADRs) is a crit-                       to be monitored in pharmacovigilance is the biomed-
    ically important and time consuming task                          ical literature (Pontes et al., 2014). To this end, large
    in pharmacovigilance. The development of                          numbers of scientific abstracts and publications need
    computer assisted approaches to aid this pro-                     to be screened and/or read in full in order to collect
    cess in different forms has been the subject of                   information relevant to safety, and in particular Ad-
    many recent works.                                                verse Drug Reactions (ADRs) associated to a particu-
    One particular area that has shown promise is                     lar drug.
    the use of Deep Neural Networks, in particu-                         Screening and reading the biomedical literature is
    lar, Convolutional Neural Networks (CNNs),                        a time consuming task and is of critical importance.
    for the detection of ADR relevant sentences.                      It requires particular expertise, and needs to be per-
    Using token-level convolutions and general                        formed by well-trained readers. Given this, systems
    purpose word embeddings, this architecture                        that enable human readers to perform this task faster
    has shown good performance relative to more                       and more effectively would be of great value.
    traditional models as well as Long Short Term
    Memory (LSTM) models.                                             2 Background
    In this work, we evaluate and compare two                         Computer assisted pharmacovigilance and, more
    different CNN architectures using the ADE                         specifically, the automation of the detection of ADR
    corpus. In addition, we show that by de-                          relevant information across various data sources has
    duplicating the ADR relevant sentences, we                        the potential to have great positive impact on the
    can greatly reduce overoptimism in the clas-                      pharmaceutical industry. There is a very vast ar-
    sification results. Finally, we evaluate the use                  ray of sources of potential ADR relevant information,
    of word embeddings specifically developed                         including both structured and unstructured data re-
    for biomedical text and show that they lead                       sources.
    to a better performance in this task.                                In many cases, adverse reactions are initially de-
                                                                      tected through unstructured means of communication,
1 Introduction                                                        such as a patient speaking to a healthcare profes-
                                                                      sional, and case reports written by physicians and pub-
Pharmacovigilance is a crucial component at every                     lished in biomedical literature sources, such as MED-
stage of the drug development cycle, and regulations                  LINE, PubMed and EMBASE (Rison, 2013). Sponta-
require pharmaceutical companies to prepare peri-                     neous reporting can also be made through telephone
                                                                      calls, email communication, and even fax (Vallano
In: Mark Cieliebak, Don Tuggener and Fernando Benites (eds.):
Proceedings of the 3rd Swiss Text Analytics Conference (Swiss-        et al., 2005). Such information is processed, gener-
Text 2018), Winterthur, Switzerland, June 2018                        ally through human intervention in order to properly




                                                                 1
                                                                 33
categorize them and add the necessary metadata.                   manually curated lexicons which could be used to
    Other potential sources of safety signals include             build cancer drug-side effect (drug SE) pair knowl-
electronic medical/health records (EMRs/EHRs)                     edge bases from scientific publications (Xu and Wang,
(Park et al., 2011). Similarly, omics, chemical, phe-             2014c). The authors also described a method to ex-
notypic and metabolic pathway data can be analyzed                tract syntactical patterns, via parse trees from the
using a diverse array of methods to find associations             Stanford Parser (Xu and Wang, 2014a), based on
between drugs and specific side effects (Liu et al.,              known seed cancer drug-SE pairs. The patterns can
2012; Mizutani et al., 2012; Lee et al., 2011). In                then be used to extract new cancer drug-SE pairs.
recent years, social media websites have also become              They further proposed an approach using SVM classi-
a potential source of safety signals (Karimi et al.,              fiers to categorize tables from cancer related literature
2015; Sarker and Gonzalez, 2015; Tafti et al., 2017).             as either ADR relevant or not (Xu and Wang, 2015a).
    Finally, after careful processing, the data is usually        The authors then extracted cancer drug-SE pairs from
aggregated and stored in structured databases for re-             the tables using a lexicon-based approach and com-
porting and/or aggregation. Many regulatory agencies              pared them with data from the FDA label information.
maintain databases that aggregate information regard-             Xu et al. also evaluated their method in a large scale,
ing reported adverse events, such as the FDA Adverse              full text corpus of oncological publications (Xu and
Event Reporting System (FAERS) (Fang et al., 2014)                Wang, 2015b), extracting drug-SE pairs and showing
in the U.S., EudraVigilance in Europe (Banovac et al.,            good correlation of the extracted pairs with gene tar-
2017), and the MedEffect Adverse Reaction Online                  gets and disease indications.
Database in Canada (Barry et al., 2014).                             There are a number of available data resources for
    The aim of our work is to contribute towards the de-          the purpose of ADR signal detection. Gurulingappa
velopment of systems that provide assistance to read-             et al. introduced the ADE corpus, a large corpus of
ers in charge of finding ADR signals in the biomedical            MEDLINE sentences annotated as ADR relevant or
literature. As such, the ideal system should be able to           not (Gurulingappa et al., 2012). Karimi et al. de-
accurately discriminate between ADR relevant and ir-              scribed CADEC, a corpus of social media posts with
relevant sentences in the documents that it processes.            ADE annotations (Karimi et al., 2015) including map-
    In the following section, we detail some of the past          pings to vocabularies such as SNOMED. Further, the
efforts to automate this as well as other tasks related to        annotations include detailed information such as drug-
the extraction of ADR relevant information from the               event and drug-dose relationships. Sarker et al. de-
biomedical literature.                                            scribed an approach using SVM classifiers, as well as
                                                                  diverse feature engineering methods, to classify clini-
3 Related Work                                                    cal reports and social media posts from multiple cor-
The automation of the detection of ADR relevant in-               pora as ADR relevant or not (Sarker and Gonzalez,
formation across various data sources has received                2015). Odom et al. explored an approach using rela-
much attention in recent years. Ho et al. performed               tional gradient boosting (FRGB) models to combine
a systematic review and summarized their findings                 information learned from labelled data with advice
on various methods to predict ADEs ranging from                   from human readers in the identification of ADRs in
omics to social media (Ho et al., 2016). In addition,             the biomedical literature (Odom et al., 2015). Adams
the authors presented a list of public and commercial             et al. proposed an approach using custom search
data sources available for the task. Similarly, Tan et            PubMed queries making use of MeSH subheadings to
al. summarized the available data resources and pre-              automatically identify ADR related publications. The
sented the state of computational decision support sys-           authors conducted an evaluation by comparing with
tems for ADRs (Tan et al., 2016). Harpaz et al. pre-              results manually tagged by investigators, obtaining a
pared an overview of the state of the art in text mining          precision of 0.90 and a recall of 0.93.
for Adverse Drug Events (ADEs) (Harpaz et al., 2014)                 Some researchers have tried to combine informa-
in various contexts, such as the biomedical literature,           tion from structured databases with the unstructured
product labelling, social media and web search logs.              data found in the biomedical literature. For exam-
   Xu et al. initially proposed a method based on                 ple, Xu et al. showed that, by combining informa-




                                                             2
                                                             34
tion from FAERS and MEDLINE using signal boost-                       by Pyysalo et al. (2013) and show that, by using
ing and ranking algorithms, it’s possible to improve                  these embeddings in place of general-purpose
cancer drug-side effect (drug-SE pair) signal detection               GloVe embeddings, it is possible to improve the
(Xu and Wang, 2014b).                                                 performance of the algorithm.
   There have recently been efforts to use neural net-
works to improve the performance of the ADR sen-                 4 Dataset
tence detection, entity and relation extraction tasks.           The ADE corpus was introduced by Gurulingappa et
Gupta et al. proposed a two step approach for ex-                al. (2012) in order to provide a benchmark dataset
tracting mentions of adverse events from social media:           for the development of algorithms for the detection of
(1) predicting the drug based on the context, unsuper-           ADRs in case reports. The original source of the data
vised; (2) predicting adverse event mentions based on            was 2972 MEDLINE case reports. The data was la-
a tweet and the features learned in the previous step,           belled by three trained annotators and their annotation
supervised (Gupta et al., 2017). Li et al. proposed ap-          results were consolidated into a final dataset includ-
proaches combining CNNs and bi-LSTMS to perform                  ing 6728 ADE relations (in 4272 sentences), as well
named entity recognition as well as relation extrac-             as 16688 non-ADR relevant sentences.
tion for ADRs in the annotated sentences in the ADE                 The authors calculated Inter-Annotator Agreement
dataset (Li et al., 2017). More recently, Ramamoor-              (IAA), using F1 scores as a criterion, for adverse event
thy et al. described an approach using bi-LSTMs with             entities between 0.77 and 0.80 for partial matches and
an attentional mechanism to jointly perform relation             between 0.63 and 0.72 for exact matches. For more
extraction as well as visualize the patterns in the sen-         detail, the reader can refer to the work of Gurulin-
tence.                                                           gappa et al. (Gurulingappa et al., 2012).
   Huynh proposed using convolutional recurrent neu-
ral networks (CRNN) and convolutional neural net-                4.1 Preprocessing
works with attention (CNNA) to identify ADR related
tweets and MEDLINE article sentences (Huynh et al.,              The dataset is suitable for two types of tasks: (1) cat-
2016). The CNNA’s attention component had the at-                egorization of sentences as either relevant for ADRs
tractive property that it allows visualization of the in-        or not; and (2) extraction of drug-adverse event rela-
fluence of each word in the decision of the network.             tions and drug-dose relations. Because there can be
   In this work, we introduce approaches building                more than one relation in the same sentence, the ADR
upon previous results using convolutional neural net-            relevant sentences are sometimes duplicated.
works (CNNs) (Huynh et al., 2016) to detect ADR rel-                The presence of duplicates can lead to situations
evant sentences in the biomedical literature. Our key            where the same sentence is present in both the training
contributions are as follows:                                    and test datasets, as well as to an overall distortion of
                                                                 the distribution of the sentences. In order to prevent
  • We compare Huynh’s CNN approach, which                       this, we de-duplicate these sentences, which results in
    is based on the architecture proposed by Kim                 4272 ADR relevant sentences, as stated in the work of
    (2014), with a deeper architecture based on the              Gurulingappa et al. (Gurulingappa et al., 2012).
    one proposed by Hughes et al. (2017), using the
    ADE dataset, showing that Kim’s architecture                 5 Methods
    performs much better for this task and dataset.
                                                                 In the following sections, we will describe (1) the
  • We apply a de-duplication of the ADR relevant                word embeddings used in our learning algorithms; and
    sentences in the ADE dataset, (Gurulingappa                  (2) the two different CNN architectures evaluated in
    et al., 2012) which we believe leads to a better             our experiments.
    estimation of the performance of the algorithm
    and does not seem to be applied in some of the               5.1 Embeddings
    previous works.                                              GloVe 840B
  • We evaluate the use of word embeddings devel-                As in Huynh’s work (Huynh et al., 2016), we use
    oped specifically for biomedical text introduced             pre-trained word embeddings. Huynh focused mainly




                                                            3
                                                            35
on the general purpose GloVe Common Crawl 840B,
300 dimensional word embeddings (Pennington et al.,
2014).


Pyysalo’s Embeddings

We also evaluate the use of 200 dimensional word2vec
embeddings introduced by Pyysalo et al. (Pyysalo
et al., 2013). These word embeddings were fitted on
a corpus combining PubMed abstracts, PubMed Cen-
tral Open Access (PMC OA) full text articles as well
as Wikipedia articles. We also initialize zero valued
vectors for the unknown word symbol as well as for
the padding symbol.
                                                                Figure 1: Diagram of the architecture proposed by
Preprocessing                                                   Huynh (Huynh et al., 2016).

As in Huynh’s work, no new word vectors are ini-
tialized for tokens not present in the pre-trained vo-                 1 hX        
                                                                            N
cabulary, and only the tokens that are in the 20000               L Θ =−      yi log yˆi +
most frequent words in the dataset are included. The                     N
                                                                                   i=1
                                                                                                             i
remaining tokens are mapped to the unknown word
                                                                                          (1 − yi )log 1 − yˆi . (2)
symbol vector. We enable the algorithm to optimize
the pre-trained weights after initialization. We fol-
                                                                Huynh’s CNN architecture
low the preprocessing strategy used by Huynh (Huynh
et al., 2016), which is itself based on that of Kim             This architecture consists of the use of a 1D-
(Kim, 2014), and includes expansion of contractions,            convolution layer with 300 filters and a 5 token win-
and additionally, all non-alphabetic characters are re-         dow applied on the word vectors. This is followed by
placed with spaces prior to tokenization.                       a Rectified Linear Unit (ReLu) and a 1D-max pool-
                                                                ing over the full axis of 1D-convolution results. This
                                                                leads to a 300 dimensional vector representation, v,
6 Convolutional Neural Network Architec-                        which is used as an input for the classification net-
  tures                                                         work described above. Figure 1 shows a diagram of
                                                                the resulting architecture. Note that M , the number of
In all architectures described below, the sentences are
                                                                embedding dimensions, may be equal to either 300 or
mapped to a vector representation, v. Dropout is ap-
                                                                200, but is shown as 300 for illustration in the figure.
plied to v during training with a dropout probability
                                                                   To reduce overfitting, a constraint is added to en-
of 0.5. As in usual classification tasks, the predicted
                                                                sure that the L2 norms of each one of the 1D convolu-
probability of a possitive outcome, that is, of the sen-
                                                                tion filters are never above a threshold value, s, after
tence being ADR relevant, is given by
                                                                each batch. For more detail, the reader can refer the
                                                              works of Huynh (Huynh et al., 2016) and Kim (Kim,
                  ŷ = ρ vT w + b ,                 (1)         2014).

   where w is a vector of coefficients, b is the inter-         Hughes’ CNN architecture
cept, and ρ is the sigmoid function.                            Based on the approach proposed by Hughes (Hughes
   The objective function to be optimized is the cross          et al., 2017) we explored a deeper architecture, with
entropy, which can also be interpreted as an average            multiple successive stages of 1D-convolution, non-
negative log-likelihood, and is given by                        linear transformations, and max pooling.




                                                           4
                                                           36
                                                                 7 Experimental Setup
                                                                 Following the approach used by Huynh et al. (2016),
                                                                 we used 10-fold cross validation to evaluate the per-
                                                                 formance of our classifiers. The normalization thresh-
                                                                 old used to clip the L2 norms of the filters, s, was set
                                                                 to 9.
                                                                    The Adam optimizer (Kingma       and Ba, 2014) was
                                                                 used to minimize the loss, L Θ , with 8 epochs and
                                                                 a batch size of 50. To avoid overfitting, early stopping
                                                                 is used based on a development set consisting of 10%
                                                                 of the training data of each fold. For the decision of
                                                                 the classifier, instead of a ŷ threshold of 0.5, we deter-
                                                                 mine the optimum threshold by evaluating all possible
                                                                 thresholds present in the development set of each fold
                                                                 and keeping the threshold that results in the best F1
                                                                 score.
                                                                    After every 10 batches, the optimal threshold is de-
                                                                 termined from the development set and the associated
                                                                 best F1 score is obtained. Optimization is stopped if
                                                                 the F1 score on the development set fails to improve
                                                                 after 6 steps. The set of CNN parameters associated
                                                                 with the best F1 score observed throughout the train-
                                                                 ing process is then kept and used to evaluate the net-
                                                                 work’s performance on the test set of each fold.
Figure 2: Diagram of an architecture based on the one               We use the architecture originally proposed by
proposed by Hughes (Hughes et al., 2017).                        Huynh (Huynh et al., 2016) without de-duplication as
                                                                 the baseline results to understand the impact of the
                                                                 de-duplication, choice of embeddings, and CNN ar-
   This architecture starts with two successive stages           chitecture.
of 1D-convolutions with 256 filters and a 5 token win-              All CNN implementations were done using Python
dow, each followed by a ReLu transformation. After               3.4.5 (Rossum, 1995) and Tensorflow 1.2.0 (Abadi
this, a 1D-max pooling on the axis of the convolutions           et al., 2015).
with a window of length 5 is applied. Finally, an-
other two successive stages of 1D-convolutions with              8 Results
256 filters and a window of length 5, each followed by           8.1 Impact of De-duplication on Classification
a ReLu transformation, is applied, followed by a 1D-                 Performance Estimates
max pooling over the full axis of the 1D-convolutions.
                                                                 Table 8.1 shows a comparison of the performance
   Similar to the case of the previous architecture, this        metrics of our implementation of Huynh’s architec-
leads to a 256 dimensional vector representation, v,             ture and GloVe 849B word embeddings with and
and a constraint is used to keep the L2 norms of all             without de-duplication of the sentences labelled as
1D-convolution filters under a threshold value s. Fig-           ADR relevant. After de-duplication, most of the per-
ure 2 shows a diagram of the resulting architecture. As          formance metrics were lower, since the presence of
previously, note that M may be equal to either 300 or            duplicates in the positive samples resulted in overly
200, but is shown as 300 for illustration in the figure.         optimistic results.
                                                                    The biggest impact was observed on precision, re-
   For further detail, the reader can refer to the work          call and F1 scores. Overall accuracies and area under
of Hughes (2017).                                                the ROC curve (AUROC) didn’t seem to be greatly




                                                            5
                                                            37
           De-duplication        No      Yes                        This also led to an increased average F1 score from
           Accuracy           0.919    0.914                     0.790 to 0.798. The average AUROC also increased
           Precision          0.858    0.784                     from 0.954 to 0.958. Specificity increased from 0.943
           Recall             0.860    0.798                     to 0.949, and recall was the only metric that was
           F1-score           0.859    0.790                     slightly reduced from 0.798 to 0.797.
           Specificity        0.942    0.943
           AUROC              0.966    0.954                     8.3 Comparison With Hughes’ CNN Architec-
                                                                     ture
Table 1: Performance metrics of Huynh’s architecture
using GloVe 840B embeddings with and without de-                          Architecture     Huynh     Hughes
duplication of the ADR relevant sentences.                                Accuracy          0.918     0.905
                                                                          Precision         0.800     0.765
affected. Note that the specificity, which is the true
                                                                          Recall            0.797     0.771
negative rate, was higher after de-duplication.
                                                                          F1-score          0.798     0.767
   We initially obtained somewhat lower perfor-
                                                                          Specificity       0.949     0.939
mances for the baseline model without de-duplication
                                                                          AUROC             0.958     0.940
compared to the one reported by Huynh et al. (2016)
even though we accurately followed the described ar-
chitecture. After investigating the differences in the           Table 3: Performance metrics of Huynh’s and
code, we noticed that during pre-processing, charac-             Hughes’ architectures with de-duplication and
ters that are not alphabetic are replaced with spaces            Pyysalo’s embeddings.
prior to tokenization. After incorporating this step into           Table 8.3 shows a comparison between the per-
our code, the results matched the previously reported            formances of our implementations of Huynh’s and
ones much better.                                                Hughes’ architectures. In both cases, de-duplication
                                                                 of ADR relevant sentences, and biomedical embed-
8.2 Impact of Biomedical Word Embeddings
                                                                 dings were used. The former ourperformed the latter
    Word Embeddings         Glove 840B      Pyysalo              in every performance metric. The biggest improve-
    Accuracy                      0.914       0.918              ment was in metrics associated to the positive class,
    Precision                     0.784       0.800              such as precision, recall, and F1 score.
    Recall                        0.798       0.797
    F1-score                      0.790       0.798              9 Discussion
    Specificity                   0.943       0.949
                                                                 The purpose of this work was to evaluate the use
    AUROC                         0.954       0.958
                                                                 of convolutional neural networks (CNNs) architec-
                                                                 tures and biomedical word embeddings for the au-
Table 2: Performance metrics of Huynh’s architec-
                                                                 tomatic categorization of sentences relevant to ad-
ture with de-duplication with GloVe 840B embed-
                                                                 verse drug reactions (ADRs) in case reports present
dings and Pyysalo’s embeddings.
                                                                 in the biomedical literature. For this purpose, we used
   Table 8.2 shows a comparison of the performance               the ADE corpus, which consists of sentences coming
metrics with de-duplication of ADR relevant sen-                 from 2972 MEDLINE case reports labelled by trained
tences using the GloVe 840B word embeddings, and                 annotators. This includes 4272 ADR relevant sen-
the word embeddings fit for biomedical data purposes             tences, as well as 16688 non-ADR relevant sentences.
proposed by Pyysalo et al. (Pyysalo et al., 2013).                  We showed that, because of duplications present in
   In most cases, the use of biomedical word embed-              the ADE corpus, the use of this dataset for sentence
dings was favorable or non-detrimental to the perfor-            classification without performing a de-duplication can
mance metrics. The largest improvement was seen                  lead to overoptimistic performance estimates. In ad-
on the increase of average precision from 0.780 with             dition, we showed that, by using biomedical word em-
GloVe 840B to 0.800 with the biomedical embed-                   beddings, as opposed to general purpose word embed-
dings.                                                           dings, it’s possible to improve upon the performance




                                                            6
                                                            38
of the algorithm. Finally, we compared the perfor-              research.
mance of our implementations of two CNN architec-
tures, with the architecture proposed by Huynh out-
performing the architecture proposed by Hughes in               References
this task and dataset in every metric.                          Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene
   One important measure of the potential noise in                Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,
the inputs of human annotators is the Inter Annotator             Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-
Agreement (IAA) (Gurulingappa et al., 2012), which                mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving,
                                                                  Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz
in this dataset was measured by its original authors              Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion
by calculating inter annotator F1 scores. Although                Mané, Rajat Monga, Sherry Moore, Derek Murray,
this measure was calculated on the entity (partial and            Chris Olah, Mike Schuster, Jonathon Shlens, Benoit
exact) matching level, and although there has been                Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vin-
                                                                  cent Vanhoucke, Vijay Vasudevan, Fernanda Viégas,
a harmonization process, it is informative of the po-             Oriol Vinyals, Pete Warden, Martin Wattenberg, Mar-
tential noise in the inputs used to build the dataset.            tin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. Ten-
The fact that the IAAs for partial matches of adverse             sorFlow: Large-scale machine learning on heteroge-
events ranged between 0.77 and 0.80 indicates that                neous systems. Software available from tensorflow.org.
                                                                  https://www.tensorflow.org/.
aiming for near perfect predictions may be unrealistic,
since there is a considerable degree of disagreement            Marin Banovac, Gianmario Candore, Jim Slattery, Fran-
between human annotators.                                         cois Houez, David Haerry, Georgy Genov, and Peter
                                                                  Arlett. 2017. Patient reporting in the EU: Analysis
10   Conclusions and Future Work                                  of EudraVigilance data. Drug Safety 40(7):629–645.
                                                                  https://doi.org/10.1007/s40264-017-0534-1.
Our results highlight the importance of sentence de-
                                                                Arden R. Barry, Sheri L. Koshman, and Glen J.
duplication, pre-processing, choice of word embed-
                                                                  Pearson. 2014.             Adverse drug reactions.
dings, and neural network architectures when apply-               Canadian Pharmacists Journal / Revue des
ing convolutional neural networks (CNNs) for the de-              Pharmaciens        du     Canada    147(4):233–238.
tection of adverse drug reaction (ADR) relevant sen-              https://doi.org/10.1177/1715163514536523.
tences in the biomedical literature using the ADE
                                                                H Fang, Z Su, Y Wang, A Miller, Z Liu, P C Howard,
dataset. We believe that these are only a few of the              W Tong, and S M Lin. 2014. Exploring the FDA
factors that can greatly influence the performance of             adverse event reporting system to generate hypothe-
the algorithms performing these tasks.                            ses for monitoring of disease characteristics. Clin-
   Future work could include the use of either                    ical Pharmacology & Therapeutics 95(5):496–498.
                                                                  https://doi.org/10.1038/clpt.2014.17.
grid-based, random, or reinforcement-learning based
search for more optimal CNN architectures, as well              Shashank Gupta, Sachin Pawar, Nitin Ramrakhiyani,
as the evaluation of architectures other than CNNs.               Girish Keshav Palshikar, and Vasudeva Varma. 2017.
In addition, another very interesting area explored               Semi-supervised recurrent neural network for ad-
                                                                  verse drug reaction mention extraction.          CoRR
in previous works (Huynh et al., 2016) was the as-                abs/1709.01687. http://arxiv.org/abs/1709.01687.
pect of visualization using CNNs with Attention (CN-
NAs). However, this algorithm seemed to underper-               Harsha Gurulingappa, Abdul Mateen Rajput, Angus
form compared to the normal CNN. Building upon                    Roberts, Juliane Fluck, Martin Hofmann-Apitius, and
                                                                  Luca Toldo. 2012. Development of a benchmark
this approach to improve its performance while re-                corpus to support the automatic extraction of drug-
taining its attractive visualization properties would be          related adverse effects from medical case reports.
an important step towards the development of systems              Journal of Biomedical Informatics 45(5):885–892.
that assist human readers.                                        https://doi.org/10.1016/j.jbi.2012.04.008.

                                                                Rave Harpaz, Alison Callahan, Suzanne Tamang, Yen
11   Acknowledgements                                             Low, David Odgers, Sam Finlayson, Kenneth Jung,
                                                                  Paea LePendu, and Nigam H. Shah. 2014. Text min-
The author would like to thank Abhimanyu Verma as                 ing for adverse drug events: the promise, challenges,
well as the Technology Architecture & Digital depart-             and state of the art. Drug Safety 37(10):777–790.
ment at Novartis Pharma A.G. for their support in this            https://doi.org/10.1007/s40264-014-0218-z.




                                                           7
                                                           39
Tu-Bao Ho, Ly Le, Dang Tran Thai, and Siriwon                     Phillip Odom, Vishal Bangera, Tushar Khot, David Page,
  Taewijit. 2016.        Data-driven approach to de-                 and Sriraam Natarajan. 2015. Extracting adverse drug
  tect and predict adverse drug reactions.      Cur-                 events from text using human advice. In Artificial In-
  rent Pharmaceutical Design 22(23):3498–3526.                       telligence in Medicine, Springer International Publish-
  https://doi.org/10.2174/1381612822666160509125047.                 ing, pages 195–204. https://doi.org/10.1007/978-3-319-
                                                                     19551-3 26.
Mark Hughes, Irene Li, Spyros Kotoulas, and Toyotaro
  Suzumura. 2017. Medical text classification using con-          Man Young Park, Dukyong Yoon, KiYoung Lee, Seok Yun
  volutional neural networks. CoRR abs/1704.06841.                  Kang, Inwhee Park, Suk-Hyang Lee, Woojae Kim,
  http://arxiv.org/abs/1704.06841.                                  Hye Jin Kam, Young-Ho Lee, Ju Han Kim, and
                                                                    Rae Woong Park. 2011.          A novel algorithm for
Trung Huynh, Yulan He, Alistair Willis, and Stefan Rger.            detection of adverse drug reaction signals using a
   2016. Adverse drug reaction classification with deep             hospital electronic medical record database. Phar-
   learning. In International Conference of Computational           macoepidemiology and Drug Safety 20(6):598–607.
   Linguistics (COLING).                                            https://doi.org/10.1002/pds.2139.

Sarvnaz Karimi, Alejandro Metke-Jimenez, Madonna                  Jeffrey Pennington, Richard Socher, and Christopher D.
   Kemp, and Chen Wang. 2015.                 Cadec: A               Manning. 2014. Glove: Global vectors for word
   corpus of adverse drug event annotations.                         representation.  In Empirical Methods in Natural
   Journal of Biomedical Informatics 55:73–81.                       Language Processing (EMNLP). pages 1532–1543.
   https://doi.org/10.1016/j.jbi.2015.03.010.                        http://www.aclweb.org/anthology/D14-1162.
Yoon Kim. 2014.          Convolutional neural networks            Helena Pontes, Mallorie Clément, and Victoria Rolla-
  for sentence classification.     CoRR abs/1408.5882.              son. 2014. Safety signal detection: The relevance
  http://arxiv.org/abs/1408.5882.                                   of literature review.     Drug Safety 37(7):471–479.
                                                                    https://doi.org/10.1007/s40264-014-0180-9.
Diederik P. Kingma and Jimmy Ba. 2014.           Adam:
  A method for stochastic optimization.          CoRR             S. Pyysalo, F. Ginter, H. Moen, T. Salakoski, and
  abs/1412.6980. http://arxiv.org/abs/1412.6980.                     S. Ananiadou. 2013.            Distributional seman-
                                                                     tics resources for biomedical text processing.
Arun Chander Yadav Krishnamurthy, Jayasudha                          In Proceedings of LBM 2013. pages 39–44.
  Dhanasekaran, and Anusha Natarajan. 2017.             A            http://lbm2013.biopathway.org/lbm2013proceedings.pdf.
  succinct medical safety: periodic safety update reports.
  International Journal of Basic & Clinical Pharma-
  cology 6(7):1545.        https://doi.org/10.18203/2319-         Richard A Rison. 2013. A guide to writing case reports for
  2003.ijbcp20172714.                                                the journal of medical case reports and BioMed central
                                                                     research notes. Journal of Medical Case Reports 7(1).
Sejoon Lee, Kwang H Lee, Min Song, and Doheon Lee.                   https://doi.org/10.1186/1752-1947-7-239.
   2011. Building the process-drug–side effect network to
   discover the relationship between biological processes         Guido Rossum. 1995. Python reference manual. Technical
   and side effects. BMC Bioinformatics 12(Suppl 2):S2.             report, Amsterdam, The Netherlands, The Netherlands.
   https://doi.org/10.1186/1471-2105-12-s2-s2.
                                                                  Abeed Sarker and Graciela Gonzalez. 2015. Portable
Fei Li, Meishan Zhang, Guohong Fu, and Donghong Ji.                 automatic text classification for adverse drug
   2017. A neural joint model for entity and relation ex-           reaction detection via multi-corpus training.
   traction from biomedical text. BMC Bioinformatics                Journal of Biomedical Informatics 53:196–207.
   18(1). https://doi.org/10.1186/s12859-017-1609-9.                https://doi.org/10.1016/j.jbi.2014.11.002.

Mei Liu, Yonghui Wu, Yukun Chen, Jingchun Sun, Zhong-             Ahmad P Tafti, Jonathan Badger, Eric LaRose, Ehsan
  ming Zhao, Xue wen Chen, Michael Edwin Math-                      Shirzadi, Andrea Mahnke, John Mayer, Zhan Ye, David
  eny, and Hua Xu. 2012. Large-scale prediction of ad-              Page, and Peggy Peissig. 2017. Adverse drug event
  verse drug reactions using chemical, biological, and              discovery using biomedical literature: A big data neu-
  phenotypic properties of drugs. Journal of the Amer-              ral network adventure. JMIR Medical Informatics
  ican Medical Informatics Association 19(e1):e28–e35.              5(4):e51. https://doi.org/10.2196/medinform.9170.
  https://doi.org/10.1136/amiajnl-2011-000699.
                                                                  Yuxiang Tan, Yong Hu, Xiaoxiao Liu, Zhinan Yin, Xue
S. Mizutani, E. Pauwels, V. Stoven, S. Goto, and Y. Yaman-          wen Chen, and Mei Liu. 2016.          Improving drug
   ishi. 2012. Relating drug-protein interaction network            safety: From adverse drug reaction knowledge discov-
   with drug side effects. Bioinformatics 28(18):i522–              ery to clinical implementation. Methods 110:14–25.
   i528. https://doi.org/10.1093/bioinformatics/bts383.             https://doi.org/10.1016/j.ymeth.2016.07.023.




                                                             8
                                                             40
A. Vallano, G. Cereza, C. Pedròs, A. Agustı́, I. Danés,
   C. Aguilera, and J. M. Arnau. 2005.            Obsta-
   cles and solutions for spontaneous reporting of ad-
   verse drug reactions in the hospital.          British
   Journal of Clinical Pharmacology 60(6):653–658.
   https://doi.org/10.1111/j.1365-2125.2005.02504.x.

Rong Xu and QuanQiu Wang. 2014a. Automatic con-
  struction of a large-scale and accurate drug-side-effect
  association knowledge base from biomedical litera-
  ture. Journal of Biomedical Informatics 51:191–199.
  https://doi.org/10.1016/j.jbi.2014.05.013.

Rong Xu and QuanQiu Wang. 2014b. Large-scale com-
  bining signals from both biomedical literature and the
  FDA adverse event reporting system (FAERS) to im-
  prove post-marketing drug safety signal detection. BMC
  Bioinformatics 15(1):17. https://doi.org/10.1186/1471-
  2105-15-17.

Rong Xu and QuanQiu Wang. 2014c.           Toward cre-
  ation of a cancer drug toxicity knowledge base: au-
  tomatically extracting cancer drug—side effect rela-
  tionships from the literature. Journal of the Amer-
  ican Medical Informatics Association 21(1):90–96.
  https://doi.org/10.1136/amiajnl-2012-001584.

Rong Xu and QuanQiu Wang. 2015a. Combining auto-
  matic table classification and relationship extraction in
  extracting anticancer drug–side effect pairs from full-
  text articles. Journal of Biomedical Informatics 53:128–
  135. https://doi.org/10.1016/j.jbi.2014.10.002.

Rong Xu and QuanQiu Wang. 2015b. Large-scale au-
  tomatic extraction of side effects associated with tar-
  geted anticancer drugs from full-text oncological ar-
  ticles. Journal of Biomedical Informatics 55:64–72.
  https://doi.org/10.1016/j.jbi.2015.03.009.




                                                              9
                                                              41