=Paper= {{Paper |id=Vol-1996/paper12 |storemode=property |title=Detection of Adverse Drug Reaction from Twitter Data |pdfUrl=https://ceur-ws.org/Vol-1996/paper12.pdf |volume=Vol-1996 |authors=Fuchiang Tsui,Lingyun Shi,Victor Ruiz,Amie D. Barda,Ye Ye,Diyang Xue,Fan Mi,Uktars Jain |dblpUrl=https://dblp.org/rec/conf/amia/TsuiSRBYXMJ17 }} ==Detection of Adverse Drug Reaction from Twitter Data== https://ceur-ws.org/Vol-1996/paper12.pdf

Detection of Adverse Drug Reaction from Twitter Data

Fuchiang Tsui, PhD1,2,3, Lingyun Shi, MS1, Victor Ruiz, MS1, Amie D. Barda, MS1,
Ye Ye, MS1,2, Diyang Xue, MS1,2, Fan Mi, MS1, and Utkars Jain, MS3
1
Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, PA, USA
2
Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA
3
Department of Bioengineering, University of Pittsburgh, Pittsburgh, PA, USA

Abstract
In response to the challenges set forth by the 2nd Social Media Mining for Health Applications Shared Task 2017, we
describe a framework to automatically detect Twitter tweets with mentioned adverse drug reactions (ADRs). We used
a dataset with 10822 annotated tweets provided by the event organizers to develop a framework comprised of a pre-
processing module, 6 feature extraction modules, and one predictive modeling module with 5 models (K2 Bayesian
network, naïve Bayes, decision tree, random forest, and support vector machines). To evaluate our framework, we
employed a blind test dataset with 9961 tweets provided by the event organizer. The area under the ROC curve (AUC)
from the K2 Bayesian network model was 0.74 (95% C.I. 0,721-0.759) and F-Measure ranged from 0.339 to 0.342.
The described framework in this paper demonstrated a potential public health surveillance tool for ADR surveillance
from Twitter tweets.

I. Introduction
The 2nd Social Media Mining for Health Applications Shared Task (SMM4H) 2017 put forth three competition
challenge tasks: Task 1: Automatic classification of adverse drug reaction mentioning posts—binary classification,
Task 2: Automatic classification of posts describing medication intake—three-class classification, and Task 3:
Automatic normalization of adverse drug reaction mentions. In this paper, we describe a framework to address the
Task 1 challenge for classification of adverse drug reaction mentioning Twitter tweets.
Adverse drug reactions (ADRs) impose clinical and economic burdens in the U.S. The Federal Drug and
Administration (FDA) shows the patients with ADRs have double length of stay and mortality than those of non-ADR
patients.1 Moreover, up to 39% of ADRs in pediatric inpatients can be life-threatening or fatal.2 A study estimated that
between 32% and 69% of drug-related admissions could be potentially preventable.3 In the U.S., the cost of ADRs
may be up to $136 billion annually.4
With 328 million monthly active users generating 500 million tweets per day in 2017 and 80% of users posting tweets
about themselves,5 Twitter provides a potential channel for ADR surveillance.
We hypothesize that our proposed pipeline process and predictive models can identify individual tweets with ADR.
II. Methods
The training dataset provided by SMM4H 2017 originally comprised of total 15,667 tweet identifiers with annotated
ADR results from two batches: each with 10,822 and 4845 identifiers respectively; after employing the python script
provided by the event organizer, we only retrieved 10,281 tweets for the training dataset. The blind test dataset
released 7 days prior to the competition deadline comprised of 9,961 tweets. All the datasets were annotated and
provided by the event organizer.
II.1 Data Preprocessing
Figure 1 summarizes our proposed framework, which contained a data preprocessing module, and 6 feature extraction
modules, and one predictive modeling module. We first preprocessed each tweet in the training dataset in 7 steps: (1)
lowercase text conversion, (2) sentence partition, (3) stop words removal, (4) tokenization, (5) spelling correction, (6)
lemmatization and (7) part-of-speech (POS) tagging. In the stop words removal step, we removed commonly used
words that are typically ignored by a search engine, such as “the”, “a”, “at”, etc. In the (Twitter-specific) tokenization
step, we parsed Twitter-specific tokens such as retweet flag (RT), @USERNAME, Hashtag (#), URL and emoticon,
and used them as features for each tweet. We used the Natural Language Processing Toolkit6 (NLTK) to perform the
data preprocessing.

Figure 1. Framework for classification of adverse drug reaction mentioning Twitter tweets.
II.2 Feature Modules
We employed six feature extraction modules: N-grams, term frequency-inverse document frequency (TF-IDF), word
embeddings, and sentiment statistics.
II.2.1 N-grams
We used unigram (single word), bi-gram (two words) and tri-gram (three words) as features for predictive modeling.
For example, single words like “sick” and “weight” belong to unigram and “Cipro pills” containing two words is a
bigram.
II.2.2 Term Frequency-Inverse Document Frequency
TF-IDF score measures how important a term, e.g., word, is to a document (tweet) in a corpus (training dataset). The
following equation defines TF-IDF.
𝐷 + 1
𝑇𝐹𝐼𝐷𝐹 𝑡, 𝑑, 𝐷 = 𝑇𝐹 𝑡, 𝑑 ∗ log ,
𝐷𝐹 𝑡, 𝐷 + 1
where TF(t,d) denotes the term frequency (number of times that a term t appears in a tweet d), |D| is the total number
of tweets, and DF(t, D) is the document frequency (number of tweets that contains the term t in all tweets D). Instead
of using words in the above equation, we used hashed terms for computational efficiency. We used the vector of the
hashed terms from each tweet’s as a feature vector.
II.2.3 Word Embeddings
We employed the Apache Spark Word2Vec7 to obtain a word embeddings matrix from the training dataset. We used
Skip-gram model with softmax activation function in a neuron. The Skip-gram model learns word vector
representations from sentences by maximizing the average log-likelihood of a word wt in a sentence based on
conditional independence assumption as shown below:
8 456
1
log 𝑝 𝑤234 𝑤2 ,
𝑇
259 4576
where k is the size of a pre-defined training window, T is the total number of words within a sentence, and the
conditional probability p(𝑤< |𝑤4 ) is determined based on the softmax model.

For each tweet, we computed an averaged word vector based on the word embeddings matrix, which becomes a feature
vector.
II.2.4 Sentiment statistics
We employed Stanford NLP toolkit8,9 to annotate each sentence's sentiment level as one of 5 levels in a tweet: very
negative, negative, neutral, positive or very positive. From the annotation, we generated 6 sentiment features for each
tweet: 1) the count of negative sentences annotated with negative and very negative sentiment levels, 2) the ratio
between the numbers of negative sentences and all sentences, 3) the count of sentences annotated with neutral
sentiment level, 4) the ratio between the numbers of neutral sentences and all sentences, 5) the count of positive
sentences annotated with positive and very positive sentiment levels, 6) the ratio between the numbers of positive
sentences and all sentences.
II.2.5 cTAKES findings
We employed cTAKES10, a medical natural language processing application, to extract symptoms and findings from
clinical narratives. We further customized cTAKES with 2017 US Edition of SNOMED CT11 as a clinical-term look-
up dictionary. We identified 10 annotation types such as drug, disorder, finding, procedure, lab, etc.
II.2.6 ADR tags
We employed ADRMine12, a social-media-specific named entity recognition (NER) application, to identify ADR
mentions in a tweet. Any non-negative ADR mentions in the training dataset were included in our feature set.
II.3 Predictive Modeling
We used five machine learning algorithms to build predictive models and validated the models through nested 10-fold
cross validation. All the features were obtained from the feature modules stated in Section II.2. For each fold, we
applied information gain first followed by correlation-based feature selection. We then used K2, Naïve Bayes (NB),
decision tree (DT), random forest (RF), and SVM algorithms to build predictive models.
We identified three different methods to determine probability thresholds. The first method is cross-validation
threshold, which used the average threshold that optimized F1-score during cross-validation testing. The second
method is prevalence threshold, which calculated the threshold that would produce the same class prevalence observed
in the training set. The third method is training threshold, which calculated the threshold that optimized F1-score in
the entire training dataset.
The evaluation metrics for predictive models are F-measure (also known as F1-measure), precision, recall, and the
area under the ROC curve (AUC).
III. Results
Compared to other models, the K2 model had the highest average AUC (0.844) and F1-score (0.469) across 10 test
folds within the 10-fold cross validation (CV). Table 1 lists predictive results from the 5 algorithms within the 10-
fold CV. DT had the best recall (0.641) and SVM (0.62) had the best precision.

Table 1. Prediction performance within 10-fold cross validation from 5 algorithms: K2, Naïve Bayes (NB),
Decision Tree (DT), Random Forest (RF), and Support Vector Machines (SVM).
Algorithm AUC F-Measure Precision Recall
K2 0.844 0.469 0.431 0.519
NB 0.828 0.446 0.399 0.519
DT 0.783 0.384 0.310 0.641
RF 0.756 0.388 0.326 0.504
SVM 0.581 0.269 0.620 0.174

The final K2 model comprised 26 nodes from unigram (9 nodes), bi-gram (2 nodes), tri-gram (1 node), cTAKES (8
nodes), word embeddings (1 node), and ADR tags (5 nodes).

Based on Table 1, we used K2 model for the final test (blind) dataset provided by the organizer released a week before
the predictive result submission. Table 2 summarizes the predictive performance in the test dataset. The AUC of the
K2 model was 0.74 (95% C.I. 0,721-0.759). The K2 model with the training threshold method had the best F-measure
(0.342) and the prevalence threshold method had the best recall (0.394).

Table 2. K2 Prediction Performance in the Test (blind) Dataset.
Algorithm AUC F-Measure Precision Recall
K2-CV 0.74 0.341 0.333 0.35
K2-Prevalence 0.74 0.339 0.298 0.394
K2-Training 0.74 0.342 0.336 0.348

IV. Conclusion
In this study, we built and evaluated a framework to identify individual tweets with mentioned ADR. The framework
comprises of preprocessing module, feature extraction modules, and predictive modeling module. The results
demonstrated a potential public health surveillance tool for ADR surveillance from Twitter tweets.
References
1. U.S. Food and Drug Administration. Costs associated with ADRs.
https://www.fda.gov/drugs/developmentapprovalprocess/developmentresources/druginteractionslabeling/ucm11
4949.htm. Accessed October 5, 2017.
2. Impicciatore P, Choonara I, Clarkson A, Provasi D, Pandolfini C, Bonati M. Incidence of adverse drug reactions
in paediatric in/out-patients: a systematic review and meta-analysis of prospective studies. Br J Clin Pharmacol.
2001;52(1):77-83. doi:10.1046/j.0306-5251.2001.01407.x.
3. Sultana J, Cutroneo P, Trifirò G. Clinical and economic burden of adverse drug reactions. J Pharmacol
Pharmacother. 2013;4(Suppl 1):S73-7. doi:10.4103/0976-500X.120957.
4. Bootman J, Johnson JA. Drug-related morbidity and mortality: A cost-of-illness model. Arch Intern Med.
1995;155(18):1949-1956. doi:10.1001/archinte.1995.00430180043006.
5. Naaman M, Boase J, Lai C-H. Is it really about me? In: Proceedings of the 2010 ACM Conference on Computer
Supported Cooperative Work - CSCW ’10. ; 2010:189. doi:10.1145/1718918.1718953.
6. Bird S, Klein E, Loper E. Natural Language Processing with Python. Vol 43.; 2009. doi:10.1097/00004770-
200204000-00018.
7. Meng X, Bradley J, Yavuz B, et al. [seminal] MLlib: Machine Learning in Apache Spark. J Mach Learn Res.
2016;17:1-7. http://arxiv.org/abs/1505.06807.
8. Manning C, Surdeanu M, Bauer J, Finkel J, Bethard S, McClosky D. The Stanford CoreNLP Natural Language
Processing Toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics:
System Demonstrations. ; 2014:55-60. doi:10.3115/v1/P14-5010.
9. Socher R, Perelygin A, Wu J. Recursive deep models for semantic compositionality over a sentiment treebank.
Proc …. 2013:1631-1642. doi:10.1371/journal.pone.0073791.
10. Savova GK, Masanz JJ, Ogren P V, et al. Mayo clinical Text Analysis and Knowledge Extraction System
(cTAKES): architecture, component evaluation and applications. J Am Med Inf Assoc. 2010;17(5):507-513.
doi:10.1136/jamia.2009.001560.
11. NIH-NLM. SNOMED Clinical Terms® (SNOMED CT®). NIH-US Natl Libr Med. 2015.
12. Nikfarjam A, Sarker A, O’Connor K, Ginn R, Gonzalez G. Pharmacovigilance from social media: Mining
adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med
Informatics Assoc. 2015;22(3):671-681. doi:10.1093/jamia/ocu041.