=Paper=
{{Paper
|id=Vol-1996/paper9
|storemode=property
|title=Team UKNLP: Detecting ADRs, Classifying Medication Intake Messages,
and Normalizing ADR Mentions on Twitter
|pdfUrl=https://ceur-ws.org/Vol-1996/paper9.pdf
|volume=Vol-1996
|authors=Sifei Han,Tung Tran,Anthony Rios,Ramakanth Kavuluru
|dblpUrl=https://dblp.org/rec/conf/amia/HanTRK17
}}
==Team UKNLP: Detecting ADRs, Classifying Medication Intake Messages,
and Normalizing ADR Mentions on Twitter==
Team UKNLP: Detecting ADRs, Classifying Medication Intake Messages, and Normalizing ADR Mentions on Twitter Sifei Han, B.S1 , Tung Tran, B.S1 , Anthony Rios, B.S1 , and Ramakanth Kavuluru, Ph.D1,2 1 Department of Computer Science, University of Kentucky 2 Div. of Biomedical Informatics, Department of Internal Medicine, University of Kentucky Abstract This paper describes the systems we developed for all three tasks of the 2nd Social Media Mining for Health Ap- plications Shared Task at AMIA 2017. The first task focuses on identifying the Twitter posts containing mentions of adverse drug reactions (ADR). The second task focuses on automatic classification of medication intake messages (among those containing drug names) on Twitter. The last task is on identifying the MEDDRA Preferred Term (PT) code for the ADR mentions expressed in casual social text. We propose convolutional neural network (CNN) and tra- ditional linear model (TLM) approaches for the first and second tasks and use hierarchical long short-term memory (LSTM) recurrent neural networks for the third task. Among 11 teams our systems ranked 4th in ADR detection with F-score 40.2% and 2nd in classifying medication intake messages with F-score 68.9%. For the MEDDRA PT code identification, we obtained an accuracy of 87.2%, which is nearly 1% lower than the top score from the only other team that participated. 1. Introduction Online social networks and forums provide a new way for people to share their experiences with others. Nowadays patients also share their symptoms and treatment progress online to help other patients dealing with similar condi- tions. Due to the individual differences and other factors that cannot be tested out during clinical trails, patients may experience adverse drug reactions (ADRs) even when taking FDA approved medications. ADRs lead to a financial burden of 30.1 billion dollars annually in the USA [1]. Automatic detection of ADRs is receiving significant attention from the biomedical community. While traditional approaches of reporting to the FDA (phone, online) are still impor- tant, given millions of patients are sharing their drug reactions on social media, automatic detection of ADRs reported on online posts may offer complementary valuable signal for pharmacovigilance. Among these posts, some ADRs are mentioned from personal experience while other ADR mentions are observed in other people. To identify whether or not a Twitter user has consumed the medication is also an important task. So is the normalization of different ways of expressing the same event using a standardized terminology. These are the three tasks in the 2nd Social Media Mining for Health Applications Shared Task at AMIA 2017. In task 1 there are 6822 training, 3555 development, and 9961 testing samples; for task 2 there are 7617 training, 2105 development, and 7513 testing samples. In task 3, there are 6650 training and 2500 testing samples. The first task is a binary classification task with F-score for the ADR (positive class) as the evaluation metric. Task 2 is a 3-way classification task where the evaluation metric is the micro-averaged F-score for “intake” and “possible intake” classes. Task 3 is a multiclass classification task where Twitter phrases are mapped to their corresponding MEDDRA PT codes and accuracy is the evaluation metric. 2. Methods For tasks 1 and 2, we employ traditional methods (e.g. SVMs) and recent deep learning methods (e.g. CNNs) as well as their ensembles as detailed in Section 2.1. For task 3, we use a hierarchical character-LSTM model which is detailed in Section 2.2. 2.1. Tasks 1 and 2: TLMs, CNNs, CNNs with attention mechanism, and their ensembles For tasks 1 and 2, we used both TLMs, specifically linear SVMs and logistic regression, and deep nets specifically CNNs. In the first task, we simply averaged the probability estimates from each TLM classifier with and without CNN ensemble. The features used in TLMs are itemized below. • Uni/bi grams: Counts and real values of uni/bi-gram features in the tweet via countvectorizer and tfidfvectorizer, respectively, from Scikit Learn machine learning package [2]. • Sentiment lexicons (11 features): Four count features from the post based on the positive and negative counts of unigrams and bigrams using the NRC Canada emoticon lexicon1 , four additional sentiment score aggregation features corresponding to the count features, and overall sentiment score of uni-grams/bigrams/uni+bi grams. • Word embeddings (400 features): The average 400 dimensional vector of all the word vectors of a post where the word vectors (∈ R400 ) are obtained from a pre-trained Twitter word2vec model [3]. • ADR lexicon [4] (7423 features): One Boolean feature per concept indicating whether the concept ID is men- tioned in the tweet; rest are count features identifying the number of drugs from a particular source (SIDER, CHV, COSTART and DIEGO_lab) and the number of times different drugs are mentioned in the tweet. • Negation words (2 features): The first is a count of certain negation words (not, don’t, can’t, won’t, doesn’t, never) and second feature is the proportion of negation words to the total number of words in the post. • PMI score (1 feature): The sum of words’ pointwise mutual information (PMI) [5] scores as a real-valued feature based on the training examples and their class membership. • For task 2 only – handcrafted lexical pairs of drug mentions preceded by pronouns (6 features): The count of first, second, and third personal pronouns with and without negation followed, with potentially other interme- diate words, by a drug mention2 . (e.g., “I did a line of cocaine”) For the CNN models, each tweet is passed to the model as a sequence of words vectors, [w1 , w2 , ..., wn ], where n is the number of words in the tweet. We begin by concatenating each window spanning k words, x j−k+1 || . . . ||x j , into a local context vector c j ∈ Rkdemb where demb is the dimensionality of the word vectors. Intuitively, CNNs extract informative n-grams from text and n-grams are extracted with the use of convolutional filters (CFs). We define the CFs as Wk ∈ Rq×kdemb , where q is the number of feature maps generated using filter width k. Next, using a non-linear function f , we convolve over each context vector, ĉ j = f (Wc j + b), where b ∈ R . Given the convolved context vectors [ĉ1 , ĉ2 , . . . , ĉn−k+1 ], we map them into a fixed size vector using q max-over-time pooling mk = [ĉ1max , ĉ2max , . . . , ĉqmax ], where ĉmax j represents the max value across the j-th feature map such that ĉmax j = max(ĉ1j , ĉ2j , . . . , ĉn−k+1 j ). To im- prove our model, we use convolutional filters of different widths. With filters spanning a different number of words (k1 , k2 , . . . , k p ), we generate multiple sentence representations mk1 , mk2 , . . . , mk p , where p is the total number of filter widths we use. When datasets contain many noisy features, it can be beneficial to use a simpler model. Simpler models are less prone to overfitting to the noise in the dataset. We augment the features generated from the CNN with a simple average of all the word vectors in a given tweet n 1X mbow = wi . n i=1 Now we have p + 1 feature representations of the final tweet [mk1 , mk2 , . . . , mk p , mbow ]. Prior work using CNNs [6] simply concatenated each m j to pass to the final fully-connected softmax layer. Rather than concatenating each mk j , we use self-attention [7] (with multiple attention scores per feature representation) to allow the model to dynamically weight the feature representations. Specifically, we define the j-th attention of the i-th feature representation as exp(e j,i ) α j,i = P p+1 , where e j,i = vTj · tanh(Wa mi ), k=1 exp(e j,k ) such that Wa ∈ Rt×q , v j ∈ Rt , and α j,i ∈ [0, 1]. Intuitively, α j,i represents the importance we give the feature representation mi . Next, we merge all feature representations into a matrix M ∈ R(p+1)×q . Likewise, given s total attentions, each attention weight is combined to form an attention matrix A ∈ R s×(p+1) . Here, the j-th row of A represents the importance of each mi with respect to the attention weights v j . Finally, we represent each tweet as the weighted sum of the feature representations according to each of the attention weight vectors as h = vec(AM), where vec represents the matrix vectorization operation and h ∈ R sq . Lastly, h is passed to a final fully-connected softmax layer. 1 http://saifmohammad.com/WebPages/AccessResource.htm 2 https://www.drugs.com/drug_information.html 2 2.2. Task 3: Hierarchical Character-LSTM The deep model we propose for task 3 realizes a hierarchical composition in which an example phrase is segmented into N constituent words and each word is treated as a sequence of characters. In our formulation, the word at position i ∈ [1, N] is of character length T i . Herein, we formulate the model composition from the bottom up. At the character level, word representations are composed using a forward LSTM over each character and its corresponding character class. The former is a case-insensitive character embedding lookup and the latter is a separate label embedding lookup indicating the class of character: lowercase letter, uppercase letter, punctuation, or other. We denote ci,t and zi,t as the t-th character and class embedding respectively of the i-th word in the sentence for t ∈ [1, T i ]. For a word at position i, we feed its character embeddings to a forward-LSTM layer with 200 output units such that → −g i,t = CLS T M(ci,t kzi,t ) for t = 1, . . . , T i where k is the vector concatenation operation, CLS T M is an LSTM unit composition at the character level, and → −g i,t ∈ R200 is the output at timestep t. The output state at the last step, or → −g i,T i , encodes the left-to-right context accumulating at the last character and is used to represent the word at position i. Next, we deploy a bi-directional LSTM layer at the word level which is useful for capturing the contextual information of the phrase with respect to each word. Concretely, → −i h = WLS T M → (→ −g i,T i ), ← − h i = WLS T M ← (→ −g i,T i ), →− ← − hi = h i k h i for i = 1, . . . , N →− ← − where h i , h i ∈ R200 , hi ∈ R400 , and WLS T M → , WLS T M ← are LSTM units composing words in the forward and backward direction respectively. We then perform a max-pooling operation over all hi vectors to produce the feature vector ĥ = [hmax 1 , . . . , h2d ] where h j max max = max(h1j , . . . , hNj ). We use a fully-connected layer output layer of unit size m, where m corresponds to the number of MEDDRA codes in the label space. The output is computed as q = Wq · ĥ + bq such that q ∈ Rm and Wq and bq are network parameters. In order to get a categorical distribution, we apply a softmax layer to the vector q such that eq j p j = Pm q j j=1 e where p j ∈ [0, 1] is the probability estimate of the label at index j. Additional Training Data In addition to the training set provided for this task, we also experiment with using additional external datasets for training. Namely, we use the TwADR-L [8] and training data released as part of SMM2016 task 3. Since these datasets are labeled with CUIs as opposed to MEDDRA codes, we perform a mapping from CUIs to MEDDRA codes by referencing the NLM Metathesaurus; here we keep only examples for which such a mapping exists. This amounts to approximately 905 additional examples. Model Configuration We train the model with 80% of the training data and use the remaining 20% for validation. We train for 40 epochs with a batch size of 8 and learning rate of 0.01. Here we use the RMSprop optimize with an exponential decay rate of 0.95. The character embeddings are of size 32 and the character class embeddings are of size 8. We apply dropout regularization at a probability of 50% at the feature vector layer. To make the final prediction, we performed model averaging using an ensemble of 10 such models each with a different random parameter initialization and random validation split. 3. Result and Discussion 3.1. Task 1 Results Table 1 shows our official scores on task 1. The CNN with Attention (CNN-Att) model is observed to outperform the other models. On the training and development data set, we found logistic regression to perform better than SVM; TLM ensemble (model averaging) of two LR models with one SVM model and the base CNN have about the same performance while the averaged model of TLM-ensemble with CNN was the best performer. CNN-Att itself 3 was slightly better than TLM-ensemble and CNN when considered separately, but CNN-Att was worse off when TLM-ensemble and CNN were combined using model averaging. Our ensemble model has the top precision score among all teams and shows the potential of the ensemble approaches. However, our recall is significantly less than the best performer where it was over 48%. We will further investigate the discrepancies between train and test set performances of various models. Our initial assessment is that the attention model is able to more effectively weight what different CFs are capturing from each tweet. To verify this, we will examine the “popular” n-grams of filters (based on values of ĉ j in Section 2.1) that are consistently being weighted above others by the attention mechanism. In turn, these n-grams can be used as additional features in the regular CNN or TML models to potentially improve their performances. We will need to experiment with these additional approaches to improve our recall without major compromises in precision. Table 1: Task 1: Performance on the test set ADR Precision ADR Recall ADR F-score TLM-ensemble 0.459 0.237 0.313 CNN+TLM-ensemble 0.567 0.259 0.356 CNN-Att 0.498 0.337 0.402 3.2. Task 2 Results Table 2 shows our official scores on task 2. The CNN-Att model still has the best performance over the other two models we submitted. On the training and development data sets, we found deep nets significantly outperformed the TLMs (and their ensembles), and therefore, in task 2 we focused on incorporating deep nets in all ensembles. Table 2: Task 2: Performance on the test set Prec. for classes 1 & 2 Recall for classes 1 & 2 F-score for classes 1 & 2 CNN+TLM-ensemble 0.688 0.607 0.645 CNNs 0.705 0.666 0.685 CNN-Att 0.701 0.677 0.689 Since each task allowed submission of only three models, the ones we submitted may not have been the best compared with additional models we built as part of our participation in the challenge. Therefore, we did some new experiments using unsubmitted models and found newer better performing ensembles on task 1 and 2 as shown in Table 3. The results from our CNN and CNN-Att models are from averaging ensembles of ten models each using both stratified and random 80%–20% splits. We did this because the test dataset distribution may vary from training data, and we believed stratified split may cause overfitting. Therefore, we felt that with the help of random splits it may help our model’s ability to handle test data with a different distribution. In this particular case, the test data distribution is similar to that of the training data; so when we only consider the use of stratified splitting to tune the parameters, our results improved. In task 1, we found the ensemble model with CNN-Att and CNN with logistic regression had almost 1% improvement in F1 score. In task 2, we found CNN-Att trained with stratified splitting has an F1 score that is 0.4% higher than that of our submitted model and this score is equal to winning team’s performance. Table 3: Experiments on unsubmitted models Precision Recall F-score CNN_Att+CNN+LR (task1) 0.483 0.358 0.411 CNN_Att (averaging_stratified_only) (task2) 0.701 0.686 0.693 4 3.3. Task 3 Results We disclose our results for task 3 in Table 4. For our initial experiments, we performed testing on 30 different runs each evaluating on a different random held-out development set; the results are averaged and recorded in the second column of the table. We submitted only two models and these are marked accordingly in the table. Our first submission, the Hierarchical Char-LSTM model as described in Section 2.2, performed best at 87.4% on the validation set which is consistent with its performance on the actual test at 87.2%. Our second submission, which is the same as the first except it is trained using additional external datasets performed reasonably but did not generalize to the test set as initially hoped. The flat character-based models performed poorly on both the validation and the test set but nevertheless saw a slight increase on the test set. The Hierarchical Char-CNN model results were underwhelming on the validation; however, on the test set it is observed to outperform even the Hierarchical Char-LSTM models which is unexpected. Furthermore, we experimented with appending pretrained word embeddings (trained on the Twitter corpus [3]) to the word representation layer of the model and this resulted in poor performance on the validation set at only 83.3% for either variant. However, the results on the testing set were much more competitive. Table 4: Accuracy of candidate models for task 3. * Models that were submitted. Dev. Test Flat Char-CNN 82.4 84.8 Flat Char-LSTM 81.7 84.7 Hierarchical Char-CNN 85.0 87.7 Hierarchical Char-LSTM * 87.4 87.2 Hierarchical Char-LSTM with External Training Data * 85.6 86.7 Hierarchical Char-CNN with Word Embeddings 83.3 87.4 Hierarchical Char-LSTM with Word Embeddings 83.3 86.0 Acknowledgments Our work is primarily supported by the National Library of Medicine through grant R21LM012274 and the National Cancer Institute through grant R21CA218231. References [1] Sultana J, Cutroneo P, Trifirò G. Clinical and economic burden of adverse drug reactions. Journal of pharmacol- ogy & pharmacotherapeutics. 2013;4(Suppl1):S73. [2] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research. 2011;12:2825–2830. [3] Godin F, Vandersmissen B, De Neve W, Van de Walle R. Multimedia lab@ ACL W-NUT NER shared task: named entity recognition for twitter microposts using distributed word representations. ACL-IJCNLP. 2015;p. 146–153. [4] Sarker A, Gonzalez G. Portable automatic text classification for adverse drug reaction detection via multi-corpus training. Journal of biomedical informatics. 2015;53:196–207. [5] Bouma G. Normalized (pointwise) mutual information in collocation extraction. Proceedings of GSCL. 2009;p. 31–40. [6] Rios A, Kavuluru R. Convolutional neural networks for biomedical text classification: application in indexing biomedical articles. In: Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics. ACM; 2015. p. 258–267. [7] Lin Z, Feng M, Santos CNd, Yu M, Xiang B, Zhou B, et al. A structured self-attentive sentence embedding. In: Proceedings of the 5th International Conference on Learning Representations; 2017. . [8] Limsopatham N, Collier N. Normalising Medical Concepts in Social Media Texts by Learning Semantic Repre- sentation. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1); 2016. p. 1014–1023. 5