1 Introduction

NRC-Canada at SMM4H Shared Task: Classifying Tweets Mentioning Adverse Drug Reactions and Medication Intake

Svetlana Kiritchenko

Ph.D.

Saif M. Mohammad

Ph.D.

Jason Morin

Ph.D.

Berry de Bruijn

Ph.D. National Research Council Canada

Ottawa

Canada

Our team, NRC-Canada, participated in two shared tasks at the AMIA-2017 Workshop on Social Media Mining for Health Applications (SMM4H): Task 1 - classification of tweets mentioning adverse drug reactions, and Task 2 classification of tweets describing personal medication intake. For both tasks, we trained Support Vector Machine classifiers using a variety of surface-form, sentiment, and domain-specific features. With nine teams participating in each task, our submissions ranked first on Task 1 and third on Task 2. Handling considerable class imbalance proved crucial for Task 1. We applied an under-sampling technique to reduce class imbalance (from about 1:10 to 1:2). Standard n-gram features, n-grams generalized over domain terms, as well as general-domain and domain-specific word embeddings had a substantial impact on the overall performance in both tasks. On the other hand, including sentiment lexicon features did not result in any improvement.

1 Introduction Dataset

Training set Development set Test set Task 1 was formulated as follows: given a tweet, determine whether it mentions an adverse drug reaction. This was a binary classification task: class 1 (ADR) - tweets that mention adverse drug reactions

Example: Nicotine lozenges are giving me stomach cramps.

class 0 (non-ADR) - tweets that do not mention adverse drug reactions

Example: I need a injection of Prozac ! ...now!!!! The official evaluation metric was the F-score for class 1 (ADR):

Pclass 1 =

T Pclass 1 T Pclass 1 + F Pclass 1 ;

Rclass 1 =

T Pclass 1 T Pclass 1 + F Nclass 1 ;

Fclass 1 = 2 The data for this task was created as part of a large project on ADR detection from social media by the DIEGO lab at Arizona State University. The tweets were collected using the generic and brand names of the drugs as well as their phonetic misspellings. Two domain experts under the guidance of a pharmacology expert annotated the tweets for the presence or absence of an ADR mention. The inter-annotator agreement for the two annotators was Cohens Kappa = 0:69.8 Two labeled datasets were provided to the participants: a training set containing 10,822 tweets and a development set containing 4,845 tweets. These datasets were distributed as lists of tweet IDs, and the participants needed to download the tweets using the provided Python script. However, only about 60–70% of the tweets were accessible at the time of download (May 2017). The training set contained several hundreds of duplicate or near-duplicate messages, which we decided to remove. Near-duplicates were defined as tweets containing mostly the same text but differing in user mentions, punctuation, or other non-essential context. A separate test set of 9,961 tweets was provided without labels at the evaluation period. This set was distributed to the participants, in full, by email. Table 1 shows the number of instances we used for training and testing our model.

Task 1 was a rerun of the shared task organized in 2016. 8 The best result obtained in 2016 was Fclass 1 = 0:42. 9 The participants in the 2016 challenge employed various statistical machine learning techniques, such as Support Vector Machines, Maximum Entropy classifiers, Random Forests, and other ensembles.9, 10 A variety of features (e.g., word n-grams, word embeddings, sentiment, and topic models) as well as extensive medical resources (e.g., UMLS, lexicons of ADRs, drug lists, and lists of known drug-side effect pairs) were explored.

Task 2: Classification of Tweets for Medication Intake

Task 2 was formulated as follows: given a tweet, determine if it mentions personal medication intake, possible medication intake, or no intake is mentioned. This was a multi-class classification problem with three classes: class 1 (personal medication intake) - tweets in which the user clearly expresses a personal medication intake/consumption

Example: Advil just saved my life :)) Dataset Training set Development set Test set

All 7,528 2,068 7,513 class 2 (possible medication intake) - tweets that are ambiguous but suggest that the user may have taken the medication

Example: Having pains and all my Tylenol gone

class 3 (non-intake) - tweets that mention medication names but do not indicate personal intake

Example: Going thru this pain without Tylenol..

The official evaluation metric for this task was micro-averaged F-score of the class 1 (intake) and class 2 (possible intake):

Pclass 1 + class 2 = Rclass 1+class 2 =

T Pclass 1 + T Pclass 2 T Pclass 1 + F Pclass 1 + T Pclass 2 + F Pclass 2

T Pclass 1 + T Pclass 2

T Pclass 1 + F Nclass 1 + T Pclass 2 + F Nclass 2 Fclass 1+class 2 = 2

Pclass 1+class 2

Rclass 1+class 2

Pclass 1+class 2 + Rclass 1+class 2 Information on how the data was collected and annotated was not available until after the evaluation. Two labeled datasets were provided to the participants: a training set containing 8,000 tweets and a development set containing 2,260 tweets. As for Task 1, the training and development sets were distributed through tweet IDs and a download script. Around 95% of the tweets were accessible through download. Again, we removed duplicate and near-duplicate messages. A separate test set of 7,513 tweets was provided without labels at the evaluation period. This set was distributed to the participants, in full, by email. Table 2 shows the number of instances we used for training and testing our model.

For each task, three submissions were allowed from each participating team. 3

System Description

Both our systems, for Task 1 and Task 2, share the same classification framework and feature pool. The specific configurations of features and parameters were chosen for each task separately through cross-validation experiments (see Section 3.3). 3.1

Machine Learning Framework

For both tasks, we trained linear-kernel Support Vector Machine (SVM) classifiers. Past work has shown that SVMs are effective on text categorization tasks and robust when working with large feature spaces. In our cross-validation experiments on the training data, a linear-kernel SVM trained with the features described below was able to obtain better performance than a number of other statistical machine-learning algorithms, such as Stochastic Gradient Descent, AdaBoost, Random Forests, as well SVMs with other kernels (e.g., RBF, polynomic). We used an in-house implementation of SVM.

Handling Class Imbalance: For Task 1 (Classification of tweets for ADR), the provided datasets were highly imbalanced: the ADR class occurred in less than 12% of instances in the training set and less than 8% in the development and test sets. Most conventional machine-learning algorithms experience difficulty with such data, classifying most of the instances into the majority class. Several techniques have been proposed to address the issue of class imbalance, including over-sampling, under-sampling, cost-sensitive learning, and ensembles.11 We experimented with several such techniques. The best performance in our cross-validation experiments was obtained using under-sampling with the class proportion 1:2. To train the model, we provided the classifier with all available data for the minority class (ADR) and a randomly sampled subset of the majority class (non-ADR) data in such a way that the number of instances in the majority class was twice the number of instances in the minority class. We found that this strategy significantly outperformed the more traditional balanced under-sampling where the majority class is sub-sampled to create a balanced class distribution. In one of our submissions for Task 1 (submission 3), we created an ensemble of three classifiers trained on the full set of instances in the minority class (ADR) and different subsets of the majority class (non-ADR) data. We varied the proportion of the majority class instances to the minority class instances: 1:2, 1:3, and 1:4. The final predictions were obtained by majority voting on the predictions of the three individual classifiers. For Task 2 (Classification of tweets for medication intake), the provided datasets were also imbalanced but not as much as for Task 1: the class proportion in all subsets was close to 1:2:3. However, even for this task, we found some of the techniques for reducing class imbalance helpful. In particular, training an SVM classifier with different class weights improved the performance in the cross-validation experiments. These class weights are used to increase the cost of misclassification errors for the corresponding classes. The cost for a class is calculated as the generic cost parameter (parameter C in SVM) multiplied by the class weight. The best performance on the training data was achieved with class weights set to 4 for class 1 (intake), 2 for class 2 (possible intake), and 1 for class 3 (non-intake). Preprocessing: The following pre-processing steps were performed. URLs and user mentions were normalized to http://someurl and @username, respectively. Tweets were tokenized with the CMU Twitter NLP tool.12 3.2

Features

The classification model leverages a variety of general textual features as well as sentiment and domain-specific features described below. Many features were inspired by previous work on ADR9, 10, 13 and our work on sentiment analysis (such as the winning system in the SemEval-2013 task on sentiment analysis in Twitter14 and best performing stance detection system15) .

General Textual Features

The following surface-form features were used:

N -grams: word n-grams (contiguous sequences of n tokens), non-contiguous word n-grams (n-grams with one token replaced by *), character n-grams (contiguous sequences of n characters), unigram stems obtained with the Porter stemming algorithm; General-domain word embeddings: – dense word representations generated with word2vec on ten million English-language tweets, summed over all tokens in the tweet, – word embeddings distributed as part of ConceptNet 5.516, summed over all tokens in the tweet; General-domain word clusters: presence of tokens from the word clusters generated with the Brown clustering algorithm on 56 million English-language tweets;12 Negation: presence of simple negators (e.g., not, never); negation also affects the n-gram features—a term t becomes t N EG if it occurs after a negator and before a punctuation mark; Twitter-specific features: the number of tokens with all characters in upper case, the number of hashtags, presence of positive and negative emoticons, whether the last token is a positive or negative emoticon, the number of elongated words (e.g., soooo); Punctuation: presence of exclamation and question marks, whether the last token contains an exclamation or question mark.

Domain-Specific Features

To generate domain-specific features, we used the following domain resources:

Medication list: we compiled a medication list by selecting all one-word medication names from RxNorm (e.g, acetaminophen, nicorette, zoloft) since most of the medications mentioned in the training datasets were one-word strings.

Pronoun Lexicon: we compiled a lexicon of first-person pronouns (e.g., I, ours, we’ll), second-person pronouns (e.g., you, yourself ), and third-person pronouns (e.g., them, mom’s, parents’).

ADR Lexicon: a list of 13,699 ADR concepts compiled from COSTART, SIDER, CHV, and drug-related tweets by the DIEGO lab;17 domain word embeddings: dense word representations generated by the DIEGO lab by applying word2vec on one million tweets mentioning medications;17 domain word clusters: word clusters generated by the DIEGO lab using the word2vec tool to perform K-means clustering on the above mentioned domain word embeddings.17 From these resources, the following domain-specific features were generated:

N -grams generalized over domain terms (or domain generalized n-grams, for short): n-grams where words or phrases representing a medication (from our medication list) or an adverse drug reaction (from the ADR lexicon) are replaced with <MED> and <ADR>, respectively (e.g., <MED> makes me); Pronoun Lexicon features: the number of tokens from the Pronoun lexicon matched in the tweet; domain word embeddings: the sum of the domain word embeddings for all tokens in the tweet; domain word clusters: presence of tokens from the domain word clusters.

Sentiment Lexicon Features

We generated features using the sentiment scores provided in the following lexicons: Hu and Liu Lexicon18, Norms of Valence, Arousal, and Dominance19, labMT20, and NRC Emoticon Lexicon21. The first three lexicons were created through manual annotation while the last one, NRC Emoticon Lexicon, was generated automatically from a large collection of tweets with emoticons. The following set of features were calculated separately for each tweet and each lexicon: the number of tokens with score(w) 6= 0; the total score = Pw2tweet score(w); the maximal score = max w2tweet score(w); the score of the last token in the tweet.

We experimented with a number of other existing manually created or automatically generated sentiment and emotion lexicons, such as the NRC Emotion Lexicon22 and the NRC Hashtag Emotion Lexicon23(http://saifmohammad.com/ WebPages/lexicons.html), but did not observe any improvement in the cross-validation experiments. None of the sentiment lexicon features were effective in the cross-validation experiments on Task 1; therefore, we did not include them in the final feature set for this task. 3.3

Official Submissions

For each task, our team submitted three sets of predictions. The submissions differed in the sets of features and parameters used to train the classification models (Table 3).

While developing the system for Task 1 we noticed that the results obtained through cross-validation on the training data were almost 13 percentage points higher than the results obtained by the model trained on the full training set and applied on the development set. This drop in performance was mostly due to a drop in precision. This suggests that the datasets had substantial differences in the language use, possibly because they were collected and annotated at separate times. Therefore, we decided to optimize the parameters and features for submission 1 and submission 2 using two different strategies. The models for the three submissions were trained as follows: 1 General textual features word n-grams, n up to non-contiguous n-grams, n up to character n-grams, n up to unigram stems general-domain word embeddings general-domain word clusters negation Twitter-specific features punctuation

Domain-specific features

domain generalized n-grams, n up to domain gen. non-cont. n-grams, n up to ADR lexicon Pronoun lexicon domain word embeddings domain word clusters

Sentiment lexicon features SVM parameters

C class weights

Under-sampling

class proportion

Submission 1: we randomly split the development set into 5 equal folds. We trained a classification model on the combination of four folds and the full training set, and tested the model on the remaining fifth fold of the development set. The procedure was repeated five times, each time testing on a different fold. The feature set and the classification parameters that resulted in the best Fclass 1 were used to train the final model. Submission 2: the features and parameters were selected based on the performance of the model trained on the full training set and tested on the full development set.

Submission 3: we used the same features and parameters as in submission 1, except we trained an ensemble of three models, varying the class distribution in the sub-sampling procedure (1:2, 1:3, and 1:4).

For Task 2, the features and parameters were selected based on the cross-validation results run on the combination of the training and development set. We randomly split the development set into 3 equal folds. We trained a classification model on the combination of two folds and the full training set, and tested the model on the remaining third fold of the development set. The procedure was repeated three times, each time testing on a different fold. The models for the three submissions were trained as follows:

Submission 1: we used the features and parameters that gave the best results during cross-validation. Submission 2: we used the same features and parameters as in submission 1, but added features derived from two domain resources: the ADR lexicon and the Pronoun lexicon.

Submission 3: we used the same features as in submission 1, but changed the SVM C parameter to 0.1. d. Our best result

Fclass 1

For both tasks and all submissions, the final models were trained on the combination of the full training set and full development set, and applied on the test set. 4

Results and Discussion Task 1 (Classification of Tweets for ADR)

The results for our three official submissions are presented in Table 4 (rows c.1–c.3). The best results in Fclass 1 were obtained with submission 1 (row c.1). The results for submission 2 are the lowest, with F-measure being 3.5 percentage points lower than the result for submission 1 (row c.2). The ensemble classifier (submission 3) shows a slightly worse performance than the best result. However, in the post-competition experiments, we found that larger ensembles (with 7–11 classifiers, each trained on a random sub-sample of the majority class to reduce class imbalance to 1:2) outperform our best single-classifier model by over one percentage point with Fclass 1 reaching up to 0:446 (row d). Our best submission is ranked first among the nine teams participated in this task (rows b.1–b.3). Table 4 also shows the results for two baseline classifiers. The first baseline is a classifier that assigns class 1 (ADR) to all instances (row a.1). The performance of this baseline is very low (Fclass 1 = 0:143) due to the small proportion of class 1 instances in the test set. The second baseline is an SVM classifier trained only on the unigram features (row a.2). Its performance is much higher than the performance of the first baseline, but substantially lower than that of our system. By adding a variety of textual and domain-specific features as well as applying under-sampling, we are able to improve the classification performance by almost ten percentage points in F-measure.

To investigate the impact of each feature group on the overall performance, we conduct ablation experiments where we repeat the same classification process but remove one feature group at a time. Table 5 shows the results of these ablation experiments for our best system (submission 1). Comparing the two major groups of features, general textual features (row b) and domain-specific features (row c), we observe that they both have a substantial impact on the performance. Removing one of these groups leads to a two percentage points drop in Fclass 1. The general textual features mostly affect recall of the ADR class (row b) while the domain-specific features impact precision (row c). Among the general textual features, the most influential feature is general-domain word embeddings (row b.2). Among the domain-specific features, n-grams generalized over domain terms (row c.1) and domain word embeddings (row c.3) provide noticeable contribution to the overall performance. In the Appendix, we provide a list of top 25 n-gram features (including n-grams generalized over domain terms) ranked by their importance in separating the two classes. As mentioned before, the data for Task 1 has high class imbalance, which significantly affects performance. Not applying any of the techniques for handling class imbalance, results in a drop of more than ten percentage points in b. all general textual features b.1. all general n-grams b.2. all general embeddings b.3. all general clusters b.4. all Twitter-specific c. all domain-specific features c.1. all domain generalized n-grams c.2. all Pronoun lexicon c.3. all domain embeddings c.4. all domain clusters d. all under-sampling punctuation 0.390 0.397 0.365 0.383 0.382 F-measure—the model assigns most of the instances to the majority (non-ADR) class (row d). Also, applying undersampling with the balanced class distribution results in performance significantly worse (Fclass 1 = 0:387) than the performance of the submission 1 where under-sampling with class distribution of 1:2 was applied. Error analysis on our best submission showed that there were 395 false negative errors (tweets that report ADRs, but classified as non-ADR) and 582 false positives (non-ADR tweets classified as ADR). Most of the false negatives were due to the creative ways in which people express themselves (e.g., i have metformin tummy today :-( ). Large amounts of labeled training data or the use of semi-supervised techniques to take advantage of large unlabeled domain corpora may help improve the detection of ADRs in such tweets. False positives were caused mostly due to the confusion between ADRs and other relations between a medication and a symptom. Tweets may mention both a medication and a symptom, but the symptom may not be an ADR. The medication may have an unexpected positive effect (e.g., reversal of hair loss), or may alleviate an existing health condition. Sometimes, the relation between the medication and the symptom is not explicitly mentioned in a tweet, yet an ADR can be inferred by humans.

Task 2 (Classification of Tweets for Medication Intake)

The results for our three official submissions on Task 2 are presented in Table 6 (rows c.1–c.3). The best results in Fclass 1 + class 2 are achieved with submission 1 (row c.1). The results for the other two submissions, submission 2 and submission 3, are quite similar to the results of submission 1 in both precision and recall (rows c.2–c.3). Adding the features from the ADR lexicon and the Pronoun lexicon did not result in performance improvement on the test set. Our best system is ranked third among the nine teams participated in this task (rows b.1–b.3).

Table 6 also shows the results for two baseline classifiers. The first baseline is a classifier that assigns class 2 (possible medication intake) to all instances (row a.1). Class 2 is the majority class among the two positive classes, class 1 and class 2, in the training set. The performance of this baseline is quite low (Fclass 1 + class 2 = 0:452) since class 2 covers only 36% of the instances in the test set. The second baseline is an SVM classifier trained only on the unigram features (row a.2). The performance of such a simple model is surprisingly high (Fclass 1 + class 2 = 0:646), only 4.7 percentage points below the top result in the competition.

Table 7 shows the performance of our best system (submission 1) when one of the feature groups is removed. In this task, the general textual features (row b) played a bigger role in the overall performance than the domain-specific (row c) or sentiment lexicon (row d) features. Removing this group of features results in more than 2.5 percentage points drop in the F-measure affecting both precision and recall (row b). However, removing any one feature subgroup in this group (e.g., general n-grams, general clusters, general embeddings, etc.) results only in slight drop or even increase in the performance (rows b.1–b.4). This indicates that the features in this group capture similar information. Among the domain-specific features, the n-grams generalized over domain terms are the most useful. The model trained without Submission a. Baselines a.1. Assigning class 2 to all instances a.2. SVM-unigrams b. Top 3 teams in the shared task b.1. InfyNLP b.2. UKNLP b.3. NRC-Canada c. all domain-specific features c.1. all domain generalized n-grams c.2. all domain embeddings punctuation 0.452 0.646 these n-grams features performs almost one percentage point worse than the model that uses all the features (row c.1). The sentiment lexicon features were not helpful (row d).

Our strategy of handling class imbalance through class weights did not prove successful on the test set (even though it resulted in increase of one point in F-measure in the cross-validation experiments). The model trained with the default class weights of 1 for all classes performs 0.7 percentage points better than the model trained with the class weights selected in cross-validation (row e).

The difference in how people can express medication intake vs. how they express that they have not taken a medication can be rather subtle. For example, the expression I need Tylenol indicates that the person has not taken the medication yet (class 3), whereas the expression I need more Tylenol indicates that the person has taken the medication (class 1). In still other instances, the word more might not be the deciding factor in whether a medication was taken or not (e.g., more Tylenol didn’t help). A useful avenue of future work is to explore the role function words play in determining the semantics of a sentence, specifically, when they imply medication intake, when they imply the lack of medication intake, and when they are not relevant to determining medication intake.

Conclusion

Our submissions to the 2017 SMM4H Shared Tasks Workshop obtained the first and third ranks in Task1 and Task 2, respectively. In Task 1, the systems had to determine whether a given tweet mentions an adverse drug reaction. In Task 2, the goal was to label a given tweet with one of the three classes: personal medication intake, possible medication intake, or non-intake. For both tasks, we trained an SVM classifier leveraging a number of textual, sentiment, and domain-specific features. Our post-competition experiments demonstrate that the most influential features in our system for Task 1 were general-domain word embeddings, domain-specific word embeddings, and n-grams generalized over domain terms. Moreover, under-sampling the majority class (non-ADR) to reduce class imbalance to 1:2 proved crucial to the success of our submission. Similarly, n-grams generalized over domain terms improved results significantly in Task 2. On the other hand, sentiment lexicon features were not helpful in both tasks.

Appendix

We list the top 25 n-gram features (word n-grams and n-grams generalized over domain terms) ranked by mutual information of the presence/absence of n-gram features (f ) and class labels (C):

I(f; C) = X

X c2C p(f; c) log

p(f; c) p(f ) p(c) ; where C = f0; 1g for Task 1 and C = f1; 2; 3g for Task 2.

Here, <ADR> represents a word or a phrase from the ADR lexicon; <MED> represents a medication name from our one-word medication list.

Task 1 14. <MED> makes me 15. gain 16. weight 17. <ADR> and 18. headache 19. made 20. tired 21. rivaroxaban diary 22. withdrawals 23. zomby 24. day 25. <MED> diary

Task 2

Jason

Lazarou , Bruce H Pomeranz , and Paul N Corey. Incidence of adverse drug reactions in hospitalized patients: a meta-analysis of prospective studies . JAMA , 279 ( 15 ): 1200 - 1205 , 1998 .

Patrick

Waller and

Mira

Harrison-Woolrych . Types and sources of data . In An Introduction to Pharmacovigilance , pages 37 - 53 . John Wiley & Sons, Ltd, 2017 .

Mittmann ,

S. R.

Knowles ,

Gomez ,

J. S.

Fish ,

Cartotto , and

N. H.

Shear . Evaluation of the extent of underreporting of serious adverse drug reactions: the case of toxic epidermal necrolysis . Drug Safety , 27 ( 7 ): 477 - 487 , 2004 .

A. C.

Tricco ,

Zarin ,

Lillie ,

Pham , and

S. E.

Straus . Utility of social media and crowd-sourced data for pharmacovigilance: a scoping review protocol . BMJ Open , 7 ( 1 ):e013474, Jan 2017 .

Mascolo ,

Scavone ,

Sessa , G. di Mauro , D.

Cimmaruta , V.

Orlando , F.

Rossi , L.

Sportiello , and

Capuano . Can causality assessment fulfill the new European definition of adverse drug reaction? A review of methods used in spontaneous reporting . Pharmacological Research , 123 : 122 - 129 , Sep 2017 .

R. P.

Naidu . Causality assessment: A brief insight into practices in pharmaceutical industry . Perspectives in Clinical Research , 4 ( 4 ): 233 - 236 , Oct 2013 .

Lardon ,

Abdellaoui ,

Bellet ,

Asfari ,

Souvignet ,

Texier ,

M. C.

Jaulent ,

M. N.

Beyens ,

Burgun , and

Bousquet . Adverse Drug Reaction Identification and Extraction in Social Media: A Scoping Review . Journal of Medical Internet Research , 17 ( 7 ):e171, Jul 2015 .

Abeed

Sarker , Azadeh Nikfarjam, and

Graciela

Gonzalez . Social media mining shared task workshop . In Proceedings of the Pacific Symposium on Biocomputing , 2016 .

Majid

Rastegar-Mojarad , Ravikumar Komandur Elayavilli,

Yue

Yu , and Hongfang Liu. Detecting signals in noisy data - can ensemble classifiers help identify adverse drug reaction in tweets? In Proceedings of the Social Media Mining Shared Task Workshop at the Pacific Symposium on Biocomputing , 2016 .

10. Dominic

Egger

, Fatih Uzdilli, and

Mark

Cieliebak . Adverse drug reaction detection using an adapted sentiment classifier . In Proceedings of the Social Media Mining Shared Task Workshop at PSB , 2016 .

11. Guo

Haixiang

, Li Yijing, Jennifer Shang, Gu Mingyun, Huang Yuanyue, and

Gong

Bing . Learning from classimbalanced data: review of methods and applications . Expert Systems with Applications , 73 : 220 - 239 , 2017 .

12. Kevin

Gimpel

, Nathan Schneider, Brendan O'Connor , Dipanjan Das , Daniel Mills , Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah

Smith.

Part-of-speech tagging for Twitter: Annotation, features, and experiments . In Proceedings of the Annual Meeting of ACL , 2011 .

13.

Abeed

Sarker and

Graciela

Gonzalez . Portable automatic text classification for adverse drug reaction detection via multi-corpus training . Journal of Biomedical Informatics , 53 : 196 - 207 , 2015 .

14. Saif M. Mohammad , Svetlana Kiritchenko, and Xiaodan Zhu . NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets . In Proceedings of the International Workshop on Semantic Evaluation , Atlanta, Georgia, 2013 .

15. Saif M. Mohammad , Parinaz Sobhani, and Svetlana Kiritchenko . Stance and sentiment in tweets . ACM Transactions on Internet Technology , 17 ( 3 ), 2017 .

16. Robert

Speer

, Joshua Chin, and Catherine Havasi. ConceptNet 5 . 5: An open multilingual graph of general knowledge . In Proceedings of the AAAI Conference on Artificial Intelligence , pages 4444 - 4451 , 2017 .

17. Azadeh

Nikfarjam

, Abeed Sarker, Karen

OConnor

, Rachel Ginn, and

Graciela

Gonzalez . Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features . Journal of the American Medical Informatics Association , 22 ( 3 ): 671 - 681 , 2015 .

18.

Minqing

Hu and

Bing

Liu . Mining and summarizing customer reviews . In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) , pages 168 - 177 , USA, 2004 .

19. Amy Beth Warriner, Victor Kuperman, and

Marc

Brysbaert . Norms of valence, arousal, and dominance for 13,915 English lemmas . Behavior Research Methods , 45 ( 4 ): 1191 - 1207 , 2013 .

20. Peter Sheridan Dodds , Kameron Decker Harris, Isabel M. Kloumann , Catherine A. Bliss , and Christopher

Danforth . Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter . PloS One , 6 ( 12 ): e26752 , 2011 .

21. Svetlana

Kiritchenko

, Xiaodan Zhu, and Saif

Mohammad . Sentiment analysis of short informal texts . Journal of Artificial Intelligence Research , 50 : 723 - 762 , 2014 .

22. Saif

Mohammad and Peter D. Turney . Crowdsourcing a word-emotion association lexicon . Computational Intelligence , 29 ( 3 ): 436 - 465 , 2013 .

23.

Saif

Mohammad . # Emotional tweets . In Proceedings of the Conference on Lexical and Computational Semantics (*Sem) , pages 246 - 255 , Montre´al, Canada, June 2012 .

14. you

15. he

16. me

17. need a

18. kick

19. i need a

20. she

21. headache

22. kick in

23. this <MED>

24. need a <MED>

25. need <MED>