<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NRC-Canada at SMM4H Shared Task: Classifying Tweets Mentioning Adverse Drug Reactions and Medication Intake</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Svetlana Kiritchenko</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ph.D.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saif M. Mohammad</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ph.D.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jason Morin</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ph.D.</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Berry de Bruijn</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ph.D. National Research Council Canada</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ottawa</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Canada</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Our team, NRC-Canada, participated in two shared tasks at the AMIA-2017 Workshop on Social Media Mining for Health Applications (SMM4H): Task 1 - classification of tweets mentioning adverse drug reactions, and Task 2 classification of tweets describing personal medication intake. For both tasks, we trained Support Vector Machine classifiers using a variety of surface-form, sentiment, and domain-specific features. With nine teams participating in each task, our submissions ranked first on Task 1 and third on Task 2. Handling considerable class imbalance proved crucial for Task 1. We applied an under-sampling technique to reduce class imbalance (from about 1:10 to 1:2). Standard n-gram features, n-grams generalized over domain terms, as well as general-domain and domain-specific word embeddings had a substantial impact on the overall performance in both tasks. On the other hand, including sentiment lexicon features did not result in any improvement.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>Dataset</title>
      <p>Training set
Development set
Test set
Task 1 was formulated as follows: given a tweet, determine whether it mentions an adverse drug reaction. This was a
binary classification task:
class 1 (ADR) - tweets that mention adverse drug reactions</p>
      <sec id="sec-2-1">
        <title>Example: Nicotine lozenges are giving me stomach cramps.</title>
        <p>class 0 (non-ADR) - tweets that do not mention adverse drug reactions</p>
      </sec>
      <sec id="sec-2-2">
        <title>Example: I need a injection of Prozac ! ...now!!!!</title>
        <sec id="sec-2-2-1">
          <title>The official evaluation metric was the F-score for class 1 (ADR):</title>
          <p>Pclass 1 =</p>
          <p>T Pclass 1
T Pclass 1 + F Pclass 1
;</p>
          <p>Rclass 1 =</p>
          <p>T Pclass 1
T Pclass 1 + F Nclass 1
;</p>
          <p>Fclass 1 =
2
The data for this task was created as part of a large project on ADR detection from social media by the DIEGO lab at
Arizona State University. The tweets were collected using the generic and brand names of the drugs as well as their
phonetic misspellings. Two domain experts under the guidance of a pharmacology expert annotated the tweets for the
presence or absence of an ADR mention. The inter-annotator agreement for the two annotators was Cohens Kappa
= 0:69.8
Two labeled datasets were provided to the participants: a training set containing 10,822 tweets and a development set
containing 4,845 tweets. These datasets were distributed as lists of tweet IDs, and the participants needed to download
the tweets using the provided Python script. However, only about 60–70% of the tweets were accessible at the time
of download (May 2017). The training set contained several hundreds of duplicate or near-duplicate messages, which
we decided to remove. Near-duplicates were defined as tweets containing mostly the same text but differing in user
mentions, punctuation, or other non-essential context. A separate test set of 9,961 tweets was provided without labels
at the evaluation period. This set was distributed to the participants, in full, by email. Table 1 shows the number of
instances we used for training and testing our model.</p>
          <p>
            Task 1 was a rerun of the shared task organized in 2016.
            <xref ref-type="bibr" rid="ref8">8 The best result obtained in 2016</xref>
            was Fclass 1 = 0:42.
            <xref ref-type="bibr" rid="ref9">9
The participants in the 2016</xref>
            challenge employed various statistical machine learning techniques, such as Support
Vector Machines, Maximum Entropy classifiers, Random Forests, and other ensembles.9, 10 A variety of features (e.g.,
word n-grams, word embeddings, sentiment, and topic models) as well as extensive medical resources (e.g., UMLS,
lexicons of ADRs, drug lists, and lists of known drug-side effect pairs) were explored.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Task 2: Classification of Tweets for Medication Intake</title>
      <p>Task 2 was formulated as follows: given a tweet, determine if it mentions personal medication intake, possible
medication intake, or no intake is mentioned. This was a multi-class classification problem with three classes:
class 1 (personal medication intake) - tweets in which the user clearly expresses a personal medication
intake/consumption</p>
      <sec id="sec-3-1">
        <title>Example: Advil just saved my life :))</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Dataset</title>
      <sec id="sec-4-1">
        <title>Training set Development set Test set</title>
        <p>All
7,528
2,068
7,513
class 2 (possible medication intake) - tweets that are ambiguous but suggest that the user may have taken the
medication</p>
        <sec id="sec-4-1-1">
          <title>Example: Having pains and all my Tylenol gone</title>
          <p>class 3 (non-intake) - tweets that mention medication names but do not indicate personal intake</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Example: Going thru this pain without Tylenol..</title>
          <p>The official evaluation metric for this task was micro-averaged F-score of the class 1 (intake) and class 2 (possible
intake):</p>
          <p>Pclass 1 + class 2 =
Rclass 1+class 2 =</p>
          <p>T Pclass 1 + T Pclass 2
T Pclass 1 + F Pclass 1 + T Pclass 2 + F Pclass 2</p>
          <p>T Pclass 1 + T Pclass 2</p>
          <p>T Pclass 1 + F Nclass 1 + T Pclass 2 + F Nclass 2
Fclass 1+class 2 =
2</p>
          <p>Pclass 1+class 2</p>
          <p>Rclass 1+class 2</p>
          <p>Pclass 1+class 2 + Rclass 1+class 2
Information on how the data was collected and annotated was not available until after the evaluation.
Two labeled datasets were provided to the participants: a training set containing 8,000 tweets and a development set
containing 2,260 tweets. As for Task 1, the training and development sets were distributed through tweet IDs and a
download script. Around 95% of the tweets were accessible through download. Again, we removed duplicate and
near-duplicate messages. A separate test set of 7,513 tweets was provided without labels at the evaluation period. This
set was distributed to the participants, in full, by email. Table 2 shows the number of instances we used for training
and testing our model.</p>
          <p>For each task, three submissions were allowed from each participating team.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>System Description</title>
      <p>Both our systems, for Task 1 and Task 2, share the same classification framework and feature pool. The specific
configurations of features and parameters were chosen for each task separately through cross-validation experiments
(see Section 3.3).
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Machine Learning Framework</title>
      <p>For both tasks, we trained linear-kernel Support Vector Machine (SVM) classifiers. Past work has shown that SVMs
are effective on text categorization tasks and robust when working with large feature spaces. In our cross-validation
experiments on the training data, a linear-kernel SVM trained with the features described below was able to obtain
better performance than a number of other statistical machine-learning algorithms, such as Stochastic Gradient
Descent, AdaBoost, Random Forests, as well SVMs with other kernels (e.g., RBF, polynomic). We used an in-house
implementation of SVM.</p>
      <p>Handling Class Imbalance: For Task 1 (Classification of tweets for ADR), the provided datasets were highly
imbalanced: the ADR class occurred in less than 12% of instances in the training set and less than 8% in the development
and test sets. Most conventional machine-learning algorithms experience difficulty with such data, classifying most of
the instances into the majority class. Several techniques have been proposed to address the issue of class imbalance,
including over-sampling, under-sampling, cost-sensitive learning, and ensembles.11 We experimented with several such
techniques. The best performance in our cross-validation experiments was obtained using under-sampling with the
class proportion 1:2. To train the model, we provided the classifier with all available data for the minority class (ADR)
and a randomly sampled subset of the majority class (non-ADR) data in such a way that the number of instances in the
majority class was twice the number of instances in the minority class. We found that this strategy significantly
outperformed the more traditional balanced under-sampling where the majority class is sub-sampled to create a balanced
class distribution. In one of our submissions for Task 1 (submission 3), we created an ensemble of three classifiers
trained on the full set of instances in the minority class (ADR) and different subsets of the majority class (non-ADR)
data. We varied the proportion of the majority class instances to the minority class instances: 1:2, 1:3, and 1:4. The
final predictions were obtained by majority voting on the predictions of the three individual classifiers.
For Task 2 (Classification of tweets for medication intake), the provided datasets were also imbalanced but not as much
as for Task 1: the class proportion in all subsets was close to 1:2:3. However, even for this task, we found some of the
techniques for reducing class imbalance helpful. In particular, training an SVM classifier with different class weights
improved the performance in the cross-validation experiments. These class weights are used to increase the cost of
misclassification errors for the corresponding classes. The cost for a class is calculated as the generic cost parameter
(parameter C in SVM) multiplied by the class weight. The best performance on the training data was achieved with
class weights set to 4 for class 1 (intake), 2 for class 2 (possible intake), and 1 for class 3 (non-intake).
Preprocessing: The following pre-processing steps were performed. URLs and user mentions were normalized to
http://someurl and @username, respectively. Tweets were tokenized with the CMU Twitter NLP tool.12
3.2</p>
    </sec>
    <sec id="sec-7">
      <title>Features</title>
      <p>
        The classification model leverages a variety of general textual features as well as sentiment and domain-specific
features described below. Many features were inspired by previous work on ADR9, 10, 13 and our work on sentiment
analysis
        <xref ref-type="bibr" rid="ref19 ref22">(such as the winning system in the SemEval-2013 task on sentiment analysis in Twitter14 and best performing
stance detection system15)</xref>
        .
      </p>
      <sec id="sec-7-1">
        <title>General Textual Features</title>
        <p>The following surface-form features were used:</p>
        <p>N -grams: word n-grams (contiguous sequences of n tokens), non-contiguous word n-grams (n-grams with one
token replaced by *), character n-grams (contiguous sequences of n characters), unigram stems obtained with
the Porter stemming algorithm;
General-domain word embeddings:
– dense word representations generated with word2vec on ten million English-language tweets, summed
over all tokens in the tweet,
– word embeddings distributed as part of ConceptNet 5.516, summed over all tokens in the tweet;
General-domain word clusters: presence of tokens from the word clusters generated with the Brown clustering
algorithm on 56 million English-language tweets;12
Negation: presence of simple negators (e.g., not, never); negation also affects the n-gram features—a term t
becomes t N EG if it occurs after a negator and before a punctuation mark;
Twitter-specific features: the number of tokens with all characters in upper case, the number of hashtags,
presence of positive and negative emoticons, whether the last token is a positive or negative emoticon, the number
of elongated words (e.g., soooo);
Punctuation: presence of exclamation and question marks, whether the last token contains an exclamation or
question mark.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Domain-Specific Features</title>
        <p>To generate domain-specific features, we used the following domain resources:</p>
        <p>Medication list: we compiled a medication list by selecting all one-word medication names from RxNorm
(e.g, acetaminophen, nicorette, zoloft) since most of the medications mentioned in the training datasets were
one-word strings.</p>
        <p>Pronoun Lexicon: we compiled a lexicon of first-person pronouns (e.g., I, ours, we’ll), second-person pronouns
(e.g., you, yourself ), and third-person pronouns (e.g., them, mom’s, parents’).</p>
        <p>ADR Lexicon: a list of 13,699 ADR concepts compiled from COSTART, SIDER, CHV, and drug-related tweets
by the DIEGO lab;17
domain word embeddings: dense word representations generated by the DIEGO lab by applying word2vec on
one million tweets mentioning medications;17
domain word clusters: word clusters generated by the DIEGO lab using the word2vec tool to perform K-means
clustering on the above mentioned domain word embeddings.17
From these resources, the following domain-specific features were generated:</p>
        <p>N -grams generalized over domain terms (or domain generalized n-grams, for short): n-grams where words or
phrases representing a medication (from our medication list) or an adverse drug reaction (from the ADR lexicon)
are replaced with &lt;MED&gt; and &lt;ADR&gt;, respectively (e.g., &lt;MED&gt; makes me);
Pronoun Lexicon features: the number of tokens from the Pronoun lexicon matched in the tweet;
domain word embeddings: the sum of the domain word embeddings for all tokens in the tweet;
domain word clusters: presence of tokens from the domain word clusters.</p>
      </sec>
      <sec id="sec-7-3">
        <title>Sentiment Lexicon Features</title>
        <p>We generated features using the sentiment scores provided in the following lexicons: Hu and Liu Lexicon18, Norms
of Valence, Arousal, and Dominance19, labMT20, and NRC Emoticon Lexicon21. The first three lexicons were created
through manual annotation while the last one, NRC Emoticon Lexicon, was generated automatically from a large
collection of tweets with emoticons. The following set of features were calculated separately for each tweet and each
lexicon:
the number of tokens with score(w) 6= 0;
the total score = Pw2tweet score(w);
the maximal score = max w2tweet score(w);
the score of the last token in the tweet.</p>
        <p>We experimented with a number of other existing manually created or automatically generated sentiment and emotion
lexicons, such as the NRC Emotion Lexicon22 and the NRC Hashtag Emotion Lexicon23(http://saifmohammad.com/
WebPages/lexicons.html), but did not observe any improvement in the cross-validation experiments. None of the
sentiment lexicon features were effective in the cross-validation experiments on Task 1; therefore, we did not include
them in the final feature set for this task.
3.3</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Official Submissions</title>
      <p>For each task, our team submitted three sets of predictions. The submissions differed in the sets of features and
parameters used to train the classification models (Table 3).</p>
      <p>While developing the system for Task 1 we noticed that the results obtained through cross-validation on the training
data were almost 13 percentage points higher than the results obtained by the model trained on the full training set
and applied on the development set. This drop in performance was mostly due to a drop in precision. This suggests
that the datasets had substantial differences in the language use, possibly because they were collected and annotated
at separate times. Therefore, we decided to optimize the parameters and features for submission 1 and submission 2
using two different strategies. The models for the three submissions were trained as follows:
1
General textual features
word n-grams, n up to
non-contiguous n-grams, n up to
character n-grams, n up to
unigram stems
general-domain word embeddings
general-domain word clusters
negation
Twitter-specific features
punctuation</p>
      <sec id="sec-8-1">
        <title>Domain-specific features</title>
        <p>domain generalized n-grams, n up to
domain gen. non-cont. n-grams, n up to
ADR lexicon
Pronoun lexicon
domain word embeddings
domain word clusters</p>
      </sec>
      <sec id="sec-8-2">
        <title>Sentiment lexicon features</title>
      </sec>
      <sec id="sec-8-3">
        <title>SVM parameters</title>
        <p>C
class weights</p>
      </sec>
      <sec id="sec-8-4">
        <title>Under-sampling</title>
        <p>class proportion</p>
        <p>Submission 1: we randomly split the development set into 5 equal folds. We trained a classification model on
the combination of four folds and the full training set, and tested the model on the remaining fifth fold of the
development set. The procedure was repeated five times, each time testing on a different fold. The feature set
and the classification parameters that resulted in the best Fclass 1 were used to train the final model.
Submission 2: the features and parameters were selected based on the performance of the model trained on the
full training set and tested on the full development set.</p>
        <p>Submission 3: we used the same features and parameters as in submission 1, except we trained an ensemble of
three models, varying the class distribution in the sub-sampling procedure (1:2, 1:3, and 1:4).</p>
        <p>For Task 2, the features and parameters were selected based on the cross-validation results run on the combination of
the training and development set. We randomly split the development set into 3 equal folds. We trained a classification
model on the combination of two folds and the full training set, and tested the model on the remaining third fold of
the development set. The procedure was repeated three times, each time testing on a different fold. The models for the
three submissions were trained as follows:</p>
        <p>Submission 1: we used the features and parameters that gave the best results during cross-validation.
Submission 2: we used the same features and parameters as in submission 1, but added features derived from
two domain resources: the ADR lexicon and the Pronoun lexicon.</p>
        <p>Submission 3: we used the same features as in submission 1, but changed the SVM C parameter to 0.1.
d. Our best result</p>
        <p>Fclass 1</p>
        <p>For both tasks and all submissions, the final models were trained on the combination of the full training set and full
development set, and applied on the test set.
4</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Results and Discussion</title>
    </sec>
    <sec id="sec-10">
      <title>Task 1 (Classification of Tweets for ADR)</title>
      <p>The results for our three official submissions are presented in Table 4 (rows c.1–c.3). The best results in Fclass 1
were obtained with submission 1 (row c.1). The results for submission 2 are the lowest, with F-measure being 3.5
percentage points lower than the result for submission 1 (row c.2). The ensemble classifier (submission 3) shows a
slightly worse performance than the best result. However, in the post-competition experiments, we found that larger
ensembles (with 7–11 classifiers, each trained on a random sub-sample of the majority class to reduce class imbalance
to 1:2) outperform our best single-classifier model by over one percentage point with Fclass 1 reaching up to 0:446
(row d). Our best submission is ranked first among the nine teams participated in this task (rows b.1–b.3).
Table 4 also shows the results for two baseline classifiers. The first baseline is a classifier that assigns class 1 (ADR)
to all instances (row a.1). The performance of this baseline is very low (Fclass 1 = 0:143) due to the small proportion
of class 1 instances in the test set. The second baseline is an SVM classifier trained only on the unigram features (row
a.2). Its performance is much higher than the performance of the first baseline, but substantially lower than that of our
system. By adding a variety of textual and domain-specific features as well as applying under-sampling, we are able
to improve the classification performance by almost ten percentage points in F-measure.</p>
      <p>To investigate the impact of each feature group on the overall performance, we conduct ablation experiments where
we repeat the same classification process but remove one feature group at a time. Table 5 shows the results of these
ablation experiments for our best system (submission 1). Comparing the two major groups of features, general textual
features (row b) and domain-specific features (row c), we observe that they both have a substantial impact on the
performance. Removing one of these groups leads to a two percentage points drop in Fclass 1. The general textual
features mostly affect recall of the ADR class (row b) while the domain-specific features impact precision (row c).
Among the general textual features, the most influential feature is general-domain word embeddings (row b.2). Among
the domain-specific features, n-grams generalized over domain terms (row c.1) and domain word embeddings (row
c.3) provide noticeable contribution to the overall performance. In the Appendix, we provide a list of top 25 n-gram
features (including n-grams generalized over domain terms) ranked by their importance in separating the two classes.
As mentioned before, the data for Task 1 has high class imbalance, which significantly affects performance. Not
applying any of the techniques for handling class imbalance, results in a drop of more than ten percentage points in
b. all general textual features
b.1. all general n-grams
b.2. all general embeddings
b.3. all general clusters
b.4. all Twitter-specific
c. all domain-specific features
c.1. all domain generalized n-grams
c.2. all Pronoun lexicon
c.3. all domain embeddings
c.4. all domain clusters
d. all
under-sampling
punctuation
0.390
0.397
0.365
0.383
0.382
F-measure—the model assigns most of the instances to the majority (non-ADR) class (row d). Also, applying
undersampling with the balanced class distribution results in performance significantly worse (Fclass 1 = 0:387) than the
performance of the submission 1 where under-sampling with class distribution of 1:2 was applied.
Error analysis on our best submission showed that there were 395 false negative errors (tweets that report ADRs, but
classified as non-ADR) and 582 false positives (non-ADR tweets classified as ADR). Most of the false negatives were
due to the creative ways in which people express themselves (e.g., i have metformin tummy today :-( ). Large amounts
of labeled training data or the use of semi-supervised techniques to take advantage of large unlabeled domain corpora
may help improve the detection of ADRs in such tweets. False positives were caused mostly due to the confusion
between ADRs and other relations between a medication and a symptom. Tweets may mention both a medication
and a symptom, but the symptom may not be an ADR. The medication may have an unexpected positive effect (e.g.,
reversal of hair loss), or may alleviate an existing health condition. Sometimes, the relation between the medication
and the symptom is not explicitly mentioned in a tweet, yet an ADR can be inferred by humans.</p>
    </sec>
    <sec id="sec-11">
      <title>Task 2 (Classification of Tweets for Medication Intake)</title>
      <p>The results for our three official submissions on Task 2 are presented in Table 6 (rows c.1–c.3). The best results in
Fclass 1 + class 2 are achieved with submission 1 (row c.1). The results for the other two submissions, submission 2
and submission 3, are quite similar to the results of submission 1 in both precision and recall (rows c.2–c.3). Adding
the features from the ADR lexicon and the Pronoun lexicon did not result in performance improvement on the test set.
Our best system is ranked third among the nine teams participated in this task (rows b.1–b.3).</p>
      <p>Table 6 also shows the results for two baseline classifiers. The first baseline is a classifier that assigns class 2 (possible
medication intake) to all instances (row a.1). Class 2 is the majority class among the two positive classes, class 1 and
class 2, in the training set. The performance of this baseline is quite low (Fclass 1 + class 2 = 0:452) since class 2
covers only 36% of the instances in the test set. The second baseline is an SVM classifier trained only on the unigram
features (row a.2). The performance of such a simple model is surprisingly high (Fclass 1 + class 2 = 0:646), only 4.7
percentage points below the top result in the competition.</p>
      <p>Table 7 shows the performance of our best system (submission 1) when one of the feature groups is removed. In this
task, the general textual features (row b) played a bigger role in the overall performance than the domain-specific (row
c) or sentiment lexicon (row d) features. Removing this group of features results in more than 2.5 percentage points
drop in the F-measure affecting both precision and recall (row b). However, removing any one feature subgroup in this
group (e.g., general n-grams, general clusters, general embeddings, etc.) results only in slight drop or even increase in
the performance (rows b.1–b.4). This indicates that the features in this group capture similar information. Among the
domain-specific features, the n-grams generalized over domain terms are the most useful. The model trained without
Submission
a. Baselines
a.1. Assigning class 2 to all instances
a.2. SVM-unigrams
b. Top 3 teams in the shared task
b.1. InfyNLP
b.2. UKNLP
b.3. NRC-Canada
c. all domain-specific features
c.1. all domain generalized n-grams
c.2. all domain embeddings
punctuation
0.452
0.646
these n-grams features performs almost one percentage point worse than the model that uses all the features (row c.1).
The sentiment lexicon features were not helpful (row d).</p>
      <p>Our strategy of handling class imbalance through class weights did not prove successful on the test set (even though it
resulted in increase of one point in F-measure in the cross-validation experiments). The model trained with the default
class weights of 1 for all classes performs 0.7 percentage points better than the model trained with the class weights
selected in cross-validation (row e).</p>
      <p>The difference in how people can express medication intake vs. how they express that they have not taken a medication
can be rather subtle. For example, the expression I need Tylenol indicates that the person has not taken the medication
yet (class 3), whereas the expression I need more Tylenol indicates that the person has taken the medication (class 1).
In still other instances, the word more might not be the deciding factor in whether a medication was taken or not (e.g.,
more Tylenol didn’t help). A useful avenue of future work is to explore the role function words play in determining
the semantics of a sentence, specifically, when they imply medication intake, when they imply the lack of medication
intake, and when they are not relevant to determining medication intake.</p>
    </sec>
    <sec id="sec-12">
      <title>Conclusion</title>
      <p>Our submissions to the 2017 SMM4H Shared Tasks Workshop obtained the first and third ranks in Task1 and Task
2, respectively. In Task 1, the systems had to determine whether a given tweet mentions an adverse drug reaction. In
Task 2, the goal was to label a given tweet with one of the three classes: personal medication intake, possible
medication intake, or non-intake. For both tasks, we trained an SVM classifier leveraging a number of textual, sentiment,
and domain-specific features. Our post-competition experiments demonstrate that the most influential features in our
system for Task 1 were general-domain word embeddings, domain-specific word embeddings, and n-grams
generalized over domain terms. Moreover, under-sampling the majority class (non-ADR) to reduce class imbalance to 1:2
proved crucial to the success of our submission. Similarly, n-grams generalized over domain terms improved results
significantly in Task 2. On the other hand, sentiment lexicon features were not helpful in both tasks.</p>
    </sec>
    <sec id="sec-13">
      <title>Appendix</title>
      <p>We list the top 25 n-gram features (word n-grams and n-grams generalized over domain terms) ranked by mutual
information of the presence/absence of n-gram features (f ) and class labels (C):</p>
      <p>I(f; C) = X</p>
      <p>X
c2C
p(f; c) log</p>
      <p>p(f; c)
p(f ) p(c)
;
where C = f0; 1g for Task 1 and C = f1; 2; 3g for Task 2.</p>
      <p>Here, &lt;ADR&gt; represents a word or a phrase from the ADR lexicon; &lt;MED&gt; represents a medication name from
our one-word medication list.</p>
      <p>Task 1
14. &lt;MED&gt; makes me
15. gain
16. weight
17. &lt;ADR&gt; and
18. headache
19. made
20. tired
21. rivaroxaban diary
22. withdrawals
23. zomby
24. day
25. &lt;MED&gt; diary</p>
      <p>Task 2</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Jason</given-names>
            <surname>Lazarou</surname>
          </string-name>
          ,
          <string-name>
            <surname>Bruce H Pomeranz</surname>
          </string-name>
          , and Paul N Corey.
          <article-title>Incidence of adverse drug reactions in hospitalized patients: a meta-analysis of prospective studies</article-title>
          .
          <source>JAMA</source>
          ,
          <volume>279</volume>
          (
          <issue>15</issue>
          ):
          <fpage>1200</fpage>
          -
          <lpage>1205</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Waller</surname>
          </string-name>
          and
          <string-name>
            <given-names>Mira</given-names>
            <surname>Harrison-Woolrych</surname>
          </string-name>
          .
          <article-title>Types and sources of data</article-title>
          .
          <source>In An Introduction to Pharmacovigilance</source>
          , pages
          <fpage>37</fpage>
          -
          <lpage>53</lpage>
          . John Wiley &amp; Sons, Ltd,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>N.</given-names>
            <surname>Mittmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R.</given-names>
            <surname>Knowles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. S.</given-names>
            <surname>Fish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cartotto</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N. H.</given-names>
            <surname>Shear</surname>
          </string-name>
          .
          <article-title>Evaluation of the extent of underreporting of serious adverse drug reactions: the case of toxic epidermal necrolysis</article-title>
          .
          <source>Drug Safety</source>
          ,
          <volume>27</volume>
          (
          <issue>7</issue>
          ):
          <fpage>477</fpage>
          -
          <lpage>487</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Tricco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zarin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Lillie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Pham</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Straus</surname>
          </string-name>
          .
          <article-title>Utility of social media and crowd-sourced data for pharmacovigilance: a scoping review protocol</article-title>
          .
          <source>BMJ Open</source>
          ,
          <volume>7</volume>
          (
          <issue>1</issue>
          ):e013474,
          <year>Jan 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>A.</given-names>
            <surname>Mascolo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Scavone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sessa</surname>
          </string-name>
          , G. di
          <string-name>
            <surname>Mauro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Cimmaruta</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Orlando</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Rossi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Sportiello</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Capuano</surname>
          </string-name>
          .
          <article-title>Can causality assessment fulfill the new European definition of adverse drug reaction? A review of methods used in spontaneous reporting</article-title>
          .
          <source>Pharmacological Research</source>
          ,
          <volume>123</volume>
          :
          <fpage>122</fpage>
          -
          <lpage>129</lpage>
          ,
          <year>Sep 2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>R. P.</given-names>
            <surname>Naidu</surname>
          </string-name>
          .
          <article-title>Causality assessment: A brief insight into practices in pharmaceutical industry</article-title>
          .
          <source>Perspectives in Clinical Research</source>
          ,
          <volume>4</volume>
          (
          <issue>4</issue>
          ):
          <fpage>233</fpage>
          -
          <lpage>236</lpage>
          ,
          <year>Oct 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>J.</given-names>
            <surname>Lardon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Abdellaoui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bellet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Asfari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Souvignet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Texier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. C.</given-names>
            <surname>Jaulent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Beyens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Burgun</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bousquet</surname>
          </string-name>
          .
          <article-title>Adverse Drug Reaction Identification and Extraction in Social Media: A Scoping Review</article-title>
          .
          <source>Journal of Medical Internet Research</source>
          ,
          <volume>17</volume>
          (
          <issue>7</issue>
          ):e171,
          <year>Jul 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Abeed</given-names>
            <surname>Sarker</surname>
          </string-name>
          , Azadeh Nikfarjam, and
          <string-name>
            <given-names>Graciela</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          .
          <article-title>Social media mining shared task workshop</article-title>
          .
          <source>In Proceedings of the Pacific Symposium on Biocomputing</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Majid</given-names>
            <surname>Rastegar-Mojarad</surname>
          </string-name>
          , Ravikumar Komandur Elayavilli,
          <string-name>
            <given-names>Yue</given-names>
            <surname>Yu</surname>
          </string-name>
          , and Hongfang Liu.
          <article-title>Detecting signals in noisy data - can ensemble classifiers help identify adverse drug reaction in tweets?</article-title>
          <source>In Proceedings of the Social Media Mining Shared Task Workshop at the Pacific Symposium on Biocomputing</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Dominic</surname>
            <given-names>Egger</given-names>
          </string-name>
          , Fatih Uzdilli, and
          <string-name>
            <given-names>Mark</given-names>
            <surname>Cieliebak</surname>
          </string-name>
          .
          <article-title>Adverse drug reaction detection using an adapted sentiment classifier</article-title>
          .
          <source>In Proceedings of the Social Media Mining Shared Task Workshop at PSB</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Guo</surname>
            <given-names>Haixiang</given-names>
          </string-name>
          , Li Yijing, Jennifer Shang, Gu Mingyun, Huang Yuanyue, and
          <string-name>
            <given-names>Gong</given-names>
            <surname>Bing</surname>
          </string-name>
          .
          <article-title>Learning from classimbalanced data: review of methods and applications</article-title>
          .
          <source>Expert Systems with Applications</source>
          ,
          <volume>73</volume>
          :
          <fpage>220</fpage>
          -
          <lpage>239</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kevin</surname>
            <given-names>Gimpel</given-names>
          </string-name>
          , Nathan Schneider,
          <string-name>
            <surname>Brendan O'Connor</surname>
          </string-name>
          ,
          <string-name>
            <surname>Dipanjan Das</surname>
          </string-name>
          ,
          <string-name>
            <surname>Daniel Mills</surname>
          </string-name>
          , Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and
          <string-name>
            <surname>Noah</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith.</surname>
          </string-name>
          <article-title>Part-of-speech tagging for Twitter: Annotation, features, and experiments</article-title>
          .
          <source>In Proceedings of the Annual Meeting of ACL</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <given-names>Abeed</given-names>
            <surname>Sarker</surname>
          </string-name>
          and
          <string-name>
            <given-names>Graciela</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          .
          <article-title>Portable automatic text classification for adverse drug reaction detection via multi-corpus training</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          ,
          <volume>53</volume>
          :
          <fpage>196</fpage>
          -
          <lpage>207</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Saif M. Mohammad</surname>
            , Svetlana Kiritchenko, and
            <given-names>Xiaodan</given-names>
          </string-name>
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          .
          <article-title>NRC-Canada: Building the state-of-the-art in sentiment analysis of tweets</article-title>
          .
          <source>In Proceedings of the International Workshop on Semantic Evaluation</source>
          , Atlanta, Georgia,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Saif M. Mohammad</surname>
            , Parinaz Sobhani, and
            <given-names>Svetlana</given-names>
          </string-name>
          <string-name>
            <surname>Kiritchenko</surname>
          </string-name>
          .
          <article-title>Stance and sentiment in tweets</article-title>
          .
          <source>ACM Transactions on Internet Technology</source>
          ,
          <volume>17</volume>
          (
          <issue>3</issue>
          ),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Robert</surname>
            <given-names>Speer</given-names>
          </string-name>
          ,
          <source>Joshua Chin, and Catherine Havasi. ConceptNet 5</source>
          .
          <article-title>5: An open multilingual graph of general knowledge</article-title>
          .
          <source>In Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , pages
          <fpage>4444</fpage>
          -
          <lpage>4451</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Azadeh</surname>
            <given-names>Nikfarjam</given-names>
          </string-name>
          , Abeed Sarker,
          <string-name>
            <surname>Karen</surname>
            <given-names>OConnor</given-names>
          </string-name>
          , Rachel Ginn, and
          <string-name>
            <given-names>Graciela</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          .
          <article-title>Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          ,
          <volume>22</volume>
          (
          <issue>3</issue>
          ):
          <fpage>671</fpage>
          -
          <lpage>681</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>Minqing</given-names>
            <surname>Hu</surname>
          </string-name>
          and
          <string-name>
            <given-names>Bing</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>Mining and summarizing customer reviews</article-title>
          .
          <source>In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)</source>
          , pages
          <fpage>168</fpage>
          -
          <lpage>177</lpage>
          , USA,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19. Amy Beth Warriner, Victor Kuperman, and
          <string-name>
            <given-names>Marc</given-names>
            <surname>Brysbaert</surname>
          </string-name>
          .
          <article-title>Norms of valence, arousal, and dominance for 13,915 English lemmas</article-title>
          .
          <source>Behavior Research Methods</source>
          ,
          <volume>45</volume>
          (
          <issue>4</issue>
          ):
          <fpage>1191</fpage>
          -
          <lpage>1207</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Peter Sheridan Dodds</surname>
          </string-name>
          , Kameron Decker Harris,
          <string-name>
            <surname>Isabel M. Kloumann</surname>
          </string-name>
          , Catherine A.
          <string-name>
            <surname>Bliss</surname>
          </string-name>
          , and
          <string-name>
            <surname>Christopher</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Danforth</surname>
          </string-name>
          .
          <article-title>Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter</article-title>
          .
          <source>PloS One</source>
          ,
          <volume>6</volume>
          (
          <issue>12</issue>
          ):
          <fpage>e26752</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Svetlana</surname>
            <given-names>Kiritchenko</given-names>
          </string-name>
          , Xiaodan Zhu, and
          <string-name>
            <surname>Saif</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mohammad</surname>
          </string-name>
          .
          <article-title>Sentiment analysis of short informal texts</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          ,
          <volume>50</volume>
          :
          <fpage>723</fpage>
          -
          <lpage>762</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Saif</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mohammad</surname>
          </string-name>
          and
          <string-name>
            <surname>Peter D. Turney</surname>
          </string-name>
          .
          <article-title>Crowdsourcing a word-emotion association lexicon</article-title>
          .
          <source>Computational Intelligence</source>
          ,
          <volume>29</volume>
          (
          <issue>3</issue>
          ):
          <fpage>436</fpage>
          -
          <lpage>465</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <given-names>Saif</given-names>
            <surname>Mohammad</surname>
          </string-name>
          . #
          <article-title>Emotional tweets</article-title>
          .
          <source>In Proceedings of the Conference on Lexical and Computational Semantics (*Sem)</source>
          , pages
          <fpage>246</fpage>
          -
          <lpage>255</lpage>
          , Montre´al, Canada,
          <year>June 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>14. you</mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>15. he</mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>16. me</mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>17. need a</mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>18. kick</mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>19. i need a</mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>20. she</mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>21. headache</mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>22. kick in</mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>23. this &lt;MED&gt;</mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          24.
          <article-title>need a &lt;MED&gt;</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>25. need &lt;MED&gt;</mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>