CSaRUS-CNN at AMIA-2017 Tasks 1, 2: Under sampled CNN for text
                            classification
          Arjun Magge, MS1 , Matthew Scotch, PhD1 , Graciela Gonzalez, PhD2
  1
   Department of Biomedical Informatics, Arizona State University, Scottsdale, AZ, USA;
 2
   Department of Biostatistics and Epidemiology, University of Pennsylvania, Philadelphia,
                                         PA, USA

Abstract
Most practical text classification tasks in natural language processing involve training sets where the number of
training instances belonging to each of the classes are not equal. The performance of the classifier in such a case
can be affected by the sampling strategies used in training. In this work, we describe a cost sensitive and random
undersampling variants of convolutional neural networks (CNNs) for classifying texts in imbalanced datasets and
analyze its results. The classifier proposed in this paper achieves a maximum F1-score of 0.414 placing 2nd on the
ADR dataset and achieves a maximum F1-score of 0.652 placing 6th on the medication intake dataset.

Introduction
Text classification tasks in natural language processing (NLP) can contain datasets where the number of training in-
stances belonging to each class are not equal. The annotation cost for gold standard labels (i.e. labels assigned by
humans) are mostly associated with the number of instances labeled in the dataset. When classes are highly imbal-
anced, the dataset may not contain enough training examples belonging to the minority class. Having highly imbal-
anced classes with very few instances belonging to minority classes in such datasets can lead to a drop in classification
performance. Many classifiers tend to assign the majority class to a given training instance.
Learning from imbalanced data is a big problem and the subject has been studied extensively.1 Standard approaches to
classifying imbalanced datasets in NLP at the data level involve oversampling and undersampling.2 The most common
method to tackling this problem at the algorithmic level is using cost-sensitive learning.3 In this work, we chose to
experiment with the undersampling and cost-sensitive learning methods in CNN architectures to compensate for class
imbalance.
For Task 1, the classification dataset contains tweets mentioning a drug and the objective is to classify the tweet into
two classes i.e. 1) No-ADR : the tweet contains no evidence to indicate adverse drug reaction (ADR), and 2) ADR :
the tweet contains evidence to indicate ADR. Detecting ADRs from social media and health forum texts has been an
intensive area of research for early detection of ADRs and possible interventions.4–8
For Task 2, the dataset contains tweets mentioning a drug and the objective is to classify the tweet into three classes
i.e. 1) Intake : the tweet contains evidence for medication intake, 2) Possible-Intake : the tweet contains evidence
to suspect medication intake, and 3) No-Intake : the tweet contains no evidence of medication intake. For additional
information about the dataset and its annotations, see Klein et. al.9

Method
Input The datasets for the tasks contained tweets-ids and their respective categorical annotations. The first set of
annotations provided for the task was used as training dataset and the second set was used development/validation set.
The original texts were available for only about 40% of the annotations for Task-1 and 60% for Task-2. In Table 1, we
show the details the datasets for both tasks and their respective class distributions.


Classifier For the CNN classifier used in this paper, we implemented our models based on the original CNN ar-
chitecture as proposed by Kim et. al. for sentence classification.10 We use this CNN architecture to construct cost
sensitive and random undersampling variants to tackle the class imbalance problem. The random undersampling vari-
ant (Undersampling-CNN) is constructed by randomly sampling equal number of class-instances in each epoch. This
means that there are far fewer training instances in each epoch. The cost sensitive variant
                     Table 1: Dataset details for the ADR dataset and medication intake dataset.

                             Category           Annotated   Available    Class-1     Class-2    Class-3
                             Training Set        10,822       4966        4407        559          -
                   Task-1
                             Development Set      4845        2178        2024        154          -
                             Training Set         8000        5244        1006        1611       2627
                   Task-2
                             Development Set      2260        1159        221         374         564
Table 2: Performance comparison of classifiers used in this paper. For task-1, the results are for ADR class i.e. class
1. For task-2, the results are micro-averaged scores for classes 1 and 2. Undersampling-CNN and CostSensitive-CNN
was not computed and evaluated for Task-2.

                                                               Validation                  Evaluation
                            Implementation
                                                         P         R      F1         P         R      F1
                            CNN                        0.350     0.490 0.409       0.396     0.431 0.412
                  Task-1    Undersampling-CNN          0.435     0.352 0.389       0.467     0.357 0.404
                            CostSensitive-CNN          0.493     0.393 0.438       0.437     0.393 0.414
                 Task-2     CNN                        0.692     0.625 0.656       0.696     0.601 0.645


For our experiments we use a fixed maximum sentence length of 50 words. All tweets with length less than 50 words
are padded with zeros. Sentences with more than 50 words are truncated. As pre-processing steps we tokenize each
tweet and normalize punctuations. For word embeddings, we use the word vectors generated using millions of tweets
containing drug names and made available by Sarker et. al.11 for mining health related data online.


Hyperparameters For experimentation we use filter sizes in the range of 1 to 5 words with the number of filters i.e.
model hidden dimensions in the range 50-150. The best models were obtained for filter combinations of 2,3,4 and 75
filters. A softmax cross-entropy function is used to compute the cost for optimization. For optimization, we use the
Adam Optimizer with a learning rate of 0.001.12 We employ dropout keep probability of 0.5 during training to prevent
overfitting.13 We also apply L2 regularization rate of 0.001 for training across 50 epochs. The model with the best
performance on the validation/development set is saved and used on the evaluation/test set. The Undersampling-CNN
had fewer training samples per iteration compared to training on the entire set. Hence, the Undersampling-CNN had to
be trained at half the learning rate i.e. 0.0005 and took around 40-50 epochs to arrive at the optimal model as compared
to 10-15 for CNN and CostSensitive-CNN models. Although we could add feature embeddings for each word in the
architecture, we do not add any task specific features.

Results
In Table 2, we show the results for both tasks. For Task-1, the CostSensitive-CNN model was found to achieve the
best score. As described earlier, the Undersampling-CNN takes longer to train on all the randomized training samples
in majority class. However, the results did not show its improvement over the CNN or the CostSensitive-CNN model.

Conclusion and Future Work
In this work we evaluate a CNN classifer for detecting ADR and medication intakes as part of two shared tasks at
AMIA-2017. The classifiers presented in this work placed 2nd and 6th in tasks 1 and 2 respectively. As improvements
to the proposed classifiers, we would like to experiment with variants of cost sensitive training architectures in CNN
for tackling class imbalance problems as well as strategies to introduce controlled synthetic sentence variants for
oversampling the minority class.

References
 1. Haibo He and Edwardo A Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data
    engineering, 21(9):1263–1284, 2009.
 2. Shoushan Li, Zhongqing Wang, Guodong Zhou, and Sophia Yat Mei Lee. Semi-supervised learning for im-
    balanced sentiment classification. In IJCAI proceedings-international joint conference on artificial intelligence,
    volume 22, page 1826, 2011.
 3. Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. Exploratory undersampling for class-imbalance learning. IEEE
    Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550, 2009.

 4. Xiao Liu and Hsinchun Chen. Azdrugminer: an information extraction system for mining patient-reported adverse
    drug events in online patient forums. In International Conference on Smart Health, pages 134–150. Springer,
    2013.
 5. Rachel Ginn, Pranoti Pimpalkhute, Azadeh Nikfarjam, Apurv Patki, Karen OConnor, Abeed Sarker, Karen Smith,
    and Graciela Gonzalez. Mining twitter for adverse drug reaction mentions: a corpus and classification benchmark.
    In Proceedings of the fourth workshop on building and evaluating resources for health and biomedical text pro-
    cessing, 2014.
 6. Abeed Sarker, Karen OConnor, Rachel Ginn, Matthew Scotch, Karen Smith, Dan Malone, and Graciela Gonzalez.
    Social media mining for toxicovigilance: automatic monitoring of prescription medication abuse from twitter.
    Drug safety, 39(3):231–240, 2016.

 7. Abeed Sarker and Graciela Gonzalez. Portable automatic text classification for adverse drug reaction detection
    via multi-corpus training. Journal of biomedical informatics, 53:196–207, 2015.
 8. Abeed Sarker, Rachel Ginn, Azadeh Nikfarjam, Karen OConnor, Karen Smith, Swetha Jayaraman, Tejaswi Upad-
    haya, and Graciela Gonzalez. Utilizing social media data for pharmacovigilance: A review. Journal of biomedical
    informatics, 54:202–212, 2015.

 9. Ari Klein, Abeed Sarker, Masoud Rouhizadeh, Karen O’Connor, and Graciela Gonzalez. Detecting personal
    medication intake in twitter: An annotated corpus and baseline classification system. BioNLP 2017, pages 136–
    142, 2017.
10. Yoon Kim. Convolutional neural networks for sentence classification. EMNLP, 2014.

11. Abeed Sarker and Graciela Gonzalez. A corpus for mining drug-related knowledge from twitter chatter: language
    models and their utilities. Data in brief, 10:122–131, 2017.
12. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
    2014.

13. Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a
    simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958,
    2014.