Ensemble of Convolutional Neural Networks
                               for Medicine Intake Recognition in Twitter

                         Kai Hakala1,2∗ , Farrokh Mehryary1,2∗ , Hans Moen1,3 ,
                        Suwisa Kaewphan1,2,4 , Tapio Salakoski1,4 , Filip Ginter1
             1
               Department of Future Technologies, University of Turku, Turku, Finland;
           2
             University of Turku Graduate School, University of Turku, Turku, Finland;
                3
                  Department of Nursing Science, University of Turku, Turku, Finland;
                         4
                           Turku Centre for Computer Science, Turku, Finland

Abstract
 We present the results from our participation in the 2nd Social Media Mining for Health Applications Shared Task –
Task 2. The goal of this task is to develop systems capable of recognizing mentions of medication intake in Twitter.
Our best performing classification system is an ensemble of neural networks with features generated by word- and
character-level convolutional neural network channels and a condensed weighted bag-of-words representation. A
relatively strong performance is achieved, with an F-score of 66.3 according to the official evaluation, resulting in the
5th place in the shared task with performance close to the best systems created by other participating teams.

Introduction
Pharmacovigilance is the science of detecting, assessing and preventing drug-related adverse effects. A central focus
and challenge is to detect adverse drug reactions (ADRs), which are undesired and harmful effects resulting from
taking medications. Traditionally, ADRs are identified and recorded by health care professionals, and a part of their
work includes weighting the risks and benefits of using medications. However, the number of documented ADRs is
limited and it is believed that some of the more rare ADRs have not been revealed yet. As an alternative approach,
pharmacovigilance has turned to social media. Social media represents a valuable forum for drug safety surveillance
where text-mining techniques can be applied to extract potentially ADR-related events from a large population.
Our team participated in Task 2 of the 2nd Social Media Mining for Health Applications Shared Task at AMIA 2017.
The goal of this shared task was to develop systems capable of classifying the mentions of medication intakes in
tweets. Each of the provided tweets were to be assigned with one of the following three classes: personal medication
intake, possible medication intake or non-intake. This is an important preliminary task for extracting ADRs from social
media, since it can filter out the majority of tweets that mention drugs without any indications of personal intake. We
participated with classifiers based on support vector machines (SVMs) and neural networks (NNs), as described in
more detail in the Method section.

Data
The organizers provided a training dataset with manually assigned labels (intake, possible intake, non-intake) for
each tweet. The intake class is defined as clearly expressing a personal intake of medication, whereas the possible
intake is more ambiguous, yet still suggesting an intake by the tweet writer. The non-intake class includes the rest
of the tweets, all of which include a mention of a drug, but refer to an intake by another person or discuss the drugs
in general. Approximately 50% of the data belongs to the non-intake class, whereas the intake and possible intake
classes constitute 19% and 31% of the data, respectively.
Due to the data sharing restrictions of Twitter, the organizers only provided the IDs of the tweets instead of their actual
content. Since we started investigating this task much later than the data was released, we were only able to obtain
the contents for 7444 tweets out of the total 8000 annotated tweets as some of the content had been already removed
by the Twitter users, i.e. we had 7% less training data than teams who were involved in the shared task since the very
   *: these authors have contributed equally.
beginning. The organizers also provided a separate development dataset, which consists of 2260 annotated tweets, of
which we also lost roughly the same proportion. In the Results and Discussion section we evaluate the impact of the
lost data in more detail.

Method
For our baseline approach we form term frequency–inverse document frequency (TF–IDF) weighted sparse bag-of-
words (BOW) representations for all given tweets1 . These representations are not only constructed for single tokens,
but also token bigrams, trigrams and character n-grams of length 1 to 4. These representations are then fed as features
to a linear SVM classifier2 . The regularization parameter is selected to optimize the micro-averaged F-score of intake
and possible intake classes, the official evaluation metric, on the development set. For the final submission in the
shared task we merge the training and development sets and train the system on the combined dataset.
As the sparse representations are not able to generalize well to unseen vocabulary, we also test various NN approaches
on this task. The final system is based on an ensemble of convolutional neural networks (CNNs)3 and utilizes word
and character information.
Each tweet is represented as two separate sequences: words and characters, both of which are processed with separate
convolutional channels. Each element in these sequences is represented with a latent feature vector, i.e. an embedding.
The word embeddings are initialized using word2vec4 trained with approximately 1 billion drug related tweets as
provided by Sarker and Gonzalez5 . We also tested GloVe vectors trained on 2B general domain tweets6 , but these
experiments resulted in a decreased performance. The character embeddings are initialized randomly, but the network
is allowed to backpropagate to both word and character embeddings.
The convolutional kernels are applied on the aforementioned two sequences using sliding windows. The outputs are
subsequently max-pooled and concatenated. The concatenated vectors are further fed through two densely connected
layers, the latter having the output dimensionality corresponding to the number of labels in the data set with softmax
activation.
In addition to the convolutional layers we utilize the same TF–IDF weighted sparse vector representations as in the
baseline method. As these representations have dimensionality in the order of hundreds of thousands, we first densify
the representations to 4000 dimensional vectors using truncated singular-value decomposition (SVD)7 . These vectors
are concatenated alongside the CNN outputs. This dimensionality reduction is performed mainly due to computational
reasons since the approach was prototyped on a consumer grade GPU with limited amount of memory. Projecting the
sparse vectors to 4K dimensions preserves 74% of the variance in the data and may have caused a minor performance
loss.
The network is trained on the official training data using the Nadam optimization algorithm. The network is regularized
with dropout rate of 0.2 after the first dense layer, no explicit regularization is applied on the convolutional part of the
network. The training is stopped once the performance on the development set is no longer improving, measured with
the official evaluation metric. Table 1 shows the comprehensive list of used hyperparameters.

            Hyper-parameter                                         Optimal value          Tested Values
            Character embedding dimensionality                            25               [25,50,75,100]
            Word embedding dimensionality                                400                  pre-trained
            Character CNN, number of filters per window size              50              [50,100,150,200]
            Character CNN, window sizes                               [2,3,4,5]                [2,3,4,5]
            Word CNN, number of filters per window size                  200                [100,200,300]
            Word CNN, window sizes                                      [2,4]          any subset of [2,3,4,5]
            Dimensionality of first dense layer                          400           [100,200,300,400,500]
            Dropout rate                                                 0.2                  [0,0.2,0.5]
            Activation functions                                        tanh           [ReLU, tanh, sigmoid]

                  Table 1: The optimal and tested hyperparameter values of the CNN-based system.
Training the network on this dataset resulted in relatively large variance in the measured performance, caused by the
random initialization of the weights. Thus we stabilize the system by training 15 networks, all identical apart from the
initial (random) weights. We then select the optimal subset of these networks, as measured on the development set,
for the final system where the final predictions are created by summing the confidences of all selected networks and
choosing the label with the highest overall confidence. The final system included a subset of 6 neural networks out of
the 15. We note that this approach may potentially overfit on the development set.
Other NN architectures experimented and tested during this shared task include various versions of BiLSTM and
attention based networks8, 9 but none of these experiments resulted in better performance than the CNN architecture
described in detail. However, due to the time limits of the shared task, we cannot reject the possibility of these
approaches being competitive as well.
We also experimented with a way of (pre-)tuning the utilized word embeddings to this specific classification task in
an attempt to give the word-level CNN a better starting point for the training. This was done using the principles
underlying the random indexing (RI)10 method. Unique index vectors are first assigned to each of the three classes
(intake, possible intake and non-intake), and empty context vectors are assigned to each word in the data set. When
traversing the training set, each word, in each tweet, have the index vector associated with the tweet’s class added to
their context vector. After training, the resulting word context vectors are normalized to unit length and summed with
the corresponding word embeddings/vectors generated using word2vec5 (also normalized to unit length). To make the
signal provided by the RI approach have a modest impact on the conjoint vectors, these vectors are first multiplied with
a weight of 0.3. However, the described approach did not seem to result in a positive performance impact, compared
to using the original word2vec generated embeddings.
We also tested the potential benefits of including part-of-speech (POS) tags, which were produced using the Twitter
NLP toolkit11 . The sequences of POS tags were treated in similar fashion to the word and character sequences.
Although the benefits of POS tagging are intuitive as for instance verbs in past tense are twice as common in the intake
class as in the possible intake, we did not see any increase in the performance when POS tags were utilized.

Results and Discussion
We measure the performance of our systems using micro-averaged F-score of intake and possible intake classes,
following the official evaluation, and conduct all our experiments on the official development set. However, the
reported results are not directly comparable with other systems as we only had access to a subset of the original data
(see the Data section). The results on the test set are as reported by the organizers and thus comparable to other
systems.
The overall performance of our baseline (i.e. SVM) and CNN-based systems are relatively strong, resulting in F-
scores of 69.6 and 72.7 on the development set respectively (see Table 2). We suspect the main advantage of the CNN
approach to be the generalizability of the word embeddings, which leads to the 3.1pp improvement in F-score. We
also briefly tested a nonlinear multilayer perceptron with the same BOW features as used in the SVM, which led to a
slight improvement over using the linear SVM model, but was not able to outperform our CNN-based system. Thus
the model complexity alone does not explain the performance difference between the SVM and CNN approaches.
An unexpected observation is that the intake class seems to be harder to predict than the possible intake, although eye-
balling the data suggests otherwise and the annotation guidelines provide more precise definition for the intake class.
Also for the CNN-based system it seems that the precision and recall are rather well balanced, thus no performance
improvements could have been gained through further fine tuning of these metrics.
The test set results follow the same patterns as the development set evaluation: SVM and CNN systems reach F-scores
of 64.2 and 66.3, respectively. Thus it seems that either the test set is somewhat harder than the development set or
both of the systems are overfitting equally on the development set, even though the implemented ensemble system
with CNNs could have caused greater overfitting. According to the official evaluation, our best system loses to the
winning system by 3pp in F-score, the difference being roughly the same in both precision and recall. This places our
system in the 5th position in the shared task.
By inspecting the confusion matrices it can be concluded that our classifiers tend to confuse intake class with both
                                                  Development set                       Test set
                                            Precision Recall F-score        Precision    Recall    F-score
                         Intake                  70.5    64.5     67.4
             SVM         Possible Intake         73.3    68.6     70.9
                         Overall                 72.3    67.0     69.6           69.2      60.1       64.3
                         Intake                  70.9    71.3     71.1
             CNN         Possible Intake         76.3    71.1     73.6
                         Overall                 74.2    71.2     72.7           70.1      63.0       66.3
             InfyNLP     Overall                                                 72.5      66.4       69.3

Table 2: Overall performance of our SVM and CNN-based systems. The development set results are measured with
our own evaluation whereas the test set scores are as reported by the organizers. The class specific performance was
not evaluated by the organizers and has been thus left out from the table. For comparison we have added the results of
the best performing team: InfyNLP.

possible intake and non-intake classes equally often, whereas the possible intake is more often confused with the
non-intake class.
As we only had access to a partial training data, we try to estimate how much the performance of the systems could
have been improved with additional data. To accomplish this, we train the CNN-based system with different subsets
of the training data, starting from 5K training examples and incrementally increasing the size in steps of 300 up to the
whole training data available to us. After every increment we evaluate the system’s performance on the development
set. To reduce the variance caused by different initial random weights, we train 5 networks with each subset of
the training data, and calculate the mean performance for each subset. Fitting a linear regression on the resulting
measurements shows that in this region, the learning curve is fairly linear and decent performance improvements can
be gained by adding more training data. Assuming a performance increase equal to the slope of the fitted regression
line, having the full training dataset would have increased our performance by 0.7pp in F-score, placing our system
close to the top 3 teams in the shared task.


Figure 1: Influence of the number of training examples to the performance of the CNN system as evaluated on the
development set.
As most approaches we tested, as well as the systems created by other teams, resulted roughly in the same performance
level, we wanted to assess what would be a theoretical performance limit for this task. To this end, we manually
annotated a random subset of 100 tweets from the development set and evaluated the annotations against the gold
standard. Surprisingly our manual annotations reached only an F-score of 59.3, notably lower than the developed
systems or what the official inter-annotator agreement would suggest12 . This indicates that the task is complex even
for humans and deep understanding of the annotation guidelines is required for high quality annotations.
Conclusions and Future Work
We have shown that strong results in detecting tweets describing personal medication intake can be achieved using
convolutional neural networks and word embeddings. However, more traditional methods relying on bag-of-words
features and linear classifiers also result in competitive performance. Considering that such a system can be imple-
mented in less than an hour with the existing tools and libraries, and is easily interpretable, the simpler methods may
be a more practical choice in many use cases.
Since the amount of training data for this task is fairly limited, we plan to explore various approaches for pretraining
NN classifiers as a future task. The goal here is to find a suitable proxy task related to the domain for initializing
the network before the actual training. Such a task could be, for instance, sentiment detection as many of the tweets
expressing drug intake also express a certain sentiment about the condition of the user or the effects of the drug.

Acknowledgements
This work was supported by ATT Tieto käyttöön grant and Tekes – Räätäli project (no. 644/31/2015). Computational
resources were provided by CSC - IT Center For Science Ltd., Espoo, Finland.

References
 1. Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information processing & manage-
    ment. 1988;24(5):513–523.
 2. Joachims T. Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A, editors. Advances
    in Kernel Methods – Support Vector Learning. Cambridge, MA: MIT Press; 1999. p. 169–184.

 3. LeCun Y, Bengio Y, et al. Convolutional networks for images, speech, and time series. The handbook of brain
    theory and neural networks. 1995;3361(10):1995.
 4. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their
    compositionality. In: Advances in Neural Information Processing Systems 26; 2013. p. 3111–3119.
 5. Sarker A, Gonzalez G. A corpus for mining drug-related knowledge from Twitter chatter: language models and
    their utilities. Data in Brief. 2017;10(Supplement C):122 – 131.
 6. Pennington J, Socher R, Manning C. Glove: global vectors for word representation. In: Proceedings of the 2014
    conference on empirical methods in natural language processing (EMNLP); 2014. p. 1532–1543.
 7. Halko N, Martinsson PG, Tropp JA. Finding structure with randomness: stochastic algorithms for constructing
    approximate matrix decompositions. 2009;.
 8. Hochreiter S, Schmidhuber J. Long short-term memory. Neural computation. 1997;9(8):1735–1780.
 9. Luong MT, Pham H, Manning CD. Effective approaches to attention-based neural machine translation. arXiv
    preprint arXiv:150804025. 2015;.

10. Kanerva P, Kristofersson J, Holst A. Random indexing of text samples for latent semantic analysis. In: Proceed-
    ings of 22nd Annual Conference of the Cognitive Science Society. Philadelphia, PA, USA; 2000. p. 1036.
11. Owoputi O, O’Connor B, Dyer C, Gimpel K, Schneider N, Smith NA. Improved part-of-speech tagging for online
    conversational text with word clusters. Association for Computational Linguistics; 2013. .
12. Klein A, Sarker A, Rouhizadeh M, O’Connor K, Gonzalez G. Detecting personal medication intake in Twitter:
    an annotated corpus and baseline classification system. In: BioNLP 2017. Vancouver, Canada,: Association for
    Computational Linguistics; 2017. p. 136–142.