Overview of the Second Social Media Mining for Health (SMM4H) Shared
                           Tasks at AMIA 2017

                Abeed Sarker, Ph.D.1, Graciela Gonzalez-Hernandez, Ph.D.2
 1
   Health Language Processing Laboratory, Department of Biostatistics, Epidemiology and
  Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
Abstract
The volume of data encapsulated within social media continues to grow, and, consequently, there is a growing interest
in developing effective systems that can convert this data into usable knowledge. Over recent years, initiatives have
been taken to enable and promote the utilization of knowledge derived from social media to perform health related
tasks. These initiatives include the development of data mining systems and the preparation of datasets that can be
used to train such systems. The overarching focus of the SMM4H shared tasks is to release annotated social media
based health related datasets to the research community, and to compare the performances of distinct natural
language processing and machine learning systems on tasks involving these datasets. The second execution of the
SMM4H shared tasks comprised of three subtasks involving annotated user posts from Twitter (tweets): (i) automatic
classification of tweets mentioning an adverse drug reaction (ADR) (ii) automatic classification of tweets containing
reports of first-person medication intake, and (iii) automatic normalization of ADR mentions to MedDRA concepts. A
total of 15 teams participated and 55 system runs were submitted. The best performing systems for tasks 2 and 3
outperformed the current state of the art systems.
Introduction
The second execution of the SMM4H shared tasks built on the success of the first execution of the shared task
workshop1, which was held at the Pacific Symposium on Biocomputing (PSB), 2016. In line with the previous shared
task, the data comprised of medication mentioning posts from Twitter, which were retrieved using the Twitter public
streaming API2. We designed and provided annotated data for three tasks. The annotated data were made publicly
available for download. The performances of participating systems were compared on blind evaluation sets for each
task.
Shared Task Design
The overall shared task consisted of three independent tasks/subtasks. Teams could participate in one or multiple tasks.
From the perspective of text mining, the first two tasks focused on text classification and the third task focused on
concept normalization. Manually annotated training data for the three tasks were made available to the participants in
May, 2016. Unlabeled evaluation data was released in September, 2016. Evaluations of participant submissions were
conducted from 5th to 12th September. In total, 15 teams participated in the shared tasks and 55 system runs were
accepted from them (maximum of three submissions per team per task). We received 24 submissions for task 1, 26
for task 2 and 5 for task 3. Participating teams were invited to submit system descriptions to describe their approaches
to the tasks. Teams participating in multiple tasks submitted a single system description. Each system description was
peer reviewed by at least one reviewer. Nine system descriptions were accepted for inclusion in the SMM4H workshop
proceedings, including one system description that was accepted as a full paper at the workshop after undergoing peer
review by two reviewers. We provide descriptions of the three tasks and the associated data in the following
sections/subsections.
Task Descriptions
Tasks
The primary goal of the SMM4H shared tasks is to promote community driven development and evaluations of
systems focusing on social media based health data. This year’s tasks involved medication-mentioning user posts from
Twitter. We included two tasks from the last execution at PSB and a new task. Outlines of the tasks are as follows:
    (i)      Automatic classification of ADR mentioning tweets. This is a binary text classification task for which
             systems were required to predict if a tweet mentions an ADR or not. Such a system is crucial for active
             surveillance of ADRs from social media data as most of the medication-related chatter in the domain,
                 including those on Twitter, are noise. This task was also part of the first execution of the SMM4H shared
                 tasks. Further details about this task can be found in our past publication3.
      (ii)       Automatic classification of medication intake mentioning posts. This is a three-class text classification
                 task. Each medication-mentioning tweet is categorized into three classes—definite intake (where the user
                 presents clear evidence of personal consumption), possible intake (where it is likely that the user
                 consumed the medication, but the evidence is unclear), and no intake (where there is no evidence that
                 the user consumed the medication). This proposed task was new in the 2017 SMM4H shared tasks.
                 Further details about this task can be found in our recent publication4.
      (iii)      Normalization of ADR mentions. The goal of this task is to normalize different natural language
                 expressions of the same ADR concept into standard IDs. This is a particularly challenging task and
                 although it was proposed in the first execution of the shared tasks, there were no participants.
To facilitate the shared task, we made available large annotated Twitter data sets. The overall shared task was designed
to capitalize on the interest in social media mining and appeal to a diverse set of researchers working on distinct topics
such as natural language processing, biomedical informatics, and machine learning. The different subtasks presented
a number of interesting challenges including the noisy nature of the data, the informal language of the user posts,
misspellings, and data imbalance. We provide details of the data used for each of the three abovementioned tasks, and
the tasks themselves, in the following subsection.
Data
The dataset made available for the shared tasks were collected from Twitter using the public streaming API. The
annotated datasets provided as training sets were made available to the public with our prior publications3,4. Only task
3 included new, previously unpublished data for training.
Task 1: ADR Classification. Participants were provided with the training/development set containing tweets which
were annotated in a binary fashion to indicate the presence or absence of ADRs. Initially, a total of 10,822 annotated
tweets were made available1. Later on, an additional 4895 tweets were released in the same fashion to active
participants (previous shared task’s evaluation set). The evaluation set consisted of 9961 tweets. The per-class
distributions of the tweets in the three sets are shown in Table 1. The evaluation metric for this task was the F-score
for the ADR class, since the primary intent of this task is to be able to filter out ADR indicating tweets from large
amounts of noise.
Table 1. Training and evaluation datasets for task 1 of the SMM4H shared tasks.
    Set            Total Number of Tweets          Number of Tweets in ADR             Number of Tweets in non-
                                                            Class                           ADR Class
    Training 1               10,822                            1239                                9583
    Training 2                4895                             367                                 4528
    Evaluation                9961                             771                                 9190


Task 2: Medication Intake Classification. Participants were provided with tweets that have been manually categorized
into three classes—definite intake, possible intake and no intake. Like task 1, data was released in three phases.
Initially, 8000 annotated tweets were released, followed by an additional 2260 tweets for active participants. The
evaluation set consisted of 7513 tweets. The per-class distributions of the tweets are shown in Table 2. For this task,
the evaluation metric was micro-averaged F-score for the definite intake and possible intake classes. This metric was
chosen for evaluation because the tweets belonging to these two classes are of interest in social media based drug
safety surveillance systems, while the no intake class primarily represents noise.


1
 Due to Twitter’s privacy policy, the actual tweets were not shared publicly. We made available a download script
and the TweetIDs and UserIDs for the tweets. The publicly available tweets can be downloaded using the download
scripts.
Table 2. Training and evaluation datasets for task 2 of the SMM4H shared tasks.
    Set             Total Number of          Number of Tweets           Number of Tweets in      Number of Tweets in
                        Tweets                in the Definite            the Possible Intake     the No Intake Class
                                               Intake Class                    Class
    Training 1            8000                       1528                      2502                       3970
    Training 2            2260                       424                        717                       1119
    Evaluation            7513                       1731                      2697                       3085


Task 3: Adverse Drug Reaction Mention Normalization. The training data consisted of ADR mentions mapped to
MedDRA (Medical Dictionary for Regulatory Activities)3 Preferred Terms (PTs). The training set consisted of 6,650
phrases mapped to 472 PTs (14.09 mentions per concept on average). The test set consisted of 2500 mentions mapped
to 254 classes. The evaluation metric for this task was accuracy (i.e., number of correctly identified MedDRA PTs
divided by the total number of instances in the evaluation set).
Results
Task 1
Eleven teams registered to participate in the task and 24 submissions from nine teams were included in the final
evaluations. System submissions were excluded if they did not meet the deadline, were incompatible, did not follow
the shared task guidelines or were incomplete. Table 3 presents the performances of the 24 included systems grouped
by the team names. Team NRC_Canada had the best performing system at for this task, obtaining an ADR class F-
score of 0.4355.
Table 3. System performances for each team for task 1 of the shared task. Precision, recall and F-score over the ADR
class is shown. Top score in each column is shown in bold.
    Team                  Institution(s) - Country            ADR            ADR Recall        ADR F-score
                                                            Precision
    TsuiLab               University of Pittsburgh            0.333             0.350             0.341
                          – United States
                                                             0.298              0.394             0.339
                                                             0.336              0.348             0.342
    NRC_Canada            National Research                  0.392              0.488             0.435
                          Council – Canada
                                                             0.386              0.413             0.399
                                                             0.464              0.396             0.427
    NorthEasternNLP       Northeastern University            0.551              0.306             0.394
                          – United States
                                                             0.395              0.431             0.412
    NTTMU                 Taipei Medical                     0.213              0.433             0.286
                          University, Academia
                                                             0.362              0.249             0.295
                          Sinica, National Taitung
                          University – Taiwan                0.226              0.403             0.290
    CSaRUS-CNN            Arizona State University           0.437              0.393             0.414
                          – United States
                                                             0.467              0.357             0.404
                                                             0.396              0.431             0.412
    TJIIP                 University of Montreal –           0.359              0.398             0.378
                          Canada
                                                             0.422              0.154             0.226


3
    Available at: https://www.meddra.org/.
                                                          0.325             0.400               0.359
 UKNLP                  University of Kentucky –          0.459             0.237               0.313
                        United States
                                                          0.567             0.259               0.356
                                                          0.498             0.337               0.402
 deepCyberNet           Amrita School of                  0.078             0.170               0.107
                        Engineering Coimbatore
                        – India
 AMRITA_CEN_            Amrita School of                  0.056             0.109               0.074
 NLP_RBG                Engineering Coimbatore
                                                          0.087             0.204               0.121
                        – India
                                                          0.186             0.481               0.268


Task 2
Eleven teams registered to participate in this task including eight teams that also registered for Task 1. 26 submissions
from ten teams were included in the final evaluations. Exclusion criteria were identical to those of task 1. Table 2
presents the performances of these 26 systems grouped by team names. Team InfyNLP had the best performing system
for this task, obtaining micro-averaged F-score of 0.693 for the two relevant classes6.
Table 4. System performances for each team for task 2 of the shared task. Micro-averaged precision, recall and F-
scores are shown for the definite intake (class 1) and possible intake (class 2) classes. Top score in each column is
shown in bold.
 Team                  Institution(s) – Country        Micro-averaged      Micro-averaged         Micro-averaged
                                                        precision for      recall for classes    F-score for classes
                                                       classes 1 and 2         1 and 2                1 and 2

 CSaRUS-CNN            Arizona State University –            0.696                0.601                 0.645
                       United States
                                                             0.708                0.599                 0.649
                                                             0.709                0.604                 0.652
 AMRITA_CEN_           Amrita School of                      0.569                0.390                 0.462
 NLP_RBG               Engineering Coimbatore –
                       India
 NRC_Canada            National Research Council             0.708                0.642                 0.673
                       – Canada
                                                             0.705                0.639                 0.671
                                                             0.704                0.635                 0.668
 NTTMU                 Taipei Medical University,            0.690                0.554                 0.614
                       Academia Sinica, National             0.644                0.588                 0.615
                       Taitung University –
                       Taiwan                                0.662                0.572                 0.614
 RITUAL                University of Houston –               0.630                0.571                 0.599
                       United States
                                                             0.643                0.578                 0.609
                                                             0.650                0.575                 0.610
 TJIIP                 University of Montreal -              0.691                0.641                 0.665
                       Canada
                                                             0.628                0.557                 0.590
                                                             0.654                0.664                 0.659
 TurkuNLP                                                    0.692                0.601                 0.643
                       University of Turku, Turku           0.701               0.630                 0.663
                       Centre for Computer
                       Science – Finland
 UKNLP                 University of Kentucky –             0.688               0.607                 0.645
                       United States
                                                            0.705               0.666                 0.685
                                                            0.701               0.677                 0.689
 InfyNLP               Infosys Ltd. – United                0.716               0.664                 0.689
                       States
                                                            0.721               0.661                 0.690
                       Indian Institute of
                       Technology – India                   0.725               0.664                 0.693
 deepCyberNet          Amrita School of                     0.414               0.107                 0.171
                       Engineering Coimbatore –
                                                            0.843               0.487                 0.617
                       India

Task 3
Two teams registered to participate in this task and five system submissions were submitted. Table 5 summarizes the
performances of the five systems. It can be seen from the table that the different systems showed similar performances,
with one system from team gnTeam obtaining the best accuracy of 88.5%7.
Table 5. System performances for task 3. Accuracies over the evaluation set are shown. Best performance is shown
in bold.
                           Team            Institution – Country         Accuracy (%)
                           gnTeam        University of Manchester              87.7
                                         – United Kingdom
                                                                               85.5
                                                                               88.5
                           UKNLP         University of Kentucky –              87.2
                                         United States
                                                                               86.7

Conclusion
The number of submissions received for the second execution of the SMM4H shared tasks was more than double of
that received for the first execution. The submitted systems employed a wide range of machine learning methods. The
system descriptions that have been published with the shared task proceedings provide further details about these
methods and the relative performances of each. The successful execution of the shared tasks suggests that this is an
effective model for encouraging community-driven development of systems for social media based heath related text
mining, and warrants further future efforts.
Acknowledgments
This work was supported by National Institutes of Health (NIH) National Library of Medicine (NLM) grant number
NIH NLM 5R01LM011176. The content is solely the responsibility of the authors and does not necessarily represent
the official views of the NLM or NIH.
The authors would like to thank the members of the Health Language Processing Laboratory for their support. In
particular, the authors would like to acknowledge the efforts by Karen O’Connor and Alexis Upshur for preparing the
annotated data sets. The authors would also like to thank Dr. Davy Weissenbacher, Dr. Masoud Rouhizadeh and Dr.
Ari Z. Klein for reviewing system descriptions and contributing to the selection process.
                                                     References
1.       Sarker A, Nikfarjam A, Gonzalez G. SOCIAL MEDIA MINING SHARED TASK WORKSHOP. Pac Symp
         Biocomput. 2016;21:581-592. http://www.ncbi.nlm.nih.gov/pubmed/26776221. Accessed January 5, 2017.
2.   Twitter. Twitter Public Streaming API. https://developer.twitter.com/en/docs.
3.   Sarker A, Gonzalez G. Portable automatic text classification for adverse drug reaction detection via multi-
     corpus training. J Biomed Inform. 2014;53:196-207. doi:10.1016/j.jbi.2014.11.002.
4.   Klein A, Sarker A, Rouhizadeh M, O’Connor K, Gonzalez G. Detecting Personal Medication Intake in
     Twitter: An Annotated Corpus and Baseline Classification System. In: Proceedings of the BioNLP 2017
     Workshop. Vancouver, BC, Canada; :136-142.
5.   Kiritchenko S, Mohammad SM, Morin J, de Bruijn B. NRC-Canada at SMM4H Shared Task: Classifying
     Tweets Mentioning Adverse Drug Reactions and Medication Intake. In: Proceedings of the Second
     Workshop on Social Media Mining for Health Applications (SMM4H). Health Language Processing
     Laboratory; 2017.
6.   Friedrichs J, Mahata D, Gupta S. InfyNLP at SMM4H Task 2: Stacked Ensemble of Shallow Convolutional
     Neural Networks for Identifying Personal Medication Intake from Twitter. In: Proceedings of the Second
     Workshop on Social Media Mining for Health Applications (SMM4H). Health Language Processing
     Laboratory; 2017.
7.   Belousov M, Dixon W, Nenadic G. Using an ensemble of linear and deep learning models in the SMM4H
     2017 medical concept normalisation task. In: Proceedings of the Second Workshop on Social Media Mining
     for Health Applications (SMM4H). Health Language Processing Laboratory; 2017.