Introduction

J Biomed Inform.

10.1016/j.jbi.2014.11.002

Overview of the Second Social Media Mining for Health (SMM4H) Shared Tasks at AMIA 2017

Abeed Sarker

Ph.D.

Graciela Gonzalez-Hernandez

Ph.D.

Health Language Processing Laboratory

Department of Biostatistics

Epidemiology

Task Descriptions

0 Informatics, Perelman School of Medicine, University of Pennsylvania , Philadelphia, PA , USA

2014

53 196 207

The volume of data encapsulated within social media continues to grow, and, consequently, there is a growing interest in developing effective systems that can convert this data into usable knowledge. Over recent years, initiatives have been taken to enable and promote the utilization of knowledge derived from social media to perform health related tasks. These initiatives include the development of data mining systems and the preparation of datasets that can be used to train such systems. The overarching focus of the SMM4H shared tasks is to release annotated social media based health related datasets to the research community, and to compare the performances of distinct natural language processing and machine learning systems on tasks involving these datasets. The second execution of the SMM4H shared tasks comprised of three subtasks involving annotated user posts from Twitter (tweets): (i) automatic classification of tweets mentioning an adverse drug reaction (ADR) (ii) automatic classification of tweets containing reports of first-person medication intake, and (iii) automatic normalization of ADR mentions to MedDRA concepts. A total of 15 teams participated and 55 system runs were submitted. The best performing systems for tasks 2 and 3 outperformed the current state of the art systems.

Introduction Tasks

The primary goal of the SMM4H shared tasks is to promote community driven development and evaluations of systems focusing on social media based health data. This year’s tasks involved medication-mentioning user posts from Twitter. We included two tasks from the last execution at PSB and a new task. Outlines of the tasks are as follows: (i)

Automatic classification of ADR mentioning tweets. This is a binary text classification task for which systems were required to predict if a tweet mentions an ADR or not. Such a system is crucial for active surveillance of ADRs from social media data as most of the medication-related chatter in the domain, including those on Twitter, are noise. This task was also part of the first execution of the SMM4H shared tasks. Further details about this task can be found in our past publication3.

Automatic classification of medication intake mentioning posts. This is a three-class text classification task. Each medication-mentioning tweet is categorized into three classes—definite intake (where the user presents clear evidence of personal consumption), possible intake (where it is likely that the user consumed the medication, but the evidence is unclear), and no intake (where there is no evidence that the user consumed the medication). This proposed task was new in the 2017 SMM4H shared tasks. Further details about this task can be found in our recent publication4.

Normalization of ADR mentions. The goal of this task is to normalize different natural language expressions of the same ADR concept into standard IDs. This is a particularly challenging task and although it was proposed in the first execution of the shared tasks, there were no participants.

To facilitate the shared task, we made available large annotated Twitter data sets. The overall shared task was designed to capitalize on the interest in social media mining and appeal to a diverse set of researchers working on distinct topics such as natural language processing, biomedical informatics, and machine learning. The different subtasks presented a number of interesting challenges including the noisy nature of the data, the informal language of the user posts, misspellings, and data imbalance. We provide details of the data used for each of the three abovementioned tasks, and the tasks themselves, in the following subsection.

Data

The dataset made available for the shared tasks were collected from Twitter using the public streaming API. The annotated datasets provided as training sets were made available to the public with our prior publications3,4. Only task 3 included new, previously unpublished data for training.

Task 1: ADR Classification. Participants were provided with the training/development set containing tweets which were annotated in a binary fashion to indicate the presence or absence of ADRs. Initially, a total of 10,822 annotated tweets were made available1. Later on, an additional 4895 tweets were released in the same fashion to active participants (previous shared task’s evaluation set). The evaluation set consisted of 9961 tweets. The per-class distributions of the tweets in the three sets are shown in Table 1. The evaluation metric for this task was the F-score for the ADR class, since the primary intent of this task is to be able to filter out ADR indicating tweets from large amounts of noise. Task 2: Medication Intake Classification. Participants were provided with tweets that have been manually categorized into three classes—definite intake, possible intake and no intake. Like task 1, data was released in three phases. Initially, 8000 annotated tweets were released, followed by an additional 2260 tweets for active participants. The evaluation set consisted of 7513 tweets. The per-class distributions of the tweets are shown in Table 2. For this task, the evaluation metric was micro-averaged F-score for the definite intake and possible intake classes. This metric was chosen for evaluation because the tweets belonging to these two classes are of interest in social media based drug safety surveillance systems, while the no intake class primarily represents noise. 1Due to Twitter’s privacy policy, the actual tweets were not shared publicly. We made available a download script and the TweetIDs and UserIDs for the tweets. The publicly available tweets can be downloaded using the download scripts. Task 3: Adverse Drug Reaction Mention Normalization. The training data consisted of ADR mentions mapped to MedDRA (Medical Dictionary for Regulatory Activities)3 Preferred Terms (PTs). The training set consisted of 6,650 phrases mapped to 472 PTs (14.09 mentions per concept on average). The test set consisted of 2500 mentions mapped to 254 classes. The evaluation metric for this task was accuracy (i.e., number of correctly identified MedDRA PTs divided by the total number of instances in the evaluation set).

Results Task 1

Eleven teams registered to participate in the task and 24 submissions from nine teams were included in the final evaluations. System submissions were excluded if they did not meet the deadline, were incompatible, did not follow the shared task guidelines or were incomplete. Table 3 presents the performances of the 24 included systems grouped by the team names. Team NRC_Canada had the best performing system at for this task, obtaining an ADR class Fscore of 0.4355. 3Available at: https://www.meddra.org/. Eleven teams registered to participate in this task including eight teams that also registered for Task 1. 26 submissions from ten teams were included in the final evaluations. Exclusion criteria were identical to those of task 1. Table 2 presents the performances of these 26 systems grouped by team names. Team InfyNLP had the best performing system for this task, obtaining micro-averaged F-score of 0.693 for the two relevant classes6.

UKNLP

deepCyberNet

AMRITA_CEN_ NLP_RBG CSaRUS-CNN AMRITA_CEN_ NLP_RBG NRC_Canada NTTMU RITUAL TJIIP TurkuNLP

Two teams registered to participate in this task and five system submissions were submitted. Table 5 summarizes the performances of the five systems. It can be seen from the table that the different systems showed similar performances, 7 with one system from team gnTeam obtaining the best accuracy of 88.5% .

UKNLP InfyNLP deepCyberNet

The number of submissions received for the second execution of the SMM4H shared tasks was more than double of that received for the first execution. The submitted systems employed a wide range of machine learning methods. The system descriptions that have been published with the shared task proceedings provide further details about these methods and the relative performances of each. The successful execution of the shared tasks suggests that this is an effective model for encouraging community-driven development of systems for social media based heath related text mining, and warrants further future efforts.

Acknowledgments

This work was supported by National Institutes of Health (NIH) National Library of Medicine (NLM) grant number NIH NLM 5R01LM011176. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NLM or NIH.

The authors would like to thank the members of the Health Language Processing Laboratory for their support. In particular, the authors would like to acknowledge the efforts by Karen O’Connor and Alexis Upshur for preparing the annotated data sets. The authors would also like to thank Dr. Davy Weissenbacher, Dr. Masoud Rouhizadeh and Dr. Ari Z. Klein for reviewing system descriptions and contributing to the selection process. 0.630 0.607

Sarker

, Nikfarjam

, Gonzalez G. SOCIAL MEDIA MINING SHARED TASK WORKSHOP . Pac Symp Biocomput. 2016 ; 21 : 581 - 592 . http://www.ncbi. nlm.nih.gov/pubmed/26776221. Accessed January 5 , 2017 .