<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>J Biomed Inform.</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1016/j.jbi.2014.11.002</article-id>
      <title-group>
        <article-title>Overview of the Second Social Media Mining for Health (SMM4H) Shared Tasks at AMIA 2017</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Abeed Sarker</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ph.D.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Graciela Gonzalez-Hernandez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ph.D.</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Health Language Processing Laboratory</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Department of Biostatistics</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Epidemiology</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Task Descriptions</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Informatics, Perelman School of Medicine, University of Pennsylvania</institution>
          ,
          <addr-line>Philadelphia, PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>53</volume>
      <fpage>196</fpage>
      <lpage>207</lpage>
      <abstract>
        <p>The volume of data encapsulated within social media continues to grow, and, consequently, there is a growing interest in developing effective systems that can convert this data into usable knowledge. Over recent years, initiatives have been taken to enable and promote the utilization of knowledge derived from social media to perform health related tasks. These initiatives include the development of data mining systems and the preparation of datasets that can be used to train such systems. The overarching focus of the SMM4H shared tasks is to release annotated social media based health related datasets to the research community, and to compare the performances of distinct natural language processing and machine learning systems on tasks involving these datasets. The second execution of the SMM4H shared tasks comprised of three subtasks involving annotated user posts from Twitter (tweets): (i) automatic classification of tweets mentioning an adverse drug reaction (ADR) (ii) automatic classification of tweets containing reports of first-person medication intake, and (iii) automatic normalization of ADR mentions to MedDRA concepts. A total of 15 teams participated and 55 system runs were submitted. The best performing systems for tasks 2 and 3 outperformed the current state of the art systems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <sec id="sec-1-1">
        <title>Tasks</title>
        <p>The primary goal of the SMM4H shared tasks is to promote community driven development and evaluations of
systems focusing on social media based health data. This year’s tasks involved medication-mentioning user posts from
Twitter. We included two tasks from the last execution at PSB and a new task. Outlines of the tasks are as follows:
(i)</p>
        <p>Automatic classification of ADR mentioning tweets. This is a binary text classification task for which
systems were required to predict if a tweet mentions an ADR or not. Such a system is crucial for active
surveillance of ADRs from social media data as most of the medication-related chatter in the domain,
including those on Twitter, are noise. This task was also part of the first execution of the SMM4H shared
tasks. Further details about this task can be found in our past publication3.</p>
        <p>Automatic classification of medication intake mentioning posts. This is a three-class text classification
task. Each medication-mentioning tweet is categorized into three classes—definite intake (where the user
presents clear evidence of personal consumption), possible intake (where it is likely that the user
consumed the medication, but the evidence is unclear), and no intake (where there is no evidence that
the user consumed the medication). This proposed task was new in the 2017 SMM4H shared tasks.
Further details about this task can be found in our recent publication4.</p>
        <p>Normalization of ADR mentions. The goal of this task is to normalize different natural language
expressions of the same ADR concept into standard IDs. This is a particularly challenging task and
although it was proposed in the first execution of the shared tasks, there were no participants.</p>
        <p>To facilitate the shared task, we made available large annotated Twitter data sets. The overall shared task was designed
to capitalize on the interest in social media mining and appeal to a diverse set of researchers working on distinct topics
such as natural language processing, biomedical informatics, and machine learning. The different subtasks presented
a number of interesting challenges including the noisy nature of the data, the informal language of the user posts,
misspellings, and data imbalance. We provide details of the data used for each of the three abovementioned tasks, and
the tasks themselves, in the following subsection.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Data</title>
        <p>The dataset made available for the shared tasks were collected from Twitter using the public streaming API. The
annotated datasets provided as training sets were made available to the public with our prior publications3,4. Only task
3 included new, previously unpublished data for training.</p>
        <p>Task 1: ADR Classification. Participants were provided with the training/development set containing tweets which
were annotated in a binary fashion to indicate the presence or absence of ADRs. Initially, a total of 10,822 annotated
tweets were made available1. Later on, an additional 4895 tweets were released in the same fashion to active
participants (previous shared task’s evaluation set). The evaluation set consisted of 9961 tweets. The per-class
distributions of the tweets in the three sets are shown in Table 1. The evaluation metric for this task was the F-score
for the ADR class, since the primary intent of this task is to be able to filter out ADR indicating tweets from large
amounts of noise.
Task 2: Medication Intake Classification. Participants were provided with tweets that have been manually categorized
into three classes—definite intake, possible intake and no intake. Like task 1, data was released in three phases.
Initially, 8000 annotated tweets were released, followed by an additional 2260 tweets for active participants. The
evaluation set consisted of 7513 tweets. The per-class distributions of the tweets are shown in Table 2. For this task,
the evaluation metric was micro-averaged F-score for the definite intake and possible intake classes. This metric was
chosen for evaluation because the tweets belonging to these two classes are of interest in social media based drug
safety surveillance systems, while the no intake class primarily represents noise.
1Due to Twitter’s privacy policy, the actual tweets were not shared publicly. We made available a download script
and the TweetIDs and UserIDs for the tweets. The publicly available tweets can be downloaded using the download
scripts.
Task 3: Adverse Drug Reaction Mention Normalization. The training data consisted of ADR mentions mapped to
MedDRA (Medical Dictionary for Regulatory Activities)3 Preferred Terms (PTs). The training set consisted of 6,650
phrases mapped to 472 PTs (14.09 mentions per concept on average). The test set consisted of 2500 mentions mapped
to 254 classes. The evaluation metric for this task was accuracy (i.e., number of correctly identified MedDRA PTs
divided by the total number of instances in the evaluation set).</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>Results</title>
      <sec id="sec-2-1">
        <title>Task 1</title>
        <p>Eleven teams registered to participate in the task and 24 submissions from nine teams were included in the final
evaluations. System submissions were excluded if they did not meet the deadline, were incompatible, did not follow
the shared task guidelines or were incomplete. Table 3 presents the performances of the 24 included systems grouped
by the team names. Team NRC_Canada had the best performing system at for this task, obtaining an ADR class
Fscore of 0.4355.
3Available at: https://www.meddra.org/.
Eleven teams registered to participate in this task including eight teams that also registered for Task 1. 26 submissions
from ten teams were included in the final evaluations. Exclusion criteria were identical to those of task 1. Table 2
presents the performances of these 26 systems grouped by team names. Team InfyNLP had the best performing system
for this task, obtaining micro-averaged F-score of 0.693 for the two relevant classes6.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>UKNLP</title>
      <p>deepCyberNet</p>
    </sec>
    <sec id="sec-4">
      <title>AMRITA_CEN_ NLP_RBG</title>
    </sec>
    <sec id="sec-5">
      <title>CSaRUS-CNN</title>
    </sec>
    <sec id="sec-6">
      <title>AMRITA_CEN_ NLP_RBG</title>
    </sec>
    <sec id="sec-7">
      <title>NRC_Canada</title>
    </sec>
    <sec id="sec-8">
      <title>NTTMU</title>
    </sec>
    <sec id="sec-9">
      <title>RITUAL</title>
    </sec>
    <sec id="sec-10">
      <title>TJIIP</title>
    </sec>
    <sec id="sec-11">
      <title>TurkuNLP</title>
      <p>Two teams registered to participate in this task and five system submissions were submitted. Table 5 summarizes the
performances of the five systems. It can be seen from the table that the different systems showed similar performances,
7
with one system from team gnTeam obtaining the best accuracy of 88.5% .</p>
    </sec>
    <sec id="sec-12">
      <title>UKNLP InfyNLP deepCyberNet</title>
      <p>The number of submissions received for the second execution of the SMM4H shared tasks was more than double of
that received for the first execution. The submitted systems employed a wide range of machine learning methods. The
system descriptions that have been published with the shared task proceedings provide further details about these
methods and the relative performances of each. The successful execution of the shared tasks suggests that this is an
effective model for encouraging community-driven development of systems for social media based heath related text
mining, and warrants further future efforts.</p>
    </sec>
    <sec id="sec-13">
      <title>Acknowledgments</title>
      <p>This work was supported by National Institutes of Health (NIH) National Library of Medicine (NLM) grant number
NIH NLM 5R01LM011176. The content is solely the responsibility of the authors and does not necessarily represent
the official views of the NLM or NIH.</p>
      <p>The authors would like to thank the members of the Health Language Processing Laboratory for their support. In
particular, the authors would like to acknowledge the efforts by Karen O’Connor and Alexis Upshur for preparing the
annotated data sets. The authors would also like to thank Dr. Davy Weissenbacher, Dr. Masoud Rouhizadeh and Dr.
Ari Z. Klein for reviewing system descriptions and contributing to the selection process.
0.630
0.607</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Sarker</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikfarjam</surname>
            <given-names>A</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez G. SOCIAL MEDIA MINING SHARED TASK</surname>
          </string-name>
          <article-title>WORKSHOP</article-title>
          . Pac Symp Biocomput.
          <year>2016</year>
          ;
          <volume>21</volume>
          :
          <fpage>581</fpage>
          -
          <lpage>592</lpage>
          . http://www.ncbi.
          <source>nlm.nih.gov/pubmed/26776221. Accessed January 5</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>