<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <article-id pub-id-type="doi">10.4225/08/5490FA2E01A90</article-id>
      <title-group>
        <article-title>Concept Identi cation and Normalisation for Adverse Drug Event Discovery in Medical Forums</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alejandro Metke-Jimenez</string-name>
          <email>alejandro.metke@csiro.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sarvnaz Karimi</string-name>
          <email>sarvnaz.karimi@csiro.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CSIRO</institution>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Social media is becoming an increasingly important source of information to complement traditional pharmacovigilance methods. In order to identify signals of potential adverse drug reactions, it is necessary to rst identify medical concepts and drugs in the text. We evaluate di erent concept extraction techniques on medical forums and for the machine learning approaches we encode complex annotations using a scheme that showed good results in other domains. Our study shows that the extended encoding scheme, although imperfect, still produces good results despite the complexities of social media. The comparison of techniques shows that the machine learning approach signi cantly outperforms the other approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>Text Mining</kwd>
        <kwd>Information Extraction</kwd>
        <kwd>Ontology-based Text Normalisation</kwd>
        <kwd>Drug Safety Adverse Drug Reaction Discovery</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Adverse Drug Reactions (ADRs) are a major concern for public health. An ADR
is an injury caused by a medication that is administered at the recommended
dosage, for recommended symptoms. The traditional pharmacovigilance
methods have shown limitations that have prompted the search for alternative sources
that might help identify signals of potential ADRs.</p>
      <p>One of these sources is social media. However, it is rst necessary to identify
concepts of interest, such as mentions of adverse e ects, in the text which is
unstructured and noisy. This step is critical because errors can a ect the
subsequent stages of the signal detection process.</p>
    </sec>
    <sec id="sec-2">
      <title>Background and related work</title>
      <p>
        Although there is a large body of literature on generic information extraction
from text such as news and social media, especially Twitter, there is limited
work on the speci c area of ADR detection. A comprehensive survey of text and
data mining techniques used for ADR signal detection can be found in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>In this paper we are concerned with concept extraction which can be divided
in two steps: identifying spans of text that represent a concept of interest,
referred to as concept identi cation, and mapping the spans to the corresponding
concepts in a chosen ontology, referred to as concept normalisation.</p>
      <p>
        The problem of medical concept extraction has been extensively studied by
the clinical text mining community. Most techniques used to extract ADRs from
social media use dictionary-based approaches. A review of these approaches and
the most commonly used lexicons can be found in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        More recently, machine learning techniques have been applied to extract
ADRs from social media. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] the authors implemented a CRF classi er to
detect mentions of ADRs in a corpus of Twitter and DailyStrength posts and
reported improvements over dictionary-based approaches.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Problem formulation</title>
      <p>Our goal is to evaluate the concept extraction task speci cally on medical forums.
Apart from the challenges that this type of data raises, such as dealing with
misspellings and colloquial language, we also aim to evaluate techniques that
are widely used to determine how well they perform against each other.
3.1</p>
      <sec id="sec-3-1">
        <title>Concept identi cation</title>
        <p>Concept identi cation consists of identifying spans of text that represent medical
concepts. This task can be framed as a binary classi cation problem and
evaluated using precision, recall, and F-score. In the strict version of the evaluation,
the spans are required to match exactly. In the relaxed version the spans only
need to overlap to be considered a positive match.</p>
        <p>In order to consider the correct classi cation of negative examples we also
evaluate the systems using accuracy. The set of negative examples is de ned as
all the spans that are created by all the systems under evaluation that are not
part of the gold standard.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Concept normalisation</title>
        <p>The normalisation step takes the spans that were identi ed in the identi cation
step and maps them to a concept in an ontology. ADR spans are mapped to the
Clinical Finding hierarchy of SNOMED CT and drug spans to concepts in the
Australian Medicines Terminology (AMT).</p>
        <p>Concept normalisation is often evaluated using a metric referred to as
accuracy. To avoid confusion with the metric used in the rst part of the task, we
refer to this metric as e ectiveness, which is de ned as
Next time I’ll try my luck with Paracetamol.</p>
        <p>DB DI</p>
        <p>Adverse Drug Event Discovery in Medical Forums
The pill I took consisted of 50 MG Diclofenac and 200 MG Misoprostol.
3
HB</p>
        <p>HI</p>
        <p>HI
... it has left me feeling exausted, and depressed.
where nT P is the number of spans that match the gold standard exactly, ncorrect
is the number of spans that were mapped to the correct concept in the
corresponding ontology, and tg is the total number of identi ed concepts or spans
in the gold standard. The relaxed version only considers the spans that were
correctly identi ed in the previous stage.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Dataset</title>
      <p>
        In our experiments, we used an annotated corpus called CSIRO Adverse Drug
Event Corpus (Cadec)1. This corpus is a collection of medical posts sourced
from the medical forum AskaPatient. A detailed description of the corpus can
be found in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. To develop and evaluate a machine learning approach, we divided
the data into training and testing sets, using a 70/30 split.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>Methods</title>
      <p>Most existing approaches to ADR mining in social media use dictionary-based
techniques based on pattern matching rules or sliding windows. We implemented
a sliding window approach using the Lucene search engine, without using
stemming or removing stop words.</p>
      <p>
        We also implemented a CRF classi er, similar to the one used in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] but with
fewer features, using the Stanford NER suite [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A CRF classi er takes as input
di erent features that are derived from the text, such as the words that surround
each token, letter n-grams and word shape features.
      </p>
      <p>One of the challenges of dealing with discontinuous spans is representing
them in a format that is suitable as input to the classi er. Continuous spans are
typically represented using the standard Begin, Inside, Outside (BIO) chunking
representation. This format does not support the notion of discontinuous spans
and several solutions have been proposed to overcome this limitation. The most
successful approach in tasks such as CLEF has been to extend the BIO format
with additional tags to represent the discontinuous spans.</p>
      <p>With the extended BIO format, the following additional tags are introduced:
DfB, Ig and HfB, Ig. The rst set of tags is used to represent discontinuous,
non-overlapping spans. The second set of tags is used to represent discontinuous,
overlapping spans that share one or more tokens (the H stands for Head, as in
head word). Figure 1 shows an example of a complex span.</p>
      <p>One limitation of this approach is that it is impossible to represent several
discontinuous spans in the same sentence unambiguously. To determine how this
might a ect the performance of the CRF approach with the CADEC dataset, a
round trip transformation was done on the gold standard annotations and the
results are shown in Table 1. This is equivalent to having a perfect classi er.</p>
      <p>The CRF classi er only identi es relevant spans but does not map them to
concepts. Two approaches were explored to achieve this mapping. The rst one
is based on the Vector Space Model (VSM) and was implemented using Lucene.
The target ontology was indexed using stemming and removing stop words by
creating a document for each term and storing the corresponding concept id.
Then, the text of each span was used to query the index, without requiring all
the tokens to match. The top ranked concept was assigned to the span and if
the query returned no results then the span was annotated as concept less.</p>
      <p>
        The second approach uses Ontoserver, a terminology server developed at the
Australian e-Health Research Centre, that given a free-text query returns the
most relevant SNOMED CT and AMT concepts. Ontoserver uses a
purposetuned retrieval function based on a multi-pre x matching algorithm [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>To determine if the improvements obtained with any two di erent methods
were statistically signi cant, we used McNemar's test.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Results and discussion</title>
      <p>The results of the concept identi cation task are shown in Table 2. The CRF
implementation outperforms MetaMap and all the dictionary-based
implementations in all of the metrics that were considered, in both strict and relaxed
modes, as expected.</p>
      <p>Identifying drugs usually involves less ambiguity than identifying ADRs and
therefore better results were expected in this task. The results show that the
CRF indeed performs better in this task that in the ADR identi cation task.
Note also that most of the dictionary-based implementations achieve good recall
but low precision; this is likely due to some of the constraints in the annotation
guidelines, for example, drug classes are excluded. The CRF is capable of learning
these constraints while the dictionary-based approaches are not.</p>
      <p>Table 3 shows the results of the concept normalisation task. In this case the
strict metric is more relevant, because some implementations can achieve a very
high score in the relaxed version despite having a very poor overall performance.
The results show that Ontoserver outperforms the other approaches when
normalising ADRs. Overall, however, the results are quite poor. This highlights two
important aspects of the task. First, it is inherently di cult to map colloquial
language to ontologies that contain more formal terms. Second, because in this
task the goal is to map the spans to SNOMED CT concepts, the quality of the
results when using approaches that rely on other controlled vocabularies will
depend on the quality of the mappings between those vocabularies and SNOMED
CT.</p>
      <p>It was also expected that the di erent methods would perform better when
normalising drugs than when normalising ADRs. For most implementations this
turned out to be true, except for the dictionary-based methods that are not
based on AMT. These methods were unable to normalise any concepts because
maps between the other controlled vocabularies and AMT do not currently exist.
7</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusions and future work</title>
      <p>Pharmacovigilance should no longer rely only on manual reports of potential
drug adverse e ects. One viable alternative is actively detecting signals of adverse
drug reactions in social media through text mining.</p>
      <p>We conducted an empirical evaluation of di erent methods to automatically
extract concepts from medical forums. We explored the implications of
representing complex annotations in a format suitable for use with machine learning
methods. Finally, we proposed and implemented two concept normalisation
techniques that we used in conjunction with our machine learning implementation.</p>
      <p>We showed that there is some ambiguity when using the extended BIO format
to represent the complex annotations, but the impact on the overall performance
is not substantial. The experimental results showed that the CRF
implementation combined with Ontoserver outperformed all the other methods that were
evaluated. Even though these results show that machine learning methods
perform better than simple dictionary-based methods, they also highlight the
complexities in mapping the spans of text to concepts in an underlying ontology or
controlled vocabulary.</p>
      <p>Regarding future work, existing concept normalisation implementations in
social media do not make use of the context of the spans. We believe more
advanced methods may bene t from having access not only to the text in the
span but also to the surrounding tokens and previously identi ed concepts.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>AskaPatient kindly provided the data used in this study for research purposes
only. Ethics approval for this project was obtained from the CSIRO ethics
committee, which classi ed the work as low risk (CSIRO Ecosciences #07613).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Sarvnaz</given-names>
            <surname>Karimi</surname>
          </string-name>
          , Chen Wang,
          <string-name>
            <surname>Alejandro</surname>
            Metke-Jimenez,
            <given-names>Raj</given-names>
          </string-name>
          <string-name>
            <surname>Gaire</surname>
          </string-name>
          , and Cecile Paris.
          <article-title>Text and data mining techniques in adverse drug reaction detection</article-title>
          .
          <source>ACM Computing Surveys</source>
          ,
          <volume>47</volume>
          (
          <issue>4</issue>
          ):
          <fpage>56</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Abeed</given-names>
            <surname>Sarker</surname>
          </string-name>
          , Rachel Ginn, Azadeh Nikfarjam,
          <string-name>
            <surname>Karen</surname>
            <given-names>OConnor</given-names>
          </string-name>
          , Karen Smith,
          <string-name>
            <given-names>Swetha</given-names>
            <surname>Jayaraman</surname>
          </string-name>
          , Tejaswi Upadhaya, and
          <string-name>
            <given-names>Graciela</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          .
          <article-title>Utilizing social media data for pharmacovigilance: A review</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          ,
          <volume>54</volume>
          :
          <fpage>202</fpage>
          {
          <fpage>212</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Azadeh</given-names>
            <surname>Nikfarjam</surname>
          </string-name>
          , Abeed Sarker,
          <string-name>
            <surname>Karen O'Connor</surname>
            ,
            <given-names>Rachel</given-names>
          </string-name>
          <string-name>
            <surname>Ginn</surname>
            , and
            <given-names>Graciela</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalez</surname>
          </string-name>
          .
          <article-title>Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features</article-title>
          .
          <source>Journal of the American Medical Informatics Association</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Sarvnaz</given-names>
            <surname>Karimi</surname>
          </string-name>
          , Alejandro Metke-Jimenez,
          <article-title>Madonna Kemp, and Chen Wang. CADEC: A corpus of adverse drug event annotations</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          ,
          <volume>55</volume>
          :
          <fpage>73</fpage>
          {
          <fpage>81</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Jenny</given-names>
            <surname>Rose</surname>
          </string-name>
          <string-name>
            <surname>Finkel</surname>
          </string-name>
          , Trond Grenager, and
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Manning</surname>
          </string-name>
          .
          <article-title>Incorporating nonlocal information into information extraction systems by Gibbs sampling</article-title>
          .
          <source>In The 43rd Annual Meeting On Association for Computational Linguistics</source>
          , pages
          <volume>363</volume>
          {
          <fpage>370</fpage>
          ,
          <string-name>
            <surname>Ann</surname>
            <given-names>Arbor</given-names>
          </string-name>
          , Michigan,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Merlijn</given-names>
            <surname>Sevenster</surname>
          </string-name>
          , Rob van Ommering,
          <string-name>
            <given-names>and Yuechen</given-names>
            <surname>Qian</surname>
          </string-name>
          .
          <article-title>Algorithmic and user study of an autocompletion algorithm on a large medical vocabulary</article-title>
          .
          <source>Journal of Biomedical Informatics</source>
          ,
          <volume>45</volume>
          (
          <issue>1</issue>
          ):
          <volume>107</volume>
          {
          <fpage>119</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>