<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Machine Learning Approach to Extract Drug - Drug Interactions in an Unbalanced Dataset</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jacinto Mata</string-name>
          <email>jacinto.mata@dti.uhu.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ramón Santano</string-name>
          <email>ramon.santano@alu.uhu.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Blanco</string-name>
          <email>daniel.blanco@alu.uhu.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marcos Lucero</string-name>
          <email>marcos.lucero@alu.uhu.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manuel J. Maña</string-name>
          <email>manuel.mana@dti.uhu.es</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Escuela Técnica Superior de Ingeniería. Universidad de Huelva Ctra.</institution>
          <addr-line>Huelva - Palos de la Frontera s/n. 21819 La Rábida, Huelva</addr-line>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>8</lpage>
      <abstract>
        <p>Drug-Drug Interaction (DDI) extraction from the pharmacological literature is an emergent challenge in the text mining area. In this paper we describe a DDI extraction system based on a machine learning approach. We propose distinct solutions to deal with the high dimensionality of the problem and the unbalanced representation of classes in the dataset. On the test dataset, our best run reaches an F-measure of 0.4702.</p>
      </abstract>
      <kwd-group>
        <kwd>Drug-drug interaction</kwd>
        <kwd>machine learning</kwd>
        <kwd>unbalanced classification</kwd>
        <kwd>feature selection</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        One of the most relevant problems in patient safety is the adverse reaction caused by
drugs interactions. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], it is claimed that 1.5 million adverse drug events and tens of
thousands of hospital admissions take place each year. A Drug-Drug Interaction
(DDI) occurs when the effect of a particular drug is altered when it is taken with
another drug. The most updated source to know DDI is the pharmacological
specialized literature. However, the automatic extraction of DDI information from
this huge document repository is not a trivial problem. In this scenario, text mining
techniques are very suitable to deal with this kind of problems.
      </p>
      <p>
        Different approaches are used in DDI extraction. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], the authors propose a
hybrid method based on linguistic and pattern rules to detect DDI in the literature.
Linguistic rules grasp syntactic structures or semantic meanings that could discover
relations from unstructured texts. Pattern-based rules encode the various forms of
expressing a given relationship. As far as we know, there are not many works
applying machine learning approaches to this task due to the inexistence of available
corpora. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] a SVM classifier was used to extract DDI into the DrugDDI corpus.
However, in the similar problem of protein-protein interaction (PPI) has been widely
used obtaining promising effectiveness, as in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. The main advantages of this
+
,
!
! ,
! ,
approach are that they can be easily extended to new set of data and the development
effort is considerably lower than manual encoding of rules and patterns.
      </p>
      <p>
        In this paper we present a machine learning approach to extract DDI using the
DrugDDI corpus [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Natural Language Processing (NLP) techniques are used to
analyze documents and extracting features which represent them. The unbalanced
proportion between positive and negative classes in the corpus suggest us the
application of sampling techniques. We have experimented with several machine
learning algorithms (SVM, Naïve Bayes, Decision Trees, Adaboost) in combination
with feature selection techniques in order to reduce the dimensionality of the problem.
      </p>
      <p>The paper is organized as follows. The system architecture is presented in section
2. In Section 3 we describe the set of features that represents each pair of drugs which
appears in the documents. Also we present the feature selection methods used to
reduce the initial set of attributes. Next, Section 4 describes the techniques that we
have used to deal with this unbalanced classification problem. In Section 5 we
evaluate the results obtained with the training corpus. The results on the test corpus
are presented in Section 6. Finally, the conclusions are in Section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>System Architecture</title>
      <p>Two different document formats has been provided by the organizers, the Unified
format and the MMTx format. We have used this last one to develop and testing our
system.</p>
      <p>The words around the drugs in a sentence have been selected as attributes of the
database because they could provide clues about the existence of interaction between
two drugs. We have experimented using the words as they appear in the documents
and, in other cases, with the lemmas provided by the Stanford University morphologic
parser1.</p>
      <p>For each drug pair in a sentence a set of features was extracted. The main features
were focused on keywords, distances between drugs and drug semantic types. In the
next section, a more detailed description of each attribute is done.</p>
      <p>In order to carry out the experimentation, the DB of Features was split in two
datasets for training and testing. We have used 2/3 of the original DB for training the
classifier. The remaining 1/3 was used to test the system during the development
phase.</p>
      <p>Before training the classifier we have experimented with two preprocessing
techniques. Because this problem is an unbalanced classification task we have carried
out sampling techniques. Also, to reduce the dimensionality of the dataset a feature
selection technique was performed. To obtain the model, we have experimented with
several machine learning algorithms (SVM, Naïve Bayes, Decision Trees, Adaboost).</p>
      <p>With each obtained model an evaluation was completed using the test dataset. The
results obtained in this evaluation are shown in Section 5.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Feature Extraction and Selection</title>
      <p>The most important part in this kind of classifying problem is to choose the set of
features that represents as well as possible each pair of drugs. It means that we need to
find those features that provide important information for differentiating pairs of
drugs with interaction of pairs without interactions.</p>
      <p>In this section we describe the features we have chosen to build the dataset.
3.1</p>
      <sec id="sec-3-1">
        <title>Features</title>
        <p>Firstly, we have extracted the drug ID, which indicates the sentence and the phrase of
the dataset to which the drug belongs to.</p>
        <p>Secondly, a feature subset composed by keywords was chosen. Each attribute is
represented by a binary value that means the presence or absence of this keyword.
Three windows of tokens have been considered to locate the keywords: between the
first and the second drug, before the first drug and after the second drug. In the last
two cases, only three tokens were taken into account.
1 http://nlp.stanford.edu/index.shtml
+
,
!
! ,
! ,</p>
        <p>In this work, a keyword is a word that could provide relevant information about
whether a pair of drugs interacts or not. In order to build the list of keywords we
extracted all the words between each pair of drugs, before the first drug or after the
second drug, according the case. This set of words was filtered by a short list of
stopwords. The POS tag of each word has been taken into account to make the selection.
In this sense, we thought that verbs have an important semantic content, so we
decided to include all of them into the final list. With respect to the nouns, we did a
manual selection choosing those nouns that could be related semantically with drug
interactions. Finally, in the case of prepositions, adverbs and conjunctions, we
selected those that could be related with negation or frequency.</p>
        <p>We have experimented using the keywords as they appear in the documents and, in
other cases, with the lemmas provided by the Stanford University morphologic
analyzer. In this case, the number of keywords was reduced because distinct verb
tenses or plurals of a word were reduced to their lemmas, obtaining a total of 459
attributes.</p>
        <p>Next, we added to the feature set the distance, in number of words and phrases,
between the drugs. Also we included two features that represent the semantic type of
each drug (represented by integer numbers).</p>
        <p>Finally, the feature set is completed with the class, a binary value, where 1 means
drug interaction and 0 if the pair does not interact.</p>
        <p>As we can see in Table 1, we have extracted a total of 600 features from the
original dataset to build the develop dataset.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2 Feature selection</title>
        <p>
          Due to the high dimensionality of the training dataset, we have experimented with
chi-squared feature selection method [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. This method returns a ranking of the
features in decreasing order by the value of the chi-squared statistic with respect to
the class. We selected the attributes which the statistic had a value greater than 0. The
resulting dataset, in the case of keywords without lemmatization, had 496 attributes.
7 )
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Unbalanced Classification</title>
      <p>
        As shown in Table 2, there are 23827 drug pairs in the develop dataset and only 2409
are real drug interactions. Therefore, the positive class is nearly the 10% (9.89%) of
the total number of instances. It is a classification task with unbalanced classes. To
deal with this problem we have used the SMOTE algorithm [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in order to balance the
classes.
      </p>
      <p>
        Several classification algorithms have been selected in order to obtain the best
effectiveness results with respect to the F-measure of the positive class. We have used
the Weka [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] implementation of the following algorithms: RandomForest [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], Naïve
Bayes [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], SMO [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and MultiBoosting [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>In some cases, to build the classification model, we have applied a cost sensitive
matrix in order to penalize false positives.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Experimentation on Training Corpus</title>
      <p>The develop corpus contains a collection of pharmacological texts labeled with drug
interactions. This collection consists of 4267 sentences extracted from a total of 435
documents, which describe the interactions between drugs (Drug Drug Interactions or
DDI). From these documents we have extracted 23827 drug pairs as possible cases of
interaction. In total, there are 2409 instances corresponding to drug interactions and
21418 instances where there is no interaction between drugs.</p>
      <p>Table 2 summarizes the training corpus statistics.</p>
      <p>In the experiment phase, we divided the dataset into two new datasets for training
and testing, respectively. The training dataset consists of 2/3 of the total instances
(15885). The test dataset consists of the remaining instances (7942).</p>
      <p>The distribution of the instances for training and test datasets was done at random,
keeping the percentage of instances with drug interaction and no interaction (10% and
90%, respectively).</p>
      <p>Table 3 shows the effectiveness results for precision, recall and F-measure on the
positive class of the 10 best evaluations. Each row of the table indicates a different
+
,
-./0
! ,
! ,
combination of classification algorithm, cost sensitive training, feature selection,
sampling and keyword lemmatization.</p>
      <p>As can be seen, the best results are obtained with the RandomForest algorithm.
Moreover, the cost sensitive training, feature selection, sampling and lemmatization
of the keywords contribute to achieve the best F-measures.
In order to send runs with different characteristics, we didn't send the five runs with
higher value of F-measure. According to Table 3, runs 1, 2, 4, 7 and 8 were
submitted. We chose this strategy because we did not know the characteristics of the
test corpus.</p>
      <p>In Table 4, we present the results obtained for the five submitted runs. The
approaches that obtain the best results on the training dataset coincide with the
obtained on the test dataset. Although there are not significant differences between
precisions on training and test datasets, a greater decrement in the recall measure do
that the F-measure falls a 10% approximately. We think that this decrement in the
effectiveness measures is due to a possible overfitting of the classification models.
7</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>In this paper we have presented a DDI extraction system based on a machine learning
approach. We have proposed distinct solutions to deal with the high dimensionality of
the problem and the unbalanced representation of classes in the dataset. The results
obtained on both datasets are promising and we think that this could be a good
starting point for future improvements.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L. Random</given-names>
          </string-name>
          <string-name>
            <surname>Forests</surname>
          </string-name>
          .
          <source>Machine Learning</source>
          ,
          <year>2001</year>
          . Vol.
          <volume>45</volume>
          (
          <issue>1</issue>
          ):
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Chawla</surname>
            ,
            <given-names>N.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bowyer</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>L.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kegelmeyer</surname>
            ,
            <given-names>W.P.</given-names>
          </string-name>
          <string-name>
            <surname>Synthetic Minority</surname>
          </string-name>
          <article-title>Oversampling Technique</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          ,
          <year>2002</year>
          . Vol.
          <volume>16</volume>
          :
          <fpage>321</fpage>
          -
          <lpage>357</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Classen</surname>
            ,
            <given-names>D.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Phansalkar</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bates</surname>
            ,
            <given-names>D.W.</given-names>
          </string-name>
          <article-title>Critical drug-drug interactions for use in electronic health records systems with computerized physician order entry: review of leading approaches</article-title>
          .
          <source>J. Patient Safety</source>
          <year>2011</year>
          ,Jun;
          <volume>7</volume>
          (
          <issue>2</issue>
          ):
          <fpage>61</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hall</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Holmes</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pfahringer</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reutemann</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <source>Witten. I.H. The WEKA Data Mining Software: An Update; SIGKDD Explorations</source>
          <year>2009</year>
          , Vol.
          <volume>11</volume>
          , Issue 1.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. John,
          <string-name>
            <given-names>G.H.</given-names>
            ,
            <surname>Langley</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>Estimating Continuous Distributions in Bayesian Classifiers</article-title>
          .
          <source>In: Eleventh Conference on Uncertainty in Artificial Intelligence</source>
          , San Mateo,
          <fpage>338</fpage>
          -
          <lpage>345</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Keerthi</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shevade</surname>
            ,
            <given-names>S.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhattacharyya</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Murthy</surname>
            ,
            <given-names>K.R.K.</given-names>
          </string-name>
          <article-title>Improvements to Platt's SMO Algorithm for SVM Classifier Design</article-title>
          .
          <source>Neural Computation</source>
          ,
          <year>2001</year>
          .
          <volume>13</volume>
          (
          <issue>3</issue>
          ):
          <fpage>637</fpage>
          -
          <lpage>649</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Krallinger</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Leitner</surname>
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valencia</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>The BioCreative II.5 challenge overview</article-title>
          .
          <source>Proceedings of the BioCreative II. 5 Workshop 2009 on Digital Annotations</source>
          <year>2009</year>
          ,
          <volume>19</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Setiono</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <source>Chi2</source>
          .
          <article-title>Feature selection and discretization of numeric attributes</article-title>
          ,
          <source>Proc. IEEE 7th International Conference on Tools with Artificial Intelligence</source>
          ,
          <fpage>338</fpage>
          -
          <lpage>391</lpage>
          ,
          <year>1995</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Segura-Bedmar</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martínez</surname>
            , P., de Pablo-Sánchez,
            <given-names>C.</given-names>
          </string-name>
          <article-title>A linguistic rule-based approach to extract drug-drug interactions from pharmacological documents</article-title>
          ,
          <source>March</source>
          ,
          <year>2011</year>
          , BMC BioInformatics, Vol.
          <volume>12</volume>
          (
          <issue>Suppl 2</issue>
          ):
          <fpage>S1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Segura-Bedmar</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinez</surname>
          </string-name>
          , P.,
          <string-name>
            <surname>de Pablo-Sanchez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Using</surname>
          </string-name>
          <article-title>a shallow linguistic kernel for drug-drug interaction extraction</article-title>
          ,
          <source>Journal of Biomedical Informatics</source>
          , In Press, Corrected Proof,
          <source>Available online 24 April</source>
          <year>2011</year>
          , DOI: 10.1016/j.jbi.
          <year>2011</year>
          .
          <volume>04</volume>
          .005.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Webb</surname>
            ,
            <given-names>G.I.</given-names>
          </string-name>
          <article-title>MultiBoosting: A Technique for Combining Boosting and Wagging</article-title>
          .
          <source>Machine Learning 2000</source>
          . Vol.
          <volume>40</volume>
          (No.2).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>