<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Drug-Drug Interaction Detection: A Machine Learning Approach With Maximal Frequent Sequence Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sandra Garcia-Blasco</string-name>
          <email>sandra.garcia@bitsnbrains.net</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Santiago M. Mola-Velasco</string-name>
          <email>santiago.mola@bitsnbrains.net</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roxana Danger</string-name>
          <email>rdanger@imperial.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Rosso</string-name>
          <email>prosso@dsic.upv.es</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Imperial College London</institution>
          ,
          <country country="UK">UK -</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>NLE Lab. - ELiRF, DSIC, Universidad Polit ́ecnica de Valencia</institution>
          ,
          <country country="ES">Spain -</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>bitsnbrains S.L. and Universidad Polit ́ecnica de Valencia</institution>
          ,
          <country country="ES">Spain -</country>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>9</lpage>
      <abstract>
        <p>A Drug-Drug Interaction (DDI) occurs when the effects of a drug are modified by the presence of other drugs. DDIExtraction2011 proposes a first challenge task, Drug-Drug Interaction Extraction, to compare different techniques for DDI extraction and to set a benchmark that will enable future systems to be tested. The goal of the competition is for every pair of drugs in a sentence, decide whether an interaction is being described or not. We built a system based on machine learning based on bag of words and pattern extraction. Bag of words and other drug-level and character-level have been proven to have a high discriminative power for detecting DDI, while pattern extraction provided a moderated improvement indicating a good line for further research.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        4 These statistics cover only documents and sentences that contain, at least, one drug
pair.
&amp;
*
+
&amp;
,) ,
Even though the problem of DDI extraction is relatively new, some authors
have already presented approximations to solve it. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the author presents
two approximations to face the problem: a hybrid approach, combining shallow
parsing and matching of patterns described by a pharmacist; and an
approximation based on kernel methods that obtained better results that the hybrid
approach, reaching 55% precision and 84% recall.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] the authors propose a first approximation for DDI detection based
on automatically determining the patterns that identify DDI from a training
set. The patterns extracted were Maximal Frequent Sequences (MFS), based on
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. In this work, the identified MFS were used to determine whether a sentence
contains or not a description of a DDI, without identifying the pair of interacting
drugs. MFS have been useful in different tasks such as text summarization [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
measuring text similarities [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and authorship attribution [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. MFS will also be
part of our approximation, and will be defined further on.
      </p>
      <p>
        Protein-Protein Interaction (PPI) extraction is an area of research very
similar to DDI extraction that has received a bigger attention from the scientific
comunity. The BioCreative III Workshop hosted two tasks of PPI document
classification and interaction extraction [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Some of the features present in a wide
range of participants were bag-of-words, bigrams, co-occurrences and character
ngrams. This kind of features will have a key role in our system. In [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] the authors
use patterns as one of their main features to extract PPI. In [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], the authors
use a hybrid approach with clustering and machine learning classification using
Support Vector Machines (SVM).
3
      </p>
    </sec>
    <sec id="sec-2">
      <title>Our System</title>
      <p>We built a system based on machine learning5, therefore we had to define a
feature set to estimate the model. Each sample is one possible interaction, this
is, each unique combination of two drugs appearing in a sentence of the corpus.
Given the small size of the corpus and the difficulty of properly estimating the
model, it was necessary to represent the features in a reduced space.</p>
      <p>The first step was to preprocess the corpus. For doing so, each sentence was
tokenized6 with standard English tokenization rules (e.g. split by spaces, removal
5 We used RapidMiner for every classification and clustering model. Available at http:
//rapid-i.com/.
6 The tokenization was performed with Apache Lucene. Available at http://lucene.
apache.org.
of apostrophes, conversion to lower case, removal of punctuation marks) with the
following particularities:
– Each group of tokens that represent a drug were replaced by #drug#.
– Numbers were replaced by num .
– Stop words were not removed.
– Stemming was applied7.
– Percentage symbols were preserved as independent tokens.</p>
      <p>In the following subsections, we will describe the different features used in
the system.
3.1</p>
      <sec id="sec-2-1">
        <title>Bag of Words</title>
        <p>From the set of all words appearing in the preprocessed corpus, we discarded
those with a frequency lower than 3 and stop words. With the resulting set of
words, we generated a dataset where each sample was a possible interaction in
the corpus and each feature was the presence or not of each word between the
two drugs of the potential interaction. Using this dataset, every word was ranked
using information gain ratio with respect to the label 8. Then, every word with
an information gain ratio lower than 0.0001 was discarded. The presence of each
of the remaining words was a feature in the final dataset. Finally, 1,010 words
were kept.</p>
        <p>Samples of words with a high gain ratio are: exceed, add, solubl, amphetamin,
below, lowest, second, defici, occurr, stimul and acceler.
3.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Word Categories</title>
        <p>
          In biomedical literature complex sentences are used very frequently. MFS and
bag of words are not able to capture relations that are far apart inside a
sentence. To somehow reflect the structure of the sentence, we defined some word
categories. This way, we can have some information about dependent and
independent clauses, coordinate and subordinate structures, etc. Some of these
categories were also included in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. We added two categories that include
absolute terms and quantifiers, as well as a category for negations. Table 2 enumerates
the words included in each category.
        </p>
        <p>For each word category we defined two features. One indicating how many
times the words in the category appeared in the sentence, and the other
indicating how many times they appeared between the two drugs of the potential
interaction.
7 The stemming algorithm used was Snowball for English. Available at http://
snowball.tartarus.org.
8 Information Gain Ratio was calculated using Weka. Available at http://www.cs.
waikato.ac.nz/ml/weka/.</p>
        <p>/
&amp;
*
+
&amp;
,) ,</p>
        <p>Category Words included
Subordinate after, although, as, because, before, if, since, though, unless, until,
whatever, when, whenever, whether, while.</p>
        <p>Independent markers however, moreover, furthermore, consequently, nevertheless, therefore.
Appositions like, including, e.g., i.e.</p>
        <p>Coordinators for, and, nor, but, or, yet, so.</p>
        <p>Absolute never, always.</p>
        <p>Quantifiers higher, lower.</p>
        <p>Negations no, not.
3.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Maximal Frequent Sequences</title>
        <p>
          Similar to bag of words, we used sequences of words as features. For this, we used
Maximal Frequent Sequences (MFS).9 Following [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], a sequence is defined as an
ordered list of elements, in this case, words. A sequence is maximal if it is not
a subsequence of any other, this is, if it does not appear in any other sequence
in the same order. Given a collection of sentences, a sequence is β-frequent if it
appears in at least β sentences, where β is the defined frequency threshold. MFS
are all the sequences that are β-frequent and maximal.
        </p>
        <p>We extracted all the MFS from the training corpus, with a β of 10 minimum
length of 2. Given the size of the corpus, sometimes very long MFS have no
capability to generalize knowledge because they sometimes represent full sentences,
instead of patterns that should be frequent in a kind of sentence. To avoid this,
we restricted the MFS to a maximum length of 7 words. With this, we obtained
1.010 patterns. In order to reduce the feature space we calculated clusters of
MFS.</p>
        <p>
          Clusters were calculated with the Kernel K-Means algorithm [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], using
radial kernel, with respect to the relative frequency of each MFS in the following
contexts: a) sentences, b) sentences containing an interaction, c) MFS appearing
between two drugs, c) MFS appearing before the first drug of an interaction
and d) MFS appearing after the last drug of an interaction. Clustering helped
to avoid pattern redundancy. This was necessary because some patterns could
be considered equivalent since they only differed in one or a few words not
relevant in the context of DDI. We obtained 274 clusters. Each of this clusters is a
feature of the final dataset which is set to 1 if, at least, one of the MFS of the
cluster matches with the potential interaction. The matching algorithm is shown
in Algorithm 1.
3.4
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>Token and Char Level Features</title>
        <p>At the token and char level, several features were defined. We must recall that,
during preprocessing, every token or group of tokens labeled as drugs where
replaced by the token #drug#. Table 3 describes this subset of features. Each
one of these features appears twice in the final dataset, once computed on the
9 We used a proprietary library by bitsnbrains, http://bitsnbrains.net.
,
else
Algorithm 1: MFS matching algorithm.</p>
        <p>Input: mf s, sentence, drug1index, drug2index
Output: match
startT hreshold ← 0
endT hreshold ← 0
if ”#drug#” ∈ mf s then
startT hreshold ← First index of ”#drug#” in mf s
endT hreshold ← length(mf s)− last index of ”#drug#” in mf s
startIndex ← drug1index − startT hreshold
if startIndex &lt; 0 then</p>
        <p>startIndex ← 0
endIndex ← drug2index + endT hreshold
if endIndex &gt; length(sentence) then</p>
        <p>endIndex ← length(sentence)
textBetweenDrugs ← Substring of sentence from index startIndex to
endIndex
if mf s is subsequence of textBetweenDrugs then
match ← 1
match ← 0
whole sentence and once computed only in the text between the two drugs of
the potential interaction.
With the features defined so far, we have not taken into account the two drugs
of the potential interaction. We believe this is important in order to have more
information when deciding wether if they interact or not.</p>
        <p>For each document, we calculated the main drug as the drug after which the
document was named, this is, the name of the article of the DrugBank database
where the text was extracted from. In the case of scientific articles, the main
drug would be calculated as the drug or drug names appearing in the title of the
article, if any. Also for each document, we calculated the most frequent drug as
the token labeled as drug that appeared more times in the document.
&amp;
*
+
&amp;
,) ,</p>
        <p>
          We noticed that, sometimes, drugs are referred to using their trade names. To
ensure good treatment of drugs in the drug level features, we replaced each trade
name with the original drug name10. Table 4 describes the drug level features.
During preliminary research, we explored the performance of a wide range of
classification models, notably Support Vector Machines, Decision Trees and
multiple ensemble classifiers such as Bagging, MetaCost and Random Forests [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
Our best choice was Random Forest with 100 iterations and 100 attributes per
iteration.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>We evaluated our model with standard performance measures for binary
classification: Precision (P), Recall (R) and F-Measure (F). For each label, our model
outputs a confidence value. In order to decide the label, we define a confidence
threshold above which the decision will be positive and below which it will be
negative. A quick way to visualize every possible set up of the system is the
PR curve, where P and R are ploted for different confidence thresholds.
Analogously, we can plot F-Measure and confidence thresholds to visualize the
optimum threshold with respect F-Measure. AUC-PR is defined as the area under
the PR curve. AUC-PR is a very stable measure to compare binary classification
models.</p>
      <p>We are evaluating the performance of our system for the test set, with and
without MFS. Figure 1 shows PR and F curves for both settings. The PR curves
are convex, which makes the decision of an optimum threshold much easier and
less risky. Table 5 shows Precision, Recall, F-Measure, AUC-PR, precision at
recall 0.8 and recall at precision 0.8 for test with MFS.
10 Trade names were extracted from the KEGG DRUG database, from the Kyoto
Encyclopedia of Genes and Genomes. Available at http://www.genome.jp/kegg/
drug/
,</p>
      <p>MFS improve moderately the performance of the system, increasing about
0.02 in AUC-PR. We expected more influence of MFS. Patterns were extracted
using all sentences, even the ones that did not include any drug interaction. We
believe that this could have reduced the performance.</p>
      <p>n 0.6
o
ii
s
c
e
rP 0.4</p>
      <p>1
0.8
0.2
0</p>
      <p>Test
Test w/o MFS</p>
      <p>Submitted run</p>
      <p>Test</p>
      <p>Test w/o /MFS</p>
      <p>Submitted run
1
0.8
re 0.6
u
s
a
e
-FM 0.4
We presented a system for DDI extraction based on bag-of-words and Maximal
Frequent Sequences, as used for the DDIExtraction2011 competition. Our
submission obtained a F-Measure of 0.5829 and a AUC-PR of 0.6341 for the test
corpus. Our system can be set up to reach recall of 0.3205 with a precision of
0.8, or precision of 0.4309 and a recall 0.8. The use of MFS increased AUC-PR
by 0.02.</p>
      <p>One of the main problems we have encountered is the complexity of the
language structures used in biomedical literature. Most of the sentence contained
appositions, coordinators, etc. Therefore it was very difficult to reflect those
structures using MFS. The reduced size of the corpus is also a serious limitation
for our approach.</p>
      <p>Our system should be improved by complementing it with other
state-ofthe-art techniques used in the PPI field that have not been explored yet during
our participation, such as character n-grams and co-occurrences. It could also
be improved by extracting MFS with reduced restrictions and improving the
clustering step.</p>
      <p>Acknowledgments. This work has been done in the framework of the VLC/
CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.
Contributions of first and second authors have been supported and partially funded
by bitsnbrains S.L. Contribution of fourth author has been partially funded
by the European Commission as part of the WIQEI IRSES project (grant no.
269180) within the FP 7 Marie Curie People Framework, by MICINN as part of
the Text-Enterprise 2.0 project (TIN2009-13391-C04-03) within the Plan I+D+i.
Computational resources for this research have been kindly provided by Daniel
Kuehn from Data@UrService.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Segura-Bedmar</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Martinez</surname>
          </string-name>
          , P.,
          <string-name>
            <surname>de Pablo-Sanchez</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Using a shallow linguistic kernel for drug-drug interaction extraction</article-title>
          .
          <source>J. Biomed. Inform</source>
          . In Press, Corrected Proof,
          <source>Available online 24 April</source>
          <year>2011</year>
          , DOI 10.1016/j.jbi.
          <year>2011</year>
          .
          <volume>04</volume>
          .005.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Segura-Bedmar</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Application of Information Extraction techniques to pharmacological domain: Extracting drug-drug interactions</article-title>
          .
          <source>PhD thesis</source>
          , UC3M, Madrid, Spain (April
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Garc</surname>
          </string-name>
          <article-title>´ıa-</article-title>
          <string-name>
            <surname>Blasco</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Danger</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Drug-Drug Interaction Detection</surname>
          </string-name>
          :
          <article-title>A New Approach Based on Maximal Frequent Sequences</article-title>
          .
          <source>SEPLN</source>
          <volume>45</volume>
          (
          <year>2010</year>
          )
          <fpage>263</fpage>
          -
          <lpage>266</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Ahonen-Myka</surname>
          </string-name>
          , H.:
          <article-title>Discovery of Frequent Word Sequences in Text</article-title>
          .
          <source>In: Pattern Detection and Discovery</source>
          . Volume
          <volume>2447</volume>
          of LNCS., London, UK, Springer (
          <year>2002</year>
          )
          <fpage>180</fpage>
          -
          <lpage>189</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Garc´ıa, R.A.:
          <article-title>Algoritmos para el descubrimiento de patrones secuenciales maximales</article-title>
          .
          <source>PhD thesis</source>
          , INAOE, Mexico (
          <year>September 2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Garc</surname>
          </string-name>
          <article-title>´ıa-</article-title>
          <string-name>
            <surname>Blasco</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          : Extraccio´n de secuencias maximales de una coleccio´n de textos.
          <source>Final degree project, UPV</source>
          , Valencia,
          <source>Spain (December</source>
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Coyotl-Morales</surname>
            ,
            <given-names>R.M.</given-names>
          </string-name>
          , Villasen˜or Pineda,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Montes-y</surname>
          </string-name>
          <string-name>
            <surname>G</surname>
          </string-name>
          ´omez,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Rosso</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>Authorship Attribution using Word Sequences</article-title>
          .
          <source>In: Proc. 11th Iberoamerican Congress on Pattern Recognition</source>
          ,
          <string-name>
            <surname>CIARP</surname>
          </string-name>
          <year>2006</year>
          .
          <article-title>Volume 4225 of LNCS</article-title>
          ., Springer (
          <year>2006</year>
          )
          <fpage>844</fpage>
          -
          <lpage>853</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Arighi</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , et al., eds.
          <source>: Proceedings of BioCreative III Workshop</source>
          , Bethesda, MD USA (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Sullivan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miller</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tari</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baral</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gonzalez</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          :
          <article-title>Efficient extraction of protein-protein interactions from full-text articles</article-title>
          .
          <source>IEEE/ACM Trans. Comput. Biology Bioinform</source>
          .
          <volume>7</volume>
          (
          <issue>3</issue>
          ) (
          <year>2010</year>
          )
          <fpage>481</fpage>
          -
          <lpage>494</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Bui</surname>
            ,
            <given-names>Q.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Katrenko</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sloot</surname>
            ,
            <given-names>P.M.A.</given-names>
          </string-name>
          :
          <article-title>A hybrid approach to extract proteinprotein interactions</article-title>
          .
          <source>Bioinformatics</source>
          <volume>27</volume>
          (
          <issue>2</issue>
          ) (
          <year>2011</year>
          )
          <fpage>259</fpage>
          -
          <lpage>265</lpage>
          Code available at http://staff.science.uva.nl/˜bui/PPIs.zip.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , R.,
          <string-name>
            <surname>Rudnicky</surname>
            ,
            <given-names>A.I.:</given-names>
          </string-name>
          <article-title>A Large Scale Clustering Scheme for Kernel K-Means</article-title>
          .
          <source>In: 16th Conference on Pattern Recognition</source>
          . Volume
          <volume>4</volume>
          ., Los Alamitos, CA, USA, IEEE Computer Society (
          <year>2002</year>
          )
          <fpage>289</fpage>
          -
          <lpage>292</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.: Random</given-names>
          </string-name>
          <string-name>
            <surname>Forests</surname>
          </string-name>
          .
          <source>Machine Learning 45(1)</source>
          (
          <year>2001</year>
          )
          <fpage>5</fpage>
          -
          <lpage>32</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>