<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the Annotation of Drug Databases with the Anatomical Therapeutic Chemical Classi cation System</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Olivier Cure</string-name>
          <email>focureg@univ-mlv.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LIGM Universite Paris-Est</institution>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Anatomical Therapeutic Chemical (ATC) classi cation system is frequently used to classify drugs according to an encoding system that considers the organ or system on which they act on and/or their therapeutic class and chemical characteristics. Motivated by the maintenance of a French drug database, we have transformed, using an inductive approach, the chemical layer of this classi cation into an OWL knowledge base. Using this knowledge base, we have been able to design interesting novel application functionalities. Nevertheless, these functionalities require that drugs are annotated in a di erent way than is generally done right now. In this paper, we highlight several French databases that are not fully bene ting from their ATC usage, present our annotation best practice and emphasize on the gains of its adoption.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        We have developed a self-medication Web application that provides information
to the general public on mild clinical signs and associated Over The Counter
(OTC) Drugs. This application is based on a drug database that stores almost
all drugs sold in France. Being mainly used by end-users without some medical
knowledge, the data quality of the database is a crucial aspect of our application,
e.g., incorrect or missing contraindications on a drug description may impact the
health of the patient. We have implemented a set of tools that help to check the
data quality and assist the domain experts in a curation process when
information are updated [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Our experience with pharmaceutical information motivated
us to design these tools at the molecule level. That is, for each molecule of
interest, we store its characteristics in terms of dimensions such as contraindications,
side-e ects, molecule interactions, warnings, therapeutic classes. To design such
a knowledge base (KB), we have decided to adopt the Anatomical Therapeutic
Chemical (ATC) classi cation system due to its wide adoption in drug databases.
Note that several projects, e:g:, [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], propose to map this classi cation with
other medical terminologies, e:g:, NDF-RT, and hence enable its integration in
even more information sources. Being solely an encoding system that considers
the organ or system on which they act on and/or their therapeutic class and
chemical characteristics, none of the information needed for our data quality
system are natively proposed in ATC. Hence, we have decided to take advantage
of the ATC encoding system to create a pratical pharmaceutical KB. This has
been performed using a probabilistic inductive approach on the data stored on
our drug database whose entities are annotated with identi ers of the ATC and
other classi cations, e.g. EphMRA. This approach is semi-automatic since it
requires the supervision of a medical expert to validate the information attributed
to a given molecule.
      </p>
      <p>To keep our database up-to-date with the release, suppression and
modication of drug products, we study the databases provided by several health
organizations in France, e.g., Agence National de Securite du Medicament et
des produits de sante (ANSM), and some over french databases, such as Vidal
and BCB (e.g., Banque Claude Bernard). After studying these databases and
their evolution for several years, we have a good understanding of their
annotation practice with the ATC system. We argue that the organization of the ATC
system motivates a form of annotation that prevents users from using its full
potential. This is mainly due to the large numbers of identi ers corresponding
to molecule combinations. This drawback not only restricts each drug database
provider to design novel functionalities but also precludes data integration with
other systems, e.g., linked data.</p>
      <p>This paper is organized as follows. In Section 2, we present the ATC
classication system. Then we detail our probabilistic inductive approach to design
an OWL KB from our drug database. In Section 4, we highlight some of the
bene ts of adopting our KB to reasoning about molecules and drugs containing
them. Section 5, introduces our simple best pratice in annotating drug products
with the ATC system. We conclude and provide some future work in Section 6.
2</p>
      <p>ATC
The ATC system (http://www.whocc.no/atcddd/) is an international classi
cation of drugs and is part of WHO's initiatives to achieve universal access to
needed drugs and their rational use. This classi cation takes the form of a tree
structure with ve di erent levels. The rst level of the code is represented with
a letter corresponding to one of the 14 prede ned anatomical groups. The
second level corresponds to a pharmacological/therapeutic subgroup. The third and
fourth levels are chemical/pharmacological/therapeutic subgroups. Finally, the
fth level corresponds to chemical substances and supports the classi cation of
drugs according to Recommended International Non-proprietary Names (rINN).
We now provide an extract from the ATC hierarchy for some cough suppressants:
R: Respiratory system</p>
      <p>R05: Cough and cold preparations</p>
      <p>R05D: Cough suppressants, excluding combinations with expectorants
R05DA: Opium alkaloids and derivatives
...</p>
      <p>R05DA08 Pholcodin
R05DA09 Dextromethorphan</p>
      <p>The current version of ATC classi cation contains 5,658 identi ers out of
which 4,406 are leaves and are hence supposed to identify molecules. Among
these leaves, 16,3% do not identify a single molecule but rather describe a
combination of molecules, e.g. R05DA20 in our previous example. Of course,
annotating a drug with a 'combination' identi er does not provide much information
on the active principle composing it.
3</p>
    </sec>
    <sec id="sec-2">
      <title>Inductive creation of the ATC ontology</title>
      <p>
        In this section, we detail our probabilistic inductive approach to transform the
ATC classi cation into an OWL KB. This transformation has been performed
using the DBOM system [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a tool we have designed to exchange data between
a relational database and a Semantic Web compliant KB. The result is an OWL
DL ontology where each ATC identi er is transformed into an OWL concept,
owl:disjointWith properties are declared mutually between all sibling concepts
and the following properties are set between cascading concepts: rdfs:subClassOf
for levels 5 and 4, 3 and 2, hasGroup for levels 4 and 3, hasSystem between levels
2 and 1.
      </p>
      <p>In the following extract of our ATC ontology, we provide the description
associated with the Pholcodine concept, with ATC code R05DA08. On line 1, we
can see that this concept is identi ed by a given URI with a local name
corresponding to R05DA08 ATC code. Line 2 de nes the concept associated with
the R05DA code (Opium alkaloids and derivatives) to be a super concept of this
concept. On lines 3 and 4, we present examples of disjointWith properties
between sibling concepts, only the rst and last concepts are displayed for brevity.
Line 5 states a comment in the french language.
1. &lt;owl:Class rdf:about="&amp;p1;R05DA08"&gt;
2. &lt;rdfs:subClassOf rdf:resource="&amp;p1;R05DA"/&gt;
3. &lt;owl:disjointWith rdf:resource="&amp;p1;R05DA01"/&gt;
...
4. &lt;owl:disjointWith rdf:resource="&amp;p1;R05DA20"/&gt;
5. &lt;rdfs:comment xml:lang="fr"&gt;Pholcodine
6. &lt;/rdfs:comment&gt;
7. &lt;/owl:Class&gt;</p>
      <p>Starting from this OWL ontology, we can now perform an enrichment by
inductive reasoning using our drug database. This database takes the form of
a star schema where the central table stores information on general drug
information, e.g., name, price, and is identi ed with a drug french identi er (CIP).
This table is also related, using one-to-many or many-to-many associations of
the Entity-Association logic model, to all kinds of medical information, e.g.,
side-e ects, therapeutic classes, contraindications, including some ATC
identiers. The method aims to cluster relevant groups of products, generated using
the ATC hierarchy. Intuitively, we navigate in the hierarchy of concepts and
create groups of products for each level, using the ATC to drug relation in our
database. Then, for each group we study some speci c domains which
correspond to elds in SPCs (Summary of Product Characteristics), e.g.
contraindications, and for each possible value in these domains we calculate the ratio of
this value occurences on the total number of elements of the group. Table 1
proposes an extract of the results for the concepts of the respiratory system
and the contraindication domain. This table highlights that our self medication
database contains 56 antitussives (identi ed by code R05D ), which are divided
into 44 plain antitussives products (R05DA) and 12 antitussives in combinations
(R05DB ). For the contraindication identi ed by the number 76, i.e. allergy to
one of the product's constituants, we can see that a ratio of 1 has been
calculated for the group composed of the R ATC code. This means that all 152
products (100 %) of this group present this contraindication. We can also stress
that for this same group, the breast-feeding contraindication (#9) has a ratio of
48 %, this means that only 72 products out the 152 of this group present this
constraints.</p>
      <p>We now consider this ratio as a con dence value for a given concept on
the membership of a given domain's value. This membership is materialized
in the ontology with the association of an concept to a property, e.g. the
hasContraIndication property, that has the value of the given contraindication, e.g.
breast-feeding (#9). In our approach, we only materialize memberships when the
con dence values are superior to a prede ned threshold , in the contraindication
example we set to 60%.</p>
      <p>This membership is only related to the highest concept in the concept
hierarchy and inherited by its sub-concepts. For instance, the breast feeding
contraindication (#9) is associated to the R05 concept as its con dence value (83%)
is the rst column on line with contraId 9 that presents a superior to 60% in
the R hierarchy. Also, the pregnancy contraindication (#21) is related to the
R05DB ATC concept since its value is (73%).</p>
      <p>Using this simple approach, we are able to enrich the ATC ontology with
axioms related to several elds of SPCs. At the end of this enrichment phase, the
expressiveness of the newly generated ontology still corresponds to OWL DL.</p>
      <p>This method can easily be applied to other drug related classi cations, e.g.
EphMRA, as soon as we consider that the ontology is presented in a DL
formalism and a relation relates CIPs to identi ers of this ontology.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Features supported by the Knowledge Base</title>
      <p>In this section, we highlight three important functionalities that have been
implemented using our medical KB. The rst two functionalities enable to improve
the data quality of the drug database while the last one supports data integration
to over KBs.</p>
      <p>The rst one consists in reasoning at the molecule level over updates
performed on the drugs of our database. That is, whenever a drug composed of a
molecule represented in our KB is inserted, updated or deleted in our database,
the system automatically checks whether this modi cations will produce a
consistent database instance. This is performed using a semantic query expansion
approach that produces SPARQL queries from inferences over the knowledge.</p>
      <p>The second feature is related to the automatic enrichment of a drug SCP
given the knowledge stored in our molecule KB. That is, our system
administrator can now insert a drug name and its composition and the system automatically
inserts properties associated to the molecules composing the drugs. We found
out that even in the case of compound drugs, this approach is relevant and
produces an important amount of the information expected from this drug's SPC.
Note that all of these data updates are double-checked by medical experts for
data quality reasons. Nevertheless, this semi-automatic approach enables us to
improve the soundness and completeness as well as save a lot of time on the
database maintenance issues.</p>
      <p>Finally, the design of this KB together with the ne-grained annotations of
drug databases supports a better data integration with other terminologies and
KBs. For instance, we have submitted a project at ISWC 2012's Semantic Web
Challenge ( nishing at the 3rd place) on the implementation of a self-medication
application using Linked Open Data data sets. In that Web application, we
were checking the consistency of the proposed information based on a mapping
between the Drugbank terminology and our molecule KB.
5</p>
    </sec>
    <sec id="sec-4">
      <title>ATC annotation best practice</title>
      <p>This section claims that organizations supporting the maintenance of a drug
database, at least in France, do not take full bene ts of annotating drug with
the ATC. For instance, the french ANSM organization maintains drug databases
containing 16,516 drug products out of which 13,358 contain a single molecule
(i.e., we do not consider excipients here which are generally not represented in
ATC, but that is another problem we are not considering in this paper). That is
3,159 drug products contain more than one molecule, representing 19,2% of the
total number products available in France. Most of these drugs are either not
encoded with the ATC or annotated with a 'combination' identi er. Note that
this proportion is also encountered in the drug databases maintained by Vidal
and BCB, two important medical providers in France.</p>
      <p>In order to create and maintain a KB as presented in Section 3 and
implement application features as presented in Section 4, one has to annotate drug
products with each molecule that is composing the drugs rather than a
combination that does not describe su ciently the product in question. For instance,
the R05DA20 code identi es compound chemical products combining opium
alkaloids with other substances. The Hexapneumine syrup is generally annotated
with this code but it is inaccurate since it does not quantify and qualify the
contained molecules. We argue for another approach where this drug is classi ed
by the conjunction of its compounds, i.e. pholcodin, chlorphenamin and
biclotymol which respectively correspond to R05DA08, R06AB04 and R02AA20. In
terms of a database modeling, this approach only requires to de ne one-to-many
relationship between a drug and its ATC identi ers rather than de ning a single
column that stores an ATC code in the drug table.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we emphasized that providing a ne-grained annotation of drug
products with the ATC classi cation opens possibilities in developing
interesting functionalities in medical applications. This approach mainly invites drug
database producers to annotate with ATC identi ers corresponding to molecule
identi ers rather than with 'combinations' ones. This form of annotation has
been quite positive in our self-medication application and helped to improve the
overall data quality and reduce the time our health experts spend on curating
the database. As future work, we started to enrich our own molecule KB with
chemical products that are found as excipients in drug compositions. The idea
is to describe excipients that are known to have side-e ects or contraindications,
e.g. aspartam and phenylcetonuria. This extension to the molecule KB will help
in improving our data quality approach.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>O.</given-names>
            <surname>Cure</surname>
          </string-name>
          .
          <article-title>Improving the data quality of drug databases using conditional dependencies and ontologies</article-title>
          .
          <source>J. Data and Information Quality</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <fpage>3</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>O.</given-names>
            <surname>Cure</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.-D.</given-names>
            <surname>Bensaid</surname>
          </string-name>
          .
          <article-title>Integration of relational databases into owl knowledge bases: demonstration of the dbom system</article-title>
          .
          <source>In ICDE Workshops</source>
          , pages
          <volume>230</volume>
          {
          <fpage>233</fpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>F.</given-names>
            <surname>Mougin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Burgun</surname>
          </string-name>
          , and
          <string-name>
            <given-names>O.</given-names>
            <surname>Bodenreider</surname>
          </string-name>
          .
          <article-title>Comparing drug-class membership in atc and ndf-rt</article-title>
          .
          <source>In Proceedings of the 2Nd ACM SIGHIT International Health Informatics Symposium, IHI '12</source>
          , pages
          <fpage>437</fpage>
          {
          <fpage>444</fpage>
          , New York, NY, USA,
          <year>2012</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C. G.</given-names>
            <surname>Chute</surname>
          </string-name>
          .
          <article-title>Standardized drug and pharmacological class network construction</article-title>
          .
          <source>In MedInfo, page 1125</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>