<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Semantic Web-Based Approach for Harvesting Multilingual Tex- tual Definitions from Wikipedia to Support ICD-11 Revision</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Guoqian Jiang</string-name>
          <email>jiang.guoqian@mayo.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Harold R. Solbrig</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christopher G. Chute</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Health Sciences Research, Mayo Clinic College of Medicine</institution>
          ,
          <addr-line>Rochester, MN</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In  the  beta  phase  of  the  11th  revision  of  International  Classification   of  Diseases  (ICD-­‐11),  the  World  Health  Organization  (WHO)  intends  to   accept  public  input  through  a  distributed  model  of  authoring,  in  which   creating   textual   definitions   for   ICD   categories   is   a   core   use   case.   In   a   previous   study,   Wikipedia   has   been   demonstrated   as   a   useful   source   for   textual   definitions   of   diseases.   The   objective   of   the   study   is   to   de-­velop   and   evaluate   a   semantic   web-­‐based   approach   for   harvesting   multilingual   textual   definitions   from   Wikipedia   to   support   ICD-­‐11   revision   and   its   public   review.   In   a   prototype   implementation,   we   de-­veloped  a  semantic  web  service  application  known  as  LexReview  that   automates  the  harvesting  process  in  a  dynamic  way  through  invoking   and   integrating   three   online   web   services:   1)   WHO   ICD-­‐11   content   services;   2)   NCBO   BioPortal   annotation   services;   and   3)   DBpedia   SPARQL  endpoint  query  services.  The  Simple  Knowledge  Organization   System   (SKOS)   lexical   and   mapping   properties   are   used   to   represent   the  harvested  definitions.  The  LexReview  service  application  could  be   extended  to  integrate  the  textual  definitions  from  other  resources  and   subsequently   consumed   by   a   review   application   to   support   ICD-­‐11   revision.  </p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 INTRODUCTION</title>
      <p>
        The 11th revision of International Classification of Diseases
(ICD-11) was officially launched by the World Health
Organization (WHO) in March 2007 (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ). The beta phase of the
ICD-11 revision started in May 2012, and WHO intends to
accept public input through a distributed model of
authoring. An ICD-11 Beta Browser application has been
developed and released by WHO (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ). The browser provides
simple commenting functionality to allow the domain
professionals to make comments on existing contents, and it
intends to introduce more social computing capabilities.
Lexical properties of ICD categories including titles,
synonyms, and textual definitions should be reviewed following
a standard and homogeneous terminological approach. The
provision of textual definitions has been regarded as one of
important criteria for measuring the quality of a
terminology/ontology (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ). In our previous study (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ), we demonstrated
that the textual definitions from the Unified Medical
Language System (UMLS) (
        <xref ref-type="bibr" rid="ref5">5</xref>
        ), the formal definitions of the
Systematized Nomenclature of Medicine – Clinical Terms
(SNOMED CT) (
        <xref ref-type="bibr" rid="ref6">6</xref>
        ) and the linked open data (LOD) from
DBpedia (
        <xref ref-type="bibr" rid="ref7">7</xref>
        ) are potentially useful resources for supporting
ICD-11 textual definitions authoring. We argued that the
ICD-11 project might potentially take advantage of the
crowd-souring model of Wikipedia (
        <xref ref-type="bibr" rid="ref8">8</xref>
        ). Using this model,
each ICD-11 category would be seeded as a Wikipedia page
for public input and the definitions of ICD categories would
be harvested using the DBpedia.
      </p>
      <p>The objective of the study is to develop and evaluate a
semantic web-based approach for harvesting multilingual
textual definitions from Wikipedia to support ICD-11 revision
and its public review. In a prototype implementation, we
developed a semantic web service application known as
LexReview that automates the harvesting process in a
dynamic way through invoking and integrating a number of
online web services: 1) WHO ICD-11 content services; 2)
NCBO BioPortal annotation services; and 3) DBPedia
SPARQL endpoint query services. The Simple Knowledge
Organization System (SKOS) lexical and mapping
properties are used to represent the harvested definitions.</p>
    </sec>
    <sec id="sec-2">
      <title>BACKGROUND 2</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>WHO ICD-11 Content Model and Services</title>
        <p>
          An ICD-11 content model has been developed by WHO to
present the knowledge that underlies the definitions of an
ICD entity. The content model is composed of three layers:
a foundation component, a linearization component and an
ontological component (
          <xref ref-type="bibr" rid="ref9">9</xref>
          ). The foundation component
stores the full range of knowledge of all classification units
in ICD. The linearization component corresponds to the
classical print versions of ICD. The ontological component
provides references to formal definition of terms and
relationships. Currently, there are 13 defined main parameters
in the content model to describe a category in ICD, in which
“Textual Definitions” is one of main parameters for
describing an ICD category.
        </p>
        <p>Recently, an ICD URI scheme is proposed for naming and
supporting web services by WHO. A base URI of
http://id.who.int has been proposed, with
http://id.who.int/icd/schema as the prefix for the vocabulary
terms that related to ICD classification efforts maintained by
WHO, http://id.who.int/icd/entity for the fundamental
foundation entities related to ICD concepts.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>BioPortal Annotation Services</title>
        <p>
          The National Center for Biomedical Ontology Annotator is
an ontology-based web service for annotating the textual
biomedical data with biomedical ontology concepts (
          <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
          ).
        </p>
        <p>
          The biomedical community can use the Annotator service to
tag datasets automatically with concepts from more than
300 ontologies coming from the two most important
biomedical ontology &amp; terminology repositories: the Unified
Medical Language System (UMLS) Metathesaurus and
NCBO BioPortal. Such annotations contribute to create a
biomedical semantic web that facilitates translational
scientific discoveries by integrating annotated data. In this study,
the Medical Subject Headings (MeSH) (
          <xref ref-type="bibr" rid="ref12">12</xref>
          ) was configured
to annotate the preferred labels of ICD-11 categories.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>DBpedia SPARQL Endpoint</title>
        <p>
          DBpedia is a crowd-sourced community effort to extract
structured information from Wikipedia and make this
information available on the Web (
          <xref ref-type="bibr" rid="ref7">7</xref>
          ). DBpedia adopts
Semantic Web Linked Open Data technology and its datasets
are rendered in RDF format and can be accessed online via a
public SPARQL query endpoint at http://dbpedia.org/sparql.
The endpoint is provided using OpenLink Virtuoso as the
back-end RDF database engine.
        </p>
        <p>DBpedia also defines an ontology to organize its datasets.
The ontology is a shallow, cross-domain ontology and
covers 359 classes that form a subsumption hierarchy and are
described by 1,775 different properties. In this study, we
used one of the classes http://dbpedia.org/ontology/Disease
and extracted all instances of the class for obtaining textual
definitions.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Semantic Web Technologies</title>
        <p>
          The World Wide Web consortium (W3C) is the main
standards body for the World Wide Web (
          <xref ref-type="bibr" rid="ref13">13</xref>
          ). The goal of the
W3C is to develop interoperable technologies and tools as
well as specifications and guidelines to lead the web to its
full potential. The resource description framework (RDF),
web ontology language (OWL), and SPARQL (a recursive
acronym for SPARQL Protocol and RDF Query Language)
specifications have all achieved the level of W3C
recommendations, and are becoming generally accepted and
widely used.
        </p>
        <p>
          The SKOS data model views a knowledge organization
system as a concept scheme comprising a set of concepts
(
          <xref ref-type="bibr" rid="ref14">14</xref>
          ).The vocabulary used in the SKOS data model is a set of
URIs that specifies the notion of SKOS concepts, concept
schemes, lexical labels, notations, documentation properties
and semantic relations. SKOS data are expressed as RDF
triples. An increasing number of SKOS datasets in RDF are
publicly available.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>SYSTEM ARCHITECTURE</title>
      <p>Figure 1 shows the system architecture of our approach. The
LexReview service appplication invoked and integrated
mainly three web services: 1) WHO ICD-11 content
services for retrieving preferred label and definition for a
target ICD entity; 2) NCBO BioPortal annotation services
for retrieving the MeSH term annotation and its ID; and 3)
DBpedia SPARQL endpoint query services for retrieving
textual definitions by MeSH ID.</p>
    </sec>
    <sec id="sec-4">
      <title>PROTOTYPE IMPLEMENTATION</title>
      <p>
        The LexReview service application was implemented using
a Java-based RESTful web services JAX-RS API known as
Jersey (
        <xref ref-type="bibr" rid="ref15">15</xref>
        ) and a Jena ARQ API (
        <xref ref-type="bibr" rid="ref16">16</xref>
        ) that is a Java-based
query engine for Jena that supports SPARQL RDF query
language.
      </p>
      <p>The service application accepts a standard URI of a single
ICD entity as input. For example, the URI
http://id.who.int/icd/entity/718946808 represents an ICD
entity Angina pectoris. Figure 2 shows the HTML rendering
of the ICD entity Angina pectoris dispalyed through a web
browser.</p>
      <p>The content of an ICD-11 entity can be accessed through
Content Negotiation that is a mechanism of RESTful
services that makes it possible to serve different
representation of a resource at the same URI. The WHO
ICD content services provide the content representation in
the formats of HTML, RDF and JSON. First, the system
retrieved the title and definition of a target ICD entity based
on its RDF rendering, in which the SKOS lexical properties
skos:prefLabel and skos:definition are used to represent the
values.</p>
      <p>
        Second, the system invoked NCBO BioPortal annotation
services using the title of a target ICD entity as the input.
The annotation services were configured to use the ontology
MeSH only and the semantic types within the semantic
group Disorders (
        <xref ref-type="bibr" rid="ref17">17</xref>
        )(see Table 1). The annotation services
provide a score for each annotation that is the weight based
on the annotation context. In this prototype implmentation,
we harvested those annotation with the score=10, meaning
that a direct annotation is matched with a concept preferred
name. We then retrieved the MeSH ID, preferred name and
URI of each annotation.
Third, when the system had a MeSH term annotated for a
target ICD entity, the system invoked the DBpedia
SPARQL enpoint to retrieve the textual definitions of a
DBpedia entry coded in a MeSH ID. Figure 3 shows the
SPARQL query used to retrieve mulilingual textual
definitions from the instance entries of a DBpedia class
“Disease” (i.e., http://dbpedia.org/ontology/Disease).
Here, we asserted that the values of the predicate
dbpedia:abstract are candidates for textual definitions. We used
the language tags as a filter to retrieve those textual
definitions in six official languages adopted by the WHO (
        <xref ref-type="bibr" rid="ref18">18</xref>
        ), i.e.
“ar” standing for Arabic, “zh” for Chinese, “en” for English,
“fr” for French, “ru” for Russian, and “es” for Spanish.
Finally, we represented the MeSH mapping based on
BioPortal annotation services and the multilingual textual
definitions retrieved for a target ICD category in RDF
format, in which the SKOS lexical and mapping properties
(skos:prefLabel, skos:definition, skos:closeMatch,
skos:exactMatch) are used. We then exposed the RDF
rendering through a RESTful service API. Figure 4 shows
an example of RDF rendering of multilingual textual
definitions for a target ICD-11 entity Angina pectoris. As
illustrated in the figure, we used the predicate
skos:closeMatch to represent the relationship between the
target ICD entity and its MeSH annotation
http://purl.bioontology.org/ontology/MSH/D000787. We
used the predicate skos:exactMatch to represent the
relationship between the MeSH annotation with the
DBpedia entry http://dbpedia.org/resource/Angina_pectoris
because they share the same MeSH ID. There are 11
definition entries in 5 languages available for the DBpedia
entry and the predicate skos:definition is used to represent
them. In addition, we also put the original title and
definition of the target ICD entity in the RDF rendering
using the predicates skos:prefLabel and skos:definition.
The prototype implementation will be accessible soon
through
http://informatics.mayo.edu/rest/project/icd11/lexreview/def
inition?uri=http://id.who.int/icd/entity/718946808, in which
the uri parameter can be replaced by any other ICD entity
URIs.
      </p>
      <p>Table 2 shows a list of ICD-11 entity examples (n=10) that
have Wikipedia definition matches. The first column in
Table 2 shows the ICD-11 entity URI and its preferred label;
the second column shows the corresponding Wikipedia URI
for each ICD-11 entity matched by the system, and the
codes for available languages; the third column shows the
MeSH ID being an anchor between an ICD-11 entity and an
Wikipedia entry. For each ICD-11 entity in Table 2, the
Wikipedia definition entries are available at least in two
language codes (range from 2-5 codes). The first author of
the paper (GJ) reviewed all definition entries in Chinese
(n=5) available from the 10 ICD-11 entity examples, and
concluded that the quality of the definitions in Chinese are
reasonably good and could be useful for supporting ICD-11
multilingual definition authoring.
5</p>
    </sec>
    <sec id="sec-5">
      <title>DISCUSSION</title>
      <p>In this study, we developed a semantic web service
application that provides a dynamic way to harvest textual disease
definitions of Wikipedia to support the ICD-11 textual
definitions authoring and its public review. The “Dynamic”
means that the service application would always retrieve the
most current textual definitions stored in the DBpedia
dataset. We found that MeSH IDs (i.e., dbpedia:meshId) are
used to code the DBpedia entries under the class “Disease”,
which provide a good anchor to access the textual
definitions of a DBpedia entry. As of April 14, 2013, there are
5,126 entries under the class “Disease”, of which 2809
(54.8%) entries have MeSH IDs annotated (covering 2505
unique IDs). In total, 19,696 (71.5%) of 27, 540 textual
definitions are available for those DBpedia disease entries
with MeSH IDs. In future, we will build an approach to
match those DBpedia disease entries that do not have MeSH
IDs coded.</p>
      <p>To get a MeSH term mapping to a target ICD entity, we
invoked the BioPortal annotation services. We used a
heuristic configuration by restricting the ontology to the MeSH
only and setting up the semantic types within the semantic
group Disorders. In our previous study, we used the UMLS
CUIs to convert the ICD-10 codes to MeSH IDs.
Considering that the ICD-11 covers many new terms other than
ICD10 terms, our approach in this prototype implementation
may potentially provide a better coverage though a rigorous
evaluation would be needed in the future.</p>
      <p>
        In addition, we used SKOS lexical and mapping properties
to represent the annotations and harvested textual
definitions. The main reason is that the SKOS model provides a
set of semantic web friendly signatures with well-defined
semantics as we demonstrated in our previous study (
        <xref ref-type="bibr" rid="ref19">19</xref>
        ).
In summary, we developed a prototype of semantic web
RESTful services that automates harvesting multilingual
textual definitions of Wikipedia to support ICD-11 textual
definition authoring and its public review. The LexReview
service application could be extended to integrate the textual
definitions from other resources and subsequently consumed
by a review application to support ICD-11 revision. In the
future, we plan to evaluate the quality and usefulness of the
harvested multilingual definitions in collaboration with
WHO ICD-11 revision community.
      </p>
    </sec>
    <sec id="sec-6">
      <title>ACKNOWLEDGEMENTS</title>
      <p>This work was supported in part by the SHARP Area 4:
Secondary Use of EHR Data (90TR000201).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1 WHO.
          <article-title>Revision of the International Classification of Diseases (ICD).</article-title>
          .
          <source>[cited April 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://www.who.int/classifications/icd/ICDRevision/en/inde x.html
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2 WHO. ICD-11
          <string-name>
            <given-names>Beta</given-names>
            <surname>Browser</surname>
          </string-name>
          .
          <source>[cited April 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://apps.who.int/classifications/icd11/browse/f/en
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>3 Smith</surname>
            <given-names>B</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ashburner</surname>
            <given-names>M</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosse</surname>
            <given-names>C</given-names>
          </string-name>
          , et al.
          <article-title>The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration</article-title>
          .
          <source>Nature biotechnology</source>
          . 2007 Nov;
          <volume>25</volume>
          (
          <issue>11</issue>
          ):
          <fpage>1251</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>4 Jiang</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solbrig</surname>
            <given-names>HR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chute</surname>
            <given-names>CG</given-names>
          </string-name>
          .
          <article-title>Using semantic web technology to support ICD-11 textual definitions authoring</article-title>
          . ACM International Conference Proceeding Series;
          <year>2011</year>
          ;
          <year>2011</year>
          . p.
          <fpage>38</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>5 UMLS. [cited April 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://www.nlm.nih.gov/research/umls/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>6 SNOMED CT. [cited April 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://www.ihtsdo.org/snomed-ct/
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>7 DBpedia. [cited April 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://dbpedia.org/About
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>8 Wikipedia. [cited April 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://wikipedia.org/
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>9 ICD-11 Information Model. [cited April 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://informatics.mayo.edu/icd11model
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>10 Jonquet</surname>
            <given-names>C</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            <given-names>NH</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Musen</surname>
            <given-names>MA</given-names>
          </string-name>
          .
          <article-title>The open biomedical annotator</article-title>
          .
          <source>Summit on translational bioinformatics</source>
          .
          <year>2009</year>
          ;
          <year>2009</year>
          :
          <fpage>56</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11
          <string-name>
            <given-names>NCBO</given-names>
            <surname>Annotator</surname>
          </string-name>
          .
          <source>[cited April 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://www.bioontology.org/annotator-service
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>12 MeSH. [cited April 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://www.nlm.nih.gov/mesh/
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13
          <string-name>
            <given-names>The</given-names>
            <surname>World Wide Web Consortium</surname>
          </string-name>
          (
          <year>W3C</year>
          ).
          <source>[cited November 26</source>
          ,
          <year>2012</year>
          ]; Available from: http://www.w3.org/
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>14 SKOS. [cited April 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://www.w3.org/TR/skos-primer/
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15
          <string-name>
            <given-names>Jersey</given-names>
            <surname>API</surname>
          </string-name>
          .
          <source>[cited April 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://jersey.java.net/
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>16 Jena ARQ API. [cited April 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://jena.apache.org/documentation/query/
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17
          <source>The UMLS Semantic Groups. [cited April 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://semanticnetwork.nlm.nih.gov/SemGroups/
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18
          <string-name>
            <given-names>WHO</given-names>
            <surname>Multilingualism</surname>
          </string-name>
          .
          <source>[cited Apirl 14</source>
          ,
          <year>2013</year>
          ]; Available from: http://www.who.int/about/multilingualism/en/
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>19 Jiang</surname>
            <given-names>G</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solbrig</surname>
            <given-names>HR</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chute</surname>
            <given-names>CG. Building</given-names>
          </string-name>
          <string-name>
            <surname>Standardized Semantic Web RESTful Services to Support</surname>
            <given-names>ICD</given-names>
          </string-name>
          <source>-11 Revision. ACM International Conference Proceeding Series</source>
          <year>2012</year>
          ;
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>