<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Web technologies for a knowledge base of biomedical facts extracted from scienti c literature</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maria Biryukov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Valentin Groues</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christophe Trefois</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Venkata Satagopam</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Reinhard Schneider</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg</institution>
          ,
          <addr-line>Esch-Belval</addr-line>
          ,
          <country country="LU">Luxembourg</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Biomedical literature, including scienti c articles, public health reports and books become more and more available due to massive digitalization. Exploration and analysis of this rich source of data requires assistance of automatic tools capable of dealing with large volumes of text. We are developing a pipeline for processing publicly available biomedical text, abstracts, full text articles, conference proceedings, eventually books and electronic health records, starting from searching the web and downloading raw les, to extraction and storing concepts (entities) and semantic relations between them into a knowledge base. The goal is to create a biomedical knowledge base publicly available for both human and machine access (SPARQL endpoint and REST API).</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Objective</title>
      <p>For the extraction of biomedical entities from the literature, we rely rst on
Reect (httpd://re ect.ws) - a named entity recognition engine to identify
biomedical concepts in the text. GeniaTagger is used to obtain basic morphological
and syntactic information. The latter is completed by application of the
Stanford Syntactic Parser which converts sentences into syntactic trees representing
dependencies between the words. Combined with a set of rules and dedicated
patterns, this information allows for getting semantic interpretation of sentences
and extraction of meaningful relationships between the concepts.</p>
    </sec>
    <sec id="sec-2">
      <title>Semantic Web technologies</title>
      <p>An ontology has been created to represent the biomedical events extracted from
the literature. Examples of extracted biomedical events include, among others,
proteins interactions, chemicals e ects and genes expression. Named graphs are
used to add metadata about those events (e.g. scoring). The events are stored in
a triple store (OpenLink Virtuoso). A hierarchy of biomedical relationships has
been de ned and the reasoning capabilities of Virtuoso are used to dynamically
use this hierarchy at query time. From 1 million publications already processed,
11 millions events were extracted resulting in about 150 millions of triples. A
rst prototype of a web-based visualization tool has been created to browse this
knowledge base.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>