<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Vocabulary-Independent Generation Framework for DBpedia and beyond</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ben De Meester</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anastasia Dimou</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wouter Maroy</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dimitris Kontokostas</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ruben Verborgh</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Lehmann</string-name>
          <email>jens.lehmann@cs.uni-bonn.de</email>
          <email>jens.lehmann@iais.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Erik Mannens</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Hellmann</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fraunhofer IAIS</institution>
          ,
          <addr-line>Sankt Augustin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Ghent University - imec - IDLab, Department of Electronics and Information Systems</institution>
          ,
          <addr-line>Ghent</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Leipzig University - AKSW/KILT</institution>
          ,
          <addr-line>Leipzig</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>University of Bonn, Smart Data Analytics Group</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The dbpedia Extraction Framework, the generation framework behind one of the Linked Open Data cloud's central hubs, has limitations which lead to quality issues with the dbpedia dataset. Therefore, we provide a new take on its Extraction Framework that allows for a sustainable and general-purpose Linked Data generation framework by adapting a semantic-driven approach. The proposed approach decouples, in a declarative manner, the extraction, transformation, and mapping rules execution. This way, among others, interchanging different schema annotations is supported, instead of being coupled to a certain ontology as it is now, because the dbpedia Extraction Framework allows only generating a certain dataset with a single semantic representation. In this paper, we shed more light to the added value that this aspect brings. We provide an extracted dbpedia dataset using a different vocabulary, and give users the opportunity to generate a new dbpedia dataset using a custom combination of vocabularies.</p>
      </abstract>
      <kwd-group>
        <kwd>DBpedia</kwd>
        <kwd>FnO</kwd>
        <kwd>Generation</kwd>
        <kwd>Linked Data</kwd>
        <kwd>RML</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The DBpedia Extraction Framework (dbpedia ef) extracts raw data from
Wikipedia and makes it available as Linked Data, forming the well-known and broadly
used dbpedia dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The majority of the dbpedia dataset is derived through
Wikipedia infobox templates, after being annotated by the DBpedia ontology5 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
The rules describing the dbpedia dataset generation from Wikipedia are executed
by the dbpedia ef, defined by a worldwide crowd-sourcing effort, and maintained
via the DBpedia mappings wiki6. Even though dbpedia is one of the central
      </p>
    </sec>
    <sec id="sec-2">
      <title>5 http://dbpedia.org/ontology/</title>
    </sec>
    <sec id="sec-3">
      <title>6 http://mappings.dbpedia.org/index.php/Main_Page</title>
      <p>
        interlinking hubs in the Linked Open Data cloud [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], its generation framework
has limitations reflected on the generated dataset [
        <xref ref-type="bibr" rid="ref10 ref8">8, 10</xref>
        ].
      </p>
      <p>A major issue is that other schema(s) than the dbpedia ontology cannot
be used to annotate Wikipedia pages. The dbpedia ef functions only with the
dbpedia ontology, e.g., the predicate depends on the ontology term used for a
certain attribute of an infobox. This occurs because the dbpedia ef selects the
corresponding parser based on where the mapping template is used and which
ontology term is selected, e.g., the dbo:date triggers the Date parser.</p>
      <p>Thus, if an ontology term is not added to the dbpedia ontology, it cannot
be used. For instance, no other predicate than the dbo:location may be used to
indicate an entity’s location because no triples will be generated.</p>
      <p>Other vocabularies, such as the schema.org7 vocabulary, cannot be used unless
they are imported into the dbpedia ontology, or the dbpedia ef is adjusted8,
because, otherwise, it will not recognize the vocabulary’s properties.</p>
      <p>Similarly, depending on the mapping template and ontology term (predicate)
which are used, a different data type can be assigned. For instance, depending
on which predicate is used, the area in square kilometers generates an xsd:double
but also a dbpedia datatype (dbo:areaTotal) that depends on the used predicate.</p>
      <p>
        In this work, we use the semantic general-purpose and more sustainable
framework that replaces the current dbpedia ef, which decouples extraction,
transformations and mapping execution from the dbpedia ef, and enables
generating high quality Linked Data that is not limited to the dbpedia use case, as
presented in detail by Maroy et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. We specifically demo how this work enables
us to easily make both small and large schema-level changes to the generated
dbpedia data without influencing the remainder of the generation process. The
demo is available at https://rmlio.github.io/dbpedia-ef-schema-demo/.
2
      </p>
      <sec id="sec-3-1">
        <title>Sustainable Linked Data Generation</title>
        <p>
          The current dbpedia ef is coupled and custom, which hampers maintenance and
limits flexibility with respect to mapping, transformation, and used schema [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
The mapping rules are a custom solution coupled to the dbpedia ef and ontology.
Similarly, the data transformations are hard-coded, executed at different places
within the dbpedia ef, and coupled with the dbpedia ontology.
        </p>
        <p>
          To alleviate its limitations, the following requirements are proposed [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]:
(i) Declarative mapping rules covering all generated rdf triples, and the
underlying implementation can interpret them in each case, whether they refer
to schema or data transformations [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. (ii) Decoupled extraction,
transformation, and mapping allowing different extraction strategies, transformation
libraries, or mapping rules without requiring adjustments to the underlying
implementation [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. (iii) A vocabulary independent solution to annotate the
extracted data values, independently of the preferred vocabulary. (iv)
Machine
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7 http://schema.org</title>
    </sec>
    <sec id="sec-5">
      <title>8 As done for dcterms (http://purl.org/dc/terms/) and foaf (http://xmlns.com/foaf/0.1/)</title>
      <p>
        A Vocabulary-Independent Generation Framework for DBpedia and beyond
processable mapping rules allow assessment not only for syntax, but also for
schema validation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] or automated mapping rules generation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        To address these requirements, we developed a solution – explained in detail
in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] – that fulfills the aforementioned requirements built on the rdf Mapping
Language (rml) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. rml performs the schema transformations, and is aligned
with fno [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], that performs data transformations. Both schema and data
transformations are thus covered using declarative machine-processable rules, instead
of coupled, while the wikitext extractor is a separate module, allowing for a
decoupled architecture. Most importantly though, the rules are independent of
the vocabulary used.
3
      </p>
      <sec id="sec-5-1">
        <title>Demo: Interchanging Schemas</title>
        <p>The mapping rules which are described in rml, are rdf triples themselves. Thus,
they can be updated – automatically or not – and other semantic annotations can
be applied or other datasets can be generated from Wikipedia. Taking advantage
of this and relying on the dbpedia mapping rules and the alignment of dbpedia
ontology with schema.org9, we translated the rml mapping rules for dbpedia to
use schema.org and generate another rdf subgraph. More specifically, within the
original mapping document, predicates and classes from the dbpedia ontology
were replaced by predicates and classes from schema.org. No further changes
were required, neither for the mapping rules, nor for the data transformations.</p>
        <p>We executed a new extraction which was done over all 16,244,162 pages in
the English dbpedia that contained articles, templates, media/file descriptions,
and primary meta-pages. 191,288 Infobox_persons were found and 1,026,143
rdf triples were generated. Indicatively, 179,037 rdf triples were generated with
schema:name property, 54,664 with schema:jobTitle, 23,751 with schema:nationality,
144,907 with schema:birthPlace, and 139,488 with schema:birthDate. The rdf
dataset is available at http://mappings.dbpedia.org/person_schema.dataset.ttl.
bz2, and can be interactively queried and compared with the original dbpedia
dataset on https://rmlio.github.io/dbpedia-ef-schema-demo/.</p>
        <p>Furthermore, at https://rmlio.github.io/dbpedia-ef-schema-demo/, we
provide an interactive Web application that allows users to easily apply small and
large changes in the dbpedia mapping document to change the schema of the
resulting data. Presets are available to easily switch between annotations using
the dbpedia ontology or schema.org, but users are encouraged to make their own
changes and create hybrid solutions, or even use completely different ontologies
and vocabularies. Users can trigger the generation of rdf data based on their
applied changes, to review their adjustments. The application demonstrates that
changes in the schema remain localized, e.g., changing the predicate does not
influence which data type is used or which function is executed, i.e., schema
transformations are decoupled from vocabulary and data transformations.</p>
        <p>In this demo paper, we showcase a generic and semantics-driven approach to
improve the Linked Data generation of the current dbpedia ef, as described in</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>9 http://schema.org</title>
      <p>
        detail in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], and provide a proof-of-concept to show the extended possibilities with
respect to schema transformations and how their changes remain decoupled from
the remainder of the generation framework. Most importantly, the generation
occurs independently of the vocabulary used to semantically annotate the rdf
dataset. This is evident by providing a new dbpedia dataset that has all person
resources mapped into rdf, however, using schema.org instead of the dbpedia
ontology. This dataset can be interactively compared to the original dbpedia
dataset. Furthermore, an interactive application allows users to make small and
large changes in the schema when generating dbpedia data. They can easily
switch between different vocabularies used to generate dbpedia data, and create
custom schema transformations.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>1. B. De Meester</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Verborgh</surname>
          </string-name>
          , E. Mannens, and R. Van de Walle.
          <article-title>An Ontology to Semantically Declare and Describe Functions</article-title>
          . In The Semantic Web:
          <article-title>ESWC 2016 Satellite Events</article-title>
          , volume
          <volume>9989</volume>
          <source>of LNCS</source>
          , pages
          <fpage>46</fpage>
          -
          <lpage>49</lpage>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>2. B. De Meester</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Maroy</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Dimou</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Verborgh</surname>
            , and
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Mannens</surname>
          </string-name>
          .
          <article-title>Declarative data transformations for Linked Data generation: the case of DBpedia</article-title>
          .
          <source>In The Semantic Web - Latest Advances and New Domains (ESWC</source>
          <year>2017</year>
          ). Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Freudenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Mannens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          , and R. Van de Walle.
          <article-title>Assessing and Refining Mappings to rdf to Improve Dataset Quality</article-title>
          .
          <source>In The Semantic Web - ISWC</source>
          <year>2015</year>
          , volume
          <volume>9367</volume>
          <source>of LNCS</source>
          , pages
          <fpage>133</fpage>
          -
          <lpage>149</lpage>
          . Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Vander</given-names>
            <surname>Sande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          , E. Mannens, and R. Van de Walle.
          <article-title>RML: A Generic Language for Integrated rdf Mappings of Heterogeneous Data</article-title>
          .
          <source>In Proceedings of the 7th Workshop on Linked Data on the Web</source>
          , volume
          <volume>1184</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>P.</given-names>
            <surname>Heyvaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-L.</given-names>
            <surname>Herregodts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurman</surname>
          </string-name>
          , E. Mannens, and R. Van de Walle.
          <article-title>RMLEditor: A Graph-based Mapping Editor for Linked Data Mappings</article-title>
          .
          <source>In The Semantic Web - Latest Advances and New Domains (ESWC</source>
          <year>2016</year>
          ), volume
          <volume>9678</volume>
          <source>of LNCS</source>
          , pages
          <fpage>709</fpage>
          -
          <lpage>723</lpage>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Isele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jakob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jentzsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morsey</surname>
          </string-name>
          , P. van Kleef,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer. DBpedia -</surname>
          </string-name>
          <article-title>A large-scale, multilingual knowledge base extracted from Wikipedia</article-title>
          .
          <source>Sem Web</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>W.</given-names>
            <surname>Maroy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dimou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          , B. De Meester,
          <string-name>
            <given-names>R.</given-names>
            <surname>Verborgh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , E. Mannens, and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          .
          <article-title>Sustainable linked data generation: The case of DBpedia</article-title>
          .
          <source>In Proceedings of the 16th International Semantic Web Conference: In-Use Track</source>
          , Vienna, Austria, Oct.
          <year>2017</year>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Gangemi</surname>
          </string-name>
          .
          <article-title>Serving DBpedia with DOLCE - More than Just Adding a Cherry on Top</article-title>
          , pages
          <fpage>180</fpage>
          -
          <lpage>196</lpage>
          . Springer International Publishing,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>M.</given-names>
            <surname>Schmachtenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Adoption of the Linked Data Best Practices in Different Topical Domains</article-title>
          , pages
          <fpage>245</fpage>
          -
          <lpage>260</lpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <given-names>A.</given-names>
            <surname>Zaveri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kontokostas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Sherif</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bühmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Morsey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. Lehmann.</surname>
          </string-name>
          <article-title>User-driven Quality Evaluation of DBpedia</article-title>
          .
          <source>In Proceedings of the 9th International Conference on Semantic Systems</source>
          , pages
          <fpage>97</fpage>
          -
          <lpage>104</lpage>
          . ACM,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>