<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Linked Data Fusion in ODCleanStore?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Michelfeit</string-name>
          <email>michelfeit.jan@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tomas Knap</string-name>
          <email>tomas.knap@mff.cuni.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Charles University in Prague, Dept. Software Engineering Malostranske nam.</institution>
          <addr-line>25, 118 00 Prague</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>As part of LOD2 project and OpenData.cz initiative, we are developing an ODCleanStore framework enabling management of Linked Data. In this paper, we focus on the query-time data fusion in ODCleanStore, which provides data consumers with integrated views on Linked Data; the fused data (1) has solved con icts according to the preferred con ict resolution policies and (2) is accompanied with provenance and quality scores, so that the consumers can judge the usefulness and trustworthiness of the data for their task at hand.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>? The work presented in this article has been funded in part by EU ICT FP7 under
No.257943 (LOD2 project), the Czech Science Foundation (GACR, grant number
201/09/H057), and GAUK 3110.
1 http://richard.cyganiak.de/2007/10/lod/
2 http://opendata.cz, http://lod2.eu
3 To download the code, please visit http://sourceforge.net/p/odcleanstore
4 RDF triples can be extended to quads (s; p; o; g) where g is the named graph [3] to
which the data belongs. When talking about \data in the named graph g", we mean
all the quads ( ; ; ; g)).
application registered in ODCS, e.g. by various extractors. Based on the
identier of the feed, the appropriate transforming pipeline is launched; the pipeline
successively executes a de ned (and customizable) set of transformers ensuring
that data in the processed feed is cleaned, resources deduplicated and linked to
already existing resources in the clean database or in the linked open data cloud,
data is enriched with new resources, arbitrarily transformed, and the quality of
the feed (graph score) is assessed. When the pipeline nishes, the augmented
RDF feed is populated to the clean database together with any auxiliary data
and metadata created during the pipeline execution, such as links to other
resources or metadata about the feed's graph score.</p>
      <p>Data consumers can query (via third-party applications) the clean database
to obtain data about the certain resource (e.g. a city, such as the German city
\Berlin"). Since the same resource can be described by various sources (feeds),
con icts may arise when integrating data about that city. To solve this, ODCS
applies in the data fusion algorithm certain con ict resolution policies which
resolve data con icts in the resulting RDF data; these policies can be
customized by the consumer. Furthermore, the resulting integrated RDF data is
supplemented with provenance metadata (data origin) and quality scores of the
integrated quads, so that data consumers can judge the usefulness and
trustworthiness of the resulting data for their task at hand; the quality score is in uenced
by the quality of the feed the triples originate from (graph score) and by the
applied con ict resolution policy [4]. The data fusion algorithm runs during query
time, because consumers in di erent situations can have di erent requirements
on the data.</p>
      <p>This paper brie y describes the data fusion algorithm in ODCS in Section 1;
the algorithm is fully described in [4]. The practical demonstration5 based on
the illustrative examples in Section 1 gives further insight into the work of the
data fusion algorithm.</p>
      <p>To the best of our knowledge, there is just one another linked data fusion
software { Sieve { currently under development [5]. Sieve is part of Linked Data
Integration Framework6. Di erently from our approach, Sieve fuses data while
5 http://www.ksi.m .cuni.cz/~knap/iswc12
6 http://www4.wiwiss.fu-berlin.de/bizer/ldif/
being stored to the clean database and not during execution of queries, thus,
provides no data fusion customization during data querying.
1</p>
    </sec>
    <sec id="sec-2">
      <title>Linked Data Fusion</title>
      <p>Suppose that the clean database of ODCS contains data about the German
city Berlin coming from multiple sources { DBpedia, GeoNames, and Freebase7.
Let us assume that Alice, a data consumer, is an investigative journalist who is
writing a story about Berlin; thus, she submits the keyword \Berlin" to the query
execution component of ODCS and she would like to get all the information the
framework knows about Berlin fused from the available sources.</p>
      <p>When fusing data, the data fusion algorithm in ODCS has to deal with data
con icts, which happen when two quads have inconsistent object values for a
certain subject s and predicate p; such quads are called o-con icting quads and
the con icting object values of these o-con icting quads are called con icting
values. The solution of the con icts is prescribed by the con ict resolution
policies, which may be speci ed globally or per predicate. We distinguish two types
of con ict resolution policies { deciding and mediating. Deciding policies select
one or more values from the con icting values, e.g., an arbitrary value (ANY),
maximum value (MAX), the value with the highest quality (BEST), or all
conicting values (ALL). Mediating policies compute new value, e.g. an average
(AVG) of the con icting values. For example, Alice may specify she would like
to receive in the response all the distinct values for the subject representing
Berlin and predicate rdf:type (deciding con ict resolution policy ALL). On the
other hand, she may want to compute for the same subject average value (AVG)
for the values of the predicate geo:lat, select the best value with the highest
quality (BEST) for rdfs:label of Berlin, and select maximum value (MAX)
from the values of the predicate dbprop:populationTotal of Berlin.</p>
      <p>
        When describing the data fusion algorithm within execution of consumer's
queries in ODCS, we suppose that the typical pre-fusing processes [2] { schema
mapping (the detection of equivalent schema elements in di erent sources) and
duplicate detection (detection of equivalent resources) has already been done.
Therefore, we suppose that (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) proper mappings between ontology elements
are available in the master data database in Figure 1, e.g. that geo:lat and
fb:location.geocode.latitude are denoted as equivalent predicates holding
latitude of Berlin, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) owl:sameAs links between resources representing the
same entity (the German city Berlin) were created by the proper transformers
(linkers) on the transforming pipeline.
      </p>
      <p>
        The input to the data fusion algorithm is (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) a collection of quads from
the clean database to be fused { the quads (x,*,*,*),(*,*,x,*), where x is the
URI representing Berlin in some source (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) owl:sameAs links between URI
resources occurring in the quads (output of the deduplication and schema mapping
pre-fusion processes), (
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) data fusion settings (including set of selected con ict
7 Identi ers for the resource Berlin are: http://dbpedia.org/resource/Berlin,
http://sws.geonames.org/2950159/, http://rdf.freebase.com/ns/en.berlin
resolution policies), and (
        <xref ref-type="bibr" rid="ref4">4</xref>
        ) graph scores of the named graphs (feeds) from which
the quads originate. The output is a collection of fused quads enriched with data
quality and source named graphs for each fused quad.
      </p>
      <p>
        The fusion algorithm rstly replaces URIs of resources representing the same
concept (i.e. connected by an owl:sameAs links) with a single URI and removes
duplicate quads8. Consequently, quads are grouped to the sets of comparable
quads { i.e. quads having the same subject and predicate; o-con icting quads
form subset of the corresponding comparable quads. For each set of comparable
quads, two steps (Step S1 and S2) are executed: Step S1 chooses and applies a
con ict resolution policy determined by the predicate of the comparable quads
and Step S2 computes quality of the quads resulting from Step S1. Multiple
realworld cases lead us to three factors in uencing the computation of the quality
of the resulting fused quads (in Step S2): (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) graph scores of the source named
graphs containing the processed comparable quads, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) number of object values
within the set of comparable quads which agree on the same object value, and
(
        <xref ref-type="bibr" rid="ref3">3</xref>
        ) the di erence between con icting values of the comparable and o-con icting
quads. Details of the quality computation are in [4].
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Conclusions</title>
      <p>This paper introduces query-time data fusion algorithm in ODCleanStore {
the framework for managing Linked Data. The practical demonstration9 shows
the maturity of the algorithm and demonstrates its features { application of
con ict resolution policies and computation of the quality of the fused quads.
Full theoretical background behind the data fusion algorithm is in [4].
8 Quads having the same subject, predicate, object, and the named graph.
9 http://www.ksi.m .cuni.cz/~knap/iswc12</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked Data - The Story So Far</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems</source>
          ,
          <volume>5</volume>
          (
          <issue>3</issue>
          ):1{
          <fpage>22</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>J.</given-names>
            <surname>Bleiholder</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Naumann</surname>
          </string-name>
          .
          <article-title>Data fusion</article-title>
          .
          <source>ACM Comput. Surv.</source>
          ,
          <volume>41</volume>
          (
          <issue>1</issue>
          ):1:
          <issue>1</issue>
          {1:
          <fpage>41</fpage>
          ,
          <string-name>
            <surname>Jan</surname>
          </string-name>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Carroll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hayes</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Stickler</surname>
          </string-name>
          .
          <article-title>Named graphs, Provenance and Trust</article-title>
          .
          <source>In WWW '05: Proceedings of the 14th international conference on World Wide Web</source>
          , pages
          <volume>613</volume>
          {
          <fpage>622</fpage>
          , New York, NY, USA,
          <year>2005</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>T.</given-names>
            <surname>Knap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Michelfeit</surname>
          </string-name>
          , and
          <string-name>
            <surname>N. M.</surname>
          </string-name>
          <article-title>Linked Open Data Aggregation: Con ict Resolution and Aggregate Quality</article-title>
          .
          <source>METHOD</source>
          <year>2012</year>
          :
          <article-title>The 1st IEEE International Workshop on Methods for Establishing Trust with Open Data, COMPSAC</article-title>
          (to appear),
          <year>2012</year>
          . http://www.ksi.m .cuni.cz/~knap/ les/method.pdf).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          , H. Muhleisen, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          . Sieve:
          <article-title>Linked Data Quality Assessment and Fusion</article-title>
          .
          <source>In 1st International Workshop on Linked Web Data Management (LWDM 2011) at the 15th International Conference on Extending Database Technology, EDBT</source>
          <year>2012</year>
          ,
          <article-title>March</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>