<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Context Aware Source Selection for Linked Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Barbara Catania</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanna Guerrini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Beyza Yaman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Genoa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>24</fpage>
      <lpage>27</lpage>
      <abstract>
        <p>The traditional Web is evolving into the Web of Data, which gathers huge collections of structured data over distributed, heterogeneous data sources. Live queries are needed to get current information out of this global data space. In live query processing, source selection allows the identi cation of the sources that most likely contain relevant content. Due to the semantic heterogeneity of the Web of Data, however, it is not always easy to assess relevancy. Context information might help in interpreting the user's information needs. In this paper, we discuss how context information can be exploited to improve source selection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In the last decade, signi cant e orts have been devoted to the development of
the Web of Data, i.e., a Web in which not only documents, but also data, can
be rst class citizens. A huge amount of Linked Data has been published to
construct a global data space based on open standards [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. However, it is still
di cult to bene t from data in the Web of Data in an e ective way. Typical
systems operating on Linked Data collect (crawl) and pre-process (index) large
amounts of data, and evaluate queries against a centralized repository. Given that
crawling and indexing are time-consuming operations, data in the centralized
index may be out of date at query execution time. On the other hand, an ideal
query answering system for live querying [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] should return current answers in a
reasonable amount of time, even on corpora as large as the Web.
      </p>
      <p>
        Since in the Linked Data context there is a large number of small size sources,
the main problems in processing queries in live query systems become (i) nding
the sources containing triples that can contribute to the overall query answer, (ii)
e ciently fetching triples from these sources, possibly in a parallel way. The rst
process is referred to as source selection [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and requires knowledge of the data
source content, with the aim of returning a ranked list of sources, ordered by
relevance. Lightweight data summaries for determining relevant sources during
query evaluation have been proposed and are exploited for this task [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Such
data summaries are approximate multidimensional index structures that store
descriptions of the content of data sources.
      </p>
      <p>The diversity of publishers, however, results in data ful lling an information
need being stored in many di erent sources and described in many di erent
ways. The semantic enrichment of data, indeed, does not solve the problem of
identifying relevant information with respect to a given domain, rather causes a
shift of heterogeneity issues from the syntactic to the semantic level. By exploiting
a more abstract notion of context, we can better interpret the request and devise
the most relevant sources for the user's information needs.</p>
      <p>
        Context may capture di erent properties of data, users or user-data
interactions, and may come from data themselves, the querying device (e.g., ambient
information) as well as from the execution environment (e.g., history of previous
queries) [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ]. Despite the di erent additional information it can capture and rely
on, matching data not only with respect to the user request but also with respect
to a reference context associated with such request can help in determining the
portion of the entire information that is most relevant to the user. Contexts
have been used, e.g., in o ering personalized services and recommendations and
improving data selection [
        <xref ref-type="bibr" rid="ref10 ref2 ref8">2, 8, 10</xref>
        ], however, surprisingly, as far as we know, no
approach has been de ned yet exploiting context information for improving the
linked data source selection process.
      </p>
      <p>
        This paper goes in this direction and faces the problem of e ective and e
cient source selection for Linked Data, with the aim of proposing techniques that
automatically discover sources providing most relevant answers, by exploiting
a domain-dependent context hierarchy (as part of the more general framework
envisioned in [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]). In this paper, we consider RDF data sources and SPARQL
queries and present: (i) two distinct extended data summaries enabling
contextaware source selection (Section 2); (ii) an instantiation of the approach with a
speci c context taxonomy (Section 3); (iii) a preliminary experimental
evaluation of the proposed structures on such instantiation (Section 4). Future work
directions are discussed in Section 5.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Extended Data Summaries for Context-Aware Source</title>
    </sec>
    <sec id="sec-3">
      <title>Selection</title>
      <p>
        In this section, we rst brie y introduce the data summary our approach relies
on, i.e., QTree [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and the two extensions we propose to enable context-aware
source selection. Both approaches rely on: (i) a context taxonomy (C; ), where
C is the set of contexts and is a subconcept relationship among them; (ii) a
distance function d : C C ! [0; 1] measuring the distance between contexts
in the taxonomy; (iii) a function for associating contexts with triples and
queries. (C; ) and can come with data or be (semi)automatically extracted
from them while d is data independent and is provided as an input parameter
for the proposed techniques. Additional details on the proposed approaches can
be found in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
2.1
      </p>
      <sec id="sec-3-1">
        <title>Data Summaries: QTree</title>
        <p>
          Data summaries [
          <xref ref-type="bibr" rid="ref11 ref6">6, 11</xref>
          ] have been proposed as an approach for e ciently
determining which sources may contribute answers to a query in live distributed query
(a) Data points and regions (b) Region hierarchy and buckets
systems. They describe the data provided by the sources in a summarized form.
We speci cally refer to the QTree indexing structure [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] which is a combination
of R-trees and multidimensional histograms. The rst step is to transform the
RDF triples provided by the source into numerical points in a 3-dimensional
space. Hash functions are applied to the individual components of RDF triples
(i.e., subject, predicate, object) to obtain 3 numerical values for each triple. The
employed hash functions, relying on a pre x-hashing approach, are such that
triples with similar content are assigned close points in the multidimensional
space, so that they are clustered close to each other. This 3-dimensional space
is then indexed by a hierarchical data structure (the QTree, shown in Fig.1.b)
in which nodes represent multidimensional regions, through minimal bounding
boxes (MBBs). The region of a node always covers all the regions of its children.
Fig.1.a) illustrates the 2-dimensional data points and regions indexed by the
QTree. (A 2-dimensional space is used for the sake of simplicity.)
        </p>
        <p>The leaves of the QTree (referred to as buckets) keep statistical information
about the data items contained in the respective region. In Fig.1.b) leaves contain
the number of points within the corresponding region. To summarize the sources,
the number of triples contributed by the source is maintained inside the bucket
as a set of pairs hsource; number of triplesi. This set is kept ordered on the rst
component, so that sources contributing the highest number of triples appear
rst. Thus, the bucket corresponding to a region R contains a list of pairs:
hs; card(ftjhash(t) 2 R; origin(t) = sg)i for each source s containing at least one
triple which is hashed to a point in region R. origin(t) denotes the data source (a
RDF set of triples available on the Web) containing t, obtained by dereferencing
the subject of t.</p>
        <p>When a SPARQL query is executed, each basic graph pattern (BGP) is
converted into a set of coordinates by applying the hash functions to the pattern
elements. BGPs containing variables are represented by regions spanning the
whole range of hash values for the respective dimension. Then, the buckets
containing sources contributing to the query are discovered by looking for overlapping
MBBs in the QTree. Buckets returned by each BGP search are then intersected
to identify joins. Sources contained in the resulting buckets are ranked based on
an estimation of the number of triples contributing to the query result (computed
taking into account the bucket region and the size of the intersection between
bucket and query regions).
2.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Context-Aware Source Selection via QTree</title>
        <p>The rst approach we propose does not modify QTree internal nodes and
exploits context information in the ranking phase only. Indeed, each source in
a bucket is associated not only with the number of triples contributed by the
source inside the bucket but also with the set of related contexts. Thus, buckets
now contain sets of triples hsource; set of contexts; number of triplesi. More
speci cally, the bucket corresponding to a region R contains a set of triples
hs; Stjhash(t)2R;origin(t)=s (t); card(ftjhash(t) 2 R; origin(t) = sg); i for each
source s containing at least one triple which is hashed to a point in region R.</p>
        <p>We assume now that a SPARQL query comes together with a reference
context. The query representation and the index search do not change but,
once the intersecting buckets are discovered and the contained sources retrieved,
they are ranked according to: (i) context distance, computed as the aggregate
(based on either AVG, MIN, or MAX function) of the distances d in the context
taxonomy between any context associated with the source and the query context;
(ii) number of triples, de ned as in Subsection 2.1. These two indicators are
normalized between [0; 1] and combined in a linear way.
2.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Modelling the Context as a Fourth Dimension</title>
        <p>The second, alternative, approach models the context as a fourth dimension of
the data space (in addition to subject, predicate, and object), thus extending the
QTree data structure. This leads to an increase of the dimensionality of the data
summary but this increases the pruning potential as well and, thus, potentially,
the e ectiveness of the indexing structure.</p>
        <p>Each context is mapped to a number, so that close numbers re ect similar
contexts. This is achieved by an appropriate node numbering in the tree
corresponding to the context taxonomy. This way, similar data will still be clustered
into the same bucket. Thus, hash function, rst applied to triples and returning
a point in the 3D space, is now extended, leading to function hash4, so that
a quadruple consisting of a triple t plus one of its contexts in (t) is hashed
to a point in the 4D space. The structure of internal nodes does not change
with respect to the original QTree, while 4D QTree buckets keep list of pairs
hsource; contexti with the number of triples they contribute to a bucket. Thus,
assuming R is the 4D region associated with a bucket, the bucket contains a set
of pairs hhs; ci; card(ftjhash4(ht; ci) 2 R; c 2 (t)g)i.</p>
        <p>Queries and the associated contexts are mapped to four dimensional regions,
similarly to what happens for BGPs in the original QTree, but taking into account
the context as fourth component. Index search now selects buckets that intersect
a region representing BGP+context instead of only BGP. Ranking is performed
similarly to what we discussed in Section 2.2 but ranking each hsource; contexti
pair in a bucket separately.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Context-Modeling Using SKOS</title>
      <p>Di erent context models [1, ?] lead to di erent instantiations of the proposed
approaches. In the following, as a concrete example, we discuss context modeling
by means of a SKOS concept taxonomy.</p>
      <p>
        SKOS. SKOS (Simple Knowledge Organization System) is a W3C
recommendation for de ning, publishing, and sharing concepts over the Web and it is used
for knowledge organization. Concepts are entity categories identi ed by URIs.
Though not providing semantic and reasoning capabilities as powerful as
ontologies, concepts enhance the potential of search capabilities. Moreover, mappings
between concepts from di erent concept schemes and in diverse data sources, and
a hierarchical organization of concepts, are provided. The fundamental element
of SKOS is the (skos:Concept) class, which instances, identi ed by URIs,
represent domain concepts. Properties (skos:broader) and (skos:narrower) are
one the inverse of the other and are used to assert a generalization/specialization
relationship between two concepts [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. For instance, the RDF triple (:Palace
skos:broader :TouristAttraction) states that the touristic attraction
concept is broader (i.e., more general) than the palace concept. SKOS is foreseen
to be applied for associating concepts (i.e., resources of type (skos:Concept))
with resources through the (dct:subject) property.
      </p>
      <p>Context modeling through SKOS. We model each context as a SKOS concept.
In order to create a domain dependent context taxonomy, we started from a
portion of DBpedia English SKOS concepts and related hierarchical relationships,
as a seed. The portion consists of a subtree of the tree rooted at the concept
Tourist attractions in Europe. Starting from English DBPedia, resources
related to contexts in the seed taxonomy through the dct:subject predicate,
as well as triples describing them, are retrieved (via SPARQL queries posed
to endpoints) and become part of our dataset (and they are associated with
the corresponding concept as a context). Then, (owl:sameAs) links both at the
instance and at the concept level are exploited for retrieving further concepts
and instances.</p>
      <p>The set of collected SKOS concepts is nally organized into a taxonomy
(C; ), containing 1632 nodes (SKOS concepts), related by the (skos:broader)
relationship. Nodes are then numbered through a visit of the tree, and the
generated numbers are used for hashing. Distance d between two contexts is
de ned as the length of the shortest path connecting them in the taxonomy,
normalized between 0 and 1. Contexts for triples (function ) are de ned as the
contexts associated with their subjects or objects. The resulting dataset contains
1.086.874 triples, from 13.466 RDF sources.</p>
      <p>https://www.w3.org/TR/skos-reference/</p>
      <p>Index</p>
      <p>QTree
Ext QTree
4D QTree
The experiments were performed on a machine with CPU Intel Core i3-2350M
2.30GHz, 6 GB of RAM size, running 64-bit Windows 7 Professional. Applications
were developed in Java HotSpot(TM) 64-Bit Server VM under Java SE 1.8.
Starting from the dataset described in Section 3, we created Qtree, Ext QTree,
and 4D QTree, according to the values pointed out in Table 1. Notice that the
maximum number of buckets (a parameter of the creation process) is higher for
4D QTree since the number of tuples to be indexed is higher in this case.</p>
      <p>In this paper, we report experimental results related to the execution of
starshaped queries, sets of triple patterns sharing the same variable at the subject
position. Star-shaped queries with a variable number of joins, ranging from 0 to
7, were randomly generated, guaranteeing at least one result on the considered
dataset. For each number of join, 10 queries have been created, each associated
with a random context. Each query was executed 10 times.</p>
      <p>Selectivity and Performance. Fig. 2 compares QTree, Ext QTree, and 4D QTree
with respect to the number of returned buckets, sources, and execution time.
From Fig.2.a we note that, thanks to context information which is indexed
together with triples, 4D QTree lters out more buckets during BGP search than
3D approaches. The same number of buckets is always returned by QTree and
Ext QTree, since the two data structures di er only for information contained
at the bucket level. On the other hand, 4D QTree returns a higher number of
buckets as result for the join execution (Fig.2.b) since, as shown in Table 1,
setting the maximum number of buckets to 50.000 does not guarantee a good
data distribution among buckets (the number of created buckets coincide with
the maximum value), many bucket regions intersect, and are returned as result.
The number of sources returned by 4D QTree is however much lower compared
with QTree and Ext QTree (Fig.2.c), thanks to context ltering during index
search. For what concerns average execution time (in milliseconds, Fig.2.d), as
expected, it coincides for QTree and Ext QTree and is often higher for 4D QTree,
due to the higher number of intersections among buckets to be performed.
Context Accuracy. Accuracy is computed as the average distance between the
contexts returned as results by Ext QTree and 4D QTree index searches and
the context associated with the query. In the Ext QTree, since each source is
associated with a set of contexts, we rst compute an aggregate distance for
Experiments related to path-shaped queries are not reported due to space constraints.
(a) Avg num. of returned buckets by
BGP search
(b) Avg num. of join buckets
(c) Avg num. of returned sources
(d) Avg execution time
each source (based on either AVG or MIN function) and then we average all
them. Fig.3 shows that 4D QTree average context distance is always lower than
the Ext QTree average context distance, when using AVG (Fig.3.a). Combined
with Fig.2, this means that 4D QTree always retrieves less sources with a more
relevant context, on average, than Ext QTree. When considering MIN (Fig.3.b),
Ext QTree performs better than 4D QTree, since each data source returned by
Ext QTree is associated with all its related contexts, including that leading to
the minimum distance. This is not necessarily true with 4D QTree.
Impact of Ranking. We considered various ranking functions di ering in the
weights used in the linear combination. Aggregate context distances in Ext QTree
are computed using the MIN aggregate function. The performed experiments,
not reported here for space constraints, show that 4D QTree average ranking
values computed for all the returned sources are lower than the corresponding
Ext QTree ranking values. This is coherent with the previously presented results.
When we consider the rst k ranked sources, 4D QTree ranking values are higher
than corresponding Ext QTree ones. Thus, 4D QTree is able to select, among all
the returned ones, \better" sources than the Ext QTree.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>In this paper, we proposed two approaches exploiting context information in
increasing source selection e ectiveness and e ciency. This way of exploiting</p>
      <p>Context result accuracy
context is, to the best of our knowledge, novel. Experimental results show that the
proposed context-based data summaries succeed in selecting the most relevant
data sources with respect to the query and the associated context, with a
reasonable overhead. Future work includes the design of algorithms supporting
top-k search in QTree and the extension of the proposed data structures with
data source quality information, with the aim of selecting not only relevant but
also accurate answers, by preferring trusted sources.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bettini</surname>
          </string-name>
          et Al.
          <article-title>A survey of context modelling and reasoning techniques</article-title>
          .
          <source>Pervasive and Mobile Computing</source>
          <volume>6</volume>
          (
          <issue>2</issue>
          ):
          <volume>161</volume>
          {
          <fpage>180</fpage>
          ,
          <year>2010</year>
          .f
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>C.</given-names>
            <surname>Bolchini</surname>
          </string-name>
          et Al.
          <article-title>A data-oriented survey of context models</article-title>
          .
          <source>ACM Sigmod Record</source>
          ,
          <volume>36</volume>
          (
          <issue>4</issue>
          ):
          <volume>19</volume>
          {
          <fpage>26</fpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>B.</given-names>
            <surname>Catania</surname>
          </string-name>
          et Al.
          <article-title>Wearable queries: Adapting common retrieval needs to data and users</article-title>
          . In DBRank,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>B.</given-names>
            <surname>Catania</surname>
          </string-name>
          , G. Guerrini, and
          <string-name>
            <given-names>B.</given-names>
            <surname>Yaman</surname>
          </string-name>
          .
          <article-title>Context-dependent quality-aware source selection for live queries on linked data</article-title>
          .
          <source>In EDBT</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>C.</given-names>
            <surname>Ghidini</surname>
          </string-name>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Giunchiglia</surname>
          </string-name>
          .
          <article-title>Local Models Semantics, or contextual reasoning=locality+compatibility</article-title>
          .
          <source>Artif. Intell</source>
          .
          <volume>127</volume>
          (
          <issue>2</issue>
          ):
          <volume>221</volume>
          {
          <fpage>259</fpage>
          ,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>A.</given-names>
            <surname>Harth</surname>
          </string-name>
          et Al.
          <article-title>Data summaries for on-demand Queries over Linked Data</article-title>
          .
          <source>In WWW</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Linked data: Evolving the web into a global data space</article-title>
          .
          <source>Synthesis lectures on the semantic web: theory and technology</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>136</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>R.</given-names>
            <surname>Meusel</surname>
          </string-name>
          et Al.
          <article-title>Towards automatic topical classi cation of lod datasets</article-title>
          .
          <source>In CEUR Workshop Proceedings</source>
          . Vol.
          <volume>1409</volume>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>A.</given-names>
            <surname>Miles</surname>
          </string-name>
          et Al.
          <article-title>Skos core: simple knowledge organisation for the web</article-title>
          .
          <source>In International Conference on Dublin Core and Metadata Applications</source>
          , pages
          <volume>3</volume>
          {
          <fpage>10</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>T. Rekatsinas</surname>
          </string-name>
          et Al.
          <article-title>Finding quality in quantity: The challenge of discovering valuable sources for integration</article-title>
          .
          <source>In CIDR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>J. Umbrich</surname>
          </string-name>
          et Al.
          <article-title>Comparing Data Summaries for Processing Live Queries over Linked Data</article-title>
          .
          <source>World Wide Web</source>
          ,
          <volume>14</volume>
          (
          <issue>5-6</issue>
          ):
          <volume>495</volume>
          {
          <fpage>544</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <given-names>B.</given-names>
            <surname>Yaman</surname>
          </string-name>
          .
          <article-title>"Exploiting Context-Dependent Quality Metadata for Linked Data Source Selection"</article-title>
          .
          <source>PhD thesis</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>