<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Approximating Faceted Search for Graph Queries</article-title>
      </title-group>
      <fpage>61</fpage>
      <lpage>76</lpage>
      <abstract>
        <p>We discuss the problem of implementing real-time faceted search interfaces over graph data, speci cally the \value suggestion problem" of presenting the user with options that makes sense in the context of a partially constructed query. For queries that include many object properties, this task is computationally expensive. We show that good approximations to the value suggestion problem can be achieved by only looking at parts of queries, and we present an index structure that supports this approximation and is designed to scale gracefully to both very large datasets and complex queries. In a series of experiments, we show that the loss of accuracy is usually minor.</p>
      </abstract>
      <kwd-group>
        <kwd>Faceted Search</kwd>
        <kwd>Visual Query Interface</kwd>
        <kwd>Index Structure</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Faceted search [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a popular search and exploration paradigm which allows
users to apply search lters to multiple orthogonal dimensions (facets) of the
data. Filters can be added, removed or modi ed in any order, and every time
this is done, the system immediately updates the list of results, giving the user
instant feedback. To support this functionality, the system needs fast access to
the underlying data. This is often provided by specialised software which provides
better performance for the queries required by faceted search than standard triple
stores and RDB-based implementations.
      </p>
      <p>
        Faceted search has also been extended to semantics-based data, e.g. in
SemFacet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Rhizomer [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In these systems, datatype properties are treated
as facets, but they also include some ability to query the graph structure via
object properties. Combining faceted search with queries for graph patterns
(called \graph queries") results in more expressive queries, but implementing
faceted search also becomes computationally harder. The challenging task is to
update the set of available values for the facets after each user interaction. The
straightforward way of computing this set involves evaluating a query of similar
complexity to the whole query built so far, and that needs to be done for each
facet. For queries with large graph patterns, and large datasets, this can become
too time consuming for an interactive system. The usual approach of using a
search engine style index will not help, since these engines do not support graph
queries.
      </p>
      <p>
        Based on the visual query system OptiqueVQS [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], we have devised a system
that combines faceted search with graph queries, and that uses an indexing
structure for suggesting facet values that can easily be scaled out arbitrarily. In
return, it compromises some accuracy in computing sets of available facet values,
but in a highly con gurable manner.1
      </p>
      <p>In the present paper, we rst discuss the combination of faceted search and
graph queries, and introduce the concept of adaptive value suggestion. (Sect. 2).
In Sect. 3 we summarise our formal framework and formally introduce the value
suggestion problem that is the problem we attempt to solve. Several attempts
at solving this problem are described in Sect. 4: one is optimal from the user's
perspective but slow for large datasets and complex queries; the second is rather
inaccurate, but allows fast implementation; the third is a con gurable family of
intermediate (good enough and fast enough) solutions to the problem, based on
only looking at a part of the constructed query. In Sect. 6, we present a series
of experiments that shows that
1. good approximations to the value suggestion problem can often be reached
by taking into account only relatively small parts of the constructed query.
2. the accuracy of the approximations can often be improved dramatically by
including only the presence of required object properties in the index.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Visual Query Systems and Faceted Search</title>
      <p>2.1</p>
      <sec id="sec-2-1">
        <title>VQSs</title>
        <p>
          A Visual query system (VQS) presents a visual interface to users that allows
them to extract information from a structured data source, based on some
combination of lters and other requirements on the information to be retrieved.
The intention is to provide data access to users without requiring them to learn
a formal query language like e.g. SPARQL. VQSs need to nd a balance
between expressivity and usability: a system that covers the whole expressivity of
SPARQL will hardly be more useful to lay users than a SPARQL editor. This
trade-o di ers depending amongst others on the user group, their information
needs, and the complexity of the data [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Simple information needs will be met
by ltering on some attributes of a single class (e.g. black shoes of size 42), but
more advanced use often involves entities of di erent types (e.g. black shoes from
a small company based in a democratic country). Examples of VQSs designed
for underlying RDF data are Rhizomer [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], SemFacet [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], and OptiqueVQS [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
We have used OptiqueVQS as the starting point for our work, which in uenced
some of our design choices.
1 We rst described the idea of this con gurable index structure in a technical report
[
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and presented our implementation in an ISWC demo [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Faceted Search</title>
        <p>The term \Faceted Search" usually refers to search over datasets containing only
one type of concept (like e.g. shoe), and where several orthogonal attributes
(facets) are associated with each individual in the data. A typical faceted search
interface has a pane on the left side or at the top, where all these facets are
listed. For each facet, the interface allows the user to add or remove lters in
any order. As this is done, the main part of the page updates immediately and
displays a list of all instances satisfying the set of lters.</p>
        <p>Many faceted search interfaces also display a list of facet value suggestions
below each facet. If the user selects a value from this list, a new lter is added
based on the corresponding facet and value. Displaying a good list of value
suggestions is a challenging task: The user should not be overwhelmed by irrelevant
values, but he should also nd the values he expects to see and/or wants to lter
on.</p>
        <p>Simple faceted search systems will suggest a long and static list of values,
containing all the di erent values appearing in the dataset. In this case, the user
will always nd the value he is looking for. However, it is not optimal, because
the list will likely contain values that are incompatible with existing lters. By
that we mean: selecting such a value will lead to a situation where the applied
lters are too restrictive, and no results are returned. This kind of dead end is
not desirable from a user experience perspective.</p>
        <p>More advanced systems, on the other hand, remove or disable values (often
indicated by a gray font color) that are not compatible with the existing lters
{ leaving a shorter list of suggestions to the user. We call this method adaptive
value suggestion:</p>
        <p>Adaptive Value Suggestion: Calculate and suggest the complete set
of values for a facet that are compatible with both the existing lters
and underlying data, in order to not end up in a situation without any
results.</p>
        <p>Calculations needed to support adaptive value suggestion are quite intensive
for large datasets. To solve this problem, it is common to use search engines
like Lucene2 or Sphinx3 or similar software to index the data before use. These
indices are known to scale to large datasets, e.g. by partitioning. This ensures
fast response time, and no delay for the user.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Faceted Search for Graph Data</title>
        <p>To accommodate complex information needs over structured data, VQSs need a
way to support graph queries; in SPARQL parlance graph patterns with several
variables connected by object properties. How this is supported varies between
VQSs, but essentially, there will be some way of adding edges and nodes to a</p>
        <sec id="sec-2-3-1">
          <title>2 https://lucene.apache.org/ 3 http://sphinxsearch.com/</title>
          <p>graph query. Typically there is some menu of properties to choose from, either
separate or merged with the facets. In an ontology-based system, this list may
be generated from the ontology, from the data, or a combination of both.</p>
          <p>It is desirable to o er users adaptive value suggestions as described in Sect. 2.2.
Unfortunately, the usual indices used for faceted search, like Lucene, do not
support querying graph data. The obvious way of achieving this over the original
graph data (i.e. without an index) requires running the whole partial query once
for every facet (see Sect. 4.1). For very large datasets and queries with many
joins, this will be too slow. Some of the queries constructed with OptiqueVQS
include up to 9 object properties, and are intended to be run over data stores
of several PB. Even with very fast hardware, these queries cannot be executed
within tenths of seconds as required for interactive systems. It becomes clear
that some kind of custom-built solution is needed to calculate the adaptive value
suggestions su ciently fast.</p>
          <p>It turns out that adaptive value suggestion is not possible to achieve for
arbitrarily large graph data and complex queries su ciently fast. Joining over
large amounts of data is inherent in the problem. As we will see however, it is
possible to compute approximations of the set of possible facet values, which may
contain some irrelevant suggestions. The more irrelevant suggestions we tolerate,
the faster we can compute them, so there is a trade-o between computation time
and accuracy. Our proposal is a particular approximation that attempts to strike
a good compromise between these two.
3</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Formal Framework</title>
      <p>
        For the purpose of this paper, we work with a number of simpli ed notions of
schema/ontology, dataset, and query. These are less general than OWL, RDF,
and SPARQL, respectively, but they cover the essential notions for VQSs that
we require. See [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for a more complete exposition including examples.
3.1
      </p>
      <sec id="sec-3-1">
        <title>The Ontology and Navigation Graph</title>
        <p>In some VQSs, there is a xed ontology O that controls the behaviour of the
system, while others analyse the data. In either case, the observed behaviour
will be that based on the type of a variable in the query,
{ some selection of datatype properties is available for ltering;
{ some selection of object properties is available for extending the query graph;
{ if the graph is extended with an object property triple, there will be some
information about the type of the newly introduced query variable.
Instead of considering an ontology, we therefore work with a navigation graph,
a directed graph NO, where each node represents either a concept or datatype
in O, and where each edge represents a (object or datatype) property in O.</p>
        <p>
          Exactly how this navigation graph is built from the data and/or ontology
di ers between VQSs. The navigation graph used in OptiqueVQS is described
in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], Sect. 5.2.1.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Queries</title>
        <p>When we refer to queries in this paper, we mean tree-shaped conjunctive queries
where each variable is associated with a node in NO, that speci es the type of
the query variable. Furthermore, we assume that this typing is homomorphic,
which means that the query obeys the type structure de ned by NO.</p>
        <p>Queries can then be represented as trees, where each node represents a query
variable, and the edges represent properties. We separate the query variables into
two separate groups based on whether they are typed to a concept or a datatype
in NO. We call them concept variables and datatype variables respectively.</p>
        <p>For each datatype variable in the query the user is allowed to add a lter,
specifying which values it is allowed to take. The lter is speci ed by the lter
function F which returns a set of values for each query datatype variable in
Q. We do not include an \optional" operator, i.e. all variables of Q have to be
bound.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Datasets and Query Answers</title>
        <p>In addition to the ontology O and the corresponding navigation graph NO, we
assume that the VQS has access to an underlying dataset (RDF graph) D. This
RDF graph should adhere to the OWL2 DL restrictions of keeping instances,
classes, object properties, and datatype properties separate, in other words it is
a proper description logic ABox. When the query is nished, the user will be
running it over D in order to retrieve the results of interest. However, the goal
of this work is to use utilize the data in D during query construction as well, in
order to provide adaptive value suggestions.</p>
        <p>#
#We will use Ans(Q( x ); D) to denote the result we get by executing the query
Q( x ) over a dataset D. The result is#a set of tuples, where the entries in each
tuple corresponds to the variables in x .
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Focus Variable</title>
        <p>Many VQSs have a concept of focus variable, which is the variable current facet
ltering applies to, and the variable that will be the subject of new object
property triples added to the query. It is usually possible to \refocus" during query
construction, i.e. change to a di erent focus variable.</p>
        <p>During a query session, both the constructed query and the focus variable
changes frequently, but for any given state there is exactly one partial query, and
a corresponding focus variable. Since the queries are tree-shaped, we assume that
the focus variable is the root of this tree, possibly reversing the direction of some
triples. For the reminder of this paper, we will use Q to denote the partial query
with root (and focus variable) v, and for convenience we also let C denote the
type of v.
3.5</p>
      </sec>
      <sec id="sec-3-5">
        <title>Value Suggestion Problem</title>
        <p>During query construction, the user is presented with a list of suggestions for
every local datatype property t4. These suggestions are based on the current
state of the session, i.e Q, in addition to the the underlying dataset D. We
formalize this by presenting what we call suggestion functions :
De nition 1. Suggestion function: A suggestion function is a function S
which takes a dataset D, a query Q, and some datatype property t as input, and
which returns a set of literal values:</p>
        <p>Sugg = S(D; Q; t):</p>
        <p>By selecting values X Sugg, the user modi es the lters related to t, i.e.
Q is updated to Q ^ t(v; w), and F (w) = X.</p>
        <p>Notice that the de nition above does not restrict S on neither its computation
time nor the quality of suggested values. However, it should be clear by now that
we are looking for functions that return adaptive value suggestions (described
in Sect. 2.2) without spending so much time that it ruins the user experience of
the VQS.</p>
        <p>The problem we target in this work is the following:
Value suggestion problem Find a suggestion function S that
1. can be computed e ciently enough for interactive use, even for large</p>
        <p>D and complex Q
2. includes all values, that can be ltered on without making the answer
set empty, i.e.</p>
        <p>Ans(Q ^ t(v; x); D) 6= ; =) x 2 S(D; Q; t)
for all values s in D
3. includes as few values as possible that will make the answer set
empty, i.e. S(D; Q; t) is as small as possible while satisfying
condition 2.</p>
        <p>Condition 1 is necessary because suggestions have to be calculated after every
user interaction with the UI, and the user should not have to wait for the relevant
suggestions. So they have to be calculated e ciently, and scale with respect to
both D and Q.</p>
        <p>The second condition formalizes the idea that all values that are compatible
with the partial query should also be suggested to the user. Otherwise, some
sensible queries could not be constructed. We say that the suggestion function
should have perfect recall.</p>
        <p>Finally, condition 3 re ects that we want to suggest as few options as possible
to the user that are incompatible with the partial query, i.e. that would lead to
an empty result set. We say that the suggestion function should be as precise as
possible.
4 A local datatype property is a datatype property related to the focus variable.</p>
        <p>These three conditions are in con ict, in the sense that a more precise
suggestion function with perfect recall will take more resources to compute. We
consider perfect recall to be non-negotiable, so the problem amounts to
nding a good compromise between accuracy and e ciency. In the next section we
present three di erent suggestions functions. None of them perfectly satis es
all three conditions in the value value suggestion problem, but the family of
approximate suggestion functions Sa are such compromises.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Suggestion Functions</title>
      <p>We now introduce three di erent suggestion functions: The rst (So) is optimal
from the user's perspective but slow for large datasets and complex queries; the
second (Sr) is rather inaccurate, but allows fast implementation; the third (Sa)
is a con gurable family of intermediate (good enough and fast enough) solutions
to the problem, where the idea is to only consider a limited part of the query Q.
Based on standard faceted search systems, we will now de ne what we consider to
be the gold standard for the value suggestion problem (with respect to accuracy),
namely the optimal suggestion function So:</p>
      <p>So(D; Q; t) = Ans(Qo(x); D) where Qo(x) = Q ^ t(v; x):
It considers both the underlying dataset D and the partial query Q, and
calculates suggestions that never lead the user into a combination of lters that are
too restrictive, i.e. it returns adaptive value suggestions (see Sect. 2.2).
Unfortunately, So does not scale for large Q and D, because it has to calculate the
answers to the query Qo(x) = Q ^ t(v; x), which is more complex than Q itself.</p>
      <p>Because So is not guaranteed to satisfy condition 1 of the value suggestion
problem, it is not technically a solution. However, it is important because it
satis es both condition 2 and 3 perfectly. Furthermore, since it returns the optimal
set of values, we will use it to de ne accuracy for the evaluation in Sect. 6.
Another important suggestion function is the function that computes the
suggestion set based only on the value range of t for instances of type C found in
D. We call this function the range-based suggestion function Sr:</p>
      <p>Sr(D; Q; t) = Ans(Qr(x); D) where Qr(x) = C(v) ^ t(v; x):</p>
      <p>Notice that C and t are the only two parameters used by Sr that change
during the query session: D is xed during the session, and except for the focus
concept C, Q is just ignored. This means that we can calculate the suggestion
set for each possible combination of C and t o ine, and providing suggestions is
then just a matter of fetching the correct pre-calculated set. The number of such
combinations is limited, and Qr is a very simple query, so any system based on
Sr is e cient with respect to both time and space.</p>
      <p>Since Q C(v) holds for all possible Q5, we know that Qo Qr, and
therefore So Sr. In other words, Sr will always suggest at least all the values
from So, but it may also include more. This also makes sense intuitively, since
So considers both D and Q, while Sr only considers D.</p>
      <p>We will show in Sect. 6 that this suggestion function often gives many
irrelevant suggestions, that will lead to empty query results.
4.3</p>
      <p>Approximate Suggestion Function Sa
The optimal suggestion function So is too costly to compute in practice, while
the range-based function, on the other hand, is quite inaccurate but can be
pre-computed with reasonable e ort. There is a gap between these two
suggestion functions, and the following solution, the family of approximate suggestion
functions Sa, lls this gap.</p>
      <p>Each member SaZ of Sa is con gured by what we call a facet con guration
Z, which is a function that returns one tree-shaped query (with root of type C)
ZC for every concept C. We call ZC the concept con guration of C.</p>
      <p>Now, given Q and the corresponding focus concept C, SaZ computes a pruned
version of Q: Qpr = intersection(Q; ZC). Qpr is then used together with the
underlying dataset D to calculate value suggestions, i.e.:</p>
      <p>SaZ (D; Q; t) = Ans(Qa(x); D) where</p>
      <p>Qa(x) = Qpr ^ t(v; x) = intersect(Q; ZC) ^ t(v; x):</p>
      <p>Intuitively the concept con guration ZC just de nes which parts of the partial
query to consider when calculating suggestions, and which parts to ignore. Every
part of Q which is not covered6 by ZC is simply ignored, and removed as the
intersection between Q and ZC is calculated.</p>
      <p>The most aggressive member of Sa is the suggestion function de ned by
concept con gurations that only contains the root, i.e. ZC = C(v) for all concepts
C. This member will actually return Qpr = C(v) for any partial query given to
it, regardless of its shape, size and focus concept. But this means that Qa(x) =
C(v) ^ t(v; x) = Qr(x), so this particular member of Sa returns exactly the same
suggestions as Sr. And in fact, the range-based suggestion function is just the
special case of Sa where only the root of the partial query is considered, and
everything else is ignored.</p>
      <p>Now let us consider the opposite case: instances of Sa with very large concept
con gurations. If the partial query is completely covered by the tree de ned
5 Here, and in the following, between queries denotes query containment, de ned
like this: Q1 Q2 if Ans(Q1; D) Ans(Q2; D)
6 When writing that a query Q1 covers another query Q2, we mean Q2 Q1. The
intuitive interpretation of this is that the tree de ned by Q2 is smaller and included
in the tree de ned by Q1.
by the concept con guration i.e. ZC Q, we get Qpr = Q. Hence Qa(x) =
Q^t(v; x) = Qo(x), which shows that So is the limit case of Sa for con gurations
that cover all possible queries.</p>
      <p>In general, for every partial query Q and facet con guration Z, the following
holds:</p>
      <p>Q</p>
      <p>Qpr</p>
      <p>C(v) ) Qo</p>
      <p>Qa</p>
      <p>Qr ) So</p>
      <p>Z
Sa</p>
      <p>Sr:
(1)</p>
      <p>As we focus on a certain query, we can ignore large parts of the facet con
guration Z, since only one concept con guration ZC is needed to calculate value
suggestions. In some cases, we will therefore use SaZC instead of SaZ , as short
hand notation. Similarly, we may only use the word \con guration" if it is clear
whether we mean \facet con guration" or \concept con guration".
5</p>
      <p>Index Structure for Sa
As seen above, Sa reduces the complexity of calculating suggestions by only
considering Qpr instead of Q. This will often reduce the query execution time,
but it is not guaranteed to be good enough for our purpose: If Qpr includes
multiple di erent concepts, the UI response time will su er due to the time
consuming join operations it requires.</p>
      <p>The solution to this problem is to pre-compute all joins covered by the facet
con guration Z, and store the results in an index structure. The system can then
execute Qa over this index structure instead of the original dataset D, in order to
retrieve answers fast enough. It is important though, that the nal constructed
query is executed over the original dataset. Sa and its index should only be used
to support adaptive value suggestion, not to answer the user's nal information
need.</p>
      <p>The index is guaranteed to contain all the data needed to answer Qa since
they both are limited by the variables de ned by ZC. Notice that constructing
such an index would not be possible if we wanted a perfect system described
by So { it is impossible to construct an index that ts all the data needed to
cover any possible query, because there are in nitely many of them. By using
Sa, we only need to consider Qpr, which is limited by ZC, hence pre-computing
and indexing is possible.</p>
      <p>We now describe how to construct and use the index. It will consist of multiple
tables { one for each concept C. Each table is based on the corresponding concept
con guration ZC, and is constructed as follows:
1. One column is added for each variable in the query de ned by ZC.
2. One row is added for each distinct tuple of Ans(ZC0; D), where ZC
0 is a
modi ed version of ZC, where everything except for the root node is made
optional.</p>
      <p>The result is a large denormalized table containing all the data that is covered
by ZC. By using the optional version ZC
0 instead of ZC directly, we ensure that
we also get the data that is just partly covered by ZC.</p>
      <p>Answering Qa over this table is then just a simple table scan, and the query
response time can be reduced to a satisfactory level by indexing the columns
and/or parallellizing the storage and processing, similar to what state of the art
search engines do. Such scaling out is much easier for a single pre-joined table
than for relational or graph storage. But it requires the pruning de ned by a
xed con guration, which is precisely the point of our approximate suggestion
functions.</p>
      <p>With data stored denormalized, we essentially have the same situation as for
standard faceted search with only one concept: We just act like every column
(i.e. variable of the con guration) is a facet. This means that we can achieve
adaptive value suggestions (over the variables included in the con guration)
with the same performance as standard faceted search systems by simply using
the same underlying search engine technology.</p>
      <p>Earlier we stated that ZC = C(v) is the smallest possible concept con
guration one can use for Sa. This is true if we use the original dataset, but not
if we use the index, because we need to answer C(v) ^ t(v; x) for each datatype
property t we want suggestions for, which cannot be done if a data column for t
does not exist. We want to provide suggestions for each local datatype property
t 2 T , hence ZC = C(v) ^ Vti2T ti(v; xi) is the smallest con guration we allow.
5.1</p>
      <sec id="sec-4-1">
        <title>Existential Concept Variables</title>
        <p>With the index construction method described above, we get one column
containing only URIs for each concept variable included in the concept con guration.
This data is often just a waste of space in our context: Users do not need to lter
on URIs, and suggested values of URIs are therefore not needed. However, it is
often interesting to know whether an assignment to the concept variable exists
or not, and that is why we introduce existential columns for concept variables.</p>
        <p>Instead of storing the full URI, we use a boolean value to indicate whether
an instance assignment exists or not. This reduces the index size considerably,
compared to the case where all URIs are stored, because multiple rows where
only one URI di ers can now be collapsed into only one row.</p>
        <p>By using existential concept variable columns it becomes quite cheap to
include concept variables in the con gurations, since it only requires one more
column of boolean values, while the number of rows stays xed. In the experiment
we conducted, we explored how much the accuracy increase by adding another
layer of these existential concepts nodes to the index, which is a comparatively
cheap investment.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Evaluation</title>
      <p>We have implemented a faceted search module for OptiqueVQS based on Sa and
the index structure described in Sect. 5. Furthermore, we implemented both Sr
and So in order to compare them to Sa in our experiment.</p>
      <p>In Sect. 5 we argued that our system is at least as e cient (w.r.t.index
access) as state of the art faceted search engines using only one concept, so we have
not spent any e ort on measuring the time our system uses. We have also not
measured the performance of the index construction process, since it is not as
time crucial as index access. In other words, we do not claim that our
implementation is suited for systems that require real-time update. Instead, we explored
how facet con gurations of di erent size and shape a ect the constructed index
and the corresponding accuracy of value suggestions.</p>
      <p>The goal of our experiment was to answer the following two questions about
Sa:
1. How does the accuracy of Sa increase as the size of ZC increases?
2. How much does the accuracy of Sa increase by adding existential concept
variables to ZC ?</p>
      <p>Section 6.1 gives a detailed description of how the experiment was done, while
the results can be found in section 6.2.
6.1</p>
      <sec id="sec-5-1">
        <title>Experiment setup</title>
        <p>
          Dataset, Ontology and Queries
In the Optique project, user requirements led to queries with up to 9 object
properties [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], but over proprietary data, so we could not use those queries directly
for our experiment. Instead, we used the RDF version of the NPD Factpages7 {
a dataset covering details about oil and gas drilling activities in Norway. This
dataset contains 2.342.597 triples, and it has a corresponding OWL ontology
containing 209 concepts and 375 properties. The NPD Factpages is actually a RDB,
containing information that all oil companies in Norway are legally required to
report to the authorities. This means that the RDF version, which is generated
from this RDB, is fairly complete and homogeneous. This is optimal for
persons who want answers to complex queries. Among the di erent concepts we
considered in our queries, each have on average 14.1 di erent outgoing datatype
properties, and 6.4 outgoing object properties in NPD Factpages. The number of
distinct individuals/literals each such property leads to is 572 on average (with
a median of 12).
        </p>
        <p>The query catalogue distributed with this dataset was not suited for our
experiment: Just a few of the queries had the structure our system required, and
none of them connected more than a few concepts together. So we constructed a
new query catalogue consisting of complex queries of a more suitable size, with
the goal to cover a wide set of possible cases. The catalogue consists of 29 queries
ranging from 5 to 8 concept variables and 0 to 12 datatype variables, and the
corresponding result sets over the NPD dataset range from just 12 tuples, to
over 5 million tuples.</p>
        <p>The complete query catalogue is publicly available on Github8.</p>
        <p>Accuracy Measure
Providing the value suggestions is an information retrieval problem, where So
de nes the set of relevant values. We therefore use the well established measures</p>
        <sec id="sec-5-1-1">
          <title>7 https://gitlab.com/logid/npd-factpages 8 https://github.com/Alopex8064/npd-factpages-experiments</title>
          <p>of precision and recall to measure the accuracy of Sa (and Sr). From Eq. 1 we
know that So \ Sa = So and So \ Sr = So which gives:
pre(Sa) = jSo \ Saj = jSoj
jSaj jSaj
pre(Sr) = jSo \ Srj = jSoj
jSrj jSrj
rec(Sa) = jSo \ Saj = jSoj = 1 rec(Sr) = jSo \ Srj = jSoj = 1:
jSoj jSoj jSoj jSoj
where D, Q and t are all xed (hence not included in the formulas).</p>
          <p>Both Sa and Sr have perfect recall, which is expected from Eq. 1 and the fact
that it was the most important condition to satisfy from the value suggestion
problem. Hence, when evaluating these systems, only the precision matters, so in
the reminder of this paper, we will use and mention precision instead of accuracy.
Furthermore, since the user is exposed to several local datatype properties at
the same time, and we want to do more high-level experiments on the system,
we average the precision over all the local datatype properties.</p>
          <p>pre(Sa) =
1</p>
          <p>X pre(Sa; t)
jT j t2T
pre(Sr) =
1</p>
          <p>X pre(Sr; t)
jT j t2T</p>
          <p>So given Q, D, ZC, we will use this average as the overall measure of precision.
From Eq. 1, we can derive the following relationship between the precision of
our three suggestion functions:
Test Cases and General Setup
In the experiment we ran multiple test cases, where each test case was based on
one of the 29 queries from the query catalogue, and a generated concept con
guration ZC covered by the tree de ned by this query. Since each test case only
considers one query and one concept con guration, the index we get based on
the con guration will only contain one table. Value suggestions are then
calculated by running Qa over the table, and precision are calculated by comparing
to So as described above.</p>
          <p>Notice that a real world scenario would be more complex than this. The
success of a concept con guration and its corresponding table index would not
only depend on the success of one single query, but rather a large set of possibly
very di erent queries. One of our future goals is to develop methods for nding
con gurations that works well for a whole catalogue of queries.</p>
          <p>Con guration Generation</p>
          <p>In our experiment we wanted to show how the accuracy of Sa changes as
con gurations of di erent size and shape are used. To do this, we rst generated
a set of random \con guration cores" c for each query Q in the query catalogue.
Each core consisted of one or more connected concept variables from Q, and was
just used as a basis for generating two other concept con gurations:
{ Dat(c): Every possible datatype property is added to the concept variables
in c.
{ ObjDat(c): Every possible datatype property and object property is added
to the concept variables in c.</p>
          <p>The only di erence between these two con gurations, is that ObjDat(c) contains
one extra layer of concept variables. It is relatively cheap (w.r.t. storage usage)
to add these concept variables, as described in Sect. 5.1, but the precision will
(potentially) increase by doing it. So the split between Dat(c) and ObjDat(c)
was created in order to measure how much the precision increases, and thereby
answering question 2.</p>
          <p>Both Dat(c) and ObjDat(c) were used in one test each, where suggestion
values for each of the four di erent suggestion functions of interest were calculated.
They are given below, and they satisfy:</p>
          <p>After running through every test case, the results was grouped by both the
con guration type (Dat or ObjDat) and the size of the con guration, where
the size of a con guration is de ned by the number of concept variables in the
con guration core c. Finally the average precision of each group was calculated
and the results visualized.
Below are several charts visualizing the results of our experiment. Fig. 1, 2 and 3
display charts for three selected individual queries, while Fig. 4 shows the average
precision for all the queries of size 6 (15 of the 29 queries). Similar charts for
queries of size 5, 7 and 8 are omitted from the paper, but can be found on
Github9 together with charts for every individual query used in the experiment.</p>
          <p>Fig. 4: Average precision of size 6 queries.</p>
          <p>The yellow line in each chart shows the precision of the range-based function
Sr, which is always constant. Since this is the suggestion function with the lowest
precision we consider, it acts as a baseline { marking the worst case scenario for
Sa. The blue and red curves show the precision of SaDat and SaObjDat respectively.
As expected, these two curves are non-decreasing and pre(SaDat) pre(SaObjDat)
for all con guration sizes.</p>
          <p>The precision given by each of the three curves depends on how many of the
important key restrictions of Q they are able to capture. A key restrictions is a
restriction which reduces the number of instances one could assign to the root so
much that it also causes a large reduction in the possible facet values. The query
used in Fig. 1 for example has only one key restriction on the data property
name of the Field concept variable in depth 2 of Q. Since this key restriction
is associated with a datatype variable, both SaDat and SaObjDat perform about
equally well. The slight di erence between SaDat and SaObjDat is caused by other
much less important restrictions, which SaObjDat manages to capture, but SaDat
does not. If this chart had shown the best case scenario, the precision would
have been perfect already at size 2, because that is the point it would reach the
Field concept node. But since we average over multiple di erently shaped
congurations, and the branching factor of Q equals 2, the two lines moves steadily
upwards until they reach size 5. At this point the con guration is guaranteed to
cover the key restriction regardless of its shape.</p>
          <p>The query used in Fig. 2 has two key restrictions. The rst restriction is
associated with a datatype property lter on the root node (wellboreTemperature
190). This is captured by all the con gurations we used in the experiment,
and the di erence between Sr and SaDat at size 1 shows the e ect of capturing it.
The other key restriction is associated with the Field concept variable in depth
2. Since SaObjDat includes one additional layer of concept variables, it captures
this already from size 1, while SaDat on the other hand, needs to be of the correct
shape in order to capture it, hence the steadily rising curve, similar to Fig. 1.</p>
          <p>The query used in Fig. 3 is a linear query (the tree has only one branch), so
there is one possible con guration core for each con guration size. Hence, the
resulting curve only shows that one case of growing con guration. This query
also has two key restrictions: The rst one is an object property restriction in
depth 2 of the query { the e ect of capturing this restriction is shown by the
precision increase of SaObjDat between size 1 and 2. The second restriction is a
data property restriction associated with the only concept variable in depth 6 of
the query. This restriction is very hard to capture for both SaObjDat and SaDat,
but when the con guration reaches size 6, and the whole query is covered by
each of their con gurations, the resulting precision becomes perfect.</p>
          <p>The rules that control SaObjDat and SaDat also apply to Sr: It only performs
well if it is able to capture all of the important key restrictions. But since Sr
never considers Q, it will in fact always perform poorly if one or more such key
restrictions exists. Fig. 1 and 2 both show examples where this happens. For
each of those cases the precision of Sr is only 0.22. This is quite low compared
to its average precision of 0.58, displayed in Fig. 4.</p>
          <p>The chart in Fig. 4 displays the average over all queries of size 6 used in the
experiment, so it acts as a summary over all the queries.</p>
          <p>The rst thing to notice is the relatively high precision of the range-based
function. In our experiment, its precision ranged from 0.22 to 0.96, with an
average of 0.56. This does not sound too bad, but user studies done with OptiqueVQS
show that the users are not always satis ed with Sr, which actually motivated
us to start exploring Sa as an alternative.</p>
          <p>In the cases where key restrictions are associated with object properties,
SaObjDat performs much better than SaDat. In fact, it quite often returns
suggestions with perfect precision, as shown in Fig. 2. The average di erence between
SaObjDat and SaDat, shown in Fig. 4, indicates that it is worth adding this
extra layer of object properties to the con guration, especially since the resulting
increase in the index size is relatively small (one extra boolean column).
7</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>We discussed the combination of visual query systems for graph queries with the
adaptive value suggestions of faceted search. After de ning the \value suggestion
problem", we introduced three suggestion functions: an optimal one that is slow
for large data sets and complex queries; a range based one that is rather
inaccurate, but allows fast implementation; and a con gurable family of intermediate
(precise enough and fast enough) solutions to the problem, based on only
looking at a part of the constructed query. We conducted a series of experiments to
conclude that
1. good approximations to the value suggestion problem can often be reached
by taking into account only relatively small parts of the constructed query.
2. the precision of the approximations can often be improved dramatically by
including the presence of required object properties in the con guration,
rather than only connected datatype properties.</p>
      <p>In future work we intend to study alternative storage formats for the
prejoined index. In particular a document database like MongoDB could be suitable.
A related question is how to share storage space between indices for sub- and
super-classes in the type hierarchy. The viability of our approach depends on
a good choice of the facet con guration: it should be possible to determine an
optimal con guration given a log of previous user queries. Another approach to
reducing the index size is to work with \buckets" that combine ranges of facet
values. Also suitable bucketing strategies can be determined from the query log
and data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>M.</given-names>
            <surname>Arenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. C.</given-names>
            <surname>Grau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kharlamov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Marciuska</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Zheleznyakov</surname>
          </string-name>
          .
          <article-title>Faceted search over rdf-based knowledge graphs</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          ,
          <volume>37</volume>
          :
          <fpage>55</fpage>
          {
          <fpage>74</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>J. M. Brunetti</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Garc</surname>
          </string-name>
          <article-title>a, and</article-title>
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          .
          <article-title>From overview to facets and pivoting for interactive exploration of semantic web data</article-title>
          .
          <source>International Journal on Semantic Web and Information Systems (IJSWIS)</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>20</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>E.</given-names>
            <surname>Kharlamov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Skj veland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bilidas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jimenez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soylu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lanti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rezk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zheleznyakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ioannidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kotidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Koubarakis</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Waaler</surname>
          </string-name>
          .
          <article-title>Ontology based data access in statoil</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          ,
          <volume>44</volume>
          :3 {
          <fpage>36</fpage>
          ,
          <year>2017</year>
          .
          <article-title>Industry and In-use Applications of Semantic Technologies</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Klungre</surname>
          </string-name>
          .
          <article-title>A faceted search index for graph queries</article-title>
          .
          <source>Technical Report 469</source>
          , Univ. of Oslo, Dept. of Informatics,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>V. N.</given-names>
            <surname>Klungre</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Giese</surname>
          </string-name>
          .
          <article-title>A faceted search index for OptiqueVQS</article-title>
          . In N. Nikitina,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fokoue</surname>
          </string-name>
          , and P. Haase, editors,
          <source>Proceedings of the ISWC</source>
          <year>2017</year>
          <article-title>Posters &amp; Demonstrations and Industry Tracks co-located with 16th International Semantic Web Conference (ISWC</article-title>
          <year>2017</year>
          ), Vienna, Austria, October 23rd - to - 25th,
          <year>2017</year>
          ., volume
          <volume>1963</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>A.</given-names>
            <surname>Soylu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jimenez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kharlamov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zheleznyakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Horrocks</surname>
          </string-name>
          .
          <article-title>Ontology-based end-user visual query formulation: Why, what</article-title>
          , who, how, and which?
          <source>Universal Access in the Information Society</source>
          ,
          <volume>16</volume>
          (
          <issue>2</issue>
          ):
          <volume>435</volume>
          {
          <fpage>467</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>A.</given-names>
            <surname>Soylu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Jimenez-Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Vega-Gorgojo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Horrocks</surname>
          </string-name>
          .
          <article-title>Experiencing OptiqueVQS: a multi-paradigm and ontology-based visual query system for end users</article-title>
          .
          <source>Universal Access in the Information Society</source>
          ,
          <volume>15</volume>
          (
          <issue>1</issue>
          ):
          <volume>129</volume>
          {
          <fpage>152</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>A.</given-names>
            <surname>Soylu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kharlamov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zheleznyakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. Jimenez</given-names>
            <surname>Ruiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Skjaeveland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hovland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schlatte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brandt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lie</surname>
          </string-name>
          ,
          <string-name>
            <surname>and I. Horrocks.</surname>
          </string-name>
          <article-title>OptiqueVQS: a visual query system over ontologies for industry</article-title>
          .
          <source>Semantic Web</source>
          , (in press),
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>D.</given-names>
            <surname>Tunkelang</surname>
          </string-name>
          .
          <article-title>Faceted search</article-title>
          .
          <source>Synthesis lectures on information concepts</source>
          ,
          <source>retrieval, and services</source>
          ,
          <volume>1</volume>
          (
          <issue>1</issue>
          ):1{
          <fpage>80</fpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>