<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Eviction Strategies for Semantic Flow Processing?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Minh Khoa Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Scharrenbach</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Abraham Bernstein</string-name>
          <email>bernsteing@ifi.uzh.ch</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Zurich</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
      </contrib-group>
      <fpage>66</fpage>
      <lpage>80</lpage>
      <abstract>
        <p>In order to cope with the ever-increasing data volume continuous processing of incoming data via Semantic Flow Processing systems have been proposed. These systems allow to answer queries on streams of RDF triples. To achieve this goal they match (triple) patterns against the incoming stream and generate/update variable bindings. Yet, given the continuous nature of the stream the number of bindings can explode and exceed memory; in particular when computing aggregates. To make the information processing practical Semantic Flow Processing systems, therefore, typically limit the considered data to a (moving) window. Whilst this technique is simple it may not be able to nd patterns spread further than the window or may still cause memory overruns when data is highly bursty. In this paper we propose to maintain bindings (and thus memory) not on recency (i.e., a window) but on the likelihood of contributing to a complete match. We propose to base the decision on the matching likelihood and not creation time ( fo) or at random. Furthermore we propose to drop variable bindings instead of data as do load shedding approaches. Speci cally, we systematically investigate deterministic and the matching-likelihood based probabilistic eviction strategy for dropping variable bindings in terms of recall. We nd that a matching likelihood based eviction can outperform fo and random eviction strategies on synthetic as well as real world data.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Flow Processing</kwd>
        <kwd>Linked Data stream processing</kwd>
        <kwd>eviction strategies</kwd>
        <kwd>load shedding</kwd>
        <kwd>matching likelihood estimation on streams</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Information processing increasingly incorporates query matching on dynamically
changing information (i.e. information ows) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Based on these Information
Flow Processing (IFP) systems, such as C-SPARQL [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], EP-SPARQL [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], or
CQELS [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] emerged that are capable of processing ows of semantically
annotated data. All of these Semantic Flow Processing (SFP) systems allow de ning
? The research leading to these results has received funding from the European Union
      </p>
      <p>Seventh Framework Programme FP7/2007-2011 under grant agreement no 296126.
queries in a language that extends SPARQL with algebra adjustments for
matching queries on ows of time-annotated RDF triples.</p>
      <p>
        Each SFP system works on limited memory resources. They constrain the
number of valid partial variable bindings (i.e., bindings that are not yet complete
solutions to a query) by applying a window on the stream. Such a window limits
the scope of query matching to, for example, a period of time or to a certain
number of triples. On the one hand this limits the scope of answers the system
can nd. If a match requires two data items that do not fall in the same scope the
system will never be able to perform such a match. On the other hand, even the
limited scope can not necessarily guarantee that the system will never exceed its
memory resources. Consider the situation when a popular TV broadcast/show,
such as a football match, starts and many users switch to it at the same time.
In this case we may nd that a sudden burst in the input data could cause the
system to exceed its resources. The situation gets worse when we are not only
interested in the number of viewers for a show but want to track the users { the
literature refer to the latter case as non-shrinking semantics [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        To our knowledge no current SFP approach has a solution for freeing memory
in case these exceed the available resources. While proposals have been made for
load-shedding|dropping or ignoring incoming data|in IFP systems, only one of
them investigated load-shedding according to a statistical model of the matching
history [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. We believe that instead of ignoring incoming data one should one
should drop variable bindings based on their likelihood that they lead to a result.
We refer to this as eviction strategies in contrast to load shedding.
      </p>
      <p>Eviction strategies are more exible than load shedding. If we drop a data
item we may drop information too early in the matching process. Eviction allows
us to de ne constraints at which point of the matching process we want to
delete items. While there exist deterministic approaches that allow for dropping
variable bindings, such as rst-in- rst-out ( fo) we propose to use the matching
likelihood as the dropping criterion. In this paper we, hence, investigate the
performance of di erent eviction strategies. Note that an eviction strategy is
not the same as a garbage collection. While the latter removes those variables
bindings that will not contribute to a solution, the former removes variable
bindings that have the potential but cannot be kept due to memory limitation.
As a consequence applying an eviction strategy could lead to incomplete results
for the sake of feasibility.</p>
      <p>
        In this paper we propose to use a probabilistic based eviction strategy that
estimates the likelihood of a binding to contribute to a result in contrast to using
a xed time window for load shedding. Note that this paper intends to show that
a probability based model can outperform fo and random load shedding but
does not provide a model of how to estimate those probabilities on the go. We
investigate the e ect our approach has on the completeness of results compared
with deterministic approaches. We systematically investigated the challenges
arising from the limited memory resources of regular joins for SFP systems. If
the partial bindings for the potential join partners exceed the memory we start
dropping partial results. Our ndings enable an SFP to trade o completeness
versus resources. To ensure methodological correctness we followed the PCKS
approach [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] for de ning stress tests for SFP systems.
      </p>
      <p>As the relevant properties of an SFP system we identify completeness as a
Quality of Service constraint. The corresponding challenge we want to tackle is
the ability to handle bursts in the data. We therefore de ne recall as the key
performance indicator for the evaluations of the tests we performed in the scope
of this paper. In order to test the system we check how the system behaves under
restricted resources and under unrestricted resources.</p>
      <p>In our experiments we compared our probabilistic methods based on the
matching likelihood of partial bindings with the rst-in- rst-out ( fo)
deterministic method as well as random load shedding. We show that our statistical
approach can outperform fo and random load shedding both on simulated data
as well as on a real-world dataset of viewership behavior of IPTV users. The
improvement ranges from 14% to 50% when using synthetic and real world data.</p>
      <p>This remainder of this paper is structured as follows: next we formally specify
the context of SFP and then introduce our load-shedding approach. We then
evaluate our approach on synthetic and real world data in Section 3. After a
discussion of the results and limitations we present related work in Section 5
and conclude in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Method: Likelihood-based Eviction of Partial Matches</title>
      <p>This section rst lays the formal groundwork for de ning what eviction strategies
are. Then, it introduces our likelihood-based eviction strategy as well as the
baseline rst-in- rst-out ( fo) and random strategies.
2.1</p>
      <sec id="sec-2-1">
        <title>Semantic Flow Processing</title>
        <p>SFP rely on an algebra that modi es the SPARQL algebra for query matching
on ows of data. The result of an algebra operation is a set of variable bindings
consisting of a nite number of pairs ?var=val, where ?var is a variable name and
val is an RDF term. A query is a tree of algebra expressions where results (i.e.,
bindings) are propagated from the leaves to the root. An SFP system, hence,
consumes time-stamped RDF triples &lt; s; p; o &gt; [time] that are then consumed
(or evaluated) by the algebra operations de ned in the query tree. Hereby the
system updates the bindings for all a ected levels of the query tree and bindings
at the root level will be emitted as results.</p>
        <p>
          In our discussions we rely upon the de nitions of SPARQL and RDF as given
in the SPARQL-query speci cation [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Whilst we limit our discussions to joins
and aggregates our approach is by no means limited to these operations. A
semantic ow F is a nite set of time-stamped RDF triples. We denote algebra
expressions by . The evaluation of an algebra operation and a nite set of
variable bindings on a semantic ow again produces a (possibly new/updated)
set of variable bindings eval( ; ; F ) = 0. In a query tree all algebra
expressions consume variable (a semantic ow) bindings, perform the corresponding
operations, and then emit variable bindings.1
        </p>
        <p>SPARQL query processing systems usually implement an iterator-based
approach: since they know all relevant data beforehand, they can successively
perform the algebra operations recursively and, hence, know a binding was
completely processed by an algebra expression. A SFP system, in contrast, has
to keep its variable bindings to guarantee completeness. Consuming data in a
streaming fashion turns the execution model upside-down. Instead of recursively
processing all available data for each algebra expression we have to continuously
evaluate newly arriving data for all algebra expressions. Consequently, the
algebra expressions in an SFP have never nished evaluating all available data and
have to continuously maintain a list of corresponding variable bindings. Consider
the simple join of two graph patterns ?a ab ?b and ?b bc ?c, for example.
Assuming that the SFP system matched the rst pattern with the binding variable
binding = f?a=a1; ?b=b1g it is not really possible to say if and when a match
for the second graph pattern might happen.</p>
        <p>Keeping all variable bindings, however, will eventually exceed the system's
memory resources and make processing unfeasible. Therefore, SFP systems
typically limit the number of data they process by windowing, which limits the data
considered for computing partial matches to a given period of time or number
of triples. This allows to mark bindings with an an expiration time-stamp. A
garbage collection process can then remove those variable bindings that become
out-of-scope.</p>
        <p>Unfortunately, windowing has its distinct problems. For example, queries
that require to match patterns that unfold over periods longer than a practical
window for example cannot be matched. In those cases there are two options:
First one can apply load shedding { an approach that drops data from the input
stream, for example by only considering every x-th data item for processing.
Load shedding is a well-established approach and we discuss its di erence to
eviction in Section 5 along with related work.</p>
        <p>Second, we may delete variable bindings instead of input data. In this paper
paper we investigate this alternative approach we refer to as eviction. It leaves
the input stream untouched but intermediate variable bindings get deleted to
make place for new ones. In the following we discuss this approach in detail.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Eviction: Dropping Variable Bindings</title>
        <p>In order to limit memory usage of a SFP system we propose to use eviction, which
deletes partial binding from the processing tree. When deleting partial matches
(or shedding load in the input stream) we potentially sacri ce completeness. As
a consequence, it is imperative to understand what consequence eviction has on
completeness as a Quality of Service (QoS) criteria. In this study we, therefore,
compare the performance of the following eviction functions :
1 Note that the leaves of the tree typically consume F and that the root may emit
RDF data to a new ow F 0.
{ random deletion,
{ time-based deletion (e.g., delete the oldest bindings rst), and
{ data-driven deletion of bindings.</p>
        <p>
          While the rst is self-explanatory, the latter two require further explanation.
Time-Based Deletion Time-based deletion bases on the assumption that the
probability that a binding contributes to a result decreases with the amount of
time it stays in the bindings cache. Whilst we know of no empirical evidence that
this assumption holds for typical SFP benchmarks it is used in systems such as
SASE+ [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], where it is referred to as least-recently-used (LRU). 2
        </p>
        <p>LRU can be implemented as a priority queue, which can be implemented via
a heap. Heap organization has a complexity of O(log n), where n is the number
of elements in the bindings cache.</p>
        <p>Note that the actual performance of LRU can depend on the time model for
the bindings. A time model is either based on system time or on data time or
on both. Furthermore, an LRU eviction function must specify which time-stamp
of a binding it uses. This time-stamp can be the creation time of the binding or
the time it was last updated. In the scope of this work we use data time and the
time-stamp of the last update of the binding.</p>
        <p>Data-Driven Eviction Data-driven eviction|the core contribution of this
paper|bases the eviction decision of a binding on the probability that
contributes to a result. This requires an estimation of the matching probability for
. To that end we introduce the notion of the transactional closure for a binding
in Section 2.3, which permits counting the number of results it contributed to
for a certain part of the stream data.</p>
        <p>Probability estimation can be performed either online or o ine. In the online
case the contribution probability has to be estimated along with the actual query
processing, which is a prerequisite in cases where the data cannot be stored for
o ine analysis. In this paper we assume that we can compute the probabilities
o ine and describe the method for estimating the probabilities in Section 2.4.</p>
        <p>When applying data-driven eviction on each operator's binding locally it
requires the same computational e ort as LRU.3 Akin to the LRU case,
datadriven eviction can be implemented using a priority queue. The only di erence
is the comparison attribute which is now the matching probability rather than
the time-stamp. Global data-driven eviction strategies may require additional
e ort (cf Section 2.5).
2 Note that in our case LRU gives the same result as fo since we only forward fresh
bindings to next operator instead of performing updates on the current binding.
3 Assuming that the probabilities, as in our case, are computed o ine and do not
in uence the online processing e ort.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>The Transactional Closure</title>
        <p>In order to evaluate the performance of any eviction or load shedding operator
we have to know whether deleting a data item or a variable binding impacts
the results. A query result emerges from applying a set of algebra operators. In
order to determine whether deletion of data items or variable bindings has an
impact on the results of a query we hence need to know all sequences in which
the application of a set of data items to the algebra operators of a query leads
to a result.</p>
        <p>We now de ne the basics for computing all possible paths to all query results
for a ow F . Any result of a query is a variable binding root the root of the
tree of the query's algebra expressions T root . As such any of these root variable
bindings emerges from a sequence of application of algebra expressions from
T root . As a result, for each root we have a nite set of sets of variable bindings.
We refer to this set as the transactional closure ( )+;F of the binding . Indeed,
we can de ne such a transactional closure recursively for all bindings for all
algebra expressions in T root .</p>
        <p>Having the de nition of the transactional closure at hand we now investigate
what happens, if we delete some of the variable bindings from the transactional
closure. If we delete all entries from ( )+;F , then the variable binding will
never be created. In this study we want to know what impact any variable
+
binding 0 2 ( ) ;F has on the transactional closure of .</p>
        <p>We assume that each algebra expression maintains a bindings cache: the
nite collection of bindings, which we denote with . If we now delete a variable
binding 0 from we would like to know which potential impact this deletion
has on the query results. We hence de ne the transactional closure with respect
to , , and F by ( )+; ;F . It inductively de nes those variable bindings that
emerge from by applying to hF ; i. Let R ; ;F be the result set of for
and F , then we call R ; ;F = R ; ;F \ ( )+; ;F the result set of .</p>
        <p>The bindings cache of an algebra expression contains di erent sets of variable
bindings. A join with n join partners, for example, has one set ;n for each of the
n join partners. In this study we assume that each of these sets can hold a limited
number of variable bindings, i.e., j ;nj ! ;n, where ! ;n is the memory
available to store bindings. When the application of on the ow F causes
some of these sets to exceed its limit (! ;n), then we apply an eviction strategy
E that removes variable bindings from ;n such that the above condition will
be ful lled again.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Estimating Matching Probabilities</title>
        <p>We estimate the selectivity of a variable binding by the number of results that
depend on . We de ne the probability that the result set of a variable binding
is non-empty as the above count normalized by the total number of results
with respect to the bindings cache:</p>
        <p>Note that the choice of designates for which biding cache ! ;n this
probability is computed.</p>
        <p>The above formula, hence, computes the probability that there exists a
binding 0 that bases on . If is a query, then the above formula de nes the
probability that contributes to a query's root result.</p>
        <p>In some cases it is possible that a binding 0 originates from two di erent
upstream bindings. Consider, for example, the following operation joining two
identical basic graph patterns BGP (?x P Y). Here, any binding = f?x/Xg
could originate from the rst or the second triple pattern matching:</p>
        <p>J OIN (BGP (?x P Y); BGP (?x P Y))</p>
        <p>For the computation of the probabilities we count both original variable
bindings.
2.5</p>
      </sec>
      <sec id="sec-2-5">
        <title>Eviction Strategies: Local vs Global Approaches</title>
        <p>A local eviction strategy operates locally on the binding cache of each operator.
A global eviction strategy, in contrast, operates on all binding caches together,
attempting to optimize the overall global performance. In other words, for a
local eviction strategy all ! ;n are xed whereas a global eviction strategy has
the additional constraints that P Pn ! ;n ! for all ! ;n. A global eviction
strategy hence requires to also solve (or approximate) a constraint optimization
problem. Since it is our goal to show that data-driven eviction can lead to
superior results over the other strategies we only have to show that one of the two
approaches o ers advantages over the other, non data-driven ones. We, hence,
restrict ourselves to the simpler, local eviction strategy shown in Algorithm 1
and leave the global eviction strategy for future work.</p>
        <p>Algorithm 1 Local eviction strategy for an algebra expression = 1; : : : ; I ,
a ow F , a set of sets of variable bindings i;ni , limits on these ! i;ni
Apply F to h ; i
for all i;n do
if j i;nj &gt; ! i;n then</p>
        <p>eviction( i;n)
end if
end for</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Evaluation</title>
      <p>The main working hypothesis of our paper is that eviction based on the
matching likelihood can outperform the fo and random dropping approach of variable
bindings. We therefore have to estimate the matching likelihood. This in turns
required implementing the transactional closure in order to be able to compute
the counts on which the matching likelihood is de ned. In order to calculate
the matching likelihood, we implemented an "oracle". This oracle allows to
retrieve the cardinality of complete matches in the future. The oracle assumes
that required bindings are not been evicted by other join operators. With this
oracle we can compute the probabilities as proposed in Section 2.4. We will test
probabilistic eviction with probabilities learned from past data in future studies.
Please nd the discussion of this limitation in Section 4.</p>
      <p>In this section we evaluate our approach governed by the above hypothesis.
We therefore compared the performance of our likelihood eviction with the fo
and random strategies with respect to recall as the key performance indicator
(KPI).
3.1</p>
      <sec id="sec-3-1">
        <title>Experimental Setup</title>
        <p>We performed our experiments on a synthetic and a real world data set. We used
the same query that performs the following join:
SELECT ?a ?b ?c ?d
WHERE {
?a ab ?b .
?b bc ?c .</p>
        <p>?c cd ?d .
}</p>
        <p>We considered joins without sequential or temporal constraints, because of
memory constraints. Our goal was to compute the recall for all potential
bindings. Computing the transactional closure for joins requires an exponential
number of traces which matching originated from which variable bindings. We discuss
this limitation in Section 4.</p>
        <p>We performed the experiments considering local eviction. We tested the three
di erent eviction strategies with bindings caches which we increased
exponentially. We chose sizes as multiples of 10. The reason for this is that calculating
matching statistics is currently a time consuming process and therefore we choose
an exponentially increasing size of cache in the evaluation.</p>
        <p>For the comparison we assumed the same time for determining whether an
element was subject to eviction or not. This holds true for fo and statistical
eviction, as they both keep a cache where the elements to be evicted have to
be sorted out. We can achieve this by applying Heap-Sort. We assume that the
number of elements to be dropped by eviction to be constant in relation to the
size of the bindings cache. This way each eviction step is bounded by O(log(n))
where n is the size of the bindings cache.</p>
        <p>All experiments were conducted using the UZH Katts Semantic Flow
Processing engine.4 While we simulated an uniform distribution of results for the
synthetic dataset we used anonymous IPTV viewership logs and joined them
with Electronic Program Guide data. The data was gathered in the scope of the
ViSTA-TV project.5 For our experiments we used a data sample from Aug 1st,
2012. The sample comprised about 300'000 events in total for the user log and
the EPG data. We thereby only considered such properties that also occur in the
query. For the above mentioned exponential explosion for tracing the matching
we limited ourselves to a subset comprising 10'000 events.</p>
        <p>To retrieve the exact matching probabilities for the bindings we stored those
statistics in a sqlite3 in memory database. However, updating the matching
probability for each binding every time unit (every time unit can change the
likelihood of a match for a binding) is a very time expensive process. Therefore
we updated the probability for each binding only once, namely when we added
it to the bindings cache.</p>
        <p>The following con guration was used to run all our experiments: 2 Quad-core
Intel(R) Xeon(R) CPU X5570@ 2.93GHz, 72GB of RAM, 4 TB of disk space,
running Fedora Core 12 kernel 2.6.32.14 64bit.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Results</title>
        <p>As Figures 1-2 show, our likelihood based eviction approach always outperforms
fo and random eviction. The di erence for the synthetic data set decreases
with increasing size of the bindings cache from 50% to 10% di erence. In the
real world data set the di erences in recall increases from 14% to 29% for the
10'000 events data set and for the 1'000 events data set we can observe a decrease
from 32% to 14%.</p>
        <p>We see that for the extreme size of the binding cache, we always obtain 100%
recall for all approaches. We evaluate the overall performance of our approach
by the recall measured for the bindings cache size that can hold up to 10% of
the bindings of the cache for which we get 100% recall. While the matching
likelihood based eviction performs very good on the synthetic data set (up to
91% recall with bindings cache of 10% relative of 100% performance limit), it
performs not as good for the real-world data set (up to 60% recall with bindings
cache of 10% relative of 100% performance limit).
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion</title>
      <p>Experimental Results The results of our study indicates that likelihood based
eviction can outperform fo and random. This was the case for all experiments we
4 KATTS is a recursive acronym for Katts is A Triple Torrent Sieve and is accessible
via GitHub at https://github.com/lorenz scher/katts
5 http://vista-tv.eu
0.8
conducted. The relative di erence between is larger for the synthetic data than
for the real world data set. We think that our approach was able to better predict
the matching likelihood for synthetic dataset as it was generated following some
uniform distributions. Nevertheless an eviction strategy based on the matching
likelihood seems to be a promising approach.</p>
      <p>We also found that there is a relation between the size of the bindings caches
and the recall. The larger the bindings cache the higher the recall. While this
result seems not very surprising it yet indicates that future research will have
to be performed nding out the exact size of the bindings cache such that the
recall becomes 100%. Such research would also investigate at which cost an
improvement of the recall using eviction strategies comes.</p>
      <p>Shortcomings The most important shortcomings of our study are the limited
number of data items we could process and the restriction to a query consisting
of a regular join. The limitation to the small number of data items is caused
by the exponential number of possible traces from variable bindings to results.
Research on more complicated queries and larger data sets has hence to be
performed online. This, in turn makes an exact calculation of the matching
statistics infeasible. Therefore one has to learn the matching statistic which we
tackle in our future work.</p>
      <p>In this study we hence concentrated on investigating how well likelihood
based eviction works in |principle, compared to fo and random eviction. We
are currently investigating how to best estimate the matching likelihood on the
go, i.e., along with the processing of incoming data. Future experiments will
0.8</p>
      <p>Fig. 2: Recall ViSTA-TV Dataset with 1'000 events.
hence systematically investigate di erent types of algebra expressions. These
include aggregates but also joins with sequential and temporal constraints, as
these are more typical for SFP systems. We believe that likelihood based eviction
can help most for such cases where a lot of variable bindings are produced, e.g.,
in the case of aggregates, in particular for non-shrinking semantics.</p>
      <p>
        Our simulation did not investigate response time or throughput but recall.
While we showed that matching likelihood based eviction outperforms
deterministic approaches like fo and random eviction in terms of recall, we will
investigate at which costs this comes for the former two KPIs. As in the previous case
we will have to then estimate the statistics online rather than o ine. Following
the PCKS paradigm [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] are working on implementing tests for determining how
well our approach works for bursts in the data.
      </p>
      <p>In this study we considered a purely local eviction strategy. It would be
interesting to see the e ect of global constraints. We are currently investigating
how to add these to statistical eviction. This requires a well-de ned optimization
strategy carefully balancing the relation between local and global constraints.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Related Work</title>
      <p>Research related to this paper roughly covers three areas: Semantic Flow
Processing, selectivity estimates on ow-data, and load shedding for IFP/SFP systems.
In the sequel we discuss how our work relates to each of these.
0.8</p>
      <p>10'000
100 1'000 10'000
fo 39 (4%) 290 (31%) 943 (100%)
random 27 (3%) 297 (31%) 943 (100%)
prob 164 (17%) 567 (60%) 943 (100%)
ground 943 (100%) 943 (100%) 943 (100%)</p>
      <p>Fig. 3: Recall ViSTA-TV Dataset with 10'000 events.
5.1</p>
      <sec id="sec-5-1">
        <title>Semantic Flow Processing</title>
        <p>
          C-SPARQL [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] performs query matching on subsets of the information ow
de ned by windows. The decidability of SPARQL query processing on such nite
sets of RDF triples causes the number of variable bindings produces to be nite.
Their number may still become prohibitively large, for example, when using
non-shrinking semantics for aggregates [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Since the execution is performed on
traditional SPARQL engines, we may apply our approach, for example as a lter
for iterator-based implementations such as Jena.6
EP-SPARQL [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] is a complex event processing system which extends the ETALIS
system by a ow-ready extension of SPARQL for query processing [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. It
provides a garbage collection facility which can \prune outdated events" or \expired
events by using periodic events, generated by the system". ETALIS is a prolog
engine which seriously hampers the applicability of our approach to ETALIS
engine. Note however, that EP-SPARQL may be implemented on other engines,
e.g., CQUELS, too.
        </p>
        <p>
          CQELS [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] \implements the required query operators natively to avoid the
overhead and limitations of closed system regimes". It optimizes the execution
by dynamically re-ordering operators, because \the earlier we prune the triples
that will not make it to the nal output, the better, since operators will then
process fewer triples". This pruning does, however not make any guarantees
6 https://jena.apache.org/
about the number of variable bindings created by the processors. Our method
should be directly applicable to CQUELS as it provides a native implementation
of the operators and these operators maintain a list of active variable bindings.
5.2
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>Selectivity Estimates</title>
        <p>
          Selectivity estimates for optimizing the execution of SPARQL queries were rst
investigated by [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. These are not directly applicable to our approach. Selectivity
estimates for SPARQL query optimization base on the selectivity of predicates
whereas we are interested in the matching likelihood of variable bindings. More
related to our work is the operator scheduling of CQELS [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. They propose to
choose the next operator to execute according to the likelihood that the operator
leads to a positive result. They, however, base the decision on heuristics. It would
be very interesting to see what e ect applying our matching likelihood would
have on the performance of CQELS.
        </p>
        <p>To the best of our knowledge, no approach estimates the matching likelihood
of bindings to determine their selectivity.
5.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>Load Shedding</title>
        <p>
          Load shedding has been applied to information ow processing. Approaches
like [
          <xref ref-type="bibr" rid="ref13 ref2 ref6">6, 2, 13</xref>
          ] perform load shedding by dropping tuples from the stream, i.e.,
dropping data instead of variable bindings. In contrast to a Complex Event
processing (CEP) system they assume a Data Stream Management System, i.e.,
a set of triples with a relational database execution engine.
        </p>
        <p>
          An approach we found that follows a similar idea is SASE+ [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. SASE is has
an automata based matching approach which can be considered similar to
variable bindings in our case. They do apply some eviction strategy. Yet, they base
their eviction on a deterministic approach that is similar to our implementation
of fo.
        </p>
        <p>
          Another approach that resembles our work is proposed in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. They perform
a simple equi-join on two incoming streams and evict tuples which are less likely
to nd a join partner. However, the approach in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] works with a sliding window
and with a single equi-join of two streams.
        </p>
        <p>Note that applying a window to the ow of RDF triples can be considered
dropping tuples, too. As such all approaches that implement a window on the
incoming data e ectively o er the window as a load shedding strategy.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Outlook</title>
      <p>In this paper we proposed to perform load shedding by eviction, i.e., by
dropping variable bindings rather than by dropping tuples. While the latter is a
well-established technique for Data Stream Management Systems, for CEP our
approach is the rst that applies (i) load shedding by eviction and (ii) bases
the decision on the matching likelihood of a variable binding rather than on a
heuristic.</p>
      <p>Our experiments show that eviction is a promising strategy for regular joins
for event-driven Semantic Flow Processing systems. We outperform the usually
used deterministic approach rst-in- rst-out by up to 51% recall.</p>
      <p>While our approach is currently investigating recall as the only key-performance
indicator (KPI) we are con dent that we will be able to show that the
matching likelihood also performs better than deterministic approaches when di erent
Quality of Service (QoS) constraints require testing a combination of recall with
other KPIs such as response time.</p>
      <p>Future work will concentrate on extending the algebra expressions to joins
with sequential and temporal constraints. We believe that the solution for load
shedding will have to incorporate statistical information from the data ow and
the query log. We will also investigate how the matching likelihood can be learned
from the stream on-the-go.</p>
      <p>
        In the future we will also plan to evaluate our approach with standardized
queries and data as proposed in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] or [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>We are convinced that only systematic investigations on the relation between
using windows and likelihood-based eviction for load shedding will reveal this
solution from the data.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Anicic</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fodor</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rudolph</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stojanovic</surname>
          </string-name>
          , N.:
          <article-title>EP-SPARQL: a uni ed language for event processing and stream reasoning</article-title>
          .
          <source>In: Proceedings of the 20th International Conference on World Wide Web</source>
          . pp.
          <volume>635</volume>
          {
          <fpage>644</fpage>
          .
          <string-name>
            <surname>ACM</surname>
          </string-name>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Babcock</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Datar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Motwani</surname>
          </string-name>
          , R.:
          <article-title>Load Shedding for Aggregation Queries over Data Streams</article-title>
          .
          <source>In: Proceedings of the 20th International Conference on Data Engineering</source>
          . pp.
          <volume>350</volume>
          |
          <article-title>-</article-title>
          .
          <source>ICDE '04</source>
          , IEEE Computer Society (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Barbieri</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Braga</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Incremental reasoning on streams and rich background knowledge</article-title>
          . In: Aroyo,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Antoniou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            , Hyvonen, E.,
            <surname>ten Teije</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Stuckenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Cabral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Tudorache</surname>
          </string-name>
          , T. (eds.)
          <source>The Semantic Web: Research and Applications: 7th Extended Semantic Web Conference, ESWC</source>
          <year>2010</year>
          , Heraklion, Crete, Greece, May 30 - June 2,
          <year>2010</year>
          . LNCS, vol.
          <volume>6088</volume>
          , pp.
          <volume>1</volume>
          {
          <issue>15</issue>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Barbieri</surname>
            ,
            <given-names>D.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Braga</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ceri</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Della Valle</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grossniklaus</surname>
          </string-name>
          , M.:
          <string-name>
            <surname>C-SPARQL</surname>
          </string-name>
          :
          <article-title>A Continuous Query Language for RDF Data Streams</article-title>
          .
          <source>International Journal of Semantic Computing</source>
          <volume>4</volume>
          (
          <issue>1</issue>
          ),
          <volume>3</volume>
          {
          <fpage>25</fpage>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Cugola</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Margara</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Processing ows of information</article-title>
          .
          <source>ACM Computing Surveys</source>
          <volume>44</volume>
          (
          <issue>3</issue>
          ),
          <volume>1</volume>
          {
          <fpage>62</fpage>
          (Jun
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gehrke</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riedewald</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Approximate join processing over data streams</article-title>
          .
          <source>In: Proceedings of the 2003 ACM SIGMOD international conference on on Management of data - SIGMOD '03</source>
          . p.
          <fpage>40</fpage>
          . ACM Press (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Diao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Immerman</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gyllstrom</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Sase+: An agile language for kleene closure over event streams</article-title>
          .
          <source>Tech. rep.</source>
          , University of Massachusetts Amherst, Department of Computer Science (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>SPARQL 1.1 Query Language</article-title>
          .
          <source>W3C</source>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Le-Phuoc</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dao-Tran</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Linked stream data processing engines: facts and gures. The Semantic Web{</article-title>
          . . .
          <volume>1380</volume>
          (
          <issue>24761</issue>
          ),
          <volume>1</volume>
          {
          <fpage>12</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Le-phuoc</surname>
            , D., Dao-tran,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parreira</surname>
            ,
            <given-names>J.X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hauswirth</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A Native and Adaptive Approach for Uni ed Processing of Linked Streams and Linked Data</article-title>
          . In: Aroyo,
          <string-name>
            <given-names>L.</given-names>
            ,
            <surname>Welty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Alani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Taylor</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kagal</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noy</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blomqvist</surname>
          </string-name>
          , E. (eds.)
          <source>The Semantic Web { ISWC</source>
          <year>2011</year>
          : 10th International Semantic Web Conference, Bonn, Germany,
          <source>October 23-27</source>
          ,
          <year>2011</year>
          , Proceedings,
          <source>Part I. Lecture Notes in Computer Science</source>
          , vol.
          <volume>7031</volume>
          , pp.
          <volume>370</volume>
          {
          <fpage>388</fpage>
          . Springer Berlin / Heidelberg (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Scharrenbach</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Urbani</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Margara</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Valle</surname>
            ,
            <given-names>E.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Seven Commandments for Benchmarking Semantic Flow Processing Systems</article-title>
          .
          <source>In: Proc.ESWC</source>
          <year>2013</year>
          (to appear). pp.
          <volume>1</volume>
          {
          <fpage>15</fpage>
          . No.
          <volume>296126</volume>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Stocker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Seaborne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bernstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiefer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>SPARQL Basic Graph Pattern Optimization Using Selectivity Estimation</article-title>
          .
          <source>In: Proceedings of the 17th International World Wide Web Conference (WWW)</source>
          .
          <source>ACM (Apr</source>
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Tatbul</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cetintemel</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zdonik</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cherniack</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stonebraker</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Load Shedding in a Data Stream Manager</article-title>
          .
          <source>In: 29th International Conference VLDB</source>
          . vol.
          <volume>54</volume>
          , pp.
          <volume>309</volume>
          {
          <fpage>320</fpage>
          .
          <string-name>
            <given-names>VLDB</given-names>
            <surname>Endowment</surname>
          </string-name>
          (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Duc</surname>
            ,
            <given-names>P.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corcho</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Calbimonte</surname>
          </string-name>
          , J.p.:
          <article-title>SRBench : A Streaming RDF / SPARQL Benchmark</article-title>
          .
          <source>In: The Semantic Web - ISWC</source>
          <year>2012</year>
          : 11th International Semantic Web Conference,
          <string-name>
            <surname>ISWC</surname>
          </string-name>
          <year>2012</year>
          , Boston, MA, USA, Nov
          <volume>12</volume>
          -
          <issue>14</issue>
          ,
          <year>2012</year>
          ,
          <string-name>
            <surname>Proceedings</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>