<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Smart Cache Management for Ontology Based, History-Aware Stream Reasoning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Rui Yan</string-name>
          <email>yanr2@rpi.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Brenda Praggastis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>William P. Smith</string-name>
          <email>William.Smithg@pnnl.gov</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deborah L. McGuinness</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Paci c Northwest National Laboratory</institution>
          ,
          <addr-line>Richland, WA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Tetherless World Constellation, Department of Computer Science, Rensselaer Polytechnic Institute</institution>
          ,
          <addr-line>Troy, NY</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Stream reasoning is an exciting multidisciplinary research area that combines stream processing and semantic reasoning. Its goal is to not only process a dynamic data stream, but also to extract explicit and implicit information on-the- y. One of its challenges is managing history awareness: how much and which historical data should be held and for how long as we continuously query and reason on an ever changing stream of linked data? In this paper, we propose an innovative approach to enable history-aware reasoning by utilizing semantic technologies in a data cache with a statistics-based cache management policy.</p>
      </abstract>
      <kwd-group>
        <kwd>semantic web</kwd>
        <kwd>stream reasoning</kwd>
        <kwd>cache management</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Examples of streaming data are ubiquitous on the web. Data streams are
presented in a multitude of physical formats and conceptual models making analysis
di cult. Inspired by the Linked Open Data approach to link static
heterogeneous data on the Web, D. Barbieri et al[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] launched an initiative to publish
data streams as linked data modeled using the Resource Description Framework
(RDF).
      </p>
      <p>
        Existing (semantic) stream processing techniques[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] are capable of quickly
processing large volume RDF streams, yet typically perform super cial
manipulations of information. Reasoning supported by semantic technologies can infer
meaning, but is mostly performed against a static knowledge base. The rising
need to extract high level knowledge from streaming data gave birth to stream
reasoning[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], leveraging the advantages of both stream processing and semantic
reasoning. Several use cases of stream reasoning are described in[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and from
these cases general requirements for e ective stream reasoning are proposed.
Historical data management is among these requirements.
      </p>
      <p>
        Most existing solutions of stream processing and reasoning use a window to
deal with the transient nature of streaming data. The window can be either
time-based or count-based to isolate the most recent segment of the unbounded
RDF streaming data. Both C-SPARQL[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and EP-SPARQL[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] extend the
standard SPARQL to enable a continuous query within the window. The di erence
between them is the former deals with RDF data streams, while the latter
processes complex events. The algorithm of incrementally maintaining ontological
entailments[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] employs time-interval-stamped RDF streams and C-SPARQL to
constantly maintain the transitive property entailments as expired RDF streams
are dropped.
      </p>
      <p>
        A window slides along the RDF data stream in the order the data appear,
inserting and evicting data according to this order. Decisions of which data will
be evicted in a window-based stream reasoning system are made a priori.
Historical data, data which have slid out of the window, are lost regardless of their
semantic importance. In contrast, our approach will decide which data to
remove using semantic importance criteria. A cache holds a portion of the data
to be processed as the data are fed in, maintaining a history of semantically
important data by evicting data according to a cache management algorithm.
For example, the Least Frequently Used (LFU) algorithm counts data reasoning
participation frequency and evicts data with small frequency counts; while the
Least Recently Used (LRU) algorithm evicts data with less recent reasoning
involvements. Caching management algorithms have been well-studied and applied
on the Web over decades[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this paper, we propose an innovative approach
to enable history-aware reasoning by utilizing semantic technologies in a data
cache with a statistics-based cache management policy.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>
        Our approach is self-adaptive in that it manages the historical data by constantly
updating cached data with a cache management algorithm. We present this
approach based on our Nuclear Magnetic Resonance (NMR) use case[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], which
exploits stream reasoning to identify chemical compound types in test specimens.
We model the background ontology in the OWL 2 DL pro le, describing thirty
di erent compound categories with their resonance spectra value ranges. The
NMR streaming data contain compound instances with their resonance spectra
frequency values, formatted in Avro1 and then annotated into RDF graphs.
The goal of this use case is to identify categories of streaming instances via DL
reasoning performed by Pellet2. A compound instance is identi ed in a category
if its frequency value ts into an ontology-encoded category range. The issue we
are addressing here is the time-consuming reasoning. For example, in our
nocache experiment it took 19 minutes to produce entailments for 19,000 triples
in the stream, which violated the timely nature of stream reasoning .
      </p>
      <p>The data cache, whose size is xed due to the maintenance feasibility and
reasoning limitations, is a key component for our stream reasoning
architecture. The cache is preloaded with the background ontology and equipped with
a statistics-based cache management algorithm. In general, the choice of these
1 https://avro.apache.org/
2 https://github.com/complexible/pellet
algorithms is dependent on the system requirements; in the NMR use case, our
algorithm will be based on LFU.3 When an instance is identi ed as one of the
desired compounds, it becomes important and its original and entailed graph is
saved in the data cache as a historical data graph. We associate a counter for each
historical data graph, and use a use case de ned predicate NMR:frequency to
pull out frequency values of both streaming and historical data graphs. Identical
instances are identi ed by comparing their frequency equality with SPARQL's
numeric-equal operator. The counter increments by one if the operator returns
true. Otherwise, the streaming instance has to go through the reasoner. The
bigger the counter, the more semantically important the historical data graph.
A historical data graph can either become more semantically important (counter
grows bigger) to stay or less semantically important (counter grows smaller) to
evict as new data are fed in. Using the historical data, new compounds with
identical frequency values will be identi ed without having to go through the
full reasoning process, which saves time.</p>
      <p>We will implement this approach in multi-threading: one thread for a
continuous standard SPARQL query, the other one for cache management. These
two threads are executed in parallel and can access the data in the cache. The
standard SPARQL thread queries the triples held in the cache, returning a list
of identi ed compounds. Continuity is enabled by repeating the query at
regular time intervals. The cache management thread increments the counters and
repeatedly checks the counter values, so as to evict certain compound graphs if
they are not frequently used.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Discussion</title>
      <p>Introducing a data cache into the stream reasoning system is bene cial because of
its ability to maintain historical data; this not only provides extra background
for reasoning, but also preserves important history for potential use, such as
trend identi cation, and anomaly detection. For example, in our NMR use case,
a large statistical count of the historical compound instance data indicates a
greater likelihood that the test specimen contains a large percentage of this
compound. By identifying identical data in historical and streaming data we
decrease overhead of reasoning, as illustrated above. The cache is also suitable
to work with existing window-based semantic stream processing engines such as
C-SPARQL. The data cache can replace the window in such engines for stream
reasoning under those scenarios emphasizing on historical data and fast system
response.</p>
      <p>Multi-threading is a bene cial way to boost system performance because it
splits di erent tasks into threads and executes them in parallel. Assigning one
thread to repeatedly evaluate a standard SPARQL query realizes an easy but
effective continuous query. This is di erent than proposals like C-SPARQL, which
introduce extra syntax and a di erent execution model for query evaluation.
3 Di erent algorithms' performances and e ects on our stream reasoning systems will
be left as future work.</p>
      <p>Last but not least, the cache management algorithm needs to be e cient to
avoid data congestion by spending as little time as possible on deciding data
removal.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Conclusion and Future Work</title>
      <p>To conclude, we have proposed an innovative approach to enable history-aware
reasoning by utilizing semantic technologies in a data cache with a cache
management algorithm. As for our future work, we will explore several e ective cache
management algorithms, impacts of ontology expressiveness on the system
performance, as well as evaluation methods to benchmark our approach.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <p>The research described in the paper is part of the AIM Initiative at PNNL. It was
conducted under the Laboratory Directed Research and Development (LDRD)
program at PNNL, a multi-program national laboratory operated by Battelle
for the U.S. Department of Energy under contract DE-AC06-76RLO 1830.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Darko</given-names>
            <surname>Anicic</surname>
          </string-name>
          , Paul Fodor,
          <string-name>
            <given-names>Sebastian</given-names>
            <surname>Rudolph</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Nenad</given-names>
            <surname>Stojanovic</surname>
          </string-name>
          .
          <article-title>Ep-sparql: a uni ed language for event processing and stream reasoning</article-title>
          .
          <source>In Proceedings of the 20th international conference on World wide web</source>
          , pages
          <volume>635</volume>
          {
          <fpage>644</fpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Abdullah</given-names>
            <surname>Balamash</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marwan</given-names>
            <surname>Krunz</surname>
          </string-name>
          .
          <article-title>An overview of web caching replacement algorithms</article-title>
          .
          <source>Communications Surveys &amp; Tutorials</source>
          , IEEE,
          <volume>6</volume>
          (
          <issue>2</issue>
          ):
          <volume>44</volume>
          {
          <fpage>56</fpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Davide F Barbieri and
          <string-name>
            <given-names>ED</given-names>
            <surname>Valle</surname>
          </string-name>
          .
          <article-title>A proposal for publishing data streams as linked data</article-title>
          .
          <source>In Linked Data on the Web Workshop</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Davide</given-names>
            <surname>Francesco</surname>
          </string-name>
          <string-name>
            <surname>Barbieri</surname>
          </string-name>
          , Daniele Braga, Stefano Ceri, Emanuele Della Valle, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Grossniklaus</surname>
          </string-name>
          .
          <article-title>C-sparql: Sparql for continuous querying</article-title>
          .
          <source>In Proceedings of the 18th international conference on World wide web</source>
          , pages
          <volume>1061</volume>
          {
          <fpage>1062</fpage>
          . ACM,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Davide</given-names>
            <surname>Francesco</surname>
          </string-name>
          <string-name>
            <surname>Barbieri</surname>
          </string-name>
          , Daniele Braga, Stefano Ceri, Emanuele Della Valle, and
          <string-name>
            <given-names>Michael</given-names>
            <surname>Grossniklaus</surname>
          </string-name>
          .
          <article-title>Incremental reasoning on streams and rich background knowledge</article-title>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Emanuele</given-names>
            <surname>Della</surname>
          </string-name>
          <string-name>
            <surname>Valle</surname>
          </string-name>
          , Stefano Ceri, Davide Francesco Barbieri, Daniele Braga, and
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Campi</surname>
          </string-name>
          .
          <article-title>A rst step towards stream reasoning</article-title>
          . Springer,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Marcin</given-names>
            <surname>Gorawski</surname>
          </string-name>
          , Anna Gorawska, and
          <string-name>
            <given-names>Krzysztof</given-names>
            <surname>Pasterak</surname>
          </string-name>
          .
          <article-title>A survey of data stream processing tools</article-title>
          .
          <source>In Information Sciences and Systems</source>
          <year>2014</year>
          , pages
          <fpage>295</fpage>
          {
          <fpage>303</fpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Margara</surname>
          </string-name>
          , Jacopo Urbani, Frank van Harmelen, and Henri Bal.
          <article-title>Streaming the web: Reasoning over dynamic data</article-title>
          .
          <source>Web Semantics: Science, Services and Agents on the World Wide Web</source>
          ,
          <volume>25</volume>
          :
          <fpage>24</fpage>
          {
          <fpage>44</fpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Smith</surname>
            <given-names>WP</given-names>
          </string-name>
          and
          <string-name>
            <given-names>MI</given-names>
            <surname>Borkum</surname>
          </string-name>
          .
          <article-title>Semantically enabling stream-reasoning architectural frameworks</article-title>
          .
          <source>Abstract submitted to The 24th ACM International Conference on Information and Knowledge Management (CIKM</source>
          <year>2015</year>
          ), Melbourne, Australia.,
          <year>2015</year>
          . PNNL-SA-
          <volume>110228</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>