<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards a Window-based Diverse Entity Summarisation Engine in Publish/Subscribe Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Niki Pavlopoulou</string-name>
          <email>niki.pavlopoulou@insight-centre.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edward Curry</string-name>
          <email>edward.curry@insight-centre.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Centre for Data Analytics, National University of Ireland Galway</institution>
          ,
          <addr-line>Galway</addr-line>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>These entity-based data streams present a high level of data hetThe rise of Smart Homes, Smart Cities and Internet of Things results erogeneity [19], either schematic or semantic, as well as duplication. in the creation of a wide range of entity-based data streams and For example, semantics could involve the use of diferent words users interested in the real-time analysis of these streams. These describing conceptually similar things. Duplicates and conceptusmart environments possess characteristics, like dynamism, con- ally similar things result in redundant information. On the other tinuity, heterogeneity and high volume of data and users. A suit- hand, data consumers may have diferent levels of expressibility. able data dissemination paradigm is needed that can overcome Expressibility [8] refers to users' level of prior understanding of these challenges and at the same time, provide expressive notifi- their needs to create queries with specific filters or their level of cations to users, but not at the expense of usability or resources. technical ability to use complex query languages. Sometimes a user Publish/Subscribe systems can eficiently realise some of these re- might find it dificult to create an appropriate query or one might quirements; however, they need additional support when applied need to create separate complex queries or join queries to bring in smart environments to overcome assumptions related to usabil- together the information needed from multiple sources. ity and redundancy-awareness. Therefore, the key question of the The challenges above when combined with dynamism (deletion paper is: Can we define an entity-centric Publish/Subscribe system or addition of data producers or consumers), continuity (unbounded that provides expressive user notifications along with high usability data streams) and the high volume of producers and consumers, and limited resource usage? characteristics that exist in smart environments, may result in inIn this work, we explore this question and propose a Publish/Subscribe eficient and inefective processing of streaming data. Specifically, system with windowing, data fusion, and top-k diverse ranking that high data volume and redundancy may lead to significant propacan result in the creation of expressive entity summaries using lim- gation, as well as storage overheads of unnecessary data within ited resources. Our results show that sending a top-k fused diverse a network and slower processing time [2]. At the same time, low summary as a notification is better than sending all the separate user expressibility may lead to abstract user queries that result in notifications or the fused ones without top-k filtering. Specifically, redundant answers and high volumes that might present the user top-k fused diverse summarisation results in 50% to 80% reduction with unnecessary information [25]. of forwarded messages and redundancy-awareness with an F-score Publish/Subscribe systems provide a suitable interaction scheme ranging from 0.35 to 0.73 depending on the k. Nevertheless, these for dynamic large-scale applications, where subscribers (users) exresults are achieved at the expense of a slightly higher latency; press their interest in an event or pattern of events, and they are therefore, there is some trade-of between latency, the number of notified when a suitable event was generated by a publisher [ 7]. forwarded messages, and expressiveness. These systems are characterised by space decoupling (publishers and subscribers do not need to know each other), time decoupling</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Therefore, the key question of the paper is: Can we define an
entity-centric Publish/Subscribe system that provides expressive
(nonredundant) user notifications along with high usability (no assumption
of high user expressibility) while using limited resources?</p>
      <p>
        To address the key question above, we propose in this paper a
window-based diverse entity summarisation engine in Publish/Subscribe
systems, as approximate solutions [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] are acceptable as quick
answers [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] within a small error range with high probability while
using limited resources. These summaries, when derived from the
fusion of multiple publishers that contain complex entity-based
semantic data and when combined with diversity (and not only
relatedness) will result in expressive subscription notifications.
Nevertheless, there is a trade-of between latency, number of forwarded
messages, and expressiveness, which we also examine.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>PROBLEM ANALYSIS</title>
      <sec id="sec-2-1">
        <title>The problem introduced is analysed more below.</title>
        <p>2.1</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Motivational Scenario</title>
      <p>
        Sensors create a high amount of data streams with frequent
sampling rates. Therefore, they might produce many unchanged or
identical values for a period of time [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. When users create
abstract, non-sophisticated queries to gain knowledge on these data
streams, they might be presented with undesired duplication.
      </p>
      <p>For example, imagine Houston is a smart city, and a user is
interested in information concerning Rice University. The user has
no other information apart from the name of the university, and one
needs to gain more knowledge without exactly knowing what one
is looking for. A wide range of sensor readings contain information
about the university, ranging from the temperature of the university
to the city it is located in, which some might be redundant. The
user would like to quickly gain knowledge about the university,
but not to be overwhelmed, especially with duplicate data. This
scenario is illustrated in Fig. 1.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Problem Challenges</title>
      <p>The aforementioned motivational scenario faces a number of
challenges:
• Redundancy awareness: Multiple publishers create
heterogeneous data about the entity. Some of this data, like
temperature and city results in redundant information due
to duplication.
• Low user expressibility: The user has limited
information about the entity and has no prior knowledge of what
they are looking for. The user is unable to create a complex
ifltering query and is not an expert in query languages. For
example, a SPARQL-like query that notifies the user when
the energy usage exceeds 4kWh would be the following:</p>
      <sec id="sec-4-1">
        <title>SELECT ?energy_value</title>
        <p>FROM STREAM
WHERE {
Rice_University energy_usage ?energy_value;
FILTER (?energy_value &gt; 4kWh).
}</p>
        <p>This query assumes a priori knowledge from the user of
the publication semantics and schematics concerning
"energy_usage" instead of synonyms like "energy_consumption",
"kWh" instead of "Wh" or which stream or streams
produce energy usage readings. On the other hand, if the user
creates an abstract query like the keyword-based one "Rice
University", it may lead to redundant or undesired
information.
3</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>BACKGROUND</title>
      <p>Some concepts and definitions concerning knowledge graphs, entity
summarisation, and Publish/Subscribe systems are described below.
3.1</p>
    </sec>
    <sec id="sec-6">
      <title>Knowledge graphs and Entity</title>
    </sec>
    <sec id="sec-7">
      <title>Summarisation</title>
      <p>
        Knowledge graphs contain information regarding entities, which
are real-world or abstract things [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Within knowledge graphs
the nodes represent the entities, and the directed labelled arcs
constitute relations among them. In Fig. 2 a part of the knowledge
graph is represented that supports the motivational scenario, where
Rice University, 15°C, 5kWh, United States, Houston, Texas and
Division I (NCAA) are entities or literals and temperature, energyUsage,
country, city, state and athletics are relations among the connected
entities or literals by the directed arc. The Resource Description
Framework (RDF) is a data modelling language that represents these
representations as triples &lt;subject, property, object&gt;, where
subject are entities, object are entities or literals and property is their
relation. RDF triples with the same subject form an RDF star-like
graph.
      </p>
      <p>
        In knowledge graphs, though, there might be some redundant
information. This could be addressed by summarising the triples of
an entity. A summarisation of an entity e that is represented by a
node v in a knowledge graph G is a subgraph of G that surrounds
v [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>
        By adopting and adapting definitions that were introduced in
Cheng et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we provide some definitions for completeness.
      </p>
      <p>Let E be the set of all entities, L the set of all literals and P the
set of all properties.</p>
      <p>Definition 1 (Data Graph). A data graph is a digraph G =
⟨V , A, LblV , LblA⟩, where V is a finite set of nodes, A is a finite set of
directed edges where each a ∈ A has a source node Src(a) ∈ V and a
target node T дt (a) ∈ V , and LblV : V 7→ E∪L and LblA : A 7→ P are
labeling functions that map nodes and edges to entities or literals,
and properties, respectively.</p>
      <p>Definition 2 (Triple). A triple tr is a sequence of &lt;subject,
property, object&gt; defined as tr = ⟨sub(tr ), p(tr ), obj(tr )⟩, where
sub(tr ) ∈ E, p(tr ) ∈ P and obj(tr ) ∈ E ∪ L.</p>
      <p>Definition 3 (Triple Set). Given a data graph G, the triple set
of an entity e, denoted by T r (e), is the set of all unique triples of e
that can be found in G.</p>
      <p>Definition 4 (Diverse Entity Summarisation). Given T r (e)
and a positive integer k &lt; |T r (e)|, the problem of diverse
entity summarisation is to select DivSumm(e) ⊂ T r (e) such that
|DivSumm(e)| = k. DivSumm(e) is called a diverse summary of e
and it contains a set of unique triples.</p>
    </sec>
    <sec id="sec-8">
      <title>Publish/Subscribe Systems</title>
      <p>
        In a typical Publish/Subscribe system [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], subscribers could be from
users to applications that they subscribe their interest in an event
or pattern of events. These subscriptions are sent to the Event
Engine where they are stored. Publishers could be sensors, users or
applications generating events or publications and sending them
to the Event Engine. A matcher is contained in the engine that
matches specific events to subscriptions based on their conditions.
When this is happening, the subscribers are getting these events
as notifications. Its decoupling capabilities in space, time and
synchronisation, make it a suitable interaction scheme for dynamic
large-scale applications.
      </p>
      <p>
        Publish/Subscribe systems typically are topic-based or
contentbased [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. In the topic-based, publishers publish events on specific
topics expressed as keywords (e.g. Sports), and subscribers that have
subscribed to these topics get notified whenever there is a match.
The content-based improves on the expressiveness of the first one
by adding event content filtering on the subscription side. This
ifltering typically involves comparison operators (=, &lt;, 6, &gt;, &gt;) on
attribute-value pairs derived from the events. Complex subscription
patterns can also be created by logical combinations (and, or etc.)
of individual constraints. For example, an event could be (gender
= female, age = 20) and a subscription that matches it could be
(gender = female, age &lt; 30).
      </p>
      <p>
        Lately, there has been some attention drawn in graph-based
Publish/Subscribe systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that represent publications as graphs.
Within these graphs, points of interest are nodes and relations
between them are edges. Subscriptions can be SPARQL-like by asking
for specific nodes and their relation among them. The notifications
are those graphs that match the subscriptions.
4
      </p>
    </sec>
    <sec id="sec-9">
      <title>RELATED WORK</title>
      <p>Related work is analysed below, and it is mainly split into two
categories; Streaming and Non-Streaming.
4.1</p>
    </sec>
    <sec id="sec-10">
      <title>Streaming</title>
      <p>4.1.1 Stream Processing Frameworks. There is a plethora of
existing stream processing frameworks, like Apache Spark1, Apache
Flink2 and Apache Kafka3 that could be extended to support entity
summarisation techniques, but some of them do not support
Publish/Subscribe. Publish/Subscribe systems, like Apache Kafka, are
topic-based; therefore, they are not capable of supporting entities
that contain complex semantic data. Furthermore, the constraints
of these frameworks in supporting specific data formats or SQL-like
APIs could lead to low usability if the user has low expressibility.</p>
      <p>
        4.1.2 Graphs in Publish/Subscribe Systems. As discussed above,
topic-based and content-based Publish/Subscribe systems are not
capable of supporting entities. Cañas et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] introduce GraPS, a
graph-based Publish/Subscribe system that can model publications
as graphs, where points of interest are nodes and relations between
them form edges. Subscriptions can be either simple ones, like a
collection of nodes or complex ones, like specific relations among
nodes. Nevertheless, they assume that the subscribers have limited
1https://spark.apache.org/
2https://flink.apache.org/
3https://kafka.apache.org/
knowledge of the graph published to filter it; therefore, they are
aware of the semantics and schematics of the graph. This could
lead to low usability if the subscribers have low expressibility. Also,
they do not support summarisation of the semantic information.
      </p>
      <p>
        4.1.3 Diversity in Publish/Subscribe Systems. Diversity of events
has not been considerably explored in Publish/Subscribe systems.
Chen et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] focus on top-k diverse publications in the form of
tweets by calculating their cosine similarity, whereas Drosou et al.
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] emphasise on top-k diverse publications in the form of
attributevalue pairs by calculating the commonalities among the events.
However, both works do not support heterogeneity neither they
consider entities as publications, which is a more complex problem.
      </p>
      <p>
        4.1.4 Summarisation in Publish/Subscribe Systems.
Summarisation has been examined in Publish/Subscribe systems by several
works. Triantafillou et al. [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] focus on subscription summarisation
in the sense of subscription subsumption, that is an attribute-value
constraint of a subscription is subsumed by that of another
subscription if the values are the same or if they are contained in
the values of the latter subscription. Specifically, each subscription
is split into its attribute-value pairs, which are then merged into
summary structures. Wang et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] emphasise on subscription
summaries by partitioning via random, R-tree and K-means
clustering techniques and summary-based routing via R-trees among a set
of servers to address high system throughput. These works focus
on subscription subsumption or covering without considering
publication summarisation. They also support simple attribute-value
pairs, so they cannot be used for complex semantic data.
      </p>
      <p>
        4.1.5 Fusion in Publish/Subscribe Systems. Fusion has been used
in Publish/Subscribe systems before. Kolozali et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] fuse
sensor data from heterogeneous sources and translate attribute-value
pairs as time series and then approximate them with dimensionality
reduction. Nevertheless, the approximation is done outside the
Publish/Subscribe system, and it is not related to entity summarisation.
Wun et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] fuse attribute-value pairs that result in semantic
interpretations with the use of ontologies. Nevertheless, they tackle
a diferent problem than entity summarisation.
      </p>
      <p>
        4.1.6 Approximate Semantic Matching in Publish/Subscribe
Systems. Approximate semantic matching in Publish/Subscribe
systems has been examined by a number of works. These works
introduce an additional layer of decoupling, that of semantic decoupling,
in Publish/Subscribe systems. Hasan et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] create an
approximate semantic single-event processing model for attribute-value
pairs coming from heterogeneous sources. Top-1 and top-k
matchers are created based on Wikipedia ESA and probabilistic models.
Their earlier work [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] focuses on RDF graphs as publications.
STOPSS [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] uses synonyms, taxonomies and mapping functions
specified by domain experts for creating an approximate matcher.
Although approximate semantic matching could be related to our
work, nevertheless, it is a diferent problem to entity summarisation.
4.2
      </p>
    </sec>
    <sec id="sec-11">
      <title>Non-Streaming</title>
      <p>
        4.2.1 Diverse Entity Summarisation. Top-k diversity in entities
in the form of sophisticated summaries that detect duplication and
conceptual similarity has been tackled by a number of works. These
works consider high usability as they use keyword-based queries.
DIVERSUM [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] focuses on a per-property basis summarisation
based on novelty, importance, popularity and diversity by
adapting the document-based Information Retrieval to the knowledge
graphs. FACES [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] emphasises on summaries based on diversity,
uniqueness, and popularity via hierarchical conceptual clustering
and the use of WordNet for related terms. FACES-E [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] improves
on FACES by considering types in datatype properties instead of
only object properties for entity summarisation. Pouriyeh et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]
emphasise on summaries based on topic modelling by considering
predicates as topics and use of Word2Vec for related terms. All of
these works contain static methodologies; therefore, they need to
be extended to support a complex dynamic environment.
      </p>
      <p>In conclusion, no existing approach covers the requirements of
our problem. Comparison among the works covered in the diferent
subsections is shown in Table 1.
5</p>
    </sec>
    <sec id="sec-12">
      <title>APPROACH</title>
      <p>The approach is analysed below that defines the event model, the
subscription model and the architecture of the summarisation
engine.
5.1</p>
    </sec>
    <sec id="sec-13">
      <title>Event Model</title>
      <p>To support complex semantic data, the event payload contains RDF
triples of the form &lt;subject, property, object&gt;. Each event is an
instance of an entity (subject) with one predicate (property) and
one value for this predicate (object). Below there is an example of a
publication payload:</p>
      <p>{&lt; Rice_U niversity &gt;&lt; city &gt;&lt; Houston &gt;}</p>
      <p>Therefore, the definition of the event model is as follows: Let EV
be the set of events, PI D the set of publisher IDs, PubI D the set of
publication IDs, T the set of timestamps and T r (e) the triple set of
an entity e, respectively, then:
ev ∈ EV ⇔ ev =</p>
      <p>pI D, pubI D, t , tr , ... :
pI D ∈ PI D, pubI D ∈ PubI D, t ∈ T , tr ∈ T r (e)}
(1)
5.2</p>
    </sec>
    <sec id="sec-14">
      <title>Subscription Model</title>
      <p>
        To support high usability, we do not assume that subscribers are
aware of the semantics and structure of the events or that they are
experts in complex query languages, like SPARQL. Therefore, a
subscription should ideally be in the form of a keyword query [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ].
      </p>
      <p>Subscriptions, therefore, are a set of attribute-value pairs. Only
conjunction has been considered in this work. This means that each
event needs to fulfil all constraints of a subscription so that it can
be considered a match. Each pair consists of an attribute, an equal
operator and a value. Below there is an example of a subscription
payload:
{entity = "&lt; Rice_U niversity &gt;", k = 5, windowSize = 10, ranking
= "Diversity"}</p>
      <p>In the example above, the subscriber is interested in an event
summary of the entity &lt; Rice_U niversity &gt; with top-5 diverse
information facts derived from the analysis of data taken from
count windows of size 10, that is 10 events.</p>
      <p>Therefore, the definition of the subscription model is as follows:
Let S be the set of subscriptions, SI D the set of subscriber IDs,
SubI D the set of subscription IDs, T the set of timestamps, ATT the
set of attributes, OP the set of operators and V AL the set of values,
respectively, then:
s ∈ S ⇔ s =
sI D, subI D, t , (att , op, val ) , ... :
sI D ∈ SI D, subI D ∈ SubI D, t ∈ T , (att , op, val ) ∈ ATT × OP × V AL
(2)
5.3</p>
    </sec>
    <sec id="sec-15">
      <title>Architecture</title>
      <p>Our architecture is illustrated in Fig. 3. In the architecture, a
Publisher creates a number of entity-based publications and a Subscriber
a number of entity-related subscriptions. All publications and
subscriptions enter the Summarisation Engine, which is the processing
engine of the system.</p>
      <p>The engine contains a boolean Matcher that extracts the matched
entities based on the stored subscriptions and publications. All
publications enter the Window Partitioning that is responsible for
creating tumbling Count Windows for each matched entity. The
corresponding window is then populated with events from all
publishers concerning this entity. All events are fused within the window
incrementally, and through the Summarisation they are checked
for duplicates. Then a score is given in each triple. Triples that
are non-duplicates and they are the most recent ones have higher
scores. Top-k filtering then involves the diverse top-k most recent
triples. Once the corresponding window reaches its capacity that is
based on the windowSize defined by the subscriber, the subscriber
is notified by the Notification , and then the process starts again.</p>
      <p>For example, if a subscriber is interested in an event summary
of the entity &lt; Rice_U niversity &gt; with top-5 diverse notifications
deriving from the analysis of the last 10 events of Fig. 1 in a window,
then, a possible notification payload would be:</p>
      <p>{&lt; Rice_U niversity &gt;&lt; temperature &gt;&lt; 15°C&gt;}
{&lt; Rice_U niversity &gt;&lt; enerдyU saдe &gt;&lt; 5kW h &gt;}
{&lt; Rice_U niversity &gt;&lt; city &gt;&lt; Houston &gt;}
{&lt; Rice_U niversity &gt;&lt; state &gt;&lt; T exas &gt;}
{&lt; Rice_U niversity &gt;&lt; country &gt;&lt; U nitedStates &gt;}
as the duplicate information of temperature and city was
discarded and the rest of the triples were the most recent ones based
on their timestamps.
6</p>
    </sec>
    <sec id="sec-16">
      <title>EVALUATION</title>
      <p>To the best of our knowledge, no one has tackled entity
summarisation in Publish/Subscribe systems. Therefore, we compare our
approach with the non-top-k non-fused approach, where all events
are sent separately to the subscriber without being fused or checked
for duplicates and with the non-top-k fused approach, where the
events are fused in the window, but they are not checked for
duplicates.
6.1</p>
    </sec>
    <sec id="sec-17">
      <title>Dataset</title>
      <p>The DBpedia dataset4 has been selected for our evaluation, as it is
highly popular in the field of entity summarisation. Following the
entity selection of FACES, 50 entities were chosen that belong to
diferent domains (e.g. politician, actor, country, etc.) and they have
per entity an average of 44 distinct direct features. As in FACES, we
ifltered out any schema information and dataset dependent details,
such as dcterms:subject, rdf:type, owl:sameAs, wordnet type and
Wikipedia related links. We did not consider literals, only
resourcebased objects, as they provided richer information.</p>
      <p>To simulate the graph evolution, we extracted information from
diferent versions of DBpedia, and we started by adding triples from
the oldest version to the newest. All entities and their triples follow
a uniform distribution in the selection process by the publishers.
50 publishers are used, and each one is responsible for
generating events of one entity. The subscriber is one and generates 50
subscriptions, one for each entity.</p>
      <p>All experiments were ran for 5 times, and the average was taken.
All runs took place in a laptop with Intel(R) Core(TM) i7-6600U
CPU@2.60GHz 2.80GHz and 16GB of RAM.
6.2</p>
    </sec>
    <sec id="sec-18">
      <title>Metrics</title>
      <p>
        6.2.1 Redundancy-aware F-score. We are using the metrics of
redundancy precision and redundancy recall defined in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ], and
through these, we calculate the redundancy-aware F-score. For our
work, we define as "redundant" the duplicate triples. The score is
defined as:
      </p>
      <p>Red_precision =</p>
      <p>R−</p>
      <p>R− + N −
Red_recall =</p>
      <p>R−</p>
      <p>R− + R+</p>
      <p>Red_precision × Red_recall</p>
      <p>Red_F − score = 2 × Red_precision + Red_recall
where R− is the set of non-delivered redundant triples, N − is
the set of non-delivered non-redundant ones and R+ is the set of
delivered redundant ones.</p>
      <p>6.2.2 End-to-End Latency. For the non-fused events, the
endto-end latency is the time it takes between the publication of an
event until its delivery. For the fused events, it is the time it takes
between the earliest triple in the fusion until the time of the fusion’s
delivery.</p>
      <p>6.2.3 Number of Messages. This metric is split between the
number of forwarded messages, that is the number of triples that
are sent upstream and the number of redundant messages, that is
the number of duplicates within the event set.
(3)
(4)
(5)</p>
    </sec>
    <sec id="sec-19">
      <title>Results</title>
      <p>The results are illustrated in Fig. 4. We selected the k value to range
from 5 to 30 and the window size to be either of 50 or 100 events.</p>
      <p>In Fig. 4(a) we observe that in terms of end-to-end latency, all
approaches have higher latencies for larger windows. This is
expected as although the fusion and top-k diversity are incremental
within the window, the notification is sent after the window is
populated; therefore, the population time is also considered. No fusion
non-top-k approach behaves slightly better in terms of latency, as
once the window is populated all events are sent separately and
their latencies are not dependent on the earliest event in the
window as in the fused case. The fusion non-top-k and fusion top-k
approach have similar behaviour in terms of latency, although the
top-k filtering might fluctuate it according to the k, as we observe
a slight rise with k.</p>
      <p>Fig. 4(b) shows that the number of forwarded messages is reduced
within the ranges of 50% to 80% depending on the k for the
topk approach compared to the baselines (both of them had similar
results, so only one is shown). For higher values of k, more messages
are forwarded upstream. Therefore, the power of top-k filtering
is more evident for lower k, assuming not much loss of valuable
information occurs. From these messages, the baselines show that
22% and 42% were duplicates for windowSize = 50 and windowSize
= 100 respectively (Fig. 4(c)). The top-k approach can discard this
duplicate information, therefore, reducing the overall forwarded
messages. There is an increase of forwarded messages in the top-k
approach for smaller windows. This happens because even though
the number of forwarded messages is dependent on k for all window
sizes, for the same duration there are more notifications produced
for smaller windows compared to bigger ones; therefore, more
messages are sent in total. This is dependent on the number of
events produced by the publishers.</p>
      <p>On the other hand, by using top-k filtering results not only in
the elimination of duplicate redundant information but in possibly
valuable information. This is depicted in Fig. 4(d) by the
redundancyaware F-score that ranges from 0.35 to 0.73. Lower F-score occurs
for lower k as stricter content filtering is taking place, whereas
higher F-score is observed with the increase in window sizes, as
the bigger the window, the more probable redundant information
exists.</p>
      <p>Therefore, we observe a trade-of between latency, forwarded
messages, and expressiveness. Specifically, although top-k filtering
reduces the number of duplicates and overall messages sent to the
subscriber compared to the baselines with comparable end-to-end
latencies, some non-redundant information will be lost.
7</p>
    </sec>
    <sec id="sec-20">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>
        In this paper, we introduce the first window-based diverse entity
summarisation in Publish/Subscribe systems that provides high
usability and expressive notifications of data deriving from
heterogeneous sources in environments like the Internet of Things.
We examine the trade-of between latency, number of forwarded
messages, and expressiveness. Future work will focus on diversity
not only based on duplicates but also on conceptual similarity [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
and more sophisticated ranking methodologies to boost
expressiveness. This work will be further evaluated by adapting static entity
summarisation techniques in streaming environments. More
personalised subscriptions will also be explored that give opportunities
for the subscribers to define which information might be more
interesting to them. Finally, more types of windows apart from count
ones will be implemented to determine their performance.
      </p>
    </sec>
    <sec id="sec-21">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported by the European Union’s Horizon 2020
research programme Big Data Value ecosystem (BDVe) grant No
732630 and in part by Science Foundation Ireland (SFI) under Grant
Number SFI/12/RC/2289_P2, co-funded by the European Regional
Development Fund.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Charu</surname>
            <given-names>C</given-names>
          </string-name>
          <string-name>
            <surname>Aggarwal</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Data streams: models and algorithms</article-title>
          . Vol.
          <volume>31</volume>
          . Springer Science &amp; Business Media.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Ejaz</given-names>
            <surname>Ahmed</surname>
          </string-name>
          and Mubashir Husain Rehmani.
          <year>2017</year>
          .
          <article-title>Mobile edge computing: opportunities, solutions, and challenges</article-title>
          . (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>César</given-names>
            <surname>Cañas</surname>
          </string-name>
          , Eduardo Pacheco, Bettina Kemme, Jörg Kienzle, and
          <string-name>
            <surname>Hans-Arno Jacobsen</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Graps: A graph publish/subscribe middleware</article-title>
          .
          <source>In Proceedings of the 16th Annual Middleware Conference. ACM</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Lisi</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>Gao</given-names>
            <surname>Cong</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Diversity-aware top-k publish/subscribe for text stream</article-title>
          .
          <source>In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM</source>
          ,
          <volume>347</volume>
          -
          <fpage>362</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Gong</surname>
            <given-names>Cheng</given-names>
          </string-name>
          , Thanh Tran, and
          <string-name>
            <given-names>Yuzhong</given-names>
            <surname>Qu</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Relin: relatedness and informativeness-based centrality for entity summarization</article-title>
          .
          <source>In International Semantic Web Conference</source>
          . Springer,
          <fpage>114</fpage>
          -
          <lpage>129</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Marina</given-names>
            <surname>Drosou</surname>
          </string-name>
          , Kostas Stefanidis, and
          <string-name>
            <given-names>Evaggelia</given-names>
            <surname>Pitoura</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Preferenceaware publish/subscribe delivery with diversity</article-title>
          .
          <source>In Proceedings of the Third ACM International Conference on Distributed Event-Based Systems. ACM</source>
          ,
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Patrick</given-names>
            <surname>Th</surname>
          </string-name>
          <string-name>
            <surname>Eugster</surname>
          </string-name>
          ,
          <article-title>Pascal A Felber, Rachid Guerraoui, and</article-title>
          <string-name>
            <given-names>Anne-Marie</given-names>
            <surname>Kermarrec</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>The many faces of publish/subscribe. ACM computing surveys (CSUR) 35,</article-title>
          <issue>2</issue>
          (
          <year>2003</year>
          ),
          <fpage>114</fpage>
          -
          <lpage>131</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>George</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Furnas</surname>
          </string-name>
          , Thomas K. Landauer,
          <string-name>
            <surname>Louis M. Gomez</surname>
          </string-name>
          , and
          <string-name>
            <surname>Susan</surname>
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Dumais</surname>
          </string-name>
          .
          <year>1987</year>
          .
          <article-title>The vocabulary problem in human-system communication</article-title>
          .
          <source>Commun. ACM</source>
          <volume>30</volume>
          ,
          <issue>11</issue>
          (
          <year>1987</year>
          ),
          <fpage>964</fpage>
          -
          <lpage>971</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Kalpa</given-names>
            <surname>Gunaratna</surname>
          </string-name>
          , Krishnaprasad Thirunarayan, Amit Sheth, and Gong Cheng.
          <year>2016</year>
          .
          <article-title>Gleaning types for literals in rdf triples with application to entity summarization</article-title>
          .
          <source>In European Semantic Web Conference</source>
          . Springer,
          <fpage>85</fpage>
          -
          <lpage>100</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Kalpa</surname>
            <given-names>Gunaratna</given-names>
          </string-name>
          , Krishnaprasad Thirunarayan, and Amit P Sheth.
          <year>2015</year>
          .
          <article-title>FACES: Diversity-Aware Entity Summarization Using Incremental Hierarchical Conceptual Clustering.</article-title>
          .
          <source>In AAAI</source>
          .
          <fpage>116</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Souleiman</given-names>
            <surname>Hasan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Edward</given-names>
            <surname>Curry</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Approximate semantic matching of events for the internet of things</article-title>
          .
          <source>ACM Transactions on Internet Technology (TOIT) 14</source>
          ,
          <issue>1</issue>
          (
          <year>2014</year>
          ),
          <fpage>2</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Souleiman</surname>
            <given-names>Hasan</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sean O'Riain</surname>
            ,
            <given-names>and Edward</given-names>
          </string-name>
          <string-name>
            <surname>Curry</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>Approximate semantic matching of heterogeneous events</article-title>
          .
          <source>In Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems. ACM</source>
          ,
          <volume>252</volume>
          -
          <fpage>263</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Sefki</surname>
            <given-names>Kolozali</given-names>
          </string-name>
          , Maria Bermudez-Edo, Daniel Puschmann, Frieder Ganz, and
          <string-name>
            <given-names>Payam</given-names>
            <surname>Barnaghi</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>A knowledge-based approach for real-time iot data stream annotation and processing</article-title>
          .
          <source>In 2014 IEEE International Conference on Internet of Things (iThings)</source>
          ,
          <article-title>and IEEE Green Computing and Communications (GreenCom) and</article-title>
          IEEE Cyber,
          <article-title>Physical and Social Computing (CPSCom)</article-title>
          . IEEE,
          <fpage>215</fpage>
          -
          <lpage>222</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>K</given-names>
            <surname>Prasanna</surname>
          </string-name>
          <article-title>Lakshmi</article-title>
          and
          <string-name>
            <given-names>CRK</given-names>
            <surname>Reddy</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>A survey on diferent trends in data streams</article-title>
          .
          <source>In Networking and Information Technology (ICNIT)</source>
          ,
          <source>2010 International Conference on. IEEE</source>
          ,
          <fpage>451</fpage>
          -
          <lpage>455</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Shobharani</surname>
            <given-names>Pacha</given-names>
          </string-name>
          , Suresh Ramalingam Murugan, and
          <string-name>
            <given-names>R</given-names>
            <surname>Sethukarasi</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Semantic annotation of summarized sensor data stream for efective query processing</article-title>
          .
          <source>The Journal of Supercomputing</source>
          (
          <year>2017</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Niki</given-names>
            <surname>Pavlopoulou</surname>
          </string-name>
          and
          <string-name>
            <given-names>Edward</given-names>
            <surname>Curry</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Streams</article-title>
          .
          <source>In 2019 First International Conference on Graph Computing (GC)</source>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Milenko</surname>
            <given-names>Petrovic</given-names>
          </string-name>
          , Ioana Burcea, and
          <string-name>
            <surname>Hans-Arno Jacobsen</surname>
          </string-name>
          .
          <year>2003</year>
          .
          <article-title>S-topss: Semantic toronto publish/subscribe system</article-title>
          .
          <source>In Proceedings 2003 VLDB Conference. Elsevier</source>
          ,
          <volume>1101</volume>
          -
          <fpage>1104</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Seyedamin</surname>
            <given-names>Pouriyeh</given-names>
          </string-name>
          , Mehdi Allahyari, Krys Kochut, Gong Cheng, and Hamid Reza Arabnia.
          <year>2018</year>
          .
          <article-title>Combining word embedding and knowledge-based topic modeling for entity summarization</article-title>
          .
          <source>In 2018 IEEE 12th International Conference on Semantic Computing (ICSC)</source>
          . IEEE,
          <fpage>252</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Yongrui</surname>
            <given-names>Qin</given-names>
          </string-name>
          , Quan Z Sheng, Nickolas JG Falkner, Schahram Dustdar,
          <string-name>
            <given-names>Hua</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <surname>Athanasios</surname>
            <given-names>V</given-names>
          </string-name>
          <string-name>
            <surname>Vasilakos</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>When things matter: A survey on data-centric internet of things</article-title>
          .
          <source>Journal of Network and Computer Applications</source>
          <volume>64</volume>
          (
          <year>2016</year>
          ),
          <fpage>137</fpage>
          -
          <lpage>153</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Marcin</surname>
            <given-names>Sydow</given-names>
          </string-name>
          , Mariusz Pikuła, and
          <string-name>
            <given-names>Ralf</given-names>
            <surname>Schenkel</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>DIVERSUM: Towards diversified summarisation of entities in knowledge graphs</article-title>
          .
          <source>In Data Engineering Workshops (ICDEW)</source>
          ,
          <source>2010 IEEE 26th International Conference on. IEEE</source>
          ,
          <fpage>221</fpage>
          -
          <lpage>226</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Triantafillou</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Economides</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>Subscription summarization: A new paradigm for eficient publish/subscribe systems</article-title>
          .
          <source>In 24th International Conference on Distributed Computing Systems</source>
          ,
          <year>2004</year>
          . Proceedings. IEEE,
          <fpage>562</fpage>
          -
          <lpage>571</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Yi-min</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Lili Qiu, Chad E Verbowski, Demetrios Achlioptas,
          <string-name>
            <surname>Gautam Das</surname>
          </string-name>
          , and
          <string-name>
            <surname>Per-Ake Larson</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Summary-based routing for content-based event distribution networks</article-title>
          .
          <source>(April 3</source>
          <year>2007</year>
          ).
          <source>US Patent 7</source>
          ,
          <issue>200</issue>
          ,
          <fpage>675</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Alex</surname>
            <given-names>Wun</given-names>
          </string-name>
          , Milenko Petrovi, and
          <string-name>
            <surname>Hans-Arno Jacobsen</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>A system for semantic data fusion in sensor networks</article-title>
          .
          <source>In Proceedings of the 2007 inaugural international conference on Distributed event-based systems. ACM</source>
          ,
          <volume>75</volume>
          -
          <fpage>79</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>Yi</surname>
            <given-names>Zhang</given-names>
          </string-name>
          , Jamie Callan, and
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Minka</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Novelty and redundancy detection in adaptive filtering</article-title>
          .
          <source>In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM</source>
          ,
          <volume>81</volume>
          -
          <fpage>88</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Cai-Nicolas</surname>
            <given-names>Ziegler</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sean</surname>
            <given-names>M McNee</given-names>
          </string-name>
          ,
          <article-title>Joseph A Konstan,</article-title>
          and
          <string-name>
            <given-names>Georg</given-names>
            <surname>Lausen</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Improving recommendation lists through topic diversification</article-title>
          .
          <source>In Proceedings of the 14th international conference on World Wide Web. ACM</source>
          ,
          <volume>22</volume>
          -
          <fpage>32</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>