<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Information-theoretic Analysis of Entity Dynamics on the Linked Open Data Cloud</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Chifumi Nishioka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ansgar Scherp</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kiel University</institution>
          ,
          <addr-line>Christian-Albrechts-Platz 4, 24118 Kiel</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leibniz Information Centre for Economics</institution>
          ,
          <addr-line>Dusternbrooker Weg 120, 24105 Kiel</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The Linked Open Data (LOD) cloud is expanding continuously. Entities appear, change, and disappear over time. However, relatively little is known about the dynamics of the entities, i. e., the characteristics of their temporal evolution. In this paper, we employ clustering techniques over the dynamics of entities to determine common temporal patterns. We de ne an entity as RDF resource together with its attached RDF types and properties. The quality of the clusterings is evaluated using entity features such as the entities' properties, RDF types, and pay-level domain. In addition, we investigate to what extend entities that share a feature value change together over time. As dataset, we use weekly LOD snapshots over a period of more than three years provided by the Dynamic Linked Data Observatory. Insights into the dynamics of entities on the LOD cloud has strong practical implications to any application requiring fresh caches of LOD. The range of applications is from determining crawling strategies for LOD, caching SPARQL queries, to programming against LOD, and recommending vocabularies for reusing LOD vocabularies.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The Linked Open Data (LOD) cloud is a global information space to structurally
represent and connect data. Since its advent in 2007, it has been continuously
evolving and is covering a wide range of domains [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], today. Understanding
the evolution of the LOD cloud is important for di erent applications such as
LOD caching [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ], indexing of distributed data sources [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and query
optimization [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. For instance, Umbrich et al. [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] developed a query execution engine
taking into account an analysis whether a dataset is static or dynamic, in order
to automatically decide whether data should be retrieved from caches or from
the original LOD cloud. The authors have compared two snapshots of a LOD
dataset captured at two di erent points in time and analyzed which triples have
been preserved, deleted, and added. Kafer et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] quanti ed changes with
respect to set of triples, set of links, and schema signature. Dividino et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
? Copyright held by the authors.
measured changes with respect to the schema information of a dataset. While
the aforementioned works focused on evaluating changes between two di erent
points in time, Dividino et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] also analyzed the temporal dynamics of a
dataset, i. e., the temporal evolution of the LOD cloud over the entire period.
They presented detailed analyses of LOD evolution for thirteen LOD sources
and represent the degree of evolution by a non-negative, single value. The result
of the analyses enables to crawl LOD documents more e ciently [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        In the realm of web search, knowledge bases contain information about
different real-world objects or concepts commonly referred to as entities. Since the
most popular type of queries contain entities [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ], it is important to understand
temporal dynamics of entities, in order to keep information of entities up to date
and accurate. To best of our knowledge, few works have conducted an analysis of
temporal dynamics in the entity level such as the early work of Ding et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In
terms of applications, triple stores are accessed by various web data applications
via SPARQL queries. Martin et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] showed that SPARQL query caching
improve the performance of query engines. While they assumed that they could be
aware of all updates of data, in practice it is not easy. In addition, recent pro ling
methods utilize entities from the LOD cloud. Schuhmacher et al. [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] proposed
a document pro ling method using DBpedia. Their method annotates a given
document with DBpedia entities using a snapshot of DBpedia. It is important
to take into account the temporal dynamics of entities, because DBpedia has
been updated continuously and a part of information in a snapshot of DBpedia
can quickly get stale.
      </p>
      <p>
        In this paper, we analyze temporal dynamics of entities on the LOD cloud. To
this end, we use snapshots of the LOD cloud over three years. We rst introduce
how to de ne entities on the LOD cloud. Subsequently, we measure the degree of
changes of entities between two snapshots and represent the dynamics of entities
as time series. We use two di erent representations of entities and two di erent
distance measures to compute the degree of changes. In addition, we introduce
di erent triple weighting methods, where each triple in the entity has di erent
importance. We apply k-means++ clustering to nd out temporal patterns of
entity dynamics. We discover dominant temporal patterns in the clusters by
using a periodicity detection algorithm by Elfeky et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Subsequently, we
investigate which features of entities control patterns of entity dynamics.
      </p>
      <p>In Section 2, we review related work. Subsequently, we introduce basic
formalization in Section 3. We compute and determine temporal patterns of entity
dynamics and analyze the periodicities of them in Section 4. Section 5 describes
the data. We provide the observed temporal patterns of entity dynamics and
analyze the e ects from entity features in Section 6, before concluding this work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In this section, we review works tackling with LOD dynamics. Before describing
the related work, we distinguish \change" and \dynamics" along with Dividino
et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Changes of LOD sources are analyzed with regarding to their sets of
triples, sets of entities, or schema signatures. For example, given two snapshots
of a dataset captured at two di erent points in time, the change analysis at the
triple level includes which triples from the previous snapshot have been preserved
in the later snapshot, which triples have been deleted, or which ones have been
added. On the other hand, the dynamics of a dataset involves a notion of how
\ uid" a dataset is, i .e., how it behaves and evolves over a certain period of time.
Thus, the dynamics involves the analysis of its development over more than two
points in time.
      </p>
      <p>
        Kafer et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] provided the comprehensive analysis of the dynamics of LOD
based on monitoring 86; 696 LOD documents for 29 weeks. They found out that
5:0% of documents had gone o ine and 62:2% of documents had no change.
In addition, they conducted the analysis on triple level. The result indicated
that additions of triples are much more frequent than deletions. Furthermore,
they observed that while object literals are the most dynamic element of triples,
predicates (i .e., properties) and RDF types de ned by the predicate rdf:type
are static. They looked into the most dynamic predicates and identi ed that they
were often about trivial time stamps. Dividino et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] attempted to grasp the
temporal dynamics of LOD sources. Beyond recent researches which focused on
changes by comparing two snapshots of a source [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Dividino et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] analyzed
dynamics of LOD sources. They proposed a monotone and positive function to
represent the dynamics as a value. They conducted detailed analyses with respect
to thirteen LOD sources and provided the statistics. Furthermore, Dividino et
al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] evaluated strategies to keep LOD stored in a local repository up to date
e ciently with limited bandwidth. Strategies de ne when to fetch data of which
data source. The experiment revealed that strategies based on dynamics of data
sources [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] performed best, compared to strategies based on the data source's
age, PageRank, or size.
      </p>
      <p>
        Below we describe works that conducted the analysis of entity dynamics.
Umbrich et al. [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] formed a labeled directed graph based on LOD, where a node
is an entity and an entity is represented by a subject URI. They analyzed entity
dynamics using this graph. They applied k-means clustering to group entities
with similar dynamics. After manual inspection, they observed that entities from
same domains were often found in same clusters. However, they considered only
whether there was a change or not and did not take into account the amount of
changes of entities. Popitsch et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] provided statistics about entity changes
between two DBpedia snapshots with respect to four OWL classes (i .e., person,
organization, place, work). In terms of OWL classes, entities belonging to the
class person were active, because a lot of entities were removed and created. The
number of entities belonging to the class location increased the most. The focus
of the work by Popitsch [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] was not to analyze temporal dynamics of entities
but to develop an e ective entity change detection framework to avoid broken
links. Thus, they have not conducted a more ne-grained analysis. Holzmann et
al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] looked into entity evolution, but focused on changes of entity names in
Wikipedia.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Basic Formalization</title>
      <p>In this section, we brie y introduce de nitions and notations. Table 1 summarizes
the symbol notations and Table 2 shows a small example of LOD snapshots.</p>
      <p>Data fetched from the LOD cloud is represented in the form of N-Triples3.
A triple x of a dataset X is represented as x = (s; p; o) where s, p, and o
correspond to a subject, predicate, and object. Please note that we focus on
analyzing triples and do not consider the contexts of the triples in this work,
i. e., the sources where these triples where obtained. Furthermore, we de ne the
sets of all possible URIs U , blank nodes B, and literals L. In a triple x, the
subject s 2 U [ B is a URI or a blank node, the predicate p 2 U a URI, and the
object o 2 U [ B [ L a URI, a blank node or a literal. Functions sub, pred, and
obj return the subject, predicate, and object of a given triple x, respectively. We
assume that the data from the di erent LOD sources is captured at some point
in time t. We de ne Xt as the set of triples (i .e., snapshot) captured at the point
in time t and X = fXt1 ; Xt2 ; : : : ; Xtn g as the collection of the snapshots at the
di erent points in time. Using the example in Table 2, Xt1 contains three triples
and each of Xt2 and Xt3 has ve triples.</p>
      <p>In order to investigate the temporal dynamics of the entities in the LOD
cloud, we represent an entity as set of triples returned by a function which
takes an entity key e 2 U [ B and a point in time t as arguments. Below, we
introduce the de nitions of entities and the entity function. We distinguish two
di erent representations of entities: EeO;t denotes entities that are characterized
by RDF types and outgoing properties and EeI;Ot additionally considers incoming
properties.</p>
      <p>De nition 1. An entity EeO;t (Out) with RDF types and outgoing properties:
For a given entity key e and a point in time t, EeO;t = O(e; t) = fx j x 2
Xt ^ sub(x) = eg, where triples containing e as subject are grouped together as
an entity.</p>
      <p>De nition 2. An entity EeI;Ot (InOut) with RDF types, outgoing properties
and incoming properties: For a given entity key e and a point in time t, EeI;Ot =
O(e; t) = EeO;t [ fx j x 2 Xt ^ obj(x) = eg, where triples containing e as subject
or object are grouped together as an entity.</p>
      <p>We derive entities Ee;t for all unique URI e 2 U [ B with respect to each
snapshot Xt. We use z to describe the entity representation. In this work, z = O or
z = IO. In Table 2, there are three entities db:Anne Smith, db:John Brown, and
db:Green Village in Xt3 when z = O. On the other hand, there are four entities
db:Anne Smith, db:John Brown, db:Green Village, and db:Green University
in Xt3 when z = IO.</p>
      <sec id="sec-3-1">
        <title>3 http://www.w3.org/TR/n-triples/, last access on 11/05/2015</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Computing Temporal Patterns of Entity Dynamics</title>
      <p>In this section, we describe how we determined the temporal patterns of entity
dynamics. First, Section 4.1 describes how to weigh di erent triples in an entity.
In Section 4.2, we explain how to measure a degree of changes of an entity
between two successive snapshots and how to represent entity dynamics as time
series. Section 4.3 provides the employed clustering method and optimization
metric, in order to nd out the most representative temporal patterns of entity
dynamics. Subsequently, we investigate periodicities of the observed temporal
patterns in Section 4.4.
4.1</p>
      <sec id="sec-4-1">
        <title>Triple Weighting</title>
        <p>
          We assume that triples of an entity have di erent importance [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. For
example, a triple (Barack Obama, dbp:vicepresident, Joe Biden) is considered
more important than a triple (Barack Obama, rdf:type, foaf:Person) for an
entity Barack Obama. Because there are a lot of entities whose rdf:type is
foaf:Person, but a few entities which have a property dbp:vicepresident.
Thus, we should give a larger weight on more important triples. For instance, it
is not helpful to update a data cache due to a change of a trivial triple. Because
in practical applications, only important facts of the entities encoded in a
knowledge graph will be shown. Therefore, trivial changes like updating time-stamps
have no e ect on most applications. In fact, Kafer et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] observed that most
changes were updates of time-stamps, which are modeled as literals of a triple.
Therefore, we give a weight w(x) to each triple and analyze the entity dynamics
considering these weights. Below, we introduce methods to weigh triples, starting
from the baseline.
1. Baseline. We give the same weight to all triples.
        </p>
        <p>wbaseline (x) = 1</p>
        <p>
          Schuhmacher et al. [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] developed several methods to weigh semantic
relations (i .e., predicates) between entities for document modeling. Based on their
work, we introduce a method to weigh triples. At the core of the weighting
method lies the information-theoretic notion of information content (IC) [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
Then, the IC of a speci c variable v = pred(x) can be computed as shown in
Equation 2, where P (v) is the probability that the random variable V shows the
speci c outcome v.
        </p>
        <p>IC(v) =</p>
        <p>log(P (v))</p>
        <p>
          Based on IC, Combined Information Content (combIC) is proposed to weigh
triples. Although the authors also introduced another method, Joint Information
Content (jointIC), we only provide the results provided by combIC. The results
of jointIC were very similar to those of combIC. We choose combIC, because it
requires less computation and demonstrates better results [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ].
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>2. Combined Information Content (combIC). Weights are computed as</title>
        <p>the sum of IC of a predicate and a object.
(1)
(2)
(3)
wcombIC (x) = IC pred(x) + IC obj(x)
Please note that the information theoretic weights are computed with respect to
each snapshot, i .e., each point in time. Therefore, probabilities such as P (pred(x))
must not be 0 theoretically. Based on this, we can nally de ne an entity as a
vector where each element is a weight of a triple.</p>
        <p>De nition 3. Weighted entity vector E~e;t
An entity for a given entity key e and a point in time t is represented as a vector
described in Equation 4.</p>
        <p>E~e;t =
1e(xt;1) w (xt;1) ; 1e(xt;2) w (xt;2) ; : : : ; 1e(xt;1) w xt;jXtj ;
(4)
where xt;i denotes the i-th triple in all unique triples from Xt, and 1e(x) is an
indicator function which returns 1 if Ee;t contains a triple x and 0 if not.
Using the example shown in Table 2, E~db:Annee Smith;t1 = (1; 1; 0), when using
the baseline weighting method. Because Edb:Annee Smith;t1 contains the rst and
second triple of Xt1 and the indicator function 1db:Annee Smith returns 1 for the
rst and second triples and 0 for the last one.
We rst measure a degree of changes for a given entity key e and an entity
representation z (i .e., Out or InOut shown in Section 3) between two successive
snapshots, using a function . We introduce two variations of using cosine
distance in Equation 5, and Euclidean distance in Equation 6.
(5)
(6)
cosd(e; z; t1; t2) = 1</p>
        <p>E~ez;t1</p>
        <p>E~ez;t2
jjE~ez;t1 jj jjE~ez;t2 jj
s</p>
        <p>X
x2Eez;t1 [Eez;t2
w(xe;t1 )
w(xe;t2 )
2
euclidean(e; z; t1; t2) =
Equation 5 is based on cosine similarity that is widely used. We modify it to
compute distance, since we focus on how much an entity is changed. If a triple
is contained in Eez;t1 and not in Eez;t2 , the weight of that triple in Eez;t2 is 0 and
vice versa. In Equation 6, w(xe;t) denotes a weight of a triple x for an entity
key e at a point in time t. Like Equation 5, if a triple is only contained in Eez;t1 ,
the weight of that triple at the point in time t2 is 0 and vice versa. Finally, we
represent the temporal dynamics of an entity as a series of entity changes as
de ned below.</p>
        <p>De nition 4. Temporal dynamics of an entity (e; z)
The temporal dynamics of an entity are represented as a time series, where each
element denotes the amount of changes between two successive snapshots.
(e; z) = ( (e; z; t1; t2); (e; z; t2; t3); : : : ; (e; z; tn 1; tn))
(7)
We construct such an entity dynamics vector for all entities e. In order to nd
patterns of entity dynamics, we apply a clustering algorithm as described below.
4.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>Time-series Clustering</title>
        <p>
          For clustering, we leave out entity dynamics vectors where all elements are equal
0 (i .e., time series with no changes over the entire observed period). As clustering
method, we employ k-means++ [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and use Euclidean distance as a distance
measure, as our previous work [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Compared to the traditional k-means,
kmeans++ introduces an improved initial seeding [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. In terms of the distance
measure, an extensive evaluation conducted by Wang et al. [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] demonstrates
that Euclidean distance is the most e cient measure for time series with a
reasonably high accuracy.
        </p>
        <p>
          We nd the optimal number of clusters k between 2 and 10 using Average
Silhouette [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. Due to the computation cost, we randomly pick 0:1% of elements
from each cluster and compute Average Silhouette over them. A higher value
indicates a better clustering. We consider centroids of generated clusters as
representative temporal patterns of entity dynamics.
4.4
        </p>
      </sec>
      <sec id="sec-4-4">
        <title>Periodicity Detection</title>
        <p>
          As results of clustering described in Section 4.3, we obtain time series that show
representative temporal patterns of entity dynamics. We further analyze these
time series by periodicity detection. Periodicity detection is the task to discover
the pattern at which a time series is periodic. For instance, temporal sequences
(1; 3; 2; 1; 3; 2) and (1; 2; 1; 2; 1; 2) have the periodicity of 3 and 2, respectively.
Computing periodicity over the observed entity dynamics ensures
generalizability of the observed temporal patterns. We employ a convolution-based algorithm
proposed by Elfeky et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. The algorithm outputs periodicity candidates with
con dence scores.
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Dataset</title>
      <p>
        We use the Dynamic Linked Data Observatory (DyLDO) dataset4. The DyLDO
dataset has been created to monitor a xed set of LOD documents on a weekly
basis. The dataset is composed of 165 weekly snapshots over a period of three
years from May 2012 to July 2015. The DyLDO dataset contains various well
known and large LOD sources (e. g., dbpedia.com, bbc.co.uk) as well as less
commonly known ones (e. g., pokemon.com). For more detailed information about
the dataset, we refer to [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>The dataset contains 27; 788; 902 unique entities (z = Out) and 2; 909; 700
unique entities (z = InOut) over the 165 snapshots. Figure 1 shows the
distributions of entity frequencies for Out and InOut, respectively. According to
Figure 1, almost 75% of the entities appear only in one snapshot in both entity
representations Out and InOut. To conduct the time series clustering properly,
we focus on 2; 521; 617 entities in Out and 2; 950; 533 entities in InOut that
appear at more than 70% of snapshots.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Results and Analyses</title>
      <p>We rst report the observed temporal patterns of entity dynamics. Subsequently,
we investigate entity features that de ne temporal patterns of entity dynamics.
6.1</p>
      <sec id="sec-6-1">
        <title>Temporal Patterns of Entity Dynamics</title>
        <p>
          Table 3 shows the resulting clusters for each condition. Since we use two di erent
entity representations, two di erent distance measures, and two di erent triple
weighting methods, we have in total eight conditions to compute time series
and clustering. In terms of the optimal number of clusters k, generally Average
Silhouette suggests that a lower value of k gives better clustering, as observed
by Yang et al. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. Since we rst remove all static entities (i .e., entities with
no change) before clustering, the optimal number of clusters shown in Table 3 is
        </p>
        <sec id="sec-6-1-1">
          <title>4 http://swse.deri.org/dyldo/, last access on 12/11/2015</title>
          <p>one larger than the optimized k. We introduce a cluster cs to accommodate the
static entities. In each condition, most entities belong to one cluster and other
clusters have relatively small number of entities.</p>
          <p>Figure 2 provides the representative temporal patterns of entities (i .e.,
centroids of observed clusters) for each condition. We observe that the number
of clusters reduces signi cantly in conditions using information theoretic triple
weighting methods. Especially, when the number of clusters is two, one
cluster represents a large amount of changes consistently over time and the other
shows consistent small amount of changes. Therefore, information theoretic triple
weighting methods distinguish entities which have important changes from
entities whose changes are less important.</p>
          <p>Table 4 shows periodicities of each temporal pattern shown in Figure 2. All
temporal patterns have a periodicity, thus the amount of entity change has some
regularity.</p>
          <p>Out Euclidean combIC
InOut Cosine Baseline
InOut Cosine combIC
InOut Euclidean Baseline
InOut Euclidean combIC
(a) Out Cosine Baseline
(b) Out Cosine combIC
(c) Out Euclidean Baseline
(d) Out Euclidean combIC (e) InOut Cosine Baseline (f) InOut Cosine combIC
(g) InOut Eucl. Baseline
(h) InOut Eucl. combIC
Fig. 2: Temporal patterns of entity dynamics in each condition. The x-axis
denotes a point in time. The y-axis shows , the amount of entity changes, de ned
in Equations 5 and 6.
6.2</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>Features for Entity Dynamics</title>
        <p>
          After nding the temporal patterns of entity dynamics, we investigate which
features of entities more likely control the temporal patterns of entity dynamics.
Umbrich et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] manually analyzed entities from same domains and found
out that they are more likely to have similar temporal dynamics. However, the
observation is based on manual investigation and other possible features have
not been empirically evaluated. In this work, we compare the following four
features. Formally, we denote f (E) ! v as a feature function which returns a
feature value for a given entity. v 2 Vf is a feature value, where Vf denotes a set
of all possible feature values received by f .
{ Type (f1): Most entities have one or more types de ned by the predicate
http://www.w3.org/1999/02/22-rdf-syntax-ns#type (e .g., foaf:Person).
{ Property (f2): Predicates (e .g., dbpedia-owl:foundedBy) in triples
describe properties of an entity.
{ Extended Characteristic Set (ECS) (f3): The combination of types and
properties leads to the de nition of Extended Characteristic Set (ECS) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
Thus, in f3, v 2 P(Vf1 [ Vf2 ). This feature function is stricter than f1 and
f2, because while entities can have several values of f1 and f2, they have
only one ECS.
{ Pay level domain (PLD)(f4): An entity has a pay level domain (PLD) in
its URI. For instance, if an entity is de ned by a URI http://dbpedia.org/
resource/Facebook, the PLD of the entity is http://dbpedia.org. PLD
is extracted using Guava5.
        </p>
        <p>Please note that feature functions f1 and f2 can return more than one feature
values, since entities may own several properties and types. In order to nd out
the features that most control the temporal patterns, we compare the clusterings
generated in Section 4.3 with clusterings obtained by using the features. We
denote C as a resulted clustering from Section 4.3 and c 2 C as a cluster.</p>
        <p>
          We use Rand Index R [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] as an evaluation metric, which is de ned as
R = T P +TTNP ++TF NP+F N . It is usually used to evaluate the accuracy of clustering.
However, we use it to evaluate the degree of the agreement between clustering
and classi cations made by one of the four features. T P indicates the number
of entity pairs that belong to the same cluster in C as well as that receive a
same feature value v by a feature function f . For example, if the entities e1 and
e2 belong to a same cluster and have a common feature f4 value dbpedia.org,
we count the entity pair as T P . T N is the number of entity pairs that belong
to di erent clusters in C and that receive di erent feature values by a feature
function f . The denominator denotes the number of entity pairs. Due to the
computation cost, we randomly pick 10; 000 entity pairs from C and compute
Rand Index.
5 https://github.com/google/guava/wiki/Release19, last access on 12/17/2015
Periodicity of temporal patterns. In Table 4, we see that a lot of temporal
patterns have periodicity of 55 and 56 weeks. It indicates that entities change
on a year cycle. Thus, the amount of entity changes at a certain point in
time can be predicted by looking the amount of entity changes at a year ago.
        </p>
      </sec>
      <sec id="sec-6-3">
        <title>Di erence between entity representations. We observe a small di erence</title>
        <p>between two entity representations. A possible reason is a small di erence
between the numbers of triples analyzed at each entity representation. When
z = InOut, we analyzed 16:78% triples more compared to when z = Out.</p>
        <p>Therefore, the e ects from incoming properties is small over all.
Di erence between triple weighting methods. When using Euclidean
distance and combIC, we observe a large cluster where the amount of changes is
consistently small (see Table 3 and Figures 2(d)(h)). It indicates that most
entities are made small changes consistently. On the other hand, a small
portion of entities belong to clusters where the amount of changes is large during
the entire observed period. Therefore, the information-theoretic method
distinguishes entities which have consistently important changes from entities
whose changes are always trivial. When using the cosine distance, the
difference between triple weighting methods is small. A possible reason is the
normalization by the denominator in Equation 5.</p>
      </sec>
      <sec id="sec-6-4">
        <title>Di erence between distance measures. We used the cosine distance and</title>
        <p>
          Euclidean distance. While the cosine distance is a value between 0 and 1
(due to normalization by the denominator in Equation 5), Euclidean distance
takes into account the size of entities and the number of changed triples in
entities. Thus, Euclidean is a value 0. This di erence in uences the number
of clusters. The number of clusters at conditions with Euclidean distance is
always larger than when using the cosine distance. For instance, in Figure
2(c), we see a cluster where the amount of changes is always over 10 (green
line). On the other hand, in the cosine distance, a value close to 1 indicates
that most triples in entities are changed regardless of the number of triples.
Features for entity dynamics. When the baseline is employed as triple
weighting method, the feature PLD (f4) is most likely to determine temporal
patterns of entity dynamics. This is in line with the observation by Umbrich
et al.[
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. When using the cosine distance and information-theoretic triple
weighting methods, the feature ECS (f3) performs best. Since the cosine
distance is proportional to the percentage of changed triples in the entities, the
entity structure is important to determine temporal patterns. Thus, when
using the cosine distance, it is not trivial which speci c properties and types
are appeared in the entities together. For instance, there is an entity with
a property whose object value changes always. For this entity, the cosine
distance would be small, if many of the other properties are static. In
contrast, the cosine distance would be large, if the entity has only a few other
properties that are static.
        </p>
      </sec>
      <sec id="sec-6-5">
        <title>Potential limitation of the analysis. The analysis shown in this paper can</title>
        <p>
          be biased by setting the threshold to 70% appearance of entities in all of the
weekly snapshots. However, we believe this bias is little, since we analyzed
71:92% of all triples that appeared in each snapshot. We think that this
has grasped the most prominent and dominant temporal patterns of entity
dynamics. In addition, the analysis can be biased by the dataset we used.
But the dataset covers both most authoritative and randomly-selected LOD
documents [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ].
        </p>
      </sec>
      <sec id="sec-6-6">
        <title>Co-evolution of the amount of entity changes and feature values. Fea</title>
        <p>
          ture values of f1, f2, and f3 can change over time. Thus, Rand Index may
di er in di erent snapshots. Table 5 reports the mean average and standard
deviations for the index. As one can see, the standard deviations in Table 5
are overall very small. It indicates that feature values of entities are not
frequently changed, as observed by Kafer et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and Gottron et al. [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
Furthermore, the standard deviations of the feature Type (f1) are smaller
than other features (f2 and f3). Therefore, the feature Type f1 is the most
static entity feature among them.
        </p>
        <p>
          Applications of this analysis. The analysis provided by this paper is useful
e .g., to predict the temporal patterns and the amount of changes of newborn
entities. For instance, if we get new entities, we can predict the temporal
pattern of this entity dynamics by looking into feature values of entities. In
terms of the information-theoretic triple weighting method, we think that
they are useful to develop cache strategies for triple stores. According to [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ],
a predicate value is xed in over 85% of SPARQL queries asking about
a certain entity. We assume that users issue queries with more important
predicates for searching entity information. However, this is subject to future
research.
8
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>In this paper, we analyze the temporal patterns of entity dynamics. This analysis
had not been done so far. As dataset, we use weekly LOD snapshots over a period
of almost three years provided by the Dynamic Linked Data Observatory. The
result demonstrates that entities that share a common PLD are more likely to
change together over time, when using the baseline as triple weighting method.
However, when using Euclidean distance and information-theoretic triple
weighting methods, entities that have a same type are more likely to change together.
Under this condition, clustering produces a large cluster where the amount of
changes is consistently small and a cluster where the amount of changes is overall
small. Therefore, the entity feature Type can distinguish entities that are given
important changes consistently from entities that have less important changes.
In terms of periodicity of observed temporal patterns of entity dynamics, most
entities change on a year-cycle with small amount of changes. The result of the
analysis shown in this paper can be applied to the wide range of the applications,
including determining crawling strategies for LOD, caching SPARQL queries, to
programming against LOD, and recommending vocabularies for reusing LOD
vocabularies.</p>
      <p>Acknowledgement This work was supported by the EU's Horizon 2020
programme under grant agreement H2020-693092 MOVING.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>D.</given-names>
            <surname>Arthur</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Vassilvitskii.</surname>
          </string-name>
          k-means+
          <article-title>+: The advantages of careful seeding</article-title>
          .
          <source>In SODA</source>
          , pages
          <volume>1027</volume>
          {
          <fpage>1035</fpage>
          .
          <string-name>
            <surname>SIAM</surname>
          </string-name>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>L.</given-names>
            <surname>Ding</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Finin</surname>
          </string-name>
          .
          <article-title>Characterizing the semantic web on the web</article-title>
          .
          <source>In ISWC</source>
          , pages
          <volume>242</volume>
          {
          <fpage>257</fpage>
          . Springer,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>R.</given-names>
            <surname>Dividino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gottron</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Scherp</surname>
          </string-name>
          .
          <article-title>Strategies for e ciently keeping local linked open data caches up-to-date</article-title>
          .
          <source>In ISWC</source>
          , pages
          <volume>356</volume>
          {
          <fpage>373</fpage>
          . Springer,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R.</given-names>
            <surname>Dividino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Gottron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Scherp</surname>
          </string-name>
          , and
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Groner. From changes to dynamics: dynamics analysis of linked open data sources</article-title>
          .
          <source>In PROFILES</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>R.</given-names>
            <surname>Dividino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Scherp</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Groner, and</article-title>
          <string-name>
            <given-names>T.</given-names>
            <surname>Gottron</surname>
          </string-name>
          .
          <article-title>Change-a-LOD: does the schema on the linked data cloud change or not</article-title>
          ? In COLD,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Elfeky</surname>
          </string-name>
          , W. G.
          <article-title>Aref, and</article-title>
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Elmagarmid</surname>
          </string-name>
          .
          <article-title>Periodicity detection in time series databases</article-title>
          .
          <source>IEEE TKDE</source>
          ,
          <volume>17</volume>
          (
          <issue>7</issue>
          ):
          <volume>875</volume>
          {
          <fpage>887</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Gallego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Mart</surname>
          </string-name>
          nez-Prieto, and P.
          <string-name>
            <surname>de la Fuente</surname>
          </string-name>
          .
          <article-title>An empirical study of real-world SPARQL queries</article-title>
          .
          <source>In USEWOD</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Gottron</surname>
          </string-name>
          and
          <string-name>
            <given-names>Christian</given-names>
            <surname>Gottron</surname>
          </string-name>
          .
          <article-title>Perplexity of index models over evolving linked data</article-title>
          .
          <source>In The Semantic Web: Trends and Challenges - 11th International Conference, ESWC</source>
          <year>2014</year>
          , Anissaras, Crete, Greece, May
          <volume>25</volume>
          -29,
          <year>2014</year>
          . Proceedings, pages
          <volume>161</volume>
          {
          <fpage>175</fpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>H.</given-names>
            <surname>Holzmann</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Risse</surname>
          </string-name>
          .
          <article-title>Named entity evolution analysis on Wikipedia</article-title>
          . In WebSci, pages
          <volume>241</volume>
          {
          <fpage>242</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. T. Kafer,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelrahman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <article-title>O'Byrne, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          .
          <article-title>Observing linked data dynamics</article-title>
          .
          <source>In ESWC</source>
          , pages
          <volume>213</volume>
          {
          <fpage>227</fpage>
          . Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>T. K</surname>
          </string-name>
          <article-title>afer</article-title>
          , J.
          <string-name>
            <surname>Umbrich</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hogan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          .
          <article-title>Towards a dynamic linked data observatory</article-title>
          .
          <source>In LDOW. CEUR</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>M. Konrath</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Gottron</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Staab</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Scherp</surname>
          </string-name>
          .
          <article-title>SchemEX-E cient construction of a data catalogue by stream-based indexing of linked data</article-title>
          .
          <source>Web Semantics</source>
          ,
          <volume>16</volume>
          :
          <fpage>52</fpage>
          {
          <fpage>58</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>M. Martin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Unbehauen</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Auer</surname>
          </string-name>
          .
          <article-title>Improving the performance of semantic web applications with SPARQL query caching</article-title>
          .
          <source>In ESWC</source>
          . Springer,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <given-names>T.</given-names>
            <surname>Neumann</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Moerkotte</surname>
          </string-name>
          .
          <article-title>Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins</article-title>
          .
          <source>In ICDE</source>
          , pages
          <volume>984</volume>
          {
          <fpage>994</fpage>
          . IEEE,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <given-names>C.</given-names>
            <surname>Nishioka</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Scherp</surname>
          </string-name>
          .
          <article-title>Temporal patterns and periodicity of entity dynamics in the Linked Open Data cloud</article-title>
          .
          <source>In K-CAP, page No. 22. ACM</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <given-names>N.</given-names>
            <surname>Popitsch</surname>
          </string-name>
          and
          <string-name>
            <surname>B. Haslhofer.</surname>
          </string-name>
          <article-title>DSNotify{a solution for event detection and link maintenance in dynamic datasets</article-title>
          .
          <source>Web Semantics</source>
          ,
          <volume>9</volume>
          (
          <issue>3</issue>
          ):
          <volume>266</volume>
          {
          <fpage>283</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>W. M. Rand</surname>
          </string-name>
          .
          <article-title>Objective criteria for the evaluation of clustering methods</article-title>
          .
          <source>J. of the American Statistical Association</source>
          ,
          <volume>66</volume>
          (
          <issue>336</issue>
          ):
          <fpage>846850</fpage>
          ,
          <year>1971</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Rousseeuw</surname>
          </string-name>
          .
          <article-title>Silhouettes: a graphical aid to the interpretation and validation of cluster analysis</article-title>
          .
          <source>Journal of computational and applied mathematics</source>
          ,
          <volume>20</volume>
          ,
          <year>1987</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>M. Schmachtenberg</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bizer</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          .
          <article-title>Adoption of the linked data best practices in di erent topical domains</article-title>
          .
          <source>In ISWC</source>
          , pages
          <volume>245</volume>
          {
          <fpage>260</fpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <given-names>M.</given-names>
            <surname>Schuhmacher</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Ponzetto</surname>
          </string-name>
          .
          <article-title>Knowledge-based graph document modeling</article-title>
          .
          <source>In WSDM</source>
          , pages
          <volume>543</volume>
          {
          <fpage>552</fpage>
          . ACM,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Shannon</surname>
          </string-name>
          .
          <article-title>A mathematical theory of communication</article-title>
          .
          <source>Bell System Technical Journal</source>
          ,
          <volume>27</volume>
          :
          <fpage>379</fpage>
          {
          <fpage>423</fpage>
          ,
          <issue>623</issue>
          {
          <fpage>656</fpage>
          ,
          <year>1948</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>J. Umbrich</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Karnstedt</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hogan</surname>
            , and
            <given-names>J. X.</given-names>
          </string-name>
          <string-name>
            <surname>Parreira</surname>
          </string-name>
          .
          <article-title>Hybrid SPARQL queries: fresh vs. fast results</article-title>
          .
          <source>In ISWC</source>
          , pages
          <volume>608</volume>
          {
          <fpage>624</fpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>J. Umbrich</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Karnstedt</surname>
            , and
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Land</surname>
          </string-name>
          .
          <article-title>Towards understanding the changing web: Mining the dynamics of linked-data sources and entities</article-title>
          .
          <source>In KDML</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mueen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ding</surname>
          </string-name>
          , G. Trajcevski,
          <string-name>
            <given-names>P.</given-names>
            <surname>Scheuermann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>E.</given-names>
            <surname>Keogh</surname>
          </string-name>
          .
          <article-title>Experimental comparison of representation methods and distance measures for time series data</article-title>
          .
          <source>Data Mining and Knowledge Discovery</source>
          ,
          <volume>26</volume>
          (
          <issue>2</issue>
          ):
          <volume>275</volume>
          {
          <fpage>309</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Leskovec</surname>
          </string-name>
          .
          <article-title>Patterns of temporal variation in online media</article-title>
          .
          <source>In WSDM</source>
          , pages
          <volume>177</volume>
          {
          <fpage>186</fpage>
          . ACM,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <given-names>X.</given-names>
            <surname>Yin</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Shah</surname>
          </string-name>
          .
          <article-title>Building taxonomy of web search intents for name entity queries</article-title>
          .
          <source>In WWW</source>
          , pages
          <volume>1001</volume>
          {
          <fpage>1010</fpage>
          . ACM,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>