<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Capturing the Currency of DBpedia Descriptions and Get Insight into their Validity</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anisa Rula</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luca Panziera</string-name>
          <email>fluca.panzierag@open.ac.uk</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Palmonari</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Maurino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Milano-Bicocca</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>An increasing amount of data is published and consumed on the Web according to the Linked Open Data (LOD) paradigm. In such scenario, capturing the age of data can provide insight about their validity under the hypothesis that more up-to-date data is, more likely is to be true. In this paper we present a model and a framework for assessing the currency of the data represented in one of the most important LOD datasets, DBpedia. Existing currency metrics are based on the notion of date of last modi cation, but often such information is not explicitly provided by data producers. The proposed framework extrapolates such temporal metadata from time-labeled revisions of Wikipedia pages (from which data has been extracted). Experimental results demonstrate the usefulness of the framework and the e ectiveness of the currency evaluation model to provide a reliable indicator of the validity of facts represented in DBpedia.</p>
      </abstract>
      <kwd-group>
        <kwd>Data Quality</kwd>
        <kwd>DBpedia</kwd>
        <kwd>Information Extraction</kwd>
        <kwd>Currency</kwd>
        <kwd>Validity</kwd>
        <kwd>Linked Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Linked Open Data (LOD) can be seen as a Web-scale knowledge base consisting
of named resources described and interlinked by means of RDF statements [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
The extraction of structured information from semi-structured content available
in Wikipedia enabled several knowledge bases to be published within the DBpedia
project3. Information in LOD datasets changes over time to re ect changes in
the real world; new statements are added and old statements are deleted [
        <xref ref-type="bibr" rid="ref16 ref5">16, 5</xref>
        ],
with the consequence that entity documents consumed by users can soon become
outdated. Out-of-date documents can re ect inaccurate information. Moreover,
more up-to-date information should be preferred over less up-to-date information
in data integration and fusion applications [
        <xref ref-type="bibr" rid="ref10 ref12">10, 12</xref>
        ].
      </p>
      <p>
        The problems of dealing with change, providing up-to-date information,
and making users aware of how much up-to-date information is have been
3 http://dbpedia.org/
acknowledged in the Semantic Web. A resource versioning mechanism for linked
data has been proposed [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which allows data providers to publish time-series of
descriptions changing over time; however this mechanism has been adopted only
by a limited number of datasets and not from DBpedia. DBpedia Live [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] provides
a model to continuously update RDF descriptions and to publish data that are as
up-to-date as they are in Wikipedia, the original source. This approach provides
a step forward in the delivery of up-to-date and reliable information. However,
even this approach is based on batch update strategies and, although data is
nearly synchronized with Wikipedia, there is still some delay in the propagation
of changes that occur in Wikipedia. As an example, the publication of the new
album titled "Get Up!" by the singer Ben Harper on January 29th 2013, is
represented in Wikipedia, but not in DBpedia Live as of February, 15th, 2013.
Moreover, DBpedia Live is not as complete as English DBpedia, the DBpedia
Live approach has not been applied to localized datasets, and several applications
still need to use local DBpedia dumps to implement scalable systems.
      </p>
      <p>
        Data currency (currency, for short) is a quality dimension proposed in the data
quality domain to describe and measure the age of (relational) data, and the speed
at which a system is capable to record changes occurring in the real-world [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
The computation of these currency measures needs versioning metadata [
        <xref ref-type="bibr" rid="ref13 ref16">16, 13</xref>
        ],
which represent the date when RDF statements or documents are created or
modi ed. Unfortunately, the use of versioning metadata in LOD datsets, and in
particular in DBpedia, is not a frequent practice as shown in [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. To overcome
the mentioned problems, in this paper, we address the issue of supporting the
consumer to evaluate the freshness of explored data by de ning:
{ a model to measure currency in LOD that encompasses two currency measures,
namely basic currency, based on the age of data, and system currency, based
on the delay with which data is extracted from a web source;
{ a method for estimating the last modi cation date of entity documents in
      </p>
      <p>DBpedia starting from the corresponding Wikipedia pages;
{ an evaluation method for estimating the quality of proposed currency
measures based on the validity of facts.</p>
      <p>The paper is organized as follows: Section 2 discusses related work on the
assessment of time-related data quality dimensions; Section 3 introduce the de
nitions adopted in the paper, and the metrics for measuring and evaluating currency.
Our framework for assessing currency of DBpedia documents is presented in
Section 4. Section 5 shows results of our experimentation on DBpedia and, in
Section 6, we draw conclusions and future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Data currency is a quality dimension well known in the literature of information
quality [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Assessing currency in traditional Database Management Systems
(DBMSs) is straightforward due to the fact that DBMSs track data updates into
log les. In the LOD domain, the de nition of currency needs to be adapted
to a new data model, where the resource identi ers are independent from the
statements where they occur. Further, the statements can be added or removed
and these actions are not represented in the data model.
      </p>
      <p>
        SIEVE is a framework for evaluating quality of LOD, proposed in the context
of, and applied to, a data fusion scenario [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The framework results show that
data currency is a crucial driver when data coming from heterogeneous sources
have to be integrated and fused. The authors do not investigate the speci c
problem of measuring data currency on arbitrary datasets. Our contribution can
be seen as a point of support to the approaches such as SIEVE since the currency
evaluation needs temporal meta-information to be available.
      </p>
      <p>
        A recent technique able to assess the currency of facts and documents is
presented in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Given a SPARQL query q over an LOD dataset, it estimates
the currency of the returned result set. Furthermore, to overcome the problem
of the semantic heterogeneities related to the use of di erent vocabularies in
the LOD datasets, the authors propose an ontology, which integrates in one
place all the temporal meta-information properties. According to this approach,
currency is measured as a change between the time of the last modi cation and
the current time, whereas we provide a further measure that assess the time
between a change in the real world and a change in the knowledge base. Both
approaches, SIEVE and the one in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] rely on timestamps such as lastUpdate or
lastModi cationDate which are often unreliable or may not be provided by many
documents. Our proposed method extends the approaches seen before. It deals
with the incomplete and scattered temporal metadata available in LOD today.
In particular, we estimate the currency of documents for which such metadata
are not available.
      </p>
      <p>
        Other time-related metadata de ne the temporal validity of RDF statements
according to the Temporal RDF model [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]; the model has been extended to
improve the e ciency of temporal query answering [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] and to link temporal
instants and intervals so that they can be associated to several entities [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
However, only few datasets like Timely YAGO [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] associate RDF statements
with their temporal validity de ned by a temporal interval. For this reason they
leverage explicit temporal metadata available from the Wikipedia's infoboxes,
which had been not considered in DBpedia. Only a subset of DBpedia statements
can be associated with temporal validity following this approach, though. The
goal of our approach is di erent because we aim at extracting the last modi cation
date of a full document. In order to achieve this goal, and di erently from Timely
YAGO, we look into version history of Wikipedia pages. Similar to our approach,
the work in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] also detects and extracts from the revision history of Wikipedia,
the last modi cation date of each attribute in the infobox. In contrast to our
approach, it gathers the revision history of Wikipedia based on data dumps.
To ensure that it gathers updated information, the system in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] aggregates
evidences from a large number of crawled Wikipedia documents. This approach
is complementary to our current approach and can easily be combined with it.
The work in [?] is a recently proposed method to parse the Wikipedia revision
history dumps and for each DBpedia fact it returns the last modi cation date.
The approach is based on combining a standard Java SAX parser with the
DBpedia Extraction Framework. Yet, it does not use the most recent version
since it accesses the dumps of Wikipedia documents rather than using the newer
revisions on the Web.
3
3.1
      </p>
    </sec>
    <sec id="sec-3">
      <title>Currency Model</title>
      <sec id="sec-3-1">
        <title>Preliminaries</title>
        <p>
          Linked Open Data describes resources identi ed by HTTP URIs by representing
their properties and links to other resources using the RDF language. Given an
in nite set U of URIs, an in nite set B of blank nodes, and an in nite set L of
literals, a statement &lt; s; p; o &gt;2 (U [ B) U (U [ B [ L) is called an RDF
triple. As the use of blank nodes is discouraged for LOD [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we will assume that
the subject and the property are URIs, while the object can be either a URI or a
literal.
        </p>
        <p>
          Most resources described in the LOD Cloud represent real-world objects
(e.g., soccer players, places or teams); we use the term entities as a short form for
named individuals as de ned in the OWL 2 speci cation. According to the LOD
principles [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], we assume that each entity e can be dereferenced. The result of
the dereferencing is an RDF document denoted de which represents a description
of the entity e. We say that de describes e and de is an entity document [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. As
an example, an entity document de returned by DBpedia in NTriples format
contains all the RDF triples where e occurs as a subject. In this work, RDF
triples occurring in entity documents are called facts. A fact represents a relation
between the subject and the object of the triple and, intuitively, it is considered
true when the relation is acknowledged to hold. In this paper, as we are interested
in analyzing di erent versions of entity documents that are published at di erent
time points, we regard time as a discrete, linearly ordered domain, as proposed
in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>Let E = fe1; e2; : : : ; eng be a set of entities in a dataset G and D =
fde1 ; de2 ; : : : ; den g be a set of entity documents. Let S be a non-structured
or semi-structured web source in the web that consists of a set of web pages
S = fp1; p2; : : : ; png. We assume that each entity ei represented by an entity
document dei has a corresponding web page pi 2 S; in the following we use
the notation pe to refer to the web page that describes an entity e. Entities
descriptions and source web pages change over time; di erent versions of a web
page corresponding to an entity exist. Formally, a version of a web page p can be
represented as a triple v =&lt; id; p; t &gt;, where id is the version identi er, p is the
target page, and t is a time point that represents the time when the version has
been published.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Currency Measures</title>
        <p>
          Starting from the de nition of currency proposed for relational databases [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], we
de ne two quality dimensions namely age-based currency and System currency.
De nition 1 (Age-Based Currency). The data currency of a value is de ned
as the age of the value, where the age of a value is computed as the di erence
between the current time (the observation time) and the time when the value was
last modi ed.
        </p>
        <p>The above de nition can be easily extended to RDF documents. Let cT ime
and lmT ime(de) be respectively the current time (the assessment time) and the
last modi cation time of an entity document de. We rst de ne the age measure
of an entity document, based on the above informal de nition of data currency.
The age(de) : D ! [0 + 1] of an entity document de, with e 2 E, can be
measured by considering the time of the last modi cation of the entity document
according to the following formula:</p>
        <p>age(de) = cT ime lmT ime(de)
We introduce a measure called age-based currency, which depends on the document
age and returns normalized values in the range [0; 1] that are higher when the
document are more up-to-date and lower when documents are old. The age-based
currency (de) : D ! [0; 1] of an entity document de is de ned as follows:
where startT ime represents a time point from which currency is measured (as an
example, startT ime can identify the time when rst data have been published
in the LOD). Observe that this age-based currency measure does not depend
on the particular nature of RDF document. In fact, the same measure can be
adopted to evaluate currency of other web documents such as Wikipedia pages
(and will be used to this aim in Section 5).</p>
        <p>Age-based currency is based on age, as de ned in Equation 1, and strongly
depends on the assessment time (current time). We introduce another measure,
which is less sensitive to current time (current time is used only for normalizing
the values returned by the measure). We take inspiration from, and adapt a
de nition of currency proposed for relational databases.</p>
        <p>De nition 2 (System-Currency). Currency refers to the speed with which
the information system state is updated after the real-world system changes.</p>
        <p>According to this de nition currency measures the temporal delay between
changes in the real world and the consequent updates in the data. An ideal system
of currency measure in our domain should evaluate the time elapsed between a
change a ecting a real-world entity and the update of the RDF document that
describes that entity. However, changes in the real-world are di cult to track
because real-world is opaque to a formal analysis. Since the data under the scope
of our investigation are extracted from a web source, we can de ne the system
currency of an RDF document by looking at the time elapsed between the update
of the web source describing an entity and the update of the correspondent RDF
document.</p>
        <p>We rst de ne the notion of system delay with respect to an entity, de ned
by a function systemDelay(de) : D ! [0 + 1] as the di erence between the
time of last modi cation of a web page pe and the time of last modi cation of its
respective entity document de as follows:
systemDelay(de) = lmT ime(pe)
lmT ime(de)
(3)
Based on the measure of system delay, we introduce a new currency measure
called system currency that returns normalized values in the interval [0; 1], which
are higher when the data are more up-to-date and lower values when data are
lessup-to-date with respect to the web source. The system currency (de) : E ! [0; 1]
of an entity document de is de ned as:
We provide an assessment framework that leverages the temporal entities
extracted from Wikipedia to compute the data currency of DBpedia entity
documents; the assessment framework follows three basic steps: (1) extract a document
representing an entity from the DBpedia dataset, (2) estimate the last
modication date of the document looking at the version history of the page that
describes the entity in Wikipedia, (3) use the estimated date to compute data
currency values for the entity document.</p>
        <p>
          The main problems we have to face, in order to apply the currency model
described in Section 3, concerns the (un)availability of the temporal
metainformation required to compute the currency, namely the last modi cation
date of an RDF document. According to a recent empirical analysis, only 10%
of RDF documents and less than 1% of the RDF statements are estimated to
be associated with temporal meta-information [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]; which means that it is not
possible to extrapolate information about the last modi cation of a document
from temporal meta-information associated with statements. To assess the data
currency of RDF documents, we propose an assessment strategy based on the
measures de ned in Section 3. Our assessment framework takes a DBpedia entity
as input and returns a data currency value. We propose a method, similar to
the data extraction framework proposed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], which uses versioning metadata
available in Wikipedia pages, which are associated with time-stamped global
version identi ers, to extract the time-stamps to associate to RDF documents as
last modi cation date.
        </p>
        <p>A pseudocode of the algorithm that implements the strategy is described in
Algorithm 1. The input of the algorithm is an entity e for which the currency has
to be evaluated using the currency measures proposed in Section 3. To obtain
the estimated last modi cation date of an entity document we need to extract
its document description and its correspondent page from the web.</p>
        <p>The description of the entity de is obtained by the function rdf F etcher(e),
line 2, which reads the statements recorded in an N-Triples entity document and
records them. Among all the statements in the document we keep only those
that use DBpedia properties; considered documents represent relational facts and
do not include typing information, links to other datasets (e.g., same as links
and other links to categories); in other words we consider in entity documents
statements that are more sensitive to changes.</p>
        <p>
          From line 3 to line 4, the algorithm extracts the ID of the last revision of
the web page pe (i.e the infobox information) corresponding to the entity e and
builds a structured representation of pe. In order to build a structured RDF
content we need to identify properties and their values in the semi-structured
part of the document (pe). The algorithm creates RDF statements from pe by
using the same approach and mappings used to extract DBpedia statements from
Wikipedia infoboxes [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>Given p and de, the algorithm nds whether the structured representation of
p provided by dev matches the entity description de (see line 4-7); we use an exact
match function. In case the last revision of the structured representation of p
does not match to the entity document, the algorithm checks for older versions
and stops only when a matching version is found. At this point, we associate the
timestamp of the v version to the entity document de (line 8). In this way we
compute the currency of de (see line 9-11).
5
5.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Evaluation</title>
      <sec id="sec-4-1">
        <title>Experimental Setup</title>
        <p>Methodology and Metrics. When multiple metrics are de ned the problem of
evaluating the quality of such dimensions arises. In this section we propose a
method to evaluate the e ectiveness of data currency measures addressing entity
documents extracted from semi-structured information available in web sources.
Out-of-date documents may contain facts that are not valid anymore when data
are consumed; intuitively a description can be considered valid when it accurately
represents a real-world entity. A data currency measure that is highly correlated
to the validity of fact is useful since it may provide insights about the reliability
of data. We de ne two metrics, namely accuracy and completeness, to capture
the intuitive notion of validity by comparison with the web source which data
are extracted from. These metrics use the semi-structured content available in
the web source - Wikipedia infoboxes in our case - as a Gold Standard against
which entity documents are compared.</p>
        <p>
          Let us assume to have a mapping function that map every Wikipedia fact
to an RDF fact. This mapping function can be the one used to extract DBpedia
facts from Wikipedia facts4; however other techniques to extract RDF facts from
semi-structured content have been proposed [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. A DBpedia fact s is valid at
a time point t i there exist a Wikipedia fact w in a page version v = hid; p; t0i
such that (w) = s, with t0 t and v being the last page version at t.
        </p>
        <p>We de ne the accuracy A(de) of an RDF document de as the number of
semantically accurate facts V F (de) in de divided by the total number of facts
T F (de) in de:
We de ne the completeness of an RDF document de as the number of semantically
accurate facts V F (de) in de divided by the total number of existing facts W F (pe)
in the original document pe:</p>
        <p>A(de) =</p>
        <p>V F (de)</p>
        <p>T F (de)
C(de) =</p>
        <p>V F (de)
W F (pe)
Dataset. In order to produce an exhaustive and signi cant experimentation, we
de ne a speci c subset of DBpedia entities that is a representative sample for
the entire dataset. These entities belong to the Soccer Player category. The main
facts describing a soccer player in the Wikipedia's infobox template are the facts
of a soccer player such as his appearance, his goals or moving from one club
to another, which denote evidences about the changes performed. The changes
can usually vary from three to seven days, which imply a high frequency of
modi cations. Furthermore, after observing for several months the modi cation
set of the Wikipedia's infoboxes of the soccer players, we noticed that there is a
high interest of the end-users to maintain the infoboxes up-to-date. Often due
to the high frequency of changes in Wikipedia, the new modi cations are not
replicated also to DBpedia.
5.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Currency and validity correlation</title>
        <p>To realize the experiments we de ne ten samples composed each one by thirty
soccer players chosen randomly where each sample contains a set of entities that
have a age-based currency evaluated on DBpedia de ned in a speci c interval
(e.g., [0; 0:1], [0:1; 0:2], ... [0:9; 1]).
4 http://mappings.dbpedia.org
(5)
(6)</p>
        <p>As provided in table 1, the results of the computations show that the Pearson's
coe cients are lower than 0.75 which mean that there is no a linear relationship
between currency and accuracy or currency and completeness. Even though the
correlation between system currency and the other estimators is lower than 0.75,
we notice that the correlation between system currency and completeness is
higher than the other correlation.</p>
        <p>In addition, we calculated the Spearman's rank correlation coe cient, which
is a metric that can be used to verify if there exists a non-linear correlation
between the quality metrics. Table 2 shows the Spearman's coe cient computation
between the two classes of quality measures performed on the collected soccer
player entities.</p>
        <p>The evaluation results allow us to assert that there exists a non-linear
correlation between system currency and completeness because Spearman's coe cient
is higher than 0.75. The correlation between the harmonic mean and the
system currency is low (even if it shows a value closed to the threshold), we can
deduce that there exists a correlation between the two components, which is
driven by completeness. Finally, the complete absence of the correlation among
validity indicators and the age-based currency of the Wikipedia and DBpedia
ones, con rms that the accuracy and completeness do not depend on the age of
the document.</p>
        <p>In order to gure out the behaviour of the system currency and completeness,
we compute the LOWESS (locally weighted scatterplot smoothing) function. This
function is able to compute the local regression for each subset of entities and to
identify, for each interval of system currency, the trend of the two distributions.</p>
        <p>Figure 1 shows the LOWESS line between system currency and completeness.
As provided by the graph, the local regression cannot be represented by a
well-known mathematical function (e.g., polynomial, exponential or logarithmic)
because it has di erent behaviours over three system currency intervals. Notice
that for system currency values greater than 0.6, the two distributions tend to
increase and for system currency values up to 0.4, the metrics tend to decrease.
A particular case is shown in the interval with system currency values going from
0.4 up to 0.6 where the LOWESS is constant.
4
.
0
0.2
0.4
0.6
0.8</p>
        <p>1.0</p>
        <p>DBpedia System Currency</p>
        <p>In order to verify the linearity on the three entities subsets, we compute
the correlation coe cients also on the identi ed system currency intervals. The
results provided in table 3 show that even for the subsets there does not exist
a linear correlation. In particular, there is no correlation in the central interval
where the Pearson's coe cient is close to 0 and the distribution is disperse as
shown in gure 1. Furthermore, the entities in the central interval show that
there does not exist a non-linear correlation.</p>
        <p>Instead, the distributions increases according to a non-linear function because
Spearman's coe cient is greater than 0.75 for the entities having high system
currency and the lower interval distribution decrease according to a non-linear
function represented by the Spearman's correlation which is lower than -0.75.
6</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion and Conclusions</title>
      <p>Based on the experimentation we can provide several general observations about
the relation between system currency and completeness. The rst one is that
entities with high system currency tend to be more complete. Commonly, there
is an update to the soccer player infobox each week. In particular, the two facts
changing frequently in the infobox are appearances and goals associated with
the current club. While these two facts change often, the fact that the soccer
players moves to a new club can change at most two times a year. Thus, we can
deduce that in a soccer player domain only two facts can change in a short time.
High system currency implies low elapsed time between the last modi cation
time of a document extracted from DBpedia and the correspondent document
extracted from Wikipedia. In case the elapsed time is low, the probability that the
player changes club is low too which means that only a few infobox modi cations
occurred. According to the formula 6, the completeness increases if the number
of facts changing in the Wikipedia infobox is small.</p>
      <p>
        The second observation concerns to the behaviour of entities with low system
currency. A deep analysis of these particular entities shows that the associated
Wikipedia pages have two common characteristics: (i) low frequency of modi
cations (approximately between six and twelve months), and (ii) refer to ex-soccer
players or athletes with low popularity. The low frequency of changes in a
document, typically known as volatility [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], implies a low system currency. Hence,
the time elapsed between the last modi cation of the document extracted by
DBpedia and the last modi cation of the associated document on Wikipedia is
high. Furthermore, ex-soccer players infoboxes do not change often because they
are considered to have nished their career. As a consequence of rare modi
cations (few changes) in the infobox attributes, such entity documents expose high
completeness.
      </p>
      <p>In case of entities for which the players have low popularity, the scenario is
quite di erent. The infobox modi cation for these entities implies high
completeness. Information about the unpopular players can change weekly, as well as
for famous players. Therefore, if a page is not frequently updated could provide
incorrect information. Consequently, we can infer that DBpedia entities of
unpopular soccer players with low system currency and high completeness refers
to Wikipedia pages infrequently updated, therefore, could not represent current
information of real world that is also re ected in DBpedia.</p>
      <p>At the end of such deep analysis of our experiments, we can assert that:
(i) our approach for measuring currency based on the estimated timestamp is
e ective; (ii) there exist a non-linear correlation between system currency and
completeness of entities; (iii) more the system currency of an entity is higher,
more the associated DBpedia document has high completeness; and (iv) entities
with low system currency, that are instances of DBpedia classes of which their
information can change frequently (e.g., unpopular soccer player), are associated
with Wikipedia pages that could not provide real world information.</p>
      <p>We plan to investigate several research directions in the future; on one hand;
we will extend the experiments to analyze the currency of every DBpedia entity
and the correlation to the validity of their respective documents; on the other
hand, we will study the correlation between volatility and currency to improve
the estimation of the validity of facts.
The work presented in the paper is supported in part by the EU FP7 project
COMSODE - Components Supporting the Open Data Exploitation (under contract
number FP7-ICT-611358).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Alfonseca</surname>
          </string-name>
          , G. Garrido,
          <string-name>
            <given-names>J.-Y.</given-names>
            <surname>Delort</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Peas</surname>
          </string-name>
          . Whad:
          <article-title>Wikipedia historical attributes data</article-title>
          .
          <source>Language Resources and Evaluation</source>
          , pages
          <volume>1163</volume>
          {
          <fpage>1190</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Batini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cappiello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Francalanci</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          .
          <article-title>Methodologies for data quality assessment and improvement</article-title>
          .
          <source>ACM Computing Surveys</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Berners-Lee</surname>
          </string-name>
          .
          <article-title>Linked Data - The Story So Far</article-title>
          . IJSWIS,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , G. Kobilarov,
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Becker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cyganiak</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann. DBpedia -</surname>
          </string-name>
          <article-title>a crystallization point for the web of data</article-title>
          .
          <source>Web Semantic</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Correndo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Salvadores</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Millard</surname>
          </string-name>
          , and
          <string-name>
            <given-names>N.</given-names>
            <surname>Shadbolt</surname>
          </string-name>
          .
          <article-title>Linked timelines: Temporal representation and management in linked data</article-title>
          .
          <source>In COLD</source>
          .
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>V. de Sompel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Nelson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Balakireva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shankar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Ainsworth</surname>
          </string-name>
          .
          <article-title>An HTTP-Based Versioning Mechanism for Linked Data</article-title>
          .
          <source>In LDOW</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Gutierrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Hurtado</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Vaisman</surname>
          </string-name>
          .
          <article-title>Temporal rdf</article-title>
          .
          <source>In ESWC</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Heath</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          .
          <article-title>Linked data: Evolving the web into a global data space</article-title>
          .
          <source>Synthesis Lectures on the Semantic Web: Theory and Technology</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stadler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Auer</surname>
          </string-name>
          .
          <article-title>Dbpedia live extraction</article-title>
          .
          <source>In OTM</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Mendes</surname>
          </string-name>
          , H. Muhleisen, and
          <string-name>
            <given-names>C.</given-names>
            <surname>Bizer</surname>
          </string-name>
          . Sieve:
          <article-title>Linked Data Quality Assessment and Fusion</article-title>
          . In LWDM,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Orlandi</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Passant</surname>
          </string-name>
          .
          <article-title>Modelling provenance of DBpedia resources using wikipedia contributions</article-title>
          .
          <source>Web Semantics</source>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>L.</given-names>
            <surname>Panziera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Comerio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmonari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. D.</given-names>
            <surname>Paoli</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Batini</surname>
          </string-name>
          .
          <article-title>Quality-driven extraction, fusion and matchmaking of semantic web api descriptions</article-title>
          .
          <source>JWE</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmonari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Harth</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <article-title>Stadtmuller, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          .
          <article-title>On the diversity and availability of temporal information in linked open data</article-title>
          .
          <source>In ISWC</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rula</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Palmonari</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Maurino</surname>
          </string-name>
          .
          <article-title>Capturing the age of linked open data: Towards a dataset-independent framework</article-title>
          .
          <source>In DQMST</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tappolet</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Bernstein</surname>
          </string-name>
          .
          <article-title>Applied temporal rdf: E cient temporal querying of rdf data with sparql</article-title>
          .
          <source>In ESWC</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Umbrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hausenblas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hogan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polleres</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Decker</surname>
          </string-name>
          .
          <article-title>Towards dataset dynamics: Change frequency of linked open data sources</article-title>
          .
          <source>In LDOW</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Spaniol</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weikum</surname>
          </string-name>
          .
          <article-title>Timely yago: harvesting, querying, and visualizing temporal knowledge from wikipedia</article-title>
          .
          <source>In EDBT</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>