<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Non-Temporal Orderings as Proxies for Extensional Concept Drift</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Albert Meroño-Peñuela</string-name>
          <email>albert.merono@vu.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefan Schlobach</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data Archiving and Networked Services</institution>
          ,
          <addr-line>KNAW, NL</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, VU University Amsterdam</institution>
          ,
          <addr-line>NL</addr-line>
        </aff>
      </contrib-group>
      <fpage>2</fpage>
      <lpage>7</lpage>
      <abstract>
        <p>In census data, concepts are central entities represented by variables and their values. The meaning of these concepts is often assumed to be stable, but in fact it can change over time: we call this concept drift. Extensional concept drift is one type of change of meaning that affects the things the concept extends to, having drastic consequences on longitudinal querying. In this paper we detect extensionally drifted concepts in current Linked Census Data when a time ordering of such concepts is not available. We exploit the Linked Data cloud to obtain meaningful proxies for such orderings.</p>
      </abstract>
      <kwd-group>
        <kwd>Concept Drift</kwd>
        <kwd>Semantic Web</kwd>
        <kwd>Linked Census Data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Most linked datasets assume some degree of stability in the concepts (variables,
values) they refer to. But the meaning of these concepts can change over time.
In this paper we find and report back this change of meaning of concepts, or
concept drift, in two census datasets. Concept drift can happen at the concept
identifier level (label drift ), in the basic properties of the concept (intensional
drift ), or to the things the concept refers to (extensional drift ) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. This paper
proposes a statistics-based solution for the latter.
      </p>
      <p>Concept drift is often assumed to happen between two time gapped variants
of a concept. Hence, time is the fundamental ordering of concepts in which
concept drift occurs. But time series are not available for the datasets we work
with. In this paper we propose a set of concept orderings that do not include
time, and we show their usefulness as proxies for concept drift detection. To get
such orderings, we exploit Linked Data to enrich and complement the census
data we already have.</p>
      <p>This paper is organised as follows. In Section 2 we describe the state of the
art in concept drift. In Section 3 we set the formal framework for the study of
concept drift. In Section 4 we describe experiments to detect extensional concept
drift in the Australian and French censuses in the absence of time series. Finally,
in Section 5 we establish some conclusions.</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        In Machine Learning, concept drift is defined as the situation in which the
statistical properties of a target variable change over time in unforeseen ways [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Several concept drift detection algorithms have been developed in this setting
[
        <xref ref-type="bibr" rid="ref2 ref4 ref6">2,4,6</xref>
        ]. On the Semantic Web, concept drift relates to the study of the dynamics
of meaning. This has been addressed in the field of ontology change and evolution
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], in Description Logics [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and in knowledge management [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Concept Drift</title>
      <p>
        As reality changes continuously, concepts also change over time. A concept refers
to different objects, real or abstract, at different points in time. We use the
formalisation framework described by Wang et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] in order to study concept
drift over time.
      </p>
      <p>Definition 1. The meaning of a concept C is a triple (label(C),int(C),ext(C)),
where label(C) is a string, int(C) a set of properties (the intension of C), and
ext(C) a subset of the universe (the extension of C).</p>
      <p>
        All the elements of the meaning of a concept can change. To address concept
identity over time, Wang et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] assume that the intension of a concept C is
the disjoint union of a rigid and a non-rigid set of properties (i.e. (intr(C) [
intnr(C))). Then, a concept is uniquely identified by some essential properties
that do not change. The notion of identity allows the comparison of two variants
of a concept at different points in time, even if a change on its meaning occurs.
Definition 2. Two concepts C1 and C2 are considered identical if and only if,
their rigid intension are equivalent, i.e., intr(C1) = intr(C2).
      </p>
      <p>If two variants of a concept at two different times have the same meaning,
there is no concept drift. We define intensional, extensional, and label similarity
functions simint; simext; simlabel to quantify meaning similarity. Each of these
functions has range [0; 1], and a similarity value of 1 indicates equality.
Definition 3. A concept has extensionally drifted in two of its variants C’ and
C”, if and only if, simext(C0; C00) 6= 1. Intensional and label drift are defined
similarly.</p>
      <p>To apply this framework of concept drift it is required to define intension,
extension and labelling functions, and to define similarity functions over intension,
extension and labels. We define these functions in Section 4.2.</p>
    </sec>
    <sec id="sec-4">
      <title>Meaningful Orderings as Concept Drift Proxies</title>
      <p>In this section we apply the concept drift framework presented in Section 3 to
study the change of meaning of concepts in RDF Data Cube versions of the
Australian census of 2011 and the French census of 2010.3,4,5 More concretely,
we apply the notion of extensional drift to detect extensionally drifted concepts
in these censuses. Concept drift is usually assumed to happen between two time
gapped variants of a concept. Hence, time is the fundamental dimension to order
such variants. Since time series are not available for these datasets, in this paper
we propose a different set of concept orderings, and we study their applicability.
To get such orderings, we exploit Linked Data to complement the census data
we already have.
4.1</p>
      <sec id="sec-4-1">
        <title>Data Retrieval</title>
        <p>
          We query the Australian and French census datasets from the statistical
environment R [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] via the SPARQL R package [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ].6 We select the variables gender,
age range, location, labour status and population. In the Australian census we
query data at the state level, and in the French census we aggregate results at
the departement level.
        </p>
        <p>To extend these variables we query DBPedia7. In the Australian case, we ask
for the gross domestic product (GDP) per capita of all states. In the French case,
we ask for the area and total population of all departements, and we derive the
population density for each of them.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Non-Temporal Extensional Concept Drift</title>
        <p>We are interested in detecting extensional concept drift, that is, simext(C0; C00) 6=
1 for two given variants C0; C00 of a concept C (see Section 3). Intuitively, this
means that the instances of C have changed significantly. We interpret
extensional concept drift in a statistical setting. We define the extension function
ext(C) as the function that returns the number of individuals that belong to C,
and the extension similarity function simext(C0; C00) as the function that returns
the probablity that C0 and C00 have identical populations. We assume that the
extension of C has drifted between C0 and C00 iff the populations of C0 and C00
are non identical (there is a shift between the populations of C0 and C00).</p>
        <p>We choose the concept of youth unemployment to study its extensional drift
in both censuses. To replace the natural ordering of time in the occurrences of
3See http://www.datalift.org/en/event/semstats2013/challenge
4SPARQL endpoint serving the datasets at http://lod.cedar-project.nl:8080/
sparql/semstats/</p>
        <p>5Source code at https://github.com/albertmeronyo/ConceptDrift/blob/master/
stats/semstats-challenge.R</p>
        <p>6SPARQL queries at https://github.com/albertmeronyo/ConceptDrift/blob/
master/sparql/semstats-challenge.txt
7http://dbpedia.org/sparql</p>
        <p>Normal Q−Q Plot</p>
        <p>Normal Q−Q Plot
0
0
0
litsean 3000
u
leQ 200
p
aSm 0000
1
0
−2
−1</p>
        <p>0
Theoretical Quantiles
1
2
−2
−1</p>
        <p>0
Theoretical Quantiles
1
2
this concept, we use the variables GDP per capita of the Australian states and
population density of the French departements to order such occurrences.</p>
        <p>As an example, we calculate the extensional drift of youth unemployment
in the Australian states of Western Autralia and Tasmania (highest and lowest
GDP per capita, respectively). We want to know if population counts of
unemployed young people (15-24 years old) have identical data distributions between
these regions. Without assuming the data to have normal distribution (see
Figure 1), we want to test at .05 significance level if the population counts for youth
unemployment have identical data distributions.</p>
        <p>
          The null hypothesis, H0, is that the young unemployed people from these two
regions are identical populations. To test the hypothesis, we run the Wilcoxon
signed-rank test that comes with the R distribution [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. We run the wilcox.test
function using these samples (see Listing 1.1), concluding that the population of
unemployed people between 15 and 24 in Western Australia and Tasmania are
statistically non-identical populations (p &lt; 0:05; N = 4, Wilcoxon signed-rank
test). Consequently, there is extensional drift in this case.
        </p>
        <p>In order to have a complete overview on how youth unemployment evolves as
GDP per capita increases, we run the same test for all Australian region pairs, in
GDP per capita ascending order. The resulting p-values indicate whether there
is an extensional drift between the regions (p &lt; 0:05, see Figure 2) or, on the
contrary, the concept remains stable. To view the evolution of extensional drift
on a relative scale, for each drift test k we compute the distance function
Index
2
4
6</p>
        <p>8
Index
(a) Extensional drift per Australian
region. P-values below 0.05 denote drift.</p>
        <p>Regions by ascending GDP per capita.
(b) Evolution of relative distances dk
in Australian regions. A decrease in
yvalues denotes drift.
it.sd 1−
p
.pgd 2−
0
Index
Index
(c) Extensional drift per French region.</p>
        <p>P-values below 0.05 denote drift.
Regions by ascending population density.
(d) Evolution of relative distances dk in
French regions. A decrease in y-values
denotes drift.
In the Australian case, the population distributions tend to vary in the less
rich regions, and they stabilize as the regions get richer. The top two regions
also differentiate themselves from the rest. In the French case, there is a great
stability of the distributions until a drastic change happens when approaching
the top 20% richest regions, which probably reveals differences in how these
labour markets behave. We consider our selected orderings to be as meaningful
and useful as time for the applicability of our extensional concept drift detection
method.</p>
        <p>In this paper we present the application of an extensional concept drift
detection method in Linked Census Data when temporal variants of the concepts
are not available. Concretely, we study extensional drifts of the concept youth
unemployment in the Australian and French censuses, leveraging Linked Data to
retrieve meaningful orderings of the data in the absence of temporal orderings.</p>
        <p>Acknowledgements The work on which this paper is based has been partly supported by the
Computational Humanities Programme of the Royal Netherlands Academy of Arts and Sciences,
under the auspices of the CEDAR project. For further information, see http://ehumanities.nl. This
work has been supported as well by the Dutch national program COMMIT.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Fanizzi</surname>
          </string-name>
          , N.,
          <string-name>
            <surname>d'Amato</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Esposito</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Conceptual Clustering: Concept Formation, Drift and Novelty Detection</article-title>
          .
          <source>In: The Semantic Web: Research and Applications, 5th European Semantic Web Conference. LNCS 5021</source>
          . pp.
          <fpage>318</fpage>
          -
          <lpage>332</lpage>
          . Springer (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Flouris</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manakanatas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kondylakis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Plexousakis</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Antoniou</surname>
          </string-name>
          , G.:
          <article-title>Ontology change: classification and survey</article-title>
          .
          <source>The Knowledge Engineering Review</source>
          <volume>23</volume>
          (
          <issue>2</issue>
          ),
          <fpage>117</fpage>
          -
          <lpage>152</lpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gonçalves</surname>
            ,
            <given-names>R.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parsia</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sattler</surname>
            ,
            <given-names>U.</given-names>
          </string-name>
          :
          <article-title>Analysing Multiple Versions of an Ontology : A Study of the NCI Thesaurus</article-title>
          .
          <source>In: Proceedings of the 24th International Workshop on Description Logics (DL</source>
          <year>2011</year>
          ). vol.
          <volume>745</volume>
          . CEUR Workshop Proceedings (
          <year>2011</year>
          ), http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>745</volume>
          /
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gulla</surname>
            ,
            <given-names>J.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Solskinnsbakk</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Myrseth</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Haderlein</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cerrato</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          :
          <article-title>Semantic Drift in Ontologies</article-title>
          .
          <source>In: Proceedings of the 6th International Conference on Web Information Systems and Technologies</source>
          . vol.
          <volume>2</volume>
          . INSTICC Press (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. van Hage,
          <string-name>
            <surname>W.R.</surname>
          </string-name>
          ,
          <source>with contributions from: Tomi Kauppinen</source>
          , Graeler,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Hoeksema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Ruttenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Bahls</surname>
          </string-name>
          .,
          <string-name>
            <surname>D.</surname>
          </string-name>
          : SPARQL: SPARQL client (
          <year>2013</year>
          ), http://CRAN.R-project.
          <source>org/package=SPARQL, R package version 1</source>
          .
          <fpage>15</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Change Management for Distributed Ontologies</article-title>
          .
          <source>Ph.D. thesis</source>
          , VU University Amsterdam (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>R</given-names>
            <surname>Core Team: R: A Language</surname>
          </string-name>
          and
          <article-title>Environment for Statistical Computing</article-title>
          . R Foundation for Statistical Computing, Vienna, Austria (
          <year>2013</year>
          ), http://www.R-project. org/
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Tsymbal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>The problem of concept drift: definitions and related work</article-title>
          .
          <source>Tech. Rep. TCD-CS-2004-15</source>
          , Computer Science Department, Trinity College Dublin (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schlobach</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>M.C.A.</given-names>
          </string-name>
          :
          <article-title>What Is Concept Drift and How to Measure It? In: Knowledge Engineering and Management by the Masses -</article-title>
          17th International Conference,
          <string-name>
            <surname>EKAW</surname>
          </string-name>
          <year>2010</year>
          . Proceedings. pp.
          <fpage>241</fpage>
          -
          <lpage>256</lpage>
          . Lecutre Notes in Computer Science,
          <volume>6317</volume>
          , Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Wilcoxon</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Individual comparisons by ranking methods</article-title>
          .
          <source>Biometrics Bulletin</source>
          <volume>1</volume>
          (
          <issue>6</issue>
          ),
          <fpage>80</fpage>
          -
          <lpage>83</lpage>
          (
          <year>1945</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>