Non-Temporal Orderings as Proxies for
             Extensional Concept Drift

                   Albert Meroño-Peñuela1,2 and Stefan Schlobach1
        1
            Department of Computer Science, VU University Amsterdam, NL
                                albert.merono@vu.nl
               2
                   Data Archiving and Networked Services, KNAW, NL


      Abstract. In census data, concepts are central entities represented by
      variables and their values. The meaning of these concepts is often as-
      sumed to be stable, but in fact it can change over time: we call this
      concept drift. Extensional concept drift is one type of change of mean-
      ing that affects the things the concept extends to, having drastic conse-
      quences on longitudinal querying. In this paper we detect extensionally
      drifted concepts in current Linked Census Data when a time ordering
      of such concepts is not available. We exploit the Linked Data cloud to
      obtain meaningful proxies for such orderings.

      Keywords: Concept Drift, Semantic Web, Linked Census Data


1   Introduction

Most linked datasets assume some degree of stability in the concepts (variables,
values) they refer to. But the meaning of these concepts can change over time.
In this paper we find and report back this change of meaning of concepts, or
concept drift, in two census datasets. Concept drift can happen at the concept
identifier level (label drift), in the basic properties of the concept (intensional
drift), or to the things the concept refers to (extensional drift) [9]. This paper
proposes a statistics-based solution for the latter.
    Concept drift is often assumed to happen between two time gapped variants
of a concept. Hence, time is the fundamental ordering of concepts in which
concept drift occurs. But time series are not available for the datasets we work
with. In this paper we propose a set of concept orderings that do not include
time, and we show their usefulness as proxies for concept drift detection. To get
such orderings, we exploit Linked Data to enrich and complement the census
data we already have.
    This paper is organised as follows. In Section 2 we describe the state of the
art in concept drift. In Section 3 we set the formal framework for the study of
concept drift. In Section 4 we describe experiments to detect extensional concept
drift in the Australian and French censuses in the absence of time series. Finally,
in Section 5 we establish some conclusions.
2    Related Work

In Machine Learning, concept drift is defined as the situation in which the sta-
tistical properties of a target variable change over time in unforeseen ways [8].
Several concept drift detection algorithms have been developed in this setting
[2,4,6]. On the Semantic Web, concept drift relates to the study of the dynamics
of meaning. This has been addressed in the field of ontology change and evolution
[1], in Description Logics [3], and in knowledge management [9].


3    Concept Drift

As reality changes continuously, concepts also change over time. A concept refers
to different objects, real or abstract, at different points in time. We use the
formalisation framework described by Wang et al. [9] in order to study concept
drift over time.

Definition 1. The meaning of a concept C is a triple (label(C),int(C),ext(C)),
where label(C) is a string, int(C) a set of properties (the intension of C), and
ext(C) a subset of the universe (the extension of C).

    All the elements of the meaning of a concept can change. To address concept
identity over time, Wang et al. [9] assume that the intension of a concept C is
the disjoint union of a rigid and a non-rigid set of properties (i.e. (intr (C) ∪
intnr (C))). Then, a concept is uniquely identified by some essential properties
that do not change. The notion of identity allows the comparison of two variants
of a concept at different points in time, even if a change on its meaning occurs.

Definition 2. Two concepts C1 and C2 are considered identical if and only if,
their rigid intension are equivalent, i.e., intr (C1 ) = intr (C2 ).

   If two variants of a concept at two different times have the same meaning,
there is no concept drift. We define intensional, extensional, and label similarity
functions simint , simext , simlabel to quantify meaning similarity. Each of these
functions has range [0, 1], and a similarity value of 1 indicates equality.

Definition 3. A concept has extensionally drifted in two of its variants C’ and
C”, if and only if, simext (C 0 , C 00 ) 6= 1. Intensional and label drift are defined
similarly.

   To apply this framework of concept drift it is required to define intension, ex-
tension and labelling functions, and to define similarity functions over intension,
extension and labels. We define these functions in Section 4.2.


                                          2
4       Meaningful Orderings as Concept Drift Proxies
In this section we apply the concept drift framework presented in Section 3 to
study the change of meaning of concepts in RDF Data Cube versions of the
Australian census of 2011 and the French census of 2010.3,4,5 More concretely,
we apply the notion of extensional drift to detect extensionally drifted concepts
in these censuses. Concept drift is usually assumed to happen between two time
gapped variants of a concept. Hence, time is the fundamental dimension to order
such variants. Since time series are not available for these datasets, in this paper
we propose a different set of concept orderings, and we study their applicability.
To get such orderings, we exploit Linked Data to complement the census data
we already have.

4.1      Data Retrieval
We query the Australian and French census datasets from the statistical envi-
ronment R [7] via the SPARQL R package [5].6 We select the variables gender,
age range, location, labour status and population. In the Australian census we
query data at the state level, and in the French census we aggregate results at
the departement level.
    To extend these variables we query DBPedia7 . In the Australian case, we ask
for the gross domestic product (GDP) per capita of all states. In the French case,
we ask for the area and total population of all departements, and we derive the
population density for each of them.

4.2      Non-Temporal Extensional Concept Drift
We are interested in detecting extensional concept drift, that is, simext (C 0 , C 00 ) 6=
1 for two given variants C 0 , C 00 of a concept C (see Section 3). Intuitively, this
means that the instances of C have changed significantly. We interpret exten-
sional concept drift in a statistical setting. We define the extension function
ext(C) as the function that returns the number of individuals that belong to C,
and the extension similarity function simext (C 0 , C 00 ) as the function that returns
the probablity that C 0 and C 00 have identical populations. We assume that the
extension of C has drifted between C 0 and C 00 iff the populations of C 0 and C 00
are non identical (there is a shift between the populations of C 0 and C 00 ).
    We choose the concept of youth unemployment to study its extensional drift
in both censuses. To replace the natural ordering of time in the occurrences of
    3
        See http://www.datalift.org/en/event/semstats2013/challenge
    4
        SPARQL endpoint serving the datasets at http://lod.cedar-project.nl:8080/
sparql/semstats/
    5
     Source code at https://github.com/albertmeronyo/ConceptDrift/blob/master/
stats/semstats-challenge.R
   6
     SPARQL queries at https://github.com/albertmeronyo/ConceptDrift/blob/
master/sparql/semstats-challenge.txt
   7
     http://dbpedia.org/sparql


                                            3
                                               Normal Q−Q Plot                                                      Normal Q−Q Plot


                            100000


                                                                                                      30000
         Sample Quantiles


                                                                                   Sample Quantiles
                                                                                                      20000
                            60000


                                                                                                      10000
                            0 20000


                                                                                                      0
                                         −2   −1        0              1   2                                  −2   −1        0              1   2
                                               Theoretical Quantiles                                                Theoretical Quantiles


    Fig. 1: Normal QQ-plots of all population counts in Western Australia and Tasmania.
    Both plots reveal non-normality of their distributions.


1   > wilcox.test(x,y)
2

3                                     Wilcoxon rank sum test with continuity correction
4

5   data: x and y
6   W = 16, p-value = 0.02857
7   alternative hypothesis: true location shift is not equal to 0

    Listing 1.1: Wilcoxon test for the population counts of unemployed young people in
    Western Australia and Tasmania


    this concept, we use the variables GDP per capita of the Australian states and
    population density of the French departements to order such occurrences.
        As an example, we calculate the extensional drift of youth unemployment
    in the Australian states of Western Autralia and Tasmania (highest and lowest
    GDP per capita, respectively). We want to know if population counts of unem-
    ployed young people (15-24 years old) have identical data distributions between
    these regions. Without assuming the data to have normal distribution (see Fig-
    ure 1), we want to test at .05 significance level if the population counts for youth
    unemployment have identical data distributions.
        The null hypothesis, H0 , is that the young unemployed people from these two
    regions are identical populations. To test the hypothesis, we run the Wilcoxon
    signed-rank test that comes with the R distribution [10]. We run the wilcox.test
    function using these samples (see Listing 1.1), concluding that the population of
    unemployed people between 15 and 24 in Western Australia and Tasmania are
    statistically non-identical populations (p < 0.05, N = 4, Wilcoxon signed-rank
    test). Consequently, there is extensional drift in this case.
        In order to have a complete overview on how youth unemployment evolves as
    GDP per capita increases, we run the same test for all Australian region pairs, in
    GDP per capita ascending order. The resulting p-values indicate whether there
    is an extensional drift between the regions (p < 0.05, see Figure 2) or, on the
    contrary, the concept remains stable. To view the evolution of extensional drift
    on a relative scale, for each drift test k we compute the distance function


                                                                               4
                   1.0


                                                                                                       1
                   0.8


                                                                                                       0
                   0.6


                                                                                      p.gdp.dist
                                                                                                       −1
    p.gdp
                   0.4


                                                                                                       −2
                   0.2


                                                                                                       −3
                   0.0


                                                                                                       −4
                          1   2        3    4           5        6       7   8                                                             2        4                6         8
                                                Index                                                                                                        Index


 (a) Extensional drift per Australian re-                                             (b) Evolution of relative distances dk
 gion. P-values below 0.05 denote drift.                                              in Australian regions. A decrease in y-
 Regions by ascending GDP per capita.                                                 values denotes drift.


                                                                                                       0.95 0.96 0.97 0.98 0.99 1.00
                   1.0
                   0.8


                                                                                      p.density.dist
    p.density.yu
                   0.6
                   0.4
                   0.2
                   0.0


                         0        20       40               60       80                                                                0       20       40           60   80
                                                Index                                                                                                        Index


 (c) Extensional drift per French region.                                             (d) Evolution of relative distances dk in
 P-values below 0.05 denote drift. Re-                                                French regions. A decrease in y-values
 gions by ascending population density.                                               denotes drift.

Fig. 2: Plots of p-values and dk distances for extensional drift in Australian and French
regions for the concept youth unemployment. The x-axis represents ascending regions
per GDP per capita (a, b) and population density (c, d); the y-axis represents p-values
(a, c) and relative drift distances dk (b, d).


                                                                     (
                                                                         pk−1 − α(pk ) if pk < 0.05
                                                            dk =
                                                                         pk−1          if pk ≥ 0.05
where α is a function that magnifies the distances in case of drift (Figure 2).
    We evaluate the applicability of this method with different data and ordering
criteria. We repeat the youth unemployment experiment, this time on the French
census. We use population density of the departements as the ordering to compare
different variants of the same concept. Results are also shown in Figure 2.

5                        Conclusions
Figure 2 shows meaningful results on the use of GDP per capita and population
density to track the evolution of the extensional drift of youth unemployment.


                                                                                  5
In the Australian case, the population distributions tend to vary in the less
rich regions, and they stabilize as the regions get richer. The top two regions
also differentiate themselves from the rest. In the French case, there is a great
stability of the distributions until a drastic change happens when approaching
the top 20% richest regions, which probably reveals differences in how these
labour markets behave. We consider our selected orderings to be as meaningful
and useful as time for the applicability of our extensional concept drift detection
method.
    In this paper we present the application of an extensional concept drift de-
tection method in Linked Census Data when temporal variants of the concepts
are not available. Concretely, we study extensional drifts of the concept youth
unemployment in the Australian and French censuses, leveraging Linked Data to
retrieve meaningful orderings of the data in the absence of temporal orderings.
    Acknowledgements The work on which this paper is based has been partly supported by the
Computational Humanities Programme of the Royal Netherlands Academy of Arts and Sciences,
under the auspices of the CEDAR project. For further information, see http://ehumanities.nl. This
work has been supported as well by the Dutch national program COMMIT.


References
 1. Fanizzi, N., d’Amato, C., Esposito, F.: Conceptual Clustering: Concept Formation,
    Drift and Novelty Detection. In: The Semantic Web: Research and Applications,
    5th European Semantic Web Conference. LNCS 5021. pp. 318–332. Springer (2008)
 2. Flouris, G., Manakanatas, D., Kondylakis, H., Plexousakis, D., Antoniou, G.: On-
    tology change: classification and survey. The Knowledge Engineering Review 23(2),
    117–152 (2008)
 3. Gonçalves, R.S., Parsia, B., Sattler, U.: Analysing Multiple Versions of an Ontol-
    ogy : A Study of the NCI Thesaurus. In: Proceedings of the 24th International
    Workshop on Description Logics (DL 2011). vol. 745. CEUR Workshop Proceed-
    ings (2011), http://ceur-ws.org/Vol-745/
 4. Gulla, J.A., Solskinnsbakk, G., Myrseth, P., Haderlein, V., Cerrato, O.: Semantic
    Drift in Ontologies. In: Proceedings of the 6th International Conference on Web
    Information Systems and Technologies. vol. 2. INSTICC Press (2010)
 5. van Hage, W.R., with contributions from: Tomi Kauppinen, Graeler, B., Davis,
    C., Hoeksema, J., Ruttenberg, A., Bahls., D.: SPARQL: SPARQL client (2013),
    http://CRAN.R-project.org/package=SPARQL, R package version 1.15
 6. Klein, M.: Change Management for Distributed Ontologies. Ph.D. thesis, VU Uni-
    versity Amsterdam (2004)
 7. R Core Team: R: A Language and Environment for Statistical Computing. R Foun-
    dation for Statistical Computing, Vienna, Austria (2013), http://www.R-project.
    org/
 8. Tsymbal, A.: The problem of concept drift: definitions and related work. Tech. Rep.
    TCD-CS-2004-15, Computer Science Department, Trinity College Dublin (2004)
 9. Wang, S., Schlobach, S., Klein, M.C.A.: What Is Concept Drift and How to Measure
    It? In: Knowledge Engineering and Management by the Masses - 17th International
    Conference, EKAW 2010. Proceedings. pp. 241–256. Lecutre Notes in Computer
    Science, 6317, Springer (2010)
10. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bulletin
    1(6), 80–83 (1945)


                                               6