=Paper=
{{Paper
|id=Vol-1318/paper2
|storemode=property
|title=Quality Metrics for Optimizing Parameters Tuning in Clustering Algorithms for Extraction of Points of Interest in Human Mobility
|pdfUrl=https://ceur-ws.org/Vol-1318/paper2.pdf
|volume=Vol-1318
|dblpUrl=https://dblp.org/rec/conf/simbig/CortezS14
}}
==Quality Metrics for Optimizing Parameters Tuning in Clustering Algorithms for Extraction of Points of Interest in Human Mobility==
<pdf width="1500px">https://ceur-ws.org/Vol-1318/paper2.pdf</pdf>
<pre>
       Quality Metrics for Optimizing Parameters Tuning in Clustering
       Algorithms for Extraction of Points of Interest in Human Mobility


            Miguel Nuẽz del Prado Cortez                               Hugo Alatrista-Salas
                     Peru I+D+I                                        GRPIAA Labs., PUCP
                   Technopark IDI                                           Peru I+D+I
           miguel.nunez@peruidi.com                                   halatrista@pucp.pe


                       Abstract                                On one hand, heuristics rely on the dwell time, which
                                                               is the lost of signal when user gets into a building.
     Clustering is an unsupervised learning tech-
                                                               Another used heuristic is the residence time, which
     nique used to group a set of elements into non-
     overlapping clusters based on some predefined
                                                               represents the time that a user spends at a particu-
     dissimilarity function. In our context, we rely           lar place. On the other hand, clustering algorithms
     on clustering algorithms to extract points of             group nearby mobility traces into clusters.
     interest in human mobility as an inference at-               In particular, in the context of POI extraction, it is
     tack for quantifying the impact of the privacy            important to find a suitable set of parameters, for a
     breach. Thus, we focus on the input param-                specific cluster algorithm, in order to obtain a good
     eters selection for the clustering algorithm,             accuracy as result of the clustering. The main contri-
     which is not a trivial task due to the direct im-
                                                               bution of this paper is a methodology to find a “op-
     pact of these parameters in the result of the
     attack. Namely, if we use too relax parame-               timal” configuration of input parameters for a clus-
     ters we will have too many point of interest              tering algorithm based on quality indices. This op-
     but if we use a too restrictive set of parame-            timal set of parameters allows us to have the appro-
     ters, we will find too few groups. Accordingly,           priate number of POIs in order to perform another
     to solve this problem, we propose a method to             inference attack. This paper is organized as follows.
     select the best parameters to extract the opti-           First, we present some related works on parameters
     mal number of POIs based on quality metrics.
                                                               estimation techniques in Section 2. Afterwards, we
                                                               describe the clustering algorithms used to perform
1   Introduction                                               the extraction of points of interests (POIs) as well as
The first step in inference attacks over mobility              the metrics to measure the quality of formed clusters
traces is the extraction of the point of interest (POI)        in sections 3 and 4, respectively. Then, we introduce
from a trail of mobility traces. Indeed, this phase            the method to optimize the choice of the parameters
impacts directly the global accuracy of an inference           in Section 5. Finally, Section 6 summarizes the re-
attack that relies on POI extraction. For instance,            sults and presents the future directions of this paper.
if an adversary wants to discover Alice’s home and
                                                               2   Related works
place of work the result of the extraction must be as
accurate as possible, otherwise they can confuse or            Most of the previous works estimate the parameters
just not find important places. In addition, for a more        of the clustering algorithms for the point of interest
sophisticated attack such as next place prediction, a          extraction by using empirical approaches or highly
mistake when extracting POIs can decrease signif-              computationally expensive methods. For instance,
icantly the global precision of the inference. Most            we use for illustration purpose two classical cluster-
of the extraction techniques use heuristics and clus-          ing approaches, K-means (MacQueen et al., 1967)
tering algorithms to extract POIs from location data.          and DBSCAN (Ester et al., 1996). In the former


                                                          14
clustering algorithm, the main issue is how to deter-         a cluster C composed of all the consecutive points
mine k, the number of clusters. Therefore, several            within distance d from each other. Afterwards, the
approaches have been proposed to address this issue           algorithm checks if the accumulated time of mobil-
(Hamerly and Elkan, 2003; Pham et al., 2005). The             ity traces between the youngest and the oldest ones
latter algorithm relies on OPTICS (Ankerst et al.,            is greater than the threshold t. If it is the case, the
1999) algorithm, which searches the space of param-           cluster is created and added to the list of POIs. Fi-
eters of DBSCAN in order to find the optimal num-             nally as a post-processing step, DT-Cluster merges
ber of clusters. The more parameters the clustering           the clusters whose medioids are less than d/3 far
algorithm has, the bigger the combinatorial space of          from each other.
parameters is. Nevertheless, the methods to calibrate
cluster algorithm inputs do not guarantee a good ac-          3.3    Time Density (TD-Cluster)
curacy for extracting meaningful POIs. In the next            Introduced in (Gambs et al., 2011), TD-Cluster is
section, we described the cluster algorithms used in          a clustering algorithm inspired from DT Cluster,
our study.                                                    which takes as input parameters a radius r, a time
                                                              window t, a tolerance rate ⌧ , a distance threshold d
3   Clustering algorithms for extraction of                   and a trail of mobility traces M . The algorithm starts
    points of interest                                        by building iteratively clusters from a trail M of mo-
To perform the POI extraction, we rely on the fol-            bility traces that are located within the time window
lowing clustering algorithms:                                 t. Afterwards, for each cluster, if a fraction of the
                                                              points (above the tolerance rate ⌧ ) are within radius
3.1 Density Joinable Cluster (DJ-Cluster)                     r from the medoid, the cluster is integrated to the list
DJ-Cluster (Zhou et al., 2004) is a clustering algo-          of clusters outputted, whereas otherwise it is simply
rithm taking as input a minimal number of points              discarded. Finally, as for DT Cluster, the algorithm
minpts, a radius r and a trail of mobility traces             merges the clusters whose medoids are less than d
M . This algorithm works in two phases. First, the            far from each other.
pre-processing phase discards all the moving points           3.4    Begin-end heuristic
(i.e. whose speed is above ✏, for ✏ a small value)
and then, squashes series of repeated static points           The objective of the begin and end location finder
into a single occurrence for each series. Next, the           inference attack (Gambs et al., 2010) is to take as
second phase clusters the remaining points based              meaningful points the first and last of a journey.
on neighborhood density. More precisely, the num-             More precisely, this heuristic considers that the be-
ber of points in the neighborhood must be equal or            ginning and ending locations of a user, for each
greater than minpts and these points must be within           working day, might convey some meaningful infor-
radius r from the medoid of a set of points. Where            mation.
medioid is the real point m that minimizes the sum of            Since we have introduced the different clustering
distances from the point m to the other points in the         algorithms to extract points of interest, we present in
cluster. Then, the algorithm merges the new cluster           the next section the indices to measure the quality of
with the clusters already computed, which share at            the clusters.
least one common point. Finally, during the merg-             4     Cluster quality indices
ing, the algorithm erases old computed clusters and
only keeps the new cluster, which contains all the            One aspect of the extraction of POIs inference at-
other merged clusters.                                        tacks is the quality of the obtained clusters, which
                                                              impacts on the precision and recall of the attack.
3.2 Density Time Cluster (DT-Cluster)                         In the following subsection we describe some met-
DT-Cluster (Hariharan and Toyama, 2004) is an iter-           rics to quantify how accurate or “how good“ is the
ative clustering algorithm taking as input a distance         outcome of the clustering task. Intuitively, a good
threshold d, a time threshold t and a trail of mobil-         clustering is one that identifies a group of clusters
ity traces M . First, the algorithm starts by building        that are well separated one from each other, compact


                                                         15
and representative. Table 1 summarizes the notation                      4.2   Additive margin
used in this section.                                                    Inspired by the Ben-David and Ackerman (Ben-
 Symbol     Definition                                                   David and Ackerman, 2008) k-additive Point Mar-
    C       An ensemble of clusters.                                     gin (K-AM) metric , which evaluates how well cen-
    ci      The ith cluster of C.                                        tered clusters are. We measure the difference be-
    nc      The number of clusters in C.                                 tween the medoid mi and its two closest points m0
   mi       The medoid point of the ith cluster.                         and m00 of a given group ci belonging to a cluster C
 d(x, y)    The Euclidean distance between x and y.
                                                                         (Equation 5).
   |ci |    The number of points in a cluster ci .
   m0       The closest point to the medoid mi .                               K   AM (ci ) = d(mi , m00i )       d(mi , m0i )   (5)
   m00      The second closest point to the medoid mi .
   |C|      The total number of points in a set of C.                    Since the average of the k-additive point margins for
                                                                         all groups ci in a cluster C is computed, we take the
           Table 1: Summary of notations
                                                                         ratio between the average k-additive Point Margin
                                                                         and the minimal inter-cluster distance (Equation 1)
4.1 Intra-inter cluster ratio                                            as shown in Equation 6.
The intra-inter cluster ratio (Hillenmeyer, 2012)                                                  1 P nc
                                                                                                  nc    ci 2C K   AM (ci )
measures the relation between compact (Equation 1)                         AM (C) = minci 2C                               (6)
                                                                                                           DIC(ci )
and well separated groups (Equation 3). More pre-
cisely, we first take the inter-cluster distance, which                  The additive margin method has a linear complexity
is the average distance from each point in a cluster                     in the number of clusters. This metric gives a high
ci to its medoid mi .                                                    value for a well centered clusters.
                                |ci |
                      1         X                                        4.3   Information loss
   DIC(ci ) =                                d(xj , mi )      (1)
                 |ci | 1
                           xj 2ci ,xj 6=mi
                                                                         The information loss ratio is a metric inspired by the
                                                                         work of Sole and coauthors (Solé et al., 2012). The
Then, the average intra-cluster distance (DIC) is                        basic idea is to measure the percent of information
computed using Equation 2.                                               that is lost while representing original data only by
                      1 X
                                    |C|                                  a certain number of groups (e.g., when we represent
        AV G DIC(C) =     DIC(ci )                            (2)        the POIs by the cluster medoids instead of the whole
                      nc
                                   ci 2C                                 set of points). To evaluate the percent of information
Afterwards, the mean distance among all medoids                          loss, we compute the sum of distance of each point
(DOC) in the cluster C is computed, using Equation                       represented by xi to its medoid mi for all clusters
3.                                                                       ci 2 C as we shown in Equation 7.
                                        |C|      |C|
                  1                     X        X                                                  |c|
                                                                                                 nc X
                                                                                                 X
DOC(C) =                                                   d(mi , mj )             SSE(C) =                     d(xj , mi )      (7)
         |nC | ⇥ (|nC |           1)
                                        ci 2C cj 2C,i6=j                                         ci 2C xj 2ci
                                                       (3)
Finally, the ratio intra-inter cluster rii is given by the               Then, we estimate the accumulated distance of all
Equation 4 as the relationship between the average                       points of a trail of mobility traces in the cluster C to
intra cluster distance divided by the inter-cluster dis-                 a global centroid (GC) using the following equation
tance.                                                                   Equation 8.
                          AV G DIC(C)
              rii(C) =                                 (4)                                          |C|
                             DOC(C)                                                                 X
                                                                                     SST (C) =             d(xi , GC)            (8)
The intra-inter ratio has an approximate linear com-
                                                                                                   xi 2C
plexity in the number of points to be computed and
gives low values to well separated and compact clus-                     Finally, the ratio between aforementioned distances
ter.                                                                     is computed using Equation 9, which results in the


                                                                    16
information loss ratio.                                       Afterwards, a similarity measure between two clus-
                                                              ters ci and cj , called R-similarity, is estimated, based
                           SSE(C)
                IL(C) =                           (9)         on Equation 14.
                           SST (C)
                                                                                            S(ci ) + S(cj )
The computation of this ratio has a linear complex-                         R(ci , cj ) =                           (14)
                                                                                              M (ci , cj )
ity. The lowest is the value of this ratio, the more
representative the clusters are.                              After that, the most similar cluster cj to ci is the
                                                              one maximizing the result of the function Rall (ci ),
4.4 Dunn index
                                                              which is given by Equation 15 for i 6= j.
This quality index (Dunn, 1973; Halkidi et al., 2001)
attempts to recognize compact and well-separated                        Rall (ci ) = maxcj 2C,i6=j R(ci , cj )      (15)
clusters. The computation of this index relies on
a dissimilarity function (e.g. Euclidean distance d)          Finally, the DBI is equal to the average of the simi-
between medoids and the diameter of a cluster (c.f,           larity between clusters in the clustering set C (Equa-
Equation 10) as a measure of dispersion.                      tion 16).
                                                                                                n
        diam(ci ) = maxx,y2ci ,x6=y d(x, y)      (10)                                       1 Xc

                                                                           DBI(C) =              Rall (ci )         (16)
                                                                                            nc
                                                                                              ci 2C
Then, if the clustering C is compact (i.e, the diam-
eters tend to be small) and well separated (distance          Ideally, the clusters ci 2 C should have the mini-
between cluster medoids are large), the result of the         mum possible similarity to each other. Accordingly,
index, given by the Equation 11, is expected to be            the lower is the DB index, the better is the cluster-
large.                                                        ing formed. These indices would be used to maxi-
                                                              mize the number of significant places a cluster algo-
     DIL(C) = minci 2C [mincj 2C,j=i+1                        rithm could find. More precisely, in the next section
                      d(mi , mj )                (11)         we evaluate the cluster algorithm aforementioned as
                [                    ]]
                  maxck 2C diam(ck )                          well as the method to extract the meaningful places
                                                              using the quality indices.
The greater is this index, the better the performance
of the clustering algorithm is assumed to be. The             5     Selecting the optimal parameters for
main drawbacks of this index is the computational                   clustering
complexity and the sensitivity to noise in data.
                                                              In order to establish how to select the best set
4.5 Davis-Bouldin index                                       of parameters for a given clustering algorithm, we
The objective of the Davis-Bouldin index (DBI)                have computed the precision, recall and F-measure
(Davies and Bouldin, 1979; Halkidi et al., 2001) is to        of all users of LifeMap dataset (Chon and Cha,
evaluate how well the clustering was performed by             2011). One of the unique characteristic of this
using properties inherent to the dataset considered.          dataset is that the POIs have been annotated by
First, we use a scatter function within the cluster ci        the users. Consequently, given a set of clusters
of the clustering C (Equation 12).                            ci 2 C such that C = {c1 , c2 , c3 , . . . , cn } and a
                                                              set of points of interest (POIs) defined by the users
                    v
                    u
                    u1 X    n                                 Ppoi = {ppoi 1 , ppoi 2 , ppoi 3 , . . . , ppoi n } we were
           S(ci ) = t           d(xj , mi )2     (12)         able to compute the precision, recall and f-measure
                       nc x 2c
                           j   i                              as we detail in the next subsection.

Then, we compute the distance between two differ-             5.1    Precision, recall and F-measure
ent clusters ci and cj , given by Equation 13.                To compute the recall (c.f. Equation 17), we take as
                             q                                input a clustering set C, the ground truth represented
              M (ci , cj ) = d(mi , mj )       (13)           by the vector Ppoi (which was defined manually by


                                                         17
each user) as well as a radius to count all the clus-                          Characteristics                                            LifeMap
ters c 2 C that are within the radius of ppoi 2 Ppoi ,                       Total nb of users                                               12
which represents the ”good clusters”. Then, the ratio                  Collection period (nb of days)                                       366
of the number of good clusters compared to the total                     Average nb of traces/user                                         4 224
number of found clusters is computed. This measure                           Total nb of traces                                            50 685
illustrates the ratio of extracted cluster that are POIs                  Min #Traces for a user                                            307
divided by the total number of extracted clusters.                        Max #Traces for a user                                           9 473

                         good clusters                          Table 2:             Main characteristics of the LifeMap
 P recision =                                   (17)
                total number extracted clusters                 dataset.
   To compute the recall (c.f. Equation 18), we take
as input a clustering set C, a vector of POIs Ppoi as           5.3     Experimental results
well as a radius to count the discovered POIs ppoi 2
Ppoi within a radius of the clusters c 2 C, which               This section is composed of two parts, in the first
represents the ”good POIs”. Then, the ratio between             part we compare the performance of the previously
the number of good POIs and the total number of                 described clustering algorithms, with two base-
POIs is evaluated. This metric represents the percent           line clustering algorithms namely k-means and DB-
of the extracted unique POIs.                                   SCAN. In the second part, a method to select the
                                                                most suitable parameters for a clustering algorithm
                                                                is presented.
                        good P OIs
      Recall =                                     (18)
                  total number of P OIs


                                                                                                                                                       DT cluster


                                                                                                                                                                                TD cluster
                                                                                                                                 DBSCAN

                                                                                                                                          DJ cluster


                                                                                                                                                                    K-means
  Finally, the F-measure is defined as the weighted              Input parameters                   Possible values
                                                                 Tolerance rate (%)             {0.75, 0.8, 0.85, 0.9}            y        Y            y            y           y
average of the precision and recall as we can see in             Tolerance rate (%)             {0.75, 0.8, 0.85, 0.9}            7        7            7            7           3
                                                                 Minpts (points)            {3, 4, 5, 6, 7, 8, 9, 10, 20, 50}     3        3            7            7           7
Equation 19.                                                     Eps (Km.)                   {0.01, 0.02, 0.05, 0.1, 0.2}         3        3            3            7           3
                                                                 Merge distance (Km.)        {0.02, 0.04, 0.1, 0.2, 0.4}          7        7            3            7           3
                                                                 Time shift (hour)                 {1, 2, 3, 4, 5, 6}             7        7            3            7           3
                                                                 K (num. clusters)                   {5, 6, 7,8, 9}               7        7            7            3           7
                           precision ⇥ recall
  F     measure = 2 ⇥                              (19)
                           precision + recall                   Table 3: Summary of input parameters for clustering
We present the dataset used for our experiments in              algorithms.
the next subsection.

5.2 Dataset description                                                         Precision       Recall      F-measure       Time(s)
                                                                                                                                                Number of
                                                                                                                                                parameters
                                                                                                                                                                              Complexity

                                                                 DBSCAN             0,58         0,54          0,48             316                    3                       O(n2 )
In order to evaluate our approach, we use the                    DJ-Cluster         0,74         0,52          0,52             429                    3                       O(n2 )
LifeMap dataset (Chon et al., 2012), which is com-               DT-Cluster         0,38         0,47          0,39             279                    3                       O(n2 )
                                                                  k-means           0,58         0,51          0,49             299                    2                       O(n)
posed of mobility traces of 12 user collected for a              TD-Cluster         0,43         0,54          0,44             362                    4                       O(n2 )

year in Seoul, Korea. This dataset comprises lo-
cation (latitude and longitude) collected with a fre-           Table 4: The characteristics of the clustering algo-
quency between 2 to 5 minutes with the user defined             rithms.
point of interest as true if the mobility trace is con-
sidered as important or meaningfull for each user.                 In order to compare the aforementioned clustering
Table 2 summarizes the main characteristics of this             algorithms, we have take into account the precision,
dataset, such as the collect period, the average num-           recall, F-measure obtained, average execution time,
ber of traces per user, the total number of mobility            number of input parameters and time complexity.
traces in the dataset, the minimal and maximal num-             To evaluate these algorithms, we used the LifeMap
ber of users’ mobility traces.                                  dataset with POIs annotation and a set of different
   Since we have described our dataset, we present              parameters configurations for each algorithm, which
the results of our experiments in the next subsection.          are summarized in Table 3. After running these con-


                                                           18
figurations, we obtained the results shown in Table                                       different configurations of distinct algorithms us-
4 for the different input values.                                                         ing the previously described quality indices. After-
                                                                                          wards, we were able to estimate the precision, recall
                   DBSCAN             DJ Cluster           DT Cluster                     and F-measure using the manual annotation of POIs
                                                                                          by the users in the LifeMap dataset.
        0.5
                                                                                             Regarding the relation between the quality indices
                                                                                          and the F-measure, we studied the relationship be-
        0.0
                                                                                          tween these factors, in order to identify the indices
                                                                                          that are highly correlated with the F-measure, as
        -0.5
                                                                                          can be observed in Figure 1. We observe that the
                                                                            Index
                                                                               DBI
                                                                                          two best performing indices, except for k-means,
        -1.0                                                                   DI
                                                                                          are IL and DBI. The former shows a negative cor-
Value


                   k_means             Resume              TD Cluster          IL
                                                                               kAM        relation with respect to the F-measure. While the
        0.5                                                                    RII
                                                                                          latter, has a positive dependency to the F-measure.
                                                                                          Our main objective is to be able to identify the rela-
        0.0
                                                                                          tionship between quality and F-measure among the
                                                                                          previous evaluated clustering algorithms. Accord-
        -0.5                                                                              ingly, we discard the inter-intra cluster ratio (RII)
                                                                                          and the adaptive margin (AM), which only perform
        -1.0                                                                              well when using k-means and the DJ clustering al-
               A   B    C    D   A     B     C     D   A    B     C     D
                                     Correlation                                          gorithms. Finally, we observe that the Dunn index
                                                                                          has a poor performance. Based on these observa-
Figure 1: Correlation of quality indices with the                                         tions, we were able to propose an algorithm to auto-
computed F-measure. Where A) is the correlation                                           matically choose the best configuration of input pa-
measured between the user annotation and the cen-                                         rameters.
troid at 20 m of radius B) at 35 m radius C) at
50 m radius, D) at 100 m radius and DBI=Davis-                                            5.4   Parameter selection method
Bouldin index, DI=Dunn index, IL=Information                                              Let us define a vector of parameters pi 2 P and P a
loss, kAM=Additive margin and RII= Ratio intra-                                           set of vectors, such that P = {p1 , p2 , p3 , . . . , pn },
inter cluster.                                                                            a trail of mobility traces M of a user. From
                                                                                          previous sections we have the clustering function
   It is possible to observe that the precision of DJ-                                    C(pi ) and the quality metrics Information Loss
Cluster out performs better than the other cluster-                                       IL(C) and Davis-Bouldin index DBI(C). Thus,
ing algorithms. In terms of recall DBSCAN and                                             for each vector of parameters we have a tuple com-
TD-Cluster perform the best but DJ-Cluster is just                                        posed of the trail of mobility traces, the result
behind them. Moreover, DJ-Cluster has the best                                            of the clustering algorithm and the quality metrics
F-measure. Regarding the execution time, DT-                                              (pi , M, Cpi , ILCpi , DBICpi ). When we compute
Clustering the fastest one while DJ-Cluster is the                                        the clustering algorithm and the quality metrics for
slowest algorithm due to the preprocessing phase.                                         each vector of parameter for a given user u. We de-
Despite the high computational time of DJ-Cluster,                                        fine also a 0u matrix, which the matrix u sorted by
this algorithm performs well in terms of F-measure.                                       IL ascending. Finally, the result matrix u is of the
   In the following, we describe our method to                                            form:
choose “optimal” parameters for obtaining a good
F-measure. We have used the aforementioned algo-                                                       p1    M     Cp 1   ILCp1     DBICp1
rithms with a different set of input parameters con-                                                   p2    M     Cp 2   ILCp2     DBICp2
figurations for users with POIs annotations in the                                              u =    p3    M     Cp 3   ILCp3     DBICp3
LifeMap dataset (Chon and Cha, 2011). Once clus-                                                       ...
ters are built, we evaluate the clusters issued from                                                   pn    M     C pn   ILCpn     DBICpn


                                                                                     19
                                     DBSCAN                 DJ cluster                DT cluster


                         0.6


                         0.4


                         0.2


                                                                                                       Heuristic
             F_measure


                         0.0                                                                               DBI
                                                                                                           IL
                                     k means                Resume                    TD cluster
                                                                                                           IL_DBI
                                                                                                           MAX

                         0.6


                         0.4


                         0.2


                         0.0

                               u1 u2 u3 u4 u5 u6 u7   u1 u2 u3 u4 u5 u6 u7      u1 u2 u3 u4 u5 u6 u7
                                                              User


Figure 2: Comparison between F-measure and parameters selection based on schema in Figure ??. Where
DBI=Davis-Bouldin index, IL=Information loss, IL DBI= combination of IL and DBI and MAX is the maximal
computed F-measure (taken as reference to compare with IL DBI). Resume is the average of all the results using
different clustering algorithms.


  Therefore, the parameter selection function                             set of input parameters. In the second situation, both
S( u ) could be defined as:                                               IL and DBI refer each one to a different set of pa-
                                                                          rameters. In this case, the algorithm sorts the values
             (
                     if maxpi (DBI) & minpi (IL)
                   pi ,                                                   by IL in the ascending order (i.e., from the small-
  S( u ) =                                                                est to the largest information loss value). Then, it
                   p0i ,
                     if maxp0i (DBI) in 1st quartile
                                                                          chooses the set of parameters with the greatest DBI
                                                  (20)
                                                                          in the first quartile.
   In detail, the function S takes as input a matrix
containing the parameters vector pi , a trail of mobil-                      For the sake of evaluation, our methodology was
ity traces M , the computed clustering C(pi , M ) as                      tested using the LifeMap dataset to check if the cho-
well as the quality metrics, such as Information loss                     sen parameters are optimal. We have tested the
(IL(C)) and the Davis-Bouldin index (DBI(C)).                             method with the seven users of LifeMap that have
Once all these values have been computed for each                         annotated manually their POIs. Consequently, for
evaluated set of parameters, two cases are possible.                      every set of settings of each clustering algorithm, we
In the first case, both IL and DBI agree on the same                      have computed the F-measure because we have the


                                                                     20
ground truth as depicted in Figure 2. The ”MAX”                   Dunn, J. C. (1973). A fuzzy relative of the ISO-
bar represents the best F-measure for the given user                DATA process and its use in detecting compact well-
and it is compare to the F-measures obtained when                   separated clusters. Journal of Cybernetics, 3(3):32–
using the ”DBI”, ”IL” or ”IL DBI” as indicators to                  57.
                                                                  Ester, M., peter Kriegel, H., Sander, J., and Xu, X.
choose the best input parameters configuration. Fi-
                                                                    (1996). A density-based algorithm for discovering
nally, this method has a satisfactory performance ex-               clusters in large spatial databases with noise. Knowl-
tracting a good number of POIs for maximizing the                   edge Discovery and Data Mining, 2(4):226–231.
F-measure achieving a difference of only 9% with                  Gambs, S., Killijian, M.-O., and Núñez del Prado Cortez,
respect to the F-measure computed from the data                     M. (2010).       GEPETO: A GEoPrivacy-Enhancing
with the ground truth.                                              TOolkit. In Advanced Information Networking and
                                                                    Applications Workshops, pages 1071–1076, Perth,
6   Conclusion                                                      Australia.
                                                                  Gambs, S., Killijian, M.-O., and Núñez del Prado Cortez,
In the current paper, we have presented a method                    M. (2011). Show me how you move and I will tell
to extract the a optimal number of POIs. Conse-                     you who you are. Transactions on Data Privacy,
quently, based on the method described in tis paper,                2(4):103–126.
we are able to find an appropriate number of POIs                 Halkidi, M., Batistakis, Y., and Vazirgiannis, M. (2001).
relying only on the quality metrics of the extracted                On clustering validation techniques. Journal of Intel-
                                                                    ligent Information Systems, 17(3):107–145.
clusters and without the knowledge of the ground
                                                                  Hamerly, G. and Elkan, C. (2003). Learning the k in K-
truth. Nonetheless, we are aware of the small size                  means. In In Neural Information Processing Systems,
of dataset but the results encourage us to continue in              pages 1–8, Vancouver, Canada.
this direction.                                                   Hariharan, R. and Toyama, K. (2004). Project lachesis:
   Therefore, in the future we plan to test our method              Parsing and modeling location histories. Lecture notes
in a larger dataset and in presence of noise like                   in computer science - Geographic information science,
downsamplig or random distortion. Another idea is                   3(1):106–124.
to evaluate the impact of this method in more com-                Hillenmeyer, M. (2012).           Intra and inter clus-
                                                                    ter distance.       http://www.stanford.edu/
plex attacks like prediction of future locations or de-
anonymization to verify if this step can affect the                 ˜maureenh/quals/html/ml/node82.html.
                                                                  MacQueen, J. et al. (1967). Some methods for classifi-
global result of a chain of inference attacks.                      cation and analysis of multivariate mbservations. In
                                                                    Proceedings of the fifth Berkeley Symposium on Math-
                                                                    ematical Statistics and Probability, pages 281–297,
References                                                          Berkeley, CA, USA.
Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander,          Pham, D. T., Dimov, S. S., and Nguyen, C. D. (2005).
  J. (1999). Optics: ordering points to identify the clus-          Selection of K in K-means clustering. Journal of Me-
  tering structure. ACM SIGMOD Record, 28(2):49–60.                 chanical Engineering Science, 205(1):103–119.
Ben-David, S. and Ackerman, M. (2008). Measures of                Solé, M., Muntés-Mulero, V., and Nin, J. (2012). Effi-
  clustering quality: A working set of axioms for clus-             cient microaggregation techniques for large numerical
  tering. In Proceedings of the Annual Conference on                data volumes. International Journal of Information
  Neural Information Processing Systems, pages 121–                 Security - Special Issue: Supervisory control and data
  128, Vancouver, Canada.                                           acquisition, 11(4):253–267.
Chon, Y. and Cha, H. (2011). LifeMap: A smartphone-               Zhou, C., Frankowski, D., Ludford, P., Shekhar, S., and
  based context provider for location-based services.               Terveen, L. (2004). Discovering personal gazetteers:
  Pervasive Computing, IEEE, 10(2):58–67.                           an interactive clustering approach. In Proceedings
                                                                    of the annual ACM international workshop on Ge-
Chon, Y., Talipov, E., Shin, H., and Cha,
                                                                    ographic information systems, pages 266–273, New
  H. (2012).           CRAWDAD data set yon-
                                                                    York, NY, USA.
  sei/lifemap (v. 2012-01-03).       Downloaded from
  http://crawdad.cs.dartmouth.edu/yonsei/lifemap.
Davies, D. L. and Bouldin, D. W. (1979). A cluster sepa-
  ration measure. IEEE Transactions on Pattern Analy-
  sis and Machine Intelligence, PAMI-1(2):224–227.


                                                             21

</pre>