=Paper= {{Paper |id=None |storemode=property |title=Classification of Whereabouts Patterns From Large-scale Mobility Data |pdfUrl=https://ceur-ws.org/Vol-621/paper23.pdf |volume=Vol-621 |dblpUrl=https://dblp.org/rec/conf/woa/FerrariM10 }} ==Classification of Whereabouts Patterns From Large-scale Mobility Data== https://ceur-ws.org/Vol-621/paper23.pdf
       Classification of Whereabouts Patterns From Large-scale Mobility Data

                   Laura Ferrari                                                   Marco Mamei
    DISMI, Università di Modena e Reggio Emilia                     DISMI, Università di Modena e Reggio Emilia
     Via Amendola 2, 42122 Reggio Emilia, Italy                       Via Amendola 2, 42122 Reggio Emilia, Italy
             laura.ferrari@unimore.it                                        marco.mamei@unimore.it


                         Abstract                                    techniques: Principal Component Analysis (PCA) [2] and
                                                                     Latent Dirichlet Allocation (LDA) [3]. The goal of both
   Classification of users’ whereabouts patterns is impor-           these techniques is to discover significant patterns and fea-
tant for many emerging ubiquitous computing applications.            tures from the input data. More precisely, from a maximum-
Latent Dirichlet Allocation (LDA) is a powerful mecha-               likelihood perspective, both these techniques aim at identi-
nism to extract recurrent behaviors and high-level patterns          fying a set of latent variables z and conditional probability
(called topics) from mobility data in an unsupervised man-           distributions p(x|z) for the observed variable x represent-
ner. One drawback of LDA is that it is difficult to give             ing users’ whereabouts. The latent variables z are typically
meaningful and usable labels to the extracted topics. We             of a much lower dimensionality than x. Thus they encode
present a methodology to automatically classify the topics           patterns in a more understandable way with reduced noise.
with meaningful labels so as to support their use in appli-          These techniques have been applied to a variety of people
cations. This mechanism is tested and evaluated using the            mobility datasets [7, 8, 4] with similar modalities.
Reality Mining dataset consisting of about 350000 hours of               In the context of the Reality Mining dataset (that is
continuous data on human behavior.                                   also the target of our work), the approach consists of ex-
                                                                     tracting for each user and for each day a 24-slots ar-
                                                                     ray indicating where the user was at a given time of the
1    Introduction                                                    day (24 hours). User’s locations are expressed as either:
                                                                     ‘Home’,‘Work’,‘Elsewhere’ or ‘No Signal’, the latter indi-
    The recent diffusion of smart phones equipped with lo-           cating lack of data (in the remainder of this paper we refer
calization capabilities allows to collect data about mobility        to these locations respectively as ‘H’, ‘W’,‘E’,‘N’). For ex-
and whereabouts in an economically feasible and unobtru-             ample, a typical day of a user could be ‘HHHHHHHHH-
sive way from a large user population [17, 15]. Several in-          WWWWWWWWWEEEHHH’ expressing the user was at
stitutions, both in academia and industry, are exploiting this       home at night and early morning, then went to work until
technology to create applications to collect logs of people’s        late afternoon, then went to somewhere else for three hours,
whereabouts. This information opens new scenarios and                and finally went back home (see next Section for further
possibilities in the development of context-aware services           details on the dataset).
and applications, but several challenges need to be tack-                Applying PCA or LDA to a set of these arrays allows to
led to extract practically useful information from such large        extract some low-dimensions latent variables (eigenvectors
mobility datasets. Accordingly, one of the key research is to        and LDA-topics respectively) representing underlying pat-
develop and apply pattern analysis algorithms to such data.          terns in the data, and offering conditional probability dis-
However, this kind of research has been impeded until re-            tributions for the observed arrays (i.e., days). Figure 1 il-
cently by the sheer availability of mobility data from a large       lustrates some eigenvectors and LDA-topics extracted from
user population.                                                     the Reality Mining dataset.
    The Reality Mining dataset is a seminal dataset in this              Eigenvectors, leftmost part of Figure 1, encode the prob-
area. It collects data about the daily life of 97 users over         ability of the user being at a given location: ‘H’, ‘W’,‘E’. In
10 months. Some pioneering researches started to apply               the picture the lighter the color, the higher the probability.
pattern-analysis and data mining algorithm to such mobility              Similarly, LDA-topics encode the probability of the user
dataset in order to extract high-level information and routine       being at a given location (in a different representation for-
behaviors.                                                           mat – see next Section for details). The rightmost part of
    A number of these researches focus on two “similar”              Figure 1 shows the most probable days according to the con-


                                                                 1
                                                                          UserID        Begin             End                 CellID
                                                                            22     2004-08-27 14:00 2004-08-27 16:00           102
                                                                            22     2004-08-27 16:30 2004-08-27 17:00           122
                                                                            ...            ...               ...                ...
                                                                                      UserID CellID Label
                                                                                         22       102  Home
                                                                                         22       121  Work
                                                                                         ...       ...   ...


                                                                           Figure 2. Tables used in the Reality Mining
                                                                           dataset.



                                                                      any visual inspection or user involvement. Once meaning-
   Figure 1. (left) The top two eigenbehaviors for
                                                                      ful labels are given to the topics, the extracted pattern be-
   Subject 4 of the Reality Mining dataset. (right)
                                                                      comes readily understandable and usable in applications.
   Exemplary LDA-topics – images respectively
                                                                      For example, life-log applications [10] could readily use
   taken from [7] and [8]
                                                                      the extracted label to automatically create an entry in the
                                                                      user blog. Similarly, analyzing city-wide mobility patterns,
                                                                      applications could identify routine behaviors affecting city-
ditional probability distribution of a given topic. These days        life and communicate such information to local government
are thus a representation of the topic itself.                        and city planners. These tasks are simplified once a proper
    Assigning a meaning to the extracted latent variables is          label is assigned to the discovered patterns, while they are
a difficult task, that has been typically performed by vi-            very difficult starting from the extracted eigenbehaviors and
sually inspecting the latent variable itself or the days that         topics in Figure 1.
strongly correlate to the variable (i.e., most probable days              The proposed classification methodology is a powerful
given the latent variable) [7, 8, 4]. For example in [7],             tool in that it allows to express what is going to happen to
by visually inspecting the first eigenbehavior represented            the user (i.e., which topic is going to be expressed) with
in Figure 1, authors conclude that it relates to the typical          high-level meaningful labels.
weekday behavior consisting of being at work from 10:00                   Despite the fact that the presented approach is generaliz-
to 20:00 and being at home in the remaining part of the day.          able both to PCA and LDA, in the following of this paper
The second eigenbehavior corresponds to typical weekend               we focus only on the LDA application. The probabilistic
behavior where the subject remains at home past 10:00 in              model realized by LDA is better suited at extracting differ-
the morning and is out in the town (elsewhere) later that             ent patterns from complex datasets [3, 8].
evening. Similarly, the rightmost part of Figure 1 reports
some LDA-topics obtained in [8]. By visually inspecting               2     Data Preprocessing and LDA
the days strongly correlated to the extracted topics, authors
conclude that topic 20 means “at home in the morning”
while topic 46 means “at work in the afternoon until late                In this section we present the data and algorithms repre-
in the evening”. Looking at these examples, it is clear how           senting the starting point of our work.
difficult it is to give a meaning to the extracted patterns and
how difficult it is to evaluate the quality of the given mean-        2.1     Data Preprocessing
ing (i.e., label).
    The need to associate meaningful labels to topics have               The work presented in this paper is based on the GSM-
been also considered in text-mining applications. The work            localization part of the Reality Mining dataset. This dataset
presented in [16] tries to classify latent topics extracted           basically consists of two big tables (see Figure 2). For each
from text corpora. Although this work applies to a com-               user are recorded several time-frames and the GSM tow-
pletely different scenario, the fact that topic understanding         ers where the user was connected. Tables have missing
is an important research challenge also in other communi-             data (time-frames in which no information has been logged)
ties further motivates our work.                                      due to data corruption and powered-off devices. On aver-
    The contribution of this paper is to present a method-            age logs account for approximatively 85% of the time that
ology to automatically classify the extracted topics without          the data collection lasted. Another table records the labels


                                                                  2
                                                                                sliding window
given by the users to the different GSM towers. Not all the                     ... ... ...     ...
towers are labeled. The dataset comprises 32628 GSM tow-                    HHHHHHHHHWWWWWWWWWEEEHHH
ers and only 825 are labeled (2.5%). Fortunately labeled
cells are those in which users spend most of the time so
                                                                         1HHH                3HWW                        8HHH
overall 75% of the dataset happens to be in labeled cells.                  1HHH                        bag of words
Still, identifying where the users have been in the remain-
ing 25% of the time is an important issue to improve the
data.                                                                         Figure 3. Sliding windows approach.
   Although some works [8] try to extract patterns directly
from such data (considering as “Elsewhere” all the unla-
beled towers), we opted to run some preprocessing to com-                Following [7, 8], we organize the dataset into a sequence
plete missing values. In particular, we train a SVM to infer          of days each consisting of 24 time-slots lasting 1 hour. Each
the labels of all the GSM towers. SVM computations have               time slot is labeled after the cell where the user spends most
been performed with the LIBSVM library [5], on the basis              of his time. If no information are present for that time slot,
of the following procedure:                                           the cell is marked as ‘No-Signal’.
                                                                         To apply the LDA algorithm described in the following
 1. We create training and testing set as:                            section, the dataset has been further processed. Each day
      Day of Week      Weekend       Hour      CellID     Label       is divided into a sequence of words each representing a 3
        Tuesday          no           14        150       Work        hours time-slot. A 3-hours sliding window runs across the
        Saturday         yes          17        950       Home        day, each word is composed of an integer value in (1,8) (we
       Wednesday         no           15        155         ?         refer to this value as time-period) and the 3 (‘H’,‘W’,‘E’ or
                                                                      ‘N’) labels in the sliding window. The time-period abstracts
     The table associates the label to be identified to a fea-        the time of the day, it is 1 if the sliding window starts in
     ture vector consisting of the day of the week when the           0:00am - 3:00am, 2 in 3:00am - 6:00am, and so on (see
     tower is visited, whether it is a weekend or not, the            Figure 3).
     hour of the visit, and the cell ID.
                                                                         The fact of using time-periods of 3 hours each (in con-
 2. We conduct a simple scaling on the data, converting all           trast with some previous work [8] in which different time-
    the values to a [0,1] scale. This is to avoid attributes in       periods are skewed) improves the resulting dataset in that
    greater numeric ranges dominating the others.                     the number of words for each time slot is not biased by its
                                                                      length, but better reflects the actual user behavior.
 3. Following other examples in the SVM literature [5],                  The resulting bag of words summarizes the original
    we consider Radial Basis Function (RBF) kernel since              dataset and is the input data structure for the LDA algorithm
    it well-applies to a vast range of classification prob-           described in the next subsection.
    lems. In addition, we use cross-validation to find the
    best parameters C and γ of SVM and RBF respec-                    2.2    LDA Algorithm
    tively. Basically we try all the combinations of the
    parameters with an exponentially-growing grid search.                 LDA is a probabilistic generative model [3] used to clus-
    Parameters producing best cross-validation accuracy               ter documents according to the topics (i.e., word patterns)
    are selected for the final model.                                 they contain. The work in [8] proposes using this model to
 4. We use the best parameters to train the SVM model on              extract mobility patterns from a mobility dataset.
    the whole training set and to classify the testing set.               LDA is based on the Bayesian network depicted in Fig-
                                                                      ure 4. A word w is the basic unit of data, representing user
   SVM classification produces results with an overall 86%            location at a given time-period (see bag of words in Fig-
accuracy in cross validation (accuracy being defined as the           ure 3). A set of N words defines a day of the user. Each
proportion of correct results in the population). After SVM-          user has a dataset consisting of M documents. Each day
classification, the dataset is much more representative of            is viewed as a mixture of topics z, where topics are dis-
the plausible whereabouts of the users (missing groundtruth           tributions over words (i.e., each topic can be represented
information, we can not make assertions on their actual               by the list of words associated to the probability p(w|z)).
whereabouts). For example, without SVM, a lot of days of              For each day i, the probability of a word wij is given by
                                                                                  PT
the users are described by being always “Elsewhere” (i.e.,            p(wij ) =      t=1 p(wij |zit )p(zit ), where T is the number
not at home nor work). This is rather unrealistic and in fact         of topics. p(wij |zit ) and p(zit ) are assumed to have Multi-
SVM corrects this unbalance by restoring, for example, the            nomial distributions with hyperparameters α and β respec-
being-at-home-at-night behavior.                                      tively. LDA uses the EM-algorithms [2] to learn the model


                                                                  3
                                                                             (each day is represented as a 24-slots array indicating
                                                                             where the user was at a given time of the day). The
                                                                             different days keep the pattern indicated by the label
                                                                             constant, and change the remaining part of the day.

                                                                        3. For each block of 15 days, we compute one LDA-
                                                                           topic. The final result is a set of 30 label-defined
                                                                           topics, each representing one of the predefined labels.
                                                                           For example, recalling that topics are distributions over
       Figure 4. Plate notation of the LDA model.                          words, the days associated with the ‘W 9:00 - 18:00’
                                                                           pattern will create a topic in which the p(w|z) of words
                                                                           like 3WWW, 4WWW, 5WWW will be high compared to
parameters. In our implementation we use the library Mallet                other words probabilities. We verified that the result-
(http://mallet.cs.umass.edu) to perform these computations.                ing topics are not strongly affected if being produced
   Once the model parameters have been found, Bayesian                     by a number of days smaller or greater than 15.
deduction allows to extract the topics best describing the
routines of a given day (rank z on the basis of p(d|z)).                4. Topics extracted from the Reality Mining dataset are
However, as already introduced, since z are just distribu-                 classified using k-Nearest Neighbor (kNN) and the
tions over words, it is difficult to give them an immediate                Kullback-Leibler divergence as distance metric from
meaning useful in applications. The next section contains                  the above label-defined topics.
the main contribution of this paper: giving a meaningful la-
bel to topics z.                                                        5. The final result is that days of a users can be de-
                                                                           scribed as easily-understandable labels (e.g., ‘W 9:00 -
3     LDA-Topic Classification                                             18:00’), rather than with just probability distributions
                                                                           over words that are much more complex to be inter-
3.1      Method                                                            preted.

   In extreme summary, our approach consists of identify-              3.2     Experiments and Discussion
ing a set of labels describing the main trends of a user typ-
ical day. For example, ‘Work 9:00 - 18:00’ represents a
                                                                          We conduct some experiments to test the above ap-
day in which the user is at work from 9:00 to 18:00. We
                                                                       proach. First, we experiment with an artificially-created
then identify the LDA-topics representing such a label (we
                                                                       dataset where we get groundtruth information, and thus
refer to these topics as label-defined topics). LDA-topics
                                                                       classification accuracy can be precisely evaluated. Then,
extracted from the Reality Mining dataset are labeled after
                                                                       we test the system with the Reality Mining data that misses
the most similar label-defined topics. Thus, rather than be-
                                                                       groundtruth information.
ing described only in terms of probability distributions over
words, topics get a compact description like ‘Work 9:00 -                 The first set of experiments studies classification accu-
18:00’.                                                                racy. Starting from the above described 30 predefined labels
   More in detail, our methodology is based on these key               expressing user patterns, we create a testing set on which to
points:                                                                extract and classify topics. More in detail, we select L la-
                                                                       bels. For each label we create 15 days (following the same
    1. We create a set of 30 predefined labels each composed           procedure described above) and we stack the 15 · L days to-
       of a place (‘H’, ‘W’, ‘E’ or ‘N’; we refer to this places       gether to create an artificial dataset of user’s whereabouts.
       as pattern-label) and a time-frame (we refer to it as           The L labels represent the groundtruth user’s whereabouts
       time-frame-label). For example, the label ‘W 9:00 -             patterns.
       18:00’ represents the pattern where the user is at work            We extract L topics from this dataset and classify the
       from 9:00 to 18:00 while the label ‘H 12:00 - 14:00’            topics with the kNN algorithm (in this experiment with just
       represents the pattern where the user is at home at             use k=1). The expected result is to classify the extracted
       lunch time. We choose these labels by visually inspect-         topics with the same labels used to create the dataset. For
       ing the recurrent patterns in the Reality Mining users’         each experiment we compute classification accuracy as:
       days.
    2. For each predefined label, we create a set of 15 sam-                   |{classified labels} ∩ {groundtruth label}|
       ple days representing the corresponding daily behavior                               |{classified labels}|

                                                                   4
                100                                                                   100
                 90                                                                    90
                 80                                                                    80




                                                                         % accuracy
                 70                                                                    70
                 60                                                                    60
   % accuracy




                 50
                                                                                       50
                 40
                                                                                       40         5 pattern
                 30                                      max
                                                                                       30
                                                         min
                 20                                                                   100 10%          20%      30%      40%     50%

                 10                                                                                            % noise
                                                                                       95
                  0
                      5   10                        20         30                      90




                                                                         % accuracy
                               number of patterns
                                                                                       85

                                                                                       80


   Figure 5. Classification accuracy and num-                                          75         30 pattern
   ber of patterns used to create the testing                                          70
                                                                                            10%        20%      30%      40%     50%
   datasets. Minimum and maximum accuracies                                                                    % noise

   give a fair measure of the variance in our re-
   sults.
                                                                                      Figure 6. Average classification accuracy as
                                                                                      a function of the noise introduced. The x-
    All the results are averaged over 100 runs of the experi-                         axis represents the hours’ percentage (with
ment in which we generate random groundtruth topics and                               respect to the time slot length indicated by
random days containing that topics.                                                   the label) that has been randomly changed.
    Figure 5 shows the average classification accuracy as a
function of the number of patterns used to create the test-
ing dataset. The loss in the accuracy classification ob-                dataset.
tained with a little number of patterns is due to the fact                 In a second group of experiments, we test our classifica-
that our algorithm tends to misclassify labels representing             tion method on the Reality Mining dataset.
short time-frame-label. For example, if the predefined label               We experiment with 36 individuals and 121 consecutive
is ‘H 12:00 - 14:00’, a testing day could be ‘EEEHHHH-                  days (from 26-08-2004 to 21-12-2004). We chose this sub-
HWWWHHHWWWWWHHHHH’ which is better repre-                               set of days with the goal of analyzing people and days for
sented by a topic described by the label ‘W 9:00 - 18:00’               which the data is reasonably complete and with the goal
or ‘H 20:00 - 24:00’.This fact is also reflected by the higher          of comparing our results with those presented in [8] taking
variation between the minimum and maximum classifica-                   into consideration the same subset of days. We also com-
tion accuracy obtained with a little number of patterns.                plete missing values with the SVM mechanism described in
    To further test the approach in this experimental setting,          the previous section.
we add artificial noise to the testing dataset. The idea is                Since groundtruth information regarding user topics are
to corrupt the underlying pattern to see that what extent               not available, our experiments on the Reality Mining dataset
the LDA algorithm and our classification mechanism are                  focus on two main aspects:
able to generalize the pattern. In particular, once a day has
been created, we change noise% of the labels in the pat-                 1. We first evaluate whether the predefined labels associ-
tern time-slots with random other labels. For example, if                   ated to the user’s topics are a good representation of
the label expressing the testing set days is ‘E 18:00 - 24:00’              her days. In other words we verify whether the labels
a sample day for this experiment could be ‘HHHHHHHH-                        are informative enough to reconstruct the day of the
WWWWWWWWWEEEEHHH’ expressing the user was                                   user.
somewhere else only until 22:00 and then went back home                  2. We evaluate if there are other labels describing user
(noise = 3/7 = 42%).                                                        days better then the ones selected by our approach.
    Figure 6 shows the classification accuracy obtained at
different noise levels in the case of datasets composed of 5               With regard to the former aspect, we extract 100 LDA-
and 30 patterns respectively. As above mentioned, the vari-             topics from all the days of each user taken into consider-
ation between minimum and maximum classification accu-                  ation. For each day d we rank the topic z according to
racy is higher with a little number of patterns than with a lot         p(d|z). Starting from he topic z with higher p(d|z), we
of them. In addition, a higher number of pattern results less           reconstruct the day d according to the pattern-label and the
affected by the introduction of artificial noise in the testing         time-frame-label associated to z. If a part of the day has


                                                                    5
           30                                                                                      80
                                                                                                   70
           25
                                                                                                   60
           20
                                                                                                   50




                                                                                      % accuracy
  % days




           15                                                                                      40
                                                                                                   30
           10
                                                                                                   20
           5
                                                                                                   10
           0                                                                                        0
                0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100                               1   10   20   30   40    50    60     70   80   90   100

                                        % accuracy                                                                           number of topics




   Figure 7. Average day reconstruction accu-                                           Figure 8. Average day reconstruction accu-
   racy computed over all users and days.                                               racy as a function of the number of topics ex-
                                                                                        tracted from the user’s dataset.


been already reconstructed by a previous (more probable)
topic, the mechanisms just left it unchanged. Parts of the                        any other label in the 80% of the cases.
day that do not appear in the considered topics labels are                           All these results support the use of our classification
not reconstructed.                                                                mechanism. Our classification can notably simplify the
    We then compare the real and the reconstructed day. For                       comparison of users’ routines characteristic, in that it allows
each time-slot (hour) we assign an error equals to 1 if the                       direct comparison between topics meaningful labels.
reconstructed label is wrong. While an error equals to 0.5 if
that hours is not reconstructed. The idea is that it is better                    4                Related Work
for the algorithm not to reconstruct a part of the day rather
than reconstructing it wrong. Figure 7 shows the distribu-                            The availability of affordable localization mechanisms
tion of days reconstruction accuracy: an average of 80% is                        and the recognition of location as a primary source of con-
obtained.                                                                         text information has stimulated a wealth of work. In particu-
    Figure 8 shows days reconstruction accuracy as a func-                        lar, the core problem addressed in this paper: understanding
tion of the number of LDA-topics extracted from users’                            users’ whereabouts.
days. The low number of LDA-topics necessary to obtain
high accuracy level can be explained by the limited number                        4.1                   Identifying Places
of users’ days available and by their repetitiveness.
    With regard to the latter aspect of the Reality Mining                           Several researches tackle the problem of understanding
experiments, we extract 100 LDA-topics from all the days                          people whereabouts by trying to extract and identify those
of each user taken into consideration. For each day d, we                         places that matter to the user. Mainstream approaches are
want to find the topic that best describes that day. We rank                      either based on segmenting and clustering GPS-traces to in-
topics z according to both p(d|z) and the length of the time-                     fer what are the places relevant to the user [1, 18], or on
frame-label assigned to the topic. The idea is that topics                        detecting places and mobility on the basis of nearby RF-
explaining a bigger part of the user day are to be preferred.                     beacons such as WiFi and GSM towers [13, 7]. These ap-
    The most probable topic ztop is selected for describing                       proaches require the user to run a special software on her
the day. We then evaluate how good the label assigned to                          device to collect and analyze the log of GPS or RF-beacons
ztop describes the day. In particular, for each time-slot in                      available. Thus experiments with these mechanisms are
the time-frame-label we assign an error equals to 1 if the                        usually conducted with a relatively small user population
reconstructed label is wrong. For each time-slot that is not                      (the Reality Mining dataset used also in this paper is by far
in the time-frame-label we give an error of 0.5. The idea is                      one of largest datasets in this category).
to lower the performance of topics associated to short time-                         Starting from these results, another important area of re-
frame-label, thus describing only a fraction of the day. Fi-                      search concerns the problem of converting from places de-
nally, we evaluate with the same error measure if there ex-                       scribed in terms of geographical coordinates or abstract IDs
ists another label, better describing the day. We obtain that                     (e.g., GSM tower ID) to semantically-rich places such as:
the labels associated to the selected topics are better than                      “home” or “favorite pub”. The work described in [1, 14]


                                                                              6
adopt a probabilistic model to automatically attach seman-           4.3      Identifying Routines
tic labels to visited places. These works rely on the fact
that semantic information can be added by exploiting the                 Even more high-level than extracting places and routes,
structure of a person’s daily routine. For example, the place        is the problem of identifying routine whereabouts behav-
where the user usually spends the night can be tagged se-            iors.
mantically as “home”, while the place where the user usu-                The CitySense project (http://www.citysense.com) is
ally goes from 8:00 to 18:00 can be tagged as “work”. In             based on GPS and WiFi localization and has the goal of
[1] further information are extracted by geocoding the place         monitoring and describing the city’s nightlife. In particular,
and mining the Web in search for relevant information.               the application identifies hot-spots in the city (e.g., popular
   In summary, these approaches allow to represent and un-           bars and clubs) and compares the number of people located
derstand users’ mobility patterns as a sequence of places            in a given area in real time with past measures, to determine
being visited at different times of the day. This representa-        the “activity-level” of a given night. In a similar work based
tion resembles the “Home, Work, Elsewhere, No Signal”-               on extremely large anonymized mobility data coming from
representation used in the Reality Mining dataset.                   Telecom operators authors were able to extract the spatio-
                                                                     temporal dynamics of the city, highlighting where people
   Accordingly, while this paper focuses on higher-level ab-         usually go during the day. Authors were able also to iden-
stractions (the sequence of places being visited is our start-       tify the most visited areas by tourists during the day and the
ing point), the above approaches are the fundamental ele-            typical time of the visit [4, 12].
ments to apply our algorithms to other mobility datasets.                In this context the works that most directly compare to
                                                                     our proposal are [7, 8]. They use PCA and LDA algorithms
                                                                     to extract routine behavior from the Reality Mining dataset.
4.2    Identifying Routes                                            Our work provides a further classification step to give mean-
                                                                     ingful labels to the extracted routines.


    A number of related work deals with the problem of               5     Conclusion and Future Work
identifying the routes the user takes to move from one place
to another. These approaches run data mining algorithms                 In this paper we presented a methodology to automati-
to identify recurrent patterns in the GPS tracks from mul-           cally classify the routine whereabouts extracted from a large
tiple users. Works in this area can be divided in 2 broad            mobility dataset with meaningful labels.
categories: “geometric-based” approaches [9] apply pattern              Our future work in this area will target 3 main directions:
matching to the sequence of geographical coordinates com-
posing the tracks. We call it geometric in that they use                 1. We will apply the presented approaches to “live”
the physical “shape” of the path to compute the matching                    datasets such as those that can be acquired from
among routes. “String-based” approaches, instead, create                    Google Latitude (http://www.google.com/latitude) and
a symbolic representation of the path (e.g., by considering                 Flickr [12].
only the names of the areas crosses by the path) and apply
pattern-matching on that list of symbols [11]. In both cases,            2. We will develop mechanisms to add/modify topic la-
the extracted routes can be used to classify the user current               bels at run time, so as to enable the use of the system
and past whereabouts.                                                       in a wide range of scenarios.
   The work presented in this paper is similar to ‘string-               3. We will develop Web-based visualization mechanisms
based” approaches, in that the geographic information                       to inspect and communicate whereabouts behaviors in
about the places visited by the users are lost in favor of                  an effective way [6].
the more compact “Home, Work, Elsewhere, No Signal”-
representation. However, there are two important differ-                The ultimate goal will be to create a real live Web appli-
ences. On the one hand, our representation is even more              cation allowing different classes of users to see, understand
abstract that the one discussed in [11] and similar works.           and predict their own and other users’ whereabouts.
Labels in our dataset (e.g., Home) are completely detached
from their physical location and, in fact, different users
will label as “Home” completely different places. On the             6     Acknowledgements
other hand our classification algorithm could extend also
the above related work to obtained a more descriptive label             We thank Katayoun Farrahi for clarifications on their use
for the extracted routes.                                            of the LDA algorithm on the Reality Mining dataset.


                                                                 7
References                                                           [12] F. Girardin, J. Blat, F. Calabrese, F. D. Fiore, and
                                                                          C. Ratti. Digital footprinting: Uncovering tourists
 [1] N. Bicocchi, G. Castelli, M. Mamei, A. Rosi, and                     with user-generated content. IEEE Pervasive Comput-
     F. Zambonelli. Supporting location-aware services for                ing, 7(4):36–43, 2008.
     mobile users with the whereabouts diary. In Interna-            [13] D. Kim, J. Hightower, R. Govindan, and D. Estrin.
     tional Conference on MOBILe Wireless MiddleWARE,                     Discovering semantically meaningful places from per-
     Operating Systems, and Applications, Innsbruck, Aus-                 vasive rf-beacons. In International Conference on
     tria, 2008.                                                          Ubiquitous Computing, Orlando (FL) ,USA, 2009.
 [2] C. Bishop. Pattern Recognition and Machine Learn-               [14] L. Liao, D. Fox, and H. Kautz. Extracting places
     ing. Springer Verlag, 2006.                                          and activities from gps traces using hierarchical con-
                                                                          ditional random fields. International Journal of
 [3] D. Blei, A. Ng, and M. Jordan. Latent dirichlet al-                  Robotics Research, 26(1):119–134, 2007.
     location. Journal of Machine Learning Research,
     3(1):993–1022, 2003.                                            [15] M. Massimi, K. Truong, D. Dearman, and G. Hayes.
                                                                          Understanding recording technologies in everyday
 [4] F. Calabrese, J. Reades, and C. Ratti. Eigenplaces:                  life. IEEE Pervasive Computing, pre-print, 2010.
     analysing cities using the space-time structure of the
                                                                     [16] Q. Mei, X. Shen, and C. Zhai. Automatic labeling
     mobile phone network. IEEE Pervasive Computing,
                                                                          of multinomial topic models. In ACM International
     9(1):78–84, 2010.
                                                                          Conference on Knowledge Discovery and Data Min-
 [5] C.-C. Chang and C.-J. Lin. LIBSVM: a library for                     ing, San Jose (CA), USA, 2007.
     support vector machines, 2001. Software available at            [17] S. Patel, J. Kientz, G. Hayes, S. Bhat, and G. Abowd.
     http://www.csie.ntu.edu.tw/ cjlin/libsvm.                            Farther than you may think: An empirical investiga-
                                                                          tion of the proximity of users to their mobile phones.
 [6] A. Clear, R. Shannon, T. Holland, A. Quigley, S. Dob-                In International Conference on Ubiquitous Comput-
     son, and P. Nixon. Situvis: A visual tool for modeling               ing, Orange County (CA), USA, 2006.
     a user’s behaviour patterns in a pervasive environment.
     In International Conference on Pervasive Computing,             [18] Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma. Mining in-
     Nara, Japan, 2009.                                                   teresting locations and travel sequences from gps tra-
                                                                          jectories for mobile users. In International World Wide
 [7] N. Eagle and A. Pentland. Eigenbehaviors: Identify-                  Web Conference, Madrid, Spain, 2009.
     ing structure in routine. Behavioral Ecology and So-
     ciobiology, 63(7):1057–1066, 2009.

 [8] K. Farrahi and D. Gatica-Perez. Learning and predict-
     ing multimodal daily life patterns from cell phones.
     In International Conference on Multimodal Interfaces
     (ICMI-MLMI), Cambridge (MA), USA, 2009.

 [9] J. Froehlich and J. Krumm. Route prediction from trip
     observations. In Intelligent Vehicle Initiative, Technol-
     ogy Advanced Controls and Navigation Systems, SAE
     World Congress and Exhibition, Detriot (MI), USA,
     2008.

[10] J. Gemmell, G. Bell, and R. Lueder. Mylifebits: a
     personal database for everything. Communications of
     the ACM, 49(1):88–95, 2006.

[11] F. Giannotti, M. Nanni, D. Pedreschi, and F. Pinelli.
     Trajectory pattern mining. In ACM SIGKDD Interna-
     tional Conference on Knowledge Discovery and Data
     Mining, San Jose (CA), USA, 2007.


                                                                 8