<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Begin</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Classification of Whereabouts Patterns From Large-scale Mobility Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laura Ferrari</string-name>
          <email>laura.ferrari@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Mamei</string-name>
          <email>marco.mamei@unimore.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DISMI, Universita` di Modena e Reggio Emilia</institution>
          ,
          <addr-line>Via Amendola 2, 42122 Reggio Emilia</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2004</year>
      </pub-date>
      <volume>200</volume>
      <fpage>4</fpage>
      <lpage>08</lpage>
      <abstract>
        <p>Classification of users' whereabouts patterns is important for many emerging ubiquitous computing applications. Latent Dirichlet Allocation (LDA) is a powerful mechanism to extract recurrent behaviors and high-level patterns (called topics) from mobility data in an unsupervised manner. One drawback of LDA is that it is difficult to give meaningful and usable labels to the extracted topics. We present a methodology to automatically classify the topics with meaningful labels so as to support their use in applications. This mechanism is tested and evaluated using the Reality Mining dataset consisting of about 350000 hours of continuous data on human behavior.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        The recent diffusion of smart phones equipped with
localization capabilities allows to collect data about mobility
and whereabouts in an economically feasible and
unobtrusive way from a large user population [
        <xref ref-type="bibr" rid="ref15 ref17">17, 15</xref>
        ]. Several
institutions, both in academia and industry, are exploiting this
technology to create applications to collect logs of people’s
whereabouts. This information opens new scenarios and
possibilities in the development of context-aware services
and applications, but several challenges need to be
tackled to extract practically useful information from such large
mobility datasets. Accordingly, one of the key research is to
develop and apply pattern analysis algorithms to such data.
However, this kind of research has been impeded until
recently by the sheer availability of mobility data from a large
user population.
      </p>
      <p>The Reality Mining dataset is a seminal dataset in this
area. It collects data about the daily life of 97 users over
10 months. Some pioneering researches started to apply
pattern-analysis and data mining algorithm to such mobility
dataset in order to extract high-level information and routine
behaviors.</p>
      <p>
        A number of these researches focus on two “similar”
techniques: Principal Component Analysis (PCA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
Latent Dirichlet Allocation (LDA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The goal of both
these techniques is to discover significant patterns and
features from the input data. More precisely, from a
maximumlikelihood perspective, both these techniques aim at
identifying a set of latent variables z and conditional probability
distributions p(xjz) for the observed variable x
representing users’ whereabouts. The latent variables z are typically
of a much lower dimensionality than x. Thus they encode
patterns in a more understandable way with reduced noise.
These techniques have been applied to a variety of people
mobility datasets [
        <xref ref-type="bibr" rid="ref4 ref7 ref8">7, 8, 4</xref>
        ] with similar modalities.
      </p>
      <p>In the context of the Reality Mining dataset (that is
also the target of our work), the approach consists of
extracting for each user and for each day a 24-slots
array indicating where the user was at a given time of the
day (24 hours). User’s locations are expressed as either:
‘Home’,‘Work’,‘Elsewhere’ or ‘No Signal’, the latter
indicating lack of data (in the remainder of this paper we refer
to these locations respectively as ‘H’, ‘W’,‘E’,‘N’). For
example, a typical day of a user could be
‘HHHHHHHHHWWWWWWWWWEEEHHH’ expressing the user was at
home at night and early morning, then went to work until
late afternoon, then went to somewhere else for three hours,
and finally went back home (see next Section for further
details on the dataset).</p>
      <p>Applying PCA or LDA to a set of these arrays allows to
extract some low-dimensions latent variables (eigenvectors
and LDA-topics respectively) representing underlying
patterns in the data, and offering conditional probability
distributions for the observed arrays (i.e., days). Figure 1
illustrates some eigenvectors and LDA-topics extracted from
the Reality Mining dataset.</p>
      <p>Eigenvectors, leftmost part of Figure 1, encode the
probability of the user being at a given location: ‘H’, ‘W’,‘E’. In
the picture the lighter the color, the higher the probability.</p>
      <p>Similarly, LDA-topics encode the probability of the user
being at a given location (in a different representation
format – see next Section for details). The rightmost part of
Figure 1 shows the most probable days according to the
conditional probability distribution of a given topic. These days
are thus a representation of the topic itself.</p>
      <p>
        Assigning a meaning to the extracted latent variables is
a difficult task, that has been typically performed by
visually inspecting the latent variable itself or the days that
strongly correlate to the variable (i.e., most probable days
given the latent variable) [
        <xref ref-type="bibr" rid="ref4 ref7 ref8">7, 8, 4</xref>
        ]. For example in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ],
by visually inspecting the first eigenbehavior represented
in Figure 1, authors conclude that it relates to the typical
weekday behavior consisting of being at work from 10:00
to 20:00 and being at home in the remaining part of the day.
The second eigenbehavior corresponds to typical weekend
behavior where the subject remains at home past 10:00 in
the morning and is out in the town (elsewhere) later that
evening. Similarly, the rightmost part of Figure 1 reports
some LDA-topics obtained in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. By visually inspecting
the days strongly correlated to the extracted topics, authors
conclude that topic 20 means “at home in the morning”
while topic 46 means “at work in the afternoon until late
in the evening”. Looking at these examples, it is clear how
difficult it is to give a meaning to the extracted patterns and
how difficult it is to evaluate the quality of the given
meaning (i.e., label).
      </p>
      <p>
        The need to associate meaningful labels to topics have
been also considered in text-mining applications. The work
presented in [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] tries to classify latent topics extracted
from text corpora. Although this work applies to a
completely different scenario, the fact that topic understanding
is an important research challenge also in other
communities further motivates our work.
      </p>
      <p>The contribution of this paper is to present a
methodology to automatically classify the extracted topics without</p>
    </sec>
    <sec id="sec-2">
      <title>CellID</title>
      <p>
        102
121
...
any visual inspection or user involvement. Once
meaningful labels are given to the topics, the extracted pattern
becomes readily understandable and usable in applications.
For example, life-log applications [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] could readily use
the extracted label to automatically create an entry in the
user blog. Similarly, analyzing city-wide mobility patterns,
applications could identify routine behaviors affecting
citylife and communicate such information to local government
and city planners. These tasks are simplified once a proper
label is assigned to the discovered patterns, while they are
very difficult starting from the extracted eigenbehaviors and
topics in Figure 1.
      </p>
      <p>The proposed classification methodology is a powerful
tool in that it allows to express what is going to happen to
the user (i.e., which topic is going to be expressed) with
high-level meaningful labels.</p>
      <p>
        Despite the fact that the presented approach is
generalizable both to PCA and LDA, in the following of this paper
we focus only on the LDA application. The probabilistic
model realized by LDA is better suited at extracting
different patterns from complex datasets [
        <xref ref-type="bibr" rid="ref3 ref8">3, 8</xref>
        ].
2
      </p>
      <sec id="sec-2-1">
        <title>Data Preprocessing and LDA</title>
        <p>In this section we present the data and algorithms
representing the starting point of our work.
2.1</p>
        <sec id="sec-2-1-1">
          <title>Data Preprocessing</title>
          <p>The work presented in this paper is based on the
GSMlocalization part of the Reality Mining dataset. This dataset
basically consists of two big tables (see Figure 2). For each
user are recorded several time-frames and the GSM
towers where the user was connected. Tables have missing
data (time-frames in which no information has been logged)
due to data corruption and powered-off devices. On
average logs account for approximatively 85% of the time that
the data collection lasted. Another table records the labels
given by the users to the different GSM towers. Not all the
towers are labeled. The dataset comprises 32628 GSM
towers and only 825 are labeled (2.5%). Fortunately labeled
cells are those in which users spend most of the time so
overall 75% of the dataset happens to be in labeled cells.
Still, identifying where the users have been in the
remaining 25% of the time is an important issue to improve the
data.</p>
          <p>
            Although some works [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] try to extract patterns directly
from such data (considering as “Elsewhere” all the
unlabeled towers), we opted to run some preprocessing to
complete missing values. In particular, we train a SVM to infer
the labels of all the GSM towers. SVM computations have
been performed with the LIBSVM library [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], on the basis
of the following procedure:
1. We create training and testing set as:
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Day of Week</title>
      <p>Tuesday</p>
      <p>Saturday
Wednesday</p>
    </sec>
    <sec id="sec-4">
      <title>Weekend</title>
      <p>no
yes
no</p>
    </sec>
    <sec id="sec-5">
      <title>Hour</title>
      <p>14
17
15</p>
    </sec>
    <sec id="sec-6">
      <title>CellID</title>
      <p>150
950
155</p>
    </sec>
    <sec id="sec-7">
      <title>Label</title>
      <p>
        Work
Home
?
The table associates the label to be identified to a
feature vector consisting of the day of the week when the
tower is visited, whether it is a weekend or not, the
hour of the visit, and the cell ID.
2. We conduct a simple scaling on the data, converting all
the values to a [
        <xref ref-type="bibr" rid="ref1">0,1</xref>
        ] scale. This is to avoid attributes in
greater numeric ranges dominating the others.
3. Following other examples in the SVM literature [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ],
we consider Radial Basis Function (RBF) kernel since
it well-applies to a vast range of classification
problems. In addition, we use cross-validation to find the
best parameters C and of SVM and RBF
respectively. Basically we try all the combinations of the
parameters with an exponentially-growing grid search.
Parameters producing best cross-validation accuracy
are selected for the final model.
4. We use the best parameters to train the SVM model on
the whole training set and to classify the testing set.
      </p>
      <p>SVM classification produces results with an overall 86%
accuracy in cross validation (accuracy being defined as the
proportion of correct results in the population). After
SVMclassification, the dataset is much more representative of
the plausible whereabouts of the users (missing groundtruth
information, we can not make assertions on their actual
whereabouts). For example, without SVM, a lot of days of
the users are described by being always “Elsewhere” (i.e.,
not at home nor work). This is rather unrealistic and in fact
SVM corrects this unbalance by restoring, for example, the
being-at-home-at-night behavior.</p>
      <p>slid.in..g wind..o.w ... ...</p>
      <p>H H H H H H H H H W W W W W W W W W E E E H H H
bag of words
8HHH</p>
      <p>
        Following [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ], we organize the dataset into a sequence
of days each consisting of 24 time-slots lasting 1 hour. Each
time slot is labeled after the cell where the user spends most
of his time. If no information are present for that time slot,
the cell is marked as ‘No-Signal’.
      </p>
      <p>To apply the LDA algorithm described in the following
section, the dataset has been further processed. Each day
is divided into a sequence of words each representing a 3
hours time-slot. A 3-hours sliding window runs across the
day, each word is composed of an integer value in (1,8) (we
refer to this value as time-period) and the 3 (‘H’,‘W’,‘E’ or
‘N’) labels in the sliding window. The time-period abstracts
the time of the day, it is 1 if the sliding window starts in
0:00am - 3:00am, 2 in 3:00am - 6:00am, and so on (see
Figure 3).</p>
      <p>
        The fact of using time-periods of 3 hours each (in
contrast with some previous work [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] in which different
timeperiods are skewed) improves the resulting dataset in that
the number of words for each time slot is not biased by its
length, but better reflects the actual user behavior.
      </p>
      <p>The resulting bag of words summarizes the original
dataset and is the input data structure for the LDA algorithm
described in the next subsection.
2.2</p>
      <sec id="sec-7-1">
        <title>LDA Algorithm</title>
        <p>
          LDA is a probabilistic generative model [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] used to
cluster documents according to the topics (i.e., word patterns)
they contain. The work in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] proposes using this model to
extract mobility patterns from a mobility dataset.
        </p>
        <p>
          LDA is based on the Bayesian network depicted in
Figure 4. A word w is the basic unit of data, representing user
location at a given time-period (see bag of words in
Figure 3). A set of N words defines a day of the user. Each
user has a dataset consisting of M documents. Each day
is viewed as a mixture of topics z, where topics are
distributions over words (i.e., each topic can be represented
by the list of words associated to the probability p(wjz)).
For each day i, the probability of a word wij is given by
p(wij ) = PtT=1 p(wij jzit)p(zit), where T is the number
of topics. p(wij jzit) and p(zit) are assumed to have
Multinomial distributions with hyperparameters and
respectively. LDA uses the EM-algorithms [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] to learn the model
parameters. In our implementation we use the library Mallet
(http://mallet.cs.umass.edu) to perform these computations.
        </p>
        <p>Once the model parameters have been found, Bayesian
deduction allows to extract the topics best describing the
routines of a given day (rank z on the basis of p(djz)).
However, as already introduced, since z are just
distributions over words, it is difficult to give them an immediate
meaning useful in applications. The next section contains
the main contribution of this paper: giving a meaningful
label to topics z.
3
3.1</p>
      </sec>
      <sec id="sec-7-2">
        <title>Method</title>
        <sec id="sec-7-2-1">
          <title>LDA-Topic Classification</title>
          <p>In extreme summary, our approach consists of
identifying a set of labels describing the main trends of a user
typical day. For example, ‘Work 9:00 - 18:00’ represents a
day in which the user is at work from 9:00 to 18:00. We
then identify the LDA-topics representing such a label (we
refer to these topics as label-defined topics). LDA-topics
extracted from the Reality Mining dataset are labeled after
the most similar label-defined topics. Thus, rather than
being described only in terms of probability distributions over
words, topics get a compact description like ‘Work 9:00
18:00’.</p>
          <p>More in detail, our methodology is based on these key
points:
1. We create a set of 30 predefined labels each composed
of a place (‘H’, ‘W’, ‘E’ or ‘N’; we refer to this places
as pattern-label) and a time-frame (we refer to it as
time-frame-label). For example, the label ‘W 9:00
18:00’ represents the pattern where the user is at work
from 9:00 to 18:00 while the label ‘H 12:00 - 14:00’
represents the pattern where the user is at home at
lunch time. We choose these labels by visually
inspecting the recurrent patterns in the Reality Mining users’
days.
2. For each predefined label, we create a set of 15
sample days representing the corresponding daily behavior
(each day is represented as a 24-slots array indicating
where the user was at a given time of the day). The
different days keep the pattern indicated by the label
constant, and change the remaining part of the day.
3. For each block of 15 days, we compute one
LDAtopic. The final result is a set of 30 label-defined
topics, each representing one of the predefined labels.
For example, recalling that topics are distributions over
words, the days associated with the ‘W 9:00 - 18:00’
pattern will create a topic in which the p(wjz) of words
like 3WWW, 4WWW, 5WWW will be high compared to
other words probabilities. We verified that the
resulting topics are not strongly affected if being produced
by a number of days smaller or greater than 15.
4. Topics extracted from the Reality Mining dataset are
classified using k-Nearest Neighbor (kNN) and the
Kullback-Leibler divergence as distance metric from
the above label-defined topics.
5. The final result is that days of a users can be
described as easily-understandable labels (e.g., ‘W 9:00
18:00’), rather than with just probability distributions
over words that are much more complex to be
interpreted.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-7-3">
        <title>Experiments and Discussion</title>
        <p>We conduct some experiments to test the above
approach. First, we experiment with an artificially-created
dataset where we get groundtruth information, and thus
classification accuracy can be precisely evaluated. Then,
we test the system with the Reality Mining data that misses
groundtruth information.</p>
        <p>The first set of experiments studies classification
accuracy. Starting from the above described 30 predefined labels
expressing user patterns, we create a testing set on which to
extract and classify topics. More in detail, we select L
labels. For each label we create 15 days (following the same
procedure described above) and we stack the 15 L days
together to create an artificial dataset of user’s whereabouts.
The L labels represent the groundtruth user’s whereabouts
patterns.</p>
        <p>We extract L topics from this dataset and classify the
topics with the kNN algorithm (in this experiment with just
use k=1). The expected result is to classify the extracted
topics with the same labels used to create the dataset. For
each experiment we compute classification accuracy as:
jfclassi ed labelsg \ fgroundtruth labelgj
jfclassi ed labelsgj
100
90
80
70
cy 60
a
rccu 50
a
% 40
30
20
10
0
5
max
min
10</p>
        <p>20
numberof patterns
30</p>
        <p>All the results are averaged over 100 runs of the
experiment in which we generate random groundtruth topics and
random days containing that topics.</p>
        <p>Figure 5 shows the average classification accuracy as a
function of the number of patterns used to create the
testing dataset. The loss in the accuracy classification
obtained with a little number of patterns is due to the fact
that our algorithm tends to misclassify labels representing
short time-frame-label. For example, if the predefined label
is ‘H 12:00 - 14:00’, a testing day could be
‘EEEHHHHHWWWHHHWWWWWHHHHH’ which is better
represented by a topic described by the label ‘W 9:00 - 18:00’
or ‘H 20:00 - 24:00’.This fact is also reflected by the higher
variation between the minimum and maximum
classification accuracy obtained with a little number of patterns.</p>
        <p>To further test the approach in this experimental setting,
we add artificial noise to the testing dataset. The idea is
to corrupt the underlying pattern to see that what extent
the LDA algorithm and our classification mechanism are
able to generalize the pattern. In particular, once a day has
been created, we change noise% of the labels in the
pattern time-slots with random other labels. For example, if
the label expressing the testing set days is ‘E 18:00 - 24:00’
a sample day for this experiment could be
‘HHHHHHHHWWWWWWWWWEEEEHHH’ expressing the user was
somewhere else only until 22:00 and then went back home
(noise = 3/7 = 42%).</p>
        <p>Figure 6 shows the classification accuracy obtained at
different noise levels in the case of datasets composed of 5
and 30 patterns respectively. As above mentioned, the
variation between minimum and maximum classification
accuracy is higher with a little number of patterns than with a lot
of them. In addition, a higher number of pattern results less
affected by the introduction of artificial noise in the testing
10%
20%
40%
50%
dataset.</p>
        <p>In a second group of experiments, we test our
classification method on the Reality Mining dataset.</p>
        <p>
          We experiment with 36 individuals and 121 consecutive
days (from 26-08-2004 to 21-12-2004). We chose this
subset of days with the goal of analyzing people and days for
which the data is reasonably complete and with the goal
of comparing our results with those presented in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] taking
into consideration the same subset of days. We also
complete missing values with the SVM mechanism described in
the previous section.
        </p>
        <p>Since groundtruth information regarding user topics are
not available, our experiments on the Reality Mining dataset
focus on two main aspects:
1. We first evaluate whether the predefined labels
associated to the user’s topics are a good representation of
her days. In other words we verify whether the labels
are informative enough to reconstruct the day of the
user.
2. We evaluate if there are other labels describing user
days better then the ones selected by our approach.</p>
        <p>With regard to the former aspect, we extract 100
LDAtopics from all the days of each user taken into
consideration. For each day d we rank the topic z according to
p(djz). Starting from he topic z with higher p(djz), we
reconstruct the day d according to the pattern-label and the
time-frame-label associated to z. If a part of the day has
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100
% accuracy
1
10 20 30 40 50 60 70 80 90 100
number of topics
been already reconstructed by a previous (more probable)
topic, the mechanisms just left it unchanged. Parts of the
day that do not appear in the considered topics labels are
not reconstructed.</p>
        <p>We then compare the real and the reconstructed day. For
each time-slot (hour) we assign an error equals to 1 if the
reconstructed label is wrong. While an error equals to 0.5 if
that hours is not reconstructed. The idea is that it is better
for the algorithm not to reconstruct a part of the day rather
than reconstructing it wrong. Figure 7 shows the
distribution of days reconstruction accuracy: an average of 80% is
obtained.</p>
        <p>Figure 8 shows days reconstruction accuracy as a
function of the number of LDA-topics extracted from users’
days. The low number of LDA-topics necessary to obtain
high accuracy level can be explained by the limited number
of users’ days available and by their repetitiveness.</p>
        <p>With regard to the latter aspect of the Reality Mining
experiments, we extract 100 LDA-topics from all the days
of each user taken into consideration. For each day d, we
want to find the topic that best describes that day. We rank
topics z according to both p(djz) and the length of the
timeframe-label assigned to the topic. The idea is that topics
explaining a bigger part of the user day are to be preferred.</p>
        <p>The most probable topic ztop is selected for describing
the day. We then evaluate how good the label assigned to
ztop describes the day. In particular, for each time-slot in
the time-frame-label we assign an error equals to 1 if the
reconstructed label is wrong. For each time-slot that is not
in the time-frame-label we give an error of 0.5. The idea is
to lower the performance of topics associated to short
timeframe-label, thus describing only a fraction of the day.
Finally, we evaluate with the same error measure if there
exists another label, better describing the day. We obtain that
the labels associated to the selected topics are better than
any other label in the 80% of the cases.</p>
        <p>All these results support the use of our classification
mechanism. Our classification can notably simplify the
comparison of users’ routines characteristic, in that it allows
direct comparison between topics meaningful labels.
4</p>
        <sec id="sec-7-3-1">
          <title>Related Work</title>
          <p>The availability of affordable localization mechanisms
and the recognition of location as a primary source of
context information has stimulated a wealth of work. In
particular, the core problem addressed in this paper: understanding
users’ whereabouts.
4.1</p>
        </sec>
      </sec>
      <sec id="sec-7-4">
        <title>Identifying Places</title>
        <p>
          Several researches tackle the problem of understanding
people whereabouts by trying to extract and identify those
places that matter to the user. Mainstream approaches are
either based on segmenting and clustering GPS-traces to
infer what are the places relevant to the user [
          <xref ref-type="bibr" rid="ref1 ref18">1, 18</xref>
          ], or on
detecting places and mobility on the basis of nearby
RFbeacons such as WiFi and GSM towers [
          <xref ref-type="bibr" rid="ref13 ref7">13, 7</xref>
          ]. These
approaches require the user to run a special software on her
device to collect and analyze the log of GPS or RF-beacons
available. Thus experiments with these mechanisms are
usually conducted with a relatively small user population
(the Reality Mining dataset used also in this paper is by far
one of largest datasets in this category).
        </p>
        <p>
          Starting from these results, another important area of
research concerns the problem of converting from places
described in terms of geographical coordinates or abstract IDs
(e.g., GSM tower ID) to semantically-rich places such as:
“home” or “favorite pub”. The work described in [
          <xref ref-type="bibr" rid="ref1 ref14">1, 14</xref>
          ]
adopt a probabilistic model to automatically attach
semantic labels to visited places. These works rely on the fact
that semantic information can be added by exploiting the
structure of a person’s daily routine. For example, the place
where the user usually spends the night can be tagged
semantically as “home”, while the place where the user
usually goes from 8:00 to 18:00 can be tagged as “work”. In
[
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] further information are extracted by geocoding the place
and mining the Web in search for relevant information.
        </p>
        <p>In summary, these approaches allow to represent and
understand users’ mobility patterns as a sequence of places
being visited at different times of the day. This
representation resembles the “Home, Work, Elsewhere, No
Signal”representation used in the Reality Mining dataset.</p>
        <p>Accordingly, while this paper focuses on higher-level
abstractions (the sequence of places being visited is our
starting point), the above approaches are the fundamental
elements to apply our algorithms to other mobility datasets.
4.2</p>
      </sec>
      <sec id="sec-7-5">
        <title>Identifying Routes</title>
        <p>
          A number of related work deals with the problem of
identifying the routes the user takes to move from one place
to another. These approaches run data mining algorithms
to identify recurrent patterns in the GPS tracks from
multiple users. Works in this area can be divided in 2 broad
categories: “geometric-based” approaches [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] apply pattern
matching to the sequence of geographical coordinates
composing the tracks. We call it geometric in that they use
the physical “shape” of the path to compute the matching
among routes. “String-based” approaches, instead, create
a symbolic representation of the path (e.g., by considering
only the names of the areas crosses by the path) and apply
pattern-matching on that list of symbols [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. In both cases,
the extracted routes can be used to classify the user current
and past whereabouts.
        </p>
        <p>
          The work presented in this paper is similar to
‘stringbased” approaches, in that the geographic information
about the places visited by the users are lost in favor of
the more compact “Home, Work, Elsewhere, No
Signal”representation. However, there are two important
differences. On the one hand, our representation is even more
abstract that the one discussed in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and similar works.
Labels in our dataset (e.g., Home) are completely detached
from their physical location and, in fact, different users
will label as “Home” completely different places. On the
other hand our classification algorithm could extend also
the above related work to obtained a more descriptive label
for the extracted routes.
4.3
        </p>
      </sec>
      <sec id="sec-7-6">
        <title>Identifying Routines</title>
        <p>Even more high-level than extracting places and routes,
is the problem of identifying routine whereabouts
behaviors.</p>
        <p>
          The CitySense project (http://www.citysense.com) is
based on GPS and WiFi localization and has the goal of
monitoring and describing the city’s nightlife. In particular,
the application identifies hot-spots in the city (e.g., popular
bars and clubs) and compares the number of people located
in a given area in real time with past measures, to determine
the “activity-level” of a given night. In a similar work based
on extremely large anonymized mobility data coming from
Telecom operators authors were able to extract the
spatiotemporal dynamics of the city, highlighting where people
usually go during the day. Authors were able also to
identify the most visited areas by tourists during the day and the
typical time of the visit [
          <xref ref-type="bibr" rid="ref12 ref4">4, 12</xref>
          ].
        </p>
        <p>
          In this context the works that most directly compare to
our proposal are [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ]. They use PCA and LDA algorithms
to extract routine behavior from the Reality Mining dataset.
Our work provides a further classification step to give
meaningful labels to the extracted routines.
5
        </p>
        <sec id="sec-7-6-1">
          <title>Conclusion and Future Work</title>
          <p>In this paper we presented a methodology to
automatically classify the routine whereabouts extracted from a large
mobility dataset with meaningful labels.</p>
          <p>
            Our future work in this area will target 3 main directions:
1. We will apply the presented approaches to “live”
datasets such as those that can be acquired from
Google Latitude (http://www.google.com/latitude) and
Flickr [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ].
2. We will develop mechanisms to add/modify topic
labels at run time, so as to enable the use of the system
in a wide range of scenarios.
3. We will develop Web-based visualization mechanisms
to inspect and communicate whereabouts behaviors in
an effective way [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ].
          </p>
          <p>The ultimate goal will be to create a real live Web
application allowing different classes of users to see, understand
and predict their own and other users’ whereabouts.
6</p>
        </sec>
        <sec id="sec-7-6-2">
          <title>Acknowledgements</title>
          <p>We thank Katayoun Farrahi for clarifications on their use
of the LDA algorithm on the Reality Mining dataset.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Bicocchi</surname>
          </string-name>
          , G. Castelli,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mamei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rosi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Zambonelli</surname>
          </string-name>
          .
          <article-title>Supporting location-aware services for mobile users with the whereabouts diary</article-title>
          .
          <source>In International Conference on MOBILe Wireless MiddleWARE, Operating Systems, and Applications</source>
          , Innsbruck, Austria,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bishop</surname>
          </string-name>
          .
          <source>Pattern Recognition and Machine Learning</source>
          . Springer Verlag,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Blei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ng</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Jordan</surname>
          </string-name>
          .
          <article-title>Latent dirichlet allocation</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ):
          <fpage>993</fpage>
          -
          <lpage>1022</lpage>
          ,
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Calabrese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Reades</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Ratti</surname>
          </string-name>
          .
          <article-title>Eigenplaces: analysing cities using the space-time structure of the mobile phone network</article-title>
          .
          <source>IEEE Pervasive Computing</source>
          ,
          <volume>9</volume>
          (
          <issue>1</issue>
          ):
          <fpage>78</fpage>
          -
          <lpage>84</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.-C.</given-names>
            <surname>Chang</surname>
          </string-name>
          and
          <string-name>
            <given-names>C.-J.</given-names>
            <surname>Lin</surname>
          </string-name>
          .
          <article-title>LIBSVM: a library for support vector machines</article-title>
          ,
          <year>2001</year>
          . Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Clear</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shannon</surname>
          </string-name>
          , T. Holland,
          <string-name>
            <given-names>A.</given-names>
            <surname>Quigley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dobson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Nixon</surname>
          </string-name>
          .
          <article-title>Situvis: A visual tool for modeling a user's behaviour patterns in a pervasive environment</article-title>
          .
          <source>In International Conference on Pervasive Computing</source>
          , Nara, Japan,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Eagle</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Pentland</surname>
          </string-name>
          . Eigenbehaviors:
          <article-title>Identifying structure in routine</article-title>
          .
          <source>Behavioral Ecology and Sociobiology</source>
          ,
          <volume>63</volume>
          (
          <issue>7</issue>
          ):
          <fpage>1057</fpage>
          -
          <lpage>1066</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Farrahi</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Gatica-Perez</surname>
          </string-name>
          .
          <article-title>Learning and predicting multimodal daily life patterns from cell phones</article-title>
          .
          <source>In International Conference on Multimodal Interfaces (ICMI-MLMI)</source>
          , Cambridge (MA), USA,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J.</given-names>
            <surname>Froehlich</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Krumm</surname>
          </string-name>
          .
          <article-title>Route prediction from trip observations</article-title>
          .
          <source>In Intelligent Vehicle Initiative, Technology Advanced Controls and Navigation Systems, SAE World Congress and Exhibition</source>
          , Detriot (MI), USA,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gemmell</surname>
          </string-name>
          , G. Bell, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Lueder</surname>
          </string-name>
          .
          <article-title>Mylifebits: a personal database for everything</article-title>
          .
          <source>Communications of the ACM</source>
          ,
          <volume>49</volume>
          (
          <issue>1</issue>
          ):
          <fpage>88</fpage>
          -
          <lpage>95</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Giannotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nanni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Pedreschi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>F.</given-names>
            <surname>Pinelli</surname>
          </string-name>
          .
          <article-title>Trajectory pattern mining</article-title>
          .
          <source>In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , San Jose (CA), USA,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>F.</given-names>
            <surname>Girardin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Blat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Calabrese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. D.</given-names>
            <surname>Fiore</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Ratti</surname>
          </string-name>
          .
          <article-title>Digital footprinting: Uncovering tourists with user-generated content</article-title>
          .
          <source>IEEE Pervasive Computing</source>
          ,
          <volume>7</volume>
          (
          <issue>4</issue>
          ):
          <fpage>36</fpage>
          -
          <lpage>43</lpage>
          ,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hightower</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Govindan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Estrin</surname>
          </string-name>
          .
          <article-title>Discovering semantically meaningful places from pervasive rf-beacons</article-title>
          .
          <source>In International Conference on Ubiquitous Computing</source>
          , Orlando (FL) ,USA,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>L.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fox</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Kautz</surname>
          </string-name>
          .
          <article-title>Extracting places and activities from gps traces using hierarchical conditional random fields</article-title>
          .
          <source>International Journal of Robotics Research</source>
          ,
          <volume>26</volume>
          (
          <issue>1</issue>
          ):
          <fpage>119</fpage>
          -
          <lpage>134</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Massimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Truong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dearman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Hayes</surname>
          </string-name>
          .
          <article-title>Understanding recording technologies in everyday life</article-title>
          .
          <source>IEEE Pervasive Computing, pre-print</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Mei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhai</surname>
          </string-name>
          .
          <article-title>Automatic labeling of multinomial topic models</article-title>
          .
          <source>In ACM International Conference on Knowledge Discovery and Data Mining</source>
          , San Jose (CA), USA,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Patel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kientz</surname>
          </string-name>
          , G. Hayes,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhat</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Abowd</surname>
          </string-name>
          .
          <article-title>Farther than you may think: An empirical investigation of the proximity of users to their mobile phones</article-title>
          .
          <source>In International Conference on Ubiquitous Computing</source>
          , Orange County (CA), USA,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xie</surname>
          </string-name>
          , and W.-Y. Ma.
          <article-title>Mining interesting locations and travel sequences from gps trajectories for mobile users</article-title>
          .
          <source>In International World Wide Web Conference</source>
          , Madrid, Spain,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>