=Paper=
{{Paper
|id=None
|storemode=property
|title=Classification of Whereabouts Patterns From Large-scale Mobility Data
|pdfUrl=https://ceur-ws.org/Vol-621/paper23.pdf
|volume=Vol-621
|dblpUrl=https://dblp.org/rec/conf/woa/FerrariM10
}}
==Classification of Whereabouts Patterns From Large-scale Mobility Data==
Classification of Whereabouts Patterns From Large-scale Mobility Data
Laura Ferrari Marco Mamei
DISMI, Università di Modena e Reggio Emilia DISMI, Università di Modena e Reggio Emilia
Via Amendola 2, 42122 Reggio Emilia, Italy Via Amendola 2, 42122 Reggio Emilia, Italy
laura.ferrari@unimore.it marco.mamei@unimore.it
Abstract techniques: Principal Component Analysis (PCA) [2] and
Latent Dirichlet Allocation (LDA) [3]. The goal of both
Classification of users’ whereabouts patterns is impor- these techniques is to discover significant patterns and fea-
tant for many emerging ubiquitous computing applications. tures from the input data. More precisely, from a maximum-
Latent Dirichlet Allocation (LDA) is a powerful mecha- likelihood perspective, both these techniques aim at identi-
nism to extract recurrent behaviors and high-level patterns fying a set of latent variables z and conditional probability
(called topics) from mobility data in an unsupervised man- distributions p(x|z) for the observed variable x represent-
ner. One drawback of LDA is that it is difficult to give ing users’ whereabouts. The latent variables z are typically
meaningful and usable labels to the extracted topics. We of a much lower dimensionality than x. Thus they encode
present a methodology to automatically classify the topics patterns in a more understandable way with reduced noise.
with meaningful labels so as to support their use in appli- These techniques have been applied to a variety of people
cations. This mechanism is tested and evaluated using the mobility datasets [7, 8, 4] with similar modalities.
Reality Mining dataset consisting of about 350000 hours of In the context of the Reality Mining dataset (that is
continuous data on human behavior. also the target of our work), the approach consists of ex-
tracting for each user and for each day a 24-slots ar-
ray indicating where the user was at a given time of the
1 Introduction day (24 hours). User’s locations are expressed as either:
‘Home’,‘Work’,‘Elsewhere’ or ‘No Signal’, the latter indi-
The recent diffusion of smart phones equipped with lo- cating lack of data (in the remainder of this paper we refer
calization capabilities allows to collect data about mobility to these locations respectively as ‘H’, ‘W’,‘E’,‘N’). For ex-
and whereabouts in an economically feasible and unobtru- ample, a typical day of a user could be ‘HHHHHHHHH-
sive way from a large user population [17, 15]. Several in- WWWWWWWWWEEEHHH’ expressing the user was at
stitutions, both in academia and industry, are exploiting this home at night and early morning, then went to work until
technology to create applications to collect logs of people’s late afternoon, then went to somewhere else for three hours,
whereabouts. This information opens new scenarios and and finally went back home (see next Section for further
possibilities in the development of context-aware services details on the dataset).
and applications, but several challenges need to be tack- Applying PCA or LDA to a set of these arrays allows to
led to extract practically useful information from such large extract some low-dimensions latent variables (eigenvectors
mobility datasets. Accordingly, one of the key research is to and LDA-topics respectively) representing underlying pat-
develop and apply pattern analysis algorithms to such data. terns in the data, and offering conditional probability dis-
However, this kind of research has been impeded until re- tributions for the observed arrays (i.e., days). Figure 1 il-
cently by the sheer availability of mobility data from a large lustrates some eigenvectors and LDA-topics extracted from
user population. the Reality Mining dataset.
The Reality Mining dataset is a seminal dataset in this Eigenvectors, leftmost part of Figure 1, encode the prob-
area. It collects data about the daily life of 97 users over ability of the user being at a given location: ‘H’, ‘W’,‘E’. In
10 months. Some pioneering researches started to apply the picture the lighter the color, the higher the probability.
pattern-analysis and data mining algorithm to such mobility Similarly, LDA-topics encode the probability of the user
dataset in order to extract high-level information and routine being at a given location (in a different representation for-
behaviors. mat – see next Section for details). The rightmost part of
A number of these researches focus on two “similar” Figure 1 shows the most probable days according to the con-
1
UserID Begin End CellID
22 2004-08-27 14:00 2004-08-27 16:00 102
22 2004-08-27 16:30 2004-08-27 17:00 122
... ... ... ...
UserID CellID Label
22 102 Home
22 121 Work
... ... ...
Figure 2. Tables used in the Reality Mining
dataset.
any visual inspection or user involvement. Once meaning-
Figure 1. (left) The top two eigenbehaviors for
ful labels are given to the topics, the extracted pattern be-
Subject 4 of the Reality Mining dataset. (right)
comes readily understandable and usable in applications.
Exemplary LDA-topics – images respectively
For example, life-log applications [10] could readily use
taken from [7] and [8]
the extracted label to automatically create an entry in the
user blog. Similarly, analyzing city-wide mobility patterns,
applications could identify routine behaviors affecting city-
ditional probability distribution of a given topic. These days life and communicate such information to local government
are thus a representation of the topic itself. and city planners. These tasks are simplified once a proper
Assigning a meaning to the extracted latent variables is label is assigned to the discovered patterns, while they are
a difficult task, that has been typically performed by vi- very difficult starting from the extracted eigenbehaviors and
sually inspecting the latent variable itself or the days that topics in Figure 1.
strongly correlate to the variable (i.e., most probable days The proposed classification methodology is a powerful
given the latent variable) [7, 8, 4]. For example in [7], tool in that it allows to express what is going to happen to
by visually inspecting the first eigenbehavior represented the user (i.e., which topic is going to be expressed) with
in Figure 1, authors conclude that it relates to the typical high-level meaningful labels.
weekday behavior consisting of being at work from 10:00 Despite the fact that the presented approach is generaliz-
to 20:00 and being at home in the remaining part of the day. able both to PCA and LDA, in the following of this paper
The second eigenbehavior corresponds to typical weekend we focus only on the LDA application. The probabilistic
behavior where the subject remains at home past 10:00 in model realized by LDA is better suited at extracting differ-
the morning and is out in the town (elsewhere) later that ent patterns from complex datasets [3, 8].
evening. Similarly, the rightmost part of Figure 1 reports
some LDA-topics obtained in [8]. By visually inspecting 2 Data Preprocessing and LDA
the days strongly correlated to the extracted topics, authors
conclude that topic 20 means “at home in the morning”
while topic 46 means “at work in the afternoon until late In this section we present the data and algorithms repre-
in the evening”. Looking at these examples, it is clear how senting the starting point of our work.
difficult it is to give a meaning to the extracted patterns and
how difficult it is to evaluate the quality of the given mean- 2.1 Data Preprocessing
ing (i.e., label).
The need to associate meaningful labels to topics have The work presented in this paper is based on the GSM-
been also considered in text-mining applications. The work localization part of the Reality Mining dataset. This dataset
presented in [16] tries to classify latent topics extracted basically consists of two big tables (see Figure 2). For each
from text corpora. Although this work applies to a com- user are recorded several time-frames and the GSM tow-
pletely different scenario, the fact that topic understanding ers where the user was connected. Tables have missing
is an important research challenge also in other communi- data (time-frames in which no information has been logged)
ties further motivates our work. due to data corruption and powered-off devices. On aver-
The contribution of this paper is to present a method- age logs account for approximatively 85% of the time that
ology to automatically classify the extracted topics without the data collection lasted. Another table records the labels
2
sliding window
given by the users to the different GSM towers. Not all the ... ... ... ...
towers are labeled. The dataset comprises 32628 GSM tow- HHHHHHHHHWWWWWWWWWEEEHHH
ers and only 825 are labeled (2.5%). Fortunately labeled
cells are those in which users spend most of the time so
1HHH 3HWW 8HHH
overall 75% of the dataset happens to be in labeled cells. 1HHH bag of words
Still, identifying where the users have been in the remain-
ing 25% of the time is an important issue to improve the
data. Figure 3. Sliding windows approach.
Although some works [8] try to extract patterns directly
from such data (considering as “Elsewhere” all the unla-
beled towers), we opted to run some preprocessing to com- Following [7, 8], we organize the dataset into a sequence
plete missing values. In particular, we train a SVM to infer of days each consisting of 24 time-slots lasting 1 hour. Each
the labels of all the GSM towers. SVM computations have time slot is labeled after the cell where the user spends most
been performed with the LIBSVM library [5], on the basis of his time. If no information are present for that time slot,
of the following procedure: the cell is marked as ‘No-Signal’.
To apply the LDA algorithm described in the following
1. We create training and testing set as: section, the dataset has been further processed. Each day
Day of Week Weekend Hour CellID Label is divided into a sequence of words each representing a 3
Tuesday no 14 150 Work hours time-slot. A 3-hours sliding window runs across the
Saturday yes 17 950 Home day, each word is composed of an integer value in (1,8) (we
Wednesday no 15 155 ? refer to this value as time-period) and the 3 (‘H’,‘W’,‘E’ or
‘N’) labels in the sliding window. The time-period abstracts
The table associates the label to be identified to a fea- the time of the day, it is 1 if the sliding window starts in
ture vector consisting of the day of the week when the 0:00am - 3:00am, 2 in 3:00am - 6:00am, and so on (see
tower is visited, whether it is a weekend or not, the Figure 3).
hour of the visit, and the cell ID.
The fact of using time-periods of 3 hours each (in con-
2. We conduct a simple scaling on the data, converting all trast with some previous work [8] in which different time-
the values to a [0,1] scale. This is to avoid attributes in periods are skewed) improves the resulting dataset in that
greater numeric ranges dominating the others. the number of words for each time slot is not biased by its
length, but better reflects the actual user behavior.
3. Following other examples in the SVM literature [5], The resulting bag of words summarizes the original
we consider Radial Basis Function (RBF) kernel since dataset and is the input data structure for the LDA algorithm
it well-applies to a vast range of classification prob- described in the next subsection.
lems. In addition, we use cross-validation to find the
best parameters C and γ of SVM and RBF respec- 2.2 LDA Algorithm
tively. Basically we try all the combinations of the
parameters with an exponentially-growing grid search. LDA is a probabilistic generative model [3] used to clus-
Parameters producing best cross-validation accuracy ter documents according to the topics (i.e., word patterns)
are selected for the final model. they contain. The work in [8] proposes using this model to
4. We use the best parameters to train the SVM model on extract mobility patterns from a mobility dataset.
the whole training set and to classify the testing set. LDA is based on the Bayesian network depicted in Fig-
ure 4. A word w is the basic unit of data, representing user
SVM classification produces results with an overall 86% location at a given time-period (see bag of words in Fig-
accuracy in cross validation (accuracy being defined as the ure 3). A set of N words defines a day of the user. Each
proportion of correct results in the population). After SVM- user has a dataset consisting of M documents. Each day
classification, the dataset is much more representative of is viewed as a mixture of topics z, where topics are dis-
the plausible whereabouts of the users (missing groundtruth tributions over words (i.e., each topic can be represented
information, we can not make assertions on their actual by the list of words associated to the probability p(w|z)).
whereabouts). For example, without SVM, a lot of days of For each day i, the probability of a word wij is given by
PT
the users are described by being always “Elsewhere” (i.e., p(wij ) = t=1 p(wij |zit )p(zit ), where T is the number
not at home nor work). This is rather unrealistic and in fact of topics. p(wij |zit ) and p(zit ) are assumed to have Multi-
SVM corrects this unbalance by restoring, for example, the nomial distributions with hyperparameters α and β respec-
being-at-home-at-night behavior. tively. LDA uses the EM-algorithms [2] to learn the model
3
(each day is represented as a 24-slots array indicating
where the user was at a given time of the day). The
different days keep the pattern indicated by the label
constant, and change the remaining part of the day.
3. For each block of 15 days, we compute one LDA-
topic. The final result is a set of 30 label-defined
topics, each representing one of the predefined labels.
For example, recalling that topics are distributions over
Figure 4. Plate notation of the LDA model. words, the days associated with the ‘W 9:00 - 18:00’
pattern will create a topic in which the p(w|z) of words
like 3WWW, 4WWW, 5WWW will be high compared to
parameters. In our implementation we use the library Mallet other words probabilities. We verified that the result-
(http://mallet.cs.umass.edu) to perform these computations. ing topics are not strongly affected if being produced
Once the model parameters have been found, Bayesian by a number of days smaller or greater than 15.
deduction allows to extract the topics best describing the
routines of a given day (rank z on the basis of p(d|z)). 4. Topics extracted from the Reality Mining dataset are
However, as already introduced, since z are just distribu- classified using k-Nearest Neighbor (kNN) and the
tions over words, it is difficult to give them an immediate Kullback-Leibler divergence as distance metric from
meaning useful in applications. The next section contains the above label-defined topics.
the main contribution of this paper: giving a meaningful la-
bel to topics z. 5. The final result is that days of a users can be de-
scribed as easily-understandable labels (e.g., ‘W 9:00 -
3 LDA-Topic Classification 18:00’), rather than with just probability distributions
over words that are much more complex to be inter-
3.1 Method preted.
In extreme summary, our approach consists of identify- 3.2 Experiments and Discussion
ing a set of labels describing the main trends of a user typ-
ical day. For example, ‘Work 9:00 - 18:00’ represents a
We conduct some experiments to test the above ap-
day in which the user is at work from 9:00 to 18:00. We
proach. First, we experiment with an artificially-created
then identify the LDA-topics representing such a label (we
dataset where we get groundtruth information, and thus
refer to these topics as label-defined topics). LDA-topics
classification accuracy can be precisely evaluated. Then,
extracted from the Reality Mining dataset are labeled after
we test the system with the Reality Mining data that misses
the most similar label-defined topics. Thus, rather than be-
groundtruth information.
ing described only in terms of probability distributions over
words, topics get a compact description like ‘Work 9:00 - The first set of experiments studies classification accu-
18:00’. racy. Starting from the above described 30 predefined labels
More in detail, our methodology is based on these key expressing user patterns, we create a testing set on which to
points: extract and classify topics. More in detail, we select L la-
bels. For each label we create 15 days (following the same
1. We create a set of 30 predefined labels each composed procedure described above) and we stack the 15 · L days to-
of a place (‘H’, ‘W’, ‘E’ or ‘N’; we refer to this places gether to create an artificial dataset of user’s whereabouts.
as pattern-label) and a time-frame (we refer to it as The L labels represent the groundtruth user’s whereabouts
time-frame-label). For example, the label ‘W 9:00 - patterns.
18:00’ represents the pattern where the user is at work We extract L topics from this dataset and classify the
from 9:00 to 18:00 while the label ‘H 12:00 - 14:00’ topics with the kNN algorithm (in this experiment with just
represents the pattern where the user is at home at use k=1). The expected result is to classify the extracted
lunch time. We choose these labels by visually inspect- topics with the same labels used to create the dataset. For
ing the recurrent patterns in the Reality Mining users’ each experiment we compute classification accuracy as:
days.
2. For each predefined label, we create a set of 15 sam- |{classified labels} ∩ {groundtruth label}|
ple days representing the corresponding daily behavior |{classified labels}|
4
100 100
90 90
80 80
% accuracy
70 70
60 60
% accuracy
50
50
40
40 5 pattern
30 max
30
min
20 100 10% 20% 30% 40% 50%
10 % noise
95
0
5 10 20 30 90
% accuracy
number of patterns
85
80
Figure 5. Classification accuracy and num- 75 30 pattern
ber of patterns used to create the testing 70
10% 20% 30% 40% 50%
datasets. Minimum and maximum accuracies % noise
give a fair measure of the variance in our re-
sults.
Figure 6. Average classification accuracy as
a function of the noise introduced. The x-
All the results are averaged over 100 runs of the experi- axis represents the hours’ percentage (with
ment in which we generate random groundtruth topics and respect to the time slot length indicated by
random days containing that topics. the label) that has been randomly changed.
Figure 5 shows the average classification accuracy as a
function of the number of patterns used to create the test-
ing dataset. The loss in the accuracy classification ob- dataset.
tained with a little number of patterns is due to the fact In a second group of experiments, we test our classifica-
that our algorithm tends to misclassify labels representing tion method on the Reality Mining dataset.
short time-frame-label. For example, if the predefined label We experiment with 36 individuals and 121 consecutive
is ‘H 12:00 - 14:00’, a testing day could be ‘EEEHHHH- days (from 26-08-2004 to 21-12-2004). We chose this sub-
HWWWHHHWWWWWHHHHH’ which is better repre- set of days with the goal of analyzing people and days for
sented by a topic described by the label ‘W 9:00 - 18:00’ which the data is reasonably complete and with the goal
or ‘H 20:00 - 24:00’.This fact is also reflected by the higher of comparing our results with those presented in [8] taking
variation between the minimum and maximum classifica- into consideration the same subset of days. We also com-
tion accuracy obtained with a little number of patterns. plete missing values with the SVM mechanism described in
To further test the approach in this experimental setting, the previous section.
we add artificial noise to the testing dataset. The idea is Since groundtruth information regarding user topics are
to corrupt the underlying pattern to see that what extent not available, our experiments on the Reality Mining dataset
the LDA algorithm and our classification mechanism are focus on two main aspects:
able to generalize the pattern. In particular, once a day has
been created, we change noise% of the labels in the pat- 1. We first evaluate whether the predefined labels associ-
tern time-slots with random other labels. For example, if ated to the user’s topics are a good representation of
the label expressing the testing set days is ‘E 18:00 - 24:00’ her days. In other words we verify whether the labels
a sample day for this experiment could be ‘HHHHHHHH- are informative enough to reconstruct the day of the
WWWWWWWWWEEEEHHH’ expressing the user was user.
somewhere else only until 22:00 and then went back home 2. We evaluate if there are other labels describing user
(noise = 3/7 = 42%). days better then the ones selected by our approach.
Figure 6 shows the classification accuracy obtained at
different noise levels in the case of datasets composed of 5 With regard to the former aspect, we extract 100 LDA-
and 30 patterns respectively. As above mentioned, the vari- topics from all the days of each user taken into consider-
ation between minimum and maximum classification accu- ation. For each day d we rank the topic z according to
racy is higher with a little number of patterns than with a lot p(d|z). Starting from he topic z with higher p(d|z), we
of them. In addition, a higher number of pattern results less reconstruct the day d according to the pattern-label and the
affected by the introduction of artificial noise in the testing time-frame-label associated to z. If a part of the day has
5
30 80
70
25
60
20
50
% accuracy
% days
15 40
30
10
20
5
10
0 0
0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 80-90 90-100 1 10 20 30 40 50 60 70 80 90 100
% accuracy number of topics
Figure 7. Average day reconstruction accu- Figure 8. Average day reconstruction accu-
racy computed over all users and days. racy as a function of the number of topics ex-
tracted from the user’s dataset.
been already reconstructed by a previous (more probable)
topic, the mechanisms just left it unchanged. Parts of the any other label in the 80% of the cases.
day that do not appear in the considered topics labels are All these results support the use of our classification
not reconstructed. mechanism. Our classification can notably simplify the
We then compare the real and the reconstructed day. For comparison of users’ routines characteristic, in that it allows
each time-slot (hour) we assign an error equals to 1 if the direct comparison between topics meaningful labels.
reconstructed label is wrong. While an error equals to 0.5 if
that hours is not reconstructed. The idea is that it is better 4 Related Work
for the algorithm not to reconstruct a part of the day rather
than reconstructing it wrong. Figure 7 shows the distribu- The availability of affordable localization mechanisms
tion of days reconstruction accuracy: an average of 80% is and the recognition of location as a primary source of con-
obtained. text information has stimulated a wealth of work. In particu-
Figure 8 shows days reconstruction accuracy as a func- lar, the core problem addressed in this paper: understanding
tion of the number of LDA-topics extracted from users’ users’ whereabouts.
days. The low number of LDA-topics necessary to obtain
high accuracy level can be explained by the limited number 4.1 Identifying Places
of users’ days available and by their repetitiveness.
With regard to the latter aspect of the Reality Mining Several researches tackle the problem of understanding
experiments, we extract 100 LDA-topics from all the days people whereabouts by trying to extract and identify those
of each user taken into consideration. For each day d, we places that matter to the user. Mainstream approaches are
want to find the topic that best describes that day. We rank either based on segmenting and clustering GPS-traces to in-
topics z according to both p(d|z) and the length of the time- fer what are the places relevant to the user [1, 18], or on
frame-label assigned to the topic. The idea is that topics detecting places and mobility on the basis of nearby RF-
explaining a bigger part of the user day are to be preferred. beacons such as WiFi and GSM towers [13, 7]. These ap-
The most probable topic ztop is selected for describing proaches require the user to run a special software on her
the day. We then evaluate how good the label assigned to device to collect and analyze the log of GPS or RF-beacons
ztop describes the day. In particular, for each time-slot in available. Thus experiments with these mechanisms are
the time-frame-label we assign an error equals to 1 if the usually conducted with a relatively small user population
reconstructed label is wrong. For each time-slot that is not (the Reality Mining dataset used also in this paper is by far
in the time-frame-label we give an error of 0.5. The idea is one of largest datasets in this category).
to lower the performance of topics associated to short time- Starting from these results, another important area of re-
frame-label, thus describing only a fraction of the day. Fi- search concerns the problem of converting from places de-
nally, we evaluate with the same error measure if there ex- scribed in terms of geographical coordinates or abstract IDs
ists another label, better describing the day. We obtain that (e.g., GSM tower ID) to semantically-rich places such as:
the labels associated to the selected topics are better than “home” or “favorite pub”. The work described in [1, 14]
6
adopt a probabilistic model to automatically attach seman- 4.3 Identifying Routines
tic labels to visited places. These works rely on the fact
that semantic information can be added by exploiting the Even more high-level than extracting places and routes,
structure of a person’s daily routine. For example, the place is the problem of identifying routine whereabouts behav-
where the user usually spends the night can be tagged se- iors.
mantically as “home”, while the place where the user usu- The CitySense project (http://www.citysense.com) is
ally goes from 8:00 to 18:00 can be tagged as “work”. In based on GPS and WiFi localization and has the goal of
[1] further information are extracted by geocoding the place monitoring and describing the city’s nightlife. In particular,
and mining the Web in search for relevant information. the application identifies hot-spots in the city (e.g., popular
In summary, these approaches allow to represent and un- bars and clubs) and compares the number of people located
derstand users’ mobility patterns as a sequence of places in a given area in real time with past measures, to determine
being visited at different times of the day. This representa- the “activity-level” of a given night. In a similar work based
tion resembles the “Home, Work, Elsewhere, No Signal”- on extremely large anonymized mobility data coming from
representation used in the Reality Mining dataset. Telecom operators authors were able to extract the spatio-
temporal dynamics of the city, highlighting where people
Accordingly, while this paper focuses on higher-level ab- usually go during the day. Authors were able also to iden-
stractions (the sequence of places being visited is our start- tify the most visited areas by tourists during the day and the
ing point), the above approaches are the fundamental ele- typical time of the visit [4, 12].
ments to apply our algorithms to other mobility datasets. In this context the works that most directly compare to
our proposal are [7, 8]. They use PCA and LDA algorithms
to extract routine behavior from the Reality Mining dataset.
4.2 Identifying Routes Our work provides a further classification step to give mean-
ingful labels to the extracted routines.
A number of related work deals with the problem of 5 Conclusion and Future Work
identifying the routes the user takes to move from one place
to another. These approaches run data mining algorithms In this paper we presented a methodology to automati-
to identify recurrent patterns in the GPS tracks from mul- cally classify the routine whereabouts extracted from a large
tiple users. Works in this area can be divided in 2 broad mobility dataset with meaningful labels.
categories: “geometric-based” approaches [9] apply pattern Our future work in this area will target 3 main directions:
matching to the sequence of geographical coordinates com-
posing the tracks. We call it geometric in that they use 1. We will apply the presented approaches to “live”
the physical “shape” of the path to compute the matching datasets such as those that can be acquired from
among routes. “String-based” approaches, instead, create Google Latitude (http://www.google.com/latitude) and
a symbolic representation of the path (e.g., by considering Flickr [12].
only the names of the areas crosses by the path) and apply
pattern-matching on that list of symbols [11]. In both cases, 2. We will develop mechanisms to add/modify topic la-
the extracted routes can be used to classify the user current bels at run time, so as to enable the use of the system
and past whereabouts. in a wide range of scenarios.
The work presented in this paper is similar to ‘string- 3. We will develop Web-based visualization mechanisms
based” approaches, in that the geographic information to inspect and communicate whereabouts behaviors in
about the places visited by the users are lost in favor of an effective way [6].
the more compact “Home, Work, Elsewhere, No Signal”-
representation. However, there are two important differ- The ultimate goal will be to create a real live Web appli-
ences. On the one hand, our representation is even more cation allowing different classes of users to see, understand
abstract that the one discussed in [11] and similar works. and predict their own and other users’ whereabouts.
Labels in our dataset (e.g., Home) are completely detached
from their physical location and, in fact, different users
will label as “Home” completely different places. On the 6 Acknowledgements
other hand our classification algorithm could extend also
the above related work to obtained a more descriptive label We thank Katayoun Farrahi for clarifications on their use
for the extracted routes. of the LDA algorithm on the Reality Mining dataset.
7
References [12] F. Girardin, J. Blat, F. Calabrese, F. D. Fiore, and
C. Ratti. Digital footprinting: Uncovering tourists
[1] N. Bicocchi, G. Castelli, M. Mamei, A. Rosi, and with user-generated content. IEEE Pervasive Comput-
F. Zambonelli. Supporting location-aware services for ing, 7(4):36–43, 2008.
mobile users with the whereabouts diary. In Interna- [13] D. Kim, J. Hightower, R. Govindan, and D. Estrin.
tional Conference on MOBILe Wireless MiddleWARE, Discovering semantically meaningful places from per-
Operating Systems, and Applications, Innsbruck, Aus- vasive rf-beacons. In International Conference on
tria, 2008. Ubiquitous Computing, Orlando (FL) ,USA, 2009.
[2] C. Bishop. Pattern Recognition and Machine Learn- [14] L. Liao, D. Fox, and H. Kautz. Extracting places
ing. Springer Verlag, 2006. and activities from gps traces using hierarchical con-
ditional random fields. International Journal of
[3] D. Blei, A. Ng, and M. Jordan. Latent dirichlet al- Robotics Research, 26(1):119–134, 2007.
location. Journal of Machine Learning Research,
3(1):993–1022, 2003. [15] M. Massimi, K. Truong, D. Dearman, and G. Hayes.
Understanding recording technologies in everyday
[4] F. Calabrese, J. Reades, and C. Ratti. Eigenplaces: life. IEEE Pervasive Computing, pre-print, 2010.
analysing cities using the space-time structure of the
[16] Q. Mei, X. Shen, and C. Zhai. Automatic labeling
mobile phone network. IEEE Pervasive Computing,
of multinomial topic models. In ACM International
9(1):78–84, 2010.
Conference on Knowledge Discovery and Data Min-
[5] C.-C. Chang and C.-J. Lin. LIBSVM: a library for ing, San Jose (CA), USA, 2007.
support vector machines, 2001. Software available at [17] S. Patel, J. Kientz, G. Hayes, S. Bhat, and G. Abowd.
http://www.csie.ntu.edu.tw/ cjlin/libsvm. Farther than you may think: An empirical investiga-
tion of the proximity of users to their mobile phones.
[6] A. Clear, R. Shannon, T. Holland, A. Quigley, S. Dob- In International Conference on Ubiquitous Comput-
son, and P. Nixon. Situvis: A visual tool for modeling ing, Orange County (CA), USA, 2006.
a user’s behaviour patterns in a pervasive environment.
In International Conference on Pervasive Computing, [18] Y. Zheng, L. Zhang, X. Xie, and W.-Y. Ma. Mining in-
Nara, Japan, 2009. teresting locations and travel sequences from gps tra-
jectories for mobile users. In International World Wide
[7] N. Eagle and A. Pentland. Eigenbehaviors: Identify- Web Conference, Madrid, Spain, 2009.
ing structure in routine. Behavioral Ecology and So-
ciobiology, 63(7):1057–1066, 2009.
[8] K. Farrahi and D. Gatica-Perez. Learning and predict-
ing multimodal daily life patterns from cell phones.
In International Conference on Multimodal Interfaces
(ICMI-MLMI), Cambridge (MA), USA, 2009.
[9] J. Froehlich and J. Krumm. Route prediction from trip
observations. In Intelligent Vehicle Initiative, Technol-
ogy Advanced Controls and Navigation Systems, SAE
World Congress and Exhibition, Detriot (MI), USA,
2008.
[10] J. Gemmell, G. Bell, and R. Lueder. Mylifebits: a
personal database for everything. Communications of
the ACM, 49(1):88–95, 2006.
[11] F. Giannotti, M. Nanni, D. Pedreschi, and F. Pinelli.
Trajectory pattern mining. In ACM SIGKDD Interna-
tional Conference on Knowledge Discovery and Data
Mining, San Jose (CA), USA, 2007.
8