<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Disseminating Synthetic Smart Home Data for Advanced Applications</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Andrea Masciadri Fabio Veronese Sara Comai Politecnico di Milano Politecnico di Milano Politecnico di Milano Como 22100</institution>
          ,
          <addr-line>Italy Como 22100, Italy Como 22100</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2015</year>
      </pub-date>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The research in IoT and Smart Homes elds
is rapidly growing, leading to the emergence
of new services to improve the health and
lifestyle of people based on the analysis of data
that they produce performing their daily
activities. However, researchers report a lack of
high-quality publicly-available datasets:
conducting experiments gathering such data is
long and expensive, especially if the
annotation of meaningful information (environment,
person's activity, health status) is required.
Moreover, there are even more speci c
settings (e.g., dementia detection) where data
must be related to a change in inhabitants'
behavior. We present a collection of new
publicly-available datasets generated with the
SHARON simulator. Thanks to this software,
researchers can obtain synthetic data
suiting their speci c requirements. Two classes
of datasets are described: one extends
existing datasets preserving the original
statistical properties, the other is composed of
simulations of virtual inhabitant-environment
systems. Moreover, we induced behavioral drifts
compatible with dementia symptoms,
generating further datasets. We believe that these
Copyright © CIKM 2018 for the individual papers by the papers'
authors. Copyright © CIKM 2018 for the volume as a collection
by its editors. This volume and its papers are published under
the Creative Commons License Attribution 4.0 International (CC
BY 4.0).
resources may help the progress of research,
as long as new real-life high-quality datasets
are not available.
1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>The possibility of gathering large amounts of data from
Smart Home environments is a valuable opportunity
for the development of numerous applications, like,
e.g., security, home automation, remote monitoring,
etc.</p>
      <p>Data are collected by using di erent types of
sensors, connected to a home (usually wireless) network
and stored in a central database. Localization of the
inhabitants, state of the house such as brightness,
temperature, humidity, doors and windows opening, as
well as the activation of household appliances can be
a source of knowledge for advanced analytics.</p>
      <p>Moreover, in addition to the mentioned data, there
is often a need for extensive descriptions of the
context in which the data were collected: the so-called
\ground-truth". For example, much attention has
been dedicated to the research in the Activity
Recognition (AR) eld { that is the task of identifying the
ongoing Activity of Daily Living (ADL) from sensors
data. As highlighted by Sprint et al. [SCFSE16], in
order to access the Health Events related to a person
living in a Smart Environment, supervised
machinelearning algorithms are commonly used. Usually, AR
requires a set of labels related to the performed ADLs:
these data are provided by external annotators
(often called oracles) which look at them and utilize
extra information (such as videos, the house oor-plan,
the resident pro le, etc.) to generate corresponding
ground-truth labels.</p>
      <p>s
e
s
u
o</p>
      <p>H</p>
      <p>It comes clear that creating Smart Home datasets
with ground-truth information related to the
inhabitant's activities and well-being status is a long and
costly operation that often slows down the progress of
research and advanced applications. In the next
section we provide an overview of the currently
publiclyavailable datasets, highlighting the strengths and
weaknesses of the various resources. In Section 3 we
describe a new set of resources, their peculiarities and
how have they been generated. Finally, in Section 4 we
conclude this work discussing future challenges about
the dissemination of Smart Home datasets.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Background</title>
      <p>In recent years, many papers have been discussing the
importance of the continuous monitoring of the
person's behavior as a source of information concerning
his/her well-being [RBC+15, PKL+05, PLJ+15].
According to Saives et al. [SPF15], improving the life of
the inhabitant with new technological services makes
a house \Smart"; those applications cover several
interesting research elds, all of them sharing the same
need to collect Home Automation datasets. A
literature review by Rashidi et al. from 2013 reports 18
noticeable projects in Ambient Assisted Living, and
con rms that \one of the most important component
is Human Activity Recognition" [RM13]. Despite the
great interest in research concerning Activity
Recognition (AR) and Behavioral Drift Detection (BDD),
the amount of publicly-available high-quality datasets
is particularly small. Indeed, the collection of Home
Automation data in controlled settings, with good
annotation, is a hard and resource demanding task.</p>
      <p>Table 2 summarizes the features of the most
widely used datasets in the literature to evaluate AR
and BDD research; as reported by Benmansour et
al. [BBF16], AR and BDD with multiple residents
introduce complexity in identifying the dwellers and
disassociating data and activities.</p>
      <p>ARAS (Activity Recognition with Ambient
Sensing) is a project developed aiming at ADL
recognition [AEIE13]. The authors have published their
datasets, that comprise data collected from two houses
with two inhabitants, for a duration of one month each.
The deployed sensors set was composed of 20 boolean
sensors, and data were annotated with 27 di erent
ADLs. The dataset however reports erratic routine
of the inhabitants (unusual meal times, unexpected
behavior during the ADL, etc.), speci es only one
activity at a time (even when two happens concurrently),
and reports ADLs which cannot be identi ed due to
sensor lack (e.g., no sensor to detect \using internet"
and \reading" activities were present).</p>
      <p>CASAS (Center for Advanced Studies in Adaptive
Systems) is a research project and a department of
the Washington State University very active in AR
studies. Their focus is to design a smart home \small
in form, lightweight in infrastructure, extendable with
minimal e ort, and ready to perform key capabilities
out of the box", through their Smart Home in a Box
project [CCTK13]. The success of this project enabled
the collection and publication of several datasets, so
that many AR research studies worked using CASAS
data [CSEC+09]. Nonetheless, the annotation of the
datasets is restricted to a reduced subset of the freely
available data, and in most of these cases it was
obtained thanks to an automatic labeling method rather
than using a personal diary or an oracle. Finally, the
variety of the installed sensors is often restricted to
two di erent types (motion and temperature),
reducing the possibility of advanced data analysis.</p>
      <p>Tapia et al. [TIL04] presented two datasets related
to two houses with a single resident each collected
by MIT. They comprise data collected from many
Boolean sensors (up to 85) for two weeks each.
Activity annotation was achieved asking the inhabitant to
use a Person Digital Assistant (PDA). Every 15
minutes candidates were reminded by the PDA to record
the performed activities. Even if this methodology is
less intrusive and less demanding than spontaneous
annotation, it resulted to be less accurate probably
because it is not spontaneous. Moreover, the reduced
duration makes it less relevant for traditional machine
learning methods.</p>
      <p>T.Van Kasteren [VKNEK08], working at an
Activity Recognition project at University of Amsterdam,
has collected a dataset concerning two houses with
single inhabitant. The volunteers houses were
instrumented with 20 boolean sensors, collecting data for
28 days. The annotation was done directly by the
inhabitant, but it reports some inaccurate entries, as
well as some unexpected data (e.g., sensors always
on/o ).</p>
      <p>Referring to the reported projects, we can subsume
the weak points of publicly-available datasets as
follows:</p>
      <p>Limited Sensor Variety: many projects use few
sensors or a limited variety in sensed quantity.
Limited Extension: projects involving several
volunteers, present short duration per-person;
conversely, long lasting collections refer only to few
participants;
Limited Annotation Reliability: inhabitants and
automatic methods could lead to insu cient
results in terms of accuracy and, in some cases, the
single activity annotation is not su cient to
describe properly the experimental settings;
Heterogeneity: every project de nes its set of
activities, sensors, standards, and protocols,
resulting in non-comparable datasets;
Speci city and Applicability: most of the projects
report data collected with a speci c intent, not
necessarily matching the aim of other research
groups; dually, if a dataset is collected in generic
settings, it might not contain some speci c
situations required by other research groups.</p>
      <p>Moreover, we would like to emphasize the lack of
attention devoted to the behavioral change annotation.
Indeed, all the mentioned datasets have a too short
time duration and/or have no annotation concerning
such modi cations in the inhabitant behavior.</p>
      <p>Alternative approaches for the dataset collection
phase consist in substituting the real world
system with a simulation software producing synthetic
data [Mas, MN06, AR07]. In this paper we present a
collection of datasets generated with SHARON
simulator, which can be tuned to produce highly customized
synthetic home automation data for advanced
applications.
3</p>
    </sec>
    <sec id="sec-4">
      <title>Synthetic Smart Home Datasets</title>
      <p>The datasets we present have been obtained exploiting
SHARON's sensor data generation algorithms, with
di erent environments and inhabitant behaviors.</p>
      <p>The resources are accessible at the persistent
URL http://www.purl.org/synthetic sh dataset, and
are available under the Creative Commons
Attribution 3.0 CC-BY License; when exploiting the hereby
included data, please cite the work of Veronese et
al. [VMT+16]. The resources and the software to
generate further data are also available at the
institutional website of the Assistive Technology Group
ATG [b1115].
3.1</p>
      <sec id="sec-4-1">
        <title>SHARON simulator</title>
        <p>SHARON is a tool developed in the BRIDGE
project [MSV+15] to face the lack of data for advanced
Smart Home applications such as Activity
Recognition. The simulator is structured in two main layers:
the top layer generates the daily activity schedule,
the bottom layer translates them into sensor
activations. The software can be tuned designing the
dwelling characteristics, the virtual sensors models,
and a set of parameters representing the inhabitant
response to needs (e.g., hunger, tiredness, boredom,
stress, etc.). The activity schedule attempts to satisfy
the person needs in relation to the time of the day,
the weekly cycle, the weather conditions and other
non-deterministic components. The bottom level
relies either on a virtual agent, performing the scheduled
action in the environment and activating the sensors
following a set of alternative patterns, or on a
statistical module, reproducing the activations of sensors
given an activity as performed in an available
training dataset.Finally it is possible to program a change
in the simulation parameters so that the inhabitant
behavior is a ected accordingly.</p>
        <p>All the details about the data generation model
implemented in SHARON to produce Synthetic Smart
Home Data are available in the work of Veronese et
al. [VPC+14]. The evaluation of the simulator has
already been performed through a cross-validation
process applied on a real world dataset (ARAS [AEIE13]);
the work in Veronese et al. [VMT+16] reports the
results for both the layers of the SHARON simulator:
Top layer validation (ADL scheduling):
three di erent validation metrics (Bhattacharyya
distance [Bha46], Earth Mover Distance [Hit41]
and Kullback-Leibler divergence [KL51]) have
been used to evaluate the di erence between
activity distributions in the generated dataset with
respect to a test set extracted from the original
dataset. The same distance has been computed
between a training set of the above mentioned
real world dataset (original dataset ) and the test
set; Figure 1 shows that the ADL scheduling
generated by the SHARON simulator is compatible
with the schedule of the original dataset.</p>
        <p>Bottom layer validation (Sensor
activations): the Bhattacharyya distance have been
computed to compare the sensor activation
distributions in the ARAS dataset with respect to
500
450
400
350
taa300
d
ing250
n
i
raT200
150
50
100 Dinner</p>
        <p>Toilet OtherReading</p>
        <p>WaGtcohininggOTuVt
Shower
Internet</p>
        <p>Lunch</p>
        <p>Sleeping
0 Breakfast
0 100</p>
        <p>Conversation</p>
        <p>Snack
200 300
Simulated data
400
500
the generated datasets (using both the agent
module and the statistical module). Table 2 reports
the results for three relevant activities: Cleaning
(where the sequence of sensor activations is
almost random), Lunch (where several executions
are di erent, but keeping an overall procedure),
and Having Shower (where the procedural
connotation is strong).</p>
        <p>Music
Napping</p>
        <p>Laundry</p>
        <p>Cleaning
The generated dataset has been obtained using the
SHARON simulator; every day of simulation is
represented by two text les: one describing the ADL
scheduling and one describing sensor activations. The
former contains all the performed activities - one
activity per line - with the starting time, the activity
identi er, and the activity name in a comma
separated format. The latter contains 86400 lines - one for
every second of the day - reporting the boolean
status of every sensor of the house separated by blank
characters.</p>
        <p>The proposed datasets refer to three di erent house
models. Each dataset comprises 90 days of the virtual
inhabitant life, and has an alternative version
comprising an injected behavioral drift compatible with
dementia symptoms, that can be used for comparison.
In the following, the characteristics of di erent classes
of datasets are described; they are summarized in
Tables 3 and 4.
This rst group of datasets comprises four synthetic
home automation datasets (their names start with
A-* ) based on a virtual reproduction of the ARAS
project test environment [AEIE13]. Two of them
(Aext-* ) have been obtained by training SHARON over
the behavior of one of the original ARAS project
inhabitants, resulting in an extension of the original
data. The other two (A-agn-* ) have been obtained
using the same ADL scheduling but with an
agentbased simulation. Two variants with behavioral drift
due to dementia (*-dem) are also available.</p>
        <sec id="sec-4-1-1">
          <title>Environment</title>
          <p>The house environment exploited for simulation
comprises 20 binary home automation sensors. The
location is a simple apartment with four main spaces:
bedroom, bathroom, and an openspace with kitchen and
living room. Most common sensors are motion
detectors, but in this environment there are also tap, toilet,
and shower sensors, pressure detectors on chairs, sofa
and bed.</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Inhabitant</title>
          <p>The inhabitant routine comprises two di erent
patterns for weekdays and weekends. During the
weekdays the inhabitant spends a daily amount of time
outside the dwelling (for working activities), while
during the weekend leisure is the main occupation (relax,
reading, internet, etc.). There are 13 performed
activities, as described in Table 4, plus an unquali ed
activity \Other".
3.2.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Van Kasteren dataset extension</title>
        <p>The second dataset group (K-* ) is related to
the research project home by Van Kasteren et
al. [VKNEK08]. In this case the virtual environment
reproduces the experimental house, as well as the
sensor activations, which are produced after a training on
the original data. The results are two datasets: one
with the extension of the real dataset (K-ext-norm),
the other with the superimposed behavioral drift
(Kext-dem).</p>
        <sec id="sec-4-2-1">
          <title>Environment</title>
          <p>The house environment exploited for simulation
comprises 21 binary home automation sensors. The
location is a two-storey apartment: on the rst oor
there are a bathroom and an open-space with kitchen
and livingroom; the second oor is composed by two
bedrooms, a bathroom, and a study room. Installed
sensors include motion sensors to detect doors,
drawers and cupboards openings, tap and shower sensors,
sensors to detect appliances uses, pressure detectors
on chairs, sofa and bed.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Inhabitant</title>
          <p>The dataset describes 12 activities, the same of the
ARAS datasets, except for \Working" and \Internet"
activities that are missing (Table 4). The inhabitant
routine comprises two di erent patterns for weekdays
and weekends, mainly by di erentiating the time and
duration of meals.
3.2.3</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>V-Home dataset</title>
        <p>This last group of datasets are fully virtual (V-*). The
authors designed a simple four room house, and
programmed an easy routine for a virtual inhabitant. The
obtained datasets are based on an agent based sensor
activation simulation, one with plain routine
(V-agnnorm), the other with the injected drift (V-agn-dem).</p>
        <sec id="sec-4-3-1">
          <title>Environment</title>
          <p>The virtual designed environment includes 11 binary
sensors. The house is designed with four main rooms:
kitchen, bedroom, bathroom, and livingroom. Most
devices are movement sensors, with open-close
detectors on main door and bathroom cabinet.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>Inhabitant</title>
          <p>The inhabitant routine represents a remote-worker,
working 8 hours at home in weekdays, and relaxing
in the weekends. The activities are 14, with the
addition of an unquali ed \Other".
3.3</p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>Behavioral Drift Description</title>
        <p>Alzheimer's Disease (AD) is becoming widespread as
reported by AD International [WJB+13]: there will
be up to 65.7 million people living with dementia
worldwide by 2030 and up to 115.4 million by 2050.
The typical symptoms of AD involve the daily
routine, concerning: forgetfulness, di culty performing
ADL, incontinence, speech problems, wandering and
getting lost, depression, sleep disorders. In the
provided dataset (*-dem) this condition is simulated by
replicating part of the symptoms. The time taken to
perform complex tasks such as \Take a shower" is
increased by 20%, its rate is decreased by 15%. The
duration of nighttime sleep passes from an average of
8 uninterrupted hours to 4.5 hours fragmented up to
5 times, while naps appear during the day. The
frequency of activities such as \Dinner" and \Going out"
slightly decreases.
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion and Future Work</title>
      <p>The presented datasets, generated with SHARON, are
a support resource for research groups working on
smart home data processing for advanced applications.
Even if with some limitations, the proposed data are
a resource to foster such research, avoiding the costs
of creating a real world testbed. Moreover, the
software SHARON is publicly-available, enabling to
generate further di erent data with particular conditions
and behavioral drifts, and overcoming the lack of
highquality real world datasets. The quality of the data
generated by the simulator has been discussed in the
work of Veronese et al. [VMT+16], which has already
attracted the attention of the scienti c community
that has expressed a willingness to access data. We
believe that this could be used as a tool to provide
early tests for new methods development (e.g.,
Activity Recognition and Behavioral Drift Detection),
before allocating time and nancial resources. The
provided datasets are only a possible application of the
simulation software, whose next releases will include
further features and a user friendly web interface to
allow the generation of high quality synthetic datasets.
[AEIE13]
[CSEC+09] Diane Cook, M Schmitter-Edgecombe,
Aaron Crandall, Chad Sanders, and
Brian Thomas. Collecting and
disseminating smart home sensor data in the
[Hit41]
[KL51]
[Mas]
[MN06]
[MSV+15]
[PKL+05]
[PLJ+15]
[RBC+15]
casas project. In Proceedings of the CHI
Workshop on Developing Shared Home
Behavior Datasets to Advance HCI and
Ubiquitous Computing Research, 2009.</p>
      <p>Frank L Hitchcock. The distribution of
a product from several sources to
numerous localities. J. Math. Phys, 20(2):224{
230, 1941.</p>
      <p>Solomon Kullback and Richard A
Leibler. On information and su ciency.</p>
      <p>The annals of mathematical statistics,
pages 79{86, 1951.</p>
      <p>Mason project website.
http://cs.gmu.edu/eclab/projects/mason.</p>
      <p>Accessed: 2015.</p>
      <p>Miquel Martin and Petteri Nurmi. A
generic large scale simulator for
ubiquitous computing. In 2006 Third Annual
International Conference on Mobile and
Ubiquitous Systems: Networking &amp;
Services, pages 1{3. IEEE, 2006.</p>
      <p>Simone Mangano, Hassan Saidinejad,
Fabio Veronese, Sara Comai, Matteo
Matteucci, and Fabio Salice. Bridge:
Mutual reassurance for autonomous and
independent living. Intelligent Systems,
IEEE, 30(4):31{38, 2015.</p>
      <p>Paula Paavilainen, Ilkka Korhonen, Jyrji
Lotjonen, Luc Cluitmans, Marja Jylha,
Antti Sarela, and Markku Partinen.</p>
      <p>Circadian activity rhythm in demented
and non-demented nursing-home
residents measured by telemetric actigraphy.</p>
      <p>Journal of sleep research, 14(1):61{68,
2005.</p>
      <p>Kirsten KB Peetoom, Monique AS Lexis,
Manuela Joore, Carmen D Dirksen, and
Luc P De Witte. Literature review
on monitoring technologies and their
outcomes in independently living
elderly people. Disability and
Rehabilitation: Assistive Technology, 10(4):271{
294, 2015.</p>
      <p>Daniele Riboni, Claudio Bettini,
Gabriele Civitarese, Za ar Haider
Janjua, and Viola Bulgari. From lab to
life: Fine-grained behavior monitoring
in the elderly's home. In Pervasive
Computing and Communication
Workshops (PerCom Workshops), 2015 IEEE
[RM13]
[SCFSE16]
[SPF15]
[TIL04]
on,
pages
Parisa Rashidi and Alex Mihailidis. A
survey on ambient-assisted living tools
for older adults. IEEE journal of
biomedical and health informatics, 17(3):579{
590, 2013.</p>
      <p>Gina Sprint, Diane Cook, Roschelle
Fritz, and Maureen
SchmitterEdgecombe. Detecting health and
behavior change by analyzing smart
home sensor data. In Smart Computing
(SMARTCOMP), 2016 IEEE
International Conference on, pages 1{3. IEEE,
2016.</p>
      <p>Jeremie Saives, Clement Pianon, and
Gregory Faraut. Activity discovery and
detection of behavioral deviations of an
inhabitant from binary sensors. IEEE
Transactions on Automation Science and
Engineering, 12(4):1211{1224, 2015.</p>
      <p>Emmanuel Munguia Tapia, Stephen S
Intille, and Kent Larson. Activity
recognition in the home using simple and
ubiquitous sensors. Springer, 2004.
[VMT+16]
[VPC+14]
[WJB+13]</p>
      <p>Fabio Veronese, Andrea Masciadri,
Anna A Tro mova, Matteo Matteucci,
and Fabio Salice. Realistic human
behaviour simulation for quantitative
ambient intelligence studies. Technology
and Disability, 28(4):159{177, 2016.</p>
      <p>F Veronese, D Proserpio, S Comai,
M Matteucci, and F Salice. Sharon: a
simulator of human activities, routines
and needs. Studies in health technology
and informatics, 217:560{566, 2014.</p>
      <p>Anders Wimo, Linus Jonsson, John
Bond, Martin Prince, Bengt Winblad,
and Alzheimer Disease International.</p>
      <p>The worldwide economic impact of
dementia 2010. Alzheimer's &amp; Dementia,
9(1):1{11, 2013.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>