Disseminating Synthetic Smart Home Data for Advanced Applications Andrea Masciadri Fabio Veronese Sara Comai Politecnico di Milano Politecnico di Milano Politecnico di Milano Como 22100, Italy Como 22100, Italy Como 22100, Italy andrea.masciadri@polimi.it fabio.veronese@polimi.it sara.comai@polimi.it Ilaria Carlini Fabio Salice Politecnico di Milano Politecnico di Milano Como 22100, Italy Como 22100, Italy ilaria.carlini@mail.polimi.it fabio.salice@polimi.it resources may help the progress of research, as long as new real-life high-quality datasets Abstract are not available. The research in IoT and Smart Homes fields 1 Introduction is rapidly growing, leading to the emergence of new services to improve the health and The possibility of gathering large amounts of data from lifestyle of people based on the analysis of data Smart Home environments is a valuable opportunity that they produce performing their daily ac- for the development of numerous applications, like, tivities. However, researchers report a lack of e.g., security, home automation, remote monitoring, high-quality publicly-available datasets: con- etc. ducting experiments gathering such data is Data are collected by using different types of sen- long and expensive, especially if the annota- sors, connected to a home (usually wireless) network tion of meaningful information (environment, and stored in a central database. Localization of the person’s activity, health status) is required. inhabitants, state of the house such as brightness, tem- Moreover, there are even more specific set- perature, humidity, doors and windows opening, as tings (e.g., dementia detection) where data well as the activation of household appliances can be must be related to a change in inhabitants’ a source of knowledge for advanced analytics. behavior. We present a collection of new Moreover, in addition to the mentioned data, there publicly-available datasets generated with the is often a need for extensive descriptions of the con- SHARON simulator. Thanks to this software, text in which the data were collected: the so-called researchers can obtain synthetic data suit- “ground-truth”. For example, much attention has ing their specific requirements. Two classes been dedicated to the research in the Activity Recog- of datasets are described: one extends exist- nition (AR) field – that is the task of identifying the ing datasets preserving the original statisti- ongoing Activity of Daily Living (ADL) from sensors cal properties, the other is composed of simu- data. As highlighted by Sprint et al. [SCFSE16], in lations of virtual inhabitant-environment sys- order to access the Health Events related to a person tems. Moreover, we induced behavioral drifts living in a Smart Environment, supervised machine- compatible with dementia symptoms, gener- learning algorithms are commonly used. Usually, AR ating further datasets. We believe that these requires a set of labels related to the performed ADLs: these data are provided by external annotators (of- Copyright © CIKM 2018 for the individual papers by the papers' ten called oracles) which look at them and utilize ex- authors. Copyright © CIKM 2018 for the volume as a collection tra information (such as videos, the house floor-plan, by its editors. This volume and its papers are published under the resident profile, etc.) to generate corresponding ground-truth labels. the Creative Commons License Attribution 4.0 International (CC BY 4.0). It comes clear that creating Smart Home datasets Table 1: Datasets comparison. with ground-truth information related to the inhab- itant’s activities and well-being status is a long and Multiple residents costly operation that often slows down the progress of research and advanced applications. In the next sec- tion we provide an overview of the currently publicly- # Activities available datasets, highlighting the strengths and # Sensors # Houses Duration weaknesses of the various resources. In Section 3 we describe a new set of resources, their peculiarities and how have they been generated. Finally, in Section 4 we conclude this work discussing future challenges about the dissemination of Smart Home datasets. Aras 2 y 2m 20 27 Casas >38 y 2-8 m 20-86 11 MIT 2 n 2w 77-84 13 2 Background Kasteren 3 n >1 m 21 27 In recent years, many papers have been discussing the importance of the continuous monitoring of the per- son’s behavior as a source of information concerning Systems) is a research project and a department of his/her well-being [RBC+ 15, PKL+ 05, PLJ+ 15]. Ac- the Washington State University very active in AR cording to Saives et al. [SPF15], improving the life of studies. Their focus is to design a smart home “small the inhabitant with new technological services makes in form, lightweight in infrastructure, extendable with a house “Smart”; those applications cover several in- minimal effort, and ready to perform key capabilities teresting research fields, all of them sharing the same out of the box”, through their Smart Home in a Box need to collect Home Automation datasets. A liter- project [CCTK13]. The success of this project enabled ature review by Rashidi et al. from 2013 reports 18 the collection and publication of several datasets, so noticeable projects in Ambient Assisted Living, and that many AR research studies worked using CASAS confirms that “one of the most important component data [CSEC+ 09]. Nonetheless, the annotation of the is Human Activity Recognition” [RM13]. Despite the datasets is restricted to a reduced subset of the freely great interest in research concerning Activity Recog- available data, and in most of these cases it was ob- nition (AR) and Behavioral Drift Detection (BDD), tained thanks to an automatic labeling method rather the amount of publicly-available high-quality datasets than using a personal diary or an oracle. Finally, the is particularly small. Indeed, the collection of Home variety of the installed sensors is often restricted to Automation data in controlled settings, with good an- two different types (motion and temperature), reduc- notation, is a hard and resource demanding task. ing the possibility of advanced data analysis. Table 2 summarizes the features of the most Tapia et al. [TIL04] presented two datasets related widely used datasets in the literature to evaluate AR to two houses with a single resident each collected and BDD research; as reported by Benmansour et by MIT. They comprise data collected from many al. [BBF16], AR and BDD with multiple residents in- Boolean sensors (up to 85) for two weeks each. Activ- troduce complexity in identifying the dwellers and dis- ity annotation was achieved asking the inhabitant to associating data and activities. use a Person Digital Assistant (PDA). Every 15 min- ARAS (Activity Recognition with Ambient Sens- utes candidates were reminded by the PDA to record ing) is a project developed aiming at ADL recog- the performed activities. Even if this methodology is nition [AEIE13]. The authors have published their less intrusive and less demanding than spontaneous datasets, that comprise data collected from two houses annotation, it resulted to be less accurate probably with two inhabitants, for a duration of one month each. because it is not spontaneous. Moreover, the reduced The deployed sensors set was composed of 20 boolean duration makes it less relevant for traditional machine sensors, and data were annotated with 27 different learning methods. ADLs. The dataset however reports erratic routine T.Van Kasteren [VKNEK08], working at an Activ- of the inhabitants (unusual meal times, unexpected ity Recognition project at University of Amsterdam, behavior during the ADL, etc.), specifies only one ac- has collected a dataset concerning two houses with tivity at a time (even when two happens concurrently), single inhabitant. The volunteers houses were instru- and reports ADLs which cannot be identified due to mented with 20 boolean sensors, collecting data for sensor lack (e.g., no sensor to detect “using internet” 28 days. The annotation was done directly by the and “reading” activities were present). inhabitant, but it reports some inaccurate entries, as CASAS (Center for Advanced Studies in Adaptive well as some unexpected data (e.g., sensors always on/off). generate further data are also available at the insti- tutional website of the Assistive Technology Group Referring to the reported projects, we can subsume ATG [b1115]. the weak points of publicly-available datasets as fol- lows: 3.1 SHARON simulator • Limited Sensor Variety: many projects use few SHARON is a tool developed in the BRIDGE sensors or a limited variety in sensed quantity. project [MSV+ 15] to face the lack of data for advanced Smart Home applications such as Activity Recogni- • Limited Extension: projects involving several vol- tion. The simulator is structured in two main layers: unteers, present short duration per-person; con- the top layer generates the daily activity schedule, versely, long lasting collections refer only to few the bottom layer translates them into sensor acti- participants; vations. The software can be tuned designing the dwelling characteristics, the virtual sensors models, • Limited Annotation Reliability: inhabitants and and a set of parameters representing the inhabitant automatic methods could lead to insufficient re- response to needs (e.g., hunger, tiredness, boredom, sults in terms of accuracy and, in some cases, the stress, etc.). The activity schedule attempts to satisfy single activity annotation is not sufficient to de- the person needs in relation to the time of the day, scribe properly the experimental settings; the weekly cycle, the weather conditions and other non-deterministic components. The bottom level re- • Heterogeneity: every project defines its set of ac- lies either on a virtual agent, performing the scheduled tivities, sensors, standards, and protocols, result- action in the environment and activating the sensors ing in non-comparable datasets; following a set of alternative patterns, or on a sta- • Specificity and Applicability: most of the projects tistical module, reproducing the activations of sensors report data collected with a specific intent, not given an activity as performed in an available train- necessarily matching the aim of other research ing dataset.Finally it is possible to program a change groups; dually, if a dataset is collected in generic in the simulation parameters so that the inhabitant settings, it might not contain some specific situa- behavior is affected accordingly. tions required by other research groups. All the details about the data generation model im- plemented in SHARON to produce Synthetic Smart Moreover, we would like to emphasize the lack of Home Data are available in the work of Veronese et attention devoted to the behavioral change annotation. al. [VPC+ 14]. The evaluation of the simulator has al- Indeed, all the mentioned datasets have a too short ready been performed through a cross-validation pro- time duration and/or have no annotation concerning cess applied on a real world dataset (ARAS [AEIE13]); such modifications in the inhabitant behavior. the work in Veronese et al. [VMT+ 16] reports the re- Alternative approaches for the dataset collection sults for both the layers of the SHARON simulator: phase consist in substituting the real world sys- tem with a simulation software producing synthetic • Top layer validation (ADL scheduling): data [Mas, MN06, AR07]. In this paper we present a three different validation metrics (Bhattacharyya collection of datasets generated with SHARON simula- distance [Bha46], Earth Mover Distance [Hit41] tor, which can be tuned to produce highly customized and Kullback-Leibler divergence [KL51]) have synthetic home automation data for advanced appli- been used to evaluate the difference between ac- cations. tivity distributions in the generated dataset with respect to a test set extracted from the original 3 Synthetic Smart Home Datasets dataset. The same distance has been computed between a training set of the above mentioned The datasets we present have been obtained exploiting real world dataset (original dataset) and the test SHARON’s sensor data generation algorithms, with set; Figure 1 shows that the ADL scheduling gen- different environments and inhabitant behaviors. erated by the SHARON simulator is compatible The resources are accessible at the persistent with the schedule of the original dataset. URL http://www.purl.org/synthetic sh dataset, and are available under the Creative Commons Attribu- • Bottom layer validation (Sensor activa- tion 3.0 CC-BY License; when exploiting the hereby tions): the Bhattacharyya distance have been included data, please cite the work of Veronese et computed to compare the sensor activation dis- al. [VMT+ 16]. The resources and the software to tributions in the ARAS dataset with respect to Table 2: Bhattacharyya distance of the sensor activa- Table 3: New datasets comparison. A: ARAS, K: Van tion distributions in the generated datasets with re- Kasteren, V: V-House. spect to the original dataset; smaller values represent closer distributions. A: Agent, S: Statistical House Map ADL Lunch Shower Cleaning Annot. DATASET A S A S A S Type Drift Days Couch 0.34 0.06 - - 0.79 0.43 Chair 1 0.38 0.29 - - - - Chair 2 0.25 0.47 - - 0.47 0.59 A-ext-norm A Statistical no 90 yes Fridge 0.66 0.41 - - 0.54 0.54 A-ext-dem A Statistical yes 90 yes K. Drawer 0.74 0.60 - - - - A-agn-norm A Agent-based no 90 yes B. Door - - 0.16 0.11 - - A-agn-dem A Agent-based yes 90 yes Shower - - 0.63 0.34 - - K-ext-norm K Statistical no 90 yes Hall 0.83 0.89 - - 0.36 0.77 K-ext-dem K Statistical yes 90 yes K. Mov. 0.22 0.20 - - 0.33 0.21 V-agn-norm V Agent-based no 90 yes Tap - - - - 0.94 0.54 V-agn-dem V Agent-based yes 90 yes K. Temp. 0.18 0.19 - - - - the generated datasets (using both the agent mod- Table 4: New datasets performed ADLs. ule and the statistical module). Table 2 reports Watching TV the results for three relevant activities: Cleaning Going Out Breakfast (where the sequence of sensor activations is al- Cleaning Working Sleeping Reading Internet Shower Dinner Lunch most random), Lunch (where several executions Other Toilet Relax are different, but keeping an overall procedure), and Having Shower (where the procedural conno- A-* K-* tation is strong). V-* resented by two text files: one describing the ADL 500 Laundry scheduling and one describing sensor activations. The 450 Cleaning former contains all the performed activities - one ac- Music 400 Napping tivity per line - with the starting time, the activity identifier, and the activity name in a comma sepa- 350 rated format. The latter contains 86400 lines - one for Training data 300 every second of the day - reporting the boolean sta- 250 tus of every sensor of the house separated by blank Conversation Snack 200 characters. The proposed datasets refer to three different house 150 Toilet Reading Other models. Each dataset comprises 90 days of the virtual Watching TV 100 Dinner Going Out inhabitant life, and has an alternative version com- Shower 50 Internet prising an injected behavioral drift compatible with Lunch Sleeping dementia symptoms, that can be used for comparison. 0 Breakfast 0 100 200 300 400 500 In the following, the characteristics of different classes Simulated data of datasets are described; they are summarized in Ta- bles 3 and 4. Figure 1: Comparison of the Earth Mover Distance be- tween the activity distributions in simulated data and 3.2.1 ARAS dataset extension training data as reported by Veronese et al. [VMT+ 16]. This first group of datasets comprises four synthetic home automation datasets (their names start with A-* ) based on a virtual reproduction of the ARAS 3.2 Dataset description project test environment [AEIE13]. Two of them (A- The generated dataset has been obtained using the ext-* ) have been obtained by training SHARON over SHARON simulator; every day of simulation is rep- the behavior of one of the original ARAS project in- habitants, resulting in an extension of the original routine comprises two different patterns for weekdays data. The other two (A-agn-* ) have been obtained and weekends, mainly by differentiating the time and using the same ADL scheduling but with an agent- duration of meals. based simulation. Two variants with behavioral drift due to dementia (*-dem) are also available. 3.2.3 V-Home dataset This last group of datasets are fully virtual (V-*). The Environment authors designed a simple four room house, and pro- The house environment exploited for simulation com- grammed an easy routine for a virtual inhabitant. The prises 20 binary home automation sensors. The loca- obtained datasets are based on an agent based sensor tion is a simple apartment with four main spaces: bed- activation simulation, one with plain routine (V-agn- room, bathroom, and an openspace with kitchen and norm), the other with the injected drift (V-agn-dem). living room. Most common sensors are motion detec- tors, but in this environment there are also tap, toilet, Environment and shower sensors, pressure detectors on chairs, sofa The virtual designed environment includes 11 binary and bed. sensors. The house is designed with four main rooms: kitchen, bedroom, bathroom, and livingroom. Most Inhabitant devices are movement sensors, with open-close detec- The inhabitant routine comprises two different pat- tors on main door and bathroom cabinet. terns for weekdays and weekends. During the week- days the inhabitant spends a daily amount of time Inhabitant outside the dwelling (for working activities), while dur- The inhabitant routine represents a remote-worker, ing the weekend leisure is the main occupation (relax, working 8 hours at home in weekdays, and relaxing reading, internet, etc.). There are 13 performed ac- in the weekends. The activities are 14, with the addi- tivities, as described in Table 4, plus an unqualified tion of an unqualified “Other”. activity “Other”. 3.3 Behavioral Drift Description 3.2.2 Van Kasteren dataset extension Alzheimer’s Disease (AD) is becoming widespread as The second dataset group (K-* ) is related to reported by AD International [WJB+ 13]: there will the research project home by Van Kasteren et be up to 65.7 million people living with dementia al. [VKNEK08]. In this case the virtual environment worldwide by 2030 and up to 115.4 million by 2050. reproduces the experimental house, as well as the sen- The typical symptoms of AD involve the daily rou- sor activations, which are produced after a training on tine, concerning: forgetfulness, difficulty performing the original data. The results are two datasets: one ADL, incontinence, speech problems, wandering and with the extension of the real dataset (K-ext-norm), getting lost, depression, sleep disorders. In the pro- the other with the superimposed behavioral drift (K- vided dataset (*-dem) this condition is simulated by ext-dem). replicating part of the symptoms. The time taken to perform complex tasks such as “Take a shower” is in- Environment creased by 20%, its rate is decreased by 15%. The The house environment exploited for simulation com- duration of nighttime sleep passes from an average of prises 21 binary home automation sensors. The lo- 8 uninterrupted hours to 4.5 hours fragmented up to cation is a two-storey apartment: on the first floor 5 times, while naps appear during the day. The fre- there are a bathroom and an open-space with kitchen quency of activities such as “Dinner” and “Going out” and livingroom; the second floor is composed by two slightly decreases. bedrooms, a bathroom, and a study room. Installed sensors include motion sensors to detect doors, draw- 4 Discussion and Future Work ers and cupboards openings, tap and shower sensors, The presented datasets, generated with SHARON, are sensors to detect appliances uses, pressure detectors a support resource for research groups working on on chairs, sofa and bed. smart home data processing for advanced applications. Even if with some limitations, the proposed data are Inhabitant a resource to foster such research, avoiding the costs The dataset describes 12 activities, the same of the of creating a real world testbed. Moreover, the soft- ARAS datasets, except for “Working” and “Internet” ware SHARON is publicly-available, enabling to gen- activities that are missing (Table 4). The inhabitant erate further different data with particular conditions and behavioral drifts, and overcoming the lack of high- casas project. In Proceedings of the CHI quality real world datasets. The quality of the data Workshop on Developing Shared Home generated by the simulator has been discussed in the Behavior Datasets to Advance HCI and work of Veronese et al. [VMT+ 16], which has already Ubiquitous Computing Research, 2009. attracted the attention of the scientific community that has expressed a willingness to access data. We [Hit41] Frank L Hitchcock. The distribution of believe that this could be used as a tool to provide a product from several sources to numer- early tests for new methods development (e.g., Activ- ous localities. J. Math. Phys, 20(2):224– ity Recognition and Behavioral Drift Detection), be- 230, 1941. fore allocating time and financial resources. The pro- [KL51] Solomon Kullback and Richard A vided datasets are only a possible application of the Leibler. On information and sufficiency. simulation software, whose next releases will include The annals of mathematical statistics, further features and a user friendly web interface to pages 79–86, 1951. allow the generation of high quality synthetic datasets. [Mas] Mason project website. References http://cs.gmu.edu/eclab/projects/mason. Accessed: 2015. [AEIE13] Hande Alerndar, Halil Ertan, Ozlem Durmaz Incel, and Cem Er- [MN06] Miquel Martin and Petteri Nurmi. A soy. Aras human activity datasets in generic large scale simulator for ubiqui- multiple homes with multiple residents. tous computing. In 2006 Third Annual In Pervasive Computing Technologies International Conference on Mobile and for Healthcare (PervasiveHealth), 2013 Ubiquitous Systems: Networking & Ser- 7th International Conference on, pages vices, pages 1–3. IEEE, 2006. 232–235. IEEE, 2013. [MSV+ 15] Simone Mangano, Hassan Saidinejad, [AR07] Ibrahim Armac and Daniel Retkowitz. Fabio Veronese, Sara Comai, Matteo Simulation of smart environments. In Matteucci, and Fabio Salice. Bridge: IEEE International Conference on Per- Mutual reassurance for autonomous and vasive Services, pages 257–266. IEEE, independent living. Intelligent Systems, 2007. IEEE, 30(4):31–38, 2015. [b1115] Assistive Technology Group [PKL+ 05] Paula Paavilainen, Ilkka Korhonen, Jyrji (ATG) of Politecnico di Milano. Lötjönen, Luc Cluitmans, Marja Jylhä, http://www.atg.deib.polimi.it, 2015. Antti Särelä, and Markku Partinen. Circadian activity rhythm in demented [BBF16] Asma Benmansour, Abdelhamid and non-demented nursing-home resi- Bouchachia, and Mohammed Feham. dents measured by telemetric actigraphy. Multioccupant activity recognition in Journal of sleep research, 14(1):61–68, pervasive smart home environments. 2005. ACM Computing Surveys (CSUR), [PLJ+ 15] Kirsten KB Peetoom, Monique AS Lexis, 48(3):34, 2016. Manuela Joore, Carmen D Dirksen, and [Bha46] Anil Bhattacharyya. On a measure of di- Luc P De Witte. Literature review vergence between two multinomial popu- on monitoring technologies and their lations. Sankhyā: The Indian Journal of outcomes in independently living el- Statistics, pages 401–406, 1946. derly people. Disability and Rehabili- tation: Assistive Technology, 10(4):271– [CCTK13] Diane J Cook, Aaron S Crandall, Brian L 294, 2015. Thomas, and Narayanan C Krishnan. Casas: A smart home in a box. Com- [RBC+ 15] Daniele Riboni, Claudio Bettini, puter, 46(7), 2013. Gabriele Civitarese, Zaffar Haider Jan- jua, and Viola Bulgari. From lab to [CSEC+ 09] Diane Cook, M Schmitter-Edgecombe, life: Fine-grained behavior monitoring Aaron Crandall, Chad Sanders, and in the elderly’s home. In Pervasive Brian Thomas. Collecting and dissem- Computing and Communication Work- inating smart home sensor data in the shops (PerCom Workshops), 2015 IEEE International Conference on, pages 342–347. IEEE, 2015. [RM13] Parisa Rashidi and Alex Mihailidis. A survey on ambient-assisted living tools for older adults. IEEE journal of biomed- ical and health informatics, 17(3):579– 590, 2013. [SCFSE16] Gina Sprint, Diane Cook, Roschelle Fritz, and Maureen Schmitter- Edgecombe. Detecting health and behavior change by analyzing smart home sensor data. In Smart Computing (SMARTCOMP), 2016 IEEE Interna- tional Conference on, pages 1–3. IEEE, 2016. [SPF15] Jérémie Saives, Clément Pianon, and Gregory Faraut. Activity discovery and detection of behavioral deviations of an inhabitant from binary sensors. IEEE Transactions on Automation Science and Engineering, 12(4):1211–1224, 2015. [TIL04] Emmanuel Munguia Tapia, Stephen S Intille, and Kent Larson. Activity recog- nition in the home using simple and ubiq- uitous sensors. Springer, 2004. [VKNEK08] Tim Van Kasteren, Athanasios Noulas, Gwenn Englebienne, and Ben Kröse. Ac- curate activity recognition in a home set- ting. In Proceedings of the 10th interna- tional conference on Ubiquitous comput- ing, pages 1–9. ACM, 2008. [VMT+ 16] Fabio Veronese, Andrea Masciadri, Anna A Trofimova, Matteo Matteucci, and Fabio Salice. Realistic human behaviour simulation for quantitative ambient intelligence studies. Technology and Disability, 28(4):159–177, 2016. [VPC+ 14] F Veronese, D Proserpio, S Comai, M Matteucci, and F Salice. Sharon: a simulator of human activities, routines and needs. Studies in health technology and informatics, 217:560–566, 2014. [WJB+ 13] Anders Wimo, Linus Jönsson, John Bond, Martin Prince, Bengt Winblad, and Alzheimer Disease International. The worldwide economic impact of de- mentia 2010. Alzheimer’s & Dementia, 9(1):1–11, 2013.