Towards an Interactive Personal Care System
              driven by Sensor Data

              Stefano Bragaglia, Paola Mello, and Davide Sottara

                 DEIS, Facoltà di Ingegneria, Università di Bologna
                  Viale Risorgimento 2, 40136 Bologna (BO) Italy
              stefano.bragaglia@unibo.it, paola.mello@unibo.it,
                            davide.sottara2@unibo.it


      Abstract. This demo presents an integrated application, exploiting tech-
      nologies derived from various branches of Artificial Intelligence. The pro-
      totype, when triggered by an appropriate event, interacts with a person
      and expects them to perform a simple acknowledgement gesture: the fol-
      lowing actions, then, depend on the actual recognition of this gesture. In
      order to achieve this goal, a combination of data stream processing and
      rule based programming is used. While the more quantitative compo-
      nents exploit techniques such as Clustering or (Hidden) Markov Models,
      the higher levels, more qualitative and declarative, are based on well
      known frameworks such as Event Calculus and Fuzzy Logic.

      Keywords: Complex Event Processing, Event Calculus, Sensor Data
      Fusion, Activity Recognition

1    Introduction
Intelligent beings need senses to interact with the environment they live in: the
same principle holds for artificially intelligent entities when they are applied
outside of purely virtual contexts, in the real world. Much eﬀort is being spent
in emulating smell (processing signals acquired through electronic noses), taste
(using electronic tongues) and touch, although hearing and especially sight are
arguably the subject of an even larger amount of research and application de-
velopment. Focusing on computer vision, observing the position or the shape of
people and objects is extremely relevant when dealing with problems such as
pathfinding, tracking, planning or monitoring. Moreover, analysing the visual
information is often the prelude to a decisional process, where actions are sched-
uled depending on the sensed inputs and the current goals. In our case study,
part of the DEPICT project, we are addressing the monitoring of an elderly
person during their daily activity, recognizing, if possible, the insurgence of un-
desired short and long term conditions, such as frailty, and ultimately trying to
prevent severely detrimental events such as falls. Should a fall or similar event
actually happen, however, it is essential to recognize it as soon as possible, in or-
der to take the appropriate actions. To this end, we are planning to use a mobile
platform equipped with optical and audio sensors and a decisional processing
unit: the device would track or locate the person as needed, using the sensors
to acquire information on their current status and capabilities. In this paper,
however, we will not discuss the mobile platform, but focus on the sensors it is
equipped with, and the collected information. This information will be analyzed
to identify and isolate, in ascending order of abstraction, poses, gestures, actions
and activities [1]. A pose is a specific position assumed by one or more parts of
the body, such as “sitting”, “standing” or “arms folded”; a gesture is a simple,
atomic movement of a specific body part (e.g. “waving hand”, “turning head”);
actions are more complex interactions such as “walking” or “standing up”, while
activities are composite actions, usually performed with a goal in mind. Recog-
nizing patterns directly at each level of abstraction is a relevant research task of
its own, but it may also be necessary to share and correlate information between
the levels. From a monitoring perspective, the more general information may
provide the necessary context for a proper analysis of the more detailed one:
for example, one might be interested in a postural analysis of the spine when a
person is walking or standing up from a chair, but not when a person is sleeping.


2     AI Techniques

In order to analyse a person’s movements in a flexible and scalable way, we
propose the adoption of a hybrid architecture, combining low level image and
signal processing techniques with a more high level reasoning system. The core
of the system is developed using the “Knowledge Integration Platform” Drools1 ,
an open source suite which is particulary suitable for this kind of applications.
It is based on a production rule engine, allowing to encode knowledge in the
form of if-then rules. A rule’s premise is normally a logic, declarative con-
struction stating the conditions for the application of some consequent action;
the consequences, instead, are more operational and define which actions should
be executed (including the generation of new data) when a premise is satis-
fied. Drools, then, builds its additional capabilities on top of its core engine: it
supports, among other things, temporal reasoning, a limited form of functional
programming and reasoning under uncertainty and/or vagueness [6]. Being ob-
ject oriented and written in Java, it is platform independent and can be easily
integrated with other existing components. Using this platform, we built our
demo application implementing and integrating other well known techniques.
    Vision sub-system. At the data input level, we used a Kinect hardware sensor
due to its robustness, availability and low price. It combines a traditional camera
with an infrared depth camera, allowing to reconstruct both 2D and 3D images.
We used it in combination with the open source OpenNI2 middleware, in par-
ticular exploiting its tracking component, which allows to identify and trace the
position of humanoid figures in a scene. For each figure, the coordinates of their
“joints” (neck, elbow, hip, etc. . . ) are estimated and sampled with a frequency
of 30Hz.
1
    http://www.jboss.org/drools
2
    http://www.OpenNI.org
    Vocal sub-system. We rely on the open source projects FreeTTS3 and Sphinx-
 4
4 for speech synthesis and recognition, respectively.
    Semantic domain model. Ontologies are formal descriptions of a domain,
defining the relevant concepts and the relationships between them. An ontology
is readable by domain experts, but can also be processed by a (semantic) reasoner
to infer and make additional knowledge explicit, check the logical consistency
of a set of statements and recognize (classify) individual entities given their
properties. For our specific case, we are developing a simple ontology of the
body parts (joints) and a second ontology of poses and acts. The ontologies are
then converted into an object model [5], composed of appropriate classes and
interfaces, which can be used to model facts and write rules.
    Event Calculus. The Event Calculus [2] is another well known formalism,
used to represent the eﬀects of actions and changes on a domain. The base
formulation of the EC consists of a small set of simple logical axioms, which
correlate the happening of events with fluents. An event is a relevant state
change in a monitored domain, taking place at a specific point in time; fluents,
instead, denote relevant domain properties and the time intervals during which
they hold. From a more reactive perspective, the fluents define the overall state of
a system through its relevant properties; the events, instead, mark the transitions
between those states.
    Complex Event Processing. In addition to flipping fluents, events may be-
come relevant because of their relation to other events: tipical relations include
causality, temporal sequencing or aggregation [4]. Causality indicates that an
event is a direct consequence of another; sequencing imposes constraints on the
order events may appear in; aggregation allows to define higher level, more ab-
stract events from a set of simpler ones. Exploiting these relations is important
as the number and frequency of events increases, in order to limit the amount
of information, filtering and preserving only what is relevant.
    Fuzzy Logic. When complex or context-specific conditions are involved, it
may be diﬃcult to define when a property is definitely true or false. Instead, it
may be easier to provide a vague definition, allowing the property to hold up to
some intermediate degree, defined on a scale of values. Fuzzy logic [3] deals with
this kind of graduality, extending traditional logic to support formulas involving
graded predicates. In this kind of logic, “linguistic” predicates such as old or
tall can replace crisp, quantitative constraints with vague (and smoother) qual-
itative expressions. Likewise, logical formulas evaluate to a degree rather than
being either valid or not.


3      Demo Outline.
The techniques listed in the previous section have been used as building blocks
for a simple demo application, demonstrating a user/machine interaction, based
on vision and supported by the exchange of some vocal messages. The abstract
3
     http://freetts.sourceforge.net/docs/index.php
4
     http://cmusphinx.sourceforge.net/sphinx4/
                           Fig. 1. Architectural outline


system architecture is depicted in Figure 1. The simple protocol we propose as a
use case, instead, is a simplified interaction pattern where a monitoring system
is checking that a person has at least some degree of control over their cognitive
and physical capabilities. This is especially important when monitoring elderly
patients, who might be suﬀering from (progressively) debilitating conditions such
as Parkinson’s disease and require constant checks. Similarly, in case of falls,
estimating the person’s level of consciousness may be extremely important to
determine the best emergency rescue plan.
    To this end, for each person tracked by the Kinect sensor, we generate an
object model (defined in the ontology) describing its skeleton with its joints and
their position in space. The actual coordinates are averaged over a window of 10
samples to reduce noise: every time a new average is computed, an event is gen-
erated to notify the updated coordinates. The coordinates are then matched to
predetermined patterns in order to detect specific poses: each snapshot provides
a vector of 63 features ( 3 coordinates and 1 certainty factor for each one of the
15 joints, plus the 3 coordinates of the center of mass ) which can be used for the
classification. The definition of a pose may involve one or more joints, up to the
whole skeleton and can be expressed in several ways. In this work, we adopted a
“semantic” [1] approach, providing definitions in the form of constraints (rules)
on the joints’ coordinates. To be more robust, we used fuzzy predicates (e.g.
leftHand.Y is high) rather than numeric thresholds, but the infrastructure
could easily support other types of classifiers (e.g. neural networks or cluster-
ers). Regardless of the classification technique, we assume the state of being in
a given pose can be modelled using a fuzzy fluent, i.e. a fluent with a gradual
degree of truth in the range [0, 1]. A pose, in fact, is usually mantained for a
limited amount of time and, during that period, the actual body position may
not fit the definition perfectly. In particular, we consider the maximum degree of
compatibility (similarity) between the sampled positions and the pose definition
over the minimal interval when the compatibility is greater than 0 (i.e. it is not
impossible that the body is assuming the considered pose). Technically, we keep
a fluent for each person and pose, declipping it when the compatibility between
its joints and the pose is strictly greater than 0 and clipping it as soon as it
becomes 0. Given the poses and their validity intervals and degrees, denoted by
the fluents, it is possible to define gestures and actions as sequences of poses
and/or gestures. While we are planning to consider techniques such as (semi-
)Hidden Markov Models, currently we continue adopting a “semantic”, CEP-like
approach at all levels of abstraction. Gestures and actions, then, are defined us-
ing rules involving lower level fluents. Their recognition is enabled only if they
are relevant given the current context: the event sequencing rules, in fact, are
conditioned by other state fluents, whose state in turn depends by other events
generated by the environment.

Usage. The user is expected to launch the application and wait until the tracking
system has completed its calibration phase, fact which is notified by a vocal
message. From this point on, until the user leaves the camera scope, the position
of their joints in a 3D space will be continuously estimated. When the application
enters a special recognition state, triggered pressing a button or uttering the
word “Help”, the user has 60 seconds to execute a specific sequence of minor
actions, in particular raising their left and right hand in a sequence or stating
“Fine”. With this example, we simulate a possible alert condition, where the
system believes that something bad might have happened to the person and
wants some feedback on their health and their interaction capabilities.
     If the user responds with gestures, the system will observe the vertical coor-
dinates of the hands: a hand will be raised to a degree, which will be the higher
the more the hand will be above the shoulder level. A hand partially lifted will
still allow to reach the goal, but the final result will be less than optimal. At
any given time, the recognition process considers the highest point a hand has
reached so far, unless the hand is lowered below the shoulder (the score is reset
to 0). If both hands have been completey lifted, the user has completed their
task and the alert condition is cancelled; otherwise, the system will check the
current score when the timeout expires. In the worst case scenario, at least one
of the hands has not been raised at all, leading to a complete failure; finally, in
intermediate situations, at least one hand will not have been lifted completely,
leading to a partial score. The outcome will be indicated using a color code:
green for complete success, red for complete failure and shades of orange for the
intermediate cases, in addition to a vocal response message. If the user answers
vocally, instead, the alarm will be cleared and no further action will be taken.


4   Conclusions

We have shown an example of a tightly integrated, leveraging both quantitative
and qualitative AI-based tools. Despite its simplicity, it proves the feasibility of
an interactive, intelligent care system with sensor fusion and decision support
capabilities.
Acknowledgments
This work has been funded by the DEIS project DEPICT.


References
1. J.K. Aggarwal and M.S. Ryoo. Human activity analysis: A review. ACM Comput.
   Surv., 43(3):16:1–16:43, April 2011.
2. F. Chesani, P. Mello, M. Montali, and P. Torroni. A logic-based, reactive calculus
   of events. Fundamenta Informaticae, 105(1):135–161, 2010.
3. Petr Hájek. Metamathematics of Fuzzy Logic, volume 4 of Trends in Logic: Studia
   Logica Library. Kluwer Academic Publishers, Dordrecht, 1998.
4. David C. Luckham. The Power of Events: An Introduction to Complex Event Pro-
   cessing in Distributed Enterprise Systems. Addison-Wesley Longman Publishing
   Co., Inc., Boston, MA, USA, 2001.
5. G. Meditskos and N. Bassiliades. A rule-based object-oriented OWL reasoner. IEEE
   Transactions on Knowledge and Data Engineering, 20(3):397–410, 2008.
6. D. Sottara, P. Mello, and M. Proctor. A configurable rete-oo engine for reasoning
   with diﬀerent types of imperfect information. IEEE Trans. Knowl. Data Eng.,
   22(11):1535–1548, 2010.