132


Exploiting Time Series Data for Task Prediction
and Diagnosis in an Intelligent Guidance System

          Hayley Borck, Steven Johnston, Mary Southern, and Mark Boddy

        Adventium Labs, 111 Third Ave South Suite 100, Minneapolis, MN 55401


          Abstract. Time series data has been exploited for use with Case Based
          Reasoning (CBR) in many applications. We present a novel application of
          CBR that combines intelligent tutoring using Augmented Reality (AR)
          and prediction. The MonitAR system, presented in this paper, is in-
          tended for use as an intelligent guidance system for astronauts conduct-
          ing complex procedures during periods of a communication time delay
          or blackout from Earth. Our approach takes advantage of the relational
          nature of time-series data to detect a task that the user is completing
          and diagnose the issue when the user is about to make a mistake.


1      Introduction

Astronauts are trained in a myriad of different procedures ranging from mainte-
nance to emergency medicine. To alleviate mistakes during stressful situations,
guidance during such procedures is advantageous. However with longer space-
flight missions comes delays or blackouts in communication with Earth. Such
instances would benefit from a training and guidance system that is able to di-
rect the user during the procedure and guide them away from potential mistakes
before one is committed. For the duration of this paper we refer to a procedure
as a NASA procedure for complex tasks and a plan as the representation of a
procedure in a planning language. A task is the smallest step in a plan. Further,
a case in MonitAR is comprised of a problem, represented by a series of time
step features, and a solution that is the previously mentioned task.
    MonitAR is a training and guidance system that predicts the task the user’s
currently completing. If MonitAR predicts the user will make a mistake, it guides
the user back to the correct task using visual cues. The system monitors a
user’s activity while completing a task taking data at set intervals. The data
is collected through the camera of an Augmented Reality (AR) smart glasses
device. Features are created using the positions of objects in relation to other
objects within the view of the AR device. Over time the relationship between
the objects indicate the task the user is completing. MonitAR uses partial data
to predict the task the user is completing so as to eliminate potential mistakes.
A diagnosis of how the user is completing the task incorrectly, such as if the
user is completing tasks in the wrong order, is done to create visual cues. The
sequential nature of time series data is manipulated during diagnosis which aids
in determining the mistake. Learning is employed when the user indicates to

    Copyright © 2016 for this paper by its authors. Copying permitted for private and academic purposes.
    In Proceedings of the ICCBR 2016 Workshops. Atlanta, Georgia, United States of America
                                                                                     133


the system that they are completing the task in a previously unknown (to the
system) way. The MonitAR system aims to aid astronauts while completing
procedures in which the astronaut is not an expert or when the astronaut is
performing under stress and would normally be given precise instruction via
experts on Earth. This can be generalized to aid in any situation where the user
is not an expert in the procedure.
    The remaining sections of the paper are broken down as follows. Section
2 discusses related work, section 3 gives an overview of the MonitAR system
architecture, section 4 describes how time series data is represented in our sys-
tem. In section 5 the prediction component of the system is described, section 6
details the diagnosis component of the system, and finally the experiment and
conclusion are discussed in sections 7 and 8.


2     Related Work

Prediction and recognition of users and opponents has been a well researched
area in recent years. Less so, however, is the prediction of the task the user
is completing. The intelligent tutoring community has shown great strides in
modeling the user, and determining how best to help them through a task. The
AR community has been doing guided procedures for some time in numerous
domains. We believe our system which combines intelligent tutoring and predic-
tion using AR is the first of its kind. Given the current research and state of
technology this area of research seems likely to flourish in the coming years.


2.1   Prediction and Recognition

Prediction and recognition of human activity using visual data is an active area
of research. Our approach to prediction leans on this existing body work. Pei
et al. [8] and Auslander et al. [4] in particular have created recognition systems
from visual data. Our problem is made easier than the usual plan or intent
recognition domains because in this domain we know which task the user should
be completing. Prediction coupled with diagnosis using CBR in the low to no
communication space domain using AR is our new contribution.
    Synnaeve et al. [9] presented a bayesian programming approach to predict
an opponent’s opening strategy in RTS games. We show in our experiment that
a CBR approach to predicting the current task is better than a straightforward
Bayesian approach in our domain. The most similar prediction work to our own
came from White et al. [14]. They describe a Capability Aware, Maintaining
Plans approach in addition to a Belief, Desires, and Intentions (CAMP-BDI)
system that preempts anticipated failure. Their work, however focuses on failure
due to outside issues, rather than issues relating to the user’s own confusion,
stress, or misunderstanding of the current task. Antwarg et al. [3] showed that
adding a user profiling component to an intent prediction system increases the
accuracy of the prediction. We believe applying a user profiling system will aid
in our system as well and intend to implement it in future work.
                                                                                      134


2.2   Training and Tutoring


The Intelligent Tutoring community has published some great work on guiding
users towards better learning. Early systems such as presented by Anderson et
al. [2] have shown the usefulness of AI in training. The MonitAR system does not
fall into a prescribed definition of an ITS because we do not provide tutoring or
training services rather we guide the user when a mistake is made. The guidance
our system provides however, does use similar principles as an ITS system.
     The visual cues MonitAR presents the user in order to guide them back to
the correct task has similar qualities to a traditional constraint based modeling
ITS, as defined by Ma et al. [7]. The visual cues we provide qualify as ’a feedback
message that, when the solution state fails the satisfaction condition, advises the
student of the error...’. Additionally in their survey Ma et al. [7] found that ITS
systems were associated with positive effects across a wide range of domains
from humanities to the sciences indicating the potential of the MonitAR system
in a wide range of procedures and domains.
     Grasser et al. [6] created the AutoTutor system, that helps college students
learn computer literacy through a conversational tutor. This shows us that ITS
systems may be helpful to users with a high level of education. Aleven et al. [1]
suggests users are reluctant to seek help and that users who are at a medium level
of mastery are benefited by hints given without the user asking. Admittedly our
target audience, astronauts, are at a higher level of education and mastery than
ITS’ are generally geared for. We still believe an intelligent guidance system such
as MonitAR will be beneficial and plan to complete user studies in the future
that will corroborate this hypothesis.


2.3   Augmented Reality


A survey by [5] describes the current state of the art (as of 2015) in first person
activity recognition through video, paying special attention to AR and wear-
able devices. They describe two approaches to activity recognition through AR
devices as object based and motion based, which our system combines to both
predict and diagnose errors. Additionally they highlight that none of the ap-
proaches are able to work in a closed-loop fashion by continuously learning from
users, which we attempt to address. Others have used AR devices for training
and guidance. Wacker et al. [11] presented an AR guidance system for image
guided needle biopsy. Similarly Vosburgh et al. [10] use AR for guidance during
laparoscopic surgery using CT or MRI images. AR guidance for maintenance
and assembly tasks has been done by Webel et al. [12]. The MonitAR system
aims to generalize to many different types of procedures encompassing the pre-
viously mentioned domains. The Westerfield et el. [13] system incorporates the
intelligent tutoring techniques with AR similar to MonitAR our system however,
goes one step farther in predicting mistakes and alerting the user.
                                                                                     135


3   System Overview
The full MonitAR system can be seen in fig. 1. Procedures are taken from the
International Space Station (ISS). While the user executes a task in a plan
the AR Interface component collects features from the camera using an object
recognition library. At each time step, features are passed to the CBR Task
Prediction component. During training this component collects features until
the task is complete then writes the case to file. During execution a partial
case is compiled and retrieval is executed at each time step. If a case is found
during retrieval which is over a prediction similarity threshold the partial case,
predicted case, and a case determined to represent the current task (the ’correct
case’) are given to the Diagnoser component. The Diagnoser merges the partial
and predicted case and calculates the difference between this merged case and
the correct case using delta cases which are discussed in later sections. This
difference is used to create visual cues within the AR Interface.


                                                           Case Base
    ISS Procedures
                                         CBR Task
                                        Prediction
         Plans


                Diagnoser                                    AR Interface
                                             Visual Cues
       Difference           Visual                         • Object Recognition
       Calculator         Explanation                      • Object Tracking
                                                           • Gathers features


                    Fig. 1. Architecture of the MonitAR system


4   Representation of Time Series Data
Time series data is represented in the MonitAR system in two ways. During
prediction of a task the data is represented as distance relationships between
recognized objects in view. At each time step the distance of each object in
view related to each other object in view averaged over the time step length as
                                                                                     136


distance features. A time step feature is comprised of multiple distance features.
A short time step length of 500ms was chosen in order to collect enough data
to quickly and correctly predict during relatively short tasks. The position of
the hand at the beginning and end of the time step is also annotated and added
to the time step feature. The annotation of hand position enables the system to
reason on how the hand is moving within the length of the time step. A sample
case using partial information that indicates the hand reaching towards obj1 and
away from obj2 in two time steps can be seen in Fig.2.


    TimeStepFeature @ time t               TimeStepFeature @ time t1
           HandPosBeg: [0,0,0]                 HandPosBeg: [25,25,0]
           HandPosEnd: [25,25,0]               HandPosEnd: [50,50,0]
           DistanceFeature                     DistanceFeature
                    Hand:Obj1 Dist: 100                 Hand:Obj1 Dist: 75
           DistanceFeature                     DistanceFeature
                    Hand:Obj2 Dist: 50                  Hand:Obj2 Dist: 75


           Fig. 2. A sample partial case showing two time step features


    During diagnosis of the predicted mistake a delta case is created by merging
adjacent time step features. To merge time step features the distance in each
distance feature of a time step feature for time t+1 is subtracted from the
matching (containing the same objects) distance feature from the time step
feature at time t. Using the case in 2 as an example, the merged time step
feature for the delta case would have two distance features. The distance feature
containing the hand and obj1 would have a distance of -25. The distance feature
containing the hand and obj2 would have a distance of 25. Delta cases represent
the movement of recognized objects between time steps and provide a way of
determining the relationship between objects over time. See section 6 for more
detail on delta cases and the diagnosis process.
    To handle faulty sensors we employ filters using heuristics based on the way
the physical world works. When an object which was previously recognized is not
recognized in the current time step, distance features are added to the time step
feature at the same position as previously seen. In some instances, for example
when a hand grabs an object and occludes it, this heuristic fails. To combat this
when a missing object is recognized in a different location than it was previously
and near an object which can move it, such as a hand, distance features are
added to each time step feature where that object was missing using the same
distance relations as the object that presumably moved it. Lastly, the user can
introduce camera jitter due to slight movements even when standing ’still’. We
ran a short experiment and found that a typical user will sway up to 15mm so
                                                                                       137


we accounted for this possible distance change in the similarity function. These
input filters solve the most significant issues found with the camera, occlusion,
and the object recognition library.


5   Prediction
During execution of a task, time step features are created by the AR system
and handed to the CBR Task Prediction component. After a time step feature
has gone through the input filters, a set of n cases are retrieved from the case
base using the similarity function. The similarity function is comprised of two
parts. The first part is a weighted sum of distances between objects. The second
part consists of the distance from the current and projected hand position of
the partial case q to the current and next hand position of the case base case
c. Sequentiality of the time steps enable a projection of hand positions which
give the system more information for the similarity function to use, allowing a
quicker prediction. These two parts are weighted and added together to create
the similarity score. The full equation is shown in Eqn-1. For the weighted sum of
distances we choose to weight time step features using linearly ascending weight,
γ, to model that the later time steps better indicate what the user is trying to
accomplish. In the following equation m is the number of time step features in
the case base case c, n is number of matching distance features between c and q,
Qdf and Cdf indicate the distance feature in case q and c, and finally chf and
nhf are the current and next hand positions.
                                P
                           P      (1−(Qdf −Cdf ))
                             (γ        n          )
           sim(q, c) = α                            + β(ζ(chf ) + η(nhf ))     (1)
                                     m
    The top l cases with a similarity over a threshold tsim are brought back
from the case base. If any of the l cases have a greater similarity then a predic-
tion threshold tpred the top case is handed to the Diagnoser to component as a
predicted case.


6   Diagnosis
The Diagnoser component is responsible for determining the difference between
the predicted task and the task the user should be completing. Visual cues are
created during diagnosis that show the user the deviation from the correct task
via the AR Interface. The Diagnoser conducts the reuse phase of the CBR work
flow to adapt the predicted case to the current situation. To do this we first merge
the predicted case and the partial case to create a complete case by taking the
time steps t - tn of the partial case and adding the remaining time steps from the
predicted case tn+1 - tm . The cases are merged in order to give the Diagnoser the
most grounded information possible, rather than relying solely on the predicted
case to be similar to reality. Delta cases are created from this merged predicted
case and the correct case for the current task.
                                                                                           138


    From the delta cases we are able determine the target object of each case, by
which we mean the object that the hand is moving towards. Two assumptions
are made here first that there is exactly one hand represented in the case and
secondly whatever the hand is doing is imperative to the task. In future iterations
of this system we will generalize this to a generic type of object. To determine the
target object the sum of the distance features between the hand, and each other
object is found. The largest sum represents the object that the hand traveled to
the fastest and therefore is the target of that delta case. Visual cues are created
by finding the target objects of the correct case and the merged predicted case.
In the instance that the target objects are different, a visual cue of a highlighted
green box is drawn over the target object of the correct case, while a red box is
drawn over the target object of the merged predicted case (fig. 3).
    The difference routine reuses the time step portion of the similarity function
with the delta cases. Since the delta case represents the movement of objects
between time steps, the similarity function here indicates the similarity of that
movement. This is opposed to when the similarity function is used in prediction
where the calculation between the partial and case base cases represent the
similarity between the location of static objects. The time steps that have a low
similarity (or high difference) under a difference threshold are compiled to create
the visual cues, the largest difference is used to create the final visual cue.


Fig. 3. MonitAR indicated the vacuum (top right) as the correct target object within
the task and the crayon box (middle left) as the incorrect target object by highlighting
the objects in green and red respectively. The hand (middle bottom) is highlighted in
gray to indicate it is a recognized object.


   We have created a series of distinct image targets (2D images) to label objects
such as the hand or vacuum shown in Fig.3 to make the object recognition
                                                                                        139


task easier. The focus of this project is not intended to include development of
improved object recognition algorithms.


7   Experiment

The results of our prototype MonitAR system are encouraging. For the exper-
iment we tested whether the MonitAR system could correctly predict the task
the user was completing. Our initial tests were compared against a naive Bayes
task prediction. The CBR system was trained on four plans containing a cu-
mulative thirteen tasks. One hundred cases were created during training which
encompassed two users (with different handedness) completing the plans. The
experiment was run using k-fold cross validation with k = 10.
    The naive Bayes prediction is calculated during the retrieval phase replacing
Eqn-1 with the following Eqn-2. In Eqn-2 cτ is a case with task τ , and q repre-
sents the current partial case. The conditional p(q|cτ ) is the probability a case c
encompasses the partial case q and has a solution of task τ . p(q) is the probability
that the case encompasses the partial case q. Finally p(cτ ) is the probability a
case c has a solution of task τ . The top ten cases are returned during the retrieval
step. The case with the highest probability, if it is over the prediction threshold
tpred , is the prediction. Both methods used the same prediction threshold tpred .

                                           p(q|cτ ) · p(cτ )
                             p(cτ | q) =                                         (2)
                                                p(q)
    There was a significant difference in the percentage of correct predictions
for MonitAR using the similarity Eqn-1 and the Bayesian Eqn-2 (p < 0.0001
using a paired t-test). MonitAR gave on average 148 more predictions than its
Bayesian counterpart with an average percent correct prediction of 81% when the
tpred = .6 and tsim = .4. Even though the propensity to report false positives
is higher using MonitAR due to the sheer amount of predictions made, the
gains over Bayesian retrieval, that had an average percent correct prediction
of just 43% are significant. The average earliest correct prediction was better
using Bayesian retrieval: 1.04 seconds vs 1.2 seconds for Bayesian and MonitAR
respectively. The experiment was rerun with a tpred = .8 and tsim = .4 the
results also showed MonitAR correctly predicting the task at a significantly
higher percentage. Future experiments will be run to determine the best values
for tpred and tsim .
    If we look deeper into the results, we can see that certain tasks were easier
to predict than others (Fig. 4). In particular T3 did very poorly, this can be
explained by the nature of the task which asks the user to remove a battery
from a power tool. To do so means the user’s hand is reaching toward both the
battery and the power tool for most of the case. Instances such as this will be
addressed in the future with the addition of more fine grained features. Task
T2 also did poorly using either method which we believe is due to the length of
the task which was very short. We surmise using the Bayesian probability as a
                                                                                      140


confidence score in conjuncture with Eqn1 will bring the overall correctness and
timeliness of the prediction up. This will be explored in future work.


                          Correct Prediction By Task
    1.00
    0.90
    0.80
    0.70
    0.60
    0.50
    0.40
    0.30
    0.20
    0.10
    0.00
           T1   T2   T3   T4    T5   T6    T7   T8     T9   T10   T11   T12   T13

                                 MonitAR    Bayesian


Fig. 4. Percent correct prediction by task for MonitAR and Bayesian retreival using
tpred = .6 and tsim = .4


8    Conclusion
This paper presented early work done on the MonitAR system for task prediction
and mistake diagnosis using visual cues. The system is a novel application of
CBR to monitor a user’s activity and give visual feedback upon the prediction
of deviation to a plan. Our system leans on previous work in plan prediction and
recognition and has wide applications within training and procedure guidance
domains. The MonitAR system has shown promising results in prediction time
and correctness when compared to other approaches. Future work will work
encompass a full experimental study to determine the best thresholds to employ
and weights as well as the addition to more fine grained features.

9    Acknowledgments
The material is based upon work supported by the National Aeronautics and
Space Administration under Contract Number NNX16CJ22P. Any opinions,
findings, and conclusions or recommendations expressed in this material are
those of the author(s) and do not necessarily reflect the views of the National
Aeronautics and Space Administration. Copyright, 2016, Adventium Labs - All
rights reserved.
                                                                                           141


References
 1. Vincent Aleven, Ido Roll, Bruce M McLaren, and Kenneth R Koedinger. Help
    helps, but only so much: research on help seeking with intelligent tutoring systems.
    International Journal of Artificial Intelligence in Education, 26(1):205–223, 2016.
 2. John R Anderson, C Franklin Boyle, and Brian J Reiser. Intelligent tutoring
    systems. Science(Washington), 228(4698):456–462, 1985.
 3. Liat Antwarg, Lior Rokach, and Bracha Shapira. Attribute-driven hidden markov
    model trees for intention prediction. IEEE Transactions on Systems, Man, and
    Cybernetics, Part C (Applications and Reviews), 42(6):1103–1119, 2012.
 4. Bryan Auslander, Kalyan Moy Gupta, and David W Aha. Maritime threat detec-
    tion using plan recognition. In Homeland Security (HST), 2012 IEEE Conference
    on Technologies for, pages 249–254. IEEE, 2012.
 5. Alejandro Betancourt, Pietro Morerio, Carlo S Regazzoni, and Matthias Rauter-
    berg. The evolution of first person vision methods: A survey. 2015.
 6. Arthur C Graesser, Kurt VanLehn, Carolyn P Rosé, Pamela W Jordan, and Derek
    Harter. Intelligent tutoring systems with conversational dialogue. AI magazine,
    22(4):39, 2001.
 7. Wenting Ma, Olusola O Adesope, John C Nesbit, and Qing Liu. Intelligent tu-
    toring systems and learning outcomes: A meta-analysis. Journal of Educational
    Psychology, 106(4):901, 2014.
 8. Mingtao Pei, Zhangzhang Si, Benjamin Z Yao, and Song-Chun Zhu. Learning and
    parsing video events with goal and intent prediction. Computer Vision and Image
    Understanding, 117(10):1369–1383, 2013.
 9. Gabriel Synnaeve and Pierre Bessiere. A bayesian model for opening prediction in
    rts games with application to starcraft. In 2011 IEEE Conference on Computa-
    tional Intelligence and Games (CIG’11), pages 281–288. IEEE, 2011.
10. Kirby G Vosburgh and R San Jose Estepar. Natural orifice transluminal endoscopic
    surgery (notes): an opportunity for augmented reality guidance. Studies in health
    technology and informatics, 125:485, 2006.
11. Frank K Wacker, Sebastian Vogt, Ali Khamene, John A Jesberger, Sherif G Nour,
    Daniel R Elgort, Frank Sauer, Jeffrey L Duerk, and Jonathan S Lewin. An aug-
    mented reality system for mr image–guided needle biopsy: Initial results in a swine
    model 1. Radiology, 238(2):497–504, 2006.
12. Sabine Webel, Uli Bockholt, Timo Engelke, Nirit Gavish, Manuel Olbrich, and
    Carsten Preusche. An augmented reality training platform for assembly and main-
    tenance skills. Robotics and autonomous systems, 61(4):398–403, 2013.
13. Giles Westerfield, Antonija Mitrovic, and Mark Billinghurst. Intelligent augmented
    reality training for assembly tasks. In Artificial Intelligence in Education, pages
    542–551. Springer, 2013.
14. Alan White, Austin Tate, and Michael Rovatsos. Camp-bdi: A pre-emptive ap-
    proach for plan execution robustness in multiagent systems. In PRIMA 2015:
    Principles and Practice of Multi-Agent Systems, pages 65–84. Springer, 2015.