Conversational Agent for Daily Living Assessment Coaching

Aditya Gaydhani1⇤ , Raymond Finzel2 , Sheena Dufresne3 , Maria Gini1 and Serguei Pakhomov2
             1
               Department of Computer Science and Engineering, University of Minnesota
          2
            Department of Pharmaceutical Care & Health Systems, University of Minnesota
         3
           Department of Experimental and Clinical Pharmacology, University of Minnesota
                     {gaydh001, finze006, gahmx008, gini, pakh0002}@umn.edu

                                  Abstract                                                 rheumatology, among other health disciplines. It is also cen-
                                                                                           tral to several non-clinical domains including disability and
        We present preliminary work-in-progress results of                                 human services. One’s ability to perform day-to-day activi-
        a project focused on developing a conversational                                   ties independently relies on unimpaired cognitive, motor, and
        agent system to help with training certified asses-                                perceptual abilities. Significant impairment in these abilities
        sors in conducting assessments of functioning in                                   typically results in a need for assistive devices or external
        activities of daily living. To date, we have de-                                   supervision and/or assistance. In the United States, signif-
        signed a modular task-based conversational agent                                   icant public resources are dedicated to providing assistance
        system and collected hypothetical dialogue data re-                                to those in need. In Minnesota, that assistance in allocated
        quired for training system components as well as                                   based on specific needs. The certified assessors perform as-
        a knowledge base needed to generate a wide vari-                                   sessments by conducting extensive face-to-face verbal inter-
        ety of synthetic profiles of “individuals” being as-                               views with the individuals referred for services and make rec-
        sessed. One of the key components of the system is                                 ommendations for the level of support required to meet the
        the topic tracking module that determines the cur-                                 person’s needs. The interviews cover a broad range of ar-
        rent topic of the conversation. We report the results                              eas including activities of daily living (ADLs: e.g., dressing,
        of experiments with several machine learning ap-                                   toileting, bathing, mobility, etc.) and instrumental activities
        proaches to topic/domain classification. The high-                                 of daily living (IADLs: e.g., preparing meals, managing fi-
        est accuracy of 83% was achieved with a bidirec-                                   nances, etc.). One of the desired goals of these assessments is
        tional long short-term memory (BiLSTM) model                                       to determine the degree of independence to which the person
        with pre-trained GloVe embeddings. In addition                                     being assessed is able to preform ADLs and IADLs and to do
        to these results, we also discuss some of the other                                so as consistently and uniformly as possible across multiple
        challenges that we have encountered so far and po-                                 assessors. CA technology offers a potential for standardizing
        tential solutions that we are currently pursuing.                                  the training of certified assessors by simulating the interac-
                                                                                           tions between assessors and persons being assessed in a uni-
                                                                                           form and reproducible fashion.
1       Introduction                                                                          The long-term objective of our ongoing project is to de-
The use of Artificial Intelligence (AI) technology in the form                             velop a conversational agent system and infrastructure to sup-
of conversational agents (CA) has now expanded far beyond                                  port training of certified assessors in conducting the assess-
popular intelligent in-home assistants that are capable of an-                             ment of needs for social services. The purpose for develop-
swering basic questions about weather, trivia, driving direc-                              ing a conversational agent is to a) assist in shifting the mode
tions, or music selection [Sciuto et al., 2018]. For example,                              of conducting assessments from a questionnaire/survey style
despite significant barriers to its adoption in healthcare, CA                             to a more free-form conversational/narrative style, and b) to
technology (mostly rule-based) is being actively investigated                              standardize assessment outcomes across individual assessors.
as a tool to assist patients and clinicians across multiple clin-                          Towards this long-term objective, we have developed a pro-
ical contexts including diagnostic, prognostic, and treatment                              totype of the Conversational Agent for Daily Living Assess-
scenarios [Laranjo et al., 2018]. Specific to the domain of                                ment Coaching (CADLAC) that relies on a database of his-
functioning, the use of CA technology is also being investi-                               torical assessments, conducted by Minnesota Department of
gated in the context of patient care and monitoring after the                              Human Services, of ADLs and IADLS in order to generate
patient has been discharged from the hospital [Fadhil, 2018].                              synthetic profiles of individuals with varying levels of inde-
Assessment of functioning and functional status is a key tar-                              pendence and needs. In this paper, we describe the high-level
get in multiple clinical contexts such as nursing, physical                                system architecture and its components, and report the results
and occupational therapy, geriatric medicine, neurology, and                               of experiments with machine learning approaches to maxi-
                                                                                           mizing the accuracy of the domain classification component.
    ⇤
        Contact Author                                                                     We also discuss the challenges encountered during the devel-
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).


                                                                                       8
opment of natural language understanding (NLU) and natural                            Example dialogue in the Dressing domain
language generation (NLG) components and possible solu-
tions with which we are currently experimenting.                              Assessor:      Tell me about how you get dressed after
                                                                                             you are done in the bath.
                                                                         Participant:        I can dress myself.
                                                                           Assessor:         Including putting your shoes and socks
2       Methodology                                                                          on?
                                                                         Participant:        It can be tough.
The high-level architecture of CADLAC system is shown in                   Assessor:         What about putting them on is hard?
Figure 1. We followed the traditional modular CA system                  Participant:        It is hard for me to bend that far. But I
design [Ultes et al., 2017] vs. an end-to-end design [Wen                                    take it slow and I get it done. I sit on my
et al., 2017] because the modular design is more suitable                                    lift chair while I do it.
in the current early stage of the development when large
amounts of training data needed for the end-to-end design
are not yet available. Our modular design includes stan-                   Based on these hypothetical dialogues, we have defined
dard components such as the Topic Tracker (Domain Clas-                 and are currently continuing to refine an annotation schema to
sifier), NLU and NLG modules, a Dialogue Manager con-                   manually label semantic frames and their elements that may
sisting of the dialogue state tracking and policy components.           be useful for this application. For example, we defined the
In the current early stage of the project, we have been able            following frames for the Heavy Housekeeping domain using
to generate enough data to use machine learning in order to             BRAT annotation schema format:
train some of the CA system components including the Topic
Tracker and the NLU module designed to identify user intent                                    !Housekeeping heavy
and recognize named entities needed to match the input ut-               vacuum         Place-Arg?:Home location,
terance/question to the database containing historical records                          Helper-Arg?:Person
from which we generated synthetic profiles to represent a va-                 scrub     Artifact-Arg?:Home location,
riety of levels of functioning. The remaining components in-                            Place-Arg?:Home location,
cluding the Dialogue policy are currently rule-based. This                              Device-Arg?:Instrument, Helper-Arg?:Person
architecture is implemented using the open source MindMeld                 shovel       Place-Arg?:Home location,
platform for conversational AI 1 .                                                      Helper-Arg?:Person

2.1       Data                                                             An annotation of a short hypothetical dialogue using this
                                                                        schema focused on Heavy Housekeeping is shown in Fig-
We designed a survey to collect the data required to model the          ure 2.
CA. The survey asked the assessors to recall some of their                 We also collected de-identified historical assessment data
past assessments and provide hypothetical and anonymous                 for approximately 12,000 individuals. These data comprise
examples based on verbal interactions they have had during              a mix of structured and unstructured fields. The structured
those assessments focused on specific domains of function-              fields refer to the age, gender, communication style, and abil-
ing. The survey was administered to approximately 1,700 cer-            ity level of the person being assessed corresponding to the
tified assessors. The resulting data consists of 2,900 short dia-       independence scale mentioned earlier. Unstructured fields
logues (up to 3 turns: see the example dialogue below) cover-           capture free-text notes made by assessors during assessments
ing 18 domains within ADLs and IADLs: Dressing, Groom-                  consisting of brief descriptions of the challenges, preferences,
ing, Bathing, Toileting, Incontinence Management, Heavy                 and any assistive equipment for each ADLs and IADLs do-
Housekeeping, Light Housekeeping, Laundry, Financial Ac-                main.
tivities, Mobility, Transfers, Mode of Transfer, Positioning,
Mode of Positioning, Food Consumption, Meal Preparation,                2.2     Synthetic Profiles
Meal Planning, Fine Motor Skills. Each turn consists of a               The CA is given a synthetic profile for every session of in-
question by the assessor and the response to that question              teraction. The synthetic profile gives a personality to the CA
provided by the person being assessed. Additionally, we                 by defining its characteristics such as age, gender, communi-
collected characteristics of the person being assessed such             cation style, and the degree of independence to which it can
as approximate age, gender, communication style (open vs.               perform activities for all domains within ADLs and IADLs.
closed), and the degree of independence to which they are               The synthetic profile also holds information about the chal-
able to perform activities on the following scale: a) com-              lenges, preferences, and assistive equipment used across all
pletely independent, b) requiring intermittent supervision, c)          domains. The responses of the CA are based on the underly-
requiring supervision throughout the activity, d) requiring in-         ing synthetic profile.
termittent physical assistance, e) requiring physical assistance           The characteristics of the synthetic profile, particularly the
throughout the activity, f) completely dependent.                       independence levels, need to be consistent with each other.
                                                                        For example, a person who is unable to walk independently is
                                                                        most likely unable to do housekeeping independently. To en-
    1
        https://www.mindmeld.com/                                       sure consistency, we use historical assessment data to gener-


                                                                    9
                                                      Figure 1: System architecture.


                                                                           2.4   Dialogue State Tracking
                                                                           Conversational interaction consists of dialogue states, where
                                                                           each state is responsible for generating a particular type of re-
                                                                           sponse. Dialogue state tracking refers to mapping of incom-
                                                                           ing queries to appropriate dialogue states. We use an effective
      Figure 2: Example dialogue annotated for semantic frames.            rule-based and pattern matching procedure in the CA for dia-
                                                                           logue state tracking. The rules defined by this procedure rely
                                                                           on the domain, intent, or entities identified for an incoming
ate synthetic profiles. At every session of interaction, a record          query, as well as profile characteristics such as communica-
is randomly sampled from the historical data and the fields of             tion style. A dialogue state is determined by a combination
the synthetic profile are populated using this record.                     of these attributes.
                                                                              One of the challenges in modeling the CA is handling
2.3     Natural Language Understanding                                     generic follow-up questions because such questions refer to
The NLU module of the CA consists of domain classification,                the previous utterances of the conversation. We create a sep-
intent classification, and named entity recognition. The do-               arate domain for generic follow-up questions using the asses-
main classifier or topic tracker determines the target domain              sor’s questions from the 2nd and 3rd turn of the dialogues in
for an input query. It performs a first-pass categorization of             the data. Whenever the system classifies an incoming query
the incoming query and assigns it to one of the pre-defined                as a generic follow-up question, the domain of the previ-
domains. Each domain can have one or more intents that                     ous turn is carried over to the current turn. Moreover, if the
specify the task that the user wants to accomplish. The in-                follow-up question does not consist of any entities of its own,
tent classifier identifies such intents for an input query. In this        then the entities from the previous turn are also carried over.
case, the input query is the question asked by the assessor to                Communication style of the person being interviewed is
the synthetic profile of the CA. The question may consist of               one of the characteristics that we incorporate in the synthetic
zero or more words or phrases, referred to as “entities”, that             profile of the CA. Profiles with closed communication style
need to be identified to generate an appropriate response. The             are intended to generate brief responses that do not reveal de-
named entity recognizer identifies such entities in the ques-              tails at the first utterance. It is important to track the questions
tion.                                                                      corresponding to such utterances so that a detailed response
                                                                           can be generated after the assessor asks follow-up questions
   One of the approaches to text classification is to use sim-
                                                                           to the CA.
ple rule-based algorithms. These algorithms detect certain
keywords in the incoming query and classify it into an appro-
priate class. However, such rule-based algorithms often have
                                                                           2.5   Natural Language Generation
limited capabilities and do not generalize well. Moreover, the             The NLG module generates responses to the input queries.
complexity of the rules increases with more variation in the               One of the common approaches used in NLG is delexi-
type of input queries, hence these approaches are not scal-                calization [Wen et al., 2015], which is the process of us-
able. In this paper, we explore more sophisticated machine                 ing placeholders to represent slots in a sentence, which are
learning and deep learning approaches to text classification.              then populated using the actual values of entities identified


                                                                      10
from the input sentences. Recent studies [Xing et al., 2017;                 Model                 Acc.    F1-Score     F1-Weighted
Cai et al., 2019] have also shown promising results using
sequence-to-sequence models for dialogue generation.                         LR                   0.797        0.773           0.793
   One of the challenges in NLG for this application is that the             SVM                  0.780        0.744           0.772
responses are based on the identified attributes from the input              Decision Tree        0.706        0.670           0.700
query such as domain, intent, and entities, as well as the char-             Random Forest        0.710        0.669           0.699
acteristics of the synthetic profile. Our current approach relies            BiLSTM               0.808        0.780           0.806
on using the unstructured text of the assessor notes contained               BiLSTM + GloVe       0.830        0.801           0.827
in the historical database to generate responses to assessor
questions that would match the topic and intent of the ques-                          Table 1: Domain Classification Results
tion and also would provide information consistent with the
selected synthetic profile. For example, the first question in           that has capabilities of learning long-term dependencies. It is
“Example dialogue for the Dressing domain” described above               widely used in sequential learning problems like language.
would be categorized as belonging to Dressing with the intent            The model architecture is shown in Figure 3. In the net-
to elicit challenges that the person experiences in this domain.         work, we used 20% spatial and recurrent dropout regulariza-
In this case, the question would be mapped to a specific syn-            tion [Srivastava et al., 2014] to prevent overfitting. We set
thetic profile in which the synthetic “person” is marked as              batch size to 64, and used ADAM [Kingma and Ba, 2015]
independent in this ADL. The database entry for this profile             optimizer and categorical cross-entropy loss.
would also contain assessor notes regarding challenges with
                                                                         Feature Extraction. The baseline models use n-gram fea-
dressing that may say “Able to dress on her own.” The chal-
                                                                         tures that are extracted from the data corpus. In particular,
lenge for the NLG module is to “translate” this note into a
                                                                         we extract uni-gram, bi-gram, and tri-gram features. In re-
natural language response such as “I can dress myself.” In
                                                                         cent years, distributed word representations [Mikolov et al.,
order to address this challenge we are currently experiment-
                                                                         2013], or word embeddings, have shown impressive perfor-
ing with sequence-to-sequence machine translation modeling
                                                                         mance in various natural language processing tasks. In this
trained on manually generated data. This work is currently in
                                                                         paper, we make use of pre-trained GloVe embeddings [Pen-
progress.
                                                                         nington et al., 2014] for our BiLSTM model. We also exper-
                                                                         iment with training the embeddings from scratch using the
3     Experiments                                                        dataset.
3.1    Domain Classification                                             Results. The results of the models are shown in Table 1.
Classification of text data using machine learning involves              We use accuracy, f1-score, and weighted f1-score as our per-
two tasks: transforming text into a numerical representa-                formance metrics for evaluation. The results show that the
tion and feeding this representation into a classifier. We               BiLSTM models outperform the traditional machine learn-
perform comparative analysis of various classification algo-             ing baseline models. Moreover, using pre-trained GloVe em-
rithms ranging from traditional machine learning approaches              beddings further improves the result of the BiLSTM model
to modern neural networks for this task. We also explore tech-           with embeddings trained from scratch. The BiLSTM model
niques for extracting features from text.                                achieves 80.1% f1-score, 82.7% weighted f1-score, and 83%
Data Preparation. The dataset used to train the models                   accuracy over a fairly balanced data. Analyzing the confusion
was created from the data collected from the surveys. It com-            matrix shows some level of misclassification among similar
prises queries belonging to domains that fall under the cate-            domains, e.g., planning meals and preparing meals, due to the
gories of personal cares, household management, eating and               similar nature of dialogues between these classes. Merging
meal preparation, and movement. We divided the conversa-                 such domains increases the accuracy of this model to 94.2%.
tion snippets from the surveys into turns and labeled them
according to their domain. We also added data for small talk,            4    Discussion
in particular, a collection of phrases for greeting, interrogat-         In this paper we presented some of the preliminary results
ing and ending the conversation. We created a separate do-               of a work-in-progress project aimed at developing a conver-
main for generic follow-up questions. The resultant dataset              sational agent system for training certified assessors in con-
consists of 20 domains and 2885 examples, and it is fairly               ducting assessments for human services eligibility. The fo-
balanced across the domains. 20% of this data was randomly               cus of the experiments reported here was on topic tracking
sampled for testing and the remaining 80% was used for train-            for which we experimented with a range of machine learn-
ing the models.                                                          ing approaches to text categorization. So far, we found that
Models. We included Logistic Regression, Support Vector                  the best accuracy for domain categorization was achieved
Machines (SVM), Decision Trees, and Random Forests mod-                  with a bidirectional LSTM model with pre-trained GloVe em-
els as baselines. We tuned the hyperparameter settings of                beddings. Our modeling results also show that some of the
these models by performing an exhaustive grid search us-                 distinctions between functional categories (e.g., Positioning,
ing 5-fold cross-validation. We compared the performance of              Mobility, Transfers, Mode of Positioning, and Mode of Trans-
these models with a Bidirectional Long Short-Term Memory                 fer) are not supported by the currently available data and may
(BiLSTM) neural network. LSTM [Hochreiter and Schmid-                    require further data collection efforts in order to increase the
huber, 1997] is a type of Recurrent Neural Network (RNN)                 accuracy of the topic tracker at a higher granularity.


                                                                    11
                                          Figure 3: An illustration of the BiLSTM architecture


   Our experiments with topic classification have a number               References
of limitations. First, the data used for training and evalu-
ation were collected as part of a survey in which assessors              [Cai et al., 2019] Hengyi Cai, Hongshen Chen, Cheng
were asked to recall prior assessments, resulting in realistic             Zhang, Yonghao Song, Xiaofang Zhao, and Dawei Yin.
but still hypothetical dialogues. The models developed on                  Adaptive parameterization for neural dialogue genera-
these data would need to be further evaluated on actual in-                tion. In Proceedings of the 2019 Conference on Empir-
terviews between assessors and the persons being assessed,                 ical Methods in Natural Language Processing and the
which is something we plan to do in future steps. Another                  9th International Joint Conference on Natural Language
potential limitation of the current CA system as a whole in                Processing (EMNLP-IJCNLP), pages 1793–1802, Hong
the context of training certified assessors is that information            Kong, China, November 2019. Association for Computa-
gained by assessors through verbal interactions is only a part             tional Linguistics.
of what drives their assessments. Much of the additional in-             [Fadhil, 2018] Ahmed Fadhil. Beyond patient monitor-
formation comes from non-verbal cues such as direct obser-                  ing: Conversational agents role in telemedicine and
vation of the individual being assessed and the observation                 healthcare support for home-living elderly individuals.
of the environment. Currently, our system is not designed as                arXiv:1803.06000 [cs.CY], 2018.
an embodied CA and does not provide non-verbal informa-
tion about the physical environment in which the assessment              [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and
is taking place.                                                           Jürgen Schmidhuber. Long short-term memory. Neural
   Our next most immediate steps include training an intent                Computation, 9(8):1735–1780, 1997.
classifier to recognize intents for all domains. Additionally,           [Kingma and Ba, 2015] Diederik P. Kingma and Jimmy Ba.
we intend to experiment with transformer based models to                   ADAM: A method for stochastic optimization. In 3rd
train a named entity recognizer to identify entities in the input          International Conference for Learning Representations,
queries, and use sequence-to-sequence models for the NLG                   2015.
component. We are also working on a strategy to provide
feedback to the assessors regarding their conduct of the in-             [Laranjo et al., 2018] Liliana Laranjo, Adam G Dunn,
terviews and consistency of their assessments with synthetic                Huong Ly Tong, Ahmet Baki Kocaballi, Jessica Chen, Ra-
profiles.                                                                   bia Bashir, Didi Surian, Blanca Gallego, Farah Magrabi,
                                                                            Annie Y S Lau, and Enrico Coiera. Conversational agents
                                                                            in healthcare: a systematic review. Journal of the Amer-
                                                                            ican Medical Informatics Association, 25(9):1248–1258,
Acknowledgements                                                            July 2018.
                                                                         [Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai
The work on this project was supported by funding from the                 Chen, Greg Corrado, and Jeffrey Dean. Distributed rep-
Minnesota Department of Human Services. We would like                      resentations of words and phrases and their composition-
to thank the people at DSD and MNIT for help with project                  ality. In Proceedings of the 26th International Confer-
specifications, gathering of historical data, and expert guid-             ence on Neural Information Processing Systems - Volume
ance on domain-specific aspects of the project.                            2, NIPS’13, page 3111–3119, 2013.


                                                                    12
[Pennington et al., 2014] Jeffrey Pennington,         Richard
   Socher, and Christopher Manning. GloVe: Global vectors
   for word representation. In Proceedings of the 2014
   Conference on Empirical Methods in Natural Language
   Processing (EMNLP), pages 1532–1543, Doha, Qatar,
   October 2014. Association for Computational Linguistics.
[Sciuto et al., 2018] Alex Sciuto, Arnita Saini, Jodi Forlizzi,
   and Jason Hong. “Hey Alexa, what’s up?”: A mixed-
   methods studies of in-home conversational agent usage.
   In DIS ’18: Proceedings of the 2018 Designing Interac-
   tive Systems Conference, pages 857–868, June 2018.
[Srivastava et al., 2014] Nitish Srivastava, Geoffrey Hinton,
   Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi-
   nov. Dropout: A simple way to prevent neural networks
   from overfitting. Journal of Machine Learning Research,
   15(56):1929–1958, 2014.
[Ultes et al., 2017] Stefan Ultes, Lina M. Rojas Bara-
   hona, Pei-Hao Su, David Vandyke, Dongho Kim, Iñigo
   Casanueva, Paweł Budzianowski, Nikola Mrkšić, Tsung-
   Hsien Wen, Milica Gasic, and Steve Young. PyDial: A
   multi-domain statistical dialogue system toolkit. In Pro-
   ceedings of ACL 2017, System Demonstrations, pages 73–
   78, Vancouver, Canada, July 2017. Association for Com-
   putational Linguistics.
[Wen et al., 2015] Tsung-Hsien Wen, Milica Gasic, Nikola
   Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young.
   Semantically conditioned LSTM-based natural language
   generation for spoken dialogue systems. Proceedings of
   the 2015 Conference on Empirical Methods in Natural
   Language Processing, 2015.
[Wen et al., 2017] Tsung-Hsien Wen, David Vandyke,
   Nikola Mrkšić, Milica Gašić, Lina M. Rojas-Barahona,
   Pei-Hao Su, Stefan Ultes, and Steve Young. A network-
   based end-to-end trainable task-oriented dialogue system.
   In Proceedings of the 15th Conference of the European
   Chapter of the Association for Computational Linguistics:
   Volume 1, Long Papers, pages 438–449, Valencia, Spain,
   April 2017. Association for Computational Linguistics.
[Xing et al., 2017] Chen Xing, Wei Wu, Yu Wu, Jie Liu,
   Yalou Huang, Ming Zhou, and Wei-Ying Ma. Topic
   aware neural response generation. In Proceedings of the
   Thirty-First AAAI Conference on Artificial Intelligence,
   AAAI’17, page 3351–3357. AAAI Press, 2017.


                                                              13