Conversational Agent for Daily Living Assessment Coaching Aditya Gaydhani1⇤ , Raymond Finzel2 , Sheena Dufresne3 , Maria Gini1 and Serguei Pakhomov2 1 Department of Computer Science and Engineering, University of Minnesota 2 Department of Pharmaceutical Care & Health Systems, University of Minnesota 3 Department of Experimental and Clinical Pharmacology, University of Minnesota {gaydh001, finze006, gahmx008, gini, pakh0002}@umn.edu Abstract rheumatology, among other health disciplines. It is also cen- tral to several non-clinical domains including disability and We present preliminary work-in-progress results of human services. One’s ability to perform day-to-day activi- a project focused on developing a conversational ties independently relies on unimpaired cognitive, motor, and agent system to help with training certified asses- perceptual abilities. Significant impairment in these abilities sors in conducting assessments of functioning in typically results in a need for assistive devices or external activities of daily living. To date, we have de- supervision and/or assistance. In the United States, signif- signed a modular task-based conversational agent icant public resources are dedicated to providing assistance system and collected hypothetical dialogue data re- to those in need. In Minnesota, that assistance in allocated quired for training system components as well as based on specific needs. The certified assessors perform as- a knowledge base needed to generate a wide vari- sessments by conducting extensive face-to-face verbal inter- ety of synthetic profiles of “individuals” being as- views with the individuals referred for services and make rec- sessed. One of the key components of the system is ommendations for the level of support required to meet the the topic tracking module that determines the cur- person’s needs. The interviews cover a broad range of ar- rent topic of the conversation. We report the results eas including activities of daily living (ADLs: e.g., dressing, of experiments with several machine learning ap- toileting, bathing, mobility, etc.) and instrumental activities proaches to topic/domain classification. The high- of daily living (IADLs: e.g., preparing meals, managing fi- est accuracy of 83% was achieved with a bidirec- nances, etc.). One of the desired goals of these assessments is tional long short-term memory (BiLSTM) model to determine the degree of independence to which the person with pre-trained GloVe embeddings. In addition being assessed is able to preform ADLs and IADLs and to do to these results, we also discuss some of the other so as consistently and uniformly as possible across multiple challenges that we have encountered so far and po- assessors. CA technology offers a potential for standardizing tential solutions that we are currently pursuing. the training of certified assessors by simulating the interac- tions between assessors and persons being assessed in a uni- form and reproducible fashion. 1 Introduction The long-term objective of our ongoing project is to de- The use of Artificial Intelligence (AI) technology in the form velop a conversational agent system and infrastructure to sup- of conversational agents (CA) has now expanded far beyond port training of certified assessors in conducting the assess- popular intelligent in-home assistants that are capable of an- ment of needs for social services. The purpose for develop- swering basic questions about weather, trivia, driving direc- ing a conversational agent is to a) assist in shifting the mode tions, or music selection [Sciuto et al., 2018]. For example, of conducting assessments from a questionnaire/survey style despite significant barriers to its adoption in healthcare, CA to a more free-form conversational/narrative style, and b) to technology (mostly rule-based) is being actively investigated standardize assessment outcomes across individual assessors. as a tool to assist patients and clinicians across multiple clin- Towards this long-term objective, we have developed a pro- ical contexts including diagnostic, prognostic, and treatment totype of the Conversational Agent for Daily Living Assess- scenarios [Laranjo et al., 2018]. Specific to the domain of ment Coaching (CADLAC) that relies on a database of his- functioning, the use of CA technology is also being investi- torical assessments, conducted by Minnesota Department of gated in the context of patient care and monitoring after the Human Services, of ADLs and IADLS in order to generate patient has been discharged from the hospital [Fadhil, 2018]. synthetic profiles of individuals with varying levels of inde- Assessment of functioning and functional status is a key tar- pendence and needs. In this paper, we describe the high-level get in multiple clinical contexts such as nursing, physical system architecture and its components, and report the results and occupational therapy, geriatric medicine, neurology, and of experiments with machine learning approaches to maxi- mizing the accuracy of the domain classification component. ⇤ Contact Author We also discuss the challenges encountered during the devel- Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 8 opment of natural language understanding (NLU) and natural Example dialogue in the Dressing domain language generation (NLG) components and possible solu- tions with which we are currently experimenting. Assessor: Tell me about how you get dressed after you are done in the bath. Participant: I can dress myself. Assessor: Including putting your shoes and socks 2 Methodology on? Participant: It can be tough. The high-level architecture of CADLAC system is shown in Assessor: What about putting them on is hard? Figure 1. We followed the traditional modular CA system Participant: It is hard for me to bend that far. But I design [Ultes et al., 2017] vs. an end-to-end design [Wen take it slow and I get it done. I sit on my et al., 2017] because the modular design is more suitable lift chair while I do it. in the current early stage of the development when large amounts of training data needed for the end-to-end design are not yet available. Our modular design includes stan- Based on these hypothetical dialogues, we have defined dard components such as the Topic Tracker (Domain Clas- and are currently continuing to refine an annotation schema to sifier), NLU and NLG modules, a Dialogue Manager con- manually label semantic frames and their elements that may sisting of the dialogue state tracking and policy components. be useful for this application. For example, we defined the In the current early stage of the project, we have been able following frames for the Heavy Housekeeping domain using to generate enough data to use machine learning in order to BRAT annotation schema format: train some of the CA system components including the Topic Tracker and the NLU module designed to identify user intent !Housekeeping heavy and recognize named entities needed to match the input ut- vacuum Place-Arg?:Home location, terance/question to the database containing historical records Helper-Arg?:Person from which we generated synthetic profiles to represent a va- scrub Artifact-Arg?:Home location, riety of levels of functioning. The remaining components in- Place-Arg?:Home location, cluding the Dialogue policy are currently rule-based. This Device-Arg?:Instrument, Helper-Arg?:Person architecture is implemented using the open source MindMeld shovel Place-Arg?:Home location, platform for conversational AI 1 . Helper-Arg?:Person 2.1 Data An annotation of a short hypothetical dialogue using this schema focused on Heavy Housekeeping is shown in Fig- We designed a survey to collect the data required to model the ure 2. CA. The survey asked the assessors to recall some of their We also collected de-identified historical assessment data past assessments and provide hypothetical and anonymous for approximately 12,000 individuals. These data comprise examples based on verbal interactions they have had during a mix of structured and unstructured fields. The structured those assessments focused on specific domains of function- fields refer to the age, gender, communication style, and abil- ing. The survey was administered to approximately 1,700 cer- ity level of the person being assessed corresponding to the tified assessors. The resulting data consists of 2,900 short dia- independence scale mentioned earlier. Unstructured fields logues (up to 3 turns: see the example dialogue below) cover- capture free-text notes made by assessors during assessments ing 18 domains within ADLs and IADLs: Dressing, Groom- consisting of brief descriptions of the challenges, preferences, ing, Bathing, Toileting, Incontinence Management, Heavy and any assistive equipment for each ADLs and IADLs do- Housekeeping, Light Housekeeping, Laundry, Financial Ac- main. tivities, Mobility, Transfers, Mode of Transfer, Positioning, Mode of Positioning, Food Consumption, Meal Preparation, 2.2 Synthetic Profiles Meal Planning, Fine Motor Skills. Each turn consists of a The CA is given a synthetic profile for every session of in- question by the assessor and the response to that question teraction. The synthetic profile gives a personality to the CA provided by the person being assessed. Additionally, we by defining its characteristics such as age, gender, communi- collected characteristics of the person being assessed such cation style, and the degree of independence to which it can as approximate age, gender, communication style (open vs. perform activities for all domains within ADLs and IADLs. closed), and the degree of independence to which they are The synthetic profile also holds information about the chal- able to perform activities on the following scale: a) com- lenges, preferences, and assistive equipment used across all pletely independent, b) requiring intermittent supervision, c) domains. The responses of the CA are based on the underly- requiring supervision throughout the activity, d) requiring in- ing synthetic profile. termittent physical assistance, e) requiring physical assistance The characteristics of the synthetic profile, particularly the throughout the activity, f) completely dependent. independence levels, need to be consistent with each other. For example, a person who is unable to walk independently is most likely unable to do housekeeping independently. To en- 1 https://www.mindmeld.com/ sure consistency, we use historical assessment data to gener- 9 Figure 1: System architecture. 2.4 Dialogue State Tracking Conversational interaction consists of dialogue states, where each state is responsible for generating a particular type of re- sponse. Dialogue state tracking refers to mapping of incom- ing queries to appropriate dialogue states. We use an effective Figure 2: Example dialogue annotated for semantic frames. rule-based and pattern matching procedure in the CA for dia- logue state tracking. The rules defined by this procedure rely on the domain, intent, or entities identified for an incoming ate synthetic profiles. At every session of interaction, a record query, as well as profile characteristics such as communica- is randomly sampled from the historical data and the fields of tion style. A dialogue state is determined by a combination the synthetic profile are populated using this record. of these attributes. One of the challenges in modeling the CA is handling 2.3 Natural Language Understanding generic follow-up questions because such questions refer to The NLU module of the CA consists of domain classification, the previous utterances of the conversation. We create a sep- intent classification, and named entity recognition. The do- arate domain for generic follow-up questions using the asses- main classifier or topic tracker determines the target domain sor’s questions from the 2nd and 3rd turn of the dialogues in for an input query. It performs a first-pass categorization of the data. Whenever the system classifies an incoming query the incoming query and assigns it to one of the pre-defined as a generic follow-up question, the domain of the previ- domains. Each domain can have one or more intents that ous turn is carried over to the current turn. Moreover, if the specify the task that the user wants to accomplish. The in- follow-up question does not consist of any entities of its own, tent classifier identifies such intents for an input query. In this then the entities from the previous turn are also carried over. case, the input query is the question asked by the assessor to Communication style of the person being interviewed is the synthetic profile of the CA. The question may consist of one of the characteristics that we incorporate in the synthetic zero or more words or phrases, referred to as “entities”, that profile of the CA. Profiles with closed communication style need to be identified to generate an appropriate response. The are intended to generate brief responses that do not reveal de- named entity recognizer identifies such entities in the ques- tails at the first utterance. It is important to track the questions tion. corresponding to such utterances so that a detailed response can be generated after the assessor asks follow-up questions One of the approaches to text classification is to use sim- to the CA. ple rule-based algorithms. These algorithms detect certain keywords in the incoming query and classify it into an appro- priate class. However, such rule-based algorithms often have 2.5 Natural Language Generation limited capabilities and do not generalize well. Moreover, the The NLG module generates responses to the input queries. complexity of the rules increases with more variation in the One of the common approaches used in NLG is delexi- type of input queries, hence these approaches are not scal- calization [Wen et al., 2015], which is the process of us- able. In this paper, we explore more sophisticated machine ing placeholders to represent slots in a sentence, which are learning and deep learning approaches to text classification. then populated using the actual values of entities identified 10 from the input sentences. Recent studies [Xing et al., 2017; Model Acc. F1-Score F1-Weighted Cai et al., 2019] have also shown promising results using sequence-to-sequence models for dialogue generation. LR 0.797 0.773 0.793 One of the challenges in NLG for this application is that the SVM 0.780 0.744 0.772 responses are based on the identified attributes from the input Decision Tree 0.706 0.670 0.700 query such as domain, intent, and entities, as well as the char- Random Forest 0.710 0.669 0.699 acteristics of the synthetic profile. Our current approach relies BiLSTM 0.808 0.780 0.806 on using the unstructured text of the assessor notes contained BiLSTM + GloVe 0.830 0.801 0.827 in the historical database to generate responses to assessor questions that would match the topic and intent of the ques- Table 1: Domain Classification Results tion and also would provide information consistent with the selected synthetic profile. For example, the first question in that has capabilities of learning long-term dependencies. It is “Example dialogue for the Dressing domain” described above widely used in sequential learning problems like language. would be categorized as belonging to Dressing with the intent The model architecture is shown in Figure 3. In the net- to elicit challenges that the person experiences in this domain. work, we used 20% spatial and recurrent dropout regulariza- In this case, the question would be mapped to a specific syn- tion [Srivastava et al., 2014] to prevent overfitting. We set thetic profile in which the synthetic “person” is marked as batch size to 64, and used ADAM [Kingma and Ba, 2015] independent in this ADL. The database entry for this profile optimizer and categorical cross-entropy loss. would also contain assessor notes regarding challenges with Feature Extraction. The baseline models use n-gram fea- dressing that may say “Able to dress on her own.” The chal- tures that are extracted from the data corpus. In particular, lenge for the NLG module is to “translate” this note into a we extract uni-gram, bi-gram, and tri-gram features. In re- natural language response such as “I can dress myself.” In cent years, distributed word representations [Mikolov et al., order to address this challenge we are currently experiment- 2013], or word embeddings, have shown impressive perfor- ing with sequence-to-sequence machine translation modeling mance in various natural language processing tasks. In this trained on manually generated data. This work is currently in paper, we make use of pre-trained GloVe embeddings [Pen- progress. nington et al., 2014] for our BiLSTM model. We also exper- iment with training the embeddings from scratch using the 3 Experiments dataset. 3.1 Domain Classification Results. The results of the models are shown in Table 1. Classification of text data using machine learning involves We use accuracy, f1-score, and weighted f1-score as our per- two tasks: transforming text into a numerical representa- formance metrics for evaluation. The results show that the tion and feeding this representation into a classifier. We BiLSTM models outperform the traditional machine learn- perform comparative analysis of various classification algo- ing baseline models. Moreover, using pre-trained GloVe em- rithms ranging from traditional machine learning approaches beddings further improves the result of the BiLSTM model to modern neural networks for this task. We also explore tech- with embeddings trained from scratch. The BiLSTM model niques for extracting features from text. achieves 80.1% f1-score, 82.7% weighted f1-score, and 83% Data Preparation. The dataset used to train the models accuracy over a fairly balanced data. Analyzing the confusion was created from the data collected from the surveys. It com- matrix shows some level of misclassification among similar prises queries belonging to domains that fall under the cate- domains, e.g., planning meals and preparing meals, due to the gories of personal cares, household management, eating and similar nature of dialogues between these classes. Merging meal preparation, and movement. We divided the conversa- such domains increases the accuracy of this model to 94.2%. tion snippets from the surveys into turns and labeled them according to their domain. We also added data for small talk, 4 Discussion in particular, a collection of phrases for greeting, interrogat- In this paper we presented some of the preliminary results ing and ending the conversation. We created a separate do- of a work-in-progress project aimed at developing a conver- main for generic follow-up questions. The resultant dataset sational agent system for training certified assessors in con- consists of 20 domains and 2885 examples, and it is fairly ducting assessments for human services eligibility. The fo- balanced across the domains. 20% of this data was randomly cus of the experiments reported here was on topic tracking sampled for testing and the remaining 80% was used for train- for which we experimented with a range of machine learn- ing the models. ing approaches to text categorization. So far, we found that Models. We included Logistic Regression, Support Vector the best accuracy for domain categorization was achieved Machines (SVM), Decision Trees, and Random Forests mod- with a bidirectional LSTM model with pre-trained GloVe em- els as baselines. We tuned the hyperparameter settings of beddings. Our modeling results also show that some of the these models by performing an exhaustive grid search us- distinctions between functional categories (e.g., Positioning, ing 5-fold cross-validation. We compared the performance of Mobility, Transfers, Mode of Positioning, and Mode of Trans- these models with a Bidirectional Long Short-Term Memory fer) are not supported by the currently available data and may (BiLSTM) neural network. LSTM [Hochreiter and Schmid- require further data collection efforts in order to increase the huber, 1997] is a type of Recurrent Neural Network (RNN) accuracy of the topic tracker at a higher granularity. 11 Figure 3: An illustration of the BiLSTM architecture Our experiments with topic classification have a number References of limitations. First, the data used for training and evalu- ation were collected as part of a survey in which assessors [Cai et al., 2019] Hengyi Cai, Hongshen Chen, Cheng were asked to recall prior assessments, resulting in realistic Zhang, Yonghao Song, Xiaofang Zhao, and Dawei Yin. but still hypothetical dialogues. The models developed on Adaptive parameterization for neural dialogue genera- these data would need to be further evaluated on actual in- tion. In Proceedings of the 2019 Conference on Empir- terviews between assessors and the persons being assessed, ical Methods in Natural Language Processing and the which is something we plan to do in future steps. Another 9th International Joint Conference on Natural Language potential limitation of the current CA system as a whole in Processing (EMNLP-IJCNLP), pages 1793–1802, Hong the context of training certified assessors is that information Kong, China, November 2019. Association for Computa- gained by assessors through verbal interactions is only a part tional Linguistics. of what drives their assessments. Much of the additional in- [Fadhil, 2018] Ahmed Fadhil. Beyond patient monitor- formation comes from non-verbal cues such as direct obser- ing: Conversational agents role in telemedicine and vation of the individual being assessed and the observation healthcare support for home-living elderly individuals. of the environment. Currently, our system is not designed as arXiv:1803.06000 [cs.CY], 2018. an embodied CA and does not provide non-verbal informa- tion about the physical environment in which the assessment [Hochreiter and Schmidhuber, 1997] Sepp Hochreiter and is taking place. Jürgen Schmidhuber. Long short-term memory. Neural Our next most immediate steps include training an intent Computation, 9(8):1735–1780, 1997. classifier to recognize intents for all domains. Additionally, [Kingma and Ba, 2015] Diederik P. Kingma and Jimmy Ba. we intend to experiment with transformer based models to ADAM: A method for stochastic optimization. In 3rd train a named entity recognizer to identify entities in the input International Conference for Learning Representations, queries, and use sequence-to-sequence models for the NLG 2015. component. We are also working on a strategy to provide feedback to the assessors regarding their conduct of the in- [Laranjo et al., 2018] Liliana Laranjo, Adam G Dunn, terviews and consistency of their assessments with synthetic Huong Ly Tong, Ahmet Baki Kocaballi, Jessica Chen, Ra- profiles. bia Bashir, Didi Surian, Blanca Gallego, Farah Magrabi, Annie Y S Lau, and Enrico Coiera. Conversational agents in healthcare: a systematic review. Journal of the Amer- ican Medical Informatics Association, 25(9):1248–1258, Acknowledgements July 2018. [Mikolov et al., 2013] Tomas Mikolov, Ilya Sutskever, Kai The work on this project was supported by funding from the Chen, Greg Corrado, and Jeffrey Dean. Distributed rep- Minnesota Department of Human Services. We would like resentations of words and phrases and their composition- to thank the people at DSD and MNIT for help with project ality. In Proceedings of the 26th International Confer- specifications, gathering of historical data, and expert guid- ence on Neural Information Processing Systems - Volume ance on domain-specific aspects of the project. 2, NIPS’13, page 3111–3119, 2013. 12 [Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar, October 2014. Association for Computational Linguistics. [Sciuto et al., 2018] Alex Sciuto, Arnita Saini, Jodi Forlizzi, and Jason Hong. “Hey Alexa, what’s up?”: A mixed- methods studies of in-home conversational agent usage. In DIS ’18: Proceedings of the 2018 Designing Interac- tive Systems Conference, pages 857–868, June 2018. [Srivastava et al., 2014] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdi- nov. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014. [Ultes et al., 2017] Stefan Ultes, Lina M. Rojas Bara- hona, Pei-Hao Su, David Vandyke, Dongho Kim, Iñigo Casanueva, Paweł Budzianowski, Nikola Mrkšić, Tsung- Hsien Wen, Milica Gasic, and Steve Young. PyDial: A multi-domain statistical dialogue system toolkit. In Pro- ceedings of ACL 2017, System Demonstrations, pages 73– 78, Vancouver, Canada, July 2017. Association for Com- putational Linguistics. [Wen et al., 2015] Tsung-Hsien Wen, Milica Gasic, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015. [Wen et al., 2017] Tsung-Hsien Wen, David Vandyke, Nikola Mrkšić, Milica Gašić, Lina M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes, and Steve Young. A network- based end-to-end trainable task-oriented dialogue system. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 438–449, Valencia, Spain, April 2017. Association for Computational Linguistics. [Xing et al., 2017] Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, and Wei-Ying Ma. Topic aware neural response generation. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, AAAI’17, page 3351–3357. AAAI Press, 2017. 13