Evaluation of Conversational Agents for Aerospace Domain Ying-Hsang LIU∗ Alexandre ARNOLD† yingliu@sdu.dk Gérard DUPONT∗ University of Southern Denmark Catherine KOBUS∗ Kolding, Denmark François LANCELOT∗ .@airbus.com AIRBUS AI Research, Toulouse France ABSTRACT other case. However, they come with their limitations. Most of The use of conversational agents within the aerospace industry of- the time, it is the user’s responsibility to adapt their search needs fers quick and concise answers to complex situations. The aerospace using specific keywords and/or syntax, known as the difficulty of domain is characterized by products and systems that are built articulating information needs [30, 51]. For simple queries that have over decades of engineering to reach high levels of performance a ready-made answer in the document, this is not always a difficult within complex environments. Current development in conversa- problem. However, for the understanding of complex procedures tional agents can leverage the latest retrieval and language model or troubleshooting system errors, it can lead to multiple queries to refine the system’s question-answering capabilities. However, and thus a cumbersome experience for the user. evaluating the added-value of such a system in the context of indus- Various types of systems are associated with conversational trial applications such as pilots in a cockpit is complex. This paper agents. A recent survey of different types of dialogue systems has describes how a conversational agent is implemented and evalu- identified three main types: task-oriented dialogue system, conver- ated, with particular references to how state-of-the-art technologies sational agents, and interactive question-answering [14]. From the can be adapted to the domain specificity. Preliminary findings of perspectives of human-computer interaction (HCI), user experience a controlled user experiment suggest that user perception of the (UX), and information retrieval (IR), issues associated with the voice- usefulness of the system in completing the search task and the based user interface, such as recognition error, user experience, and system’s responses to the relevance of the topic are good predictors voice queries have gained traction recently [20, 32, 33]. of user search performance. User satisfaction with the system’s In this study, our “Smart Librarian” (SL) mixed a task-oriented responses may not be a good predictor of user search performance. dialogue system with a conversational agent and an interactive question-answering component (with/without a voice-based in- CCS CONCEPTS terface). Specifically, the assistant is envisioned as a task-oriented system in a restricted domain, with mixed system/user initiatives • Human-centered computing → Laboratory experiments; and a multimodal interface to support situation awareness in a cock- Natural language interfaces; • Information systems → Search pit. Therefore, the evaluation objective is to assess the benefit of interfaces. smart search and conversational search for cockpit documentation. KEYWORDS One of the primary objectives of conversational search systems is to enable the provision of information services in interactive styles, Enterprise search; Conversational search; Aerospace industry; Con- similar to human-human interactions in information-seeking con- versational agent; Question answering; Evaluation protocol versations. User interfaces for conversational search systems ideally are similar to natural dialogue interactions [21] in which user’s 1 INTRODUCTION questions can be clarified during conversations. This thread of The aerospace industry relies on massive collections of documents research has received much attention from the research commu- covering system descriptions, manuals or procedures. Most of these nities of natural language processing, information retrieval, and are subjected to dedicated regulation and/or have to be used in human-computer interaction, just to name a few [2, 31, 34, 55]. the context of safety of life scenarios such as cockpit procedures This paper describes how a conversational agent is implemented for pilots. A user looking for specific information in response to a and evaluated, with particular references to how state-of-the-art given situation in this large corpus is often seen spending a signifi- technologies can be adapted to the domain specificity in aerospace. cant amount of precious time navigating through the documents. We propose a user-centered approach to the design and evaluation Even experienced pilots who are familiar with the structure of the of conversational search user interfaces (SUIs), termed Smart Li- documents can sometimes have difficulties in finding known items brarian, to support the pilot in cockpits. The skills of assistants in a constrained time. are intended to translate into the requirements of conversational The dedicated structure of the information helps to quickly target search systems to support the tasks performed by the pilot. Our a specific piece of information. The search system helps in any preliminary findings suggest that there were significant interac- ∗ Also with The Australian National University. tion effects between the task difficulty and the types of system; the † Authors in alphabetical order. Smart Librarian system performed well for difficult search tasks. "Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0)." Ying-Hsang LIU, Alexandre ARNOLD, Gérard DUPONT, Catherine KOBUS, and François LANCELOT Future directions for research and development of conversational communication theories was built [18]. Using ethnomethodology agents are suggested. and conversational analysis, the role of conversational interfaces in everyday life revealed the voice user interface design implica- 2 RELATED WORK tions for the request and response design in embedded social in- teractions [35]. Together with IIR research, this thread of research 2.1 Aviation Cockpits and Controls contributes to our current discussions regarding the theories that In the aviation cockpits and controls environment, research has can be borrowed from other disciplines and/or re-conceptualized focused on the consideration of cognitive strategies and cognitive to design conversational search systems. processing in a stressful environment for the design of automated support systems. For example, the role of cognitive processes inher- 2.3 Conversational Search System ent in the tasks and specific considerations of cognitive strategies System requirements for conversational search systems were de- adopted by the pilots for the design of automated support systems, fined as “a system for retrieving information that permits a mixed- i.e., automated cockpit have been emphasized [10]. In a study of initiative back and forth between a user and agent, where the how procedures such as Quick Reference Handbook (QRH) are agent’s actions are chosen in response to a model of current user used in a cockpit for emergencies, the results showed that “pilots needs within the current conversation, using both short- and long- employed strategies that interleaved a range of resources, often term knowledge of the user” [37, p. 160]. An evaluation framework consulting fragments of the QRH checklists rather than following for conversational agents in the aerospace domain was proposed [4]. them from start to finish” [10, p. 147]. The research was conducted to identify the conversational styles for From the perspectives of human-computer interaction (HCI), an building computational models at scale for speech-based conversa- observational and interview study of cockpit activities for tangible tional agents [43]. These studies suggest the existing methods used design noted that “Pilots mention the value of having tools sepa- to explore the conversational search systems from the perspectives rated from the aircraft systems. Speaking about the physical QRH of system design. (Quick Reference Handbook), that you can hold in your hands, a From technical perspectives, research on conversational search pilot valued it in case of degraded contexts, when there is no longer systems has focused on identifying user intent in information seek- control available.” [27, p. 663]. And the usability of an information ing conversations, designing user interfaces for different modes of visualization system in flight to improve aviation safety has been interaction, and provision of clarification questions. For example, evaluated in a flight simulator setting [3]. structural features (i.e., the position of an utterance in a dialogue) Overall, these studies suggest that systems designed for aviation contribute the most to the identification of user intents, using neural cockpits and controls need to consider the issues regarding the role classifiers [36]. The generation of clarification questions from com- of cognitive processes inherent in tasks and cognitive strategies munity question-answering websites formulated the tasks as noun employed by pilots. The role of context in designing user interfaces phrase ranking problems [9]. Neural models were used to generate and usability is also emphasized. clarification questions by considering sequences of purposes of in- teraction [1]. A formal model of information seeking dialogues that 2.2 Information Seeking Conversation consists of the query, request, feedback, and answer for identifying Informed by theories of human-human communication and linguis- the frequency of sequence patterns was proposed [48]. tics, IIR (interactive information retrieval) research has attempted Overall, research and practice in conversational search systems to identify the purposes and communicative functions of elicita- have received lots of attention recently, but the usefulness of these tions (i.e., questions to request information) in information seeking systems has not been rigorously evaluated in the system design conversations. User’s elicitation behavior was found to be affected process from user perspectives. by individual differences, such as status, age, and experience, and interacted with situational variables, such as interaction time and 2.4 Evaluation of Conversational Search the number of utterances [52]. Further studies have developed the System concept of elicitation styles, characterized by linguistic forms, ut- A recent approach to the evaluation of conversational search sys- terance purposes, and communicative functions, with particular tems, such as chatbots has intended to enhance user experience references to user satisfaction [53, 54]. However, these findings have and thus select user satisfaction as the main evaluation criterion for not been directly applied to the design of conversational search success. For example, the Alexa Prize Socialbot Grand Challenge systems. was designed as research competitions to advance our understand- Recent studies have focused on developing system design guide- ing of human interactions with socialbots, with the support of large lines from user studies. For example, in a study that observes peo- amounts of user data from Amazon.com. This evaluation approach ple’s interactions in a laboratory setting, researchers have com- was derived from computer science research and AI perspective. pared human-human interactions and models of well-established Within the NLP community, one of the key distinctions of evalua- search models to inform the design of spoken conversational search tion approaches is the intrinsic and extrinsic evaluation of machine system [44]. And the system requirements for intelligent conver- outputs [40]. The intrinsic evaluation focuses on the internal out- sational assistants for improving user experience have been ex- puts from the system, whereas the extrinsic evaluation is concerned plored [50]. with how the use of the system contributes to external outputs, Following the paradigm of computers as social actors, a taxon- omy of social cues of conversational agents based on interpersonal https://developer.amazon.com/alexaprize Evaluation of Conversational Agents for Aerospace Domain such as task completion. In the IR community, the evaluation efforts ACHIL platform. The setting was intended to create an environment have focused on the creation of test collection to compare system that can elicit the information needs of participants, as suggested performance, using appropriate evaluation metrics for different in simulated work task situations [8]. types of question-answering tasks. In this study, we take a holistic The subjects were given access to a tablet - similar to the ones approach to understand user experience and user performance to used by the pilot in flight - to access the Flight Crew Operating Man- bridge the gap between system-centric evaluation (i.e., automatic ual (FCOM) through one of the two systems: Smart Librarian (SL) metrics) and human evaluation, using crowdsourcing platforms. and electronic flight bag (FB). This source document incorporates aircraft manufacturer guidance on how to use the systems onboard 3 USER EXPERIMENT the aircraft for enhanced operational safety, as well as for increased efficiency. Overall it can be seen of several PDF documents counting 3.1 Evaluation Objective several thousand pages. The alignment between system design requirements and evalua- tion objectives is important for a user-centered approach to system 3.5 Search Task design and evaluation. Our evaluation objective is to determine In designing search tasks we have considered the complexity of the relationship between the search tasks in the typical flight oper- tasks from the perspectives of search as learning by classifying the ation scenarios and the perceived usefulness of the system for task search scenarios as easy and complex [46]. User perceived search completion. task complexity after using the system [28] was assessed by a questionnaire. 3.2 Research Hypothesis Specifically, the easy task involves fact-finding while the hard Research on user information seeking suggests that people’s levels task requires a higher level of understanding of the problems and/or of domain expertise and experience, work roles, tasks, and pro- some cognitive reasoning for answering the questions. In easy cedures affect their information-seeking strategies and perceived search tasks, the problem description contains relevant words that usefulness of information resources [17, 19, 22, 28]. In the context can be used to craft the "best question" pointing to a unique pro- of a safety-critical environment, information behavior research re- cedure (or document unit) that contains the solution. By contrast, veals that “Overly conditioned information behaviors, which would in hard search tasks, the problem description does not contain correspondingly limit methodical information behaviors, can lead any words matching the "best question" and the subject will need crews to miss crucial steps in the process of projecting the future to rephrase the problem. Moreover, the user needs to explore at state of the aircraft and suitably planning ahead” [49, p. 1567]. least two document units to find the answer. There is a need to Therefore, our proposed research hypotheses are as follows: reformulate the problem with new words/question and at least two H1. Types of search systems and user perceptions will affect user document units are necessary to find the solution to the problem. search performance. Several successive questions are needed to identify the solution H2. Perceived search task difficulty and user perceptions will (See Table 1). affect user search performance. For each task, the ground truth has been defined by a set of do- main experts by pointing the exact expected answer(s) and the exact 3.3 Research Design procedure(s) in which these can be found in the FCOM document. In this study, since we focus on the design and evaluation of conver- The very narrow of the aeronautical domain and the particular form sational search systems from user perspectives, we are concerned of documents, allowed us to ensure that the answers are unique for with user interactions with a prototype system in a laboratory set- each task and that their location in the documentation is unique. ting. This approach has been adopted because we can 1) determine the relationship among the variables in a laboratory environment 3.6 Arrangement of Experimental Conditions and 2) transfer the findings into specific system design decisions. Tasks were presented to subjects following a traditional Graeco- This approach is scientifically rigorous when the experiment is Latin square design [24, 25]. This study is a 2 × 2 design with two conducted properly. However, it is very resource-intensive and types of search systems (SL and FB) and two types of search tasks time-consuming and requires different sets of expertise. And the (easy and hard), to minimize the effect of presentation order of results may be affected by the variability of individuals consider- treatments [25]. ably [e.g. 41, 42, 51]. Specific examples include a flight simulator experiment with pilots co-designing system [3] and a turbulent 3.7 Metrics touch design experiment with students [12]. Search performance: The tasks defined are pure goal-oriented The experiment protocol has been approved by the Toulouse Uni- search task: the user is asked to find the exact answer and locates versity research ethics committee. The participant was presented the procedure used. Classic precision and recall metrics used in with an informed consent form to sign-off before the experiment information retrieval do not apply in this context and the score started. used can be seen as a Boolean success metric based on the expert ground truth (one could note it is similar to the precision@1 - but 3.4 Experiment Setting measured based on user’s response). The experiment was conducted in the environment of a flight sim- Thus performance was evaluated through [0;1] scores for each ulator (ENAC BIGONE A320/A330 cockpit simulator) within the step in the tasks (finding the right procedure, finding the right Ying-Hsang LIU, Alexandre ARNOLD, Gérard DUPONT, Catherine KOBUS, and François LANCELOT Label Level Title Initial Trig- Flight Condi- • A QA engine, based on a BERT large model [15], fine-tuned ger/Message tion using the FARM framework . A multi-task setup was used for the fine-tuning: one task is the classical QA task (detecting Tutorial Easy Captain’s N/A cruise the span of text) on SQUAD 2.0 dataset [38]; the other is a duty classification task (i.e. whether the answer to the question is Task A Easy Cockpit Bird cruise FL370 contained or not in the document extract). windshield strike/window cracked crack On top of these, additional capabilities to process speech inputs Task B Easy Bomb on N/A cruise and produce speech outputs are available as an alternative to the board traditional textual input. Figure 1 offers an overview of the whole Task C Hard ALL EN- ALL EN- cruise flying architecture. GINE GINE FL350 over The whole system is made available through a reactive web inter- FAILURE FAILURE the ocean, face enabling conversation and document exploration (See Figure 2). over the sea >70NM from It was deployed in a cloud environment and made available to users coast through a tablet. Task D Hard Air too hot in Air too hot cruise FL370 the cockpit Bonus Hard Engine ENG 1 FIRE climbing fire over over the mountain Alps in FL350 Table 1: Difficulty level of search tasks, with description and key aspects (FL means ’Flight Level’) Notes: The first task familiarizes the participant with the experiment setup, whereas the final task introduces the participant to the setup of a flight simulator. Figure 1: Overview of the prototype architecture. answer to the situation in the procedure, finding the next procedure, etc.). Since hard tasks had more steps, the score was averaged in The reference system, electronic flight bag (FB) was the Navblue a single task score in [0;1] for each task (1 being the maximum software used by many pilots in commercial flight. It is distributed score). This scoring strategy relies primarily on the task that can be on tablets and customized for each aircraft. Only the library fea- understood from the user’s point of view as a fact-finding problem. tures (for access to documents and procedures) were used in the The classic precision and recall measures of the system are only experiment. taken into account through the lens of the user’s selection in this interactive experimentation. 3.9 Participants This can be seen as a quantitative metric with 1 data point per We recruited students from an aviation school, currently in their user per search task. early years of training for becoming commercial aircraft pilots. User’s perception of the problem and system: Other met- These students have a good understanding of aircraft technologies rics were collected through post-search questionnaires after each and flying physics. At the time of the experiments, they would not task and a final exit questionnaire. They were designed using 5 have followed a particular course on a specific aircraft, such as points Likert scale to collect the subject’s perception of task diffi- A320. culty and familiarity as well as system relevance and usefulness. These questionnaires were submitted by each user after each task 3.10 Evaluation Protocols and each system usage. We have developed realistic scenarios for our user experiment by 3.8 Prototype System engaging with engineers and consulting with an ergonomic expert with specialties in designing systems for pilots (See Table 1). The In our user experiment, we have developed a prototype Smart Li- simulated work task situations toolkit has been followed for trig- brarian system to address the evaluation objective of determining gering user information needs and the evaluation of user search the relationship between the types of search tasks and the per- behavior and system performance [8]. In designing search tasks we ceived usefulness of search. The system was built around three have considered the complexity of tasks from the perspectives of main components: search as learning [46]. Several questionnaires, including a demo- • A dialog engine (based on RASA platform [7]) handling the graphic questionnaire, post-search, and exit questionnaires, have conversation and identifying user’s intents; been developed to assess the user perceptions during the search • A search engine (based on Solr [45]) where the documents process as well as overall perceptions about the whole interaction collection is indexed following the BM25F relevance frame- work [39]; https://github.com/deepset-ai/FARM Evaluation of Conversational Agents for Aerospace Domain Figure 2: Screenshot of the prototype showing conversation on the left, search result panel in the center and document view on the right. process. Finally, to ensure that the training for each participant is with p < .001, we chose a mixed-effects model with search task as consistent across all sessions, experimental guidelines have been random effects. developed and used in our experiment. 5 RESULTS 4 DATA ANALYSIS 5.1 Participant Characteristics We construct mixed-effects models for determining the effects of A total of 16 pilot school students participated in the study. Most system and user perceptions on search performance. Mixed-effects students were between 18 and 25 years old. Almost all students distinguish between fixed effects that are due to experimental con- had flying experiences as an amateur or student pilot (less than 70 ditions and random effects that are due to individual differences in flight hours on average), and three students had general aviation a sample. We are concerned with both fixed effects of system and experiences for more than 5 years. None of them had commercial user perception and random effects of individual differences. flying experiences. Most participants used search engines every day We choose the mixed-effects models because they are useful or several times a day or more, whereas they had limited experiences for the examination of the random effects of subjects and search using virtual assistants. Overall, the participants are homogeneous tasks [5]. Examples of mixed-effects models in information retrieval by experiences and age. research have included modeling of search topics effect [11], anal- ysis of eye gaze behavior [23], and user characteristics [29]. We 5.2 Search Performance by Search Task primarily use the lme4 package in R statistical computing software Difficulty for model fitting [6]. The overall results suggest that there was no significant difference We find that there was no significant relationship between the in search performance by task score and time spent. However, there task order and the time spent (R = 0.012, p = 0.92). In addition to were very significant differences in search performance by search considering the fixed effects of system and user perception, the task. Figures 3 and 4 indicate that the proposed SL (Smart Librarian) random effects of search task and user were considered in our full system enhanced task score for hard search tasks, but there was no model construction and data fitting. To fit the data, we performed significant difference by time spent. As expected, task difficulty had an automatic backward model selection of fixed and random parts significant effects on search performance. The SL system performed of the linear mixed model [26]. Since the random intercepts for task particularly well for hard search tasks. and user were significant for time spent, with p < .001 and p < .01 respectively, we chose a mixed-effects model with search task and user-controlled as random effects. Model assessments based on 5.3 Effect of System and Perception on Task diagnostic checks for non-normality of residuals and outliers, dis- Score tribution of random effects, and heteroscedasticity were conducted. Table 2 presents the results of model selection for the mixed-effects The random intercepts for the task were significant for task score, of system and user perception on task scores. It shows that the Ying-Hsang LIU, Alexandre ARNOLD, Gérard DUPONT, Catherine KOBUS, and François LANCELOT Table 2: Model selection of fixed and random effects for user perception measures. Fixed and Random Effects Model Model 1 sys_usefulness + (1 | user) Model 2 topic_relevance + (1 | task) Model 3 utility + (1 | task) + (1 | user) Model 4 sys_satisfaction + (1 | task) + (1 | user) Model 5 process_satisfaction + (1 | task) Notes: sys_usefulness refers to how useful the system was in completing the task; topic_relevance is how relevant to the topic the system’s responses were; utility refers to how useful the system’s responses were to find answers; sys_satisfaction is how satisfied with the system’s responses; process_satisfaction refers to how satisfied with the search process; random intercepts for task and user are specified with (1|task) and (1|user) respectively. Figure 3: Boxplot of the types of systems and task score by search task difficulty user’s satisfaction with the system’s responses represented the ef- fect size of 21% in Model 4. Therefore, the results suggest that the system design should focus on the user-perceived usefulness of the design features (related to usability issues) and the relevance of the system’s responses to the topic (related to effectiveness is- sues). User satisfaction may not be the best predictor of user search performance. Therefore, our research hypothesis H1: Types of search systems and user perceptions will affect user search performance is partially supported. Specifically, the search system is not correlated with user search performance; user perception of the usefulness of the system in completing the search task and the system’s responses to the relevance of the topic are good predictors of user search performance. 5.4 Effect of Perceived Difficulty and Perception on Task Score Table 4 shows that the best model was perceived search task diffi- culty and its interactional effect with the relevance of the system’s responses to the topic, which accounts for 52% of the variances. In Figure 4: Boxplot of the types of systems and time spent by other words, when a search task was considered difficult, partic- search task difficulty ipants had more problems judging the relevance of the system’s responses to the topic. The user perceptions about the system util- ity and satisfaction about the system had significant effects on the system alone did not significantly affect the task score. The random task score, together with significant interactional effects. It is worth effect of the user was present in the perceived usefulness of the noting that both Tables 3 and Table 4 suggest that user perception system, whereas the random effect of the task appeared in the of how useful the system was in completing the task was the best relevance of the system’s responses to the topic and user satisfaction predictor of task score and there was no correlation between the with the search process. The random effect of both user and task user-perceived search task difficulty and the task score. Importantly, was present in the usefulness of the system’s responses to find our constructed mixed-effects models have relatively large effect answers and user satisfaction with system responses. sizes, suggesting that participants in the study are very good at Table 3 reveals that all the user perception measures made sig- judging their performance. nificant differences in the task score. The best model based on Therefore, our research hypothesis H2: Perceived search task dif- the Akaike information criterion (AIC) was Model 1 in which the ficulty and user perceptions will affect the user search performance user-perceived usefulness of the system accounted for 65% of the is supported. Specifically, perceived search task difficulty and its variances, followed by the relevance of the system’s responses to interactional effect with the relevance of the system’s responses to the topic in Model 2, with 41% of the variances. Interestingly, the the topic make a significant difference in user search performance. Evaluation of Conversational Agents for Aerospace Domain Table 3: Effect of system and user perception on task score Task_Score Model 1 Model 2 Model 3 Model 4 Model 5 sys_usefulness 0.25∗∗∗ (0.02) topic_relevance 0.19∗∗∗ (0.03) utility 0.13∗∗∗ (0.03) sys_satisfaction 0.12∗∗∗ (0.03) process_satisfaction 0.08∗∗∗ (0.03) Constant −0.20∗∗ −0.01 0.27∗ 0.28∗ 0.44∗∗ (0.09) (0.15) (0.15) (0.15) (0.18) N 64 64 64 64 64 Log Likelihood −1.90 −2.20 −5.79 −6.92 −11.99 AIC (Akaike information criterion) 11.79 12.39 21.57 23.85 31.99 ICC (Intraclass correlation) 0.21 0.33 0.56 0.55 0.55 2 𝑅 (fixed) 0.65 0.41 0.22 0.21 0.06 2 𝑅 (total) 0.72 0.60 0.66 0.65 0.58 ∗∗∗ p < .01; ∗∗ p < .05; ∗ p < .1 Table 4: Effect of perceived search task difficulty and user perception on task score Task Score Model 1 Model 2 Model 3 perceived_difficulty −0.33∗∗∗ −0.33∗∗∗ −0.35∗∗∗ (0.10) (0.08) (0.09) topic_relevance −0.11 (0.10) perceived_difficulty:topic_relevance 0.06∗∗∗ (0.02) utility −0.16∗∗ (0.08) perceived_difficulty:utility 0.06∗∗∗ (0.02) sys_satisfaction −0.18∗∗ (0.09) perceived_difficulty:sys_satisfaction 0.06∗∗∗ (0.02) Constant 1.46∗∗∗ 1.66∗∗∗ 1.80∗∗∗ (0.46) (0.37) (0.41) N 64 64 64 Log Likelihood −2.16 −5.50 −6.71 AIC (Akaike information criterion) 16.31 23.01 25.42 ICC (Intraclass correlation) 0.30 0.36 0.34 2 𝑅 (fixed) 0.52 0.41 0.41 2 𝑅 (total) 0.66 0.63 0.61 ∗∗∗ p < .01; ∗∗ p < .05; ∗ p < .1 Ying-Hsang LIU, Alexandre ARNOLD, Gérard DUPONT, Catherine KOBUS, and François LANCELOT 6 DISCUSSION 7 CONCLUSION This study is concerned with the design and evaluation of con- In this paper, we demonstrate a user-centered approach to the de- versational search systems to support the pilot in cockpits, with sign and evaluation of conversational search user interfaces through particular references to the system evaluation issues from the user- a collaborative research project between academia and industry. centered perspectives. Our findings suggest that the system alone It presents an approach for developing conversational search sys- cannot predict search performance and search efficiency; partic- tems from the user perspectives by considering the user search ipants in the study are very good at judging their performance. behavior as well as individual differences when interacting with Specifically, their perceptions about the usefulness of the system in the proposed conversational search system. The controlled user completing the task and the relevance of the system’s responses to experiment suggests that user perception of the usefulness of the the topic are good predictors of search performance. system in completing the search task and the system’s responses Our findings reveal that user satisfaction with the system’s re- to the relevance of the topic are good predictors of user search sponses may not a good predictor of user search performance. Since performance. User satisfaction with the system’s responses may the Alexa Prize Socialbot Grand Challenge is designed as research not a good predictor of user search performance. competitions to advance our understanding of human interactions with socialbots, to enhance the user experience, specifically user sat- 8 ACKNOWLEDGMENTS isfaction when interacting with Alexa, it is not surprising that user This study was funded by Airbus Central Research & Technology satisfaction has been selected as the main evaluation criterion for with the support from the Aeronautical Computer Interaction Lab success. The evaluation criteria which consist of automatic metrics (ACHIL), from the Ecole Nationale de l’Aviation Civile (ENAC) and from the system and human evaluation with Amazon’s Mechanical in particular the following researchers assistance for the experi- Turk. Since the objective is to judge the system performance based mentation: Alexandre DUCHEVET, Géraud GRANGER, Jean-Paul on approximations of user satisfaction, it is found that there was IMBERT, Nadine MATTON and Yves ROUILLARD. a discrepancy between automatic metrics and human evaluation results [16]. Our findings suggest that if the goal is to enhance user search performance, the perceived usefulness of the system REFERENCES and the relevance of the system’s responses to the topic are better [1] Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W Bruce Croft. 2019. Asking Clarifying Questions in Open-Domain Information-Seeking Con- predictors than user satisfaction with the system. versations. In Proceedings of the SIGIR ’19. ACM, New York, 475–484. https: Our holistic approach to understanding user experience and user //doi.org/10.1145/3331184.3331265 [2] Avishek Anand, Lawrence Cavedon, Hideo Joho, Mark Sanderson, and Benno performance is intended to bridge the gap between system-centric Stein. 2020. Conversational Search (Dagstuhl Seminar 19461). Dagstuhl Reports evaluation (i.e., automatic metrics) and human evaluation. This 9, 11 (2020), 34–83. https://doi.org/10.4230/DagRep.9.11.34 approach is in line with the extrinsic evaluation that is concerned [3] Cecilia R Aragon and Marti A Hearst. 2005. Improving Aviation Safety with information Visualization : A Flight Simulation Study. In Proceedings of the CHI with how the use of system contributes to external outputs, such as ’05. ACM, New York, 441–450. https://doi.org/10.1145/1054972.1055033 task completion [40]. The user’s judgment of usefulness has been [4] Alexandre Arnold, Gérard Dupont, Catherine Kobus, and François Lancelot. 2019. proposed and used as an evaluation criterion for IIR (interactive Conversational agent for aerospace question answering: A position paper. In Proceedings of the 1st Workshop on Conversational Interaction Systems (WCIS at information retrieval) studies [13, 47]. Our finding that user per- SIGIR). Paris. ceptions about the usefulness of the system in completing the task [5] R. H. Baayen, D. J. Davidson, and D. M. Bates. 2008. Mixed-effects modeling with crossed random effects for subjects and items. J. Mem. Lang. 59, 4 (2008), and the relevance of the system’s responses to the topic are good 390–412. https://doi.org/10.1016/j.jml.2007.12.005 predictors of search performance suggest user-perceived useful- [6] Douglas Bates, Martin Mächler, Ben Bolker, and Steve Walker. 2015. Fitting ness and relevance of the system’s responses to the topic can be linear mixed-effects models using lme4. J. Stat. Softw. 67, 1 (2015), 51. https: //doi.org/10.18637/jss.v067.i01 used for the evaluation of current conversational search systems. [7] Tom Bocklisch, Joey Faulkner, Nick Pawlowski, and Alan Nichol. 2017. Rasa: It demonstrates the applicability of the holistic approach adopted Open source language understanding and dialogue management. (2017). in previous information-seeking conversations [52, 54] to the de- arXiv:1712.05181 [8] Pia Borlund. 2016. A study of the use of simulated work task situations in sign and evaluation of conversational search systems in a specific interactive information retrieval evaluations: A meta-evaluation. J. Doc. 72, 3 domain. (2016), 394–413. https://doi.org/10.1108/JD-06-2015-0068 [9] Pavel Braslavski, Denis Savenkov, Eugene Agichtein, and Alina Dubatovka. Given these findings, future research and development work 2017. What do you mean exactly? Analyzing clarification questions in CQA. needs to focus on the design of system support features for the In Proceedings of the CHIIR ’17 (Oslo, Norway). ACM, New York, 345—348. relevance judgment, such as the snippets in search engine results https://doi.org/10.1145/3020165.3022149 [10] Guido C Carim, Tarcisio A Saurin, Jop Havinga, Andrew Rae, Sidney W.A. Dekker, page and system feedback. This work involves both the usability and Éder Henriqson. 2016. Using a procedure doesn’t mean following it: A and effectiveness issues in system development. Future research on cognitive systems approach to how a cockpit manages emergencies. Saf. Sci. 89 the correlations between the system-centric metrics and the user (2016), 147–157. https://doi.org/10.1016/j.ssci.2016.06.008 [11] Ben Carterette, Evangelos Kanoulas, and Emine Yilmaz. 2011. Simulating simple task score is suggested. Since the participants are homogeneous by user behavior for system effectiveness evaluation. In Proceedings of the CIKM age and experience in a specific domain, the generalizability of the ’11 (Glasgow, Scotland, UK). ACM, New York, 611–620. https://doi.org/10.1145/ 2063576.2063668 research findings to other settings may be limited. Larger sample [12] Andy Cockburn, Carl Gutwin, Philippe Palanque, Yannick Deleris, Catherine size would also enhance the validity of the results. Trask, Ashley Coveney, Marcus Yung, and Karon MacLean. 2017. Turbulent touch: Touchscreen input for cockpit flight displays. In Proceedings of the CHI ’17 (Denver, Colorado, USA). ACM, New York, 6742–6753. https://doi.org/10.1145/ 3025453.3025584 [13] Michael Cole, Jingjing Liu, Nicholas Belkin, Ralf Bierig, Jacek Gwizdka, C Liu, Jin Zhang, and X Zhang. 2009. Usefulness as the criterion for evaluation of Evaluation of Conversational Agents for Aerospace Domain interactive information retrieval. In Proceedings of the HCIR ’09. 1–4. http: ACM, New York, 1—12. https://doi.org/10.1145/3173574.3174214 //cuaslis.org/hcir2009/HCIR2009.pdf [36] Chen Qu, Liu Yang, W. Bruce Croft, Yongfeng Zhang, Johanne R. Trippas, and [14] Jan Deriu, Álvaro Rodrigo, Arantxa Otegi, Guillermo Echegoyen, Sophie Rosset, Minghui Qiu. 2019. User intent prediction in information-seeking conversations. Eneko Agirre, and Mark Cieliebak. 2019. Survey on evaluation methods for In Proceedings of the CHIIR ’19 (Glasgow, Scotland UK). ACM, New York, 25—33. dialogue systems. (2019). arXiv:1905.04071 https://doi.org/10.1145/3295750.3298924 [15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: [37] Filip Radlinski and Nick Craswell. 2017. A theoretical framework for conversa- Pre-training of deep bidirectional transformers for language understanding. In tional search. In Proceedings of the CHIIR ’17 (Oslo, Norway). ACM, New York, Proceedings of the NAACL-HLT 2019. Association for Computational Linguistics, 117–126. https://doi.org/10.1145/3020165.3020183 Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423 [38] Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: [16] Emily Dinan, Varvara Logacheva, Valentin Malykh, Alexander Miller, Kurt Shus- Unanswerable questions for SQuAD. In Proceedings of the Annual Meeting of the ter, Jack Urbanek, Douwe Kiela, Arthur Szlam, Iulian Serban, Ryan Lowe, Shrimai ACL. Association for Computational Linguistics, Melbourne, Australia, 784–789. Prabhumoye, Alan W Black, Alexander Rudnicky, Jason Williams, Joelle Pineau, https://doi.org/10.18653/v1/P18-2124 Mikhail Burtsev, and Jason Weston. 2020. The Second Conversational Intelligence [39] Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance Challenge (ConvAI2). Springer International Publishing, Cham, 187–208. framework: BM25 and beyond. Found. Trends Inf. Retr. 3, 4 (2009), 333–389. [17] Ralph H Earle, Mark A Rosso, and Kathryn E Alexander. 2015. User preferences https://doi.org/10.1561/1500000019 of software documentation genres. In Proceedings of the Annual International [40] Anne Schneider, Ielka Van Der Sluis, and Saturnino Luz. 2010. Comparing intrin- Conference on the Design of Communication (SIGDOC ’15). ACM, New York, 46:1– sic and extrinsic evaluation of MT output in a dialogue system.. In Proceedings of 46:10. https://doi.org/10.1145/2775441.2775457 the 7th International Workshop on Spoken Language Translation. 329–336. [18] Jasper Feine, Ulrich Gnewuch, Stefan Morana, and Alexander Maedche. 2019. A [41] Ben Steichen, Cristina Conati, and Giuseppe Carenini. 2014. Inferring visu- taxonomy of social cues for conversational agents. Int. J. Hum. Comput. Stud. alization task properties, user performance, and user cognitive abilities from 132 (2019), 138–161. https://doi.org/10.1016/j.ijhcs.2019.07.009 eye gaze data. ACM Trans. Interact. Intell. Syst. 4, 2 (2014), 1–29. https: [19] Luanne Freund. 2013. A cross-domain analysis of task and genre effects on //doi.org/10.1145/2633043 perceptions of usefulness. Inf. Process. Manage. 49, 5 (2013), 1108–1121. https: [42] Muh-Chyun Tang, Ying-Hsang Liu, and Wan-Ching Wu. 2013. A study of the //doi.org/10.1016/j.ipm.2012.08.007 influence of task familiarity on user behaviors and performance with a MeSH term [20] Ido Guy. 2018. The characteristics of voice search: Comparing spoken with suggestion interface for PubMed bibliographic search. Int. J. Med. Informatics 82, typed-in mobile web search queries. ACM Trans. Inf. Syst. 36, 3 (2018), 30:1–30:28. 9 (2013), 832–843. https://doi.org/10.1016/j.ijmedinf.2013.04.005 https://doi.org/10.1145/3182163 [43] Paul Thomas, Mary Czerwinski, Daniel McDuff, Nick Craswell, and Gloria Mark. [21] Marti A. Hearst. 2011. ‘Natural’ search user interfaces. Commun. ACM 54, 11 2018. Style and alignment in information-seeking conversation. In Proceedings (2011), 60–67. https://doi.org/10.1145/2018396.2018414 of the CHIIR ’18 (New Brunswick, NJ, USA). ACM, New York, 42—51. https: [22] Morten Hertzum and Jesper Simonsen. 2019. How is professionals’ information //doi.org/10.1145/3176349.3176388 seeking shaped by workplace procedures? A study of healthcare clinicians. Inf. [44] Johanne R. Trippas, Damiano Spina, Lawrence Cavedon, Hideo Joho, and Mark Process. Manage. 56, 3 (2019), 624–636. https://doi.org/10.1016/J.IPM.2019.01.001 Sanderson. 2018. Informing the design of spoken conversational search. In [23] Kajta Hofmann, Bhaskar Mitra, Filip Radlinski, and Milad Shokouhi. 2014. An eye- Proceedings of the CHIIR ’18. ACM, New York, 32–41. https://doi.org/10.1145/ tracking study of user interactions with query auto completion. In Proceedings of 3176349.3176387 the CIKM ’14 (Shanghai, China). ACM, New York, 549—558. https://doi.org/10. [45] Doug Turnbull and John Berryman. 2016. Relevant Search: With Applications for 1145/2661829.2661922 Solr and Elasticsearch. Manning Publications, Shelter Island, NY. [24] Diane Kelly. 2009. Methods for evaluating interactive information retrieval [46] Kelsey Urgo, Jaime Arguello, and Robert Capra. 2019. Anderson and Krathwohl’s systems with users. Found. Trends Inf. Retr. 3, 1–2 (2009), 1–224. two-dimensional taxonomy applied to task creation and learning assessment. In [25] Roger E. Kirk. 2013. Experimental Design: Procedures for the Behavioral Sciences Proceedings of the ICTIR ’19 (Santa Clara, CA, USA). ACM, New York, 117–124. (4th ed.). Brooks/Cole, Pacific Grove, CA. https://doi.org/10.1145/3341981.3344226 [26] Alexandra Kuznetsova, Per Brockhoff, and Rune Christensen. 2017. lmerTest [47] Pertti Vakkari, Michael Völske, Martin Potthast, Matthias Hagen, and Benno package: Tests in linear mixed effects models. J. Stat. Software, Artic. 82, 13 (2017), Stein. 2019. Modeling the usefulness of search results as measured by information 1–26. https://doi.org/10.18637/jss.v082.i13 use. Inf. Process. Manage. 56, 3 (2019), 879–894. https://doi.org/10.1016/j.ipm. [27] Catherine Letondal, Jean-Luc Vinot, Sylvain Pauchet, Caroline Boussiron, 2019.02.001 Stéphanie Rey, Valentin Becquet, and Claire Lavenir. 2018. Being in the sky: Fram- [48] Svitlana Vakulenko, Kate Revoredo, Claudio Di Ciccio, and Maarten de Rijke. ing tangible and embodied interaction for future airliner cockpits. In Proceedings 2019. QRFA: A data-driven model of information-seeking dialogues. In Advances of the TEI ’18. ACM, New York, 656–666. https://doi.org/10.1145/3173225.3173229 in Information Retrieval. ECIR 2019. Lecture Notes in Computer Science, vol 11437, [28] Yuelin Li and Nicholas J Belkin. 2008. A faceted approach to conceptualizing Leif Azzopardi, Benno Stein, Norbert Fuhr, Philipp Mayr, Claudia Hauff, and tasks in information seeking. Inf. Process. Manage. 44, 6 (2008), 1822–1837. https: Djoerd Hiemstra (Eds.). Springer International Publishing, Cham, 541–557. //doi.org/10.1016/j.ipm.2008.07.005 [49] Terry L von Thaden. 2008. Distributed information behavior: A study of dynamic [29] Chang Liu, Ying-Hsang Liu, Tom Gedeon, Yu Zhao, Yiming Wei, Fan Yang, and practice in a safety critical environment. J. Am. Soc. Inf. Sci. Technol. 59, 10 (2008), Fan Zhangs. 2019. The effects of perceived chronic pressure and time constraint 1555–1569. https://doi.org/10.1002/asi.20842 on information search behaviors and experience. Inf. Process. Manage. 56, 5 (2019), [50] Alexandra Vtyurina, Denis Savenkov, Eugene Agichtein, and Charles L A Clarke. 1667–1679. https://doi.org/10.1016/j.ipm.2019.04.004 2017. Exploring conversational search with humans, assistants, and wizards. In [30] Ying-Hsang Liu and Nicholas J Belkin. 2008. Query reformulation, search perfor- Proceedings of the CHI EA ’17. ACM, New York, 2187–2193. https://doi.org/10. mance, and term suggestion devices in question-answering tasks. In Proceedings of 1145/3027063.3053175 the IIiX ’08. ACM, New York, NY, 21–26. https://doi.org/10.1145/1414694.1414702 [51] Peter Wittek, Ying-Hsang Liu, Sándor Darányi, Tom Gedeon, and Ik Soo Lim. [31] Varvara Logacheva, Valentin Malykh, Aleksey Litinsky, and Mikhail Burtsev. 2016. Risk and ambiguity in information seeking: Eye gaze patterns reveal 2020. ConvAI2 dataset of non-goal-oriented human-to-bot dialogues. In The contextual behavior in dealing with uncertainty. Front. Psychol. 7 (2016), 1790. NeurIPS ’18 Competition. The Springer Series on Challenges in Machine Learning, https://doi.org/10.3389/fpsyg.2016.01790 Sergio Escalera and Ralf Herbrich (Eds.). Springer International Publishing, Cham, [52] Mei-Mei Wu. 2005. Understanding patrons’ micro-level information seeking 277–294. (MLIS) in information retrieval situations. Inf. Process. Manage. 41, 4 (2005), [32] Robert J Moore and Raphael Arar. 2019. Conversational UX Design: A Practitioner’s 929–947. https://doi.org/10.1016/j.ipm.2004.08.007 Guide to the Natural Conversation Framework. ACM, New York. [53] Mei-Mei Wu and Ying-Hsang Liu. 2003. Intermediary’s information seeking, [33] Christine Murad and Cosmin Munteanu. 2019. "I Don’T Know What You’Re Talk- inquiring minds, and elicitation styles. J. Am. Soc. Inf. Sci. Technol. 54, 12 (2003), ing About, HALexa": The case for voice user interface guidelines. In Proceedings 1117–1133. https://doi.org/10.1002/asi.10323 of the CUI ’19. ACM, New York, 9:1–9:3. https://doi.org/10.1145/3342775.3342795 [54] Mei-Mei Wu and Ying-Hsang Liu. 2011. On Intermediaries’ Inquiring Minds, [34] Chelsea M Myers. 2019. Adaptive suggestions to increase learnability for voice Elicitation Styles, and User Satisfaction. J. Am. Soc. Inf. Sci. Technol. 62, 12 (2011), user interfaces. In Proceedings of the IUI ’19 Companion. ACM, New York, 159–160. 2396–2403. https://doi.org/10.1002/asi.21644 https://doi.org/10.1145/3308557.3308727 [55] Hamed Zamani, Susan Dumais, Nick Craswell, Paul Bennett, and Gord Lueck. [35] Martin Porcheron, Joel E. Fischer, Stuart Reeves, and Sarah Sharples. 2018. Voice 2020. Generating clarifying questions for information retrieval. In Proceedings of interfaces in everyday life. In Proceedings of the CHI ’18 (Montreal QC, Canada). The Web Conference 2020 (Taipei, Taiwan) (WWW ’20). ACM, New York, 418–428. https://doi.org/10.1145/3366423.3380126