Towards Explainable Question Answering (XQA) Saeedeh Shekarpour,1 Faisal Alshargi,2 Mohammadjafar Shekarpour 1 University of Dayton, Dayton, United States 2 University of Leipzig, Leipzig, Germany sshekarpour1@udayton.org, alshargi@informatik.uni-leipzig.de, mj.shekarpour@gmail.com Abstract poral and locative dimensions and feedbacks from the crowd during the spread of content. In addition, they The increasing rate of information pollution on the Web fail to 1) provide transparency about their exploitation requires novel solutions to tackle that. Question An- swering (QA) interfaces are simplified and user-friendly and ranking mechanisms, 2) discriminate trustworthy interfaces to access information on the Web. However, content and sources from untrustworthy ones, 3) iden- similar to other AI applications, they are black boxes tify manipulative or misleading context, and 4) reveal which do not manifest the details of the learning or rea- provenance. soning steps for augmenting an answer. The Explainable Question Answering (QA) applications are a subcat- Question Answering (XQA) system can alleviate the egory of Artificial Intelligence (AI) applications where pain of information pollution where it provides trans- for a given question, an adequate answer(s) is provided parency to the underlying computational model and ex- to the end-user regardless of concerns related to the poses an interface enabling the end-user to access and structure and semantics of the underlying data. The validate provenance, validity, context, circulation, inter- spectrum of QA implementations varies from statisti- pretation, and feedbacks of information. This position paper sheds light on the core concepts, expectations, cal approaches (Shekarpour, Ngomo, and Auer 2013; and challenges in favor of the following questions (i) Shekarpour et al. 2015), deep learning models (Xiong, What is an XQA system?, (ii) Why do we need XQA?, Merity, and Socher 2016; Shekarpour, Ngomo, and (iii) When do we need XQA? (iv) How to represent the Auer 2013) to simple rule-based (i.e., template-based) explanations? (iv) How to evaluate XQA systems? approaches (Unger et al. 2012; Shekarpour et al. 2011). Also, the underlying data sets in which the answer is exploited might range from Knowledge Graphs (KG) Introduction holding a solid semantics as well as structure to un- The increasing rate of information pollution [1–4] on structured corpora (free text) or consolidation of both. the Web requires novel solutions to tackle. In fact there Apart from the implementation details and the back- major deficiencies in the area of computation, informa- ground data, roughly speaking, the research commu- tion, and Web science as follows: (i) Information disor- nity introduced the following categories of QA sys- der on the Web: content is shared and spread on the tems: Web without any accountability (e.g., bots [6–9] or ma- nipulative politicians [10] posts fake news). The misin- • Ad-hoc QA: advocates simple and short questions formation is easily spread on social networks [11]. Al- and typically relies on one single KG or Corpus. though tech companies try to identify misinformation • Hybrid QA: requires federating knowledge from using AI techniques, it is not sufficient [12–14]. In fact, heterogeneous sources (Bast et al. 2007). the root of this problem lies in the fact that the Web • Complex QA: deals with complex questions which infrastructure might need newer standards and pro- are long, and ambiguous. Typically, to answer such tocols for sharing, organizing and managing content questions, it is required to exploit answers from a hy- (ii) The incompetence of the Information Retrieval (IR) brid of KGs and textual content (Asadifar, Kahani, and Question Answering (QA) models and interfaces: and Shekarpour 2018). the IR systems are limited to the bag-of-the-words se- mantics and QA systems mostly deal with factoid ques- • Visualized QA: answers texual questions from im- tions. In fact, they fail to take into account the other as- ages (Li et al.). pects of the content such as provenance, context, tem- • Pipeline-based QA: provides automatic integration of the state-of-the-art QA implementations (Singh AAAI Fall 2020 Symposium on AI for Social Good. et al. 2018b,a). Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International A missing point in all types of QA systems is that (CC BY 4.0). in case of either success or failure, they are silent to Question Answering System Search Engine Why not something else? what is the side effect of antibiotics? why do you fail? Corpora why do you succeed? when can i trust you? How do I correct an error? H OTPOT QA: A Dataset for Diverse, Explainable Deep Learning Multi-hop Question Answering Graphical Models Question Bayesian Belief nets Answering Zhilin Yang* Peng Qi*~ Saizheng Zhang*| Models Statistical Ensemble Interlinked Models Yoshua Bengio|} William W. Cohen† Methods Knowledge Markov Graph Ruslan Salakhutdinov Christopher D. Manning~ Models  Carnegie Mellon University ~ Stanford University | Mila, Université de Montréal } CIFAR Senior Fellow † Google AI Figure 1: The existing QA systems are arsalakhu}@cs.cmu.edu, {zhiliny, black box which do not provide {pengqi, any explanation for their inference. manning}@cs.stanford.edu saizheng.zhang@umontreal.ca, yoshua.bengio@gmail.com, wcohen@google.com the question of why? Why have been a particular an- swer chosen? Why were the rest of theAbstract candidates dis- Paragraph A, Return to Olympus: [1] Return to Olympus is the only album by the alterna- regarded? Why did the QA system fail to answer? tive rock band Malfunkshun. [2] It was released after Existing question answering whether it is the fault of the model, quality of data, or (QA) datasets fail the band had broken up and after lead singer Andrew lack of data? The truth is tothat train QA systems to perform complex rea- the existing QA systems Wood (later of Mother Love Bone) had died of a drug soning and provide explanations for answers. overdose in 1990. [3] Stone Gossard, of Pearl Jam, had similar to other AI applications are a black box (see Fig- We introduce H OTPOT QA, a new dataset with compiled the songs and released the album on his label, ure 1) meaning they do 113k not Wikipedia-based provide any supporting question-answer pairs Loosegroove Records. fact (explanation) about with the represented answer with four key features: (1) the questions re- Paragraph B, Mother Love Bone: respect to the trustworthiness rate to [4] Mother Love Bone was an American rock band that quire finding andthe source reasoning overofmultiple in- sup- formed in Seattle, Washington in 1987. [5] The band formation, the confidence/reliability porting documents ratetotoanswer; the chosen(2) the ques- was active from 1987 to 1990. [6] Frontman Andrew answer, and the chain oftions reasoning are diverse orand learning steps to any not constrained Wood’s personality and compositions helped to catapult led to predict the final answer. pre-existingForknowledge example,bases Figure 1 or knowledge the group to the top of the burgeoning late 1980s/early 1990s Seattle music scene. [7] Wood died only days be- shows that the user sendsschemas; the question (3) we‘what provide sentence-level is the sup- fore the scheduled release of the band’s debut album, porting facts side effect of antibiotics?’ to the QA sys- required for reasoning, allowing “Apple”, thus ending the group’s hopes of success. [8] QA systems in tem. If the answer is represented to reason a way with strong supervision similar to The album was finally released a few months later. the interface of Google, then and explain the predictions; the end-user might(4) we offer a new have Q: What was the former band of the member of Mother type of factoid comparison questions to test Love Bone who died just before the release of “Apple”? a mixed feeling as to whether s/he can rely on this an- QA systems’ ability to extract relevant facts A: Malfunkshun swer or how and why such an answer is chosen among and perform necessary comparison. We show Supporting facts: 1, 2, 4, 6, 7 numerous candidates? that H OTPOT QA is challenging for the latest The rising challenges regarding QA systems,the and credibility, the supportingreli- facts enable ability, and validity of themodelsstate-of-the-art QA systems to improve performance and make ex- FigureFigure 2: An1:example An example of the multi-hop questions in from (Yang et al. 2018) where the H OTPOT QA. We also highlight the supporting facts in are of high importance, especially on critical domains plainable predictions. supporting facts necessary to answer the given ques- blue italics, which are also part of the dataset. such as life-science involved with human life. The Ex- tion Q are listed. 1 Introduction plainable Question Answering (XQA) systems are an emerging area which tries to address the shortcomings First, some datasets mainly focus on testing the The ability to perform of the existing QA systems. The recent article (Yang reasoning and up inference with discriminating ability information of reasoning within which isorbiased a single paragraph et al. 2018) publishedover naturalset a data language pairs ofaspect ofbased is an important containing in- on race, gender, document, age, ethnicity, or single-hop reasoning. religion, For example, social or question/answer along with theThe telligence. supporting facts ofanswering task of question the political (QA) in rank SQuAD of (Rajpurkar publisheretand al., targeted user (Buranyi 2016) questions are corpus where an inferenceprovides mechanism a quantifiable overand them led toway to 2017). objective test (Gunning designed 2017) to be raisesgiven answered six afundamental single paragraph compe- the answer. Figure 2 istheanreasoning exampleability takenoffrom the orig- intelligent systems. To this tency questions as the regarding context, and XAI most of as the follows: questions can in inal article (Yang et al. 2018). end, a fewThe assumption large-scale QA datasets behind have been 1. pro-Whyfactdidbethe AI system answered do that?the question with by matching this data set is that theposed, questionswhichrequire sparked multi-hops significant progressto in2.thisWhya did single sentence in not the AI systemthat paragraph. As a result, do something it else? conclude the answer, which direction. is not the case However, all the existing time.have limita- has fallen short at testing systems’ ability to reason datasets Besides, this kind of representations might not be anof machine 3. When overdid the AI a larger system context. succeed? TriviaQA (Joshi et al., 2017) tions that hinder further advancements ideal form for XQA; forreasoning example, whether representing over natural language, especially in test- 4. When and SearchQA (Dunn etfail? did the AI system al., 2017) create a more solely the supportinging facts QA is sufficient? systems’ abilityhow reliable to perform multi-hop 5. rea-When challenging does thesetting AI systemby using giveinformation retrieval enough confidence in are the supporting facts?soning, where the system has toAnd Who published them? to collect that reason with in-the decision multiple you documents can trust? to form the con- how credible is the publisher? Andfrom furthermore, more than onere- garding the interface, is formation taken notatthe document 6. toHowtextcangiven the AIexisting question-answer pairs. Nev- system correct an error? arrive theend-user answer. overwhelmed ertheless, most of the questions can be answered if s/he wants to go through all the supporting facts? Is In the area of XQA, by matching the question we adopt withthese a few questions; nearby sen- how- not there a more user-friendly approach forequally. represen- ⇤ These authors contributed ever, we The order of author- tences in one single paragraph, which isaslimited apply sufficient modifications follows:as ship is decided through dice rolling. tation? † Work done when WWC was at CMU. 1. Whyitdid doesthenotQA system require morechoose complexthis answer? reasoning (e.g., The XQA similar to all applications of Explainable AI (XAI) is expected to be transparent, accountable and 2. Why did not the QA system answer something else? fair (Sample). If QA is biased (bad QA), it will come 3. When 2369 did the QA system succeed? Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380 Brussels, Belgium, October 31 - November 4, 2018. c 2018 Association for Computational Linguistics 4. When did the QA system fail? cognitive layer and (ii) an explanation layer. The cog- 5. When does the QA system give enough confidence nitive layer represents the implications learned from in the answer that you can trust? the computational model in an explainable form (ab- stractive or summarized representation), and then the 6. How can the QA system correct an error? explanation layer is responsible for delivering them to This visionary paper introduces the core concepts, the end user in an interactive mode. We introduce sev- expectations and challenges in favor of the questions eral fundamental features which the future generation (i) What is an Explainable Question Answering (XQA) of XQA have to launch. We extensively elaborate on our system?, (ii) Why do we need XQA?, (iii) When do we view about the interface in Section 5. need XQA? (iv) How to represent the explanations? (iv) How to evaluate XQA systems? In the following sec- Why do we need XQA? tions, we address each question respectively. We showcase the importance of having XQA using the two following arguments. What is XQA? To answer the question of What is XQA?, we feature Information Disorder Era. The growth rate of mis-, two layers i.e., model and interface for XQA similar dis-, mal- information on the Web is getting dramati- to XAI (Gunning 2017). Figure 3 shows our envisioned cally worsened (Wardle 2018). Still, the existing search plan for XQA where at the end, the end user confi- engines fail to identify misinformation even where it is dently conclude that he can/cannot trust to the answer. highly crucial (Kata 2010). It is expected from the infor- In the following, we present a formal definition of XQA. mation retrieval systems (either keyword-based search engines or QA systems) to identify mis-, dis-, mal- in- Definition 1 (Explainable Question Answering) formation from reliable and trustworthy information. XQA is a system relying on an explainable computa- tional model for exploiting the answer and then utilizes an explainable interface to represent the answer(s) along Human Subject Area. Having XQA for areas being with the explanation(s) to the end-user. subjected to lives particularly human subject is highly important. For example, bio-medical and life-science This definition highlights two major components domains require to discriminate between the hypothet- of XQA as (i) explainable computational model and ical facts, resulting facts, methodological facts, or goal- (ii) explainable interface. In the following we discuss oriented facts. Thus XQA has to infer the answer of in- these two components in more details: formational question based on the context of the ques- tion as to whether it is asking about resulting facts, or Explainable Computational Model. Whatever com- hypothetical facts, etc. putational model employed in XQA system, (e.g., learning-based model, schema-driven approach, rea- When do we need XQA? soning approach, heuristic approach, rule-based ap- Typically in the domains that the user wants to make proach, or a mixture of various models) it has to ex- a decision upon the given answer, XQA matters since plain all intermediate and final choices meaning the it enables the end user to make a decision with trust. rationale behind the decisions should be transpar- There are domains that traditional QA does not hurt. ent, fair, and accountable (Sample). The responsible For example, if the end user is looking for the ‘nearby QA system distinguishes misinformation, disinforma- Italian restaurant’, QA systems suffice. On the tion, mal-information, and true facts (Wardle and De- contrary, in the domain of health, having the explana- rakhshan 2017). Furthermore, it cares about the un- tions is demanding otherwise the health care providers trustworthiness and trustworthiness of data publisher, can not entirely rely on the answers disposed by the information representation, updated or outdated in- system. formation, accurate or inaccurate information, and also the interpretations that the answer might raise. How to represent explanations? Whereas, the fair QA system is not biased based on We illustrate the life cycle of information on the Web the certain characteristics of the data publisher, or the in Figure 4 which can be published as a stack of the targeted end user (e.g., region, race, social or political metadata. Each piece of information has a publishing rank). Finally, the transparency of QA systems refers to source. Further, genuine information might be framed the availability and accessibility to the reasons behind or manipulated in a context. Then, the information the decisions of the QA system in each step upon the might be spread on social media. Concerning its circu- request of involving individuals (e.g., end user, devel- lation on social media or the Web, it might be annotated oper, data publisher, policymakers). or commented on by the crowd. We feature the explainable QA interface with respect Explainable Interface. The explainable interface in- to its life cycle as it should enable the end-user to 1) ac- troduced in (Gunning 2017) contains two layers (i) a cess context, 2) find the provenance of information, 3) I can access context I can find the provenance of information I can do fact-checking I can do source-checking I can check the credibility of the source I can detect manipulated information I can report mis-, dis-, mal- information I can access annotations (feedbacks) of the crowd about information I can see the circulation of information Explainable Question Answering System Search Engine what is the side effect of antibiotics? Corpora Deep Learning Graphical Bayesian Explainable trusted Models Question Belief nets Explainable Interface Answering Models Statistical Ensemble Interlinked Access to Access to the Detecting Information Models Methods Knowledge Context circulation History disorder Graph Markov not Models Report mis-, dis-, trusted Fack-checking Source-checking mal- information Figure 3: The explainable question answering exposes explainable models and explainable interface; then the user can make a decision as to whether to trust or not. Publishing Framing in Manipulating Circulating on Web and Annotating by Conclusion context Crowd Social Media In this paper, we discussed the concepts, expectations, Figure 4: The life cycle of information on the Web. and challenges of XQA. The expectation is that the fu- ture generation of QA systems (or search engines) rely on computational explainable models and interact with do fact-checking, 4) do source-checking, 5) check the the end-user via the explainable user interface. The ex- credibility of the source, 6) detect manipulated infor- plainable computational models are transparent, fair mation, 7) report mis-, dis-, mal- information, 8) access and accountable. Also, the explainable interfaces en- annotations (feedbacks) of the crowd, 9) reveal the cir- able the end-user to interact with features for source- culation history. checking, fact-checking and also accessing to context and circulation history. In addition, the explainable in- terfaces allow the end-user to report mis-, dis-, mal- in- How to evaluate XQA systems? formation. The evaluation of an XQA system has to check its We are at the beginning of a long-term agenda to performance from both qualitative and quantitative mature this vision and furthermore provide standards perspectives. The Human-Computer Interaction (HCI) and solutions. The phenomena of information pollu- community already targeted various aspects of the tion is a dark side of the Web which will endanger our human-centered design and evaluation challenges of society, democracy, justice service and health care. We black-box systems. However, the QA systems received hope that the XQA will be the attention of the research the least attention comparing to other AI applications community in the next couple of years. such as recommender systems. Regarding XQA, the qualitative measures can be (i) adequate justification: References thus the end user feels that she is aware of the reason- ing steps of the computational model, (ii) confidence: Asadifar, S.; Kahani, M.; and Shekarpour, S. 2018. the user can trust the system, and place the willing for HCqa: Hybrid and Complex Question Answering the continuation of interactions, (iii) understandabil- on Textual Corpus and Knowledge Graph. CoRR ity: educates the user as how the system infers or what abs/1811.10986. are the causes of failures and unexpected answers, and Bast, H.; Chitea, A.; Suchanek, F.; and Weber, I. 2007. (iv) user involvement: encourages the user to engage in Ester: efficient search on text, entities, and relations. In the process of QA such as question rewriting. On the Proceedings of the 30th annual international ACM SIGIR other hand, the quantitative measures are concerned conference on Research and development in information re- with the questions such as "How effective is the approach trieval. ACM. for generating explanations?". For example, it measures the effectiveness in terms of the preciseness of the ex- Buranyi, S. 2017. Rise of the racist robots – planations. However, this area is still an open research how AI is learning all our worst impulses. area that requires the research community introduce https://www.theguardian.com/inequality/2017/ metrics, criteria, and benchmarks for evaluating vari- aug/08/rise-of-the-racist-robots-how-ai-is-learning- ous features of XQA systems. all-our-worst-impulses. Gunning, D. 2017. Explainable artificial intelli- Xiong, C.; Merity, S.; and Socher, R. 2016. Dynamic gence (xai). Defense Advanced Research Projects Agency memory networks for visual and textual question an- (DARPA), nd Web . swering. In International conference on machine learning. Kata, A. 2010. A postmodern Pandora’s box: anti- Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W. W.; vaccination misinformation on the Internet. Vaccine Salakhutdinov, R.; and Manning, C. D. 2018. Hot- 28(7): 1709–1716. potQA: A Dataset for Diverse, Explainable Multi-hop Li, Q.; Fu, J.; Yu, D.; Mei, T.; and Luo, J. ???? In Pro- Question Answering. In Proceedings of the 2018 Confer- ceedings of the 2018 Conference on Empirical Methods in ence on Empirical Methods in Natural Language Processing. Natural Language Processing, Brussels, Belgium, 2018. Sample, I. ???? Computer says no: why making AIs fair, accountable and transparent is crucial. https://www.theguardian.com/science/2017/ nov/05/computer-says-no-why-making-ais-fair- accountable-and-transparent-is-crucial. Accessed: 2017-11-05. Shekarpour, S.; Auer, S.; Ngomo, A. N.; Gerber, D.; Hellmann, S.; and Stadler, C. 2011. Keyword-Driven SPARQL Query Generation Leveraging Background Knowledge. In Proceedings of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2011, Campus Scientifique de la Doua, Lyon, France, August 22- 27, 2011. Shekarpour, S.; Marx, E.; Ngomo, A. N.; and Auer, S. 2015. SINA: Semantic interpretation of user queries for question answering on interlinked data. J. Web Semant. . Shekarpour, S.; Ngomo, A. N.; and Auer, S. 2013. Ques- tion answering on interlinked data. In 22nd Inter- national World Wide Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 13-17, 2013. Singh, K.; Both, A.; Radhakrishna, A. S.; and Shekar- pour, S. 2018a. Frankenstein: A Platform Enabling Reuse of Question Answering Components. In The Se- mantic Web - 15th International Conference, ESWC 2018, Heraklion, Crete, Greece,Proceedings. Singh, K.; Radhakrishna, A. S.; Both, A.; Shekarpour, S.; Lytra, I.; Usbeck, R.; Vyas, A.; Khikmatullaev, A.; Pun- jani, D.; Lange, C.; Vidal, M.; Lehmann, J.; and Auer, S. 2018b. Why Reinvent the Wheel: Let’s Build Question Answering Systems Together. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France. Unger, C.; Bühmann, L.; Lehmann, J.; Ngomo, A. N.; Gerber, D.; and Cimiano, P. 2012. Template-based ques- tion answering over RDF data. In Proceedings of the 21st World Wide Web Conference 2012, WWW 2012, Lyon, France, April 16-20, 2012. Wardle, C. 2018. DISINFORMATION GETS WORSE. https://www.niemanlab.org/2017/12/ disinformation-gets-worse/. Wardle, C.; and Derakhshan, H. 2017. Infor- mation Disorder: Toward an interdisciplinary framework for research and policymaking. https://shorensteincenter.org/information-disorder- framework-for-research-and-policymaking/.