1. Introduction

ExDocS: Evidence based Explainable Document Search

Sayantan Polley

Atin Janki

Marcus Thiel

Juliane Hoebel-Mueller

Andreas Nuernberger

0 0 Otto von Guericke University Magdeburg , Universitätsplatz 2, 39106 Magdeburg, Germany - first authors with

We present an explainable document search system (ExDocS), based on a re-ranking approach, that uses textual and visual explanations to explain document rankings to non-expert users. ExDocS attempts to answer questions such as “Why is document X ranked at Y for a given query?”, “How do we compare multiple documents to understand their relative rankings?”. The contribution of this work is on re-ranking methods based on various interpretable facets of evidence such as term statistics, contextual words, and citation-based popularity. Contribution from the user interface perspective consists of providing intuitive accessible explanations such as: “document X is at rank Y because of matches found like Z” along with visual elements designed to compare the evidence and thereby explain the rankings. The quality of our re-ranking approach is evaluated on benchmark data sets in an ad-hoc retrieval setting. Due to the absence of ground truth of explanations, we evaluate the aspects of interpretability and completeness of explanations in a user study. ExDocS is compared with a recent baseline - explainable search system (EXS), that uses a popular posthoc explanation method called LIME. In line with the “no free lunch” theorem, we find statistically significant results showing that ExDocS provides an explanation for rankings that are understandable and complete but the explanation comes at the cost of a drop in ranking quality.

eol>Explainable Rankings XIR XAI Re-ranking

1. Introduction 2. How do we compare multiple documents to understand their relative rankings? 3. Are the explanations provided interpretable and complete?

Explainability in Artificial intelligence (XAI) is currently a vibrant research topic that attempts to make AI systems transparent and trustworthy to the concerned stakeholders. The research in XAI domain is interdisciplinary but There have been works [ 5 ], [7] in the recent past that is primarily led by the development of methods from the attempted to address related questions such as "Why is a machine learning (ML) community. From the classifi- document relevant to the query?" by adapting XAI methcation perspective, e.g., in a diagnostic setting a doctor ods such as LIME [ 3 ] primarily for neural rankers. We may be interested to know that how prediction for a dis- argue that the idea of relevance has deeper connotations ease is made by the AI-driven solution. XAI methods in related to the semantic and syntactic notion of similarity ML are typically based on exploiting features associated in text. Hence, we try to tackle the XAI problem from with a class label, development of add-on model specific a ranking perspective. Based on interpretable facets we methods like LRP [ 2 ], model agnostic ways such as LIME provide a simple re-ranking method that is agnostic of [ 3 ] or causality driven methods [ 4 ]. The explainability the retrieval model. ExDocS provides local textual exproblem in IR is inherently diferent from a classification planations for each document (Part D in Fig. 1). The setting. In IR, the user may be interested to know how a re-ranking approach enables us to display the “math becertain document is ranked for the given query or why a hind the rank” for each of the retrieved documents (Part certain document is ranked higher than others [ 5 ]. Often E in Fig. 1). Besides, we also provide a global explanaan explanation is an answer to a why question [ 6 ]. tion in form of a comparative view of multiple retrieved

In this work, Explainable Document Search (ExDocS), documents (Fig. 4). we focus on a non-web ad-hoc text retrieval setting and We discuss relevant work for explainable rankings aim to answer the following research questions: in section two. We describe our contribution to the reranking approach and methods to generate explanation in 1. Why is a document X ranked at Y for a given section three. Next in section four, we discuss the quantiquery? tative evaluation of rankings on benchmark data sets and The 1st International Workshop on Causality in Search and a comparative qualitative evaluation with an explainable Recommendation (CSR’21), July 15, 2021, Online search baseline in a user study. To our knowledge, this is " sayantan.polley@ovgu.de (S. Polley*); atin.janki@ovgu.de one of the first works comparing two explainable search (A. Janki*); marcus.thiel@ovgu.de (M. Thiel); systems in a user study. In section five, we conclude jaunldiarneaes.h.noueebrenl @beorvggeru@.deov(Jg.uH.doee(bAe.l-NMuueerlnlebre)r;ger) that ExDocS provides explanations that are interpretable CPWrEooUrckReshdoinpgs IhStpN:/c1e6u1r3-w-0s.o7r3g ©CCo2Em02mU1oRCnospLWyicreigonhsrtekfAostrthrtihboiusptpioanpPe4rr.0obIynctieetsrenaaduttiihononragsl.s(CUC(seCBpYEer4Um.0i)Rt.te-d WundSe.roCrrega)tive Wanidlccooxmonplseitgen.eTdh-erarneskutletssta.rHeostwateivsteirc,atlhlye seixgpnliaficnaanttioinns come at a cost of reduced ranking performance paving way for future work. The ExDocS system is online1 and the source code is available on-request for reproducible research.

2. Related Work

The earliest attempts on making search results explainable can be seen through the visualization paradigms [8, 9, 10] that aimed at explaining term distribution and statistics. Mi and Jiang [11] noted that IR systems were one of the earliest among other research fields to ofer interpretations of system decisions and outputs, through search result summaries. The areas of product search [12] Figure 2: Contribution of Query Terms for relevance and personalized professional search [13], have explored explanations for search results by creating knowledgegraphs based on user’s logs. In [14] Melucci made a plainability, the perspective of ethics and fairness [15, 16] preliminary study and suggested that structural equa- is also often encountered in IR whereby the retrieved data tion models from the causal perspective can be used to may be related to disadvantaged people or groups. In generate explanations for search systems. Related to ex- [17] a categorization of fairness in rankings is devised 1https://tinyurl.com/ExDocSearch based on the use of pre-processing, in-processing, or

3. Concept: Re-ranking via Interpretable facets

The concept behind ExDocS is based on the re-ranking of interpretable facets of evidence such as term statistics, contextual words, and citation-based popularity. Each of these facets is also a selectable search criterion in the search interface. We have a motivation to provide a • Contextual and Synonym Search: ‘contextual words’ (term-count of query words + expanded contextual words). Contextual words are word-embeddings+synonyms in this case. • Keyword Search with Popularity score: ‘citation-based popularity’ (popularity score of a document) Based on benchmark ranking performance, we empirically determine a weighted combination of these facets which is also available as a search criteria choice in the to Table 1). We benchmark our retrieval performance interface. Additionally, we provide local and global vi- by comparing with [21] and confirm that our ranking sual explanations. Local ones in form of visualizing the approach needs improvement to at least match the contribution of features (expanded query terms) for each baseline performance metrics. document as well as comparing them globally for multiple documents (refer the Evidence Graph in the lower 4.2. Evaluation of explanations part of Fig. 4).

input : q = {w1,w2,...,wn}, D = {d1,d2,...,dm},

facet output : A re-ranked doc list 1 Select top-k docs from D using cosine similarity,

such as 2

{′1, ′2, ..., ′} ∈ for ← 1 to do if facet == ‘term statistics’ or ‘contextual words’ then evidence(di)← Σ ∈(, ) // count(w, di) is count of

term w in di end if facet == ‘citation-based popularity’ then evidence(di)← () // popularityScore(di) could be inLinks count, PageRank or HITS score of di end 9 10 end 11 Rerank all docs in Dk using evidence 12 return Dk

Algorithm 1: Re-ranking algorithm 4. Evaluation We have two specific focus areas in evaluation. The first one is related to the quality of the rankings and the second one is related to the explainability aspect. We leave out evaluation of the popularity score model for future work. 4.1. Evaluation of re-ranking algorithm We experimented the re-ranking algorithm on the TREC

Disk 4 & 5 (-CR) dataset. The evaluations were carried out by using the trec_eval[20] package. We used TREC-6 adhoc queries (topics 301-350) and used only ‘Title’ of the topics as the query. We noticed that Keyword Search, Contextual Search, Synonym Search, and Contextual Synonym Search systems were unable to beat the ‘Baseline ExDocS’ (OOTB Apache Solr) on metrics such as MAP, R-Precision, and NDCG (refer We performed a user study to qualitatively evaluate the explanations. Also, to compare ExDocS’s explanations with that of EXS; we integrated EXS’s explanation model into our interface. Therefore, keeping the look and feel of both systems alike, we tried to reduce user’s bias towards any system. 4.2.1. User study setup

A total of 32 users participated in a lab controlled user

study. 30 users were from a computer science background while 26 users had a fair knowledge of information retrieval systems. Each user was asked to test out both the systems and the questionnaire was formatted in a Latin-block design. The name of the systems was masked as System-A (EXS) and System-B (ExDocS). 4.2.2. Metrics for evaluation We use the existing definitions ([ 6 ] and [22]) of Interpretability, Completeness and Transparency in the community with respect to evaluation in XAI. The following factors are used for evaluating the quality and efectiveness of explanations: • Interpretability: describing the internals of a system in human-understandable terms [ 6 ]. • Completeness: describing the operation of a system accurately and allowing the system’s behavior to be anticipated in future [ 6 ]. • Transparency: an IR system should be able to demonstrate to its users and other interested parties, why and how the proposed outcomes were achieved [22].

4.3. Results and Discussion We discuss the results of our experiments and draw conclusions to answer the research questions.

RQ1. Why is a document X ranked at Y for a given query? We answer this question by providing the individual textual explanation for every document (refer to Part D of Fig. 1) on the ExDocS interface. The “math behind the rank” (refer to Part E of Fig. 1) of a document is explained as a percentage of the evidence with respect to the best matching document.

RQ2. How do we compare multiple documents to understand their relative rankings? We provide an option to compare multiple documents through visual and textual paradigms (refer to Fig. 4). The evidence can be compared and contrasted and thereby understand the reasons for a document’s rank being higher or lower than others.

RQ3. Are the generated explanations interpretable and complete? We evaluate the quality of the explanations in terms of their interpretability and completeness. Empirical evidence from the user study on Interpretability: 1. 96.88% of the users understood the textual explanations of ExDocS 2. 71.88% of the users understood the relation between the query term and features (synonyms or contextual words) shown in the explanation 3. Users gave a mean rating of 4 out of 5 (standard deviation = 1.11) to ExDocS on the understandability of the percentage calculation for rankings, shown as part of the explanations

When users were explicitly asked - whether they could “gather an understanding of how the system functions based on the given explanations”, users gave a positive 5. Conclusion and Future Work Moreover, 78.13% of total users claimed that they could

anticipate ExDocS behavior in the future based on the understanding gathered through explanations (individual and comparative). Based on the above empirical evidence we argue that the ranking explanations generated by ExDocS can be assumed to be complete.

Transparency: We investigate if the explanations make ExDocS more transparent [22] to the user. Users gave ExDocS a mean rating of 3.97 out of 5 (standard deviation = 0.86) on ‘Transparency’ based on the individual (local) explanations. In addition to that, 90.63% of the total users indicated that ExDocS became more transparent after reading the comparative (global) explanations. This indicates that explanations make ExDocS more transparent to the user. response with a mean rating of 3.84 out of 5 (standard deviation = 0.72). The above-mentioned empirical evidence indicates that the ranking explanations provided In this work, we present an Explainable Document Search by ExDocS can be deemed as interpretable. (ExDocS) system that attempts to explain document rank

Empirical evidence from the user study on Complete- ings using a combination of textual and visual elements ness: to a non-expert user. We make use of word embeddings and WordNet thesaurus to expand the user query. We use 1. All users found the features shown in the expla- various interpretable facets such as term statistics, connation of ExDocS to be reasonable (i.e. sensible textual words, and citation-based popularity. Re-ranking or fairly good) results from a simple vector space model with such in2. 90.63% of the users understood through compara- terpretable facets help us to explain the “math behind tive explanations of ExDocS that- why a partic- the rank” to an end-user. We evaluate the explanations ular document was ranked higher or lower than by comparing ExDocS with another explainable search other documents baseline in a user study. We find statistically significant results that ExDocs provides interpretable and complete explanations. Although, it was dificult to find a clear winner between both systems in all aspects. In line with the “no free lunch” theorem, the results show a drop in ranking quality on benchmark data sets at the cost of getting comprehensible explanations. This paves way for ongoing research to include user feedback to adapt the rankings and explanations. ExDocS is currently being evaluated in domain-specific search settings like law search where explainability is a key factor to gain user trust.

Comparison of explanations between ExDocS and EXS: Both the systems performed similarly in terms of and . However, users found ExDocS explanations to be more interpretable compared to that of EXS (refer to Fig. 5), and this comparison was statistically significant in WSR test ( | | < ( = 0.05, = 10) = 10, where | | = 5.5). 5th International Conference on Data Science and tems with Application to LinkedIn Talent Search, Advanced Analytics (DSAA), IEEE, 2018, pp. 80–89. in: Proceedings of the 25th ACM SIGKDD In[7] Z. T. Fernando, J. Singh, A. Anand, A Study on ternational Conference on Knowledge Discovery the Interpretability of Neural Retrieval Models Us- amp; Data Mining, KDD ’19, Association for Coming DeepSHAP, in: Proceedings of the 42nd Inter- puting Machinery, New York, NY, USA, 2019, p. national ACM SIGIR Conference on Research and 2221–2231. URL: https://doi.org/10.1145/3292500. Development in Information Retrieval, SIGIR’19, 3330691. doi:10.1145/3292500.3330691. Association for Computing Machinery, New York, [17] C. Castillo, Fairness and Transparency in Ranking, NY, USA, 2019, p. 1005–1008. SIGIR Forum 52 (2019) 64–71. [8] M. A. Hearst, TileBars: Visualization of Term Distri- [18] V. Chios, Helping results assessment by adding exbution Information in Full Text Information Access, plainable elements to the deep relevance matching in: Proceedings of the SIGCHI Conference on Hu- model, in: Proceedings of the 43rd International man Factors in Computing Systems, CHI ’95, ACM ACM SIGIR Conference on Research and DevelopPress/Addison-Wesley Publishing Co., USA, 1995, ment in Information Retrieval, Association for Comp. 59–66. puting Machinery, New York, NY, USA, 2020. URL: [9] O. Hoeber, M. Brooks, D. Schroeder, X. D. Yang, https://ears2020.github.io/accept_papers/2.pdf .

TheHotMap.Com: Enabling Flexible Interaction in [19] D. Roy, S. Saha, M. Mitra, B. Sen, D. Ganguly, I-REX: Next-Generation Web Search Interfaces, in: Pro- A Lucene Plugin for EXplainable IR, in: Proceedceedings of the 2008 IEEE/WIC/ACM International ings of the 28th ACM International Conference on Conference on Web Intelligence and Intelligent Information and Knowledge Management, CIKM Agent Technology - Volume 01, WI-IAT ’08, IEEE ’19, Association for Computing Machinery, New Computer Society, USA, 2008, p. 730–734. York, NY, USA, 2019, p. 2949–2952. [10] M. A. Soliman, I. F. Ilyas, K. C.-C. Chang, URank: [20] C. Buckley, et al., The trec_eval evaluation package, Formulation and Eficient Evaluation of Top-k 2004.

Queries in Uncertain Databases, in: Proceedings of [21] D. K. Harman, E. Voorhees, The Sixth Text REthe 2007 ACM SIGMOD International Conference trieval Conference (TREC-6), US Department of on Management of Data, SIGMOD ’07, Association Commerce, Technology Administration, National for Computing Machinery, New York, NY, USA, Institute of Standards and Technology (NIST), 1998. 2007, p. 1082–1084. [22] A. Olteanu, J. Garcia-Gathright, M. de Rijke, M. D. [11] S. Mi, J. Jiang, Understanding the Interpretability Ekstrand, Workshop on Fairness, Accountability, of Search Result Summaries, in: Proceedings of the Confidentiality, Transparency, and Safety in Infor42nd International ACM SIGIR Conference on Re- mation Retrieval (FACTS-IR), in: Proceedings of the search and Development in Information Retrieval, 42nd International ACM SIGIR Conference on ReSIGIR’19, Association for Computing Machinery, search and Development in Information Retrieval, New York, NY, USA, 2019, p. 989–992. 2019, pp. 1423–1425. [12] Q. Ai, Y. Zhang, K. Bi, W. B. Croft, Explainable

Product Search with a Dynamic Relation Embedding Model, ACM Trans. Inf. Syst. 38 (2019). [13] S. Verberne, Explainable IR for personalizing professional search, in: ProfS/KG4IR/Data: Search@

SIGIR, 2018. [14] M. Melucci, Can Structural Equation Models Interpret Search Systems?, in: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR’19, Association for Computing Machinery, New York, NY, USA, 2019. URL: https://ears2019.

github.io/Melucci-EARS2019.pdf . [15] A. J. Biega, K. P. Gummadi, G. Weikum, Equity of attention: Amortizing individual fairness in rankings, in: The 41st International ACM SIGIR conference on Research & Development in Information

Retrieval, 2018, pp. 405–414. [16] S. C. Geyik, S. Ambler, K. Kenthapadi, Fairness

Aware Ranking in Search and Recommendation Sys

[1]

E. L.

Mencia ,

Fürnkranz , Eficient multilabel classification algorithms for large-scale problems in the legal domain , in: Semantic Processing of Legal Texts , Springer, 2010 , pp. 192 - 215 .

[2]

Bach ,

Binder ,

Montavon ,

Klauschen , K.-R. Müller , W. Samek, On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation , PloS one 10 ( 2015 ) e0130140 .

[3]

M. T.

Ribeiro ,

Singh ,

Guestrin , "Why Should I Trust You?": Explaining the Predictions of Any Classifier , in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD '16, Association for Computing Machinery, New York, NY, USA, 2016 , p. 1135 - 1144 .

[4]

Pearl , et al., Causal inference in statistics: An overview, Statistics surveys 3 ( 2009 ) 96 - 146 .

[5]

Singh ,

Anand , EXS: Explainable Search Using Local Model Agnostic Interpretability , in: Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining , WSDM '19, Association for Computing Machinery, New York, NY, USA, 2019 , p. 770 - 773 .

[6]

L. H.

Gilpin ,

Bau ,

B. Z.

Yuan ,

Bajwa ,

Specter , L. Kagal, Explaining explanations: An overview of interpretability of machine learning , in: 2018 IEEE