=Paper= {{Paper |id=Vol-2911/paper1 |storemode=property |title=ExDocS: Evidence based Explainable Document Search |pdfUrl=https://ceur-ws.org/Vol-2911/paper1.pdf |volume=Vol-2911 |authors=Sayantan Polley,Atin Janki,Marcus Thiel,Juliane Hoebel-Mueller,Andreas Nuernberger }} ==ExDocS: Evidence based Explainable Document Search== https://ceur-ws.org/Vol-2911/paper1.pdf

ExDocS: Evidence based Explainable Document Search
Sayantan Polley*1 , Atin Janki*1 , Marcus Thiel1 , Juliane Hoebel-Mueller1 and
Andreas Nuernberger1
1
Otto von Guericke University Magdeburg, Universitätsplatz 2, 39106 Magdeburg, Germany – first authors with * have equal contribution

Abstract
We present an explainable document search system (ExDocS), based on a re-ranking approach, that uses textual and visual
explanations to explain document rankings to non-expert users. ExDocS attempts to answer questions such as “Why is
document X ranked at Y for a given query?”, “How do we compare multiple documents to understand their relative rankings?”.
The contribution of this work is on re-ranking methods based on various interpretable facets of evidence such as term
statistics, contextual words, and citation-based popularity. Contribution from the user interface perspective consists of
providing intuitive accessible explanations such as: “document X is at rank Y because of matches found like Z” along with
visual elements designed to compare the evidence and thereby explain the rankings. The quality of our re-ranking approach
is evaluated on benchmark data sets in an ad-hoc retrieval setting. Due to the absence of ground truth of explanations, we
evaluate the aspects of interpretability and completeness of explanations in a user study. ExDocS is compared with a recent
baseline - explainable search system (EXS), that uses a popular posthoc explanation method called LIME. In line with the
“no free lunch” theorem, we find statistically significant results showing that ExDocS provides an explanation for rankings
that are understandable and complete but the explanation comes at the cost of a drop in ranking quality.

Keywords
Explainable Rankings, XIR, XAI, Re-ranking

1. Introduction 2. How do we compare multiple documents to un-
derstand their relative rankings?
Explainability in Artificial intelligence (XAI) is currently 3. Are the explanations provided interpretable and
a vibrant research topic that attempts to make AI systems complete?
transparent and trustworthy to the concerned stakehold-
There have been works [5], [7] in the recent past that
ers. The research in XAI domain is interdisciplinary but
attempted to address related questions such as "Why is a
is primarily led by the development of methods from the
document relevant to the query?" by adapting XAI meth-
machine learning (ML) community. From the classifi-
ods such as LIME [3] primarily for neural rankers. We
cation perspective, e.g., in a diagnostic setting a doctor
argue that the idea of relevance has deeper connotations
may be interested to know that how prediction for a dis-
related to the semantic and syntactic notion of similarity
ease is made by the AI-driven solution. XAI methods in
in text. Hence, we try to tackle the XAI problem from
ML are typically based on exploiting features associated
a ranking perspective. Based on interpretable facets we
with a class label, development of add-on model specific
provide a simple re-ranking method that is agnostic of
methods like LRP [2], model agnostic ways such as LIME
the retrieval model. ExDocS provides local textual ex-
[3] or causality driven methods [4]. The explainability
planations for each document (Part D in Fig. 1). The
problem in IR is inherently different from a classification
re-ranking approach enables us to display the “math be-
setting. In IR, the user may be interested to know how a
hind the rank” for each of the retrieved documents (Part
certain document is ranked for the given query or why a
E in Fig. 1). Besides, we also provide a global explana-
certain document is ranked higher than others [5]. Often
tion in form of a comparative view of multiple retrieved
an explanation is an answer to a why question [6].
documents (Fig. 4).
In this work, Explainable Document Search (ExDocS),
We discuss relevant work for explainable rankings
we focus on a non-web ad-hoc text retrieval setting and
in section two. We describe our contribution to the re-
aim to answer the following research questions:
ranking approach and methods to generate explanation in
1. Why is a document X ranked at Y for a given section three. Next in section four, we discuss the quanti-
query? tative evaluation of rankings on benchmark data sets and
The 1st International Workshop on Causality in Search and a comparative qualitative evaluation with an explainable
Recommendation (CSR’21), July 15, 2021, Online search baseline in a user study. To our knowledge, this is
" sayantan.polley@ovgu.de (S. Polley*); atin.janki@ovgu.de one of the first works comparing two explainable search
(A. Janki*); marcus.thiel@ovgu.de (M. Thiel);
systems in a user study. In section five, we conclude
juliane.hoebel@ovgu.de (J. Hoebel-Mueller);
andreas.nuernberger@ovgu.de (A. Nuernberger) that ExDocS provides explanations that are interpretable
© 2021 Copyright for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
and complete. The results are statistically significant in
CEUR
Workshop
Proceedings
http://ceur-ws.org
ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Wilcoxon signed-rank test. However, the explanations
Figure 1: The ExDocS Search Interface. Local Textual explanation, marked (D), explains the rank of a document with a simpli-
fied mathematical score (E) used for re-ranking. A query-term bar, marked (C), for each document signifies the contribution
of each query term. Other facets of Local explanation can be seen in Fig 2 & 3. A running column in the left marked (B) shows
a gradual fading of color shade with decreasing rank. Global explanation via document comparison marked here as (A), is
shown in Fig 4. Showing search results for a sample query - ‘wine market’ on EUR-Lex [1] dataset.

come at a cost of reduced ranking performance paving
way for future work. The ExDocS system is online1 and
the source code is available on-request for reproducible
research.

2. Related Work
The earliest attempts on making search results explain-
able can be seen through the visualization paradigms
[8, 9, 10] that aimed at explaining term distribution and
statistics. Mi and Jiang [11] noted that IR systems were
one of the earliest among other research fields to offer
interpretations of system decisions and outputs, through
search result summaries. The areas of product search [12] Figure 2: Contribution of Query Terms for relevance
and personalized professional search [13], have explored
explanations for search results by creating knowledge-
graphs based on user’s logs. In [14] Melucci made a
plainability, the perspective of ethics and fairness [15, 16]
preliminary study and suggested that structural equa-
is also often encountered in IR whereby the retrieved data
tion models from the causal perspective can be used to
may be related to disadvantaged people or groups. In
generate explanations for search systems. Related to ex-
[17] a categorization of fairness in rankings is devised
1
https://tinyurl.com/ExDocSearch based on the use of pre-processing, in-processing, or
simple intuitive mathematical explanation of each rank
with reproducible results. Hence, we start with a com-
mon TF-IDF based vector space model (VSM as OOTB
Apache Solr) with cosine similarity (ClassicSimilarity).
VSM helped us to separate the contributions of query
terms enabling us to analytically explain the ranks. BM25
was not deemed suitable for explaining the rankings to a
user, since it could not be interpreted completely analyt-
ically. On receiving a user query, we expand the query
and search the index. The top hundred results are passed
to the re-ranker (refer to Algo. 1) to get the final results.
Term-count is taken as the first facet of evidence since we
assumed that it is relatively easy to analytically explain to
a non-expert end-user as: “document X has P % relative
occurrences.. compared to the best matching document”
Figure 3: Coverage of matched terms in a document
(refer to Part E in 1). The assumption on term-count
is also in line with a recent work [18] on explainable
rankings.
post-processing strategies. Skip-gram word-embeddings are used to determine
Recently there has been a rise in the study of inter- contextual words. About two to three nearest neighbor
pretability of neural rankers [5, 7, 18]. While [5] uses words are used to expand the query. Additionally, the
LIME, [7] uses DeepSHAP for generating explanations WordNet thesaurus is used to detect synonyms. The opti-
and both of them differ considerably. Neural ranking can mal combination of the ratio of word-embeddings versus
be thought of as an ordinal classification problem, thereby synonyms is empirically determined by ranking perfor-
making it easier to leverage the XAI concepts from the mance. Re-ranking is performed based on the proportion
ML community to generate explanations. Moreover, [18] of co-occurring words. This enables us to provide local
generates explanations through visualization using term explanations such as “document X is ranked at position
statistics and highlighting important passages within the Y because of matches found for synonyms like A and
documents retrieved. Apart from this, [19] offers a tool contextual words like B”. Citation analysis is performed
built upon Lucene to explain the internal workings of the by making multiple combinations of weighted in-links,
Vector Space Model, BM25, and Language Model, but it is Page Rank, and HITS score for each document. Citation
aimed at assisting researchers and is still far from an end analysis was selected and deemed as an interpretable
user’s understanding. ExDocS also focuses on explaining facet that we named “document popularity”. We argue
the internal operations of the search system similar to that this could be used to generate understandable expla-
[19], however, it uses a custom ranking approach. nations such as: “document X is ranked at Y because of
Singh and Anand’s EXS [5] comes closest to ExDocS the presence of popularity”. Finally, we re-rank using the
in terms of the questions they aim to answer through following facets as shown below:
explanations, such as - "Why is a document relevant to
• Keyword Search: ‘term statistics’ (term-count)
the query?" and "Why is a document ranked higher than
• Contextual Search: ‘context-words’ (term-
the other?". EXS uses DRMM (Deep Relevance Matching
count of query words + expanded contextual
Model), a pointwise neural-ranking model that uses a
words by word-embeddings).
deep architecture at the query term level for relevance
• Synonym Search: ‘contextual words’ (term-
matching. For generating explanations it employs LIME
count of query words + expanded contextual
[3]. We consider the explanations from EXS as a fair
words). Contextual words are synonyms, in this
baseline and compare with ExDocS in a user-study.
case, using Word-Net.
• Contextual and Synonym Search: ‘contex-
3. Concept: Re-ranking via tual words’ (term-count of query words + ex-
panded contextual words). Contextual words are
Interpretable facets word-embeddings+synonyms in this case.
The concept behind ExDocS is based on the re-ranking • Keyword Search with Popularity score:
of interpretable facets of evidence such as term statistics, ‘citation-based popularity’ (popularity score of a
contextual words, and citation-based popularity. Each document)
of these facets is also a selectable search criterion in Based on benchmark ranking performance, we empiri-
the search interface. We have a motivation to provide a cally determine a weighted combination of these facets
which is also available as a search criteria choice in the to Table 1). We benchmark our retrieval performance
interface. Additionally, we provide local and global vi- by comparing with [21] and confirm that our ranking
sual explanations. Local ones in form of visualizing the approach needs improvement to at least match the
contribution of features (expanded query terms) for each baseline performance metrics.
document as well as comparing them globally for mul-
tiple documents (refer the Evidence Graph in the lower 4.2. Evaluation of explanations
part of Fig. 4).
We performed a user study to qualitatively evaluate the
explanations. Also, to compare ExDocS’s explanations
input : q = {w1,w2,...,wn}, D = {d1,d2,...,dm},
with that of EXS; we integrated EXS’s explanation model
facet
into our interface. Therefore, keeping the look and feel of
output : A re-ranked doc list
both systems alike, we tried to reduce user’s bias towards
1 Select top-k docs from D using cosine similarity, any system.
such as
4.2.1. User study setup
{𝑑′ 1, 𝑑′ 2, ..., 𝑑′ 𝑘} ∈ 𝐷𝑘
A total of 32 users participated in a lab controlled user
for 𝑖 ← 1 to 𝑘 do study. 30 users were from a computer science background
2 if facet == ‘term statistics’ or ‘contextual while 26 users had a fair knowledge of information re-
words’ then trieval systems. Each user was asked to test out both
3 evidence(di)← Σ𝑤∈𝑞 𝑐𝑜𝑢𝑛𝑡(𝑤, 𝑑𝑖) the systems and the questionnaire was formatted in a
4 // count(w, di) is count of Latin-block design. The name of the systems was masked
term w in di as System-A (EXS) and System-B (ExDocS).
5 end
6 if facet == ‘citation-based popularity’ then 4.2.2. Metrics for evaluation
7 evidence(di)← 𝑝𝑜𝑝𝑢𝑙𝑎𝑟𝑖𝑡𝑦𝑆𝑐𝑜𝑟𝑒(𝑑𝑖)
8 // popularityScore(di) could We use the existing definitions ([6] and [22]) of Inter-
be inLinks count, PageRank pretability, Completeness and Transparency in the com-
or HITS score of di munity with respect to evaluation in XAI. The following
9 end factors are used for evaluating the quality and effective-
10 end ness of explanations:
11 Rerank all docs in Dk using evidence
• Interpretability: describing the internals of a sys-
12 return Dk
tem in human-understandable terms [6].
Algorithm 1: Re-ranking algorithm • Completeness: describing the operation of a sys-
tem accurately and allowing the system’s behav-
ior to be anticipated in future [6].
• Transparency: an IR system should be able to
4. Evaluation demonstrate to its users and other interested par-
ties, why and how the proposed outcomes were
We have two specific focus areas in evaluation. The first achieved [22].
one is related to the quality of the rankings and the second
one is related to the explainability aspect. We leave out
evaluation of the popularity score model for future work. 4.3. Results and Discussion
We discuss the results of our experiments and draw con-
4.1. Evaluation of re-ranking algorithm clusions to answer the research questions.
RQ1. Why is a document X ranked at Y for a
We experimented the re-ranking algorithm on the TREC given query?
Disk 4 & 5 (-CR) dataset. The evaluations were carried out We answer this question by providing the individual tex-
by using the trec_eval[20] package. We used TREC-6 ad- tual explanation for every document (refer to Part D of
hoc queries (topics 301-350) and used only ‘Title’ of the Fig. 1) on the ExDocS interface. The “math behind the
topics as the query. We noticed that Keyword Search, rank” (refer to Part E of Fig. 1) of a document is explained
Contextual Search, Synonym Search, and as a percentage of the evidence with respect to the best
Contextual Synonym Search systems were unable
matching document.
to beat the ‘Baseline ExDocS’ (OOTB Apache Solr) on
metrics such as MAP, R-Precision, and NDCG (refer
Figure 4: Global Explanation by comparison of evidence for multiple documents (increasing ranks from left to right). A title-
body image is provided, marked (A), to indicate whether the query term was found in title and/or body. The column marked
(B), represents the attributes for comparison.

Table 1
MAP, R-Precision, and NDCG values for ExDocS search systems against TREC-6 benchmark values*[21]

IR Systems MAP R-Precision NDCG
csiro97a3* 0.126 0.1481 NA
DCU97vs* 0.194 0.2282 NA
mds603* 0.157 0.1877 NA
glair61* 0.177 0.2094 NA
Baseline ExDocS 0.186 0.2106 0.554
Keyword Search 0.107 0.1081 0.462
Contextual Search 0.080 0.0955 0.457
Synonym Search 0.078 0.0791 0.411
Contextual and Synonym Search 0.046 0.0526 0.405

RQ2. How do we compare multiple documents 1. 96.88% of the users understood the textual expla-
to understand their relative rankings? nations of ExDocS
We provide an option to compare multiple documents 2. 71.88% of the users understood the relation be-
through visual and textual paradigms (refer to Fig. 4). The tween the query term and features (synonyms or
evidence can be compared and contrasted and thereby un- contextual words) shown in the explanation
derstand the reasons for a document’s rank being higher 3. Users gave a mean rating of 4 out of 5 (standard
or lower than others. deviation = 1.11) to ExDocS on the understand-
RQ3. Are the generated explanations inter- ability of the percentage calculation for rankings,
pretable and complete? shown as part of the explanations
We evaluate the quality of the explanations in terms of
their interpretability and completeness. Empirical evi- When users were explicitly asked - whether they could
dence from the user study on Interpretability: “gather an understanding of how the system functions
based on the given explanations”, users gave a positive
response with a mean rating of 3.84 out of 5 (standard 5. Conclusion and Future Work
deviation = 0.72). The above-mentioned empirical evi-
dence indicates that the ranking explanations provided In this work, we present an Explainable Document Search
by ExDocS can be deemed as interpretable. (ExDocS) system that attempts to explain document rank-
Empirical evidence from the user study on Complete- ings using a combination of textual and visual elements
ness: to a non-expert user. We make use of word embeddings
and WordNet thesaurus to expand the user query. We use
1. All users found the features shown in the expla- various interpretable facets such as term statistics, con-
nation of ExDocS to be reasonable (i.e. sensible textual words, and citation-based popularity. Re-ranking
or fairly good) results from a simple vector space model with such in-
2. 90.63% of the users understood through compara- terpretable facets help us to explain the “math behind
tive explanations of ExDocS that- why a partic- the rank” to an end-user. We evaluate the explanations
ular document was ranked higher or lower than by comparing ExDocS with another explainable search
other documents baseline in a user study. We find statistically significant
Moreover, 78.13% of total users claimed that they could results that ExDocs provides interpretable and complete
anticipate ExDocS behavior in the future based on the explanations. Although, it was difficult to find a clear
understanding gathered through explanations (individual winner between both systems in all aspects. In line with
and comparative). Based on the above empirical evidence the “no free lunch” theorem, the results show a drop in
we argue that the ranking explanations generated by ranking quality on benchmark data sets at the cost of
ExDocS can be assumed to be complete. getting comprehensible explanations. This paves way
Transparency: We investigate if the explanations for ongoing research to include user feedback to adapt
make ExDocS more transparent [22] to the user. Users the rankings and explanations. ExDocS is currently be-
gave ExDocS a mean rating of 3.97 out of 5 (standard ing evaluated in domain-specific search settings like law
deviation = 0.86) on ‘Transparency’ based on the indi- search where explainability is a key factor to gain user
vidual (local) explanations. In addition to that, 90.63% trust.
of the total users indicated that ExDocS became more
transparent after reading the comparative (global) expla- References
nations. This indicates that explanations make ExDocS
more transparent to the user. [1] E. L. Mencia, J. Fürnkranz, Efficient multilabel clas-
sification algorithms for large-scale problems in
the legal domain, in: Semantic Processing of Legal
Texts, Springer, 2010, pp. 192–215.
[2] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R.
Müller, W. Samek, On pixel-wise explanations for
non-linear classifier decisions by layer-wise rele-
vance propagation, PloS one 10 (2015) e0130140.
[3] M. T. Ribeiro, S. Singh, C. Guestrin, "Why Should I
Trust You?": Explaining the Predictions of Any Clas-
sifier, in: Proceedings of the 22nd ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, KDD ’16, Association for Com-
puting Machinery, New York, NY, USA, 2016, p.
1135–1144.
Figure 5: Comparison of explanations from EXS and ExDocS
on different XAI metrics. All the values shown here are scaled [4] J. Pearl, et al., Causal inference in statistics: An
between [0-1] for simplicity. overview, Statistics surveys 3 (2009) 96–146.
[5] J. Singh, A. Anand, EXS: Explainable Search Using
Local Model Agnostic Interpretability, in: Proceed-
Comparison of explanations between ExDocS ings of the Twelfth ACM International Conference
and EXS: on Web Search and Data Mining, WSDM ’19, Asso-
Both the systems performed similarly in terms of ciation for Computing Machinery, New York, NY,
𝑇 𝑟𝑎𝑛𝑠𝑝𝑎𝑟𝑒𝑛𝑐𝑦 and 𝐶𝑜𝑚𝑝𝑙𝑒𝑡𝑒𝑛𝑒𝑠𝑠. However, users USA, 2019, p. 770–773.
found ExDocS explanations to be more interpretable com- [6] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter,
pared to that of EXS (refer to Fig. 5), and this compar- L. Kagal, Explaining explanations: An overview of
ison was statistically significant in WSR test (|𝑊 | < interpretability of machine learning, in: 2018 IEEE
𝑊𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙(𝛼 = 0.05,𝑁𝑟 = 10) = 10, where |𝑊 | = 5.5).
5th International Conference on Data Science and tems with Application to LinkedIn Talent Search,
Advanced Analytics (DSAA), IEEE, 2018, pp. 80–89. in: Proceedings of the 25th ACM SIGKDD In-
[7] Z. T. Fernando, J. Singh, A. Anand, A Study on ternational Conference on Knowledge Discovery
the Interpretability of Neural Retrieval Models Us- amp; Data Mining, KDD ’19, Association for Com-
ing DeepSHAP, in: Proceedings of the 42nd Inter- puting Machinery, New York, NY, USA, 2019, p.
national ACM SIGIR Conference on Research and 2221–2231. URL: https://doi.org/10.1145/3292500.
Development in Information Retrieval, SIGIR’19, 3330691. doi:10.1145/3292500.3330691.
Association for Computing Machinery, New York, [17] C. Castillo, Fairness and Transparency in Ranking,
NY, USA, 2019, p. 1005–1008. SIGIR Forum 52 (2019) 64–71.
[8] M. A. Hearst, TileBars: Visualization of Term Distri- [18] V. Chios, Helping results assessment by adding ex-
bution Information in Full Text Information Access, plainable elements to the deep relevance matching
in: Proceedings of the SIGCHI Conference on Hu- model, in: Proceedings of the 43rd International
man Factors in Computing Systems, CHI ’95, ACM ACM SIGIR Conference on Research and Develop-
Press/Addison-Wesley Publishing Co., USA, 1995, ment in Information Retrieval, Association for Com-
p. 59–66. puting Machinery, New York, NY, USA, 2020. URL:
[9] O. Hoeber, M. Brooks, D. Schroeder, X. D. Yang, https://ears2020.github.io/accept_papers/2.pdf.
TheHotMap.Com: Enabling Flexible Interaction in [19] D. Roy, S. Saha, M. Mitra, B. Sen, D. Ganguly, I-REX:
Next-Generation Web Search Interfaces, in: Pro- A Lucene Plugin for EXplainable IR, in: Proceed-
ceedings of the 2008 IEEE/WIC/ACM International ings of the 28th ACM International Conference on
Conference on Web Intelligence and Intelligent Information and Knowledge Management, CIKM
Agent Technology - Volume 01, WI-IAT ’08, IEEE ’19, Association for Computing Machinery, New
Computer Society, USA, 2008, p. 730–734. York, NY, USA, 2019, p. 2949–2952.
[10] M. A. Soliman, I. F. Ilyas, K. C.-C. Chang, URank: [20] C. Buckley, et al., The trec_eval evaluation package,
Formulation and Efficient Evaluation of Top-k 2004.
Queries in Uncertain Databases, in: Proceedings of [21] D. K. Harman, E. Voorhees, The Sixth Text RE-
the 2007 ACM SIGMOD International Conference trieval Conference (TREC-6), US Department of
on Management of Data, SIGMOD ’07, Association Commerce, Technology Administration, National
for Computing Machinery, New York, NY, USA, Institute of Standards and Technology (NIST), 1998.
2007, p. 1082–1084. [22] A. Olteanu, J. Garcia-Gathright, M. de Rijke, M. D.
[11] S. Mi, J. Jiang, Understanding the Interpretability Ekstrand, Workshop on Fairness, Accountability,
of Search Result Summaries, in: Proceedings of the Confidentiality, Transparency, and Safety in Infor-
42nd International ACM SIGIR Conference on Re- mation Retrieval (FACTS-IR), in: Proceedings of the
search and Development in Information Retrieval, 42nd International ACM SIGIR Conference on Re-
SIGIR’19, Association for Computing Machinery, search and Development in Information Retrieval,
New York, NY, USA, 2019, p. 989–992. 2019, pp. 1423–1425.
[12] Q. Ai, Y. Zhang, K. Bi, W. B. Croft, Explainable
Product Search with a Dynamic Relation Embed-
ding Model, ACM Trans. Inf. Syst. 38 (2019).
[13] S. Verberne, Explainable IR for personalizing pro-
fessional search, in: ProfS/KG4IR/Data: Search@
SIGIR, 2018.
[14] M. Melucci, Can Structural Equation Models In-
terpret Search Systems?, in: Proceedings of the
42nd International ACM SIGIR Conference on Re-
search and Development in Information Retrieval,
SIGIR’19, Association for Computing Machinery,
New York, NY, USA, 2019. URL: https://ears2019.
github.io/Melucci-EARS2019.pdf.
[15] A. J. Biega, K. P. Gummadi, G. Weikum, Equity of
attention: Amortizing individual fairness in rank-
ings, in: The 41st International ACM SIGIR confer-
ence on Research & Development in Information
Retrieval, 2018, pp. 405–414.
[16] S. C. Geyik, S. Ambler, K. Kenthapadi, Fairness-
Aware Ranking in Search and Recommendation Sys-