Validating Correctness of Textual Explanation
                            with Complete Discourse trees
                                              Boris Galitsky
                                  Oracle Corp Redwood Shores CA USA
                                                   And
                                             Dmitry Ilvovsky
                  National Research University Higher School of Economics, Moscow, Russia

                                                      Abstract
      We explore how to validate the soundness of textual explanations in a domain-independent manner. We
      further assess how people perceive explanations of their opponents and what are the factors
      determining whether explanations are acceptable or not. We discover that what we call a complete
      discourse tree (complete DT) determines the acceptability of explanation. A complete DT is a sum of a
      traditional DT for a paragraph of actual text and an imaginary DT for a text about entities used but not
      explicitly defined in the actual text.

1     Introduction
Providing explanations of decisions for human users, and understanding how human agents explain their
decisions, are important features of intelligent decision making and decision support systems. A number
of complex forms of human behavior is associated with attempts to provide acceptable and convincing
explanations. In this paper, we propose a computational framework for assessing soundness of
explanations and explore how such soundness is correlated with discourse-level analysis.
   Importance of the explanation-aware computing has been demonstrated in multiple studies and
systems. Also, (Walton, 2007) argued that the older model of explanations as a chain of inferences
with a pragmatic and communicative model that structures an explanation as a dialog exchange. The
field of explanation-aware computing is now actively contributing to such areas as legal reasoning,
natural language processing and also multi-agent systems (Dunne and Bench-Capon, 2006). It has
been shown (Walton, 2008) how the argumentation methodology implements the concept of
explanation by transforming an example of an explanation into a formal dialog structure. Galitsky
(2008) differentiated between explaining as a chain of inference of facts mentioned in dialogue, and
meta-explaining as dealing with formal dialog structure represented as a graph. Both levels of
explanations are implemented as argumentation: explanation operates with individual claims
communicated in a dialogue, and meta-explanation relies on the overall argumentation structure of
scenarios.
   In this paper we explore how good explanation in text can be computationally differentiated from
bad explanation. Intuitively, a good explanation convinces the addressee that a communicated claim is
right, and it involves valid argumentation patterns, logical, complete and thorough. A bad explanation
is unconvincing, detached from the beliefs of the addressee, includes flawed argumentation patterns
and omits necessary entities. In this work we differentiate between good and bad explanation based on
a human response to such explanation. Whereas users are satisfied with good explanation by a system
or a human, bad explanations usually lead to dissatisfactions, embarrassment and complaints.

2     Validating explanations with Discourse Trees
2.1     Classes of explanation
   To systematically treat the classes of explanation, we select an environment where customers
receive explanations from customer service regarding certain dissatisfactions these customers
encountered. If these customers are not satisfied with explanations, they frequently submit detailed
complaints to consumer advocacy sites. In some of these complaints these customers explain why they
are right and why the company’s explanation is wrong. From these training sets we select the
good/bad explanation pairs and define respective explanation classes via learning to recognize them.


Copyright ©2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
Another way to consider a bad explanation is what we call an explanation attempt: a logical chain is
built but it has some omissions and inconsistencies so that the explanation is bad. An absense of a
logical chain means the absense of explanation; otherwise, if such chain obeys certain logical
properties it can be interpreted by something else besides explanation but instead argumentation,
clarification, confirmation or other mental or epistemic state.

2.2    Explanation and Argumentation
Explanations are correlated with argumentation and sentiments. A request to explain is usually
associated with certain arguments and a negative sentiment.
For an arbitrary statement S a person may have little or no prior reason for believing this statement to
be true. In this case a cognitive response is a doubt, which is articulated with a request for evidence.
Evidence is a kind of reason, and the attempt to provide evidence in support of a conclusion is
normally called an argument. Argument reasoning is represented on the top of Fig. 1.
   On the other hand a person may already know S and require no further evidence for the truth of S.
But she still may not understand why S holds (occurred, happened etc. In this case she would request
for a cause. Explanation is defined as an attempt to provide a cause in support of a conclusion.
Explanation reasoning may be represented in the bottom of Fig. 1.


 Fig.1: Relationship between argumentation and explanation


2.3    Hybrid discourse trees
  In the banking domain nonsufficient fund fee (NSF) is a major problem that banks have difficulties
communicating with customers. An example of brief, informal explanation follows:
   It's not always easy to understand overdraft fees. When a transaction drops your checking account balance
below zero, what happens next is up to your bank. A bank or credit union might pay for the transaction or
decline it and, either way, could charge you a fee.


Fig 2. Discourse tree of explanation text with the imaginary part shown in the top-right for nucleus ‘transaction’.

   The concept of transaction is not tackled in this text explaining nonsufficient fee. An ontology
could specify that transaction = {wiring, purchasing, sending money} but it is hard to be complete.
Instead, one can complement the notion of transaction via additional text that will elaborate on
transaction, providing more details on it.
Hence Elaboration relation for nucleus transaction is not in actual DT but is assumed by a recipient of
this explanation text. We refer to such rhetorical relations as Imaginary: they are not produced from
text but are instead induced by the context of explanation. Such multiple imaginary RRs form
additional nodes of an actual DT for a text being communicated. We refer to the extended DT as
complete: it combines the actual DT and its imaginary parts. Naturally, the latter can be dependent on
the recipient: different people keep in mind distinct instances of transactions.
   We formalize this intuition by using discourse structure of the text expressed by DTs. Arcs of this
tree correspond to rhetorical relations (RR), connecting text blocks called Elementary Discourse Units
(EDU). We rely on the Rhetorical Structure Theory (RST, Mann and Thompson, 1988) when
construct and describe discourse structure of the text.
   When people explain stuff, they do not have to enumerate all premises: some of them implicitly
occurring in the explanation chain and are assumed by the person providing explanation to be known
or believed by an addressee. However, a DT for a text containing explanation only incudes EDUs from
actual text and assumed, implicit parts with its entities and phrases (which are supposed to enter
explanation sequence) are absent. How can we cover these implicit entities and phrases?
   In the considered example Elaboration relation for nucleus transaction is not in actual CDT but is
assumed by a recipient of this explanation text. We refer to such rhetorical relations as Imaginary: they
are not produced from text but are instead induced by the context of explanation. Such multiple
imaginary RRs form additional nodes of an actual DT for a text being communicated. We refer to the
combined CDTs as hybrid: it combines the actual CDT and its imaginary parts. Naturally, the latter
can be dependent on the recipient: different people keep in mind distinct instances of transactions.
Complete discourse tree for the example is shown on Fig.2. Complete discourse trees also have
communicative actions attached to their edges in the form of VerbNet verb signatures (Galitsky and
Parnis, 2019).


2.4   Semantic representation


  Fig.3 Frame semantic parse for the explanation

   A frame semantic parse for the same text is shown in Fig. 3. The reader observes that it is hard to tag
entities and determine context properly. Bank is tagged as Placing (not disambiguated properly) and
‘credit union might’ is determined as a hypothetical event since union is represented literally, as an
organization, separately from credit. Overall, the main expression being explained, ‘transaction drops
your checking account balance below zero’, is not represented as a cause of a problem by semantic
analysis, since a higher level considerations involving a banking – related ontology would be required.
   Instead of relying on semantic–level analysis to classify explanations, we propose a discourse-level
machinery. This machinery allows including the explanation structure beyond the ones from explanation
text but also from the accompanying texts mined from various sources to obtain a complete logical
structure of the entities involved in explanation.
2.5   Discourse tree of explanations
Valid explanation in text follow certain rhetoric patterns. In addition to default relations of Elaborations,
valid explanation relies on Cause, Condition, and domain-specific Comparison (Fig. 4) As an example,
we provide an explanation for why thunder sound comes after lightning:

‘We see the lightning before we hear the thunder. This is because light travels faster than sound. The
light from the lightning comes to our eyes much quicker than the sound from the lightning. So we hear it
later than we see it.‘


joint
  elaboration (LeftToRight)
   cause (LeftToRight)
     temporal (LeftToRight)
      TEXT: We see the lightning
      TEXT: before we hear the thunder .
     TEXT: This is because light travels faster than sound .
   elaboration (LeftToRight)
     TEXT: The light from the lightning travels to our eyes much quicker than the sound from the
lightning .
     comparison (LeftToRight)
      TEXT:so we hear it later
      TEXT:than we see it .
    Fig. 4: A discourse tree for an explanation of a lightning

   The clause we need to obtain for an implication in the explanation chain is verb-group-for-moving
{moves, travels, comes} faster  verb-group-for-moving-result {earlier}. This clause can be easily
obtained by web mining, searching for expression ‘if noun verb-group-for-moving faster then noun
verb-group-for-moving-result earlier.
   What would make this DT look like a one for invalid explanation? If any RR under top-level
Elaboration turns into Joint it would mean that the explanation chain is interrupted.
   We explore argumentation structure example of (Toulmin, 1958, Kennedy et al., 2006). We show two
visualizations of the discourse tree and the explanation chain (in the middle) in Fig. 5.

  elaboration
   TEXT:Harry was born in Bermuda .
   explanation (LeftToRight)
     attribution (LeftToRight)
      TEXT:A person born in Bermuda is a British subject .
      TEXT:It is on account of the following statutes 123 .
     condition
      TEXT:So , presumably , Harry is a British subject ,
      Joint
        TEXT:unless both his parents were aliens ,

       TEXT:or he has become a naturalized American .
             Elaboration

                        Explanation


        Attribution              Condition


                                             Joint


 Datum Warrant        Backing   Claim Rebuttal Rebuttal


   Fig. 5: Toulmin’s argument structure (in the middle) and its rhetorical representation via EDUs (on the
top) and via discourse relations (on the bottom)

  An interesting application of Toulmin’s model is the argumentative grammar by Lo Cascio (1991), a
work that, by deﬁning associative rules for argumentative acts, is naturally applicable, and indeed has
been applied, to the analysis of discourse structure in the pre-DT times.

2.6   Logical Validation of Explanation via Discourse trees
   Logically, explanation of text S is a chain of premises P1 ,…, Pm which imply S. S is frequently
referred to as a subject of explanation. For this chain P1 ,…, Pm each element Pi is implied by its
predecessors: P1 ,… Pi-1  Pi. In terms of a discourse tree, there should be a path in it where these
implications are realized via rhetorical relations. We intend to define a mapping between EDUs of a
DT and entities Pi occurring in these EDUs which form the explanation chain. In terms on underlying
text, Pi are entities or phrases which can be represented as logical atoms or terms.
   These implication-focused rhetorical relations rr are:
 1) elaboration: Pi can be an elaboration of Pi-1 ;
 2) attribution: Pi can be attributed to Pi-1 ;
 3) cause: this is a most straightforward case,
   Hence Pi  Pj if rr(EDUi , EDUj ) where Pi EDUi and Pj  EDUj . We refer to this condition as
“explainability” via Discourse Tree.
   Actual sequence P1 ,…, Pm for S is not known, but for each S we have a set of good explanations
Pg1 ,…, Pgm and a set of bad explanations Pb1 ,…, Pb2.
Good explanation sequences obey explainability via DT condition and bad – do not (Galitsky 2018).
Bad explanation sequences might obey explainability via DT condition for some Pbi. If a DT for a text
is such that explainability via DT condition does not hold for any Pbi then this DT does not include
any explanation at all.
The reader can observe that to define a good and a bad explanation via a DT one needs a training set
covering all involved entities and phrasing Pi occurring in both positive and negative training sets.
2.7   Constructing Imaginary Part of a Discourse Tree
    By our definition imaginary DTs are the ones not obtained from actual text but instead built on
demand to augment the actual ones. For a given chain P1 ,…, Pi’ , …, Pm let Pi’ be the entity which is
not explicitly mention in a text but instead is assumed to be known to the addressee. This Pi’ should
occur in other texts in a training dataset. To make the explainability via DT condition applicable, we
need to augment actual DTactual with imaginary DTimaginary such that Pi’  EDU of this DTimaginary. We
denote DTactual DTimaginary as DTcomplete.
    If we have two textual explanations in the positive set of good explanations for the same S, T1 and
T2 :
T1: P1 ,…, Pm  S
T2: P1 , Pi’,…, Pm  S
   then we can assume that Pi’ should occur in a complete explanation for S and since it does not occur
in T1 then DT(T1) should be augmented with DTimaginary such that Pi’  EDU of this DTimaginary.


3     Learning Framework and Evaluation
In this section we automate our validation of text convincingness including description of a training
dataset and learning framework.
   We conduct our evaluation in two steps. Firstly, we try to distinguish between texts with
explanation and without explanation. This task can be accomplished without an involvement of virtual
DTs. Secondly, once we confirm that that can be done reasonably well, we drill into more specific
tasks of differentiating between good and bad explanation chains within the dataset of the first task.
3.1    Building a Dataset of Good/bad Explanation Chains
We form the positive explanation dataset from the following sources:
     1. Customer complaints;
     2. Paragraphs from physics and biology textbook;
     3. Yahoo! Answers for Why/How-to questions.
The negative training dataset includes the sources of a totally different nature:
     1. Definition/factoid paragraphs from Wikipedia, usually, first paragraphs;
     2. First paragraphs of news articles introducing new events;
     3. Political news from Functional Text Dimension dataset.
   We formed the balances components of the positive and negative dataset for both tasks: each
component includes 240 short texts 5-8 sentences (250-400 words).
   We now comment on each source. The purpose of the customer complaint dataset is to collect texts
where authors do their best to explain their points across by employing all means to show that they are
right and their opponents are wrong. Complaints are emotionally charged texts providing explanation
of problems they encountered with a financial service, how they tried to explain their viewpoint to a
company and also a description of how these customers attempted to solve it (Galitsky et al., 2008,
GitHub Customer Complaints dataset 2019).
   Also, to select types of text with and without explanation, we adopt the genre system and the
corpora from (Lee, 2001). The genre system is constructed relying on the Functional Text Dimensions.
These are genre annotations which reflect judgments as to what extent a text can be interpreted as
belonging to a generalized functional category. A genre is a combination of several dimensions. For
the positive dataset, we select the genre with the highest density of explanation such as scientific
textbook. For the negative dataset, we focus on the genres which are least likely to contain
explanations, such as advertisement, fiction-prose, instruction manuals and political news. The last one
is chosen since it has the least likelihood to contain an explanation.
   For the positive dataset for the second task, as good explanation chains, we rely on the following
sources:
     1. Customer complaints with valid argumentation patterns;
     2. Paragraphs from phisics textbook explaining certain phenomena, which are neither factoid nor
         definitional;
     3. Yahoo! Answers for Why/How-to questions;

We form the negative dataset from the following sources:
   1. Customer complaints with invalid argumentation patterns; these complaints are inconsistent,
       illogical and rely on emotions to bring their points across;
   2. Paragraphs from phisics textbook formulating longer questions and problems;
   3. Yahoo! Answers for Why (not How-to) questions which are reduced to break the explanation
       flow. Sentences are deleted or re-shuffled to produce an incohesive, non-systematic
       explanation.
3.2   Crawling Information for Imaginary Discourse Tree Construction
Imaginary DTs can be found by employing background knowledge in a domain independent manner:
no offline ontology construction is required. Documents that were found on the web can be the basis
of constructing imaginary DTs following the algorithm described in the Section 2.4.
   Given an actual part of the text A, we outline a top-level search strategy for finding a source for
imaginary DTs (background knowledge) B.
   1) Build DT for A;
   2) Obtain pairs of entities from A that are not linked in DT (e.g. thunder, eye);
   3) Obtain a set of search queries based on provided pairs of entities
   4) For each query:
        a) Find a short list of candidate text fragments on the web using search engine API (such as
           Bing);
        b) Build DT for the text fragments;
        c) Select fragments which contain rhetoric relation (Elaboration, Attribution, Cause) linking
           this pair of entities;
        d) Choose the fragment with the highest relevance score
   The entity mentioned in the algorithm can be interpreted in a few possible ways. It can be named
entity, head of a noun phrase or a keyword extracted from a dataset.
   Relevance score can be based on the score provided by the search engine. Another option –
computing score based on structural discourse and syntactic similarity (Galitsky, 2017).
3.3   Learning Approaches and Pipelines
   Discourse Tree Construction. A number of RST parsers constructing discourse tree of the text are
available at the moments. For instance, in our previous studies we used the tool provided by (Surdeanu
et.al., 2015) and (Joty et al., 2014).
   Nearest Neighbor learning. To predict the label of the text, once the complete DT is built, one
needs to compute its similarity with DTs for the positive class and verify that it is lower than similarity
to the set of DTs for its negative class. Similarity between CDT's is defined by means of maximal
common sub-DTs. Formal definitions of labeled graphs and domination relation on them used for
construction of this operation can be found, e.g., in (Ganter, 2001).
   SVM Tree Kernel learning. A DT can be represented by a vector of integer counts of each sub-
tree type (without taking into account its ancestors). For Elementary Discourse Units (EDUs) as labels
for terminal nodes only the phrase structure is retained: we suppose to label the terminal nodes with
the sequence of phrase types instead of parse tree fragments. For the evaluation purpose Tree Kernel
builder tool (Moschitti, 2006) can be used.
3.4   Detecting explanations and valid explanation chains
We first focus on the first task, detecting paragraphs of text which contain explanation, and estimate the
detection rate in Table 1. We apply two different learning techniques, nearest neighbor (in the middle,
greyed) and SVM TK, applied to the same discourse-level and syntactic data.

   Table 1: Explanation detection rate
Source             PKNN          RKNN          F1KNN           PSVM         RSVM            F1SVM
1+ vs 1-           77.3          80.8          79.0            80.9         82.0            81.4
2+ vs 2-           78.6          76.4          77.5            74.6         74.8            74.7
3+ vs 3-           75.0          77.6          76.3            76.6         77.1            76.8
    +        -
1..3 vs 1..3       76.8          78.9          77.8            74.9         75.4            75.1
   The highest recognition accuracy, reaching 80%, is achieved for the first pair of the dataset
components, complaints vs wikipedia factois, most distinct ‘intense’ explanation vs enumeration of
facts, with least explanations. The other datasets deliver 2-3% drop in recognition performance. These
accuracies are comparable with various tasks in genre classification (one-against-all setting in Galitsky
et al., 2016).
   Table 2 shows the results of differentiation between good and bad explanation. The accuracy is
about 12% lower than for the first task, since the difference between the good and bad explanation in
text is fairly subtle.

   Table 2: Recognizing good and bad explanation chains
Source              P-virtual    R-virtual     F1-virtual        P              R             F1
 +     -
1 vs 1              64.3         60.8          62.5              72.9           74.0          73.4
2+ vs 2-            68.2         65.9          67.0              74.6           74.8          74.7
3+ vs 3-            63.7         67.4          65.5              76.6           77.1          76.8
    +        -
1..3 vs 1..3        66.4         64.6          65.5              74.9           75.4          75.1

  However, validation of explanation chain is an important task in a decision support. A low accuracy
can still be leveraged by processing a large number of documents and detecting a birst in problematic
explanation in a corpus of texts.

4   Discussion and Conclusions
In this work we considered a new approach to validating the convincingness of textual explanations. We
introduced the notion of a complete discourse tree (complete DT) including actual and imaginary parts.
Imaginary DT is constructed for the text about entities used but not explicitly defined in the actual text.
   We outlined an algorithm for building an imaginary discourse tree. We also described a possible
strategy for crawling background knowledge which is the source of the imaginary part. We also
introduced the new dataset of good and bad explanations made by complainants in the financial
domain. Finally, we outlined the learning framework used for automated detection of good and bad
explanations. It is based on RST parsing and learning on complete discourse trees provided by the
parser.
   Both professional and non-professional writers provide explanations in texts but detection of invalid
explanations is significantly harder in the former case compared to the latter. Professional writers in
such domains as politics and business are capable of explaining “anything”, and in user-generated
content errors are visible.
   Detecting faulty explanations in user-generated content is important in automated Customer
Relation Management systems where a response to user requests with valid explanation should be
different to user response with invalid explanation.
   It is important to combine rule-based learning frameworks with the ones with implicit feature
engineering such as statistical and deep learning. The latest history of applications of statistical
technique sheds a light on the limitation of these techniques for systematic exploration of a given
domain. Once statistical learning delivered satisfactory results for discourse parsing, the interest to
automated discourse analysis faded away. Since the researches in statistical ML for discourse parsing
were mainly interested in recognition accuracies and not the interpretability of obtained DTs, no
further attempts at leveraging obtained DTs were made. However, a number of studies including the
given one demonstrate that DTs provide insights in the domain where keyword statistics does not help.
   On the basis of work by Austin, Searle, Grice and Lorenzen, such discipline as pragmadialectics
provides a comprehensive analysis of argumentative dialogues. This discipline combines the study on
the formalism to represent data, from modern logic, and empirical observations, from descriptive
linguistics, for the analysis of argumentative dialogues, modeled by dialectics, seen as sets of
linguistic speech. The model proposes rule-base argumentative dialogues, but does not help with a
dialogue generation algorithm.

Acknowledgements
  The work of Dmitry Ilvovsky was supported by the Russian Science Foundation under grant 17-11-
01294'.
References
Mann, William and Sandra Thompson. 1988. Rhetorical structure theory: Towards a functional theory of text
  organization. Text - Interdisciplinary Journal for the Study of Discourse, 8(3):243–281.
Explanation on Wikipedia. 2018. https://en.wikipedia.org/wiki/Explanation#Meta-explanation.
Galitsky, B., Kuznetsov SO. 2008. Learning communicative actions of conflicting human agents. J. Exp. Theor. Artif.
  Intell. 20(4): 277-317.
Galitsky B (2018) Customers’ Retention Requires an Explainability Feature in Machine Learning Systems They
  Use. AAAI Spring Symposium Series.
Jansen, P., M. Surdeanu, and P. Clark. 2014. Discourse Complements Lexical Semantics for Nonfactoid Answer
   Reranking. ACL.
Surdeanu, M., Thomas Hicks, and Marco A. Valenzuela-Escarcega. 2015. Two Practical Rhetorical Structure
  Theory Parsers. Proceedings of the Conference of the North American Chapter of the Association for
  Computational Linguistics - Human Language Technologies: Software Demonstrations (NAACL HLT), 2015.
Galitsky, B, D Ilvovsky, SO Kuznetsov (2016) Style and Genre Classification by Means of Deep Textual
  Parsing. Computational Linguistics and Intellectual Technologies: DIALOG, Moscow, Russia.
Galitsky, B. 2017. Matching parse thickets for open domain question answering, Data & Knowledge
  Engineering, Volume 107, January 2017, Pages 24-50.
Galitsky B, D Ilvovsky, SO Kuznetsov (2018) Detecting logical argumentation in text via communicative
  discourse tree. Journal of Experimental & Theoretical Artificial Intelligence 30 (5), 637-663.
Galitsky B, Parnis A (2018) Accessing Validity of Argumentation of Agents of the Internet of Everything.
  Artificial Intelligence for the Internet of Everything, 187-216.
Joty, S., Moschitti, A. 2014 Discriminative reranking of discourse parses using tree kernels. EMNLP 2014.
Ganter, B., Kuznetsov, S.O. 2001. Pattern structures and their projections. In: International Conference on
  Conceptual Structures. pp. 129-142. Springer.
Moschitti, A. 2006. Efficient Convolution Kernels for Dependency and Constituent Syntactic Trees. In
  Proceedings of the 17th European Conference on Machine Learning, Berlin, Germany.
Paul E. Dunne and Trevor J. M. Bench-Capon. 2006. Computational Models of Argument: Proceedings of
  COMMA 2006, IOS Press, 2006.
Lo Cascio, V. 1991. Grammatica dell’Argomentare: strategie e strutture [A grammar of Arguing: strategies and
  structures]. Firenze: La Nuova Italia.
Walton, D. 2007. Dialogical Models of Explanation. Explanation-Aware Computing: Papers from the 2007
  AAAI Workshop, Association for the Advancement of Artificial Intelligence, Technical Report WS-07-06,
  AAAI Press, 2007,1-9.
Walton, D., Reed, C., Macagno, F. 2008. Argumentation schemes. Cambridge University Press
Lee, David YW. Genres, registers, text types, domains and styles: Clarifying the concepts and navigating a path
  through the BNC jungle. (2001)
Kennedy, X.J., Dorothy M. Kennedy, and Jane E. Aaron.. "Reasoning". The Bedford Reader. 9th ed. New York:
  Bedford/St. Martin's, 2006. p. 519–522.
Toulmin, S. The Uses of Argument. Cambridge At the University Press, 1958.