=Paper=
{{Paper
|id=Vol-3005/sample-3col
|storemode=property
|title=Knowledge Augmented Language Models for Causal Question Answering
|pdfUrl=https://ceur-ws.org/Vol-3005/03paper.pdf
|volume=Vol-3005
|authors=Dhairya Dalal
}}
==Knowledge Augmented Language Models for Causal Question Answering==
<pdf width="1500px">https://ceur-ws.org/Vol-3005/03paper.pdf</pdf>
<pre>
            Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021


    Knowledge Augmented Language Models for
          Causal Question Answering ?

                         Dhairya Dalal1[0000−0003−0279−234X]

              SFI Centre for Research Training in Artificial Intelligence,
            Data Science Institute, National University of Ireland - Galway
                               d.dalal1@nuigalway.ie


        Abstract. The task of causal question answering broadly involves rea-
        soning about causal relations and causality over a provided premise.
        Causal question answering can be expressed across a variety of tasks in-
        cluding commonsense question answering, procedural reasoning, reading
        comprehension, and abductive reasoning. Transformer-based pretrained
        language models have shown great promise across many natural lan-
        guage processing (NLP) applications. However, these models are reliant
        on distributional knowledge learned during the pretraining process and
        are limited in their causal reasoning capabilities. Causal knowledge, of-
        ten represented as cause-effect triples in a knowledge graph, can be used
        to augment and improve the causal reasoning capabilities of language
        models. There is limited work exploring the efficacy of causal knowledge
        for question answering tasks. We consider the challenge of structuring
        causal knowledge in language models and developing a unified model
        that can solve a broad set of causal question answering tasks.

        Keywords: causal reasoning · causal question answering · language
        models · causal knowledge graphs.


1     Problem Statement
Historically, research on causal reasoning in natural language processing (NLP)
has primarily focused on causal relation identification and extraction. Recently,
there has been an emerging interest in more complex applications of causal rea-
soning, especially around question answering and several new benchmark tasks
were released. The task of causal question answering can be expressed across a
variety of tasks. Broadly these tasks involve reasoning about causality and causal
relations over a provided premise. Pretrained transformer-based language mod-
els such as BERT [4] and RoBERTa [13] have been found to be generally effective
for these tasks. However, the distributional knowledge contained in these models
is opaque and reliant on the quality and depth of scope of the pretraining corpus.
Additionally, it is unclear the extent to which language models support causal
reasoning. We hypothesize that causal knowledge can help language models rep-
resent causality and identify causal relations necessary for downstream question
?
    With generous support from the Science Foundation Ireland Centre for Research
    Training in Artificial Intelligence


     Copyright © 2021 for this paper by its authors. Use permitted under Creative
     Commons License Attribution 4.0 International (CC BY 4.0).

                                           17
          Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021


answering. Causal facts, often extracted from causal descriptions and expressed
as cause-effect triples, can succinctly express causal knowledge. We consider the
challenge of structuring causal knowledge in language models and developing a
unified causal language model that can be effective across all causal reasoning
tasks.


2   Importance
Causal reasoning has a long and rich history rooted in philosophy, psychol-
ogy, and many other academic disciplines. Psychologists and philosophers have
posited that causal reasoning is critical to our mental models of reality and that
our knowledge is defined through identifying causal chains over our observations
of the world [5]. In the context of natural language applications, causal reasoning
can allow us to produce new knowledge from disparate observations and explore
various hypotheses. For example, causal search engines in the clinical domain
aim to identify causal factors that can help develop new drugs and diagnose rare
medical conditions. Causal question answering systems can be used to better
understand the causes of observed events and explore counterfactuals. Language
models augmented with causal knowledge can be used to develop more trans-
parent and explainable AI systems. At inference time these model could produce
causal explanations of the predicted answer.


3   Related Work
There is limited historical work on causal question answering with external causal
knowledge. Hassanzadeh et al. [7] and Kayesh et al. [11] consider the problem
of binary question answering which poses questions about causes and effects as
yes/no questions. Extracted cause-effect pairs are scored using a mixture of co-
occurrence statistics and cosine-similarity scores over BERT embeddings. These
scores are then evaluated against a threshold to answer yes/no for an input
question. Sharp et al. [17] and Xie and Mu [21] consider the task of answer re-
ranking for open-ended causal question answering. Both papers are evaluated
on a set of causal questions extracted from the Yahoo! Answers corpus, which
follow the patterns What causes ... and What is the result of .... [17]. Sharp et
al. present three distributional similarity models (adapted Skipgram, monolin-
gual alignment, and a convolutional neural network) to model the contextual
relationship between cause and effect phrases. The answer choices are re-ranked
based on the cosine similarity between extracted cause and effect vectors. Our
CausalSkipgram model for representing causal knowledge expands upon the
adapted Skipgram model presented by Sharp et al.
    Next, we summarize the current causal question answering benchmark datasets.
ROPES (reasoning over paragraph effects in situations) [12] is a reading compre-
hension dataset where the goal is to use causal relationships expressed in a back-
ground passage to answer questions about a hypothetical premise. CosmosQA [9]
is a multiple-choice reading comprehension challenge where the aim is to answer


                                         18
          Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021


questions concerning likely causes or effects of events that require commonsense
knowledge outside of the provided context. COPA (Choice of Plausible Alterna-
tives) [6] is a multiple-choice question answering task where the goal is to iden-
tify which alternative is the likely cause or effect of a provided premise. WIQA
(What if question answering) [19] is another multiple-choice task that aims to
reason about the magnitude effects of perturbations to procedural descriptions
of events. aNLI (Abductive Natural Language Inference) [1] is a multiple-choice
task where the goal is to identify which of provided hypotheses best explains a
provided context. Our preliminary work has primarily focused on the COPA and
WIQA datasets as they allow the most direct evaluation of causal knowledge in
the context of multiple-choice causal question answering.
    Finally, we summarize sources of causal knowledge. CauseNet [8] is currently
the largest publicly available knowledge graph of claimed causal facts. It con-
tains over 11 million relations and 12 million concepts that were extracted from
Wikipedia and ClueWeb [8]. ConceptNet [18], a public knowledge graph, con-
sists of 36 relations and includes a causes relation. The ATOMIC knowledge [16]
graph focuses on knowledge for commonsense inference. ATOMIC is organized
around if-then relations that primarily describe the relations and interactions
around human-centric activities. We use CauseNet as the primary source for
causal knowledge in our experiments.


4   Research Questions
RQ1 Does incorporating structured causal knowledge into language
models improve performance on causal question answering tasks?
     Current model-based solutions have converged on fine-tuning pretrained lan-
guage models on task-specific datasets. These approaches rely on the transfer-
ability of distributional knowledge learned during the pretraining process. To
the best of our knowledge, there is no empirical research that demonstrates the
efficacy of external causal knowledge in the context of causal question answering.
Our work aims to establish those baselines.
     RQ2 What is the most effective way of representing causal knowl-
edge in language models for causal question answering tasks?
     RQ2.1 How can language models be augmented with external knowledge
from causal knowledge graphs for downstream causal question answering tasks?
     RQ2.2 How can causal knowledge be injected into the language model during
the pretraining process such that it is available as transferable distributional
knowledge for downstream causal question answering tasks?
     Augmenting language models with structured knowledge is an emerging area
of research. Our research aims to provide a methodology for representing causal
knowledge that used by language models in causal question answering tasks.
We consider the strategies of knowledge augmentations and knowledge injec-
tion. The knowledge augmentation approach (RQ2.1 ) aims to train the language
model to consider external knowledge provided as input features during predic-
tion time. The knowledge injection approach (RQ2.2 ) aims to convert causal
knowledge into structured distributional knowledge during pretraining process


                                         19
          Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021


Fig. 1. Architecture of Causality Enhanced RoBERTa. The end-to-end architecture
takes as input the multiple-choice question input and relevant causal facts selected
from CauseNet.

to produce causality-aware language models (CALM) which would ideally sup-
port any downstream causal question answering task.
    RQ3 How do we evaluate the causal reasoning capabilities of lan-
guage models in the context of question answering?
    To date, most research on causal reasoning applications in NLP focuses on
task-specific model implementations. There is no comprehensive definition of
causal question answering nor a unified way to evaluate a language model’s cause
reasoning capabilities. We hope to contextualize causal question answering as an
extension of fundamental NLP problems and produce a unified benchmark (simi-
lar in spirit to GLUE (General Language Understanding Evaluation) benchmark
[20]).


5   Preliminary Results
Our experiments explore the efficacy augmenting RoBERTa [13] with causal
knowledge for multiple-choice question answering on the COPA and WIQA
benchmark tasks. Causal facts are extracted from CauseNet [8] and selected
based on the lexical overlap between the cause-effect concepts and question text.
Additional details on knowledge selection and experiment results can be found
in [3].
    COPA consists of a premise and two alternatives. The task is to identify
which alternative is most likely the cause or effect of the provided premise.
Background commonsense causal knowledge is required to answer questions as
there is limited lexical overlap between the premise and alternatives. WIQA con-
sists of multiple-choice questions where the answer options (more, less, and
no effect) describe the magnitude effect of a proposed perturbation to a pro-
cedural event. Each question has an associated procedural description consisting
of a sequence of events. The question proposes a perturbation to a specific event
and asks what impact that perturbation would have on another event in the
procedural description.


                                         20
          Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021


    Next, we present three strategies for representing causal knowledge to a lan-
guage model. The most direct way to incorporate causal information is to append
it to the end of the input text, which we call the InputAugmentation method.
Relevant causal tuples are converted into causal statements which follow the
pattern C causes E. CausalSkipgram adapts the skip-gram word embedding
approach [14] to model causal pairs. The last method is CausalKGE, which rep-
resents causal knowledge as a knowledge graph embedding. We adapt the TransE
model presented by Bordes et al. [2]. To model our causal tuples as a knowledge
graph, we add the explicit relation ”cause-effect” to each tuple. The modeling
goal of TransE is thus to predict an effect E, given a cause C and ”cause-effect”
CR such that C + CR ≈ E. A causal triple is represented by a single vector
which is generated by mean pooling the head, tail, and relation vectors.
    To incorporate causal embeddings with RoBERTa, we propose the Causality
Enhanced RoBERTa neural architecture (Figure 1). This architecture is used with
both the CausalSkipgram and CausalKGE embeddings. The first layer is the
causality-enhanced input layer which combines the pooled embedding output
of RoBERTa with the causal knowledge embeddings. For inputs where we were
able to extract causal facts, the causal embedding vector is generated by con-
catenating and flattening all relevant causal embeddings. Up to five causal facts
are selected per input. The RoBERTa pooled output is then concatenated with
causal embeddings. This input is next passed into a FeedForward Network (FFN)
with a hidden layer and classifier


Table 1. Accuracy on the COPA test set and COPA-BALANCED Hard set.
CausalKGE improves accuracy over the RoBERTa baseline by +6.20pp on COPA Test
set and +3.86pp on the COPA-Balanced Hard set.

Model                           COPA Test             COPA-Balanced Hard
RoBERTa baseline                       53.00                         58.39
+CausalSkipgram                        57.80                         58.38
+CausalKGE                  59.20 (+6.20pp)                          62.25
+InputAugmentation                     59.00               62.29 (+3.86pp)


Table 2. Accuracy of causal augmentation methods on the WIQA dataset.
InputAugmentation achieves higher accuracy in the In-Paragrah (+3.3pp) and Out-of-
Paragraph (+2.0pp) sub-categories over the current state-of-the-art QUARTET.

Model                       Overall       In-Para.      Out-of-Para.           No Effect
Bert-Baseline [19]            73.80          79.68              56.10              89.38
QUARTET - SOTA [15]          82.07            73.49             65.65             95.30
RoBERTa baseline              67.00            64.0             42.10              92.50
+ CausalSkipgram              65.00           53.96             41.38              92.29
+ CausalKGE                   74.00           71.70             55.17              93.78
+ InputAugmentation           80.00          76.79             67.65               92.43


                                         21
          Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021


6   Evaluation
Table 1 provides the results of our experiments on the COPA test set and
the COPA-Balanced Hard set. Recent pretrained models such as BERT and
RoBERTa have seen improved performance on the COPA dataset. However,
Pride et al. [10] found that these models exploited superficial cues such as the
token frequency in the correct answers. To mitigate this effect, Pride et al. ex-
panded the development set to include mirror instances to balance the lexical
distribution between correct and incorrect answers. This new dataset, called
COPA-Balanced, also categorized the test set into easy and hard groups. The
easy group consists of 190 questions where RoBERTa-Large and BERT-Large
could answer correctly without the provided premise and the hard group is the
remaining 310 questions. We use the COPA-Balanced development set for train-
ing and the hard category (which we will refer to as COPA-Balanced Hard) for
evaluation. For the COPA test set, we were able to extract causal information
from CauseNet for 32% of the questions. All three causal augmentation methods
outperform the RoBERTa baseline. The CausalKGE and Input Augmentation
have similar performance, improving accuracy on average by +6.0pp and +3.9pp
over the RoBERTa baseline on the COPA test set and COPA-Balanced Hard
set.
    Table 2 provides the results for our experiments on the WIQA dataset. The
current state-of-the-art for WIQA is the QUARTET model presented by Ra-
jagopal et al. [15]. QUARTET modifies the WIQA task to include an explanation
structure that identifies the supporting events from the procedural description
that best explains the proposed perturbation. The supporting events come from
the explanations influence graph which was selected by human annotators for
each question in the WIQA dataset. QUARTET models the explanation task
as a multi-task learning problem where the model must predict both the gold
relevant supporting sentences and the associated impact of the perturbation for
each supporting event. While our approach is -2pp less than the overall accu-
racy of QUARTET and we outperform QUARTET in the In-Paragraph and
Out-of-Paragraph subcategories.
    We were able to select causal information for 55% (1,661) of the questions in
the test set, with an average of one causal tuple extracted per question. 37% of
questions had two or more extracted causal tuples. The CausalSkipgram method
was the least successful, performing worse than the RoBERTa baseline across
all categories. The CausalKGE and InputAugmentation methods both improved
accuracy upon the RoBERTa baseline in all categories. The InputAugmentation
method was competitive with the QUARTET method and outperformed it in
both the In-Paragraph (+3.3pp) and Out-of-Paragraph (+2.0pp) categories. We
do, however, see a -3.0pp decrease in accuracy in the No Effect category. This is
likely due to extraneous or irrelevant causal tuples being selected. Future work
can explore improving the precision of the causal extraction process.


                                         22
          Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021


7   Discussion and Future Work
Our initial work validates (RQ1 ) by demonstrating the efficacy of causality en-
hanced language models on the COPA and WIQA question answering bench-
marks. Further work will explore improving recall on causal fact selection from
CauseNet and more sophisticated techniques to reduce the selection of irrele-
vant facts. We also plan to explore knowledge injections techniques described in
RQ2.2. We are investigating adapting the masked-language modeling objective
to predict masked causal concepts across sentence-level descriptions of causal
events.
    Broadly, our goal is to develop a unified causal knowledge-enhanced language
model that can be effective across all causal reasoning tasks. To that extent
we need to be able to define and measure the causal reasoning capabilities of
the language model (RQ3 ). While COPA and WIQA are both multiple-choice
question answering tasks, the causal reasoning requirements are distinct for each
application. Yet we find our simple augmentation strategies are effective in both
cases. This raises interesting questions about how language models are using
causal knowledge and if our current tasks accurately represent causal reasoning.
We hope to create a more meaningful definition of causal reasoning by exploring
the causal knowledge needs of existing NLP tasks and develop new probing
methods to better understand the causal reasoning capabilities of these language
models. We hope this work is a stepping stone towards the more ambitious goals
of general AI with causal reasoning capabilities. In order to make that jump,
language models need to be able reason across semantic knowledge found on the
web and in causal graphs.


8   Acknowledgments
This work has been funded with the financial support of the Science Foundation
Ireland Centre for Research Training in Artificial Intelligence under Grant No.
18/CRT/6223 and is supervised by Dr. Paul Buitelaar and Dr. Mihael Arcan.


References
 1. Bhagavatula, C., Bras, R.L., Malaviya, C., Sakaguchi, K., Holtzman, A., Rashkin,
    H., Downey, D., Yih, S.W., Choi, Y.: Abductive commonsense reasoning. CoRR
    (2019)
 2. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., Yakhnenko, O.: Translat-
    ing embeddings for modeling multi-relational data. In: Burges, C.J.C., Bottou, L.,
    Welling, M., Ghahramani, Z., Weinberger, K.Q. (eds.) Advances in Neural Infor-
    mation Processing Systems. vol. 26 (2013)
 3. Dalal, D., Arcan, M., Buitelaar, P.: Enhancing multiple-choice question answering
    with causal knowledge. In: Proceedings of Deep Learning Inside Out (DeeLIO):
    The Second Workshop on Knowledge Extraction and Integration for Deep Learning
    Architectures. Association for Computational Linguistics (2021)


                                         23
           Proceedings of the Doctoral Consortium at ISWC 2021 - ISWC-DC 2021


 4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep bidi-
    rectional transformers for language understanding. In: NAACL 2019 Proceedings:
    Human Language Technologies, Volume 1 (Long and Short Papers)
 5. Goldman, A.I.: A causal theory of knowing. The Journal of Philosophy 64(12),
    357–372 (1967)
 6. Gordon, A.S., Kozareva, Z., Roemmele, M.: Semeval-2012 task 7: Choice of
    plausible alternatives: An evaluation of commonsense causal reasoning. In:
    SemEval@NAACL-HLT (2012)
 7. Hassanzadeh, O., Bhattacharjya, D., Feblowitz, M., Srinivas, K., Perrone, M.,
    Sohrabi, S., Katz, M.: Answering binary causal questions through large-scale text
    mining: An evaluation using cause-effect pairs from human experts. In: Proceed-
    ings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,
    IJCAI-19. International Joint Conferences on Artificial Intelligence Organization
 8. Heindorf, S., Scholten, Y., Wachsmuth, H., Ngomo, A.C.N., Potthast, M.:
    Causenet: Towards a causality graph extracted from the web. In: CIKM (2020)
 9. Huang, L., Bras, R.L., Bhagavatula, C., Choi, Y.: Cosmos QA: machine reading
    comprehension with contextual commonsense reasoning. CoRR (2019)
10. Kavumba, P., Inoue, N., Heinzerling, B., Singh, K., Reisert, P., Inui, K.: When
    choosing plausible alternatives, clever hans can be clever. In: Proceedings of the
    First Workshop on Commonsense Inference in Natural Language Processing (2019)
11. Kayesh, H., Saiful Islam, M., Wang, J., Anirban, S., Kayes, A.S.M., Watters, P.:
    Answering binary causal questions: A transfer learning based approach. In: 2020
    International Joint Conference on Neural Networks (IJCNN) (2020)
12. Lin, K., Tafjord, O., Clark, P., Gardner, M.: Reasoning over paragraph effects in
    situations. In: MRQA@EMNLP (2019)
13. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
    Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized BERT pretraining
    approach. CoRR (2019)
14. Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word rep-
    resentations in vector space. In: ICLR (2013)
15. Rajagopal, D., Tandon, N., Clark, P., Dalvi, B., Hovy, E.: What-if I ask you to
    explain: Explaining the effects of perturbations in procedural text. In: Findings of
    the Association for Computational Linguistics: EMNLP 2020 (2020)
16. Sap, M., LeBras, R., Allaway, E., Bhagavatula, C., Lourie, N., Rashkin, H., Roof,
    B., Smith, N.A., Choi, Y.: Atomic: An atlas of machine commonsense for if-then
    reasoning (2019)
17. Sharp, R., Surdeanu, M., Jansen, P., Clark, P., Hammond, M.: Creating causal
    embeddings for question answering with minimal supervision. In: ACL 2016 Pro-
    ceedings of Conference on Empirical Methods in Natural Language Processing
18. Speer, R., Chin, J., Havasi, C.: Conceptnet 5.5: An open multilingual graph of gen-
    eral knowledge. In: Proceedings of the Thirty-First AAAI Conference on Artificial
    Intelligence. AAAI Press (2017)
19. Tandon, N., Mishra, B.D., Sakaguchi, K., Bosselut, A., Clark, P.: Wiqa: A dataset
    for ”what if...” reasoning over procedural text. In: EMNLP/IJCNLP (2019)
20. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: A
    multi-task benchmark and analysis platform for natural language understanding
    (2019), in the Proceedings of ICLR.
21. Xie, Z., Mu, F.: Distributed representation of words in cause and effect spaces. In:
    AAAI (2019)


                                          24

</pre>