Economics Assistant for Robustness Checks (EconARC): Identifying Confounders from Causal Knowledge Graphs Fiona Anting Tan, See-Kiong Ng Institute of Data Science, National University of Singapore Abstract In Economics, authors conduct robustness checks, such as accounting for potential confounders, to avoid drawing misleading conclusions from their causal analyses. To assist in this process, we propose EconARC, a tool to automatically identify confounders from the literature relevant to a Cause and Effect pair. Our methodology involves extracting cause-and-effect arguments using a fine-tuned sequence- to-sequence model, clustering semantically similar arguments into topics, and utilizing the backdoor criterion on the causal graph to detect confounders. Our study is the first to employ text mining techniques to generate confounders in Economics, with implications for advancing Artificial Intelligence towards human-level capabilities like engaging in academic discourse. Keywords causal text mining, confounder detection, knowledge graphs, backdoor criterion 1. Introduction Figure 1: Overview of EconARC Causal inference relies on addressing confounders, which are variables affecting both the dependent and independent variables. In Economics, consideration of confounders is important, and often a critical part of referee reports. However, staying abreast of confounders in this field requires an extensive knowledge of the literature, which is a non-trivial task given the lengthy and vast number of Economics papers. ISWC 2023 Posters and Demos: 22nd International Semantic Web Conference, November 6–10, 2023, Athens, Greece Envelope-Open tan.f@u.nus.edu (F. A. Tan); seekiong@nus.edu.sg (S. Ng) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Existing literature on confounder detection from causal knowledge graphs prioritizes uncov- ering latent relationships using quantitative variables [1]. We offer a novel approach by focusing solely on text, thereby bridging the fields of causal text mining and causal identification. We propose the Economics Assistant for Robustness Checks (EconARC), overview shown in Figure 1, that automates robustness check proposal by identifying confounders related to a Cause and Effect pair. To our knowledge, we are the first work to use causal text mining techniques to gen- erate confounders for Economics. We believe that EconARC will be a useful tool for Economics authors to review their paper prior to submission, and reviewers to obtain an unbiased, initial assessment of a paper. Our work also has implications for advancing Artificial Intelligence (AI) towards human-level understanding and inference tasks like engaging in academic discourse. 2. Our Approach In this section, we outline our methodology, and provide additional details in the Appendix.1 2.1. Dataset Our study experiments on 177 papers from 23 issues of the Journal of Labor Economics (JOLE).2 (1) Annotating Causal Relations: One of our authors, who is an Econometrics graduate, annotated 5 papers (2,223 sentences) for training and 1 paper (560 sentences) for testing with causal relations. We restricted our annotations to causal relations that appear across 5-sentences in the same section of the paper, and for cause and effect arguments to be consecutive spans. We mainly adapted the annotation guidelines from the Causal News Corpus (CNC) [2, 3] with the key difference being that our annotated causal relations must be helpful to an Economics academic. A consequence of this rule means that we differ from CNC in areas like our arguments need not contain events, and we do not annotate: (1) Purpose relations, (2) justifications for a data or methodology choice, etc. In total, 522 and 76 causal relations were annotated in the training and test set respectively. (2) Annotating Argument Topics: For two papers from the training set, the same annotator assigned open-ended topic labels to each argument. 356 arguments were annotated to 119 topics. Topic labels were as general as “education”, to more specific labels like “greater upward mobility”, “areas where fathers tend to be richer within the bottom half of households and whose sons did better accordingly”, etc. Sequences without annotations will be referred to as our Out-of-Sample (OOS) set. The OOS set comprises of 83,676 sentences from 171 papers. 2.2. Extraction of Causal Relations We fine-tune SOTA sequence-to-sequence (S2S) pre-trained language models (PTMs), like t 5 - b a s e [4], b a r t - b a s e [5] and p e g a s u s - l a r g e [6]. Given the i n p u t _ t e x t , the model learned to generate the t a r g e t _ t e x t . These texts are described below: 1. i n p u t _ t e x t : An input sequence that is 5-sentences long with a “ s u m m a r i z e : ” prefix. 1 Our repository is available at https://github.com/tanfiona/EconARC. 2 https://www.journals.uchicago.edu/toc/jole/current 2. t a r g e t _ t e x t : If no causal relations were annotated within the i n p u t _ t e x t , return “No key causal relations”. Else, return a line-separated list of causal relations in the format of “Key causal relations:\ n 1 . C a u s e : < F I R S T _ C A U S E _ S P A N > \ t E f f e c t : < F I R S T _ E F F E C T _ S P A N > \ n ...” 2.3. Knowledge Graph Creation To prevent a sparse graph, we grouped arguments with similar meaning into a topic. Similar to [7, 8], we approached this task by (1) generating word embeddings and (2) clustering the embeddings. We concatenated the annotated causal relations from the training set and the inferred causal relations from the OOS set together when performing clustering. For (1), to encode arguments into embeddings, we experimented with PTMs like the supervised pre-trained language model by SimCSE [9] and the encoder portion of our fine-tuned T5 extraction model. For (2), we condensed our embeddings into 400 components3 using Principal Components Analysis (PCA) and used Mini-Batch K-Means [10] to perform our clustering. We explored various levels of K (5000 to 15000, jumping by gaps of 2500). We removed relations where the cause and effect have the same topic to avoid nodes with self-loops. Our knowledge graph (KG) 𝐺 = (𝑉 , 𝐸) is a collection of nodes 𝑉 = {(𝑣1 , 𝑣2 , ..., 𝑣𝑛 )} and directed edges 𝐸 = {(𝑣1 , 𝑣2 ), (𝑣2 , 𝑣3 ), ...}. A directed edge (𝑣𝑥 , 𝑣𝑦 ) represents the presence of causality between the two nodes, where 𝑣𝑥 is the cause argument and 𝑣𝑦 is the effect argument. The edges are also weighted by support 𝑠, indicating the count of relations expressing causality from 𝑣𝑥 to 𝑣𝑦 in the dataset. 2.4. Confounder Detection Given an ordered pair of variables (𝑋 , 𝑌 ) in a directed acyclic graph 𝐺, a set of variables 𝑍 satisfies the backdoor criterion relative to (𝑋 , 𝑌 ) if no node in 𝑍 is a descendant of 𝑋, and 𝑍 blocks every path between 𝑋 and 𝑌 that contains an arrow into 𝑋 [1]. Backdoor paths may make 𝑋 and 𝑌 dependent despite lacking causal influences from 𝑋. To estimate the causal relationship of 𝑋 on 𝑌, we need to condition on a set of nodes 𝑍 such that 𝑍 (1) blocks all spurious paths between 𝑋 and 𝑌, (2) leaves directed paths between 𝑋 and 𝑌 unchanged, and (3) creates no new spurious paths. In other words, the causal effect of 𝑋 on 𝑌 is given by the formula: 𝑃(𝑌 = 𝑣𝑦 |𝑑𝑜(𝑋 = 𝑣𝑥 )) = ∑𝑣𝑐 𝑃(𝑌 = 𝑣𝑦 |𝑋 = 𝑣𝑥 , 𝑍 = 𝑣𝑧 )𝑃(𝑍 = 𝑣𝑧 ). This formula describes the distribution of Y given an intervention (𝑑𝑜(𝑋 = 𝑣𝑥 )) that sets X to the value 𝑣𝑥 , thereby removing X’s dependence on Z. Given a source (Cause) and target (Effect), we automatically identify potential confounders by adapting the backdoor criterion scripts from DoWhy [11], a Python package for causal inference. A depth-first search algorithm to explore paths between the Cause and Effect pair and determines the variables that need to be conditioned on to block all paths between them. Since our whole graph is too large, we had to restrict our search space to improve run times: For each node in the graph 𝐺, we designated it as a central node and obtained a subgraph (𝑠𝐺) containing nodes located within a 2-step radius. For each node in 𝑠𝐺 that is not the central node, we designated it as the target node, while the central node was fixed as the source. The benefit of this setup is that we could search for backdoor variables within a feasible run time. However, our methodology fails to identify backdoor variables that lie outside of each subgraph. 3 With 400 components, only 5.005−05 of variance is dropped. (A) Extraction (Seq2Seq Model) (B) Clustering (MiniBatch K-Means Model) PTM ROUGE1 ROUGE2 ROUGEL ROUGELsum PTM K ARI FMI NMI T5 79.90 77.65 79.25 79.65 SimCSE 7500 23.03 32.95 82.47 Pegasus 66.27 63.07 65.51 65.73 SimCSE 10000 18.32 27.51 80.54 BART 76.86 73.76 75.97 76.48 T5 7500 15.81 21.53 77.64 T5 10000 12.19 21.50 80.11 Table 1 Performance metrics for (A) cause-effect extraction in a S2S framework and (B) argument clustering using Mini-Batch K-Means. Scores are reported in percentages (%). Top score per column is in bold. Explanations for evaluation metrics are available in the Appendix. 3. Results & Conclusion Panel A of Table 1 reports scores for extraction. Across all metrics, our best model was the S2S model that fine-tuned T5, scoring 79.90% for ROUGE1 and 76.25% for ROUGEL.4 Hence, we used this best model on our OOS set to obtain predicted causal relations. Panel B of Table 1 reports scores for clustering. Across all metrics, our best model uses SimCSE embeddings and performs K-Means clustering for 7500 topics, scoring 23.03% for ARI and 82.47% for NMI. Using the SimCSE embeddings consistently supercedes using T5’s, suggesting benefits in clustering arguments that were converted to embeddings that convey semantic similarity. For our best model, our KG comprises of 7498 unique nodes, 37557 edges, and an edge support ranging from 1 to 10 and averaging at 1.207.5 Finally, we apply the backdoor criterion detection algorithm to identify confounders. For our dataset of 176 papers (train + OOS), we identified 152 papers and 676 confounders for authors to consider reviewing. These confounders lie 1 to 4 steps away from either the cause or effect argument of the main relation. We also detected 161 papers and 1408 confounders that the authors themselves describe within their paper, which reveal that confounders and robustness checks are definitely a key concern and covered by most authors. In conclusion, EconARC successfully applies causal text mining techniques to automatically identify confounders. EconARC will be a useful tool for Economics authors and referees to critically evaluate the validity of a causal identification strategy. This tool will also help mitigate reviewers’ unconscious bias by standardizing the review process. In the future, we hope to expand the coverage of our work to more journals and to more branches of Economics, and to evaluate our system with Economic academics. We also hope to design tools to identify other threats to validity to provide a more comprehensive review. References [1] J. Pearl, M. Glymour, N. P. Jewell, Chapter 3: The effects of interventions, in: Causal inference in statistics: A primer, John Wiley & Sons, 2016. [2] F. A. Tan, A. Hürriyetoğlu, T. Caselli, N. Oostdijk, T. Nomoto, H. Hettiarachchi, I. Ameer, O. Uca, F. F. Liza, T. Hu, The causal news corpus: Annotating causal relations in event 4 We used ROUGE evaluation metrics since the task is a S2S open-ended generation task. 5 Due to limited space, we provide experimental details, explanation of our evaluation metrics, ablation studies and provide qualitative examples of confounders in the Appendix. sentences from news, in: Proceedings of the Thirteenth Language Resources and Evalua- tion Conference, European Language Resources Association, Marseille, France, 2022, pp. 2298–2310. URL: https://aclanthology.org/2022.lrec-1.246. [3] F. A. Tan, H. Hettiarachchi, A. Hürriyetoğlu, N. Oostdijk, T. Caselli, T. Nomoto, O. Uca, F. F. Liza, S.-K. Ng, RECESS: Resource for extracting cause, effect, and signal spans, in: Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Bali, Indonesia, 2023. [4] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html. [5] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle- moyer, BART: Denoising sequence-to-sequence pre-training for natural language gen- eration, translation, and comprehension, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Lin- guistics, Online, 2020, pp. 7871–7880. URL: https://aclanthology.org/2020.acl-main.703. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . a c l - m a i n . 7 0 3 . [6] J. Zhang, Y. Zhao, M. Saleh, P. J. Liu, PEGASUS: pre-training with extracted gap-sentences for abstractive summarization, in: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 11328–11339. URL: http://proceedings.mlr. press/v119/zhang20ae.html. [7] S. Sia, A. Dalmia, S. J. Mielke, Tired of topic models? clusters of pretrained word embed- dings make for fast and good topics too!, in: Proceedings of the 2020 Conference on Empiri- cal Methods in Natural Language Processing (EMNLP), Association for Computational Lin- guistics, Online, 2020, pp. 1728–1736. URL: https://aclanthology.org/2020.emnlp-main.135. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . e m n l p - m a i n . 1 3 5 . [8] Z. Zhang, M. Fang, L. Chen, M. R. Namazi Rad, Is neural topic modelling better than clustering? an empirical study on clustering with contextual embeddings for top- ics, in: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Associa- tion for Computational Linguistics, Seattle, United States, 2022, pp. 3886–3893. URL: https://aclanthology.org/2022.naacl-main.285. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 2 . n a a c l - m a i n . 2 8 5 . [9] T. Gao, X. Yao, D. Chen, SimCSE: Simple contrastive learning of sentence embeddings, in: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 6894–6910. URL: https://aclanthology.org/2021.emnlp-main.552. doi:1 0 . 1 8 6 5 3 / v1/2021.emnlp- main.552. [10] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine learning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830. [11] A. Sharma, E. Kiciman, Dowhy: An end-to-end library for causal inference, CoRR abs/2011.04216 (2020). URL: https://arxiv.org/abs/2011.04216. a r X i v : 2 0 1 1 . 0 4 2 1 6 .