=Paper=
{{Paper
|id=Vol-3180/paper-20
|storemode=property
|title=BioTABQA: Instruction Learning for Biomedical Table Question Answering
|pdfUrl=https://ceur-ws.org/Vol-3180/paper-20.pdf
|volume=Vol-3180
|authors=Man Luo,Sharad Saxena,Swaroop Mishra,Mihir Parmar,Chitta Baral
|dblpUrl=https://dblp.org/rec/conf/clef/LuoSMPB22
}}
==BioTABQA: Instruction Learning for Biomedical Table Question Answering==
BioTABQA: Instruction Learning for Biomedical Table Question Answering Man Luo, Sharad Saxena, Swaroop Mishra, Mihir Parmar and Chitta Baral Arizona State University, Tempe, Arizona, 85281, United State Abstract Table Question Answering (TQA) is an important but under-explored task. Most of the existing QA datasets are in unstructured text format and only few of them use tables as the context. To the best of our knowledge, none of TQA datasets exist in the biomedical domain where tables are frequently used to present information. In this paper, we first curate a table question answering dataset, BioTabQA, using 22 templates and the context from a biomedical textbook on differential diagnosis. BioTabQA can not only be used to teach a model how to answer questions from tables but also evaluate how a model generalizes to unseen questions, an important scenario for biomedical applications. To achieve the generalization evaluation, we divide the templates into 17 training and 5 cross-task evaluations. Then, we develop two baselines using single and multi-tasks learning on BioTabQA. Furthermore, we explore instructional learning, a recent technique showing impressive generalizing performance. Experimental results show that our instruction-tuned model outperforms single and multi task baselines on an average by ∼ 23% and ∼ 6% across various evaluation settings, and more importantly, instruction-tuned model outperforms baselines by ∼ 5% on cross-tasks. Keywords Table question answering, biomedical question answering, instruction learning, prompt learning 1. Introduction Neural language models have achieved state-of-the-art performance in popular reading com- prehension (RC) tasks such as SQuAD [1, 2], DROP [3] and ROPES [4]. Unlike in popular RC where the context contains information in natural language, a significant amount of real-world information is stored in unstructured or semi-structured web tables [5]. In particular, many clinical information is provided in tabular format [6]. Past attempts have been made for TQA in the general domain Natural Language Processing (NLP) [7, 8, 9], however, this task has not been well-studied in the biomedical domain. This work takes the first step toward studying the TQA task in biomedical domain. To this extent, we first curate a table question answering dataset, BioTabQA, using 22 templates without heavy and expensive human annotation. This dataset also serves to evaluate the generalization of a model, a well-known issue that many language models have failed even though they outperform humans in many popular benchmarks [10, 11]. Recently, instruction-learning [12, 13, 14] have improved model’s performance to unseen tasks. Inspired by this, we leverage CLEF 2022: Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy $ mluo26@asu.edu (M. Luo); ssaxen18@asu.edu (S. Saxena); smishr1@asu.edu (S. Mishra); mparmar3@asu.edu (M. Parmar); chitta@asu.edu (C. Baral) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) instruction-tuning to build a model and verify whether instruction learning also show stronger generalization on BioTabQA. Our contribution can be summarized as: (1) to the best of our knowledge, this is the first attempt to study biomedical TQA and this is also the first attempt to incorporate instructional learning in this task, (2) we reformulate differential diagnosis as a TQA problem and introduce a new dataset BioTabQA, and (3) experimental results show that our instruction-tuned model outperforms single and multi task baselines by 23%, 6%, and outperforms multitask model by 5% in cross-task (generalization to unseen task) setting. Finally, our analysis shows that instruction is more important and useful in cross tasks compared to in-domain tasks in inference time. 2. Related Work Table Question Answering Past attempts have been made for TQA such as TabMCQ [7], WikiTableQuestions [8], Sequential Q&A [9], Spider [15], WikiSQL [16]. These approaches can handle the large-scale tables from Wikipedia efficiently. However, these QA systems can only answer the question when a strong signal needed for identifying the type of answers is provided explicitly in the table. To overcome this limitation, TabFact [17] is proposed which enables TQA when the answer is not explicitly available in the table. However, none of the above datasets are in the biomedical domain, a domain which is not only essential in human life but also in which tables have wide applications (e.g. many biomedical information are presented by tables). There exists some table datasets in biomedical domain, such as PubTabNet [18], a medical table datasets which are widely used in information retrieval tasks. Some other datasets are designed for biomedical question answering task such as [19, 20]. Nevertheless, these are not biomedical TQA dataset, leaving the biomedical TQA as an under-explored task. This work aims to take the first step to Table question answering in biomedical domain, which is known to be different from the general domain [21, 22]. Instruction Learning Recently, the paradigm in ML/DL shifted to prompt-based learning. Liu et al. [23] provides a comprehensive survey on prompt-based methods for various tasks. Prompts enable the generalization across tasks as well as achieves considerable performance on zero-shot learning. T0 model Sanh et al. [24] shows effective performance on multi-tasking and zero-task generalization using a prompt-based approach. [12] introduced natural language instructions to improve the performance of LMs such as BART, GPT-3 for cross-task. Followed by this, FLAN [13] has been proposed which uses instructions to achieve generalization across unseen tasks. Recently, Parmar et al. [14] proposed instruction learning for biomedical multi-task. Along with that, Mishra et al. [25] shows reframing instructional prompts can boost both few-shot and zero- shot model performance. Min et al. [26] shows performance of in-context learning on a large set of training tasks. InstructGPT model is proposed which is fine-tuned with human feedback [27]. Instruction-based multi-task framework for few-shot Named Entity Recognition (NER) has been developed by Wang et al. [28]. Puri et al. [29] introduced instruction augmentation and Prasad et al. [30] introduced Gradient-free Instructional Prompt Search (GrIPS) for improving model performance. Recently, Parmar et al. [31] believe that instruction bias in existing Natural Language Understanding (NLU) datasets can impact the instruction learning, however, many ID Question Template Prompt 1 I have symptom A, what disease do I have? If symptom A is in symptom list, report corresponding disease. 2 I have symptom A and sign A, what is my diagnosis? If symptom A is in symptom list, and sign A is in sign list, report corresponding disease. 11 The patient has symptom A, symptom B and symp- If symptom A, symptom B and symptom C are in tom C, what disease can cause these symptoms? symptom list, report corresponding disease. 22 I have symptom A, symptom B, symptom C but no If symptom A, symptom B and symptom C are in symptom D, what is causing this? symptom list, but symptom D is not in symptom list, report corresponding disease. Table 1 Examples of four question templates for BioTabQA dataset creation and the corresponding prompts approaches have been proposed recently using instructions to improve model performance [32, 33, 34, 35, 36]. Motivated by the effectiveness of instruction learning, in this work, we explore the potential application of instructional prompts for the biomedical TQA. 3. Task and Dataset Task Formulation Each data point is a tuple, where T is a table, Q is a question, and A is the answer to Q in T. In particular, Q exhibits some symptoms/signs and asks about what potential disease it is, e.g., “I have joint pain and swelling on my face, what’s wrong with me”. A is the corresponding disease (or diagnosis) in T. The task is to predict A given as input. Dataset Source We use the medical textbook “Differential Diagnosis in Primary Care” [37] as the source of our dataset, which contains information on how to diagnose a patient by observing their disease symptoms. This book is in the tabular format with five columns: (1) diagnosis, (2) key symptoms, (3) key signs1 , (4) background, and (5) additional information. We only use the first three columns to create the dataset. We divide the textbook into 513 tables. Dataset Creation To create large scale training/evaluation datasets (i.e., BioTabQA) without laborious human annotation, we design a wide range of templates to semi-automate the process of dataset generation. We use key symptoms and/or key signs of a diagnosis in the question templates, and the diagnosis as the answer. In addition, we design the corresponding prompt to enable instruction learning for each template. In total, we design 22 templates. Table 1 shows four templates as an example and the corresponding prompts (all templates and prompts are given in Appendix A. Specifically, some templates have one, two, three or four symptoms/sign (e.g. ID 1, 2, 11, 22), and some have negation (e.g. ID 22). Once the templates are pre-defined, given a table, and a template, for each row, we randomly select the symptoms/signs based on the template and replace the placeholder in the template with the chosen symptoms/signs. 1 According to JAMA Network, a symptom is a manifestation of disease appears to the patient himself, while a sign is a manifestation of the disease that the physician perceives. Statistic Train IID Test Cross Task Test # of Samples 9,126 19,590 2,463 Question Length 240 20 16 Table Length 255 256 246 Prompt Length 18 18 14 # Tasks with 1 sym/sign 0 0 3 # Tasks with 2 sym/sign 9 9 2 # Tasks with 3 sym/sign 7 7 0 # Tasks with 4 sym/sign 1 1 0 # Tasks with negation 2 2 0 Table 2 Statistic of BioTabQA Split 1 for training (Train), in-domain testing (IID Test) and cross task testing (Cross Task Test) sets. Three Splits in BioTabQA For experimental purposes, we created 3 training/testing/cross- task splits of data. Each split includes 17 templates for in-domain training and testing, where there are non-overlap tables for training and testing. The rest 5 templates are used for cross-task evaluation. For each split, the templates are similar to each other in the training set and less similar to the templates in the evaluation set (cross-task setting) to show the generalization capability of a model. The similarity is defined as either the same number of symptoms/signs presented in the templates or similar phrases in the templates. Table 2 shows the statistics of Split 1 and other Splits as well as the division of the Splits are given in Appendix A. 4. Experiments and Results From our dataset, each question type (i.e., template) is considered an individual task. Hence, we have 22 different tasks in total. We design two baselines, the single-task model (STM) and the multi-tasks model (MTM). We compare the performance of the instruction-tuned model (In-MTM) with these two baselines on the in-domain test set, cross-task, and robustness [38, 39]. We use DistilBert [40] as the backbone model for all experiments. Exact Match (EM) score is used as an evaluation metric. Other experimental setup can be found in Appendix B. In the following, we describe the table linearization technique followed by our instructional multi-task learning model. We present the results and analysis at the end of this section. 4.1. Table Linearization Since input of the language model is text, we need to linearize the table context from BioTabQA. We use a simple yet effective linearization method suggested by [17] to convert the table context into a string of text. We pre-define the format “Row 1 is: Diagnosis is _, Key symptoms are _, Key signs are _;..., Row N is Diagnosis is _, Key symptoms are X, Key signs are XX”. 4.2. Instructional Multi-task Learning Model Apart from the prompt designed for each template (see §3), one additional example is also given in the instruction. The example consists of a question and the answer without the context table due to the input length restriction of the language model. We also use special words to denote the beginning of the prompt, and question and answer. In particular, the instruction set of {Prompt: p. Question: q. Answer: a}. The input to our instruction learning model is {[CLS] Question: Q, Context: C, Instruction: I}, where [CLS] is the special token of the DistilBERT model, Q is the input question, C is the input table after linearization. As mention in §3, we create multiple templates and we term the data created by individual template as task. A single task model (STM) is trained by one task, and a multitask model (MTM) is trained by multiple tasks. 4.3. Main Results We evaluated our proposed model In-MTM in terms of various aspects including in-domain testing, cross-task setting and robustness. All the results are presented in Table 3. In the following, we present insightful results and findings based on our experiments. The performance of MTM and In-MTM varies for different split since each split consists of different tasks. Finding 1: Multitask Model performs better than Single-task Model From Figure 1, we can observe that MTM outperforms STM in majority cases, leading to on an average 14%, 21%, and 18% improvement on split 1, 2, and 3, respectively. Also, we observe that multi-task learning is significantly helpful on the tasks where the training data is less. Hence, we observe tasks 1, 15, and 21 which have only 667 training examples (see results in the first block of Table 4). We can see that the STM model achieve less than 0.60 EM score; while the MTM trained on split 2 achieves at least 0.85 EM score2 . For split 1 where the MTM does not train on tasks 1, 15 and 21 tasks, it shows superior performance compared to the STM. This indicates that multi-task learning is effective in a low-resource setting for TQA. Moreover, for tasks 5 and 13 which have more than 10𝑘 instances, the STM can obtain a 0.90 EM score; while the MTM trained on the split 2 achieves similar performance. This finding is aligned with the literature that multitask learning model improves single task learning model [41, 42, 43] Finding 2: Instruction further improve Multi-task learning Model We can observe from Figure 1 that In-MTM further improves the performance of the MTM, yielding on an average 6%, 6% and 5% improvement on split 1,2, and 3, respectively. These results indicate the use of instructional prompts increases the question-answering performance both consistently and significantly. Finding 3: Multi-task learning and Instruction Learning improve generalization capac- ity of model For each split, we hold out 5 different tasks for the cross-task evaluation. This is similar to out-of-domain evaluation where a model has not seen such types of questions in the 2 We compare STM with MTM only on split 2 results in this scenario because task 1, 15, and 21 are used for training in split 2. Split 1 Split 2 Split 3 Task ID # Training STM MTM In-MTM MTM In-MTM MTM In-MTM 1 667 0.53 0.84 0.85 0.88 0.91 0.88 0.88 2 3023 0.55 0.83 0.93 0.88 0.94 0.87 0.93 3 3082 0.62 0.87 0.93 0.89 0.96 0.90 0.95 4 3170 0.64 0.80 0.92 0.88 0.93 0.87 0.94 5 47561 0.90 0.88 0.93 0.92 0.95 0.90 0.93 6 10991 0.86 0.86 0.94 0.89 0.98 0.91 0.96 7 3082 0.60 0.87 0.93 0.89 0.97 0.89 0.95 8 3082 0.60 0.87 0.92 0.89 0.92 0.89 0.95 9 10324 0.82 0.80 0.95 0.83 0.96 0.83 0.96 10 3082 0.63 0.87 0.93 0.89 0.96 0.90 0.95 11 10991 0.88 0.86 0.94 0.89 0.97 0.91 0.95 12 3082 0.60 0.87 0.92 0.89 0.95 0.89 0.94 13 10991 0.90 0.86 0.93 0.89 0.98 0.90 0.95 14 10991 0.71 0.86 0.93 0.89 0.98 0.90 0.96 15 667 0.51 0.84 0.85 0.89 0.90 0.87 0.90 16 3082 0.63 0.87 0.92 0.89 0.96 0.89 0.92 17 10991 0.80 0.87 0.94 0.89 0.98 0.90 0.95 18 3082 0.68 0.88 0.93 0.89 0.96 0.89 0.94 19 3082 0.60 0.87 0.93 0.89 0.96 0.90 0.95 20 3082 0.61 0.87 0.91 0.89 0.95 0.90 0.94 21 667 0.54 0.83 0.85 0.87 0.89 0.85 0.90 22 14639 0.88 0.89 0.93 0.93 0.97 0.92 0.94 Avg. Split 1 9127 0.72 0.86 0.93 Avg. Split 2 7349 0.68 - - 0.89 0.95 Avg. Split 3 8525 0.71 - - - - 0.89 0.94 Avg. cross Split 1 - - 0.84 0.88 Avg. cross Split 2 - - - - 0.88 0.97 - - Avg. cross Split 3 - - - - - - 0.89 0.92 Table 3 The EM (exact matching) scores of three Models on BioTabQA. Green denotes cross task performance. Bold number denotes the best performance for each task. training time, thus the performance on the cross tasks demonstrates the generalization capacity of a model. From the results shown in Figure 2, we have two observations. First, we can see that both MTM and In-MTM show sufficient performance. In split 3, In-MTM achieves average 0.94 EM on in-domain tasks (see Figure 1) and average 0.92 EM on cross-tasks (see Figure 2), a marginal drop (∼ 2%). On the same split, MTM achieves the same performance on in-domain tasks and cross-tasks. More importantly, both MTM and In-MTM achieve higher performance than the STM on every task even though the former two models do not train on these tasks. This demonstrates the benefits of multi-task learning. Second, for each Split, In-MTM achieves better performance than MTM on every cross-task. This shows that instruction learning can further improve generalization. Split 1 Split 2 Task ID STM MTM In-MTM MTM In-MTM 1 0.53 0.84 0.85 0.88 0.91 15 0.51 0.84 0.85 0.89 0.90 21 0.54 0.83 0.85 0.87 0.89 5 0.90 0.88 0.93 0.92 0.95 13 0.90 0.86 0.93 0.89 0.98 Table 4 The EM (exact matching) scores of STM, MTM and In-MTM on the low resources tasks (first block) and high resources tasks (second block). STM MTM In-MTM 1.1 1.0 0.9 0.8 0.7 EM Score 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Split 1 Split 2 Split 3 Figure 1: The average performance of three models on the in-domain testing sets of different Splits. Finding 4: Instruction is more useful in cross tasks compared to in-domain tasks. We evaluate the In-MTM on two cross tasks and in-domain tasks with different types of instructions to analyze the change in model performance. First, we use mismatched instruction, i.e., instruction of one task for other tasks, however, it is still instructive. From Table 5, we can see that model shows similar performance with the mismatch instruction as original instruction on both in-domain and cross tasks. The reason might be that instructions for the different tasks have a set of similar words (see Appendix A), and previous studies [44, 45] have shown that the model can perform well if the instructions have similar words. Moreover, we also construct three types of meaningless instructions which does not have any linguistic meaning such as random strings (e.g., ‘ashlksadkl’), random words (e.g., ‘hello bye you east’), and repeated characters Task ID Correct Mismatched Repeat Random String Random Words 2 0.93 0.92 0.89 0.90 0.89 4 0.92 0.91 0.85 0.86 0.85 20 0.91 0.92 0.93 0.94 0.93 22 0.93 0.95 0.96 0.95 0.96 Table 5 Test the performance of In-MTM (trained on Split 1) using 4 variants of instructions) on two cross tasks (first block) and two in-domain tasks (second block). STM MTM In-MTM 1.1 1.0 0.9 0.8 0.7 EM Score 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Split 1 Split 2 Split 3 Figure 2: The average performance of three models on the cross-tasks of different Splits. (e.g., ‘AAAAA’). These meaningless instructions hamper the performance of cross tasks more significantly than in-domain tasks. For in-domain tasks, the model is already exposed to the same type of instances at training time, but for cross-task, these unseen tasks heavily rely on the instructions [12]. In summary, instructions can be more important and helpful in cross-task settings. 5. Future Work and Conclusion In this work, we take the first step toward studying table question answering in biomedical domain. We firstly create a dataset, BioTabQA, based on templates using a primary care textbook. We then experiment with three models on BioTabQA and find that multi-task learning is better than single task learning especially in low resource scenarios. Furthermore, the instruction learning can significantly improve the model without instructions on both in-domain as well as cross-tasks. This suggest the benefits of instruction learning on table question answering task, and explore the role of instruction learning in other general table question answering datasets is one interesting future work. The questions in current dataset are based on the formal symptoms or signs given in the textbook, which make some questions unnatural. Using more natural terms to generate the question can produce a dataset more close to real-life scenario. References [1] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100,000+ questions for machine comprehension of text, arXiv preprint arXiv:1606.05250 (2016). [2] P. Rajpurkar, R. Jia, P. Liang, Know what you don’t know: Unanswerable questions for squad, arXiv preprint arXiv:1806.03822 (2018). [3] D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, M. Gardner, Drop: A reading com- prehension benchmark requiring discrete reasoning over paragraphs, arXiv preprint arXiv:1903.00161 (2019). [4] K. Lin, O. Tafjord, P. Clark, M. Gardner, Reasoning over paragraph effects in situations, arXiv preprint arXiv:1908.05852 (2019). [5] W. Chen, M.-W. Chang, E. Schlinger, W. Wang, W. W. Cohen, Open question answering over tables and text, arXiv preprint arXiv:2010.10439 (2020). [6] C. G. Durbin, Effective use of tables and figures in abstracts, presentations, and papers, Respiratory care 49 (2004) 1233–1237. [7] S. K. Jauhar, P. Turney, E. Hovy, Tables as semi-structured knowledge for question answer- ing, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 474–483. [8] P. Pasupat, P. Liang, Compositional semantic parsing on semi-structured tables, arXiv preprint arXiv:1508.00305 (2015). [9] M. Iyyer, W.-t. Yih, M.-W. Chang, Search-based neural structured learning for sequential question answering, in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 1821–1831. [10] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, D. Song, Robust physical-world attacks on deep learning visual classification, in: Pro- ceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1625–1634. [11] R. Le Bras, S. Swayamdipta, C. Bhagavatula, R. Zellers, M. Peters, A. Sabharwal, Y. Choi, Adversarial filters of dataset biases, in: International Conference on Machine Learning, PMLR, 2020, pp. 1078–1088. [12] S. Mishra, D. Khashabi, C. Baral, H. Hajishirzi, Cross-task generalization via natural language crowdsourcing instructions, ACL (2022). [13] J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le, Finetuned language models are zero-shot learners, arXiv preprint arXiv:2109.01652 (2021). [14] M. Parmar, S. Mishra, M. Purohit, M. Luo, M. H. Murad, C. Baral, In-BoXBART: Get Instructions into Biomedical Multi-Task Learning, NAACL 2022 Findings (2022). [15] T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, et al., Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task, arXiv preprint arXiv:1809.08887 (2018). [16] V. Zhong, C. Xiong, R. Socher, Seq2sql: Generating structured queries from natural language using reinforcement learning, arXiv preprint arXiv:1709.00103 (2017). [17] W. Chen, H. Wang, J. Chen, Y. Zhang, H. Wang, S. Li, X. Zhou, W. Y. Wang, Tabfact: A large-scale dataset for table-based fact verification, arXiv preprint arXiv:1909.02164 (2019). [18] X. Zhong, E. ShafieiBavani, A. Jimeno Yepes, Image-based table recognition: data, model, and evaluation, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, Springer, 2020, pp. 564–580. [19] G. Tsatsaronis, M. Schroeder, G. Paliouras, Y. Almirantis, I. Androutsopoulos, E. Gaussier, P. Gallinari, T. Artieres, M. R. Alvers, M. Zschunke, et al., Bioasq: A challenge on large- scale biomedical semantic indexing and question answering., in: AAAI fall symposium: Information retrieval and knowledge discovery in biomedical text, Citeseer, 2012. [20] Q. Jin, B. Dhingra, Z. Liu, W. Cohen, X. Lu, Pubmedqa: A dataset for biomedical research question answering, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 2567–2577. [21] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2020) 1234–1240. [22] M. Luo, A. Mitra, T. Gokhale, C. Baral, Improving biomedical information retrieval with neural retrievers (2022). [23] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, G. Neubig, Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing, arXiv preprint arXiv:2107.13586 (2021). [24] V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, et al., Multitask prompted training enables zero-shot task generalization, arXiv preprint arXiv:2110.08207 (2021). [25] S. Mishra, D. Khashabi, C. Baral, Y. Choi, H. Hajishirzi, Reframing instructional prompts to gptk’s language, ACL Findings (2022). [26] S. Min, M. Lewis, L. Zettlemoyer, H. Hajishirzi, Metaicl: Learning to learn in context, arXiv preprint arXiv:2110.15943 (2021). [27] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., Training language models to follow instructions with human feedback, Preprint (2022). [28] L. Wang, R. Li, Y. Yan, Y. Yan, S. Wang, W. Wu, W. Xu, Instructionner: A multi-task instruction-based generative framework for few-shot ner, arXiv preprint arXiv:2203.03903 (2022). [29] R. S. Puri, S. Mishra, M. Parmar, C. Baral, How many data samples is an additional instruction worth?, arXiv preprint arXiv:2203.09161 (2022). [30] A. Prasad, P. Hase, X. Zhou, M. Bansal, Grips: Gradient-free, edit-based instruction search for prompting large language models, arXiv preprint arXiv:2203.07281 (2022). [31] M. Parmar, S. Mishra, M. Geva, C. Baral, Don’t blame the annotator: Bias already starts in the annotation instructions, arXiv preprint arXiv:2205.00415 (2022). [32] T. Wu, M. Terry, C. J. Cai, Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts, arXiv preprint arXiv:2110.01691 (2021). [33] T. Wu, E. Jiang, A. Donsbach, J. Gray, A. Molina, M. Terry, C. J. Cai, Promptchainer: Chaining large language model prompts through visual programming, arXiv preprint arXiv:2203.06566 (2022). [34] X. V. Lin, T. Mihaylov, M. Artetxe, T. Wang, S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhosale, J. Du, et al., Few-shot learning with multilingual language models, arXiv preprint arXiv:2112.10668 (2021). [35] K. Kuznia, S. Mishra, M. Parmar, C. Baral, Less is more: Summary of long instructions is better for program synthesis, arXiv preprint arXiv:2203.08597 (2022). [36] Y. Wang, S. Mishra, P. Alipoormolabashi, Y. Kordi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al., Benchmarking generalization via in-context instructions on 1,600+ language tasks, arXiv preprint arXiv:2204.07705 (2022). [37] N. Rasul, M. Syed, Differential Diagnosis in Primary Care, Wiley, 2009. URL: https://books. google.com/books?id=r5cTAQAAMAAJ. [38] H. Kitano, Biological robustness, Nature Reviews Genetics 5 (2004) 826–837. [39] T. Gokhale, S. Mishra, M. Luo, B. Sachdeva, C. Baral, Generalized but not robust? comparing the effects of data modification methods on out-of-domain generalization and adversarial robustness, in: Findings of the Association for Computational Linguistics: ACL 2022, 2022, pp. 2705–2718. [40] V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, arXiv preprint arXiv:1910.01108 (2019). [41] B. McCann, N. S. Keskar, C. Xiong, R. Socher, The natural language decathlon: Multitask learning as question answering, arXiv preprint arXiv:1806.08730 (2018). [42] A. Fisch, A. Talmor, R. Jia, M. Seo, E. Choi, D. Chen, Mrqa 2019 shared task: Evaluating generalization in reading comprehension, in: Proceedings of the 2nd Workshop on Machine Reading for Question Answering, 2019, pp. 1–13. [43] M. Luo, K. Hashimoto, S. Yavuz, Z. Liu, C. Baral, Y. Zhou, Choose your qa model wisely: A systematic study of generative and extractive readers for question answering, Spa-NLP 2022 (2022) 7. [44] A. Webson, E. Pavlick, Do prompt-based models really understand the meaning of their prompts?, arXiv preprint arXiv:2109.01247 (2021). [45] T. Schick, H. Schütze, True few-shot learning with prompts–a real-world perspective, arXiv preprint arXiv:2111.13440 (2021). [46] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, A. M. Rush, Transformers: State-of-the-art natural lan- guage processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6. A. Details of BioTabQA Table 6 shows the 22 templates and the corresponding prompts for BioTabQA datasets. Table 9 shows the three Split division including which tasks in the training and cross-task evaluation. Table 7 and 8 show the statistic of Split 1 and 2, respectively. B. Experimental Setup We use DistilBERT [40] as the backbone model and load the pretrained model distilbert-base- uncased from Huggingface library [46]. All models are optimized by AdamW with learning rate 5e-5 in 4 epochs, batch size 16. The maximum length input to every model is 512. All models are trained on Tesla V100 machine with one GPU. ID Question Template Prompt 1 I have symptom A, what disease do I have? If symptom A is in symptom list, report corresponding disease. 2 I have symptom A and sign A, what is my diagnosis? If symptom A is in symptom list, and sign A is in sign list, report corresponding disease. 3 I have symptom A and symptom B, what is wrong If symptom A and symptom B are in symptom list, with me? report corresponding disease. 4 I have sign A and sign B, what disease do you think I If sign A is in sign list, and sign B is in sign list, report have? corresponding disease. 5 I have symptom A and symptom B but not symptom If symptom A and symptom B are in symptom list, C, what is my potential diagnosis? but symptom C is not in symptom list, report corre- sponding disease. 6 A patient is showing symptom A , symptom B and If symptom A, symptom B and symptom C are in symptom C, what could be causing this? symptom list, report corresponding disease. 7 A patient is exhibitng syptom A and sign A, diagnose If symptom A is in symptom list, and sign A is in sign her list, report corresponding disease. 8 What disease can cause symptom A and symptom B? If symptom A and symptom B are in symptom list, report corresponding disease. 9 What disease causes symptom A, symptom B and If symptom A and symptom B are in symptom list, sign A? and sign A is in sign list, report corresponding disease. 10 If my friend has symptom A and symptom B, then If symptom A and symptom B are in symptom list, what is his potential diagnosis? report corresponding disease. 11 The patient has symptom A,symptom B and symptom If symptom A, symptom B and symptom C are in C, what disease can cause these symptoms? symptom list, report corresponding disease. 12 Which disease is associated with symptom A and If symptom A and symptom B are in symptom list, symptom B? report corresponding disease. 13 A patient is complaining about symptom A, symptom If symptom A, symptom B and symptom C are in B and symptom C, diagnose him. symptom list, report corresponding disease. 14 What disease is responsible for symptom A, symptom If symptom A, symptom B and symptom C are in B and symptom C? symptom list, report corresponding disease. 15 I am experiencing symptom A, what is wrong with If symptom A is in symptom list, report corresponding me? disease. 16 Why am I experiencing symptom A and symptom B? If symptom A and symptom B are in symptom list, report corresponding disease. 17 I have symptom A, symptom B and symptom C, why If symptom A, symptom B and symptom C are in is this happening? symptom list, report corresponding disease. 18 A patient is showing symptom A and symptom B, If symptom A, symptom B and symptom C are in what illness is associated with these symptoms? symptom list, report corresponding disease. 19 I have symptom A, and symptom B, what disease If symptom A and symptom B are in symptom list, may I have? report corresponding disease. 20 I have symptom A and symptom B, what possible If symptom A and symptom B are in symptom list, disease could I have? report corresponding disease. 21 What is causing my symptom A? If symptom A is in symptom list, report corresponding disease. 22 I have symptom A, symptom B, symptom C but no If symptom A, symptom B and symptom C are in symptom D, what is causing this? symptom list, but symptom D is not in symptom list, report corresponding disease. Table 6 22 types of templates and the corresponding prompts for BioTabQA datasets. Statistic Train IID Test Cross Task Test # of Samples 7,349 15,566 16,145 Question Length 19 21 17 Table Length 239 255 259 Prompt Length 17 17 17 # Tasks with 1 sym/sign 3 3 0 # Tasks with 2 sym/sign 9 9 2 # Tasks with 3 sym/sign 4 4 3 # Tasks with 4 sym/sign 1 1 0 # Tasks with negation 2 2 0 Table 7 Statistic of BioTabQA Split 2 for training (Train), in-domain testing (IID Test) and cross task testing (Cross Task Test) sets. Statistic Train IID Test Cross Task Test # of Samples 8,524 18,278 6924 Question Length 240 21 254 Table Length 19 256 18 Prompt Length 18 18 14 # Tasks with 1 sym/sign 1 1 2 # Tasks with 2 sym/sign 9 9 2 # Tasks with 3 sym/sign 6 6 1 # Tasks with 4 sym/sign 1 1 0 # Tasks with negation 2 2 0 Table 8 Statistic of BioTabQA Split 3 for training (Train), in-domain testing (IID Test) and cross task testing (Cross Task Test) sets. Split Train/Test Cross-Task Test 1 2,3,5,6,8,9,10,11,12,13,14,16,17,18,19,20,22 1,4,7,15,21 2 1,2,3,4,5,6,7,10,13,15,16,17,18,19,20,21,22 8,9,11,12,14 3 2,4,5,6,7,8,9,10,11,12,13,14,18,19,20,21,22 1,3,15,16,17 Table 9 BioTabQA provides three Splits, each Split has 17 tasks for training, and the rest 5 for cross-task evaluations.