=Paper= {{Paper |id=Vol-2936/paper-26 |storemode=property |title=KU-DMIS at BioASQ 9: Data-centric and model-centric approaches for biomedical question answering |pdfUrl=https://ceur-ws.org/Vol-2936/paper-26.pdf |volume=Vol-2936 |authors=Wonjin Yoon,Jaehyo Yoo,Sumin Seo,Mujeen Sung,Minbyul Jeong,Gangwoo Kim,Jaewoo Kang |dblpUrl=https://dblp.org/rec/conf/clef/YoonYSSJKK21 }} ==KU-DMIS at BioASQ 9: Data-centric and model-centric approaches for biomedical question answering== https://ceur-ws.org/Vol-2936/paper-26.pdf
KU-DMIS at BioASQ 9: Data-centric and
model-centric approaches for biomedical question
answering
Wonjin Yoon1 , Jaehyo Yoo1 , Sumin Seo1 , Mujeen Sung1 , Minbyul Jeong1 ,
Gangwoo Kim1 and Jaewoo Kang*1
1
 Korea University, Seoul, 02841, South Korea
* Corresponding Author


                                         Abstract
                                         In this paper, we present approaches for our participation in the 9th BioASQ challenge (Task b - Phase B).
                                         Our systems are based on the transformer models with model-centric and data-centric approaches. For
                                         factoid-type questions we modified the dataset to increase label consistency, and for list-type questions
                                         we apply the sequence tagging model which is a more natural model design for the multi-label task.
                                         Our experimental results suggest two main points: better model design can be achieved by reflecting
                                         data characteristics such as the number of labels for a data point; and scarce resources such as BioQA
                                         datasets can greatly benefit from a data-centric approach with relatively little effort. Our submissions
                                         achieve competitive results with top or near top performance in the challenge.

                                         Keywords
                                         BioNLP, Biomedical Natural Language Processing, BioASQ, Biomedical Question Answering




1. Introduction
Question answering (QA) is the task of finding information about the given question from the
given document. Biomedical question answering (BioQA) is a specific category of QA tasks
where questions and/or the related documents, namely passages, are in the context of the
biomedical domain.
   BioASQ [1] challenge is a large-scale annual competition for biomedical literature, which
encompasses document classification, document retrieval, and QA tasks, and is a one of the
richest source for BioQA research. Questions of the challenge are categorized in 4 categories:
Factoid-type, List-type, Yes/No-type, and Summary-type questions. Answers of QA tasks are in
two formats: exact answer and ideal answer. For an exact answer, the output of the model is


CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" wjyoon@korea.ac.kr (W. Yoon); jaehyoyoo@korea.ac.kr (J. Yoo); suminseo@korea.ac.kr (S. Seo);
mujeensung@korea.ac.kr (M. Sung); minbyuljeong@korea.ac.kr (M. Jeong); kgw8965@gmail.com (G. Kim);
kangj@korea.ac.kr (J. Kang*)
~ http://wonjin.info/ (W. Yoon); https://minstar.github.io/ (M. Jeong)
 0000-0002-6435-548X (W. Yoon); 0000-0002-3600-6362 (J. Yoo); 0000-0001-8703-0322 (S. Seo);
0000-0002-7978-8114 (M. Sung); 0000-0002-1346-730X (M. Jeong); 0000-0003-4581-0384 (G. Kim);
0000-0001-6798-9106 (J. Kang*)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
Table 1
Data cleaning operations for the answers.
  Operation                  Question                             Original Answer                      Normalized Answer
                                                        [‘mDia proteins are members
              What family do mDia proteins belong in?                                           [‘formin family’]
                                                        of the formin family’]
Sentence      What does polyadenylate-binding
                                                        [‘PABP4 binds mRNA poly(A) tails.’]     [‘mRNA poly(A) tails’]
to phrase     protein 4 (PABP4) bind to?
              Where are the orexigenic peptides         [‘The orexigenic peptides are
                                                                                                [‘the hypothalamus’]
              synthesized?                              sythesized in the hypothalamus.’]
              What is the effect of CRD-BP on           [‘To protect c-myc CRD                  [‘To protect c-myc CRD
              the stability of c-myc mRNA?              from endonucleolytic attack.’]          from endonucleolytic attack’]
Punctuation
              What is gingipain?                        [‘A keystone periodontal pathogen. ’]   [‘A keystone periodontal pathogen’]
marks
              What is the role of the UBC9 enzyme                                               [‘SUMO-conjugating enzyme’,
                                                        [‘SUMO-conjugating enzyme’]
              in the protein sumoylation pathway?                                               ‘SUMO conjugating enzyme’]


Table 2
Data cleaning operation for the questions. Misspelled words are corrected.
                     Original Question                                                   Cleaned Question
Which trancription factor activates the betalain pathway?       Which transcription factor activates the betalain pathway?
What happens to H2AX upon DNA bouble strand breaks?             What happens to H2AX upon DNA double strand breaks?
What is a popular mesaure of gene expression in RNA-seq         What is a popular measure of gene expression in RNA-seq
experiments?                                                    experiments?



a word or a short phrase, whereas, for an ideal answer, the output should be one or multiple
sentences. Participants can choose to answer both formats or partially.
  In this paper, we provide approaches of our participation in QA tasks of the challenge (Task
9b - Phase B) [2]. Our submissions are in both exact and ideal answer formats. Our approaches
vary from the type of questions and answers. For exact answers, we utilize both model-centric
approach (list-type question) and data-centric approach (factoid-type question) to improve our
previous systems [3] for BioASQ 8b. For ideal answers, we apply abstractive summarization
method using large-scale language model BART [4].


2. Methods
In this section, we describe details of our approaches. In the Section 2.1, 2.2, and 2.3, we describe
our approaches for the exact answers. In the Section 2.4, we describe our unified model for ideal
answers, which can answer all four types of questions: 3 aforementioned types and summary-
type questions. For all types, we mainly fine-tune BioBERT [5], which has been proven to be
effective on various NLP tasks in the biomedical or clinical domain [6, 7, 8, 9, 10, 11].

2.1. Data-centric approach; Factoid Questions
Recently, pre-trained models [12, 13] have achieved dramatic improvements of downstream
tasks in both general and biomedical domain by harnessing large-scale models with transfer-
learning methods [10, 13, 14, 15]. BioQA models have also benefited from transfer-learning
[3, 11, 16]. However, utilizing the maximum of scarce resources is susceptible to the rare error
of the training samples, as opposed to relatively rich datasets where a few erroneous samples
can be ignored by the model. Moreover, Jeong et al. [3] measured the proportion of questions,
which is unanswerable if transformed to the extractive QA setting, in the test dataset of BioASQ
8b. Hence, the models trying to solve the task under the extractive QA setting are suffering
from unexpected noise.
   In this section, we introduce our data-centric approach for factoid-type questions of the
BioASQ 9b challenge. Data-centric approach is a concept of improving the quality of training
data to make it better fit into the model. The term data-centric forms a binary opposition
term pair with model-centric approach which focuses more into improving model to achieve
better performance. The concept is introduced by Ng [17] at a seminar, where he argued the
benefit of data-centric approaches and showed that, for some datasets, larger improvements in
performance can be made with data-centric approach than model-centric approach.
   Our main aim of data-centric approach is to increase labeling consistency and exclude or
clean noisy data points. Table 1 and 2 shows the data cleaning operations and the examples
of them. The answers from the BioASQ 9b training samples are mostly in the format of noun
phrases. However, some data points have answers in sentences format. We manually modify
such sentence answers to a noun phrase format. Additional normalization processes are made
to correct misspelled word and to remove punctuation marks. Word correction and minor
normalization processes are made to the questions. We do not modify grammatical structure of
the questions and we count both British and American spelling as correct.
   152 changes are made to the 9b training dataset including 22 question corrections and 24
dropped data points. For the evaluation of our models, we do not apply normalization steps.
Data cleaning resources are available at https://github.com/dmis-lab/bioasq9b-dmis .

2.2. Model-centric approach; List Questions
Extractive question answering is a task of finding answer spans of a question in the given
passage. List-type questions are questions with multiple answers whereas factoid questions are
questions that can be answered with one phrase. For list-type question, the number of answer
for a given question is uninformed (i.e. not available as a metadata). Hence, deciding it remains
a challenging and key operation to participating systems.
   Previous works utilize factoid models with an additional steps to decide the number of
answers for the questions. Factoid models are designed to predict a single answer span, and
thus, they can not be trained on multi-label setup directly. In other words, for a training data
point of list-type question, one answer span is trained for a training step and the other answers
are acting as a noise since they are considered as non-answer tokens. We call this setting as
"start-end span prediction", which is commonly used in biomedical extractive QA [3, 11, 16].
   Following the approach of Yoon et al. [18], our systems for list-type questions are based
on the sequence tagging approach. Specifically, a question and its corresponding passage are
concatenated to construct a sequence, which is a training data point. Our systems adopt either
BIO or IO scheme to annotate answer spans. For each tokens in the passage is tagged as B, I, or
O tag which stands for Beginning, Inside, Outside, respectively.
   Sequence tagging approach has two significant benefits over the previous models. First, the
                                                   Sentence 1
                                                   Sentence 2
     Question




                                                       …
                                                   Sentence n            Candidate selection
   Exact Answer
                                 BART                                Score              Syntax
                                                                    Function            Check
      Passage




Figure 1: Overview of our systems for ideal answers. Question, passage and the exact answer form
an input sequence. Candidate selection module scores the candidate sentences and select the best
candidate as an ideal answer for a given question.


model can be trained on multiple answers simultaneously which is more natural model design
for the multi-answer task. As all available training labels (i.e. answers) are used for the training,
rather than acting as training noise, the model learn the maximum of the dataset. Second, the
approach is an end-to-end model that finds all the answers in the passage, whereas the previous
models require complex post-processing steps to decide the number of answers. Models for
the previous BioASQ challenges [3, 11] decided the number of answers by threshold-based
answer decision process, where the threshold value is a hyperparameter that needs to be tuned.
Additionally, rule-based answer number detection are adopted under the assumption that if
the numeric value exists in the question, the value is highly likely to be the required number
of answers (ex. Question: List 3 apoE mimetics). In contrast, our approach does not require
additional hyperparameter searching nor have to rely on the weak assumption.

2.3. Yes/No Questions
Following the systems of our participation for BioASQ 8b [3], we use BioBERT [5] with an
additional pre-training step on MNLI dataset [19]. For BioASQ 9b task, our systems are based on
BioBERTLARGE , which has more model parameters than BioBERTBASE model from the previous
participation. For yes/no-type questions, we adjust the ratio of yes questions and no questions
in the training set to 1:1. The original training dataset is heavily skewed towards question with
yes answers. As shown in Table 3, original 8b training set consists of 10,284 yes and 1,691 no
question and passage pair samples. After our pre-processing steps, approximately 85.8% of
answers for the yes/no-type questions are yes. Our systems are trained on the down-sampled
training dataset to alleviate the class imbalance problem.

2.4. Ideal answers
Figure 1 shows an overview of our approach for the ideal answers. Our model utilize the
predicted exact answer as a input for generating a ideal answer. In detail, we generate all the
combinations of triples of an exact answer (𝐴) and all the passages (𝑃1 , . . . , 𝑃𝑛 ) available for
a question (𝑄). Then using the triples, we generate a candidate sentences (𝐶1 , . . . , 𝐶𝑛 ) for a
question. We then select one ideal answer from the candidates using candidate selection process.
Table 3
Proportion of question samples having yes or no as its answers. Samples are pre-processed from original
questions and consist of question and passage pairs.
                                                              Yes              No
                           Original Training 8b        10284 (85.8%)       1691 (14.1%)
                         Down-sampled Training 8b       1691 (50.0%)       1691 (50.0%)


Table 4
Performance of our factoid model on original training dataset and cleaned dataset of BioASQ 8B. The
performance is based on the macro average of 5 batches. We only describe results on full-abstract
setting.
                                                       SAcc         LAcc   MRR
                                Original Training 8b   37.75    57.62      45.66
                                Cleaned Training 8b    43.71    58.94      49.14


Candidate Selection Our model is designed to generate one answer candidate for a input
sequences. In the original dataset, multiple articles/snippets are provided as corresponding
passages for a question. Hence multiple answer candidates are generated for a question. We
need a candidate selection process to submit one answer, and the quality of this process can
largely impact the performance of the model. Our candidate selection process is composed of a
scoring function and a syntax checker.

Score Function We have tried to score the generated ideal answer candidates (𝐶1 , . . . , 𝐶𝑛 )
by checking the presence of (candidate) exact answer(s). In order to check the presence, we
employ BERN [9], a BioBERT based named entity recognition (NER) and linking system, to
detect all the entities in the ideal answer candidates. For each candidate sentences for factoid
and list questions, we calculate the F1-score using the tentative exact answer(s) and recognize
entities from the candidate sentences. If one of the tentative answer(s) exists in the generated
sentences, we add it to the set of recognized entities even if it is not detected by BERN.

Syntax Checker We use language-check python package 1 , an automated syntax checker,
to correct or filter out grammatically wrong candidate sentences. From the list of ideal answer
candidates, we select one with the highest score and check it with the syntax checker. If
grammar errors are detected, we try to correct it with the checker. However, if the checker
finds impossible to fix it, we then skip the candidate and move to the second highest candidate.


3. Results and Discussion
Table 4 presents our experimental results on the factoid dataset. Owing to our data-centric
approach (noisy data points exclusion and increasing label consistency), our model perfor-
    1
        https://github.com/myint/language-check
Table 5
Experiments of list type on different settings. The scores in the table are reported based on the macro
average of 5 batches in BioASQ 8B test sets.
                                                  Setting         Prec     Recall    F1
                                                 Snippet         40.72     60.14     43.20
                               Start-End
                                               Full Abstract     38.63     54.03     40.54
                             BIO Tagging       Full Abstract     40.10     55.25     42.62
                              IO Tagging       Full Abstract     40.79     55.46     43.82


Table 6
Experiments of Yes/No type on different sampling strategy. The performance is based on the macro
average of 5 batches in BioASQ 8B test sets. We only describe results on snippet setting.
                                                          Acc     Macro F1      Yes F1    No F1
                         Original Training 8b            0.82       0.80         0.87     0.73
                       Down-sampled Training 8b          0.94       0.93         0.95     0.92


mance shows 3.48 score improvement on the mean reciprocal rank (MRR) score and 5.96 point
improvement on the strict accuracy (SAcc) metric.
   Table 5 shows the experimental results on the list dataset for sequence tagging model and
the baseline model, namely Start-End model[3] (for both full abstract and snippet datasets).
Our model outperforms the baseline model, which we used for the last year, with large gap.
Our model with IO tagging even outperforms the start-end model with snippet dataset as
input without using complex post-processing and rule-based processing steps. Please note
that snippets are more concentrated sources of information than the full abstract documents
since the answer should exist in a snippet or a full abstract but a snippet is a sentence-length
document whereas a full abstract is a paragraph-length document. In the challenge, we also
submitted ensemble system, which consists of sequence tagging model and start-end model.
However, ensemble system did not show significant improvement over single models.
   For yes/no-type questions, we introduce the down-sampling method which we balance the
number of yes and no questions from the training set by sampling 1,691 yes questions out of
10284 yes questions (i.e. sampled 16% of yes questions). We have empirically shown that a
down-sampled dataset alleviates the underlying class imbalance issue of yes/no-type questions.
The results of yes/no model trained on down-sampled and original training data are described
in Table 6. For models trained on down-sampled data, macro F1 score is improved from 0.80 to
0.93 (0.13 improvement). We have found that the down sampling method achieves better model
performance for imbalanced yes/no-type questions. We conduct the experiments following the
same hyperparameter settings, including learning rate and the number of learning steps.
   The official evaluation metric for the ideal models is human evaluation score, which is scored
by the human experts on the following four criteria: recall, precision, repetition, and readability.
Since the human evaluation score is not yet released at the moment2 , we provide the qualitative
    2
        The official result (human evaluation) will be on : http://participants-area.bioasq.org/results/9b/phaseB/
Table 7
Qualitative analysis of ideal answers on BioASQ 9B test sets. Tentative exact answers, which are the
outputs of our exact models, are denoted as Exact Answer in the table.
  Type                                   Question - Exact Answer - Ideal Answer
           Question: Which disease is monitored in the BIOCURA cohort?
           Exact Answer: rheumatoid arthritis
 Factoid
           Ideal Answer: Baseline disease activity in patients with rheumatoid arthritis is monitored in the
           BIOCURA cohort.
           Question: What are the years of the initiation and completion of the Human Genome project?
           Exact Answer: 1990, 2003
   List
           Ideal Answer: The Human Genome Project (HGP) was initiated in 1990, and the completion of the
           genome project was in 2003.
           Question: Is Tocilizumab (Actemra) used to block/antagonize the IL-6 receptor?
           Exact Answer: yes
 Yes/No
           Ideal Answer: Yes, Tocilizumab (Actemra) is a humanized monoclonal antibody that binds to and
           inhibits the IL-6 receptor. It is used for treatment of type 1 diabetes.


Table 8
BioASQ 9B results for our systems. The scores are obtained from the leaderboard. Score in bold font
denotes that our system achieved the highest score from the leaderboard.
                                 Factoid (MRR)         List (F1)      Yes/No (Macro F1)
                  # of Batch
                                 Ours      Top-1     Ours    Top-1    Ours        Top-1

                  9B Batch 1     38.79     46.32         53.39                92.58
                  9B Batch 2     52.94     55.39     46.44   48.92    88.54       94.54
                  9B Batch 3     42.34     61.49     54.21   58.87    90.23       95.32
                  9B Batch 4     57.26     69.29         70.61                94.80
                  9B Batch 5     47.45     58.80     45.60   51.75            84.21


analysis of the model on Table 7. As we denoted in Section 2.4, tentative exact answers are
used as one of the inputs to generate ideal answer candidates and to select best candidates for
the questions. By utilizing the exact answer as one of the input to the ideal answer generation
model, the model is imposed to consider the given exact answer in the sentence generation step
and potentially include the exact answer in the generated sentence. The examples in the Table 7
show that the generated ideal answer sentences successfully include the given exact answers
(potential exact answers). We did not exclude the snippets that do not contain the exact answer
since our model is expected to include the given exact answers in the generated sentences even
if they dose not exist in the passage.
    Finally, Table 8 shows the result of our participation and the best performing system in the
challenge. If our system scored the highest performance for a given batch, we marked it using
bold font.
4. Conclusion
In this paper, we report our participation with data-centric and model-centric approaches, and
the results of our systems for the BioASQ 9b task. We introduce a data-centric approach for
factoid-type questions, in which we train the model using the dataset with increased label
consistency, and improves the performance of our model. For list-type questions, we apply
the sequence tagging model and achieve better performance while at the same time lower the
cost of pre- and post-processing. Furthermore, for yes/no-type questions in the BioASQ 9b,
our models have shown the best performance by utilizing down-sampling. We can conclude
that it is beneficial for BioQA models to use data-centric and/or model-centric approaches in
consideration of the features of questions and answers.


Acknowledgments
We express gratitude towards Dr. Jihye Kim and Dr. Sungjoon Park from Korea University
for their invaluable insight into our systems’ output. This research is supported by National
Research Foundation of Korea (NRF-2020R1A2C3010638) and a grant of the the Korea Health
Technology R&D Project through the Korea Health Industry Development Institute (KHIDI),
funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HR20C0021)


References
 [1] G. Tsatsaronis, G. Balikas, P. Malakasiotis, I. Partalas, M. Zschunke, M. R. Alvers, D. Weis-
     senborn, A. Krithara, S. Petridis, D. Polychronopoulos, et al., An overview of the bioasq
     large-scale biomedical semantic indexing and question answering competition, BMC
     bioinformatics 16 (2015) 1–28.
 [2] A. Krithara, A. Nentidis, G. Paliouras, M. Krallinger, A. Miranda, Bioasq at clef2021:
     Large-scale biomedical semantic indexing and question answering., in: ECIR (2), 2021, pp.
     624–630.
 [3] M. Jeong, M. Sung, G. Kim, D. Kim, W. Yoon, J. Yoo, J. Kang, Transferability of natural
     language inference to biomedical question answering, arXiv preprint arXiv:2007.00217
     (2020).
 [4] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle-
     moyer, Bart: Denoising sequence-to-sequence pre-training for natural language generation,
     translation, and comprehension, 2019. arXiv:1910.13461.
 [5] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, Biobert: a pre-trained biomedical
     language representation model for biomedical text mining, Bioinformatics 36 (2020)
     1234–1240.
 [6] E. Alsentzer, J. Murphy, W. Boag, W.-H. Weng, D. Jindi, T. Naumann, M. McDermott,
     Publicly available clinical BERT embeddings, in: Proceedings of the 2nd Clinical Natural
     Language Processing Workshop, Association for Computational Linguistics, Minneapolis,
     Minnesota, USA, 2019, pp. 72–78. URL: https://www.aclweb.org/anthology/W19-1909.
     doi:10.18653/v1/W19-1909.
 [7] I. Beltagy, K. Lo, A. Cohan, Scibert: A pretrained language model for scientific text, 2019.
     arXiv:1903.10676.
 [8] Q. Jin, B. Dhingra, W. W. Cohen, X. Lu, Probing biomedical embeddings from language
     models, 2019. arXiv:1904.02181.
 [9] D. Kim, J. Lee, C. So, H. Jeon, M. Jeong, Y. Choi, W. Yoon, M. Sung, J. Kang, A neural
     named entity recognition and multi-type normalization tool for biomedical text mining,
     IEEE Access 7 (2019) 73729–73740. doi:10.1109/ACCESS.2019.2920708, funding Infor-
     mation: This work was supported in part by the National Research Foundation of Korea
     under Grant NRF-2017R1A2A1A17069645 and Grant NRF-2016M3A9A7916996, and in
     part by the National IT Industry Promotion Agency, Development Project of the Precision
     Medicine Hospital Information System (P-HIS), under Grant C1202-18-1001. Publisher
     Copyright: © 2013 IEEE.
[10] Y. Peng, S. Yan, Z. Lu, Transfer learning in biomedical natural language processing: An
     evaluation of bert and elmo on ten benchmarking datasets, 2019. arXiv:1906.05474.
[11] W. Yoon, J. Lee, D. Kim, M. Jeong, J. Kang, Pre-trained language model for biomedical
     question answering, 2019. arXiv:1909.08229.
[12] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Zettlemoyer, Deep
     contesxtualized word representations, 2018. arXiv:1802.05365.
[13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, 2019. arXiv:1810.04805.
[14] N. Kim, R. Patel, A. Poliak, P. Xia, A. Wang, T. McCoy, I. Tenney, A. Ross, T. Linzen,
     B. Van Durme, S. R. Bowman, E. Pavlick, Probing what different NLP tasks teach machines
     about function word comprehension, in: Proceedings of the Eighth Joint Conference
     on Lexical and Computational Semantics (*SEM 2019), Association for Computational
     Linguistics, Minneapolis, Minnesota, 2019, pp. 235–249. URL: https://www.aclweb.org/
     anthology/S19-1026. doi:10.18653/v1/S19-1026.
[15] J. Phang, T. Févry, S. R. Bowman, Sentence encoders on stilts: Supplementary training on
     intermediate labeled-data tasks, 2019. arXiv:1811.01088.
[16] G. Wiese, D. Weissenborn, M. Neves, Neural domain adaptation for biomedical question
     answering, in: Proceedings of the 21st Conference on Computational Natural Language
     Learning (CoNLL 2017), Association for Computational Linguistics, Vancouver, Canada,
     2017, pp. 281–289. URL: https://www.aclweb.org/anthology/K17-1029. doi:10.18653/v1/
     K17-1029.
[17] A. Y. Ng, A chat with andrew on mlops: From model-centric to data-centric ai, 2021. URL:
     https://www.youtube.com/06-AZXmwHjo.
[18] W. Yoon, R. Jackson, J. Kang, A. Lagerberg, Sequence tagging for biomedical extractive
     question answering, arXiv preprint arXiv:2104.07535 (2021).
[19] A. Williams, N. Nangia, S. Bowman, A broad-coverage challenge corpus for sentence
     understanding through inference, in: Proceedings of the 2018 Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Or-
     leans, Louisiana, 2018, pp. 1112–1122. URL: https://www.aclweb.org/anthology/N18-1101.
     doi:10.18653/v1/N18-1101.