<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Answer Type Prediction using BERT?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vinay Setty</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krisztian Balog</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>vinay.j.setty</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>krisztian.balogg@uis.no</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Stavanger</institution>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper summarizes our participation in the SMART Task of the ISWC 2020 Challenge. A particular question we are interested in answering is how well neural methods, and speci cally transformer models, such as BERT, perform on the answer type prediction task compared to traditional approaches. Our main nding is that coarse-grained answer types can be identi ed e ectively with standard text classi cation methods, with over 95% accuracy, and BERT can bring only marginal improvements. For ne-grained type detection, on the other hand, BERT clearly outperforms previous retrieval-based approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>Answer type prediction answer category classi cation natural language understanding question answering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        The importance of being able to identify the types or semantic categories of
answers requested has been long recognized in question answering (QA) research as
a key step towards interpreting the meaning of natural language questions [
        <xref ref-type="bibr" rid="ref4 ref8">4, 8</xref>
        ].
This task may be performed either against a set of coarse-grained types (e.g.,
at the TREC QA track [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]) or against ne-grained type systems of knowledge
bases, such as DBpedia [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ]. The Semantic Answer type prediction (SMART)
task [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]1, organized as a challenge at the 2020 International Semantic Web
Conference (ISWC '20) provides a large-scale evaluation platform for assessing
answer type prediction both at course-grained and ne-grained taxonomical levels.
      </p>
      <p>Speci cally, given a natural language question as input, rst a high-level
answer type category is to be predicted, which can be one of resource, literal,
or boolean. If the predicted category is resource, a more speci c ontological
class is to be provided, using the type system of DBpedia or Wikidata. If the
predicted category is literal, it also has to be further classi ed as number,
date, or string. In this paper, we refer to the task of coarse-grained answer
detection as category classi cation and to the problem of ne-grained prediction
of (resource) types as type prediction. Table 1 shows some examples. As seen
from the examples, answers for the resource category are provided as a ranked
list of types.</p>
      <p>The main research objectives in this work are to assess (1) How do neural
approaches perform compared to traditional feature-based classi cation approaches
on the category classi cation task? (2) How do neural classi cation approaches
fare against well-establised (fusion-based) IR approaches on the type
prediction problem? We nd that (1) is essentially a \solved" problem. Our baseline
SVM classi er with word unigrams as features achieved 95% accuracy. Neural
approaches yield only minor improvements. As for (2), type prediction has
previously been approached as a ranking problem, due to the large number of
possible types ( 760 types in DBpedia and 50k types in Wikidata) that rendered
classi cation-based approaches infeasible. We draw on recent work on extreme
multi-class classi cation and demonstrate substantial gains over the IR baselines.
It appears that ne-grained type detection on Wikidata is more challenging than
on DBpedia. However, the two are not directly comparable due to the di erent
evaluation measures that are employed, which calls for further analysis.</p>
      <p>Code and resources developed in this work are made publicly available at
https://github.com/iai-group/smart-task.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Approach</title>
      <p>
        We follow a two-phase approach. In the rst phase, we perform category
classi cation, that is, a supervised classi er predicts the high-level category of the
answer type. Then, in the second phase, we perform type prediction to identify
the top-k types for the questions for which answer type was predicted to be
a resource. For category classi cation we use two classi ers: SVM with word
unigrams as features and ne-tuned BERT (Section 2.1). Type prediction has
previously been approached as a ranking task [
        <xref ref-type="bibr" rid="ref1 ref5">1, 5</xref>
        ], due to the large number
of possible types. As an alternative, we can cast it as an extreme classi cation
problem (Section 2.2).
      </p>
      <sec id="sec-2-1">
        <title>Category Classi cation</title>
        <p>We atten the high-level categories into following ve categories: boolean,
literal-number, literal-string, literal-date, and resource. Since the
category classi cation task is same for both DBpedia and Wikidata, we combine the
training datasets for the two and predict the categories for their respective test
datasets using the combined model.</p>
        <p>Feature-based classi cation As a rst approach to category classi cation, we
use TF-IDF-weighted word unigrams as features. The vocabulary construction
and IDF computations are based only on the training portion of the dataset,
to avoid any assumptions on the test data. Our implementation is based on the
CountVectorizer and TFIDFVectorizer classes from the sklearn library2 with
default parameters. We then train an SVM classi er with a linear kernel. We
also experimented with using a Naive Bayes classi er, but decided to exclude
that after observing inferior performance.</p>
        <p>
          Neural approach As a second approach, we ne-tune a pre-trained BERT
model (RoBERTa) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] for a sequence classi cation task to classify the category.
Our implementation uses the HuggingFace API3 for ne-tuning and category
classi cation.
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Type Prediction</title>
        <p>
          IR-based methods We employ two ranking-based approaches from [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], which
were introduced for the task of identifying target types of (entity-bearing) search
queries. These approaches are representatives of two main families of object
ranking strategies, which have been termed as early and late fusion design patterns
in [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. According to the type-centric (TC, a.k.a. early fusion) approach, rst a
textual representation is built for each type by concatenating the descriptions of
entities that are assigned that type. Then, these type description (pseudo)
document can be ranked using standard IR models. Speci cally, we use the DBpedia
short abstracts of entities and then rank type documents using BM25. The
second strategy is termed entity-centric (EC, a.k.a. late fusion). There, the top-k
most relevant entities from the underlying knowledge base are retrieved using
the question as a query. Then, the relevance score of a given type is computed
by aggregating the relevance scores of entities with that type. We use BM25 as
the underlying retrieval model and a \catch-all" entity representation, following
the settings in [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. The cut-o parameter k is chosen empirically based on the
training set (k = 20).
2 https://scikit-learn.org/
3 https://huggingface.co/
Neural method Due to the large number of possible labels, using standard
Transformer models is not feasible. Instead, we cast the type prediction task
as an extreme multi-label text classi cation (XMC) problem: given a question
as input text, return the top-k most relevant types from a large collection of
possible types. Vanilla transformer models such as BERT [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], RoBERTa [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], and
XLNet [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] are ine ective in this scenario due to the memory and computation
requirements imposed by the large number of possible labels. This was also
conrmed from our experiments that ne-tuning the above mentioned transformer
models using the Huggingface framework exhausted all the memory on a 32GB
Nvidia Tesla V100 GPU. While this may work on a GPU with larger memory,
since we do not have access such a GPU we could not verify and it may still be
computationally very expensive to train them. In addition to the computational
limitations, as we show in Section 4, the types are very sparse with most of
them having only a few training instances. In order to address these challenges,
a model designed for XMC is essential. We use the recent solution to extend the
transformer models for XMC coined X-Transformers [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] for this purpose, which
shall be referred to as XBERT in the rest of the paper.
        </p>
        <p>XBERT consists of three components:
1. Semantic Label Indexing (SLI), which performs hierarchical clustering on
the labels to reduce the label space.
2. Deep Neural Matching (DNM), to ne-tune the Transformer models for each
of the label clusters identi ed by SLI.
3. Ensemble Ranking (ER), which ranks the instances within the label clusters
by training a linear ranker conditionally on the label clusters and the DNM
Transformer's output.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experimental Evaluation</title>
      <p>In this section, we discuss our experimental setup, introduce the evaluation
measures, and present our results.
3.1</p>
      <sec id="sec-3-1">
        <title>Data</title>
      </sec>
      <sec id="sec-3-2">
        <title>Methods</title>
        <p>The following methods are compared:
{ SVM: Support Vector Machine for category classi cation
{ BERT: RoBERTa for category classi cation
{ XBERT: X-Transformers for type prediction
{ IR/TC: Type-centric IR approach for type prediction
{ IR/EC: Entity-centric IR approach for type prediction
We train all neural models on a single Nvidia Tesla V100 GPU with 32GB
memory.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Evaluation Metrics</title>
        <p>
          Category classi cation is evaluated in terms of classi cation accuracy. Type
prediction is cast as a ranking task and is evaluated using rank-based metrics. It,
however, considers only those questions that fall into the literal or resource
answer categories. Furthermore, evaluation is performed di erently for DBpedia
and for Wikidata, given the nature of their respective type taxonomies. Types in
the DBpedia Ontology are organized hierarchically, up to 7 levels deep. There, a
graded evaluation metric, Normalized Discounted Cumulative Gain (NDCG@k),
is used. Speci cally:
{ For literal answer types, only a single predicted type is considered that
can be either correct (NDCG=1) or incorrect (NDCG=0).
{ For resource answer types, a ranked list of top-k ontology classes is
considered and evaluated in terms of lenient NDCG@k with linear decay [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The
gain for a given predicted type is 0 if it is not on the same path with any of
the gold types, and otherwise it is 1 d(t; tq)=h, where d(t; tq) is the distance
between the predicted type and the closest matching gold type in the type
hierarchy, with h being the maximum depth of the type hierarchy.
In case of Wikidata, the type hierarchy is rather at. Therefore, type prediction
is evaluated using a binary notion of relevance, with Mean Reciprocal Rank
(MRR) as the metric.
        </p>
        <p>We report results on the training dataset, using 5-fold cross-validation. For
our o cial submissions, we also report the performance on the test set.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Results</title>
        <p>Category Classi cation It can be seen from the results in Table 3 that both
feature-based and neural approaches perform quite well for category classi
cation. BERT has a slight advantage over SVM. We hypothesize that due to the
clear patterns which the models can learn, the high-level category classi cation
is a fairly easy task and hence the high accuracy scores. However, most mistakes
occur for the resource class, which is the majority class in both datasets.</p>
        <p>Dataset
DBpedia
Wikidata SVM</p>
        <p>BERT
Method Train</p>
        <p>Test
SVM
BERT
Type Prediction Since di erent metrics are used for DBpedia and Wikidata,
we report results on the two datasets separately, in Tables 4 and 5, respectively.
Recall, that (stage-two) type prediction is applied on top of (stage-one) category
classi cation (SVM or BERT) and is only carried out when the predicted
category is resource. We thus pre x the method names in the result tables with
SVM- or BERT- to indicate how category classi cation was performed.</p>
        <p>
          On DBpedia (Table 4), XBERT clearly outperforms the IR approaches. We
attribute this to the fact that XBERT is tailored for XMC problem which can
deal with large number of types and sparsity with tail resource types. The slight
di erence between SVM-XBERT and BERT-XBERT is due to the mistakes
made by SVM in category classi cation. Given the large advantage of XBERT
over the IR approaches, our o cial submissions on Wikidata (Table 5) only
considered the former. It should nevertheless be noted that the IR approaches are
unsupervised methods that do not need any training data. Supervised
alternatives have shown to perform signi cantly better [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. We leave that comparison
to future work.
In this section, we analyze the errors made by the our best performing
approach, BERT-XBERT. First, we look at resource types where most errors
occur. That is, types which are present in the gold labels but are missing from the
predicted labels. Table 6 shows the top-10 errors in type prediction for
DBpedia and Wikidata, together with their total instance counts. Ideally, we would
expect that the number of mistakes to be directly proportional to the total
frequency of the resource type. In DBpedia, some types such as dbo:State,
dbo:Activity, dbo:Band, and dbo:Profession break this pattern. Similarly in
Wikidata, natural person, political territorial entity, and big city
are some of types with which the BERT-XBERT model struggles.
        </p>
        <p>In Table 7, we show anecdotal examples of the mistakes made by the
BERTXBERT approach. Most of these errors are due to irrelevant types returned in
the result list. In several cases, the predicted labels do contain the the gold label
but place them at lower ranks, which a ects the NDCG and MRR scores. In some
cases the predicted labels are appropriate, even though they do not exactly match
the gold labels. For example, for the last question in Table 7, publication is one
of the gold labels, which is not predicted, but written work and periodical
are still relevant among the predicted labels. We also spotted several instances
with double questions such as \What con ict occurred in Philoctetes and who
was involved?" and questions with grammatical errors and typos.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusions</title>
      <p>In this paper, we presented our solution for the SMART Task challenge of ISWC
2020, which was the best performing approach on both datasets and tasks,
across all evaluation metrics. Our ndings suggest that for coarse-grained
category prediction, simple feature-based approaches are quite e ective with over
Question</p>
      <p>Gold types</p>
      <p>Predicted types
95% accuracy, while sophisticated neural Transformer architectures only improve
marginally. For ne-grained type prediction, on the other hand, Transformer
models for extreme multilabel classi cation clearly outperform retrieval-based
approaches.</p>
      <p>Our future work concerns an in-depth analysis of the results on DBpedia vs.
Wikidata, to understand the di erences and modeling requirements for small
and hierarchical (DBpedia) vs. large and shallow (Wikidata) type taxonomies.
["dbo:Country",
"dbo:Location", "dbo:Place",
"dbo:PopulatedPlace",
"dbo:Continent",
"dbo:Airport",
"dbo:MeanOfTransportation",
"dbo:Aircraft",
"dbo:Infrastructure",
"dbo:ArchitecturalStructure"]
["non-alcoholic beverage",
"carbonated beverage", "soft
drink", "trademark", "food",
"long gun", "goods", "dish",
"cyclic process", "tea"]</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          and
          <string-name>
            <given-names>R.</given-names>
            <surname>Neumayer</surname>
          </string-name>
          .
          <article-title>Hierarchical target type identi cation for entityoriented queries</article-title>
          .
          <source>In Proceedings of the 21st ACM international conference on Information and knowledge management</source>
          ,
          <source>CIKM '12</source>
          , pages
          <fpage>2391</fpage>
          {
          <fpage>2394</fpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>W.-C.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.-F.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I. S.</given-names>
            <surname>Dhillon</surname>
          </string-name>
          .
          <article-title>Taming pretrained transformers for extreme multi-label text classi cation</article-title>
          .
          <source>In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &amp; Data Mining</source>
          , pages
          <volume>3163</volume>
          {
          <fpage>3171</fpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          . Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          .
          <source>In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers), pages
          <fpage>4171</fpage>
          {
          <fpage>4186</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Ferrucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. W.</given-names>
            <surname>Brown</surname>
          </string-name>
          , J.
          <string-name>
            <surname>Chu-Carroll</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Gondek</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kalyanpur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lally</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          <string-name>
            <surname>Murdock</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Nyberg</surname>
            ,
            <given-names>J. M.</given-names>
          </string-name>
          <string-name>
            <surname>Prager</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schlaefer</surname>
            , and
            <given-names>C. A.</given-names>
          </string-name>
          <string-name>
            <surname>Welty. Building Watson</surname>
          </string-name>
          :
          <article-title>An overview of the DeepQA project</article-title>
          .
          <source>AI Magazine</source>
          ,
          <volume>31</volume>
          (
          <issue>3</issue>
          ):
          <volume>59</volume>
          {
          <fpage>79</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Garigliotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hasibi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          .
          <article-title>Target type identi cation for entitybearing queries</article-title>
          .
          <source>In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR '17</source>
          , pages
          <fpage>845</fpage>
          {
          <fpage>848</fpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          .
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          . arXiv preprint arXiv:
          <year>1907</year>
          .11692,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>N.</given-names>
            <surname>Mihindukulasooriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gliozzo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lehmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.-C. N.</given-names>
            <surname>Ngomo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>R.</given-names>
            <surname>Usbeck. SeMantic AnsweR</surname>
          </string-name>
          <article-title>Type prediction task (SMART) at ISWC 2020 Semantic Web Challenge</article-title>
          . CoRR/arXiv, abs/
          <year>2012</year>
          .00555,
          <year>2020</year>
          . URL https://arxiv.org/abs/
          <year>2012</year>
          .00555.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Long</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          <article-title>. Multi-task learning for conversational question answering over a large-scale knowledge base</article-title>
          .
          <source>In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP)</source>
          , pages
          <fpage>2442</fpage>
          {
          <fpage>2451</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Voorhees</surname>
          </string-name>
          .
          <article-title>The trec question answering track</article-title>
          .
          <source>Nat. Lang</source>
          . Eng.,
          <volume>7</volume>
          (
          <issue>4</issue>
          ):
          <volume>361</volume>
          {
          <fpage>378</fpage>
          ,
          <string-name>
            <surname>Dec</surname>
          </string-name>
          .
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , J. Carbonell,
          <string-name>
            <given-names>R. R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          . Xlnet:
          <article-title>Generalized autoregressive pretraining for language understanding</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <volume>5753</volume>
          {
          <fpage>5763</fpage>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          and
          <string-name>
            <given-names>K.</given-names>
            <surname>Balog</surname>
          </string-name>
          .
          <article-title>Design patterns for fusion-based object retrieval</article-title>
          .
          <source>In Proceedings of the 39th European conference on Advances in Information Retrieval, ECIR '17</source>
          , pages
          <fpage>684</fpage>
          {
          <fpage>690</fpage>
          . Springer,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>