<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Exploring the Viability of Synthetic Query Generation for Relevance Prediction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aditi Chaudhary</string-name>
          <email>aditichaud@google.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Karthik Raman</string-name>
          <email>karthikraman@google.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Krishna Srinivasan</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kazuma Hashimoto</string-name>
          <email>kazumah@google.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mike Bendersky</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marc Najork</string-name>
          <email>najork@google.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Google Research</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Query-document relevance prediction is a critical problem in Information Retrieval systems. This problem has increasingly been tackled using (pretrained) transformer-based models which are finetuned using large collections of labeled data. However, in specialized domains such as e-commerce and healthcare, the viability of this approach is limited by the dearth of large in-domain data. To address this paucity, recent methods leverage these powerful models to generate high-quality task and domain-specific synthetic data. Prior work has largely explored synthetic data generation or query generation (QGen) for QuestionAnswering (QA) and binary (yes/no) relevance prediction, where for instance, the QGen models are given a document, and trained to generate a query relevant to that document. However in many problems, we have a more fine-grained notion of relevance than a simple yes/no label. Thus, in this work, we conduct a detailed study into how QGen approaches can be leveraged for nuanced relevance prediction. We demonstrate that - contrary to claims from prior works - current QGen approaches fall short of the more conventional cross-domain transfer-learning approaches. Via empirical studies spanning three public e-commerce benchmarks, we identify new shortcomings of existing QGen approaches - including their inability to distinguish between different grades of relevance. To address this, we introduce label-conditioned QGen models which incorporates knowledge about the different relevance. While our experiments demonstrate that these modifications help improve performance of QGen techniques, we also find that QGen approaches struggle to capture the full nuance of the relevance label space and as a result the generated queries are not faithful to the desired relevance label.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Synthetic query generation</kwd>
        <kwd>Relevance prediction</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The task of modeling how relevant a document is to a query is among the most central problems
in Information Retrieval, and a key component of many IR systems. The e-commerce domain
is no exception, with improved relevance models leading to higher consumer engagement and
user satisfaction [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. That said, the e-commerce domain offers additional challenges for relevance
modeling – specifically due to its fluidity, with new products appearing every day coupled with
the ever-evolving interests of the user base.
      </p>
      <p>
        The advent of Large Language Models (LLMs) such as GPT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], T5 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], PaLM [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and
MOUNTAINTOP 40L Hiking Backpacks
with Rain Cover for Women Men
Label: E Product MOUNTAINTOP 40L
Hiking Backpacks with Rain Cover for
Women Men
Label: S Product MOUNTAINTOP 40L
Hiking Backpacks with Rain Cover for
Women Men
      </p>
      <p>Generate Queries</p>
      <p>Labeled Data
(ESCI / MS-MARCO)
Vanilla QGen</p>
      <p>hiking pack
LabelCond QGen
hiking pack
teton
backpack
zero-shot
application
apply QGen</p>
      <p>Task</p>
      <p>Finetune
Test Product Corpora
(WANDS / HomeDepot)</p>
      <p>
        Test Data
(WANDS / HomeDepot)
LLaMa [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], has unlocked new opportunities for potent relevance modeling. However leveraging
LLMs comes with a key requirement: data! As in other IR verticals, e-commerce (relevance)
labeled training datasets – that are large enough to train these LLMs – are rare1. The proprietary
nature of user logs, coupled with the increasing privacy expectations of users and the exorbitant
costs of collecting high-quality relevance ratings, limit the availability of such data. To tackle
this issue, the predominant solution in the IR community has been to leverage large-scale
general-purpose IR datasets and perform (zero-shot / few-shot) transfer learning. In particular
the MS-MARCO [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] dataset – mined from Bing search logs – is the largest publicly available
dataset (with millions of query-document pairs labeled) and most commonly used to train LLMs
to understand query-document relevance.
      </p>
      <p>
        Recently, an alternative paradigm has emerged to overcome the lack of query logs –
synthetically generated query logs i.e., Query Generation (QGen). Recent works have successfully
demonstrated the use of such techniques across different verticals and IR problems, including
Question Answering [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], Passage Ranking [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and Retrieval [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] – with some recent
results [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] even outperforming transfer learning from MS-MARCO. Beyond improving relevance
prediction (the focus of this paper), these synthetically generated query logs can also be used as
a substitute for real logs in different IR technologies and problems. For instance, applications
like training query suggestions systems or automatically creating FAQs for consumer-facing
applications [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] could all be performed with such logs.
      </p>
      <p>
        Thus our first contribution is providing the first detailed empirical understanding of QGen
approaches in the e-commerce domain. Using data from three different e-commerce benchmarks,
1The one notable exception is the recently released ESCI dataset [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] – which we use and discuss later.
we study performance of the two major families of QGen approaches (finetuning-based vs.
promptbased) popular in the literature. Our results also demonstrate that models trained using smaller
in-domain labeled datasets can outperform larger general-purpose datasets, thus reinforcing the
promise of generating high-quality in-domain synthetic data.
      </p>
      <p>Our second contribution involves experiments and analyses that demonstrate that (unlike
claims reported in prior works) QGen approaches are outperformed by the more conventional
(cross-domain) transfer learning style approaches. Via detailed analyses, we identify a set of key
reasons (that we have not seen discussed – or perhaps identified – in prior works) explaining
why QGen approaches fall short. For example, we observe that the best existing QGen baseline
produces at least one problematic (from the lens of faithfulness / correctness) query for 80+% of
products.</p>
      <p>
        Per our study, a key reason responsible for the shortcomings of existing QGen techniques, is
their simplification of the label space. More specifically, QGen techniques simplify the problem
of query-document (product) relevance into a simple binary one i.e., relevant or not. In fact, most
existing approaches only use the relevant query-document pairs, by training the model to produce
the associated (relevant) query given the document. This yes/no binarization is unfortunately
a gross over-simplification of the complex relationship between queries and documents. For
example, TREC relevance judgments are often rated on a 4-point Likert scale. Thus ignoring this
nuance seems sub-optimal – as evidenced in our results. Additionally, as noted in Reddy et al.
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], nuanced relevance judgements are important for training a high quality product ranker for a
better user search experience. For instance, they define four-class relevance judgements ranging
from highly relevant to not relevant. A high quality product ranker should be able to rank the
highly relevant product over the next relevance class and so on. Binarizing this would lead to
a loss in nuance and thereby the ranking quality. Thus as our third contribution, we present
modifications to both families of existing QGen approaches (finetuning-based and prompt-based)
that recognize and leverage the nuance in the relevance label space. Interestingly, while the
ifnetuning variant leads to the overall best QGen models, we find that the prompt-only methods
struggle to understand nuance – indicating potential for future improvements in these pretained
prompt models.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background: Vanilla QGen</title>
      <p>
        Powerful transformed-based models like GPT [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], T5 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], have shown their prowess in generating
high-quality text, owing to their ability to attend to even a large context. These models have now
become a starting point for generating synthetic data for training further downstream models. In
this work, we explore two existing paradigms of QGen approaches – Finetune-Based where a
QGen model is trained on a subset of training data, and Prompt-Based where a large language
model (LLM) is leveraged using only few-shot examples. We refer to these existing approaches
as Vanilla QGen variants as they use information from only the highest relevance label. Below,
we briefly describe them.
      </p>
      <p>
        Finetune-Based Typically, such a QGen model [
        <xref ref-type="bibr" rid="ref13 ref9">13, 9</xref>
        ] is given an input text  (e.g. passage
or document for question generation) and is trained to generate a output question  which is
relevant to that passage or document. Throughout the paper, the terms ‘product, document
and, passage’ are used interchangeably, but they all refer to an input context which is used for
generating the query. Only the relevant query-document pairs from these datasets (e.g.
MSMARCO, Yahoo Answers, Stack Exchange) are used for training such a QGen model. The QGen
model is then applied to documents (from the task of interest) to generate synthetic relevant
questions. For training QA models, these new question-document pairs are directly used for data
augmentation [
        <xref ref-type="bibr" rid="ref14 ref8 ref9">9, 8, 14</xref>
        ]. For training neural retrieval models, an additional retriever (e.g. BM25)
is then used to retrieve negative documents for every synthetic relevant question [
        <xref ref-type="bibr" rid="ref10 ref15">15, 10</xref>
        ].
Prompt-Based Instead of training a full QGen model, recent works such as PROMPTAGAGOR
[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and INPARS [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] leverage large language models (LLMs) as query generator. For instance,
PROMPTAGAGOR concatenates 8 relevant question-document pairs
{(0, 0) · · · (7, 7)} with the target document of interest () and prompts the LLM to generate
a new question () that is relevant to . Then, a retriever is used on the generated new query to
construct hard negatives to train a new model on the downstream ranking task. INPARS uses 3
question-document pairs followed by a BM25 retriever to train a T5-reranker model.
      </p>
      <p>In this work, we explore the application of these existing QGen approaches to a much harder
relevance prediction task, where it has multiple relevance classes which are nuanced as opposed
to only binary relevance prediction (e.g in MS-MARCO). In the next section, we describe our
adaptations to the above QGen approaches which conditions on all labels.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Proposed: Label-Conditioned QGen for Fine-Grained</title>
    </sec>
    <sec id="sec-4">
      <title>Relevance Prediction</title>
      <p>As mentioned above, the task of relevance prediction for e-commerce entails – given a
userissued query and a product, predict the degree of relevance (e.g. highly relevant, partially
relevant, irrelevant) between them. Consider an example from Table 1, where we can see the
ifne-grained difference in queries across different relevance labels for the same ESCI product.
Simply binarizing this task or only considering queries from one relevance label, as done in the
above strategies, could risk losing the nuance. Therefore, we extend the above described vanilla
QGen techniques to our nuanced relevance prediction task by conditioning the query generation
on the relevance label. Below we describe our adaptations:
• Finetune-Based-LabelCond: we use the entire training portion of the available data and
not just the relevant query-document portion, to train the QGen model. Specifically, each
annotated query-document-label triple is transformed such that the label  is prepended to
the document  and the model is trained to the output the query , as shown in Table 1.
• Prompt-Based-LabelCond: we follow PROMPTAGATOR and instead of using all 8
examples of just the relevant label, we use 2 labels per each relevance label, where again, the
label is prepended to the respective example as shown in Table 1.</p>
      <p>As before, the QGen model is applied to the product corpora of the target domain, which generates
query-product examples for all labels, on which a downstream task model is then trained. In the
next section, we describe in detail this entire process.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Experimental Setup</title>
      <p>
        We conduct experiments for the zero-shot setting, where we assume that we do not have any
training data for our dataset of interest. We use two e-commerce datasets as our target, namely,
WANDS and HomeDepot, both described below. These datasets were selected to fulfill the
following desiderata – a) they provide a significantly-sized test sets in the e-commerce domain
and has real-world impact, and b) they have fine-grained nuance in the relevance judgements. To
understand the effectiveness of QGen over the more conventional transfer learning approaches,
we compare the cross-domain transfer learning approach (which is non-QGen) with two QGen
approaches (vanilla vs label-conditioned). For the zero-shot cross-domain transfer learning, we
train a downstream relevance prediction model on existing datasets, namely, ESCI and
MSMARCO, where MS-MARCO is more general-purpose while ESCI is e-commerce focussed,
albeit much smaller in size. The QGen models are similarly trained on ESCI and MS-MARCO,
and applied to the two target datasets to create training data for the downstream task (Figure 1).
We now present the datasets in more detail.
4.1. Data
MS-MARCO Bajaj et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] first introduced the MS MARCO dataset which is constructed
from Bing search logs having 8 million passages extracted from general-purpose web documents.
Over the years this dataset has been updated and subsets of it have been used for many shared
tasks (e.g. TREC2). In this paper, we use the same MS-MARCO data as used by Zhuang et al.
[
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] which comprises of 530,000 queries and a passage corpus of 8 million, each query being
annotated with binary relevance judgements (0 for not relevant and 1 for relevant). Furthermore,
Zhuang et al. [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] retrieve 35 hard negatives for each relevant query and upsample the relevant
examples to match the irrelevant examples, we refer the reader to the paper for more details.
      </p>
      <p>
        ESCI Reddy et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] comprises of 2.6 million manually labeled query-product relevance
judgements obtained from the Amazon Search pool. To the best of our knowledge, this is the
largest shopping queries dataset publicly available which comprises of 130k unique queries
covering three languages, namely English, Spanish and Japanese. The query-product pairs are
rated for four relevance labels: Exact (E) when the product is exactly relevant to the query,
Substitute (S) when the product is somewhat relevant but it fails to satisfy all requirements of the
query (e.g. showing a ‘red sweater’ product for a ‘blue sweater’ query), Complement (C) when
the item doesn’t satisfy the query but could be used in combination with the query (e.g. showing
‘hydration pack’ for a ‘hiking bag’ query), and Irrelevant (I) when the product is completely
irrelevant to the central aspect of the query (e.g. ‘harry potter book’ for a ‘telescope’ query.
WANDS Chen et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] is a product-search relevance dataset released by WayFair 3, which
primarily focusses on home improvement. It comprises of 233,448 human-annotated relevance
judgements comprising of 480 unique queries and 42,994 unique products. Unlike ESCI, WANDS
has been labeled with three relevance labels, namely, Exact-Match where the product fully
matches the user query, Partial-Match where the product somewhat matches the query in terms
of the target entity but does not satisfy the modifiers, and Irrelevant where the product is not
relevant to the user query. We consider all 233k examples as our test set for evaluation.4
HomeDepot The Home Depot Product Search Relevance5 released by Home Depot6 retailer,
comprises of 73,789 training examples7 and 166k test examples, focussing on home improvement
e-commerce. However, the relevance labels for the test split are not released publicly so we use
the entire train portion for our zero-shot evaluation. These comprise of 54,470 unique products
with relevance labels scored between 1 (not relevant) to 3 (highly relevant).
      </p>
      <p>Table 2 shows some examples for all these datasets and in Table 3 we describe the statistics for
each dataset.
3https://www.wayfair.com/
4There is no designated train/test split provided so for reproducibility we use the entire data as our test set.
5https://www.kaggle.com/c/home-depot-product-search-relevance/
6http://www.homedepot.com.
7The original dataset had 74,068 examples but 279 of those had parsing issues.</p>
      <p>Label</p>
      <p>E
S
C
I</p>
      <sec id="sec-5-1">
        <title>4.2. QGen Setup</title>
        <p>
          We use the pretrained mT5-XXL [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] model (13B parameters) as our starting point for all
Finetune-Based* models, which has been trained for 1M steps on a multilingual corpora, which
gives our subsequent models the ability to generate in many languages inherently. For
PromptBased* models ,we use the same setup as PROMPTAGATOR [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] which uses FLAN-137B as the
large language model (LLM) [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. We use the t5x code base https://github.com/google-research/
t5x to train all models.
        </p>
        <p>Train QGen We finetune a QGen model using the training portion of MS-MARCO and ESCI.
We use the same train/dev/test splits as provided with the respective datasets and transform the
input/output as shown in Table 1. We finetune the QGen model for 100k additional steps with a
constant learning rate of 1e-4, Adafactor optimzer, batch size 128, input sequence length 256,
target length 32.8 The best checkpoint for subsequent steps was selected using the performance
of BLEU on the validation set.</p>
        <p>Apply QGen Next, we apply the above trained models to generate query-product pairs on
WANDS and HomeDepot. For the label-conditioned models, the input text is a concatenation
of the desired label and the product, whereas for its vanilla QGen counterparts the input text is
simply the product information. Since *-LabelCond QGen models have the ability to generate
8100k steps amount to approximately 8 epochs which we thought were sufcfiient given the computational and time
requirements.
queries for different relevance labels, unlike their vanilla counterparts which only can generate
queries for one relevance label, we generate queries for all relevance labels for a given product.
Similar to the training setup, we use input sequence length of 256 with target length to be 32. As
an additional filtration step, we remove duplicate queries i.e. if the same query is generated for
different labels of the same product, we only retain that query-product-label triple which has the
highest model probability.9</p>
      </sec>
      <sec id="sec-5-2">
        <title>4.3. Evaluate QGen for Utility</title>
        <p>We automatically evaluate the generated synthetic data for its utility to the downstream task. To
do that we evaluate the models trained on above-generated data on the respective test sets. We
split the resulting filtered QGen data into a train and validation set with a 90:10 ratio such that
there is no product overlap across the two sets. We experiment with two styles of downstream
models, classification and ranking.
classification We use a pretrained mt5-XXL based encoder-only model to perform
multiclass classification, and report NDCG. 10 For ESCI-based QGen models, this would become a
four-class classification task. We finetune the mt5-encoder for additional 25000 steps with a
constant learning rate of 1e-4.11 We use a batch size of 64 with input sequence length of 608.
The reason we chose NDCG, a ranking metric, instead of accuracy is because there is a label
mismatch between ESCI and WANDS, which helps avoid an oversimplification by deterministic
mapping across the two label sets. This helps evaluate whether the model is correctly ranking
exactly-relevant over partially-relevant over irrelevant. In order to compute NDCG, we need a
relevance score output for each query-document pair. So, from a downstream model based on
the four-way ESCI classification model, we output the prediction probability (|) where 
is the concatenation of input query, product title, product description and  is the output label
(E/S/C/I). We then compute a final score by taking an expectation of the prediction probabilities
by multiplying it with the label weight:
 =</p>
        <p>∑︁
={,,,}</p>
        <p>( |) * 
 = { = 3.0,  = 2.0,  = 1.0,  = 0.0}</p>
        <p>
          An astute reader may wonder why we go through the trouble of training a multi-class model as
opposed to using a ranking model. The reason is that we want the ability to generate new queries
for different relevance labels. This is important for search engines where queries/products that
have had more user-clicks are often indexed and served with priority (to avoid latency). However,
rare or new products often are not covered as they do not have any query associated with them, so
having the ability to generate queries across relevance labels for such products becomes crucial
to increase coverage. However, for completeness, we do report results from a neural re-ranker
and find them to underperforming than the classification model (details in section 5).
9see Table 6 for number of duplicate queries.
10We had also tried an encoder-decoder model for the classification task but found encoder-only model to outperform
slightly.
11We tried two other learning rates of 1e-3 and 5e-5 but found them to be under-performing.
ranking In the neural re-ranker setup, we use the RankT5 model [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] which uses T5 encoder
with pointwise ranking loss wherein the loss for each query-document pair is independently
computed. The authors train the RankT5 model on MS-MARCO which has binary relevance
judgements. We follow the same modeling setup as them, with the main difference being that we
use mt5-XL as our starting point instead of T5-Large, as used by them.12 Input sequence length is
256 with constant learning rate of 1e-4. This ranking model is used in the FINETUNE-BASED and
PROMPT-BASED QGen baselines to evaluate the downstream performance. In these baselines,
as you recall, the QGen models are trained to generate only relevant queries. To create training
data for the ranking downstream model, we need to create negative query-document pairs as
well i.e. documents which are not relevant to a query. For this, we use a dual-encoder T5-based
retriever [
          <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
          ]13 to retrieve top-35 documents for every generated query. We use all 35 as our
hard negative query-document pairs and upsample the relevant documents to have a equal label
distribution and train a RankT5 model. This model ranks the target query-product pairs so we
directly use that to compute NDCG.
        </p>
        <p>Below, we briefly summarize all the model variants we experiment with.</p>
      </sec>
      <sec id="sec-5-3">
        <title>4.4. All Model Variants</title>
        <p>
          First, we describe the baselines which do not use QGen:
• Random where for the target datasets the documents for a given query are randomly
ranked.
• Zero-shot (ESCI) where we train a downstream model for multi-class classification on all
of the ESCI training data and apply it directly to WANDS and Homedepot test data.
• Zero-shot (MS-MARCO) where we train a ranking model using RankT5 [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] with the
pointwise loss function on the MS-MARCO training data and apply it directly to the
WANDS and Homedepot test data.
        </p>
        <p>Next, we describe the baselines which use existing QGen approaches:
• Prompt-Based (ESCI) where we randomly sample 8 query-product pairs from ESCI
having Exact (E) relevance label and similar to PROMPTAGATOR prompt FLAN-137B to
generate one relevant query for a new WANDS/Homedepot product. For the downstream
application, we follow the ranking setup described in subsection 4.3.
• FineTune-Based (MS-MARCO) where we finetune the QGen model on only those
querypassage pairs from MS-MARCO that have Relevant label. For every new target product,
we generate one relevant query and use the retriever to retrieve 35 documents as negative
examples following the ranking setup.</p>
        <p>
          Finally, we describe our adaptations of the above QGen approaches:
• Finetune-Based-LabelCond (ESCI) where we finetune the QGen model on all ESCI
examples, and for every new target product generate queries for all four relevance labels.
12Due to hardware restrictions we could not train the mt5-XXL model variant with their code setup.
13We used the mt5-BASE model finetuned with the unsupervised objective proposed by Izacard et al. [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], based on
the t5x-retrieval code base: https://github.com/google-research/t5x_retrieval.
        </p>
        <p>For the subsequent downstream model, we initialize it with the multi-class classification
model trained on all ESCI data (which we had used in our zero-shot setting), and further
ifnetune it on the synthetic data, following the classification setup.
• FineTune-Based-LabelCond (MS-MARCO) where we finetune the QGen model on
MSMARCO examples which and for every new product generate two queries for each of
the two relevance labels. We use the ranking setup to train the downstream model and
initialize it with the MS-MARCO-finetuned-ranking model (used in the zero-shot setting).
• Prompt-Based-LabelCond (ESCI) where we prompt FLAN-137B with 8 ESCI examples
comprising of 2 examples per each relevance label. For every new target product, we
generate queries for all four relevance labels, and follow ranking setup to train the downstream
model.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5. Results and Discussion</title>
      <p>In this section, we present the results of two major QGen families (finetune-based vs
promptbased) comparing them with cross-domain transfer learning approach. Since WANDS is a more
recent and challenging dataset in comparison to HomeDepot14, we focus on WANDS for our
discussion. We report results for WANDS in Table 4 and for HomeDepot in Table 5. Here are our
main findings:</p>
      <p>Zero-shot Transfer Learning wins over any QGen! Overall, we find that zero-shot transfer
learning outperform all QGen approaches, both vanilla and label-conditioned. This is unlike
what existing works such as INPARS and PROMPTAGATOR where QGen approaches give the
best downstream performance. This could be attributed to the difficulty of the downstream task,
which in this case is a nuanced relevance prediction task, while the existing works focus on binary
relevance which is much simpler.</p>
      <p>Label-conditioned QGen wins over vanilla QGen! Within the QGen approaches, we find
that our adaptation of conditioning on all relevance labels outperforms the vanilla versions which
do not. From the results of FINETUNE-BASED and FINETUNE-BASED-LABELCOND trained on
MS-MARCO, we find that exposing the QGen models to all labels (in the case of MS-MARCO
they are binary) performs better by +3.3 NDCG@10 points. Therefore, we finetune with all labels
on a related dataset (ESCI) for WANDS and find that it outperforms even the MS-MARCO-based
QGen models. For prompt-based QGen models, we find that its label-conditioned counterpart
underperforms its vaniila variant. However, the prompt-based vanilla variant is far behind (-8.3
NDCG@10 points) the finetune-based vanilla variant to begin with.</p>
      <p>
        In-domain training is important! We find that for both transfer learning and QGen
approaches, transferring from a related domain is important in downstream performance. For
instance, within the transfer learning models, the model trained on ESCI (zero-shot (ESCI))
gives the best downstream performance, even outperforming the model trained on MS-MARCO
(zero-shot (MS-MARCO)), which is trained on nearly 10 times larger training data than ESCI.
This again emphasizes that having a related dataset to transfer from is essential for
downstream performance, similar to Gururangan et al. [
        <xref ref-type="bibr" rid="ref25">25</xref>
        ]. Similarly, within QGen approaches, the
label-conditioned model trained on ESCI (Finetune-Based-LabelCond (ESCI)) outperforms its
14We refer the reader to Chen et al. [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ] for more information.
MS-MARCO counterpart. Clearly, relatedness of the target dataset to the training dataset is also
important for the QGen model training.
      </p>
      <p>Below we discuss the probable reasons for the shortcomings of QGen approaches. We
inspect three QGen models which have been trained with all labels, namely,
PROMPT-BASEDLABELCOND (ESCI), FINETUNE-BASED-LABELCOND (ESCI) and
FINETUNE-BASEDLABELCOND (MS-MARCO), for the number of duplicate queries generated by the model.
Specifically, a duplicate query here refers to the QGen model producing the same query across
different relevance labels for the same product. In Table 6 we report the results for WANDS.
As you recall, for each of the WANDS 42,994 products, the QGen models trained on ESCI,
generated 171,976 queries, one for each of the four relevance labels. For QGen models using
MS-MARCO, we generate 85,988 queries, one of each of the two relevance classes. In Table 6
we find that the FINETUNE-BASED-LABELCOND (ESCI) QGen model produces duplicate
queries for 81% of the products, which suggests that simply prepending label information in
the input context is insufficient for the model to learn how to generate discriminative queries.
We would also like to highlight the fact that this is happening despite exposing the QGen model
to the entire ESCI training data which is 1.6 million examples, of which only 5 of the 1.1
million products had duplicate queries. In Table 7 we report the distribution of generated queries
across different labels, after applying the filtration step (described in subsection 4.2) where
we remove the duplicate queries. Clearly, noise in the synthetic queries causes errors in the
subsequent downstream models. Interestingly, despite PROMPT-BASED-LABELCOND (ESCI)
and FINETUNE-BASED-LABELCOND (MS MARCO) models having more number of valid
queries, they still underperform FINETUNE-BASED-LABELCOND (ESCI).</p>
      <p>
        The reason why FINETUNE-BASED-LABELCOND (MS-MARCO) is underperforming its
ESCI counterpart could be attributed to a) the difference in domain and, b) the style of queries. For
instance, queries from MS-MARCO-trained-QGen models are more formal what-style questions,
while queries from ESCI-trained-QGen models are more informal and similar in style to the
gold queries. Although, PROMPT-BASED-LABELCOND (ESCI) has far fewer duplicate queries
it severely underperforms probably because of poor overall quality. In Table 8 we present the
generated queries from different QGen model for a product. We also provide the user-issued or
gold query from the WANDS test set for the same product.15 For PROMPT-BASED-LABELCOND
(ESCI) we see that the query for the highest relevance label i.e. ‘E’ focuses on the entity bed
frame with free storage plans, while from the product description we know that it is mainly
about a bed frame which is made from acacia wood and additionally has storage. Nowhere
does the product talk about storage plans. In fact, the query for the next relevance label ‘S’
is more relevant than the one for ‘E’. Clearly, exposing the models to only 8 examples, as
proposed by PROMPTAGATOR [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is insufficient, in comparison to the 1.6 million examples
used by FINETUNE-BASED-LABELCOND (ESCI), especially for the WANDS dataset. On the
15Note that all products in the test set do not have queries for each relevance label, we simple sampled from those
products which have for qualitative evaluation purposes.
other hand, PROMPTAGATOR work had found that exposing the models to only 8 task-specific
examples for QGen had outperformed finetuned models which were trained on (100)
MSMARCO examples. We would like to note that the Dai et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] also apply an additional
consistency filtration step to the generated queries, wherein they only retain those queries which
are answerable from the passage from which it was generated. They find that that adding this
round-trip consistency adds 2.5 points (avg.) but for smaller datasets it negatively impacts
the downstream performance. Therefore, we experimented with round-trip consistency for the
FINETUNE-BASED-LABELCOND (ESCI) model for WANDS, which is the best among all QGen
variants. Specifically, we use the downstream relevance prediction model trained on ESCI (i.e.
the model used for zero-shot transfer learning) and re-label the generated queries.16 We first
ifnd that the predicted label of 49% of the generated queries do not match the label which was
used to generate the query (i.e. the desired label). We then use the predicted label as the final
label for that query and train a downstream model as before. We find this results in only +1 point
improvement.17
      </p>
      <p>
        This highlights that even though QGen techniques offer a promising solution for adapting
models to new domains, they need further investigation and analyses to make them more effective
across different tasks.
16Dai et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] use the downstream model trained on synthetic data instead we use a model trained on good quality
      </p>
      <p>ESCI data
17Given that, for WANDS in Table 4 the PROMPT-BASED-LABELCOND (ESCI) is almost 21 points behind its
ifnetuned counterparts, we did not apply this additional step which would require additional model training.</p>
    </sec>
    <sec id="sec-7">
      <title>6. Related Work</title>
      <p>
        Synthetic Question Generation has come a long way from relying on simple but rigid heuristics
[
        <xref ref-type="bibr" rid="ref26">26</xref>
        ] to using neural-network approaches, specifically seq-to-seq model [
        <xref ref-type="bibr" rid="ref27 ref28 ref29 ref30">27, 28, 29, 30</xref>
        ], to now
even leveraging large language models (LLMs) through prompting [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Much of the work in
this area has focussed on question generation in the context of QA systems. Below we describe
some of the representative works in this area.
      </p>
      <p>
        QGen for QA Pre-transformer era had seq-to-seq models trained with attention to read an
input sentence and generate a question with respect to an answer which is contained in that
sentence e.g. for factiod QA [
        <xref ref-type="bibr" rid="ref28 ref29">28, 29</xref>
        ] Du and Cardie [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ] go beyond using single sentence context
(as Du et al. [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ] note that 30% of SQuaD questions span answers beyond single sentence) for
generating questions. Transformers [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ] changed the game subsequently with their power of
attention to refer to specific parts of text – the QGen models have further improved. For instance,
Lopez et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] use a GPT-2 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] language model to train a question-generation model using
the passage as input. They also train an answer-aware variant where they mark start and end
of the answer span with special tokens in the context. However, they find the answer-aware
variant to be under-performing for question generation (in terms of BLEU metric) than the
answer-unaware model. They hypothesize that this is because there is no explicit mechanism to
inform the model on how to use the answer information, somewhat similar to what we find in our
label-conditioned models as well where they seem to not use the label information effectively.
Ünlü Menevs¸e et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] explore question-generation for Spoken QA task. More recently, Ko
et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], Chakrabarty et al. [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Cao and Wang [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] propose approaches to generate more
open-ended questions, whose answers often span multiple sentences and could be long-form.
Cao and Wang [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ] create a question-type ontology to guide the model to generate a particular
type of question. They essentially concatenate the question-type with the multi-sentence input to
generate the question. In the hope of controlling the question generation, they train it jointly with
question focus prediction which uses semantic graphs. In principle the question focus and label
conditioning are related as in our case, question focus is the conditioned label, however, their
main goal of work is to generate questions which are diverse and illicit complex reasoning or
curiosity [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. It is not evaluated on improving any downstream tasks.
      </p>
      <p>
        Label-Conditioned QGen Some previous work have looked at label conditioning in QGen
models for classificication tasks. Kumar et al. [33], Yang et al. [34] find that pre-pending
class-labels to input text is quite effective in class-conditional test generation and thereby data
augmentation. They show the effectiveness of this approach for classification tasks (e.g. SST-2
with binary relevance, SNIPS with 7 intents, and TREC with six-classes, SNLI, commonsense
reasoning) across different pretrained LMs including auto-encoder LM (BERT Devlin et al. [35]),
auto-regressive LM (GPT-2 Radford et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) and pretrained seq-2-seq LM (BART [36]). In
this work, we look at nfie-grained relevance prediction, where the task is difficult in that the
multiple classes have an inherent ordering, and therefore it is harder for QGen models to produce
discrimnative queries across such fine-grained labels.
      </p>
    </sec>
    <sec id="sec-8">
      <title>7. Limitations and Next Steps</title>
      <p>From the above results, it is apparent that QGen approaches, although offering a promising
direction especially for zero-shot settings, need considerable work to outperform transfer learning.
Clearly, simply adding label information in the input context does not provide a sufficient signal
for the model to generate discriminative queries. We need to explicitly enforce this signal
throughout the QGen training process. In this work, we only generate one query, but using beam
search we could generate multiple queries for a given product-label combination, resulting in
a diverse collection. Another challenge in working with QGen approaches is that the typical
strategy for evaluating the synthetic data is to evaluate it on a downstream task, requiring two
additional steps after training a QGen model: applying the QGen model for generating queries
and then training a downstream task model, to understand the effect of synthetic data. So if a
researcher wanted to experiment with multiple QGen models, they would have to run three times
the number of experiments to understand which QGen model is the best one, which is a waste of
resources and time. This means that we need to come up with an intrinsic evaluation metric that
correlates well the downstream task performance. Our next steps are focused on addressing these
issues.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>We would like to thank the anonymous reviewers for the valuable feedback and suggestions. We
would also like to thank Honglei Zhuang and Rolf Jagerman for helping us run and adapt RankT5
model for our experiments.
ontology, in: Proceedings of the 59th Annual Meeting of the Association for Computational
Linguistics and the 11th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), 2021, pp. 6424–6439. URL: https://aclanthology.org/2021.
acl-long.502. doi:10.18653/v1/2021.acl-long.502.
[33] V. Kumar, A. Choudhary, E. Cho, Data augmentation using pre-trained transformer models,
in: Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems,
2020, pp. 18–26. URL: https://aclanthology.org/2020.lifelongnlp-1.3.
[34] Y. Yang, C. Malaviya, J. Fernandez, S. Swayamdipta, R. L. Bras, J.-P. Wang, C. Bhagavatula,
Y. Choi, D. Downey, Generative data augmentation for commonsense reasoning, arXiv
preprint arXiv:2004.11546 (2020).
[35] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[36] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov,
L. Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language
generation, translation, and comprehension, in: Proceedings of the 58th Annual Meeting
of the Association for Computational Linguistics, 2020, pp. 7871–7880. URL: https://
aclanthology.org/2020.acl-main.703. doi:10.18653/v1/2020.acl-main.703.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Choudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Katariya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Subbian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Reddy</surname>
          </string-name>
          , Anthem:
          <article-title>Attentive hyperbolic entity model for product search</article-title>
          ,
          <source>in: Proceedings of the 15th ACM International Conference on Web Search and Data Mining</source>
          ,
          <year>2022</year>
          , p.
          <fpage>161</fpage>
          -
          <lpage>171</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          , et al.,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Raffel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>67</lpage>
          . URL: http://jmlr.org/papers/v21/
          <fpage>20</fpage>
          -
          <lpage>074</lpage>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Barham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gehrmann</surname>
          </string-name>
          , et al.,
          <article-title>Palm: Scaling language modeling with pathways</article-title>
          ,
          <source>arXiv preprint arXiv:2204.02311</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
          </string-name>
          , et al.,
          <article-title>Llama: Open and efcfiient foundation language models</article-title>
          ,
          <source>arXiv preprint arXiv:2302.13971</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Màrquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Valero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bandyopadhyay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Biswas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Subbian</surname>
          </string-name>
          ,
          <article-title>Shopping queries dataset: A large-scale ESCI benchmark for improving product search</article-title>
          , arXiv (
          <year>2022</year>
          ). arXiv:
          <volume>2206</volume>
          .
          <fpage>06588</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tiwary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , L. Deng, MS MARCO:
          <article-title>A human generated machine reading comprehension dataset</article-title>
          ,
          <source>in: Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches</source>
          <year>2016</year>
          , volume
          <volume>1773</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2016</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>1773</volume>
          / CoCoNIPS_2016_paper9.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ünlü Menevs</surname>
          </string-name>
          ¸e,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Manav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Arisoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Özgür</surname>
          </string-name>
          ,
          <article-title>A framework for automatic generation of spoken question-answering data</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>4659</fpage>
          -
          <lpage>4666</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          . ifndings-emnlp.
          <volume>342</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>C.</given-names>
            <surname>Alberti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Andor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Pitler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M. Collins,
          <article-title>Synthetic QA corpora generation with roundtrip consistency</article-title>
          ,
          <source>in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6168</fpage>
          -
          <lpage>6173</lpage>
          . URL: https://aclanthology.org/P19-1620. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P19</fpage>
          -1620.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          , I. Korotkov,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. McDonald</surname>
          </string-name>
          ,
          <article-title>Zero-shot neural passage retrieval via domain-targeted synthetic question generation</article-title>
          ,
          <source>in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:</source>
          Main Volume,
          <year>2021</year>
          , pp.
          <fpage>1075</fpage>
          -
          <lpage>1088</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .eacl-main.
          <volume>92</volume>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2021</year>
          .eacl-main.
          <volume>92</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bakalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Guu</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. B. Hall</surname>
            , M.-
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , Promptagator:
          <article-title>Few-shot dense retrieval from 8 examples</article-title>
          , arXiv preprint arXiv:
          <volume>2209</volume>
          .11755 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakrabarty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Muresan</surname>
          </string-name>
          , CONSISTENT:
          <article-title>Open-ended question generation from news articles</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>6954</fpage>
          -
          <lpage>6968</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .findings-emnlp.
          <volume>517</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Lopez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C. B.</given-names>
            <surname>Cruz</surname>
          </string-name>
          , C. Cheng,
          <article-title>Transformer-based end-to-end question generation</article-title>
          ,
          <source>arXiv preprint arXiv:2005.01107 4</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>W.-J. Ko</surname>
            , T.-y. Chen,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Durrett</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Inquisitive question generation for high level text comprehension</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6544</fpage>
          -
          <lpage>6555</lpage>
          . URL: https://aclanthology. org/
          <year>2020</year>
          .emnlp-main.
          <volume>530</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>530</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <article-title>Document expansion by query prediction</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>08375</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bonifacio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Abonizio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fadaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <article-title>Inpars: Data augmentation for information retrieval using large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2202.05144</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bajaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McNamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , et al., MS Marco:
          <article-title>A human generated machine reading comprehension dataset</article-title>
          ,
          <source>arXiv preprint arXiv:1611.09268</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Jagerman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hui</surname>
          </string-name>
          , J. Ma, J. Lu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Bendersky, Rankt5: Fine-tuning t5 for text ranking with ranking losses</article-title>
          ,
          <source>arXiv preprint arXiv:2210.10634</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Baltrunas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schroeder</surname>
          </string-name>
          ,
          <article-title>Wands: Dataset for product search relevance assessment</article-title>
          ,
          <source>in: Proceedings of the 44th European Conference on Information Retrieval</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>128</fpage>
          -
          <lpage>141</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddhant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barua</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Raffel, mT5: A massively multilingual pre-trained text-to-text transformer</article-title>
          ,
          <source>in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>483</fpage>
          -
          <lpage>498</lpage>
          . URL: https://aclanthology. org/
          <year>2021</year>
          .naacl-main.
          <volume>41</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>41</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Guu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lester</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Finetuned language models are zero-shot learners</article-title>
          ,
          <source>arXiv preprint arXiv:2109.01652</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. H.</given-names>
            <surname>Ábrego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ma</surname>
          </string-name>
          , V. Y.
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <surname>K. B. Hall</surname>
            , M.-
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Large dual encoders are generalizable retrievers</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2112</volume>
          .
          <fpage>07899</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. Hernandez</given-names>
            <surname>Abrego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          , J. Ma,
          <string-name>
            <given-names>K.</given-names>
            <surname>Hall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cer</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Yang,</surname>
          </string-name>
          <article-title>Sentencet5: Scalable sentence encoders from pre-trained text-to-text models</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: ACL</source>
          <year>2022</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>1864</fpage>
          -
          <lpage>1874</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          . ifndings-acl.
          <volume>146</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .findings-acl.
          <volume>146</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Caron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave,
          <article-title>Towards unsupervised dense information retrieval with contrastive learning</article-title>
          ,
          <source>arXiv preprint arXiv:2112.09118</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gururangan</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Marasovic´,
          <string-name>
            <given-names>S.</given-names>
            <surname>Swayamdipta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Beltagy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Downey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith,</surname>
          </string-name>
          <article-title>Don't stop pretraining: Adapt language models to domains and tasks</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>8342</fpage>
          -
          <lpage>8360</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>740</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          . acl-main.
          <volume>740</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>N. A.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heilman</surname>
          </string-name>
          ,
          <article-title>Automatic factual question generation from text</article-title>
          ,
          <source>Technical Report CMU-LTI-11-004</source>
          , Carnegie Mellon University, Language Technologies Institute,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>I. V.</given-names>
            <surname>Serban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García-Durán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chandar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <article-title>Generating factoid questions with recurrent neural networks: The 30M factoid question-answer corpus</article-title>
          ,
          <source>in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2016</year>
          , pp.
          <fpage>588</fpage>
          -
          <lpage>598</lpage>
          . URL: https://aclanthology.org/P16-1056. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P16</fpage>
          -1056.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Neural question generation from text: A preliminary study</article-title>
          ,
          <source>in: Natural Language Processing and Chinese Computing</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>X.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cardie</surname>
          </string-name>
          ,
          <article-title>Learning to ask: Neural question generation for reading comprehension</article-title>
          ,
          <source>in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Vancouver, Canada,
          <year>2017</year>
          , pp.
          <fpage>1342</fpage>
          -
          <lpage>1352</lpage>
          . URL: https://aclanthology.org/P17-1123. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P17</fpage>
          -1123.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>X.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Cardie</surname>
          </string-name>
          ,
          <article-title>Harvesting paragraph-level question-answer pairs from Wikipedia, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics</article-title>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>1907</fpage>
          -
          <lpage>1917</lpage>
          . URL: https://aclanthology.org/P18-1177. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P18</fpage>
          -1177.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Controllable open-ended question generation with a new question type</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>