1. Introduction

Exploring the Viability of Synthetic Query Generation for Relevance Prediction

Aditi Chaudhary

aditichaud@google.com

Karthik Raman

karthikraman@google.com

Krishna Srinivasan

Kazuma Hashimoto

kazumah@google.com

Mike Bendersky

Marc Najork

najork@google.com

Google Research

Query-document relevance prediction is a critical problem in Information Retrieval systems. This problem has increasingly been tackled using (pretrained) transformer-based models which are finetuned using large collections of labeled data. However, in specialized domains such as e-commerce and healthcare, the viability of this approach is limited by the dearth of large in-domain data. To address this paucity, recent methods leverage these powerful models to generate high-quality task and domain-specific synthetic data. Prior work has largely explored synthetic data generation or query generation (QGen) for QuestionAnswering (QA) and binary (yes/no) relevance prediction, where for instance, the QGen models are given a document, and trained to generate a query relevant to that document. However in many problems, we have a more fine-grained notion of relevance than a simple yes/no label. Thus, in this work, we conduct a detailed study into how QGen approaches can be leveraged for nuanced relevance prediction. We demonstrate that - contrary to claims from prior works - current QGen approaches fall short of the more conventional cross-domain transfer-learning approaches. Via empirical studies spanning three public e-commerce benchmarks, we identify new shortcomings of existing QGen approaches - including their inability to distinguish between different grades of relevance. To address this, we introduce label-conditioned QGen models which incorporates knowledge about the different relevance. While our experiments demonstrate that these modifications help improve performance of QGen techniques, we also find that QGen approaches struggle to capture the full nuance of the relevance label space and as a result the generated queries are not faithful to the desired relevance label.

eol>Synthetic query generation Relevance prediction

1. Introduction

The task of modeling how relevant a document is to a query is among the most central problems in Information Retrieval, and a key component of many IR systems. The e-commerce domain is no exception, with improved relevance models leading to higher consumer engagement and user satisfaction [ 1 ]. That said, the e-commerce domain offers additional challenges for relevance modeling – specifically due to its fluidity, with new products appearing every day coupled with the ever-evolving interests of the user base.

The advent of Large Language Models (LLMs) such as GPT [ 2 ], T5 [ 3 ], PaLM [ 4 ] and MOUNTAINTOP 40L Hiking Backpacks with Rain Cover for Women Men Label: E Product MOUNTAINTOP 40L Hiking Backpacks with Rain Cover for Women Men Label: S Product MOUNTAINTOP 40L Hiking Backpacks with Rain Cover for Women Men

Generate Queries

Labeled Data (ESCI / MS-MARCO) Vanilla QGen

hiking pack LabelCond QGen hiking pack teton backpack zero-shot application apply QGen

Task

Finetune Test Product Corpora (WANDS / HomeDepot)

Test Data (WANDS / HomeDepot) LLaMa [ 5 ], has unlocked new opportunities for potent relevance modeling. However leveraging LLMs comes with a key requirement: data! As in other IR verticals, e-commerce (relevance) labeled training datasets – that are large enough to train these LLMs – are rare1. The proprietary nature of user logs, coupled with the increasing privacy expectations of users and the exorbitant costs of collecting high-quality relevance ratings, limit the availability of such data. To tackle this issue, the predominant solution in the IR community has been to leverage large-scale general-purpose IR datasets and perform (zero-shot / few-shot) transfer learning. In particular the MS-MARCO [ 7 ] dataset – mined from Bing search logs – is the largest publicly available dataset (with millions of query-document pairs labeled) and most commonly used to train LLMs to understand query-document relevance.

Recently, an alternative paradigm has emerged to overcome the lack of query logs – synthetically generated query logs i.e., Query Generation (QGen). Recent works have successfully demonstrated the use of such techniques across different verticals and IR problems, including Question Answering [ 8 ], Passage Ranking [ 9 ] and Retrieval [ 10, 11 ] – with some recent results [ 11 ] even outperforming transfer learning from MS-MARCO. Beyond improving relevance prediction (the focus of this paper), these synthetically generated query logs can also be used as a substitute for real logs in different IR technologies and problems. For instance, applications like training query suggestions systems or automatically creating FAQs for consumer-facing applications [ 12 ] could all be performed with such logs.

Thus our first contribution is providing the first detailed empirical understanding of QGen approaches in the e-commerce domain. Using data from three different e-commerce benchmarks, 1The one notable exception is the recently released ESCI dataset [ 6 ] – which we use and discuss later. we study performance of the two major families of QGen approaches (finetuning-based vs. promptbased) popular in the literature. Our results also demonstrate that models trained using smaller in-domain labeled datasets can outperform larger general-purpose datasets, thus reinforcing the promise of generating high-quality in-domain synthetic data.

Our second contribution involves experiments and analyses that demonstrate that (unlike claims reported in prior works) QGen approaches are outperformed by the more conventional (cross-domain) transfer learning style approaches. Via detailed analyses, we identify a set of key reasons (that we have not seen discussed – or perhaps identified – in prior works) explaining why QGen approaches fall short. For example, we observe that the best existing QGen baseline produces at least one problematic (from the lens of faithfulness / correctness) query for 80+% of products.

Per our study, a key reason responsible for the shortcomings of existing QGen techniques, is their simplification of the label space. More specifically, QGen techniques simplify the problem of query-document (product) relevance into a simple binary one i.e., relevant or not. In fact, most existing approaches only use the relevant query-document pairs, by training the model to produce the associated (relevant) query given the document. This yes/no binarization is unfortunately a gross over-simplification of the complex relationship between queries and documents. For example, TREC relevance judgments are often rated on a 4-point Likert scale. Thus ignoring this nuance seems sub-optimal – as evidenced in our results. Additionally, as noted in Reddy et al. [ 6 ], nuanced relevance judgements are important for training a high quality product ranker for a better user search experience. For instance, they define four-class relevance judgements ranging from highly relevant to not relevant. A high quality product ranker should be able to rank the highly relevant product over the next relevance class and so on. Binarizing this would lead to a loss in nuance and thereby the ranking quality. Thus as our third contribution, we present modifications to both families of existing QGen approaches (finetuning-based and prompt-based) that recognize and leverage the nuance in the relevance label space. Interestingly, while the ifnetuning variant leads to the overall best QGen models, we find that the prompt-only methods struggle to understand nuance – indicating potential for future improvements in these pretained prompt models.

2. Background: Vanilla QGen

Powerful transformed-based models like GPT [ 2 ], T5 [ 3 ], have shown their prowess in generating high-quality text, owing to their ability to attend to even a large context. These models have now become a starting point for generating synthetic data for training further downstream models. In this work, we explore two existing paradigms of QGen approaches – Finetune-Based where a QGen model is trained on a subset of training data, and Prompt-Based where a large language model (LLM) is leveraged using only few-shot examples. We refer to these existing approaches as Vanilla QGen variants as they use information from only the highest relevance label. Below, we briefly describe them.

Finetune-Based Typically, such a QGen model [ 13, 9 ] is given an input text (e.g. passage or document for question generation) and is trained to generate a output question which is relevant to that passage or document. Throughout the paper, the terms ‘product, document and, passage’ are used interchangeably, but they all refer to an input context which is used for generating the query. Only the relevant query-document pairs from these datasets (e.g. MSMARCO, Yahoo Answers, Stack Exchange) are used for training such a QGen model. The QGen model is then applied to documents (from the task of interest) to generate synthetic relevant questions. For training QA models, these new question-document pairs are directly used for data augmentation [ 9, 8, 14 ]. For training neural retrieval models, an additional retriever (e.g. BM25) is then used to retrieve negative documents for every synthetic relevant question [ 15, 10 ]. Prompt-Based Instead of training a full QGen model, recent works such as PROMPTAGAGOR [ 11 ] and INPARS [ 16 ] leverage large language models (LLMs) as query generator. For instance, PROMPTAGAGOR concatenates 8 relevant question-document pairs {(0, 0) · · · (7, 7)} with the target document of interest () and prompts the LLM to generate a new question () that is relevant to . Then, a retriever is used on the generated new query to construct hard negatives to train a new model on the downstream ranking task. INPARS uses 3 question-document pairs followed by a BM25 retriever to train a T5-reranker model.

In this work, we explore the application of these existing QGen approaches to a much harder relevance prediction task, where it has multiple relevance classes which are nuanced as opposed to only binary relevance prediction (e.g in MS-MARCO). In the next section, we describe our adaptations to the above QGen approaches which conditions on all labels.

3. Proposed: Label-Conditioned QGen for Fine-Grained Relevance Prediction

As mentioned above, the task of relevance prediction for e-commerce entails – given a userissued query and a product, predict the degree of relevance (e.g. highly relevant, partially relevant, irrelevant) between them. Consider an example from Table 1, where we can see the ifne-grained difference in queries across different relevance labels for the same ESCI product. Simply binarizing this task or only considering queries from one relevance label, as done in the above strategies, could risk losing the nuance. Therefore, we extend the above described vanilla QGen techniques to our nuanced relevance prediction task by conditioning the query generation on the relevance label. Below we describe our adaptations: • Finetune-Based-LabelCond: we use the entire training portion of the available data and not just the relevant query-document portion, to train the QGen model. Specifically, each annotated query-document-label triple is transformed such that the label is prepended to the document and the model is trained to the output the query , as shown in Table 1. • Prompt-Based-LabelCond: we follow PROMPTAGATOR and instead of using all 8 examples of just the relevant label, we use 2 labels per each relevance label, where again, the label is prepended to the respective example as shown in Table 1.

As before, the QGen model is applied to the product corpora of the target domain, which generates query-product examples for all labels, on which a downstream task model is then trained. In the next section, we describe in detail this entire process.

4. Experimental Setup

We conduct experiments for the zero-shot setting, where we assume that we do not have any training data for our dataset of interest. We use two e-commerce datasets as our target, namely, WANDS and HomeDepot, both described below. These datasets were selected to fulfill the following desiderata – a) they provide a significantly-sized test sets in the e-commerce domain and has real-world impact, and b) they have fine-grained nuance in the relevance judgements. To understand the effectiveness of QGen over the more conventional transfer learning approaches, we compare the cross-domain transfer learning approach (which is non-QGen) with two QGen approaches (vanilla vs label-conditioned). For the zero-shot cross-domain transfer learning, we train a downstream relevance prediction model on existing datasets, namely, ESCI and MSMARCO, where MS-MARCO is more general-purpose while ESCI is e-commerce focussed, albeit much smaller in size. The QGen models are similarly trained on ESCI and MS-MARCO, and applied to the two target datasets to create training data for the downstream task (Figure 1). We now present the datasets in more detail. 4.1. Data MS-MARCO Bajaj et al. [ 17 ] first introduced the MS MARCO dataset which is constructed from Bing search logs having 8 million passages extracted from general-purpose web documents. Over the years this dataset has been updated and subsets of it have been used for many shared tasks (e.g. TREC2). In this paper, we use the same MS-MARCO data as used by Zhuang et al. [ 18 ] which comprises of 530,000 queries and a passage corpus of 8 million, each query being annotated with binary relevance judgements (0 for not relevant and 1 for relevant). Furthermore, Zhuang et al. [ 18 ] retrieve 35 hard negatives for each relevant query and upsample the relevant examples to match the irrelevant examples, we refer the reader to the paper for more details.

ESCI Reddy et al. [ 6 ] comprises of 2.6 million manually labeled query-product relevance judgements obtained from the Amazon Search pool. To the best of our knowledge, this is the largest shopping queries dataset publicly available which comprises of 130k unique queries covering three languages, namely English, Spanish and Japanese. The query-product pairs are rated for four relevance labels: Exact (E) when the product is exactly relevant to the query, Substitute (S) when the product is somewhat relevant but it fails to satisfy all requirements of the query (e.g. showing a ‘red sweater’ product for a ‘blue sweater’ query), Complement (C) when the item doesn’t satisfy the query but could be used in combination with the query (e.g. showing ‘hydration pack’ for a ‘hiking bag’ query), and Irrelevant (I) when the product is completely irrelevant to the central aspect of the query (e.g. ‘harry potter book’ for a ‘telescope’ query. WANDS Chen et al. [ 19 ] is a product-search relevance dataset released by WayFair 3, which primarily focusses on home improvement. It comprises of 233,448 human-annotated relevance judgements comprising of 480 unique queries and 42,994 unique products. Unlike ESCI, WANDS has been labeled with three relevance labels, namely, Exact-Match where the product fully matches the user query, Partial-Match where the product somewhat matches the query in terms of the target entity but does not satisfy the modifiers, and Irrelevant where the product is not relevant to the user query. We consider all 233k examples as our test set for evaluation.4 HomeDepot The Home Depot Product Search Relevance5 released by Home Depot6 retailer, comprises of 73,789 training examples7 and 166k test examples, focussing on home improvement e-commerce. However, the relevance labels for the test split are not released publicly so we use the entire train portion for our zero-shot evaluation. These comprise of 54,470 unique products with relevance labels scored between 1 (not relevant) to 3 (highly relevant).

Table 2 shows some examples for all these datasets and in Table 3 we describe the statistics for each dataset. 3https://www.wayfair.com/ 4There is no designated train/test split provided so for reproducibility we use the entire data as our test set. 5https://www.kaggle.com/c/home-depot-product-search-relevance/ 6http://www.homedepot.com. 7The original dataset had 74,068 examples but 279 of those had parsing issues.

Label

E S C I

4.2. QGen Setup

We use the pretrained mT5-XXL [ 20 ] model (13B parameters) as our starting point for all Finetune-Based* models, which has been trained for 1M steps on a multilingual corpora, which gives our subsequent models the ability to generate in many languages inherently. For PromptBased* models ,we use the same setup as PROMPTAGATOR [ 11 ] which uses FLAN-137B as the large language model (LLM) [ 21 ]. We use the t5x code base https://github.com/google-research/ t5x to train all models.

Train QGen We finetune a QGen model using the training portion of MS-MARCO and ESCI. We use the same train/dev/test splits as provided with the respective datasets and transform the input/output as shown in Table 1. We finetune the QGen model for 100k additional steps with a constant learning rate of 1e-4, Adafactor optimzer, batch size 128, input sequence length 256, target length 32.8 The best checkpoint for subsequent steps was selected using the performance of BLEU on the validation set.

Apply QGen Next, we apply the above trained models to generate query-product pairs on WANDS and HomeDepot. For the label-conditioned models, the input text is a concatenation of the desired label and the product, whereas for its vanilla QGen counterparts the input text is simply the product information. Since *-LabelCond QGen models have the ability to generate 8100k steps amount to approximately 8 epochs which we thought were sufcfiient given the computational and time requirements. queries for different relevance labels, unlike their vanilla counterparts which only can generate queries for one relevance label, we generate queries for all relevance labels for a given product. Similar to the training setup, we use input sequence length of 256 with target length to be 32. As an additional filtration step, we remove duplicate queries i.e. if the same query is generated for different labels of the same product, we only retain that query-product-label triple which has the highest model probability.9

4.3. Evaluate QGen for Utility

We automatically evaluate the generated synthetic data for its utility to the downstream task. To do that we evaluate the models trained on above-generated data on the respective test sets. We split the resulting filtered QGen data into a train and validation set with a 90:10 ratio such that there is no product overlap across the two sets. We experiment with two styles of downstream models, classification and ranking. classification We use a pretrained mt5-XXL based encoder-only model to perform multiclass classification, and report NDCG. 10 For ESCI-based QGen models, this would become a four-class classification task. We finetune the mt5-encoder for additional 25000 steps with a constant learning rate of 1e-4.11 We use a batch size of 64 with input sequence length of 608. The reason we chose NDCG, a ranking metric, instead of accuracy is because there is a label mismatch between ESCI and WANDS, which helps avoid an oversimplification by deterministic mapping across the two label sets. This helps evaluate whether the model is correctly ranking exactly-relevant over partially-relevant over irrelevant. In order to compute NDCG, we need a relevance score output for each query-document pair. So, from a downstream model based on the four-way ESCI classification model, we output the prediction probability (|) where is the concatenation of input query, product title, product description and is the output label (E/S/C/I). We then compute a final score by taking an expectation of the prediction probabilities by multiplying it with the label weight: =

∑︁ ={,,,}

( |) * = { = 3.0, = 2.0, = 1.0, = 0.0}

An astute reader may wonder why we go through the trouble of training a multi-class model as opposed to using a ranking model. The reason is that we want the ability to generate new queries for different relevance labels. This is important for search engines where queries/products that have had more user-clicks are often indexed and served with priority (to avoid latency). However, rare or new products often are not covered as they do not have any query associated with them, so having the ability to generate queries across relevance labels for such products becomes crucial to increase coverage. However, for completeness, we do report results from a neural re-ranker and find them to underperforming than the classification model (details in section 5). 9see Table 6 for number of duplicate queries. 10We had also tried an encoder-decoder model for the classification task but found encoder-only model to outperform slightly. 11We tried two other learning rates of 1e-3 and 5e-5 but found them to be under-performing. ranking In the neural re-ranker setup, we use the RankT5 model [ 18 ] which uses T5 encoder with pointwise ranking loss wherein the loss for each query-document pair is independently computed. The authors train the RankT5 model on MS-MARCO which has binary relevance judgements. We follow the same modeling setup as them, with the main difference being that we use mt5-XL as our starting point instead of T5-Large, as used by them.12 Input sequence length is 256 with constant learning rate of 1e-4. This ranking model is used in the FINETUNE-BASED and PROMPT-BASED QGen baselines to evaluate the downstream performance. In these baselines, as you recall, the QGen models are trained to generate only relevant queries. To create training data for the ranking downstream model, we need to create negative query-document pairs as well i.e. documents which are not relevant to a query. For this, we use a dual-encoder T5-based retriever [ 22, 23 ]13 to retrieve top-35 documents for every generated query. We use all 35 as our hard negative query-document pairs and upsample the relevant documents to have a equal label distribution and train a RankT5 model. This model ranks the target query-product pairs so we directly use that to compute NDCG.

Below, we briefly summarize all the model variants we experiment with.

4.4. All Model Variants

First, we describe the baselines which do not use QGen: • Random where for the target datasets the documents for a given query are randomly ranked. • Zero-shot (ESCI) where we train a downstream model for multi-class classification on all of the ESCI training data and apply it directly to WANDS and Homedepot test data. • Zero-shot (MS-MARCO) where we train a ranking model using RankT5 [ 18 ] with the pointwise loss function on the MS-MARCO training data and apply it directly to the WANDS and Homedepot test data.

Next, we describe the baselines which use existing QGen approaches: • Prompt-Based (ESCI) where we randomly sample 8 query-product pairs from ESCI having Exact (E) relevance label and similar to PROMPTAGATOR prompt FLAN-137B to generate one relevant query for a new WANDS/Homedepot product. For the downstream application, we follow the ranking setup described in subsection 4.3. • FineTune-Based (MS-MARCO) where we finetune the QGen model on only those querypassage pairs from MS-MARCO that have Relevant label. For every new target product, we generate one relevant query and use the retriever to retrieve 35 documents as negative examples following the ranking setup.

Finally, we describe our adaptations of the above QGen approaches: • Finetune-Based-LabelCond (ESCI) where we finetune the QGen model on all ESCI examples, and for every new target product generate queries for all four relevance labels. 12Due to hardware restrictions we could not train the mt5-XXL model variant with their code setup. 13We used the mt5-BASE model finetuned with the unsupervised objective proposed by Izacard et al. [ 24 ], based on the t5x-retrieval code base: https://github.com/google-research/t5x_retrieval.

For the subsequent downstream model, we initialize it with the multi-class classification model trained on all ESCI data (which we had used in our zero-shot setting), and further ifnetune it on the synthetic data, following the classification setup. • FineTune-Based-LabelCond (MS-MARCO) where we finetune the QGen model on MSMARCO examples which and for every new product generate two queries for each of the two relevance labels. We use the ranking setup to train the downstream model and initialize it with the MS-MARCO-finetuned-ranking model (used in the zero-shot setting). • Prompt-Based-LabelCond (ESCI) where we prompt FLAN-137B with 8 ESCI examples comprising of 2 examples per each relevance label. For every new target product, we generate queries for all four relevance labels, and follow ranking setup to train the downstream model.

5. Results and Discussion

In this section, we present the results of two major QGen families (finetune-based vs promptbased) comparing them with cross-domain transfer learning approach. Since WANDS is a more recent and challenging dataset in comparison to HomeDepot14, we focus on WANDS for our discussion. We report results for WANDS in Table 4 and for HomeDepot in Table 5. Here are our main findings:

Zero-shot Transfer Learning wins over any QGen! Overall, we find that zero-shot transfer learning outperform all QGen approaches, both vanilla and label-conditioned. This is unlike what existing works such as INPARS and PROMPTAGATOR where QGen approaches give the best downstream performance. This could be attributed to the difficulty of the downstream task, which in this case is a nuanced relevance prediction task, while the existing works focus on binary relevance which is much simpler.

Label-conditioned QGen wins over vanilla QGen! Within the QGen approaches, we find that our adaptation of conditioning on all relevance labels outperforms the vanilla versions which do not. From the results of FINETUNE-BASED and FINETUNE-BASED-LABELCOND trained on MS-MARCO, we find that exposing the QGen models to all labels (in the case of MS-MARCO they are binary) performs better by +3.3 NDCG@10 points. Therefore, we finetune with all labels on a related dataset (ESCI) for WANDS and find that it outperforms even the MS-MARCO-based QGen models. For prompt-based QGen models, we find that its label-conditioned counterpart underperforms its vaniila variant. However, the prompt-based vanilla variant is far behind (-8.3 NDCG@10 points) the finetune-based vanilla variant to begin with.

In-domain training is important! We find that for both transfer learning and QGen approaches, transferring from a related domain is important in downstream performance. For instance, within the transfer learning models, the model trained on ESCI (zero-shot (ESCI)) gives the best downstream performance, even outperforming the model trained on MS-MARCO (zero-shot (MS-MARCO)), which is trained on nearly 10 times larger training data than ESCI. This again emphasizes that having a related dataset to transfer from is essential for downstream performance, similar to Gururangan et al. [ 25 ]. Similarly, within QGen approaches, the label-conditioned model trained on ESCI (Finetune-Based-LabelCond (ESCI)) outperforms its 14We refer the reader to Chen et al. [ 19 ] for more information. MS-MARCO counterpart. Clearly, relatedness of the target dataset to the training dataset is also important for the QGen model training.

Below we discuss the probable reasons for the shortcomings of QGen approaches. We inspect three QGen models which have been trained with all labels, namely, PROMPT-BASEDLABELCOND (ESCI), FINETUNE-BASED-LABELCOND (ESCI) and FINETUNE-BASEDLABELCOND (MS-MARCO), for the number of duplicate queries generated by the model. Specifically, a duplicate query here refers to the QGen model producing the same query across different relevance labels for the same product. In Table 6 we report the results for WANDS. As you recall, for each of the WANDS 42,994 products, the QGen models trained on ESCI, generated 171,976 queries, one for each of the four relevance labels. For QGen models using MS-MARCO, we generate 85,988 queries, one of each of the two relevance classes. In Table 6 we find that the FINETUNE-BASED-LABELCOND (ESCI) QGen model produces duplicate queries for 81% of the products, which suggests that simply prepending label information in the input context is insufficient for the model to learn how to generate discriminative queries. We would also like to highlight the fact that this is happening despite exposing the QGen model to the entire ESCI training data which is 1.6 million examples, of which only 5 of the 1.1 million products had duplicate queries. In Table 7 we report the distribution of generated queries across different labels, after applying the filtration step (described in subsection 4.2) where we remove the duplicate queries. Clearly, noise in the synthetic queries causes errors in the subsequent downstream models. Interestingly, despite PROMPT-BASED-LABELCOND (ESCI) and FINETUNE-BASED-LABELCOND (MS MARCO) models having more number of valid queries, they still underperform FINETUNE-BASED-LABELCOND (ESCI).

The reason why FINETUNE-BASED-LABELCOND (MS-MARCO) is underperforming its ESCI counterpart could be attributed to a) the difference in domain and, b) the style of queries. For instance, queries from MS-MARCO-trained-QGen models are more formal what-style questions, while queries from ESCI-trained-QGen models are more informal and similar in style to the gold queries. Although, PROMPT-BASED-LABELCOND (ESCI) has far fewer duplicate queries it severely underperforms probably because of poor overall quality. In Table 8 we present the generated queries from different QGen model for a product. We also provide the user-issued or gold query from the WANDS test set for the same product.15 For PROMPT-BASED-LABELCOND (ESCI) we see that the query for the highest relevance label i.e. ‘E’ focuses on the entity bed frame with free storage plans, while from the product description we know that it is mainly about a bed frame which is made from acacia wood and additionally has storage. Nowhere does the product talk about storage plans. In fact, the query for the next relevance label ‘S’ is more relevant than the one for ‘E’. Clearly, exposing the models to only 8 examples, as proposed by PROMPTAGATOR [ 11 ] is insufficient, in comparison to the 1.6 million examples used by FINETUNE-BASED-LABELCOND (ESCI), especially for the WANDS dataset. On the 15Note that all products in the test set do not have queries for each relevance label, we simple sampled from those products which have for qualitative evaluation purposes. other hand, PROMPTAGATOR work had found that exposing the models to only 8 task-specific examples for QGen had outperformed finetuned models which were trained on (100) MSMARCO examples. We would like to note that the Dai et al. [ 11 ] also apply an additional consistency filtration step to the generated queries, wherein they only retain those queries which are answerable from the passage from which it was generated. They find that that adding this round-trip consistency adds 2.5 points (avg.) but for smaller datasets it negatively impacts the downstream performance. Therefore, we experimented with round-trip consistency for the FINETUNE-BASED-LABELCOND (ESCI) model for WANDS, which is the best among all QGen variants. Specifically, we use the downstream relevance prediction model trained on ESCI (i.e. the model used for zero-shot transfer learning) and re-label the generated queries.16 We first ifnd that the predicted label of 49% of the generated queries do not match the label which was used to generate the query (i.e. the desired label). We then use the predicted label as the final label for that query and train a downstream model as before. We find this results in only +1 point improvement.17

This highlights that even though QGen techniques offer a promising solution for adapting models to new domains, they need further investigation and analyses to make them more effective across different tasks. 16Dai et al. [ 11 ] use the downstream model trained on synthetic data instead we use a model trained on good quality

ESCI data 17Given that, for WANDS in Table 4 the PROMPT-BASED-LABELCOND (ESCI) is almost 21 points behind its ifnetuned counterparts, we did not apply this additional step which would require additional model training.

6. Related Work

Synthetic Question Generation has come a long way from relying on simple but rigid heuristics [ 26 ] to using neural-network approaches, specifically seq-to-seq model [ 27, 28, 29, 30 ], to now even leveraging large language models (LLMs) through prompting [ 11 ]. Much of the work in this area has focussed on question generation in the context of QA systems. Below we describe some of the representative works in this area.

QGen for QA Pre-transformer era had seq-to-seq models trained with attention to read an input sentence and generate a question with respect to an answer which is contained in that sentence e.g. for factiod QA [ 28, 29 ] Du and Cardie [ 30 ] go beyond using single sentence context (as Du et al. [ 29 ] note that 30% of SQuaD questions span answers beyond single sentence) for generating questions. Transformers [ 31 ] changed the game subsequently with their power of attention to refer to specific parts of text – the QGen models have further improved. For instance, Lopez et al. [ 13 ] use a GPT-2 [ 2 ] language model to train a question-generation model using the passage as input. They also train an answer-aware variant where they mark start and end of the answer span with special tokens in the context. However, they find the answer-aware variant to be under-performing for question generation (in terms of BLEU metric) than the answer-unaware model. They hypothesize that this is because there is no explicit mechanism to inform the model on how to use the answer information, somewhat similar to what we find in our label-conditioned models as well where they seem to not use the label information effectively. Ünlü Menevs¸e et al. [ 8 ] explore question-generation for Spoken QA task. More recently, Ko et al. [ 14 ], Chakrabarty et al. [ 12 ], Cao and Wang [ 32 ] propose approaches to generate more open-ended questions, whose answers often span multiple sentences and could be long-form. Cao and Wang [ 32 ] create a question-type ontology to guide the model to generate a particular type of question. They essentially concatenate the question-type with the multi-sentence input to generate the question. In the hope of controlling the question generation, they train it jointly with question focus prediction which uses semantic graphs. In principle the question focus and label conditioning are related as in our case, question focus is the conditioned label, however, their main goal of work is to generate questions which are diverse and illicit complex reasoning or curiosity [ 14 ]. It is not evaluated on improving any downstream tasks.

Label-Conditioned QGen Some previous work have looked at label conditioning in QGen models for classificication tasks. Kumar et al. [33], Yang et al. [34] find that pre-pending class-labels to input text is quite effective in class-conditional test generation and thereby data augmentation. They show the effectiveness of this approach for classification tasks (e.g. SST-2 with binary relevance, SNIPS with 7 intents, and TREC with six-classes, SNLI, commonsense reasoning) across different pretrained LMs including auto-encoder LM (BERT Devlin et al. [35]), auto-regressive LM (GPT-2 Radford et al. [ 2 ]) and pretrained seq-2-seq LM (BART [36]). In this work, we look at nfie-grained relevance prediction, where the task is difficult in that the multiple classes have an inherent ordering, and therefore it is harder for QGen models to produce discrimnative queries across such fine-grained labels.

7. Limitations and Next Steps

From the above results, it is apparent that QGen approaches, although offering a promising direction especially for zero-shot settings, need considerable work to outperform transfer learning. Clearly, simply adding label information in the input context does not provide a sufficient signal for the model to generate discriminative queries. We need to explicitly enforce this signal throughout the QGen training process. In this work, we only generate one query, but using beam search we could generate multiple queries for a given product-label combination, resulting in a diverse collection. Another challenge in working with QGen approaches is that the typical strategy for evaluating the synthetic data is to evaluate it on a downstream task, requiring two additional steps after training a QGen model: applying the QGen model for generating queries and then training a downstream task model, to understand the effect of synthetic data. So if a researcher wanted to experiment with multiple QGen models, they would have to run three times the number of experiments to understand which QGen model is the best one, which is a waste of resources and time. This means that we need to come up with an intrinsic evaluation metric that correlates well the downstream task performance. Our next steps are focused on addressing these issues.

Acknowledgments

We would like to thank the anonymous reviewers for the valuable feedback and suggestions. We would also like to thank Honglei Zhuang and Rolf Jagerman for helping us run and adapt RankT5 model for our experiments. ontology, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 6424–6439. URL: https://aclanthology.org/2021. acl-long.502. doi:10.18653/v1/2021.acl-long.502. [33] V. Kumar, A. Choudhary, E. Cho, Data augmentation using pre-trained transformer models, in: Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, 2020, pp. 18–26. URL: https://aclanthology.org/2020.lifelongnlp-1.3. [34] Y. Yang, C. Malaviya, J. Fernandez, S. Swayamdipta, R. L. Bras, J.-P. Wang, C. Bhagavatula, Y. Choi, D. Downey, Generative data augmentation for commonsense reasoning, arXiv preprint arXiv:2004.11546 (2020). [35] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018). [36] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7871–7880. URL: https:// aclanthology.org/2020.acl-main.703. doi:10.18653/v1/2020.acl-main.703.

[1]

Choudhary ,

Rao ,

Katariya ,

Subbian ,

C. K.

Reddy , Anthem: Attentive hyperbolic entity model for product search , in: Proceedings of the 15th ACM International Conference on Web Search and Data Mining , 2022 , p. 161 - 171 .

[2]

Radford , J. Wu ,

Child ,

Luan ,

Amodei ,

Sutskever , et al., Language models are unsupervised multitask learners , OpenAI blog 1 ( 2019 ) 9 .

[3]

Raffel ,

Shazeer ,

Roberts ,

Lee ,

Narang ,

Matena ,

Zhou ,

Li ,

P. J.

Liu , Exploring the limits of transfer learning with a unified text-to-text transformer , Journal of Machine Learning Research 21 ( 2020 ) 1 - 67 . URL: http://jmlr.org/papers/v21/ 20 - 074 .html.

[4]

Chowdhery ,

Narang ,

Devlin ,

Bosma ,

Mishra ,

Roberts ,

Barham ,

H. W.

Chung ,

Sutton ,

Gehrmann , et al., Palm: Scaling language modeling with pathways , arXiv preprint arXiv:2204.02311 ( 2022 ).

[5]

Touvron ,

Lavril ,

Izacard ,

Martinet , M. -

A. Lachaux , T.

Lacroix , B.

Rozière , N.

Goyal , E.

Hambro , F.

Azhar , et al., Llama: Open and efcfiient foundation language models , arXiv preprint arXiv:2302.13971 ( 2023 ).

[6]

C. K.

Reddy ,

Màrquez ,

Valero ,

Rao ,

Zaragoza ,

Bandyopadhyay ,

Biswas ,

Xing ,

Subbian , Shopping queries dataset: A large-scale ESCI benchmark for improving product search , arXiv ( 2022 ). arXiv: 2206 . 06588 .

[7]

Nguyen ,

Rosenberg ,

Song ,

Gao ,

Tiwary ,

Majumder , L. Deng, MS MARCO: A human generated machine reading comprehension dataset , in: Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 , volume 1773 of CEUR Workshop Proceedings , 2016 . URL: https://ceur-ws. org/ Vol- 1773 / CoCoNIPS_2016_paper9.pdf.

[8]

Ünlü Menevs ¸e,

Manav ,

Arisoy ,

Özgür , A framework for automatic generation of spoken question-answering data , in: Findings of the Association for Computational Linguistics: EMNLP 2022 , 2022 , pp. 4659 - 4666 . URL: https://aclanthology.org/ 2022 . ifndings-emnlp. 342 .

[9]

Alberti ,

Andor ,

Pitler ,

Devlin , M. Collins, Synthetic QA corpora generation with roundtrip consistency , in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , 2019 , pp. 6168 - 6173 . URL: https://aclanthology.org/P19-1620. doi: 10 .18653/v1/ P19 -1620.

[10]

Ma , I. Korotkov,

Yang ,

Hall , R. McDonald , Zero-shot neural passage retrieval via domain-targeted synthetic question generation , in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021 , pp. 1075 - 1088 . URL: https://aclanthology.org/ 2021 .eacl-main. 92 . doi: 10 .18653/ v1/ 2021 .eacl-main. 92 .

[11]

Dai ,

V. Y.

Zhao ,

Ma ,

Luan ,

Ni ,

Lu ,

Bakalov ,

Guu , K. B. Hall , M.- W. Chang , Promptagator: Few-shot dense retrieval from 8 examples , arXiv preprint arXiv: 2209 .11755 ( 2022 ).

[12]

Chakrabarty ,

Lewis ,

Muresan , CONSISTENT: Open-ended question generation from news articles , in: Findings of the Association for Computational Linguistics: EMNLP 2022 , 2022 , pp. 6954 - 6968 . URL: https://aclanthology.org/ 2022 .findings-emnlp. 517 .

[13]

L. E.

Lopez ,

D. K.

Cruz ,

J. C. B.

Cruz , C. Cheng, Transformer-based end-to-end question generation , arXiv preprint arXiv:2005.01107 4 ( 2020 ).

[14] W.-J. Ko , T.-y. Chen, Y.

Huang , G.

Durrett , J. J.

Li , Inquisitive question generation for high level text comprehension , in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2020 , pp. 6544 - 6555 . URL: https://aclanthology. org/ 2020 .emnlp-main. 530 . doi: 10 .18653/v1/ 2020 .emnlp-main. 530 .

[15]

Nogueira ,

Yang ,

Lin ,

Cho , Document expansion by query prediction , arXiv preprint arXiv: 1904 . 08375 ( 2019 ).

[16]

Bonifacio ,

Abonizio ,

Fadaee ,

Nogueira , Inpars: Data augmentation for information retrieval using large language models , arXiv preprint arXiv:2202.05144 ( 2022 ).

[17]

Bajaj ,

Campos ,

Craswell ,

Deng ,

Gao ,

Liu ,

Majumder ,

McNamara ,

Mitra ,

Nguyen , et al., MS Marco: A human generated machine reading comprehension dataset , arXiv preprint arXiv:1611.09268 ( 2016 ).

[18]

Zhuang ,

Qin ,

Jagerman ,

Hui , J. Ma, J. Lu,

Ni ,

Wang , M. Bendersky, Rankt5: Fine-tuning t5 for text ranking with ranking losses , arXiv preprint arXiv:2210.10634 ( 2022 ).

[19]

Chen ,

Liu ,

Sun ,

Baltrunas ,

Schroeder , Wands: Dataset for product search relevance assessment , in: Proceedings of the 44th European Conference on Information Retrieval , 2022 , pp. 128 - 141 .

[20]

Xue ,

Constant ,

Roberts ,

Kale ,

Al-Rfou ,

Siddhant ,

Barua , C. Raffel, mT5: A massively multilingual pre-trained text-to-text transformer , in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , 2021 , pp. 483 - 498 . URL: https://aclanthology. org/ 2021 .naacl-main. 41 . doi: 10 .18653/v1/ 2021 .naacl-main. 41 .

[21]

Wei ,

Bosma ,

V. Y.

Zhao ,

Guu ,

A. W.

Yu ,

Lester ,

Du ,

A. M.

Dai ,

Q. V.

Le , Finetuned language models are zero-shot learners , arXiv preprint arXiv:2109.01652 ( 2021 ).

[22]

Ni ,

Qu ,

Lu ,

Dai ,

G. H.

Ábrego ,

Ma , V. Y. Zhao , Y. Luan , K. B. Hall , M.- W.

Chang , Y.

Yang , Large dual encoders are generalizable retrievers , 2021 . arXiv: 2112 . 07899 .

[23]

Ni ,

G. Hernandez

Abrego ,

Constant , J. Ma,

Hall ,

Cer , Y. Yang, Sentencet5: Scalable sentence encoders from pre-trained text-to-text models , in: Findings of the Association for Computational Linguistics: ACL 2022 , Association for Computational Linguistics , Dublin, Ireland, 2022 , pp. 1864 - 1874 . URL: https://aclanthology.org/ 2022 . ifndings-acl. 146 . doi: 10 .18653/v1/ 2022 .findings-acl. 146 .

[24]

Izacard ,

Caron ,

Hosseini ,

Riedel ,

Bojanowski ,

Joulin , E. Grave, Towards unsupervised dense information retrieval with contrastive learning , arXiv preprint arXiv:2112.09118 ( 2021 ).

[25]

Gururangan , A . Marasovic´,

Swayamdipta ,

Lo ,

Beltagy ,

Downey ,

N. A.

Smith, Don't stop pretraining: Adapt language models to domains and tasks , in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 2020 , pp. 8342 - 8360 . URL: https://aclanthology.org/ 2020 .acl-main. 740 . doi: 10 .18653/v1/ 2020 . acl-main. 740 .

[26]

N. A.

Smith ,

Heilman , Automatic factual question generation from text , Technical Report CMU-LTI-11-004 , Carnegie Mellon University, Language Technologies Institute, 2011 .

[27]

I. V.

Serban ,

García-Durán ,

Gulcehre ,

Ahn ,

Chandar ,

Courville ,

Bengio , Generating factoid questions with recurrent neural networks: The 30M factoid question-answer corpus , in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, 2016 , pp. 588 - 598 . URL: https://aclanthology.org/P16-1056. doi: 10 .18653/v1/ P16 -1056.

[28]

Zhou ,

Yang ,

Wei ,

Tan ,

Bao ,

Zhou , Neural question generation from text: A preliminary study , in: Natural Language Processing and Chinese Computing , 2017 .

[29]

Du ,

Shao ,

Cardie , Learning to ask: Neural question generation for reading comprehension , in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Vancouver, Canada, 2017 , pp. 1342 - 1352 . URL: https://aclanthology.org/P17-1123. doi: 10 .18653/v1/ P17 -1123.

[30]

Du ,

Cardie , Harvesting paragraph-level question-answer pairs from Wikipedia, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, 2018 , pp. 1907 - 1917 . URL: https://aclanthology.org/P18-1177. doi: 10 .18653/v1/ P18 -1177.

[31]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , Advances in neural information processing systems 30 ( 2017 ).

[32]

Cao ,

Wang , Controllable open-ended question generation with a new question type