=Paper=
{{Paper
|id=Vol-2621/CIRCLE20_06
|storemode=property
|title=Using BERT and BART for Query Suggestion
|pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_06.pdf
|volume=Vol-2621
|authors=Agnès Mustar,Sylvain Lamprier,Benjamin Piwowarski
|dblpUrl=https://dblp.org/rec/conf/circle/MustarLP20
}}
==Using BERT and BART for Query Suggestion==
Using BERT and BART for Query Suggestion
Agnès Mustar Sylvain Lamprier Benjamin Piwowarski
Sorbonne Université, CNRS, LIP6, Sorbonne Université, CNRS, LIP6, Sorbonne Université, CNRS, LIP6,
F-75005 F-75005 F-75005
Paris, France Paris, France Paris, France
agnes.mustar@lip6.fr sylvain.lamprier@lip6.fr benjamin.piwowarski@lip6.fr
ABSTRACT models are difficult to adapt when using a wider context than the
Transformer networks have recently been successfully applied on last submitted query [7].
a very large range of NLP tasks. Surprisingly, they have never More recently, recurrent neural network-based (RNNs) meth-
been employed for query suggestion, although their sequence-to- ods have been proposed to exploit longer dependencies between
sequence architecture makes them particularly appealing for this queries [1, 2, 7, 37, 40]. RNNs do so by keeping track of the user
task. Query suggestion requires to model behaviors during complex in a representation/vector space which depends on all the actions
search sessions to output useful next queries to help users to com- performed by the user so far. Such models have improved the qual-
plete their intent. We show that pre-trained transformer networks ity of suggestions by capturing a broader context, but are limited
exhibit a very good performance for query suggestion on a large by the relatively short span of interaction that RNNs are able to
corpus of search logs, that they are more robust to noise, and have capture.
a better understanding of complex queries. Common NLP tasks [22, 31, 34, 38, 39, 41] have benefited from
the recently proposed Transformers architecture [39]. Transformer
CCS CONCEPTS networks, such as Bert [8], capture long-range dependencies be-
tween terms by refining each token representation based on its
• Information systems → Query suggestion; Query reformu-
context before handling the task at hand. They are thus a partic-
lation; Query representation; Language models; • Computing method-
ularly interesting architecture for query suggestion since query
ologies → Learning latent representations.
terms are often repeated throughout a session, and their interac-
KEYWORDS tion needs to be captured, to build a faithful representation of the
Queries suggestion, Transformers, User modeling current user state.
In this work, we compare and analyse the results of pre-trained
1 INTRODUCTION transformers for query suggestion to the ones from RNN-based
models.
To explore the space of potentially relevant documents, users in-
teract with search engines through queries. This process can be
improved, since when looking for information, users may have
2 RELATED WORK
difficulties to express their needs at first sight, and hence may have
to reformulate the queries multiple times to find the documents A large number of works have focused on the task of query sug-
that satisfy their needs. This process is particularly exacerbated gestion [29], and related tasks such as query auto-completion [26],
when the user is accomplishing a complex search task. based on search logs to extract query co-occurrences [14, 15]. From
Among the different ways to help users in exploring the in- a given single query formulated by a user, the goal is to identify
formation space, modern search engines provide a list of query related queries from logs, and to suggest reformulations based on
suggestions, which help users by either following their current what follows in the retrieved sessions, assuming subsequent queries
search direction – e.g. by refining the current query – or by switch- as refinements of former ones [33]. These works rely on several
ing to a different aspect of a search task [28] if the proposed queries methods, such as using term co-occurrence [14], using users click
match a different aspect of the user information need. Another use information [25], using word-level representation [4], capturing
of query suggestions is to help the search engines by providing higher order collocation in query-document sub-graphs [3], clus-
ways to diversify the presented information [36]. tering queries from logs [33], or defining hierarchies of related
To suggest useful queries, most models build upon web search search tasks and sub-tasks [11, 24]. Some methods finally prevent
logs, where the actions of a user (queries, clicks, and timestamps) are query sparsity via reformulations using NLP techniques [29]. [15]
recorded. User sessions are then extracted by segmenting the web proposes an end-to-end system to generate synthetic suggestions,
search log. The first query suggestion models exploited the query based on query-level operations and information collected from
co-occurrence graph extracted from user sessions [14, 15]: if a query available text resources.
is often followed by another one, then the latter is a good potential However, such log-based methods suffer from data sparsity and
reformulation. However, co-occurrence based models suffer from are not effective for rare or unseen queries [37]. In addition, these
data sparsity, for instance when named entities are mentioned, approaches are usually context-agnostic, focusing on matching can-
and lack of coverage for rare or unseen queries. Moreover, these didates with a single query. But, when the query comes in a session
with some previous attempts for finding relevant information, it is
"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com- crucial to leverage such context for capturing the user intent and
mons License Attribution 4.0 International (CC BY 4.0)." understanding its reformulation behavior. Note the approach in [5],
Agnès Mustar, Sylvain Lamprier, and Benjamin Piwowarski
which alleviates the problem by relating the user sessions to paths 3 TRANSFORMERS FOR QUERIES
in a concept tree also suffers from data sparsity issues. SUGGESTION
Instead of trying to predict directly a query, it is possible to learn
In this section, we first present the transformer network architec-
how to transform it. Most approaches operate at a high level, with
ture and pre-trained transformers before describing how we use
term retention, addition and removal as the possible reformulation
them for query suggestion.
actions [20, 35]. [20] consider these actions as feedback from the
user – e.g. a term that is retained during the whole session should be
considered as central for the user intent. Depending on the previous
3.1 The Transformer architecture
sequence of users’ actions, these methods seek to predict the next The transformer architecture was introduced by [39]. It is composed
action. These methods are interesting because they model the user of parametric functions that successively refine the representation
behavior in a session. However, they fail at capturing the semantic of sequences, both for the encoder and the decoder. In our case,
of words, which is essential. the encoder is used to represent the session, and the decoder to
To cope with limitations of log-based and action-based methods, generate the next query.
some works propose to define probabilistic models for next query Each layer of the encoder or the decoder transforms the sequence
prediction [12]. Due to their ability for processing sequences of 𝑥 composed of 𝑛 vectors 𝑥 1, . . . , 𝑥𝑛 into a sequence 𝑦1, . . . , 𝑦𝑛 of
variable size, Recurrent Neural Networks (RNNs) have been widely the same length, through an attention over a context sequence 𝑐
used for text modeling and generation tasks, with an encoder that composed of 𝑛 vectors 𝑐 1, . . . , 𝑐𝑛 . Each time, the central mechanism
processes an input sequence by updating a representation in R𝑛 , is to use an attention mechanism – other operations are conducted
and a decoder that generates the target sequence from the last to ensure a stable and efficient learning process, and are detailed in
computed representation. Some works have adapted these ideas [39].Special tokens are used to separate texts ( [SEP]) or perform
to a sequence of queries [7, 16, 37]. HRED [37] proposes to use classification ([CLS]).
two encoders: a query-level encoder, which encodes each query of
the user session independently, and a session-level encoder, which 3.2 Pre Trained Transformers
deals with the sequence of query representations. Instead of using a Transformer models have a lot of parameters, and can be long
hierarchical representation, ACG [7] relies on attention mechanism to train. Recently, multiple pre-trained models trained on large
that is used to give a different importance to words and queries datasets have been released [8, 21, 32, 42]. We compare the results
in the representation. Another improvement of ACG is to deal of the fully trained transformer, to two pre-trained models that we
with Out-Of-Vocabulary (OOV) words through the use of a copy finetune: BERT [8] and BART [21].
mechanism, which allows the model to pick tokens from the past
user queries rather than generating from using the standard RNN Bert. The Bidirectional Encoder Representations from Trans-
decoding. formers [8] have been trained on a large dataset, the BooksCorpus
Other RNN based approaches have also been recently proposed, [43] on two tasks, namely predicting some masked tokens of the
such as [40], which leverages user clicks and document represen- input, and on predicting whether one sentence follows another. It
tations to specify the user intent [1, 2], or [16] which integrates is a state-of-the-art model, which is used for different tasks. BERT
click-through data into homomorphic term embeddings to cap- corresponds to the encoder part only – we have to train a decoder
ture semantic reformulations. In this work, as a starting point, we for our specific task.
restrict to queries in sessions as input data, but other sources of
Bart. Bidirectional and Auto-Regressive Transformer [21] is
information can be added to such models.
made of an encoder and a decoder. It is trained on the same data than
In parallel, the Transformers architecture, a recent and effective
BERT, but for multiple tasks: token masking, tokens detection, text
alternative to RNNs models introduced in [39], was successfully
infilling, sentence permutation, and document rotation. Because it
applied to a large broad of NLP applications, such as Constituency
has a decoder and it is trained on these tasks, the authors claim that
Parsing and Automatic Translation [39], Semantic Role Labeling
BART is better than BERT for text generation. They also released
[38], Machine Reading Comprehension [22], and Abstractive Text
fine-tuned versions of BART for other tasks. We use the weights of
Summarization [34].
the model fine-tuned on CNN/DM, a news summarization dataset,
The Transformer has also has been used several times in the
because as a text generation task it was the closest task to the query
field of Information Retrieval. [27] and [10] applied transformers to
suggestion task.
infer query from a document. [27] used the pretrained transformer
BERT, and showed that expanding the document with the predicted
query improve the ad hoc retrieval results, while [10] presented a
3.3 Using Transformer networks for Query
more complex seq2seq architecture: the encoder included a Graph Suggestion
Convolutional Network and a RNN; and the decoder is a transformer. 3.3.1 Problem Setting. Let us consider a session 𝑆 = (𝑄 1, ..., 𝑄 |𝑆 | )
Transformers have also been used for ad hoc retrieval [6, 23, 31, 41]. as a sequence of |𝑆 | queries, where every 𝑄𝑖 = (𝑤𝑖,1, ..., 𝑤𝑖, |𝑄𝑖 | )
[23] used BERT features in existing ranking neural models, and is a sequence of |𝑄𝑖 | words. The goal of query suggestion is to
outperforms state-of-the-art ad hoc ranking scores. suggest the most relevant query for the user intent represented
by the session. However, no perfect ground truth can be easily
established for such problems: defining the perfect query for a given
specific under defined need, given a sequence of past queries, is
Using BERT and BART for Query Suggestion
an intractable problem, which requires to consider very diverse (in Compared Models. In our experiments we compare RNN-based
nature and complexity) search tasks, depends on the user state, the approaches against fine-tuned transformer models. The RNN mod-
IR system and the available information in the targeted collection. els are HRED [37] and ACG [7] described in section 2. The pre-
Following other works on model-based query suggestion, we thus trained models that we finetune are BERT [8] and BART [21].
focus on predicting the next question within an observed session. In order to isolate possible causes of performance variations,
We suppose that our dataset is composed of pairs (𝑆, 𝑄) ˇ where models optimization is performed on the training sets of sessions
𝑄ˇ is the query following a sequence of queries 𝑆. Our aim is thus to with the ADAM optimizer [18]. All hyper-parameters are tuned via
find the parameters 𝜃 that maximize the log probability of observing grid-search on the validation dataset.
the dataset:
Query suggestion metrics. As a metric to evaluate generated
|𝑄ˇ |
Õ Õ Õ queries compared to the target ones, we first use the classical metric
L (𝑆; 𝜃 ) = log 𝑝𝜃 (𝑄ˇ |𝑆) = log 𝑝𝜃 (𝑤𝑡 |𝑄 1, . . . , 𝑄 |𝑆 | ) BLEU [30], which corresponds to the rate of generated n-grams
ˇ
(𝑆,𝑄) ˇ 𝑡 =1
(𝑆,𝑄) that are present in the target query. We refer to BLEU-1, BLEU-2,
(1) BLEU-3 and BLEU-4 for 1-gram, 2-grams, 3-grams and 4-grams
ˇ We describe
where (𝑤 1, ..., 𝑤 |𝑄ˇ | ) are the token of the query 𝑄. respectively. We also calculate the exact match EM (equals to 1 if
below how we use the transformer – we tried to build different the predicted query is exactly the observed one, 0 otherwise).
architectures based on the transformer, but the simplest one worked As EM can be too harsh, we also use a metric, Sim𝑒𝑥𝑡𝑟𝑒𝑚𝑎 [9],
the best throughout all our pilot experiments. which computes the cosine similarity between the representation
Input. For a session, the input of the transformer is simply the of the candidate query with the target one. The representation of a
concatenation of all the words of all the queries separated by a query 𝑞 (either target or generated) is a component-wise maximum
token [SEP], i.e. the [SEP] is used to mark the beginning of a new of the representations of the words making up the query (we use
query in the session: the GoogleNews embeddings, following [37]). The extrema vector
method has the advantage of taking into account words carrying
information, instead of other common words of the queries
𝑆 = [ [𝑆𝐸𝑃 ] 𝑤1,1 . . . 𝑤1,|𝑄 1 | [𝑆𝐸𝑃 ] . . . [𝑆𝐸𝑃 ] 𝑤|𝑆 |,1 . . . 𝑤|𝑆 |,|𝑄 |𝑆 | | [𝑆𝐸𝑃 ] ]
| {z } | {z } However, this component-wise maximum method might exces-
𝑄1 𝑄 |𝑆 | sively degrade the representation of a query. As an alternative, we
propose to compute Sim𝑝𝑎𝑖𝑟 𝑤𝑖𝑠𝑒 as the mean value of the maximum
This sequence is then transformed by using the token embed- cosine similarity between each term of the target query and all the
dings added to positional embeddings (one per distinct position) – terms of the generated one.
this is how Transformers recover the sequence order [39]. Finally, as discussed in section 3.3, there is no ground truth on
what the best queries to suggest are. Instead, for each generation
3.3.2 BERT. We use the pre-trained model Bert [8], and extract
metric, we consider a standard max-pooling from the top-10 queries
each layer of decoder. We sum the last layer, with the average and
generated by the models. More precisely, for each model, we first
the max of these layers . For each token of the input, we have a
generate (through a beam search with 𝐾 = 20) 10 queries to suggest
contextualized embedding of size 768 given by Bert. For the decod-
to the user given the context. The reported value for each metric
ing part, we use a transformer decoder and feedforward network.
(BLEU, EM, Sim𝑒𝑥𝑡𝑟𝑒𝑚𝑎 and Sim𝑝𝑎𝑖𝑟 𝑤𝑖𝑠𝑒 ) is the maximum score
At the beginning of the training the encoder is frozen and the de-
over the 10 different generated queries. This is usually employed
coder is trained. Then we use the gradual unfreezing method, as
for assessing the performance of a probabilistic model w.r.t. a single
recommended by [13]: when the loss stabilizes, we unfreeze the
target (see e.g., [19]) and corresponds to a fair evaluation of models
last frozen layer of the encoder, until all the layers are finetuned.
that try to find a good balance between quality and diversity.
3.3.3 BART. The architecture is complete for text generation, it
has an encoder and a decoder. We also use gradual unfreezing to 4.1 Results
finetune the model, but starting from the last layer of the pre-trained Tables 1 report results obtained by all models on generated queries.
decoder. We added two further indicators, the ratio of new words, and the
rank of the prediction in the beam search if the predicted query
4 EXPERIMENTS appears in the context (or 10 if it doesn’t, so values can be averaged).
Datasets. We conduct our experiments of the AOL dataset. It From a high level point of view, we see that pre-trained trans-
consists of 16 million real search log entries from the AOL Web formers are better performing than the RNN-based models, HRED
Search Engine for 657,426 users. Following [37], we delimit sessions and ACG, on all metrics and that Bart performs better than Bert.
with a 30-minutes timeout. The queries submitted before May 1, We also note that models have different tendencies to copy one
2006, are used as the training set, the remaining four weeks are split of the queries in the session. Note that this is a standard behavior,
into validation and test sets, as in [37]. The queries are processed by 6% of the AOL queries to predict are among the previous queries
removing all non-alphanumeric characters and lowercasing follow- of the session. So it is not surprising that more powerful models
ing [37]. After filtering, there are 1,708,224 sessions in the training
set, 416,450 in the test set and 416,450 in the validation set. As we want to encourage the models trained with a word tokenizer to generate tokens
present in the vocabulary, we follow [17] and apply a penalty on the “OOV” token. To
Based on https://github.com/hanxiao/bert-as-service, and our own preliminary compute the metrics, we ignored the OOV token that can be generated by HRED or
experiments ACG – queries with only OOV words are skipped.
Agnès Mustar, Sylvain Lamprier, and Benjamin Piwowarski
Table 1: Results on the AOL dataset. ★ indicates significant Results on complex sessions. We were also interested in how the
gains (𝑝 < 0.05) compared to BERT. † indicates significant different models could handle complex sessions. To identify those,
gains (𝑝 < 0.05) compared to HRED. we used a simple heuristic, which was empirically validated on a
sample of sessions: A complex session (1) consists of at least three
ACG HRED BERT BART queries; (2) contains queries with more than one word; and (3) to
EM 0.018 0.030 0.060† 0.121†★ discard sessions that only contain spelling corrections, each of its
BLEU 1 0.418 0.408 0.459† 0.551†★ queries must be sufficiently different from the previous one – we
BLEU 2 0.126 0.122 0.193† 0.313†★ use a simple editing distance in characters with 3 as threshold.
BLEU 3 0.038 0.052 0.109† 0.231†★ Figure 2a reports the relative results obtained on this subset of
BLEU 4 0.005 0.017 0.060† 0.172†★ 193,336 complex sessions (compared to corresponding results from
𝑠𝑖𝑚𝑒𝑥𝑡𝑟𝑒𝑚𝑎 0.665 0.710 0.741† 0.789†★ all sessions of the AOL dataset). It emphasizes the good behavior
𝑠𝑖𝑚𝑝𝑎𝑖𝑟 𝑤𝑖𝑠𝑒 0.405 0.404 0.467† 0.553†★ of transformers for query suggestion (and especially pre-trained
New Words 0.118 0.681 0.927 0.684 ones), since on this subset of sessions, they improve over the other
Repetition Rank 7.109 7.789 6.713 2.196 models on every metric. Moreover, the transformers have a better
performance on complex search sessions than when considering
all sessions (except for EM), which means that this type of model is
particularly well suited to support users having a complex informa-
Figure 1: Difference (in %) between the performance on all tion need, and also shows that HRED and ACG are more suited for
the AOL sessions and one of its subsets (potentially modi- simple reformulations (spelling, etc.).
fied). Negative values indicate a degradation.
(a) Complex sessions Results on concatenated sessions. To assess the robustness of the
approaches, we add one random session at the start of each session
of the test set. Since the intent of these added sessions is (in aver-
age) not the same as the intent driving the user’s behavior when
formulating test queries, models must have learned to identify the-
matic breaks, and to ignore this noisy information. Figure 2b shows
percentages of performance loss for every metric. We can see that,
while RNN-based models have a lesser performance, pre-trained
transformers are greatly less impacted than others. This is an im-
portant result, since test sessions were arbitrarily split according to
a 30-minute timeout, which might not correspond to users’ intent
changes. Pre-trained transformers can adapt themselves to longer
history, by efficiently focusing on the relevant part. We believe that
this is due to the fact that those models have learned to detect topic
changes on much more data. The same observations were made
(b) Concatenated sessions on additional experiments (not shown here), where some context
queries were replaced at random. Again, pre-trained transformers
showed a lower performance decrease, showing that they were
better to ignore noisy context queries.
5 CONCLUSION
In this paper, inspired by the success of transformer-based models
[39] in various NLP and IR tasks, we looked at its application to
query generation and found that in this domain again, transformers
could better handle this task than RNN-based models – even when
trying to incorporate some of the elements at the origin of its
success. The transformers have proven to to be more resilient to
noise, to be able to detect thematic boundaries in multi-task sessions,
and generate more diverse results than the previously proposed
models. Future work will focus on integrating various sources of
information beside queries, and to develop architectures able to
learn to copy – transformer models have thus a tendency to repeat cope with long sessions (potentially all the user history).
a seen query compared to ACG or HRED (lower Repetition Rank).
This tendency is explained by their ability to retrieve information
at arbitrary positions in the input. While BERT is generating much
new words while keeping a repetition rank low, BART has a higher
repetition rank and is adding less new words.
Using BERT and BART for Query Suggestion
REFERENCES [24] Rishabh Mehrotra and Emine Yilmaz. 2017. Extracting hierarchies of search tasks
[1] Wasi Uddin Ahmad, Kai-Wei Chang, and Hongning Wang. 2019. Context At- & subtasks via a bayesian nonparametric approach. In Proceedings of the 40th
tentive Document Ranking and Query Suggestion. CoRR abs/1906.02329 (2019). International ACM SIGIR Conference on Research and Development in Information
arXiv:1906.02329 http://arxiv.org/abs/1906.02329 Retrieval. ACM.
[2] Wasi Uddin Ahmad, Kai-Wei Chang, and Hongning Wang. 2018. Multi-task [25] Qiaozhu Mei, Dengyong Zhou, and Kenneth Church. 2008. Query suggestion
learning for document ranking and query suggestion. (2018). using hitting time. In Proceedings of the 17th ACM conference on Information and
[3] Paolo Boldi, Francesco Bonchi, Carlos Castillo, Debora Donato, and Sebastiano knowledge management. ACM.
Vigna. 2009. Query suggestions using query-flow graphs. In Proceedings of the [26] Bhaskar Mitra and Nick Craswell. 2015. Query Auto-Completion for Rare Prefixes.
2009 workshop on Web Search Click Data. ACM, 56–63. In Proceedings of the 24th ACM International on Conference on Information and
[4] Francesco Bonchi, Raffaele Perego, Fabrizio Silvestri, Hossein Vahabi, and Knowledge Management (Melbourne, Australia) (CIKM ’15). ACM, 4.
Rossano Venturini. 2012. Efficient query recommendations in the long tail via [27] Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document
center-piece subgraphs. In Proceedings of the 35th international ACM SIGIR con- Expansion by Query Prediction. arXiv preprint arXiv:1904.08375 (2019).
ference on Research and development in information retrieval. ACM, 345–354. [28] Umut Ozertem, Olivier Chapelle, Pinar Donmez, and Emre Velipasaoglu. [n.d.].
[5] Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhong Chen, and Learning to Suggest: A Machine Learning Framework for Ranking Query Sugges-
Hang Li. 2008. Context-aware query suggestion by mining click-through and tions a Machine Learning Framework for Ranking Query Suggestions. In ACM
session data. In Proceedings of the 14th ACM SIGKDD international conference on SIGIR (2012).
Knowledge discovery and data mining. ACM, 875–883. [29] Umut Ozertem, Olivier Chapelle, Pinar Donmez, and Emre Velipasaoglu. 2012.
[6] Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with Learning to suggest: a machine learning framework for ranking query sugges-
Contextual Neural Language Modeling. arXiv preprint arXiv:1905.09217 (2019). tions. In ACM SIGIR. ACM.
[7] Mostafa Dehghani, Sascha Rothe, Enrique Alfonseca, and Pascal Fleury. 2017. [30] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a
Learning to attend, copy, and generate for session-based query suggestion. In method for automatic evaluation of machine translation. In Proceedings of the
Proceedings of the 2017 ACM on Conference on Information and Knowledge Man- 40th annual meeting on association for computational linguistics. Association for
agement. ACM. Computational Linguistics, 311–318.
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: [31] Yifan Qiao, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. 2019. Understanding
Pre-training of deep bidirectional transformers for language understanding. arXiv the Behaviors of BERT in Ranking. arXiv preprint 1904.07531 (2019).
preprint arXiv:1810.04805 (2018). [32] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
[9] Gabriel Forgues, Joelle Pineau, Jean-Marie Larchevêque, and Réal Tremblay. 2014. Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI
Bootstrapping dialog systems with word embeddings. In Nips, modern machine Blog 1, 8 (2019).
learning and natural language processing workshop, Vol. 2. [33] Eldar Sadikov, Jayant Madhavan, Lu Wang, and Alon Halevy. 2010. Cluster-
[10] Fred X Han, Di Niu, Kunfeng Lai, Weidong Guo, Yancheng He, and Yu Xu. 2019. ing query refinements by user intent. In Proceedings of the 19th international
Inferring Search Queries from Web Documents via a Graph-Augmented Sequence conference on World wide web. ACM, 841–850.
to Attention Network. In The World Wide Web Conference. ACM, 2792–2798. [34] Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano.
[11] Ahmed Hassan Awadallah, Ryen W White, Patrick Pantel, Susan T Dumais, and 2019. Answers Unite! Unsupervised Metrics for Reinforced Summarization
Yi-Min Wang. 2014. Supporting complex search tasks. In Proceedings of the Models. CoRR abs/1909.01610 (2019). arXiv:1909.01610 http://arxiv.org/abs/1909.
23rd ACM International Conference on Conference on Information and Knowledge 01610
Management. ACM. [35] Marc Sloan, Hui Yang, and Jun Wang. 2015. A term-based methodology for query
[12] Qi He, Daxin Jiang, Zhen Liao, Steven C. H. Hoi, Kuiyu Chang, Ee-Peng Lim, and reformulation understanding. Information Retrieval Journal 18, 2 (2015), 145–165.
Hang Li. 2009. Web Query Recommendation via Sequential Query Prediction. In [36] Yang Song, Dengyong Zhou, and Li-wei He. [n.d.]. Post-Ranking Query Sugges-
Proceedings of the 2009 IEEE International Conference on Data Engineering (ICDE tion by Diversifying Search Results. In ACM SIGIR (New York, NY, USA, 2011)
’09). IEEE Computer Society, Washington, DC, USA, 12. https://doi.org/10.1109/ (SIGIR ’11). ACM. https://doi.org/10/b6d6s9 ZSCC: 0000049.
ICDE.2009.71 [37] Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob
[13] Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning Grue Simonsen, and Jian-Yun Nie. 2015. A hierarchical recurrent encoder-decoder
for text classification. arXiv preprint arXiv:1801.06146 (2018). for generative context-aware query suggestion. In Proceedings of the 24th ACM
[14] Chien-Kang Huang, Lee-Feng Chien, and Yen-Jen Oyang. 2003. Relevant term International on Conference on Information and Knowledge Management. ACM.
suggestion in interactive web search based on contextual information in query [38] Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi. 2018.
session logs. Journal of the American Society for Information Science and Technol- Deep semantic role labeling with self-attention. In AAAI.
ogy 54, 7 (2003). [39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[15] Alpa Jain, Umut Ozertem, and Emre Velipasaoglu. 2011. Synthesizing High Utility Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
Suggestions for Rare Web Search Queries. In ACM SIGIR (Beijing, China) (SIGIR you need. In NIPS.
’11). ACM, 10. [40] Bin Wu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. 2018. Query suggestion
[16] Jyun-Yu Jiang and Wei Wang. 2018. RIN: Reformulation Inference Network for with feedback memory network. In Proceedings of the 2018 World Wide Web
Context-Aware Query Suggestion. In Proceedings of the 27th ACM International Conference. International World Wide Web Conferences Steering Committee,
Conference on Information and Knowledge Management. ACM. 1563–1571.
[17] Atsuhiko Kai, Yoshifumi Hirose, and Seiichi Nakagawa. 1998. Dealing with [41] Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Simple applications of bert for
out-of-vocabulary words and speech disfluencies in an n-gram based speech ad hoc document retrieval. arXiv preprint 1903.10972 (2019).
understanding system. In Fifth International Conference on Spoken Language [42] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov,
Processing. and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Lan-
[18] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti- guage Understanding. arXiv preprint arXiv:1906.08237 (2019).
mization. arXiv preprint arXiv:1412.6980 (2014). [43] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
[19] Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards
Levine, Laurent Dinh, and Durk Kingma. 2019. VideoFlow: A Flow-Based story-like visual explanations by watching movies and reading books. In IEEE
Generative Model for Video. CoRR abs/1903.01434 (2019). arXiv:1903.01434 international conference on computer vision. 19–27.
http://arxiv.org/abs/1903.01434
[20] Nir Levine, Haggai Roitman, and Doron Cohen. 2017. An extended relevance
model for session search. In Proceedings of the 40th International ACM SIGIR
Conference on Research and Development in Information Retrieval. ACM, 865–868.
[21] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising
sequence-to-sequence pre-training for natural language generation, translation,
and comprehension. arXiv preprint arXiv:1910.13461 (2019).
[22] Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. 2017. Stochastic answer
networks for machine reading comprehension. arXiv preprint 1712.03556 (2017).
[23] Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR:
Contextualized embeddings for document ranking. In Proceedings of the 42nd
International ACM SIGIR Conference on Research and Development in Information
Retrieval. 1101–1104.