=Paper= {{Paper |id=Vol-2621/CIRCLE20_06 |storemode=property |title=Using BERT and BART for Query Suggestion |pdfUrl=https://ceur-ws.org/Vol-2621/CIRCLE20_06.pdf |volume=Vol-2621 |authors=Agnès Mustar,Sylvain Lamprier,Benjamin Piwowarski |dblpUrl=https://dblp.org/rec/conf/circle/MustarLP20 }} ==Using BERT and BART for Query Suggestion== https://ceur-ws.org/Vol-2621/CIRCLE20_06.pdf
                          Using BERT and BART for Query Suggestion
                  Agnès Mustar                                         Sylvain Lamprier                          Benjamin Piwowarski
      Sorbonne Université, CNRS, LIP6,                        Sorbonne Université, CNRS, LIP6,               Sorbonne Université, CNRS, LIP6,
                 F-75005                                                   F-75005                                       F-75005
               Paris, France                                            Paris, France                                 Paris, France
           agnes.mustar@lip6.fr                                   sylvain.lamprier@lip6.fr                     benjamin.piwowarski@lip6.fr

ABSTRACT                                                                             models are difficult to adapt when using a wider context than the
Transformer networks have recently been successfully applied on                      last submitted query [7].
a very large range of NLP tasks. Surprisingly, they have never                          More recently, recurrent neural network-based (RNNs) meth-
been employed for query suggestion, although their sequence-to-                      ods have been proposed to exploit longer dependencies between
sequence architecture makes them particularly appealing for this                     queries [1, 2, 7, 37, 40]. RNNs do so by keeping track of the user
task. Query suggestion requires to model behaviors during complex                    in a representation/vector space which depends on all the actions
search sessions to output useful next queries to help users to com-                  performed by the user so far. Such models have improved the qual-
plete their intent. We show that pre-trained transformer networks                    ity of suggestions by capturing a broader context, but are limited
exhibit a very good performance for query suggestion on a large                      by the relatively short span of interaction that RNNs are able to
corpus of search logs, that they are more robust to noise, and have                  capture.
a better understanding of complex queries.                                              Common NLP tasks [22, 31, 34, 38, 39, 41] have benefited from
                                                                                     the recently proposed Transformers architecture [39]. Transformer
CCS CONCEPTS                                                                         networks, such as Bert [8], capture long-range dependencies be-
                                                                                     tween terms by refining each token representation based on its
• Information systems → Query suggestion; Query reformu-
                                                                                     context before handling the task at hand. They are thus a partic-
lation; Query representation; Language models; • Computing method-
                                                                                     ularly interesting architecture for query suggestion since query
ologies → Learning latent representations.
                                                                                     terms are often repeated throughout a session, and their interac-
KEYWORDS                                                                             tion needs to be captured, to build a faithful representation of the
Queries suggestion, Transformers, User modeling                                      current user state.
                                                                                        In this work, we compare and analyse the results of pre-trained
1    INTRODUCTION                                                                    transformers for query suggestion to the ones from RNN-based
                                                                                     models.
To explore the space of potentially relevant documents, users in-
teract with search engines through queries. This process can be
improved, since when looking for information, users may have
                                                                                     2   RELATED WORK
difficulties to express their needs at first sight, and hence may have
to reformulate the queries multiple times to find the documents                      A large number of works have focused on the task of query sug-
that satisfy their needs. This process is particularly exacerbated                   gestion [29], and related tasks such as query auto-completion [26],
when the user is accomplishing a complex search task.                                based on search logs to extract query co-occurrences [14, 15]. From
   Among the different ways to help users in exploring the in-                       a given single query formulated by a user, the goal is to identify
formation space, modern search engines provide a list of query                       related queries from logs, and to suggest reformulations based on
suggestions, which help users by either following their current                      what follows in the retrieved sessions, assuming subsequent queries
search direction – e.g. by refining the current query – or by switch-                as refinements of former ones [33]. These works rely on several
ing to a different aspect of a search task [28] if the proposed queries              methods, such as using term co-occurrence [14], using users click
match a different aspect of the user information need. Another use                   information [25], using word-level representation [4], capturing
of query suggestions is to help the search engines by providing                      higher order collocation in query-document sub-graphs [3], clus-
ways to diversify the presented information [36].                                    tering queries from logs [33], or defining hierarchies of related
   To suggest useful queries, most models build upon web search                      search tasks and sub-tasks [11, 24]. Some methods finally prevent
logs, where the actions of a user (queries, clicks, and timestamps) are              query sparsity via reformulations using NLP techniques [29]. [15]
recorded. User sessions are then extracted by segmenting the web                     proposes an end-to-end system to generate synthetic suggestions,
search log. The first query suggestion models exploited the query                    based on query-level operations and information collected from
co-occurrence graph extracted from user sessions [14, 15]: if a query                available text resources.
is often followed by another one, then the latter is a good potential                   However, such log-based methods suffer from data sparsity and
reformulation. However, co-occurrence based models suffer from                       are not effective for rare or unseen queries [37]. In addition, these
data sparsity, for instance when named entities are mentioned,                       approaches are usually context-agnostic, focusing on matching can-
and lack of coverage for rare or unseen queries. Moreover, these                     didates with a single query. But, when the query comes in a session
                                                                                     with some previous attempts for finding relevant information, it is
"Copyright © 2020 for this paper by its authors. Use permitted under Creative Com-   crucial to leverage such context for capturing the user intent and
mons License Attribution 4.0 International (CC BY 4.0)."                             understanding its reformulation behavior. Note the approach in [5],
                                                                                                 Agnès Mustar, Sylvain Lamprier, and Benjamin Piwowarski


which alleviates the problem by relating the user sessions to paths         3     TRANSFORMERS FOR QUERIES
in a concept tree also suffers from data sparsity issues.                         SUGGESTION
   Instead of trying to predict directly a query, it is possible to learn
                                                                            In this section, we first present the transformer network architec-
how to transform it. Most approaches operate at a high level, with
                                                                            ture and pre-trained transformers before describing how we use
term retention, addition and removal as the possible reformulation
                                                                            them for query suggestion.
actions [20, 35]. [20] consider these actions as feedback from the
user – e.g. a term that is retained during the whole session should be
considered as central for the user intent. Depending on the previous
                                                                            3.1    The Transformer architecture
sequence of users’ actions, these methods seek to predict the next          The transformer architecture was introduced by [39]. It is composed
action. These methods are interesting because they model the user           of parametric functions that successively refine the representation
behavior in a session. However, they fail at capturing the semantic         of sequences, both for the encoder and the decoder. In our case,
of words, which is essential.                                               the encoder is used to represent the session, and the decoder to
   To cope with limitations of log-based and action-based methods,          generate the next query.
some works propose to define probabilistic models for next query                Each layer of the encoder or the decoder transforms the sequence
prediction [12]. Due to their ability for processing sequences of           𝑥 composed of 𝑛 vectors 𝑥 1, . . . , 𝑥𝑛 into a sequence 𝑦1, . . . , 𝑦𝑛 of
variable size, Recurrent Neural Networks (RNNs) have been widely            the same length, through an attention over a context sequence 𝑐
used for text modeling and generation tasks, with an encoder that           composed of 𝑛 vectors 𝑐 1, . . . , 𝑐𝑛 . Each time, the central mechanism
processes an input sequence by updating a representation in R𝑛 ,            is to use an attention mechanism – other operations are conducted
and a decoder that generates the target sequence from the last              to ensure a stable and efficient learning process, and are detailed in
computed representation. Some works have adapted these ideas                [39].Special tokens are used to separate texts ( [SEP]) or perform
to a sequence of queries [7, 16, 37]. HRED [37] proposes to use             classification ([CLS]).
two encoders: a query-level encoder, which encodes each query of
the user session independently, and a session-level encoder, which          3.2    Pre Trained Transformers
deals with the sequence of query representations. Instead of using a        Transformer models have a lot of parameters, and can be long
hierarchical representation, ACG [7] relies on attention mechanism          to train. Recently, multiple pre-trained models trained on large
that is used to give a different importance to words and queries            datasets have been released [8, 21, 32, 42]. We compare the results
in the representation. Another improvement of ACG is to deal                of the fully trained transformer, to two pre-trained models that we
with Out-Of-Vocabulary (OOV) words through the use of a copy                finetune: BERT [8] and BART [21].
mechanism, which allows the model to pick tokens from the past
user queries rather than generating from using the standard RNN                Bert. The Bidirectional Encoder Representations from Trans-
decoding.                                                                   formers [8] have been trained on a large dataset, the BooksCorpus
   Other RNN based approaches have also been recently proposed,             [43] on two tasks, namely predicting some masked tokens of the
such as [40], which leverages user clicks and document represen-            input, and on predicting whether one sentence follows another. It
tations to specify the user intent [1, 2], or [16] which integrates         is a state-of-the-art model, which is used for different tasks. BERT
click-through data into homomorphic term embeddings to cap-                 corresponds to the encoder part only – we have to train a decoder
ture semantic reformulations. In this work, as a starting point, we         for our specific task.
restrict to queries in sessions as input data, but other sources of
                                                                               Bart. Bidirectional and Auto-Regressive Transformer [21] is
information can be added to such models.
                                                                            made of an encoder and a decoder. It is trained on the same data than
   In parallel, the Transformers architecture, a recent and effective
                                                                            BERT, but for multiple tasks: token masking, tokens detection, text
alternative to RNNs models introduced in [39], was successfully
                                                                            infilling, sentence permutation, and document rotation. Because it
applied to a large broad of NLP applications, such as Constituency
                                                                            has a decoder and it is trained on these tasks, the authors claim that
Parsing and Automatic Translation [39], Semantic Role Labeling
                                                                            BART is better than BERT for text generation. They also released
[38], Machine Reading Comprehension [22], and Abstractive Text
                                                                            fine-tuned versions of BART for other tasks. We use the weights of
Summarization [34].
                                                                            the model fine-tuned on CNN/DM, a news summarization dataset,
   The Transformer has also has been used several times in the
                                                                            because as a text generation task it was the closest task to the query
field of Information Retrieval. [27] and [10] applied transformers to
                                                                            suggestion task.
infer query from a document. [27] used the pretrained transformer
BERT, and showed that expanding the document with the predicted
query improve the ad hoc retrieval results, while [10] presented a
                                                                            3.3    Using Transformer networks for Query
more complex seq2seq architecture: the encoder included a Graph                    Suggestion
Convolutional Network and a RNN; and the decoder is a transformer.          3.3.1 Problem Setting. Let us consider a session 𝑆 = (𝑄 1, ..., 𝑄 |𝑆 | )
Transformers have also been used for ad hoc retrieval [6, 23, 31, 41].      as a sequence of |𝑆 | queries, where every 𝑄𝑖 = (𝑤𝑖,1, ..., 𝑤𝑖, |𝑄𝑖 | )
[23] used BERT features in existing ranking neural models, and              is a sequence of |𝑄𝑖 | words. The goal of query suggestion is to
outperforms state-of-the-art ad hoc ranking scores.                         suggest the most relevant query for the user intent represented
                                                                            by the session. However, no perfect ground truth can be easily
                                                                            established for such problems: defining the perfect query for a given
                                                                            specific under defined need, given a sequence of past queries, is
Using BERT and BART for Query Suggestion


an intractable problem, which requires to consider very diverse (in                                Compared Models. In our experiments we compare RNN-based
nature and complexity) search tasks, depends on the user state, the                             approaches against fine-tuned transformer models. The RNN mod-
IR system and the available information in the targeted collection.                             els are HRED [37] and ACG [7] described in section 2. The pre-
Following other works on model-based query suggestion, we thus                                  trained models that we finetune are BERT [8] and BART [21].
focus on predicting the next question within an observed session.                                  In order to isolate possible causes of performance variations,
    We suppose that our dataset is composed of pairs (𝑆, 𝑄) ˇ where                             models optimization is performed on the training sets of sessions
𝑄ˇ is the query following a sequence of queries 𝑆. Our aim is thus to                           with the ADAM optimizer [18]. All hyper-parameters are tuned via
find the parameters 𝜃 that maximize the log probability of observing                            grid-search on the validation dataset.
the dataset:
                                                                                                   Query suggestion metrics. As a metric to evaluate generated
                                                |𝑄ˇ |
                  Õ                           Õ Õ                                               queries compared to the target ones, we first use the classical metric
    L (𝑆; 𝜃 ) =           log 𝑝𝜃 (𝑄ˇ |𝑆) =                log 𝑝𝜃 (𝑤𝑡 |𝑄 1, . . . , 𝑄 |𝑆 | )     BLEU [30], which corresponds to the rate of generated n-grams
                     ˇ
                  (𝑆,𝑄)                         ˇ 𝑡 =1
                                             (𝑆,𝑄)                                              that are present in the target query. We refer to BLEU-1, BLEU-2,
                                                                 (1)                            BLEU-3 and BLEU-4 for 1-gram, 2-grams, 3-grams and 4-grams
                                                      ˇ We describe
where (𝑤 1, ..., 𝑤 |𝑄ˇ | ) are the token of the query 𝑄.                                        respectively. We also calculate the exact match EM (equals to 1 if
below how we use the transformer – we tried to build different                                  the predicted query is exactly the observed one, 0 otherwise).
architectures based on the transformer, but the simplest one worked                                As EM can be too harsh, we also use a metric, Sim𝑒𝑥𝑡𝑟𝑒𝑚𝑎 [9],
the best throughout all our pilot experiments.                                                  which computes the cosine similarity between the representation
   Input. For a session, the input of the transformer is simply the                             of the candidate query with the target one. The representation of a
concatenation of all the words of all the queries separated by a                                query 𝑞 (either target or generated) is a component-wise maximum
token [SEP], i.e. the [SEP] is used to mark the beginning of a new                              of the representations of the words making up the query (we use
query in the session:                                                                           the GoogleNews embeddings, following [37]). The extrema vector
                                                                                                method has the advantage of taking into account words carrying
                                                                                                information, instead of other common words of the queries
 𝑆 = [ [𝑆𝐸𝑃 ] 𝑤1,1 . . . 𝑤1,|𝑄 1 | [𝑆𝐸𝑃 ] . . . [𝑆𝐸𝑃 ] 𝑤|𝑆 |,1 . . . 𝑤|𝑆 |,|𝑄 |𝑆 | | [𝑆𝐸𝑃 ] ]
              |      {z        }                       |          {z            }                  However, this component-wise maximum method might exces-
                          𝑄1                                        𝑄 |𝑆 |                      sively degrade the representation of a query. As an alternative, we
                                                                                                propose to compute Sim𝑝𝑎𝑖𝑟 𝑤𝑖𝑠𝑒 as the mean value of the maximum
   This sequence is then transformed by using the token embed-                                  cosine similarity between each term of the target query and all the
dings added to positional embeddings (one per distinct position) –                              terms of the generated one.
this is how Transformers recover the sequence order [39].                                          Finally, as discussed in section 3.3, there is no ground truth on
                                                                                                what the best queries to suggest are. Instead, for each generation
3.3.2 BERT. We use the pre-trained model Bert [8], and extract
                                                                                                metric, we consider a standard max-pooling from the top-10 queries
each layer of decoder. We sum the last layer, with the average and
                                                                                                generated by the models. More precisely, for each model, we first
the max of these layers . For each token of the input, we have a
                                                                                                generate (through a beam search with 𝐾 = 20) 10 queries to suggest
contextualized embedding of size 768 given by Bert. For the decod-
                                                                                                to the user given the context. The reported value for each metric
ing part, we use a transformer decoder and feedforward network.
                                                                                                (BLEU, EM, Sim𝑒𝑥𝑡𝑟𝑒𝑚𝑎 and Sim𝑝𝑎𝑖𝑟 𝑤𝑖𝑠𝑒 ) is the maximum score
At the beginning of the training the encoder is frozen and the de-
                                                                                                over the 10 different generated queries. This is usually employed
coder is trained. Then we use the gradual unfreezing method, as
                                                                                                for assessing the performance of a probabilistic model w.r.t. a single
recommended by [13]: when the loss stabilizes, we unfreeze the
                                                                                                target (see e.g., [19]) and corresponds to a fair evaluation of models
last frozen layer of the encoder, until all the layers are finetuned.
                                                                                                that try to find a good balance between quality and diversity.
3.3.3 BART. The architecture is complete for text generation, it
has an encoder and a decoder. We also use gradual unfreezing to                                 4.1     Results
finetune the model, but starting from the last layer of the pre-trained                         Tables 1 report results obtained by all models on generated queries.
decoder.                                                                                        We added two further indicators, the ratio of new words, and the
                                                                                                rank of the prediction in the beam search if the predicted query
4      EXPERIMENTS                                                                              appears in the context (or 10 if it doesn’t, so values can be averaged).
   Datasets. We conduct our experiments of the AOL dataset. It                                     From a high level point of view, we see that pre-trained trans-
consists of 16 million real search log entries from the AOL Web                                 formers are better performing than the RNN-based models, HRED
Search Engine for 657,426 users. Following [37], we delimit sessions                            and ACG, on all metrics and that Bart performs better than Bert.
with a 30-minutes timeout. The queries submitted before May 1,                                     We also note that models have different tendencies to copy one
2006, are used as the training set, the remaining four weeks are split                          of the queries in the session. Note that this is a standard behavior,
into validation and test sets, as in [37]. The queries are processed by                         6% of the AOL queries to predict are among the previous queries
removing all non-alphanumeric characters and lowercasing follow-                                of the session. So it is not surprising that more powerful models
ing [37]. After filtering, there are 1,708,224 sessions in the training
set, 416,450 in the test set and 416,450 in the validation set.                                 As we want to encourage the models trained with a word tokenizer to generate tokens
                                                                                                present in the vocabulary, we follow [17] and apply a penalty on the “OOV” token. To
Based on https://github.com/hanxiao/bert-as-service, and our own preliminary                    compute the metrics, we ignored the OOV token that can be generated by HRED or
experiments                                                                                     ACG – queries with only OOV words are skipped.
                                                                                          Agnès Mustar, Sylvain Lamprier, and Benjamin Piwowarski


Table 1: Results on the AOL dataset. ★ indicates significant              Results on complex sessions. We were also interested in how the
gains (𝑝 < 0.05) compared to BERT. † indicates significant            different models could handle complex sessions. To identify those,
gains (𝑝 < 0.05) compared to HRED.                                    we used a simple heuristic, which was empirically validated on a
                                                                      sample of sessions: A complex session (1) consists of at least three
                         ACG      HRED         BERT      BART         queries; (2) contains queries with more than one word; and (3) to
      EM                 0.018    0.030        0.060†   0.121†★       discard sessions that only contain spelling corrections, each of its
      BLEU 1             0.418    0.408        0.459†   0.551†★       queries must be sufficiently different from the previous one – we
      BLEU 2             0.126    0.122        0.193†   0.313†★       use a simple editing distance in characters with 3 as threshold.
      BLEU 3             0.038    0.052        0.109†   0.231†★           Figure 2a reports the relative results obtained on this subset of
      BLEU 4             0.005    0.017        0.060†   0.172†★       193,336 complex sessions (compared to corresponding results from
      𝑠𝑖𝑚𝑒𝑥𝑡𝑟𝑒𝑚𝑎         0.665    0.710        0.741†   0.789†★       all sessions of the AOL dataset). It emphasizes the good behavior
      𝑠𝑖𝑚𝑝𝑎𝑖𝑟 𝑤𝑖𝑠𝑒       0.405    0.404        0.467†   0.553†★       of transformers for query suggestion (and especially pre-trained
      New Words          0.118    0.681         0.927    0.684        ones), since on this subset of sessions, they improve over the other
      Repetition Rank    7.109    7.789         6.713    2.196        models on every metric. Moreover, the transformers have a better
                                                                      performance on complex search sessions than when considering
                                                                      all sessions (except for EM), which means that this type of model is
                                                                      particularly well suited to support users having a complex informa-
Figure 1: Difference (in %) between the performance on all            tion need, and also shows that HRED and ACG are more suited for
the AOL sessions and one of its subsets (potentially modi-            simple reformulations (spelling, etc.).
fied). Negative values indicate a degradation.
                        (a) Complex sessions                             Results on concatenated sessions. To assess the robustness of the
                                                                      approaches, we add one random session at the start of each session
                                                                      of the test set. Since the intent of these added sessions is (in aver-
                                                                      age) not the same as the intent driving the user’s behavior when
                                                                      formulating test queries, models must have learned to identify the-
                                                                      matic breaks, and to ignore this noisy information. Figure 2b shows
                                                                      percentages of performance loss for every metric. We can see that,
                                                                      while RNN-based models have a lesser performance, pre-trained
                                                                      transformers are greatly less impacted than others. This is an im-
                                                                      portant result, since test sessions were arbitrarily split according to
                                                                      a 30-minute timeout, which might not correspond to users’ intent
                                                                      changes. Pre-trained transformers can adapt themselves to longer
                                                                      history, by efficiently focusing on the relevant part. We believe that
                                                                      this is due to the fact that those models have learned to detect topic
                                                                      changes on much more data. The same observations were made
                      (b) Concatenated sessions                       on additional experiments (not shown here), where some context
                                                                      queries were replaced at random. Again, pre-trained transformers
                                                                      showed a lower performance decrease, showing that they were
                                                                      better to ignore noisy context queries.

                                                                      5   CONCLUSION
                                                                      In this paper, inspired by the success of transformer-based models
                                                                      [39] in various NLP and IR tasks, we looked at its application to
                                                                      query generation and found that in this domain again, transformers
                                                                      could better handle this task than RNN-based models – even when
                                                                      trying to incorporate some of the elements at the origin of its
                                                                      success. The transformers have proven to to be more resilient to
                                                                      noise, to be able to detect thematic boundaries in multi-task sessions,
                                                                      and generate more diverse results than the previously proposed
                                                                      models. Future work will focus on integrating various sources of
                                                                      information beside queries, and to develop architectures able to
learn to copy – transformer models have thus a tendency to repeat     cope with long sessions (potentially all the user history).
a seen query compared to ACG or HRED (lower Repetition Rank).
This tendency is explained by their ability to retrieve information
at arbitrary positions in the input. While BERT is generating much
new words while keeping a repetition rank low, BART has a higher
repetition rank and is adding less new words.
Using BERT and BART for Query Suggestion


REFERENCES                                                                                [24] Rishabh Mehrotra and Emine Yilmaz. 2017. Extracting hierarchies of search tasks
 [1] Wasi Uddin Ahmad, Kai-Wei Chang, and Hongning Wang. 2019. Context At-                     & subtasks via a bayesian nonparametric approach. In Proceedings of the 40th
     tentive Document Ranking and Query Suggestion. CoRR abs/1906.02329 (2019).                International ACM SIGIR Conference on Research and Development in Information
     arXiv:1906.02329 http://arxiv.org/abs/1906.02329                                          Retrieval. ACM.
 [2] Wasi Uddin Ahmad, Kai-Wei Chang, and Hongning Wang. 2018. Multi-task                 [25] Qiaozhu Mei, Dengyong Zhou, and Kenneth Church. 2008. Query suggestion
     learning for document ranking and query suggestion. (2018).                               using hitting time. In Proceedings of the 17th ACM conference on Information and
 [3] Paolo Boldi, Francesco Bonchi, Carlos Castillo, Debora Donato, and Sebastiano             knowledge management. ACM.
     Vigna. 2009. Query suggestions using query-flow graphs. In Proceedings of the        [26] Bhaskar Mitra and Nick Craswell. 2015. Query Auto-Completion for Rare Prefixes.
     2009 workshop on Web Search Click Data. ACM, 56–63.                                       In Proceedings of the 24th ACM International on Conference on Information and
 [4] Francesco Bonchi, Raffaele Perego, Fabrizio Silvestri, Hossein Vahabi, and                Knowledge Management (Melbourne, Australia) (CIKM ’15). ACM, 4.
     Rossano Venturini. 2012. Efficient query recommendations in the long tail via        [27] Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document
     center-piece subgraphs. In Proceedings of the 35th international ACM SIGIR con-           Expansion by Query Prediction. arXiv preprint arXiv:1904.08375 (2019).
     ference on Research and development in information retrieval. ACM, 345–354.          [28] Umut Ozertem, Olivier Chapelle, Pinar Donmez, and Emre Velipasaoglu. [n.d.].
 [5] Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhong Chen, and                   Learning to Suggest: A Machine Learning Framework for Ranking Query Sugges-
     Hang Li. 2008. Context-aware query suggestion by mining click-through and                 tions a Machine Learning Framework for Ranking Query Suggestions. In ACM
     session data. In Proceedings of the 14th ACM SIGKDD international conference on           SIGIR (2012).
     Knowledge discovery and data mining. ACM, 875–883.                                   [29] Umut Ozertem, Olivier Chapelle, Pinar Donmez, and Emre Velipasaoglu. 2012.
 [6] Zhuyun Dai and Jamie Callan. 2019. Deeper Text Understanding for IR with                  Learning to suggest: a machine learning framework for ranking query sugges-
     Contextual Neural Language Modeling. arXiv preprint arXiv:1905.09217 (2019).              tions. In ACM SIGIR. ACM.
 [7] Mostafa Dehghani, Sascha Rothe, Enrique Alfonseca, and Pascal Fleury. 2017.          [30] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a
     Learning to attend, copy, and generate for session-based query suggestion. In             method for automatic evaluation of machine translation. In Proceedings of the
     Proceedings of the 2017 ACM on Conference on Information and Knowledge Man-               40th annual meeting on association for computational linguistics. Association for
     agement. ACM.                                                                             Computational Linguistics, 311–318.
 [8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:        [31] Yifan Qiao, Chenyan Xiong, Zhenghao Liu, and Zhiyuan Liu. 2019. Understanding
     Pre-training of deep bidirectional transformers for language understanding. arXiv         the Behaviors of BERT in Ranking. arXiv preprint 1904.07531 (2019).
     preprint arXiv:1810.04805 (2018).                                                    [32] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
 [9] Gabriel Forgues, Joelle Pineau, Jean-Marie Larchevêque, and Réal Tremblay. 2014.          Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI
     Bootstrapping dialog systems with word embeddings. In Nips, modern machine                Blog 1, 8 (2019).
     learning and natural language processing workshop, Vol. 2.                           [33] Eldar Sadikov, Jayant Madhavan, Lu Wang, and Alon Halevy. 2010. Cluster-
[10] Fred X Han, Di Niu, Kunfeng Lai, Weidong Guo, Yancheng He, and Yu Xu. 2019.               ing query refinements by user intent. In Proceedings of the 19th international
     Inferring Search Queries from Web Documents via a Graph-Augmented Sequence                conference on World wide web. ACM, 841–850.
     to Attention Network. In The World Wide Web Conference. ACM, 2792–2798.              [34] Thomas Scialom, Sylvain Lamprier, Benjamin Piwowarski, and Jacopo Staiano.
[11] Ahmed Hassan Awadallah, Ryen W White, Patrick Pantel, Susan T Dumais, and                 2019. Answers Unite! Unsupervised Metrics for Reinforced Summarization
     Yi-Min Wang. 2014. Supporting complex search tasks. In Proceedings of the                 Models. CoRR abs/1909.01610 (2019). arXiv:1909.01610 http://arxiv.org/abs/1909.
     23rd ACM International Conference on Conference on Information and Knowledge              01610
     Management. ACM.                                                                     [35] Marc Sloan, Hui Yang, and Jun Wang. 2015. A term-based methodology for query
[12] Qi He, Daxin Jiang, Zhen Liao, Steven C. H. Hoi, Kuiyu Chang, Ee-Peng Lim, and            reformulation understanding. Information Retrieval Journal 18, 2 (2015), 145–165.
     Hang Li. 2009. Web Query Recommendation via Sequential Query Prediction. In          [36] Yang Song, Dengyong Zhou, and Li-wei He. [n.d.]. Post-Ranking Query Sugges-
     Proceedings of the 2009 IEEE International Conference on Data Engineering (ICDE           tion by Diversifying Search Results. In ACM SIGIR (New York, NY, USA, 2011)
     ’09). IEEE Computer Society, Washington, DC, USA, 12. https://doi.org/10.1109/            (SIGIR ’11). ACM. https://doi.org/10/b6d6s9 ZSCC: 0000049.
     ICDE.2009.71                                                                         [37] Alessandro Sordoni, Yoshua Bengio, Hossein Vahabi, Christina Lioma, Jakob
[13] Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning             Grue Simonsen, and Jian-Yun Nie. 2015. A hierarchical recurrent encoder-decoder
     for text classification. arXiv preprint arXiv:1801.06146 (2018).                          for generative context-aware query suggestion. In Proceedings of the 24th ACM
[14] Chien-Kang Huang, Lee-Feng Chien, and Yen-Jen Oyang. 2003. Relevant term                  International on Conference on Information and Knowledge Management. ACM.
     suggestion in interactive web search based on contextual information in query        [38] Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen, and Xiaodong Shi. 2018.
     session logs. Journal of the American Society for Information Science and Technol-        Deep semantic role labeling with self-attention. In AAAI.
     ogy 54, 7 (2003).                                                                    [39] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[15] Alpa Jain, Umut Ozertem, and Emre Velipasaoglu. 2011. Synthesizing High Utility           Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
     Suggestions for Rare Web Search Queries. In ACM SIGIR (Beijing, China) (SIGIR             you need. In NIPS.
     ’11). ACM, 10.                                                                       [40] Bin Wu, Chenyan Xiong, Maosong Sun, and Zhiyuan Liu. 2018. Query suggestion
[16] Jyun-Yu Jiang and Wei Wang. 2018. RIN: Reformulation Inference Network for                with feedback memory network. In Proceedings of the 2018 World Wide Web
     Context-Aware Query Suggestion. In Proceedings of the 27th ACM International              Conference. International World Wide Web Conferences Steering Committee,
     Conference on Information and Knowledge Management. ACM.                                  1563–1571.
[17] Atsuhiko Kai, Yoshifumi Hirose, and Seiichi Nakagawa. 1998. Dealing with             [41] Wei Yang, Haotian Zhang, and Jimmy Lin. 2019. Simple applications of bert for
     out-of-vocabulary words and speech disfluencies in an n-gram based speech                 ad hoc document retrieval. arXiv preprint 1903.10972 (2019).
     understanding system. In Fifth International Conference on Spoken Language           [42] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov,
     Processing.                                                                               and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Lan-
[18] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-                 guage Understanding. arXiv preprint arXiv:1906.08237 (2019).
     mization. arXiv preprint arXiv:1412.6980 (2014).                                     [43] Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun,
[19] Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey                    Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards
     Levine, Laurent Dinh, and Durk Kingma. 2019. VideoFlow: A Flow-Based                      story-like visual explanations by watching movies and reading books. In IEEE
     Generative Model for Video. CoRR abs/1903.01434 (2019). arXiv:1903.01434                  international conference on computer vision. 19–27.
     http://arxiv.org/abs/1903.01434
[20] Nir Levine, Haggai Roitman, and Doron Cohen. 2017. An extended relevance
     model for session search. In Proceedings of the 40th International ACM SIGIR
     Conference on Research and Development in Information Retrieval. ACM, 865–868.
[21] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman
     Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. 2019. Bart: Denoising
     sequence-to-sequence pre-training for natural language generation, translation,
     and comprehension. arXiv preprint arXiv:1910.13461 (2019).
[22] Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. 2017. Stochastic answer
     networks for machine reading comprehension. arXiv preprint 1712.03556 (2017).
[23] Sean MacAvaney, Andrew Yates, Arman Cohan, and Nazli Goharian. 2019. CEDR:
     Contextualized embeddings for document ranking. In Proceedings of the 42nd
     International ACM SIGIR Conference on Research and Development in Information
     Retrieval. 1101–1104.