Hybrid First-stage Retrieval Models
                for Biomedical Literature

              Ji Ma, Ivan Korotkov, Keith Hall, and Ryan McDonald

                                 Google Research
                     {maji,ivankr,kbhall,ryanmcd}@google.com


        Abstract. We describe a hybrid first-stage retrieval model evaluated on
        BioASQ 8 document retrieval. We show that a hybrid model consistently
        outperforms comparable neural and term-based models. To train both
        the hybrid and neural models, we rely on data augmentation, specifically
        question generation over the Pubmed corpus. In addition to reporting
        the official runs of this model from BioASQ, we also report some post-
        challenge improvements. With these improvements, our hybrid model is
        competitive with the top-scoring systems. When adding a simple neural
        BERT-based reranker, the model outperforms all systems, on average,
        across all five batches. This highlights the efficacy of hybrid first-stage
        retrieval models.


1     Introduction

The BioASQ challenge organizes shared-tasks for semantic understanding of
biomedical literature, including document and snippet retrieval, semantic in-
dexing, question answering and summarization [22]. Here, we describe the tech-
nical details of an entry to the document retrieval sub-task (Task B Phase A).
Specifically, our submissions consist of the following contributions:

Hybrid first-stage retrieval. We use a principled approach to create a sparse-
dense retrieval model that combines the benefits of both neural and term-based
models. Our term-based model is a standard BM25 model [21] and our neural
model falls into the category of dense vector retrieval, which is also known as
dual encoder models [1, 19, 5, 2, 7]. We show that since both of these retrieval
paradigms can be cast as vector similarity via nearest neighbor search, that a
principled hybrid model can be constructed. The neural, term and hybrid models
are described in Sections 2–4.

Data Augmentation via Question Generation. Our neural first-stage model re-
quires supervised training data. However, there is a lack of such data for the
biomedical domain outside of the few thousand examples from previous BioASQ
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
challenges. To address this we use data augmentation [24]. Specifically, we follow
the work of Ma et al. [14] and train a question generator on a community QA
dataset. We then apply this to Pubmed abstracts to create biomedical-specific
pairs of questions and relevant documents. This is described in Section 2.2.


Second-stage reranking. The focus of our contribution was to measure the efficacy
of neural first-stage retrieval models for biomedical literature. However, we also
experiment with adding a simple BERT-based [4] cross-attention reranker, which
has become standard in the IR literature [18, 15, 26], including past BioASQ
challenges [20].


2     Neural First-stage Retrieval

Our retrieval model consists of two components. A dense model, which is based
on dual encoders [1, 5], aims to capture semantic similarity between query and
relevant documents. A sparse model, which is based on term matching, aims at
capturing lexical similarity between query and documents. This section focuses
on the dense model, and the we describe the sparse model in the next section.


2.1    Dual Encoder

Formally, a dual encoder model consists of two encoders, {fQ (), fP ()} and a
similarity function, sim(). An encoder is a function f that takes an item x as
input and outputs a real valued vector as the encoding. The similarity function,
sim(), takes two encodings, q, p ∈ RN and calculates a real valued score, s =
sim(q, p).
    For BioASQ competition, we are interested in encoding natural language
texts into real valued vectors. Thus, following recent success in natural language
processing, we implement both the two encoders with BERT [4]. In particular,
our encoder feeds the input query (or document) string to the BERT model.
Then it projects the [CLS] token representation from BERT outputs to a 768-
dimensional vector, as the encoding of that query (or document). In addition,
we share parameters between query and document encoder, so called Siamese
networks [1], which we found consistently improve retrieval performance while
reducing the total number of model parameters. We use dot-product as the sim-
ilarity function. In our initial experiments, we observe no meaningful difference
in retrieval performance between dot-product and cosine similarity function.
    We train model parameters using softmax cross-entropy loss together with
in-batch negatives, i.e., given a query in a batch of (query, relevant-passage)
pairs, passages from other pairs are considered irrelevant for that query. In-batch
negatives has been widely adopted in training neural network based retrieval
models as it enables efficient training via computation sharing [27, 5, 7].
2.2    Question Generation

A major bottleneck in building high accuracy neural retrieval system is the lack
of large scale training data. The problem is exacerbated when it comes to spe-
cialized domains such as biomedical domain. To handle the data scarcity issue,
we follow the approach proposed by Ma et al. [14] which automatically generates
synthetic questions on the target domain. Specifically, a transformer-based [23]
encoder-decoder generation model is trained to generate questions specific to a
given passage. The training data for the generator comprises question-answer
pairs mined from community resources such as StackExchange1 and Yahoo! An-
swers2 . When training completes, the question generator is then applied to the
target domain document/passage to generate large amount of synthetic queries,
in this case Pubmed. Finally, the synthetic question is paired with the passage
from which it was generated to form a training example for the dual encoder
model.
    In this work, our implementation of the question generator follows exactly
the same setting as the base model in [14], e.g., both the encoder and decoder
consist of 3 transformer layers, parameters between encoder and decoder are
shared and are initialized with RoBERTa [12] checkpoint. We refer the reader
to the original paper for more details.


2.3    Nearest Neighbour Inference

To serve the dual-encoder retrieval model over a collection of passages, we first
run the encoder over every passage offline to create a distributed lookup-table as
a backend. At inference, we only need to run the question encoder on the input
query. The query encoding is used to perform nearest neighbour search against
the passage encodings in the backend. Since the total number of passages is in
the order of millions and each passage is projected to a 768 dimensional vector,
we use distributed brute-force search for exact inference instead of approximate
nearest neighbour search [11, 6].


3     Term-based Retrieval as Nearest Neighbour Search

Term-based retrieval models, such as BM25 [21], have been extensively studied
for document retrieval. In fact, for first-stage retrieval, there is significant evi-
dence that term-based models are extremely effective baselines [10]. Term-based
models usually use inverted-indexes for inference, taking advantage of lexical
sparsity per-document to optimize retrieval speed and memory usage [16]. In
this section, we show that inference in term-matching based models, specifically
BM25, can be cast as vector dot-product similarity, which will enable principled
hybrid models (Section 4).
1
    archive.org/details/stackexchange
2
    webscope.sandbox.yahoo.com/catalog.php?datatype=l
   Let Q and P denote a query and a passage, respectively. The BM25 score
between Q and P is computed as:
                               n
                               X     IDF(qi ) ∗ cnt(qi , P ) ∗ (k + 1)
             BM25(Q, P ) =                                          m     ,
                               i=1
                                   cnt(qi , P ) + k ∗ (1 − b + b ∗ mavg )

where qi are tokens from Q, cnt(qi , P ) is qi ’s term frequency in P , k/b are
BM25 hyperparameters, IDF is the term’s inverse document frequency from the
corpus, n/m are the number of tokens in Q/P , and mavg is the collection’s
average passage length. This can be written as a vector space model. To see this,
let qbm25 ∈ [0, 1]|V | be a |V |-dimensional binary encoding of Q, i.e., qbm25 [i]
is 1 if the i-th entry of vocabulary V is in Q, 0 otherwise. Furthermore, let
pbm25 ∈ R|V | be a sparse real-valued vector where,
                               IDF(pi ) ∗ cnt(pi , P ) ∗ (k + 1)
                   pbm25
                    i    =                                           .
                             cnt(pi , P ) + k ∗ (1 − b + b ∗ mmavg
                                                                   )
We can see that,
                          BM25(Q, P ) = hqbm25 , pbm25 i
Here h, i denote vector dot-product.

4    Hybrid First-stage Retrieval
Although dual encoder models are good at capturing semantic similarity, e.g.,
“Theresa May” and “Prime Minister” [3], we observe lexical matching consis-
tently poses a challenge for first-stage neural retrieval models. For instance, if we
consider the question “Which are the additions of the JASPAR 2016 open-access
database of transcription factor binding profiles?” from a prior year’s BioASQ
challenge, our initial neural model retrieved this document as the most relevant
(title only),
    JASPAR 2010: the greatly expanded open-access database of transcription
                           factor binding profiles.
whereas a BM25 system returns the much more relevant document,
 JASPAR 2016: a major expansion and update of the open-access database of
                 transcription factor binding profiles.
Thus, neural models tend to generalize better than term-models, but term mod-
els are advantageous in situations where exact lexical matching is preferable.
    In order to build a system that combines the benefits of both neural and
term-based retrieval, we combine our neural dual encoder models with BM25
in a principled way. Specifically, leveraging the vector similarity view of BM25
(Section 3) gives rise to a simple hybrid,
                 sim(qhyb , phyb ) = hqhyb , phyb i
                                  = h[λqbm25 , qnn ], [pbm25 , pnn ]i
                                  = λhqbm25 , pbm25 i + hqnn , pnn i,
where qhyb and phyb are the hybrid encodings that concatenate the BM25
(qbm25 /pbm25 ) and the neural encodings (qnn /pnn , from Sec 2); and λ is a in-
terpolation hyperparameter that trades-off the relative weight of BM25 versus
neural models.
    Thus, we can implement BM25 and our hybrid model as nearest neighbor
search with hybrid sparse-dense vector dot-product [25]. Note that this results
in exact retrieval and not approximate retrieval through post-hoc rescoring, the
latter having been studied previously [17, 13, 7]

5     Experiments
Our document collection contains the abstracts of articles from MEDLINE. We
discard about 10M abstracts that only contains a title, which leaves us about
18M abstracts. For the dual encoder model, all passages are truncated at 300
wordpiece tokens with BERT tokenization.
    All evaluation is done either by the BioASQ challenge via uploaded results,
or subsequently using the official BioASQ evaluation script. As per challenge
rules, we returned at most 10 relevant documents per question.

5.1   Systems
BM25 We build a standard BM25 retrieval system based on IDF values com-
puted on the document collection. This is a unigram model using the bioclean
tokenization script from BioASQ.

DE This the dual encoder model described in section 2.1, which is based on a
pretrained BERT model. In this work, we create our own wordpiece vocabulary
on pubmed abstracts with 107137 entries. Our BERT model consists of 12 trans-
former [23] layers, each with hidden size 1024 and 16 attention heads. We use
the same sentence sampling procedure as reported in the original BERT paper,
e.g., the combined sequence has length no longer than 512 tokens, and we uni-
formly mask 15% of the tokens from each sequence for masked language model
prediction. We update the next sentence prediction task by replacing original
binary-cross-entropy loss with softmax cross-entropy loss as described in 2.1.
We use the same hyper-parameter values for BERT pretraining except that the
learning rate is set 2e-5, and the model is trained for 300,000 steps.
    To train the dual encoder model, we use supervised data provided by BioASQ,
as well as synthetic data generated using method mentioned in section 2.2. For
supervised data, we use BioASQ 8B training data where the last 200 questions
are used as development set. The synthetic data contains about 103,635,592
question-passage pairs where questions are generated from pubmed abstracts.
The dual encoder model is trained with a batch size of 6144. For each batch, 20%
of the examples come from synthetic data, and the rest come from supervised
data. We train the model for 100,000 steps using Adam [8] with a learning rate
5e-6, β1 = 0.9, β2 = 0.999. Similar to BERT pretraining, we also apply L2
weight decay of 0.01, and warm up learning rate for the first 10,000 steps.
Hybrid This is identical to DE, but instead of using the pure neural model, we
train the hybrid model in section 4 with λ = 1.5 which is achieved by running a
grid search on the development set.


HybridRerank This system applies a reranker on top of the output from the
Hybird system. We cast the reranking to logistic regression problem: given a
question-passage pair, the model predicts whether that passage is relevant to
the question or not. Here a passage is the concatenation of an article title with
the abstract of that article. The reranking model is also based on BERT, i.e.,
we concatenate the query and passage as the input for BERT and apply a MLP
on top of the [CLS] token representation. We use question-passage pairs from
BioASQ 8B as positive examples. Negative examples are created using the same
queries but with passages returned by the BM25 system. We train the model for
1 epoch, with the same hyper parameter values as used to train the dual encoder
model. For inference, given a query, we sort the top 10 output from the Hybrid
system in descending order according to their reranking score.


5.2   Official Results

Official results for our submissions are shown in Table 1. Not all systems were
submitted to all batches. We report only Mean Average Precision (MAP) as it is
the official metric for the document retrieval challenge. These are the preliminary
results before human judgements, which are still outstanding. A number of things
can be observed:

 1. BM25 is signiciantly better than our neural DE model. As mentioned previ-
    ously, BM25 is a very strong baseline. However, we suspect that part of this
    is due to the nature of the BioASQ data, where relevance annotators are also
    who create the questions. This has been shown to bias datasets in favor of
    term-based results [9]. This is exacerbated for the preliminary results, where
    relevance judgements are gathered via pre-existing search tools like Pubmed,
    which themselves are heavily biased to term matching.
 2. The Hybrid model consistently outperforms the BM25 model – by about
    2pts on average. This shows that hybrid retrieval is a very viable approach
    to first-stage retrieval for biomedical literature.
 3. Adding a BERT-based cross-attention reranker consistently increases accu-
    racy, by 1-3pts. This is consistent with previous studies in the domain [20].
 4. Our final reranking model is competitive with the best scoring systems on
    the batches in which it was scored. Given the simplicity of the model, we
    expect that further optimizations will increase accuracy further. E.g., the
    best scoring system from last year – and one of the top systems from this
    year – used a joint document-snippet model [20].
                                    Batch 2 Batch 3 Batch 4 Batch 5
                                     MAP MAP MAP MAP
            BM25                    0.2718 0.3877 0.3631 0.4287
            DE                      0.1173 0.2756 0.2600 0.3190
            Hybrid                  0.2806 0.3995 0.3866 0.4437
            HybridRerank              ––    0.4303 0.4121 0.4593
            Best Reporting System 0.3304 0.4510 0.4163 0.4842
      Table 1. Mean average precision (MAP) official results for batches 2–5.


                           Batch 1 Batch 2 Batch 3 Batch 4 Batch 5 Average
                            MAP MAP MAP MAP MAP MAP
      BM25                  0.3538 0.2955 0.3891 0.3872 0.4286 0.3708
      DE                    0.2661 0.1869 0.3096 0.2771 0.3190 0.2717
      Hybrid                0.3711 0.3087 0.4141 0.4099 0.4437 0.3895
      HybridRerank          0.3877 0.3226 0.4235 0.4351 0.4592 0.4056
      Best Reporting System 0.3398 0.3304 0.4510 0.4163 0.4842 0.4043
      Table 2. Mean average precision (MAP) updated results for batches 1–5.


5.3   Updated Results

While the BioASQ challenge was underway, we updated our models and data to
improve them. Here we report results for updated models that incorporate these
improvements. These are not official submissions, but use the BioASQ evaluation
script and are thus comparable to official results.
    We made the following updates:

1. Data fix. After batch 4, we realized that our data pipeline sometimes did not
   include the full abstract. This was fixed.
2. Bigram BM25. Our original BM25 model was a unigram model that used
   the bioclean tokenizer supplied by BioASQ. We tried using a BERT-based
   tokenizer (the same one as used by the DE model) and found that this
   performed better.
3. Better abstract modeling for DE. For our DE model, we originally truncated
   abstracts at 300 wordpieces. Instead we divide the abstract into blocks, each
   300 wordpieces in length and index each seperately. At inference, if two blocks
   from the same abstract are returned, we remove the duplicate document.

    These updates were incorporated and all the new models were run on all
batches. Table 2 shows the results relative to the best system per batch, as well
as the average across all five batches. Compared to Table 1, we can see that
all numbers go up and now the HybridRerank system is the top system on two
batches and overall on average. We should note that the ‘Best Reporting System’
row is not the same submission across batches, as different systems (and teams)
performed best depending on the batch. Thus, the average of this row does not
represent a single system, but the average over possibly many systems.
6    Conclusions
In this paper we described our submissions to the BioASQ challenge. Specifically,
we show that hybrid term-neural models are a viable first-stage retrieval method.
For the neural portion, using data augmentation techniques as proposed by Ma
et al. [14] were required to attain reasonable performance and are likely necessary
in cases where there is little supervised data.
    Overall, our methods were competitive, especially when combined with a re-
ranker. Post challenge improvements around data quality and minor modeling
changes (e.g., bigram BM25) pushed the results near the top of the challenge,
highlighting the effectiveness of our models.


References
 1. Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R.: Signature verification
    using a “siamese” time delay neural network. In: Advances in neural information
    processing systems. pp. 737–744 (1994)
 2. Chang, W.C., Yu, F.X., Chang, Y.W., Yang, Y., Kumar, S.: Pre-training tasks for
    embedding-based large-scale retrieval. In: International Conference on Learning
    Representations (2020), https://openreview.net/forum?id=rkg-mA4FDr
 3. Cohen, D., Mitra, B., Hofmann, K., Croft, W.B.: Cross domain regularization for
    neural ranking models using adversarial learning. CoRR abs/1805.03403 (2018)
 4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deep
    bidirectional transformers for language understanding. In: Proceedings of the 2019
    Conference of the North American Chapter of the Association for Computa-
    tional Linguistics: Human Language Technologies, Volume 1 (Long and Short
    Papers). pp. 4171–4186. Association for Computational Linguistics, Minneapo-
    lis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423, https://www.
    aclweb.org/anthology/N19-1423
 5. Gillick, D., Presta, A., Tomar, G.S.: End-to-end retrieval in continuous space.
    CoRR abs/1811.08008 (2018)
 6. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. arXiv
    preprint arXiv:1702.08734 (2017)
 7. Karpukhin, V., Oğuz, B., Min, S., Wu, L., Edunov, S., Chen, D., Yih, W.t.: Dense
    passage retrieval for open-domain question answering. CoRR (04 2020)
 8. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. International
    Conference on Learning Representations (12 2014)
 9. Lee, K., Chang, M.W., Toutanova, K.: Latent retrieval for weakly supervised open
    domain question answering. In: Proceedings of the 57th Annual Meeting of the
    Association for Computational Linguistics. pp. 6086–6096. Association for Compu-
    tational Linguistics, Florence, Italy (Jul 2019). https://doi.org/10.18653/v1/P19-
    1612, https://www.aclweb.org/anthology/P19-1612
10. Lin, J.: The neural hype and comparisons against weak baselines. In: ACM SIGIR
    Forum. ACM New York, NY, USA (2019)
11. Liu, W., Wang, J., Kumar, S., Chang, S.F.: Hashing with graphs. In: Proceedings
    of the International Conference on Machine Learning (2011)
12. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
    Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized BERT pretraining
    approach. CoRR abs/1907.11692 (2019)
13. Luan, Y., Eisenstein, J., Toutanova, K., Collins, M.: Sparse, dense, and attentional
    representations for text retrieval. arXiv preprint arXiv:2005.00181 (2020)
14. Ma, J., Korotkov, I., Yang, Y., Hall, K., McDonald, R.: Zero-shot neural retrieval
    via domain-targeted synthetic query generation (2020)
15. MacAvaney, S., Yates, A., Cohan, A., Goharian, N.: Cedr: Contextualized em-
    beddings for document ranking. In: Proceedings of the International ACM SIGIR
    Conference on Research and Development in Information Retrieval (2019)
16. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval.
    Cambridge university press (2008)
17. McDonald, R., Brokos, G., Androutsopoulos, I.: Deep relevance ranking us-
    ing enhanced document-query interactions. In: Proceedings of the 2018 Con-
    ference on Empirical Methods in Natural Language Processing. pp. 1849–
    1860. Association for Computational Linguistics, Brussels, Belgium (Oct-
    Nov 2018). https://doi.org/10.18653/v1/D18-1211, https://www.aclweb.org/
    anthology/D18-1211
18. Nogueira, R., Cho, K.: Passage re-ranking with bert. arXiv preprint
    arXiv:1901.04085 (2019)
19. Palangi, H., Deng, L., Shen, Y., Gao, J., He, X., Chen, J., Song, X., Ward, R.:
    Deep sentence embedding using long short-term memory networks: Analysis and
    application to information retrieval. IEEE/ACM Transactions on Audio, Speech,
    and Language Processing 24(4), 694–707 (2016)
20. Pappas, D., McDonald, R., Brokos, G.I., Androutsopoulos, I.: AUEB at BioASQ 7:
    document and snippet retrieval. In: Proceedings of the BioASQ Workshop (2019)
21. Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M.: Okapi
    at trec-3. In: Overview of the Third Text REtrieval Conference (TREC-3). pp.
    109–126. Gaithersburg, MD: NIST (January 1995), https://www.microsoft.com/
    en-us/research/publication/okapi-at-trec-3/
22. Tsatsaronis, G., Balikas, G., Malakasiotis, P., Partalas, I., Zschunke, M., Alvers,
    M.R., Weissenborn, D., Krithara, A., Petridis, S., Polychronopoulos, D., Almiran-
    tis, Y., Pavlopoulos, J., Baskiotis, N., Gallinari, P., Artiéres, T., Ngomo, A.C.N.,
    Heino, N., Gaussier, E., Barrio-Alvers, L., Schroeder, M., Androutsopoulos, I.,
    Paliouras, G.: An overview of the bioasq large-scale biomedical semantic indexing
    and question answering competition. BMC bioinformatics 16, 138 (April 2015).
    https://doi.org/10.1186/s12859-015-0564-6
23. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
    Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Advances in neural
    information processing systems, pp. 5998–6008. Curran Associates, Inc. (2017),
    http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf
24. Wong, S.C., Gatt, A., Stamatescu, V., McDonnell, M.D.: Understanding data aug-
    mentation for classification: When to warp? In: 2016 International Conference on
    Digital Image Computing: Techniques and Applications (DICTA). pp. 1–6 (2016)
25. Wu, X., Guo, R., Simcha, D., Dopson, D., Kumar, S.: Efficient inner product
    approximation in hybrid spaces. arXiv preprint arXiv:1903.08690 (2019)
26. Yang, W., Zhang, H., Lin, J.: Simple applications of bert for ad hoc document
    retrieval. arXiv preprint arXiv:1903.10972 (2019)
27. Yih, W.t., Toutanova, K., Platt, J.C., Meek, C.: Learning discriminative projec-
    tions for text similarity measures. In: Proceedings of the Fifteenth Conference on
    Computational Natural Language Learning. pp. 247–256. Association for Com-
    putational Linguistics, Portland, Oregon, USA (Jun 2011), https://www.aclweb.
    org/anthology/W11-0329