BM25 Pseudo Relevance Feedback Using Anserini at Waseda University Zhaohao Zeng Tetsuya Sakai Waseda University Waseda University Tokyo, Japan Tokyo, Japan zhaohao@fuji.waseda.jp tetsuyasakai@acm.org ABSTRACT Docker image, logarithm is applied to r in the OW calculation to We built a Docker image for BM25PRF (BM25 with Pseudo Rele- alleviate this problem according to Sakai and Robertson [4]: vance Feedback) retrieval model with Anserini. Also, grid search is OW (ti ) = RW (ti ) · log(r ) (3) provided in the Docker image for parameter tuning. Experimental results suggest that BM25PRF with default parameters outperforms After query expansion, the expanded terms will be used for the sec- vanilla BM25 on robust04, but tuning parameters on 49 topics of ond search using a BM25 variant: for one term ti and one document robust04 did not further improve its effectiveness. d j , the score s (ti , d j ) is calculated as follows: s ′ (ti , d j )  if ti ∈ q Image Source: github.com/osirrc/anserini-bm25prf-docker s (ti , d j ) =  w · s ′ (ti , d j ) else (4) Docker Hub: hub.docker.com/r/osirrc2019/anserini-bm25prf  ′ RW (ti ) · T F (ti , d j ) · (K1 + 1) s (ti , d j ) = (5) K1 · ((1 − b) + (b · (N DL(d j )))) + T F (ti , d j ) 1 OVERVIEW where T F (ti , d j ) is the term frequency of term ti in d j , N DL(d j ) is BM25 has been widely used as a baseline model for text retrieval the normalised document length of d j : N DL(d j ) = PN N · |d j | , w is tasks. However, some researches only implement the vanilla form k |d k | of BM25 without query expansion and parameter tuning. As a re- the weight of new terms, and K1 and b are the hyper-parameters sult, the performance of BM25 may be underestimated [2]. In our of BM25. All the tunable hyper-parameters are shown in Table 1. Docker image, we implemented BM25PRF [3], which utilises Pesudo Relevance Feedback (PRF) to expand queries for BM25. We also Table 1: Tunable parameters of BM25PRF and their search implemented parameter tuning in the Docker image because we spaces in the parameter tuning script. believe how to obtain the optimised parameters is also an important part of reproducible research. We built BM25PRF and parameter tun- Search space Default Note ing with Anserini [5], a toolkit built on top of Lucene for replicable K1 0.1 - 0.9 step=0.1 0.9 K1 of the first search IR research. b 0.1 - 0.9 step=0.1 0.4 b of the first search K1pr f 0.1 - 0.9 step=0.1 0.9 K1 of the second search 2 RETRIEVAL MODELS bpr f 0.1 - 0.9 step=0.1 0.4 b of the second search Given a query q, BM25PRF [3] ranks the collection with the classic R {5, 10, 20} 10 num of relevant docs BM25 first, and then extracts m terms that have high Offer Weights w {0.1, 0.2, 0.5, 1} 0.2 weight of new terms (OW ) from the top R ranked documents to expand the query. To m {0, 5, 10, 20, 40} 20 num of new terms calculate the Offer Weight for a term ti , its Relevance Weight (RW ) needs to be calculated first as follows: 3 TECHNICAL DESIGN (r + 0.5)(N − n − R + r + 0.5) RW (ti ) = log (1) Supported Collections: (n − r + 0.5)(R − r + 0.5) robust04 where r is the Document Frequency (DF) of term ti in the top R Supported Hooks: documents, n is the DF of ti in the whole collection, and N is the init, index, search, train number of documents in the collection. Then, the Offer Weight is Since the BM25PRF retrieval model is not included in the original OW (ti ) = RW (ti ) · r (2) Anserini library, we forked its repository and added two JAVA However, in practice a term that has high r will tend to have high classes: BM25PRFSimilarity and BM25PRFReRanker by extending OW , so some common words (e.g., be, been) may have high OW the Similarity Class and the ReRanker Class, respectively. Thus, the and be selected as expansion terms. Since the common words are implemented BM25PRF can be utilised on any collections supported not informative, they may be not helpful for ranking. Thus, in our by Anserini, though we only tested it on the robust04 collection in this paper. Python scripts are used as hooks to run the necessary Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons commands (e.g., index and search) via jig.1 Jig is a tool provided License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located with SIGIR 2019, 25 July 2019, Paris, France. 1 https://github.com/osirrc/jig 62 OSIRRC 2019, July 25, 2019, Paris, France Zhaohao Zeng and Tetsuya Sakai by the OSIRRC organisers to operate the Docker images which research codes we wrote several months ago is not a nightmare follow the OSIRRC specification. anymore. Grid search is also provided in the Docker image for parameter However, the biggest obstacle we faced during the development tuning, and can be executed using the train hook of jig. It performs for OSIRRC is that it is more difficult to debug with Docker and jig. search and evaluation for every combination of parameters speci- For example, there is no simple approach to setting a debugger into fied. To reduce the search space of grid search, our tuning process the Docker container when it was launched by jig. Furthermore, consists of two steps. First, it performs search on a validation set current jig assumes that the index files are built inside the Docker using the original BM25 to find the optimal parameters of it (i.e., container and commit the Docker and built index as a new image, K1 and b) based on Mean P@20. The K1 and b are the parameters which means that the index needs to be built again after modifying for the initial iteration of BM25PRF, so precision may be important the source code. While it is not a serious problem for small collec- for extracting effective expansion terms. Then, the tuned K1 and b tions like robust04, it may take too much time for large collections. are frozen, and the other parameters of BM25PRF (i.e., K1pr f , bpr f , To solve this problem, we think jig should allow users to mount R, w, and m) are tuned on the validation set based on MAP. external index when launching the search hook. Although mount- ing external data into a Docker container is a standard action when 4 RESULTS using Docker’s command line tools directly, but OSIRRC expects The parameter tuning was performed on 49 topics2 of robust04, Docker images to be operated through jig, which currently does and the tuned parameters are shown in Table 2. As shown in Table 3, not provide such a feature. the BM25PRF outperforms the vanilla BM25, but the tuned hyper- parameters do not improve BM25PRF’s performance on robust04. REFERENCES [1] Ryan Clancy and Jimmy Lin. 2019. osirrc/anserini-docker: OSIRRC @ SIGIR 2019 This may be because the validation set used for tuning is too small, Docker Image for Anserini. https://doi.org/10.5281/zenodo.3246820 and the parameters have been overfitted. Since the goal of this [2] Jimmy Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines. In study is about using Docker for reproducible IR research instead of ACM SIGIR Forum, Vol. 52. ACM, 40–51. [3] Stephen E. Robertson and Karen Spärck Jones. 1994. Simple, proven approaches demonstrating the effectiveness of BM25PRF and grid search, we to text retrieval. Technical Report 356. Computer Laboratory University of Cam- do not further discuss the performance in this paper. bridge. [4] Tetsuya Sakai and Stephen E Robertson. 2002. Relative and absolute term selection criteria: a comparative study for English and Japanese IR. In SIGIR 2002. 411–412. Table 2: Tuned hyper-parameters. [5] P. Yang, H. Fang, and J. Lin. 2017. Anserini: Enabling the Use of Lucene for Information Retrieval Research. In SIGIR 2017. 1253–1256. K1 b K1pr f bpr f m R w Tuned Value 0.9 0.2 0.9 0.6 40 10 0.1 Table 3: BM25PRF performance on robust04. Model MAP P@30 BM25 [1] 0.2531 0.3102 BM25PRF (default parameters) 0.2928 0.3438 BM25PRF (tuned parameters) 0.2916 0.3396 5 OSIRRC EXPERIENCE Docker has been widely used in industry for delivering software, but we found that using Docker to manage an experimental en- vironment has advantages for research usage as well. First, it is easier to configure environments with Docker than with a bare- metal server, especially for deep learning scenarios where a lot of packages (e.g., GPU driver, CUDA and cuDNN) need to be installed. Moreover, Docker makes experiments more trackable. Research code is usually messy, lacks documentation, and may need a lot of changes during the experiment, so even the author may have difficulty to remember the whole change log. Since each Docker tag is an executable archive, it provides a kind of version control on executables. Moreover, if the docker images follow some com- mon specification like the ones we build for OSIRRC, running the 2 The topics ids of the validation set are provided by the OSIRRC organisers in jig: https://github.com/osirrc/jig/tree/master/sample_training_validation_query_ids 63