=Paper=
{{Paper
|id=Vol-2409/docker10
|storemode=property
|title=BM25 Pseudo Relevance Feedback Using Anserini at Waseda University
|pdfUrl=https://ceur-ws.org/Vol-2409/docker10.pdf
|volume=Vol-2409
|authors=Zhaohao Zeng,Tetsuya Sakai
|dblpUrl=https://dblp.org/rec/conf/sigir/ZengS19
}}
==BM25 Pseudo Relevance Feedback Using Anserini at Waseda University==
<pdf width="1500px">https://ceur-ws.org/Vol-2409/docker10.pdf</pdf>
<pre>
                    BM25 Pseudo Relevance Feedback Using Anserini
                                at Waseda University
                                Zhaohao Zeng                                                                            Tetsuya Sakai
                             Waseda University                                                                        Waseda University
                                Tokyo, Japan                                                                             Tokyo, Japan
                           zhaohao@fuji.waseda.jp                                                                   tetsuyasakai@acm.org

ABSTRACT                                                                                    Docker image, logarithm is applied to r in the OW calculation to
We built a Docker image for BM25PRF (BM25 with Pseudo Rele-                                 alleviate this problem according to Sakai and Robertson [4]:
vance Feedback) retrieval model with Anserini. Also, grid search is                                                    OW (ti ) = RW (ti ) · log(r )                         (3)
provided in the Docker image for parameter tuning. Experimental
results suggest that BM25PRF with default parameters outperforms                            After query expansion, the expanded terms will be used for the sec-
vanilla BM25 on robust04, but tuning parameters on 49 topics of                             ond search using a BM25 variant: for one term ti and one document
robust04 did not further improve its effectiveness.                                         d j , the score s (ti , d j ) is calculated as follows:
                                                                                                                  s ′ (ti , d j )
                                                                                                                                       if ti ∈ q
Image Source: github.com/osirrc/anserini-bm25prf-docker                                           s (ti , d j ) = 
                                                                                                                  w · s ′ (ti , d j ) else                                  (4)
Docker Hub: hub.docker.com/r/osirrc2019/anserini-bm25prf                                                          
                                                                                                  ′                           RW (ti ) · T F (ti , d j ) · (K1 + 1)
                                                                                                 s (ti , d j ) =                                                             (5)
                                                                                                                  K1 · ((1 − b) + (b · (N DL(d j )))) + T F (ti , d j )
1    OVERVIEW                                                                               where T F (ti , d j ) is the term frequency of term ti in d j , N DL(d j ) is
BM25 has been widely used as a baseline model for text retrieval                            the normalised document length of d j : N DL(d j ) = PN
                                                                                                                                                                N · |d j |
                                                                                                                                                                           , w is
tasks. However, some researches only implement the vanilla form                                                                                                  k |d k |
of BM25 without query expansion and parameter tuning. As a re-                              the weight of new terms, and K1 and b are the hyper-parameters
sult, the performance of BM25 may be underestimated [2]. In our                             of BM25. All the tunable hyper-parameters are shown in Table 1.
Docker image, we implemented BM25PRF [3], which utilises Pesudo
Relevance Feedback (PRF) to expand queries for BM25. We also                                Table 1: Tunable parameters of BM25PRF and their search
implemented parameter tuning in the Docker image because we                                 spaces in the parameter tuning script.
believe how to obtain the optimised parameters is also an important
part of reproducible research. We built BM25PRF and parameter tun-                                        Search space            Default      Note
ing with Anserini [5], a toolkit built on top of Lucene for replicable                        K1          0.1 - 0.9 step=0.1      0.9          K1 of the first search
IR research.                                                                                  b           0.1 - 0.9 step=0.1      0.4          b of the first search
                                                                                              K1pr f      0.1 - 0.9 step=0.1      0.9          K1 of the second search
2    RETRIEVAL MODELS                                                                         bpr f       0.1 - 0.9 step=0.1      0.4          b of the second search
Given a query q, BM25PRF [3] ranks the collection with the classic                            R           {5, 10, 20}             10           num of relevant docs
BM25 first, and then extracts m terms that have high Offer Weights                            w           {0.1, 0.2, 0.5, 1}      0.2          weight of new terms
(OW ) from the top R ranked documents to expand the query. To                                 m           {0, 5, 10, 20, 40}      20           num of new terms
calculate the Offer Weight for a term ti , its Relevance Weight (RW )
needs to be calculated first as follows:
                                                                                            3     TECHNICAL DESIGN
                        (r + 0.5)(N − n − R + r + 0.5)
             RW (ti ) = log                                     (1)                         Supported Collections:
                           (n − r + 0.5)(R − r + 0.5)
                                                                                            robust04
where r is the Document Frequency (DF) of term ti in the top R
                                                                                            Supported Hooks:
documents, n is the DF of ti in the whole collection, and N is the
                                                                                            init, index, search, train
number of documents in the collection. Then, the Offer Weight is
                                                                                            Since the BM25PRF retrieval model is not included in the original
                            OW (ti ) = RW (ti ) · r                             (2)         Anserini library, we forked its repository and added two JAVA
However, in practice a term that has high r will tend to have high                          classes: BM25PRFSimilarity and BM25PRFReRanker by extending
OW , so some common words (e.g., be, been) may have high OW                                 the Similarity Class and the ReRanker Class, respectively. Thus, the
and be selected as expansion terms. Since the common words are                              implemented BM25PRF can be utilised on any collections supported
not informative, they may be not helpful for ranking. Thus, in our                          by Anserini, though we only tested it on the robust04 collection
                                                                                            in this paper. Python scripts are used as hooks to run the necessary
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons        commands (e.g., index and search) via jig.1 Jig is a tool provided
License Attribution 4.0 International (CC BY 4.0). OSIRRC 2019 co-located with SIGIR
2019, 25 July 2019, Paris, France.                                                          1 https://github.com/osirrc/jig


                                                                                       62
OSIRRC 2019, July 25, 2019, Paris, France                                                                                                      Zhaohao Zeng and Tetsuya Sakai


by the OSIRRC organisers to operate the Docker images which                                 research codes we wrote several months ago is not a nightmare
follow the OSIRRC specification.                                                            anymore.
   Grid search is also provided in the Docker image for parameter                              However, the biggest obstacle we faced during the development
tuning, and can be executed using the train hook of jig. It performs                        for OSIRRC is that it is more difficult to debug with Docker and jig.
search and evaluation for every combination of parameters speci-                            For example, there is no simple approach to setting a debugger into
fied. To reduce the search space of grid search, our tuning process                         the Docker container when it was launched by jig. Furthermore,
consists of two steps. First, it performs search on a validation set                        current jig assumes that the index files are built inside the Docker
using the original BM25 to find the optimal parameters of it (i.e.,                         container and commit the Docker and built index as a new image,
K1 and b) based on Mean P@20. The K1 and b are the parameters                               which means that the index needs to be built again after modifying
for the initial iteration of BM25PRF, so precision may be important                         the source code. While it is not a serious problem for small collec-
for extracting effective expansion terms. Then, the tuned K1 and b                          tions like robust04, it may take too much time for large collections.
are frozen, and the other parameters of BM25PRF (i.e., K1pr f , bpr f ,                     To solve this problem, we think jig should allow users to mount
R, w, and m) are tuned on the validation set based on MAP.                                  external index when launching the search hook. Although mount-
                                                                                            ing external data into a Docker container is a standard action when
4    RESULTS                                                                                using Docker’s command line tools directly, but OSIRRC expects
The parameter tuning was performed on 49 topics2 of robust04,                               Docker images to be operated through jig, which currently does
and the tuned parameters are shown in Table 2. As shown in Table 3,                         not provide such a feature.
the BM25PRF outperforms the vanilla BM25, but the tuned hyper-
parameters do not improve BM25PRF’s performance on robust04.                                REFERENCES
                                                                                            [1] Ryan Clancy and Jimmy Lin. 2019. osirrc/anserini-docker: OSIRRC @ SIGIR 2019
This may be because the validation set used for tuning is too small,                            Docker Image for Anserini. https://doi.org/10.5281/zenodo.3246820
and the parameters have been overfitted. Since the goal of this                             [2] Jimmy Lin. 2019. The Neural Hype and Comparisons Against Weak Baselines. In
study is about using Docker for reproducible IR research instead of                             ACM SIGIR Forum, Vol. 52. ACM, 40–51.
                                                                                            [3] Stephen E. Robertson and Karen Spärck Jones. 1994. Simple, proven approaches
demonstrating the effectiveness of BM25PRF and grid search, we                                  to text retrieval. Technical Report 356. Computer Laboratory University of Cam-
do not further discuss the performance in this paper.                                           bridge.
                                                                                            [4] Tetsuya Sakai and Stephen E Robertson. 2002. Relative and absolute term selection
                                                                                                criteria: a comparative study for English and Japanese IR. In SIGIR 2002. 411–412.
                  Table 2: Tuned hyper-parameters.                                          [5] P. Yang, H. Fang, and J. Lin. 2017. Anserini: Enabling the Use of Lucene for
                                                                                                Information Retrieval Research. In SIGIR 2017. 1253–1256.
                          K1      b      K1pr f      bpr f     m     R       w
      Tuned Value         0.9     0.2    0.9         0.6       40    10      0.1


          Table 3: BM25PRF performance on robust04.

           Model                                       MAP          P@30
           BM25 [1]                                    0.2531       0.3102
           BM25PRF (default parameters)                0.2928       0.3438
           BM25PRF (tuned parameters)                  0.2916       0.3396


5    OSIRRC EXPERIENCE
Docker has been widely used in industry for delivering software,
but we found that using Docker to manage an experimental en-
vironment has advantages for research usage as well. First, it is
easier to configure environments with Docker than with a bare-
metal server, especially for deep learning scenarios where a lot of
packages (e.g., GPU driver, CUDA and cuDNN) need to be installed.
Moreover, Docker makes experiments more trackable. Research
code is usually messy, lacks documentation, and may need a lot
of changes during the experiment, so even the author may have
difficulty to remember the whole change log. Since each Docker
tag is an executable archive, it provides a kind of version control
on executables. Moreover, if the docker images follow some com-
mon specification like the ones we build for OSIRRC, running the
2 The topics ids of the validation set are provided by the OSIRRC organisers in jig:
https://github.com/osirrc/jig/tree/master/sample_training_validation_query_ids


                                                                                       63

</pre>