Diverse Semantics Representation is King
MIRMU and MSM at ARQMath 2022

Martin Geletka1 , Vojtěch Kalivoda1 , Michal Štefánik1 , Marek Toma1 and Petr Sojka1
1
    Faculty of Informatics, Masaryk University, Botanická 68a, 602 00 Brno, Czech Republic


                                         Abstract
                                         We report on the systems that the Math Information Retrieval group at Masaryk University (mirmu) and
                                         the team of Faculty of Informatics students (msm) prepared for task 1 (find answers) of the arqmath
                                         lab at the clef conference. To study the effects of different system settings and hyperparameters, we
                                         have prototyped several diverse math-aware information retrieval (mir) systems: both “old” inverted
                                         index-based ones and new neural ones. By ensembling the results of the “weak” individual systems into
                                         committees, we report on entailments, benefits, and drawbacks of system ensembling. We evaluated
                                         the proposed individual systems and ensembles, considering their diversity, hyperparameters, and
                                         representations used, and classified their approaches. Our prototypes have helped to understand the
                                         challenging problems of question-answering in the stem domain: the key lies in the proper representation
                                         of document semantics. Our reproducible evaluation Python library PV211-utils allows to reproduce
                                         and further advance mir re-search.

                                         Keywords
                                         Information retrieval, question answering, math representations, math-aware information retrieval, word
                                         embeddings, ensembling, voting, reranking, data fusion, diversity, transformers


                                                                                                      Content is king.                                                           Bill Gates
                                                         Properly fused diverse content is king, and context is queen.                                                           Petr Sojka

1. Introduction
Math Information Retrieval (mir) and math-aware representation of meaning of scientific
documents have been researched at mir laboratory at Masaryk University for decades, as nicely
summarized by Novotný [1] in his dissertation. As in the previous year [2], we formed two
teams, mirmu and msm. Under the mirmu team, we submitted five different versions of the deep
neural information system, which tries to overcome the performance of tf-idf likes systems.
Under the msm group, we submitted different versions of the student information systems and
their ensemble with the best variant from the mirmu submission. Finally, we report that an
ensemble of all fine-tuned individual systems’ by reciprocal rank fusion performed best.
   Our arqmath reports [3, 4] showed promising directions stemming from the enormous
capacity of neural languages models, their ensembling [5], their different training sets, hyper-
parametrization, input preprocessing, and math tokenization.
CLEF 2022 – Conference and Labs of the Evaluation Forum, September 5–8, 2022, Bologna, Italy
$ 456576@mail.muni.cz (M. Geletka); 527350@mail.muni.cz (V. Kalivoda); stefanik.m@mail.muni.cz (M. Štefánik);
485275@mail.muni.cz (M. Toma); sojka@fi.muni.cz (P. Sojka)
 0000-0001-6325-978X (M. Geletka); 0000-0003-1766-5538 (M. Štefánik); 0000-0002-5768-4007 (P. Sojka)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
  Our systems were mainly developed as part of the Information Retrieval course PV211 and
will allow reproducible research using Python package PV211-utils [6].
  In Section 2 we describe resources and methods used to train and develop our systems.
Section 3 describes systems and strategies used to prepare our runs. We report and evaluate
our results in Section 4 on page 6. We are summing up our conclusions with Section 5, draw-
ing possible research plans based on computed metrics and availability of collected systems,
ensembling techniques, and ground truth datasets.


2. Datasets and Methods
This section describes the math representations ingested by our information retrieval systems,
the corpora used for training the models that power our systems, and the relevance judgments
we used for parameter optimization, model selection, and performance estimation.

2.1. Math Input Representations
We used the most straightforward math representation for all our submitted systems: LATEX.
  In one submitted run of mirmu group, we studied the effect of the LATEX encoding of math
compared to the sole text to see how the presence of the math representation affects the resulting
score.

2.2. Datasets and Methods
We described our datasets, ensembling methods, and evaluation measures in detail in our
previous reports [5] and [2, Section 2]. We also used datasets from arqmath 2021 and arqmath
2022 [7].


3. Systems Description
The following sections describe nine systems students have developed as part of their studies.
Their diversity brings different ways to represent the meaning of math content and how to pick
and rerank the answers for given topics. Table 1 on page 7 summarizes ten submitted runs by
both MU teams.

3.1. Retriever + ReRanker System
Our systems submitted under the mirmu group consist of the following parts applied sequentially
after one another, inspired by RE3 QA architecture [8].

    • Indexer – to assign each document a dense vector representation;
    • Retriever – to compute the dense vector representation of the input query and to compute
      the cosine similarity between query and each document representation and sorting all
      documents by this similarity;
    • ReRanker – to rerank top-𝑘 most relevant documents from Retrieving part. We achieved
      the best results with tiered reranking, taking multiple non-overlapping slices from top-𝑘
      results and reranks each slice separately. This part is computationally expensive; therefore,
      we cannot do it on the whole dataset in practice.

   For implementing our system, we used the Sentence Transformer library, which is the Python
framework for state-of-the-art sentence, text, and image embeddings. [9]

3.1.1. Implementation and Hyper-parameters
For the implementation of the Base model, we used the BiEncoder model with the pre-trained
all-MiniLM-L12-v2 model [10]. For the ReRanking phase, we fine-tuned the CrossEncoder
model with the pre-trained roberta-large model [11]. Identically, we experimented with
math-specific CrossEncoder model, MathBERTa1 [12], that extends generic RoBERTa with
math-specific tokenization and fine-tunes the extended model using ArXiv collection.
   We used Sentence Transformers library2 for fine-tuning of all our models. The architecture
of both models and their differences are depicted in Figure 1. The Base model fine-tuned only
the ReRanker model, as the fine-tuning of the generic Retriever pre-trained for a retrieval on a
vast and heterogeneous datasets did not bring measurable benefits of quality. We submitted
fine-tuned version of the ReRanker as the alternative run.

3.1.2. ReRanker fine-tuning
The ReRanker was fine-tuned using the BCE with logits loss and the AdamW optimizer. The
training set consisted of relevance judgments, a train+validation set from ARQMath 2021,
extended by additional samples. The additional positive samples, with the label equal to 1, were
pairs of a question and an answer from the same thread, where the answer received at least
50 upvotes. The additional negative samples, with the label equal to 0, were the most similar
documents given by Retriever but with no overlapping Math Stack Exchange tags. Because
finding similar documents for all questions would not be feasible, we decided to generate ten
negative samples for ten random questions at the start of each epoch.
   The model was trained on all training queries and the same number of additional samples
every epoch. The ratio of positive and negative samples was 0.5. The training process was
stopped as there was no decrease in loss over the test set from ARQMath 2021.

3.1.3. Description of Individual Runs
For the mirmu team, we submitted different versions of the described system to see the effect of
the different components of the system. The variation we used are:

    • finetuning / not finetuning of the Retriever model
    • using ReRanker model pretrained on the Math texts (RoBERTa vs MathBERTa3 ) [12]

   1
     https://huggingface.co/witiko/mathberta
   2
     https://www.sbert.net/docs/pretrained_models.html
   3
     https://huggingface.co/witiko/mathberta
Figure 1: Bi-Encoder vs Cross-Encoder model scheme (image taken from
www.sbert.net/examples/applications/cross-encoder/README.html)


    • using different representation of the input (text vs text + LATEX)
We submitted the Base model as the primary as we couldn’t outperform its performance. The
other variants we submitted as the alternative runs are:
    • Base – model described in the previous section;
    • Trained only on text – system using only the text representation of documents and
      queries;
    • MathBERTa ReRanker – system using MathBERTa model as the ReRanker;
    • Retriever fine-tuned – Retriever model adjusted by the relevance judgements collected
      over ARQMath2020 and ARQMath 2021, using Negatives Ranking loss [13, 5];
    • MathBERTa ReRanker + Retriever fine-tuned – system with MathBERTa ReRanker
      and fine-tuned Retriever model.
To quantify the effect of the individual changes in the mirmu system, only one parameter was
changed at time for given run, and all other hyperparameters were left the same.

3.2. tf-idf
As one of the students’ baseline systems, we used the tf-idf model implementation available in
the Gensim library [14].
  In the preprocessing phase, we removed extreme values (below 8 absolute term frequency
and higher than 0.7 relative document frequency), where we chose the hyperparameters exper-
imentally on train subset. Then we removed punctuation, repeating whitespaces. Lastly, we
tokenized the document by splitting on whitespace and stemmed individual tokens using the
snowball stemmer available in snowball_py 4 library.
  We used smart (System for the Mechanical Analysis and Retrieval of Text) 𝑙 𝑛 𝑐 tf-idf weight-
ing variant [15]. We used logarithm for term frequency weighting, no document weighting,
and cosine document length normalization.
   4
       https://github.com/shibukawa/snowball_py
3.3. BM25
BM25+ is an improvement over BM25 introduced by Lv and Zhai [16]. Together with other
alternatives, such as BM25-L, BM25-adapt, and BM25-T, this improvement surpasses the basic
BM25 algorithm on trec collections. [17] BM25+ estimates the relevance of a document 𝑑 for a
query 𝑞 by formula (1).
                                              ⎛                                     ⎞
                                                       (𝑘1 + 1) · tf𝑡,𝑑
                                 (︂       )︂
                         ∑︁         𝑁 +1
         BM25+ (𝑑, 𝑞) =      log             ·⎝                                 + 𝛿 ⎠,   (1)
                                     df
                                                   (︁           (︁      )︁)︁
                                                                   𝐿d
                         𝑡∈𝑞            𝑡       𝑘 · (1 − 𝑏) + 𝑏
                                                 1                 𝐿avg      +t
                                                                             fd


where 𝑘1 , 𝑏, and 𝛿 are hyperparameters, 𝑁 is the number of documents in the collection, df𝑡 is
the number of documents containing the term 𝑡, tf𝑡,𝑑 is frequency of term 𝑡 in document 𝑑, 𝐿𝑑
the length of document 𝑑 in words, and 𝐿avg is the expected length of a document in words.
   We represented our answers as a concatenation of its body and title, body, and tags of its
parent question. In the next preprocessing stage, we firstly removed punctuation, repeating
whitespaces, and English stopwords. Then we transformed the text into lowercase and stemmed
it using Porter stemmer. Lastly, we tokenized the text by splitting it on whitespace.
   We used the implementation of BM25+ in the rank_bm25 Python library [18] with its default
hyperparameters.

3.4. BM25 + tf-idf ensemble
As the example of the most straightforward possible ensemble system of two unsupervised
systems, we used the ensembling of tf-idf and BM25 models. We constructed the ensemble
as the simple sum of the given scores from the individual systems. The BM25 is configured
as described in Section 3.3. The tf-idf system uses the same preprocessing as BM25 system
and tf-idf implementation available in the Gensim library with smart 𝑙 𝑡 𝑛 tf-idf weighting
variant, which corresponds to logarithmic term frequency weighting, zero-corrected idf, and
no document normalization.

3.5. Compubert
We also submitted the Compubert model prepared and submitted by the mirmu team last year.
The model and hyper-parameters can be found in last year’s report [2, Section 3.4].

3.6. Ensemble Systems
This section describes the used ensemble systems. With ensembling and voting techniques, we
can combine the strengths of different systems to produce more accurate results. Historically,
there is a long tradition of boosting, [19, 20], ensembling [21], data fusion [22] and voting
approaches [23, 24] in the information retrieval research.
   As we believe that our systems could agree on a small portion of the most relevant documents,
reflecting different ‘points of view’ on the search problem. Depending on dozens of parameters,
each individual system will miss the great majority of relevant documents. With ensembling and
voting techniques, we can combine the strengths of different systems to produce more accurate
results. [2]. All our ensemble algorithms are agnostic to the scoring functions of individual
systems and only use the ranks of the results.

3.6.1. IBC
The ibc is ensemble technique we introduced in arqmath 2020 in our paper [5]. The ensemble
combines SERPs from the individual systems by Median Inverse Rank, which is equal to
(1000 − 𝑀 )/1000, where 𝑀 is Median Rank of individual systems. For detail explanation of
the ibc ensemble algorithm we refer the reader to our arqmath 2020 paper. [5]

3.6.2. RRF
We used reciprocal rank fusion (rrf) [25] to construct the ensemble model from the previously
described systems. The ensemble, given ranks from all individual systems, sorts the documents
by Formula (2).
                                               ∑︁      1
                                  rrf+ (𝑑) =                                              (2)
                                                   𝑘 + 𝑟(𝑑)
                                                𝑟∈𝑅

where 𝑅 is set of rankings and 𝑟(𝑑) is ranking of document 𝑑. The hyper-parameter 𝑘 parame-
terizes the ensemble. We used default value of 𝑘 = 60, suggested in [25].

3.6.3. RBC
As the rbc model we refer to trained regression model, which predicts the gain of train judge-
ments from the ranks of the individual.
  For the performance estimation of rbc, we produced a result list by taking the 1,000 answers
with the highest predicted gain for each topic in the test subset.

3.6.4. WIBC
wibc is weighted variant of the ibc algorithm. Instead of electing the candidate with the highest
median rating, wibc elects the candidate with the highest weighted median rating. Instead of
breaking ties by selecting a random rating out of a uniform distribution of all ratings, we select
a random rating out of a weighted uniform distribution.


4. Evaluation and Results
To compare the submitted systems, we evaluate their performance on the topics from previous
arqmath competitions.

4.1. Submitted runs
For each system, we report the resulted scores as in Table 2 in similar form as in the overview
arqmath 2022 paper by Mansouri et al. [7].
  Performance drop in mirmu 2 run compared to other runs indicate that training and adjusting
the system on math is a must.
Table 1
Run submitted by the msm and mirmu teams. First run by each team was submitted as primary
Team Run official nick                                   System description

msm 1     Ensemble_RRF_auto-both-P_primary               rrf Ensemble
msm 2     TF-IDF-auto-both-A                             tf-idf
msm 3     BM25_system-auto-both-A                        BM25
msm 4     BM25_TfIdf_system-auto-both-A                  tf-idf + BM25
msm 5     CompuBERT22-auto-both-A                        Compubert
mirmu 1   MiniLM+RoBERTa-auto-both-P                     Base
mirmu 2   MiniLM+RoBERTa-auto-text-A                     Trained only on text
mirmu 3   MiniLM+MathRoBERTa-auto-both-A                 MathBERTa ReRanker
mirmu 4   MiniLM_tuned+RoBERTa-auto-both-A               Retriever fine-tuned
mirmu 5   MiniLM_tuned+MathRoBERTa-auto-both-A           MathBERTa ReRanker + Retriever fine-
                                                         tuned


4.2. Runs with enhanced systems and ensembles
We further fine-tuned our systems benefiting from more ground truth data and experience we
got by previous evaluations.

    • MathBERTa ReRanker – MathBerta ReRanker fine-tuned on altered preprocessing;
    • BM25-based system – BM25-based system msm 3 described in Section 3.3, where we
      optimized hyperparameters 𝑘1 , 𝑏, and 𝛿 using grid search. The values we found to yield
      the best ndcg′ on our training set are 𝑘1 = 1.8, 𝑏 = 0.75, and 𝛿 = 1.
    • Improved Base – Base system with reduced preprocessing, more precisely trained
      ReRanker and improved tiered reranking with slices on indexes: 3, 7, 12, 16, 20, 50, 100,
      125.
    • Retriever only – A system with removed reranking phase. Document ranking is based
      only on cosine similarity on embeddings obtained from Retriever.

   We report the results of extended experiments with ensembles in Table 3 on the next page.
rrf ensembles deliver best results by far margin. The more diverse systems one combines, the
more the quality metric as ndcg′ monotonously grows. As the math information systems have
to cope with really complex problems, it is really hard to build one complex system that is
capable of learning all the complex stuff: disambiguation of overloaded math symbols, structured
ambiguous notation, deduction and long causal dependencies.
Table 2
arqmath 2022 competition results of the runs submitted by the msm team (1 ensemble and 4 diverse
systems) and mirmu team (5 variants of neural-based systems)
                                              2020                     2021                     2022
System
                                      ndcg′ map’@10 P’@10      ndcg′ map’@10 P’@10      ndcg′ map’@10 P’@10
msm runs
msm 1: rrf– Ensemble of all 4 msm +   0.422 0.172     0.197    0.381 0.119     0.152    0.511 0.159     0.244
       Best mirmu
msm 2: tf-idf                         0.238   0.074   0.117    0.169   0.040   0.076    0.284   0.065   0.082
msm 3: BM25                           0.332   0.123   0.168    0.285   0.082   0.116    0.401   0.124   0.196
msm 4: tf-idf + BM25                  0.332   0.123   0.168    0.286   0.083   0.116    0.401   0.124   0.196
msm 5: Compubert                      0.115   0.038   0.099    0.098   0.030   0.090    0.132   0.025   0.060
mirmu runs
mirmu 1: Base                         0.466 0.246     0.339    0.487 ⁓⁓⁓⁓
                                                                     0.233     0.316
                                                                               ⁓⁓⁓⁓
                                                                                        0.505 0.186     0.270
mirmu 2: Trained only on text         0.298 0.124     0.201    0.277 0.104      0.180   0.354 0.109     0.161
mirmu 3: MathBERTa ReRanker           0.470 0.250     0.338    0.484 0.227      0.310   0.503 0.183     0.277
mirmu 4: Retriever fine-tuned         0.466 0.246     0.339    0.487 ⁓⁓⁓⁓
                                                                     0.233     0.316
                                                                               ⁓⁓⁓⁓
                                                                                        0.478 0.167     0.247
mirmu 5: MathBERTa ReRanker +         0.470 0.248     0.335    0.472 0.221      0.309   0.500 0.180     0.265
         Retriever fine-tuned


Table 3
Results of another (not submitted) runs of fine-tuned systems and ensembles.
                                              2020                     2021                     2022
System
                                      ndcg′ map’@10 P’@10      ndcg′ map’@10 P’@10      ndcg′ map’@10 P’@10
Best systems’ runs prepared ex post
fine 1: MathBERTa ReRanker            0.465 0.243      0.342   0.480 0.222     0.308    0.510 0.191     0.275
fine 2: BM25-based system best        0.334 0.124      0.169   0.288 0.083     0.114    0.402 0.123     0.196
fine 3: Improved Base                 0.468 0.249     0.351
                                                      ⁓⁓⁓⁓
                                                               0.487 0.229     0.304    0.514 0.194     0.275
fine 4: Retriever only                0.462 0.241      0.334   0.479 0.221     0.301    0.507 0.186     0.278
Ensembles
ens 1: rrf 60 of 4 fine systems         0.493 0.253
                                        ⁓⁓⁓⁓
                                                      0.333    0.493 0.217
                                                               ⁓⁓⁓⁓
                                                                               0.306     0.551 0.207     0.313
ens 2: ibc of all                       0.401 0.197   0.295    0.400 0.170     0.244     0.473 0.177     0.287
ens 3: ibc of all 4 msm                 0.324 0.114   0.148    0.285 0.079     0.114     0.407 0.122     0.190
ens 4: ibc of all 5 mirmu               0.468 0.247   0.339    0.485 0.229     0.317     0.504 0.188     0.270
ens 5: ibc of all 4 msm +mirmu 1        0.354 0.136   0.181    0.326 0.109     0.156     0.511 0.159     0.245
ens 6: ibc of msm 4 and mirmu 1         0.459 0.200   0.258    0.326 0.109     0.156     0.543 0.197     0.292
ens 7: rrf 60 of all                    0.480 0.237   0.314    0.471 0.195     0.290     0.570 0.209    0.329
                                                                                                        ⁓⁓⁓⁓
ens 8: rrf 180 of all                   0.486 0.237   0.314    0.467 0.186     0.261    0.576 ⁓⁓⁓⁓
                                                                                        ⁓⁓⁓⁓
                                                                                               0.214     0.309
ens 9: rrf 60 of all 4 msm              0.328 0.125   0.169    0.277 0.078     0.118     0.406 0.114     0.210
ens 10: rrf 60 of all 5 mirmu           0.465 0.244   0.323    0.473 0.215     0.294     0.521 0.191     0.280
ens 11: rrf 60 of all 4 msm and mirmu 1 0.422 0.172   0.197    0.381 0.119     0.152     0.511 0.159     0.245
ens 12: rrf 60 of msm 4 + mirmu 1       0.465 0.211   0.268    0.455 0.177     0.231     0.544 0.198     0.303
ens 13: rbc of all                      0.476 0.217   0.267    0.442 0.164     0.190     N/A    N/A      N/A
ens 14: rbc of all 4 msm                0.312 0.115   0.116    0.274 0.074     0.107     N/A    N/A      N/A
ens 15: rbc of all mirmu                0.468 0.247   0.339    0.423 0.165     0.211     N/A    N/A      N/A
ens 16: rbc of all 4 msm and mirmu 1    0.475 0.220   0.273    0.453 0.171     0.204     N/A    N/A      N/A
ens 17: rbc of msm 4 and mirmu 1        0.474 0.220   0.286    0.468 0.193     0.245     N/A    N/A      N/A
ens 18: wibc of all                     0.466 0.246   0.339    0.487 0.234     0.316     N/A    N/A      N/A
ens 19: wibc of all 4 msm               0.332 0.123   0.168    0.285 0.082     0.116     N/A    N/A      N/A
ens 20: wibc of all mirmu               0.466 0.246   0.339    0.487 0.233     0.316     N/A    N/A      N/A
ens 21: wibc of all 4 msm and mirmu 1 0.488 ⁓⁓⁓⁓
                                              0.274   0.350    0.285 0.082     0.113     N/A    N/A      N/A
ens 22: wibc of msm 4 and mirmu 1       0.466 0.246   0.339    0.487 ⁓⁓⁓⁓
                                                                     0.233     0.316     N/A    N/A      N/A
        0.58                                                                               Year
                                                                                                  2022
                                                                                                  2021
        0.56
                                                                                                  2020

        0.54


        0.52
nDCG'


         0.5


        0.48


        0.46


        0.44

                       200            400             600            800            1000

                                                k
Figure 2: Performance of rrf ensemble of all 9 submitted individual systems, depending on hyperpa-
rameter 𝑘. The best 𝑘 = 180 reported for arqmath 2022 is 𝑛dcg′ = 0.576


  The Figure 2 shows the dependence of rrf quality on the parameter 𝑘. Instead of suggested
𝑘 = 60 from the original paper, the best performance is with 𝑘 = 180. We hypotetize that the
more diverse the primary systems are, the higher optimal parameter 𝑘 scores.
  It remains to be studied to which extend the performance depends on participating individual
systems. Also, how the performance change with different hyperparameters and initial setting
and choices of pre-trained models, diversity and initial setup of individual systems.
  For reproducibility, we are going to publish our notebooks, ensemble implementations and
models in our lab and course repository [6].

  “You must cultivate activities that you love. You must discover work that you do, not for its
utility, but for itself, whether it succeeds or not, whether you are praised for it or not, whether
you are loved and rewarded for it or not, whether people know about it and are grateful to you
                                                             for it or not.”    Anthony de Mello
5. Conclusion & Future Work
We have developed nine mir systems with as diverse approaches and variants as possible. We
have evaluated them on available arqmath data from last three years. We have studied the
ways how they could be ensembled to gain better performance. We have reported our findings:
a) math-aware representation with deep models started to outperform flat token-level based
systems b) ensembling done with expertise and insight about merits of individual systems
matter.
   In the future, we plan to further enhance our neural models and ensembling algorithms by
several means:

    • evaluation of different ensembling strategies based on individual systems’ types and
      hyperparameter settings of neural systems;
    • evaluation of initial setting and hyperparameters of neural Retriever and ReRanker
      systems;
    • deep systems’ adaptation to math specifics by using Adapt𝒪r library [26];
    • study robustness and out-of-domain performance of mir systems.


Acknowledgments
We thank all PV211 course students and former members of mir group for their contributions.
We thank the two anonymous reviewers for their insightful comments. We extend our gratitude
to the arqmath 2022 organizers for keeping the research of math information retrieval aflame.
   This work has been partly supported by the Ministry of Education of CR within the LINDAT-
CLARIAH-CZ project LM2018101.


References
 [1] V. Novotný, Interpretable Representations for Fast and Accurate Retrieval of Mathematical
     Information [online], Dissertation, Masaryk University, Faculty of Informatics, Brno,
     2022 [cit. 2022-05-26]. URL: https://is.muni.cz/th/o4thd/Revidovana_verze_po_obhajobe_
     disertace.pdf, supervisor: Petr Sojka.
 [2] V. Novotný, M. Štefánik, D. Lupták, M. Geletka, P. Zelina, P. Sojka, Ensembling Ten Math
     Information Retrieval Systems: MIRMU and MSM at ARQMath 2021, in: Proceedings of
     the Working Notes of CLEF 2021 – Conference and Labs of the Evaluation Forum, volume
     2696, CEUR-WS, Bucharest, Romania, 2021, pp. 82–106. URL: http://ceur-ws.org/Vol-2936/
     paper-06.pdf.
 [3] A. Reusch, M. Thiele, W. Lehner, TU_DBS in the ARQMath Lab 2021, CLEF, in: Proceedings
     of the Working Notes of CLEF 2021 – Conference and Labs of the Evaluation Forum, volume
     2696, CEUR-WS, Bucharest, Romania, 2021, pp. 107–124. URL: http://ceur-ws.org/Vol-2936/
     paper-07.pdf.
 [4] S. Rohatgi, J. Wu, C. L. Giles, Ranked List Fusion and Re-ranking with Pre-trained
     Transformers for ARQMath Lab, in: Proceedings of the Working Notes of CLEF 2021
     – Conference and Labs of the Evaluation Forum, volume 2696, CEUR-WS, Bucharest,
     Romania, 2021, pp. 125–132. URL: http://ceur-ws.org/Vol-2936/paper-08.pdf.
 [5] V. Novotný, P. Sojka, M. Štefánik, D. Lupták, Three is Better than One, in: CEUR Workshop
     Proceedings: ARQMath task at CLEF conference, volume 2696, CEUR-WS, Thessaloniki,
     Greece, 2020, pp. 1–30. URL: http://ceur-ws.org/Vol-2696/paper_235.pdf.
 [6] M. Štefánik, V. Novotný, M. Geletka, V. Kalivoda, M. Toma, D. Lupták, P. Sojka, PV211
     Utils, 2022. URL: https://github.com/MIR-MU/pv211-utils/.
 [7] B. Mansouri, V. Novotný, A. Agarwal, D. W. Oard, R. Zanibbi, Overview of ARQMath-
     3 (2022): Third CLEF lab on Answer Retrieval for Questions on Math (Working Notes
     Version), in: G. Faggioli, N. Ferro, A. Hanbury, M. Potthast (Eds.), Working Notes of CLEF
     2022 – Conference and Labs of the Evaluation Forum, CEUR-WS, 2022.
 [8] M. Hu, Y. Peng, Z. Huang, D. Li, Retrieve, Read, Rerank: Towards End-to-End Multi-
     Document Reading Comprehension, in: Proceedings of the 57th Annual Meeting of the
     ACL, 2019, pp. 2285–2295. doi:10.18653/v1/P19-1221.
 [9] N. Reimers, I. Gurevych, Sentence-BERT: Sentence Embeddings using Siamese BERT-
     Networks, in: Proceedings of the 2019 Conference on Empirical Methods in NLP and the
     9th International Joint Conference on NLP (EMNLP-IJCNLP), ACL, Hong Kong, China,
     2019, pp. 3982–3992. doi:10.18653/v1/D19-1410.
[10] W. Wang, H. Bao, S. Huang, L. Dong, F. Wei, MiniLMv2: Multi-head self-attention relation
     distillation for compressing pretrained transformers, in: Findings of the ACL: ACL-
     IJCNLP 2021, ACL, 2021, pp. 2140–2151. URL: https://aclanthology.org/2021.findings-acl.
     188. doi:10.18653/v1/2021.findings-acl.188.
[11] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
     V. Stoyanov, RoBERTa: A Robustly Optimized BERT Pretraining Approach, ArXiv
     abs/1907.11692 (2019). URL: https://openreview.net/forum?id=SyxS0T4tvS.
[12] V. Novotný, M. Štefánik, Combining Sparse and Dense Information Retrieval, in: Proceed-
     ings of the Working Notes of CLEF 2022, CEUR-WS, 2022. To appear.
[13] M. Henderson, R. Al-Rfou, B. Strope, Y.-H. Sung, L. Lukács, R. Guo, S. Kumar, B. Miklos,
     R. Kurzweil, Efficient Natural Language Response Suggestion for Smart Reply, ArXiv
     abs/1705.00652 (2017). doi:10.48550/ARXIV.1705.00652.
[14] R. Řehůřek, P. Sojka, Software Framework for Topic Modelling with Large Corpora, in:
     Proceedings of LREC 2010 workshop New Challenges for NLP Frameworks, ELRA, Valletta,
     Malta, 2010, pp. 45–50. doi:10.13140/2.1.2393.1847.
[15] G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Information
     Processing & Management 24 (1988) 513–523. doi:10.1016/0306-4573(88)90021-0.
[16] Y. Lv, C. Zhai, A Log-Logistic Model-Based Interpretation of TF Normalization of BM25,
     in: R. Baeza-Yates, A. P. de Vries, H. Zaragoza, B. B. Cambazoglu, V. Murdock, R. Lempel,
     F. Silvestri (Eds.), Advances in Information Retrieval, Springer, Berlin, Heidelberg, 2012,
     pp. 244–255. doi:10.1007/978-3-642-28997-2_21.
[17] A. Trotman, A. Puurula, B. Burgess, Improvements to BM25 and Language Models
     Examined, in: Proceedings of the 2014 Australasian Document Computing Symposium,
     ADCS ’14, ACM, New York, NY, USA, 2014, pp. 58–65. doi:10.1145/2682862.2682863.
[18] D. Brown, S. Jain, V. Novotný, nlp4whp, dorianbrown/rank_bm25:, 2022. doi:10.5281/
     zenodo.6106156.
[19] A. Gulin, I. Kuralenok, D. Pavlov, Winning The Transfer Learning Track of Yahoo!’s
     Learning To Rank Challenge with YetiRank, in: O. Chapelle, Y. Chang, T.-Y. Liu (Eds.),
     Proceedings of the Learning to Rank Challenge, volume 14 of Proceedings of Machine
     Learning Research, PMLR, Haifa, Israel, 2011, pp. 63–76. URL: http://proceedings.mlr.press/
     v14/gulin11a.html.
[20] Q. Wu, C. J. C. Burges, K. M. Svore, J. Gao, Adapting boosting for information retrieval
     measures, Information Retrieval 13 (2010) 254–270. doi:10.1007/s10791-009-9112-1.
[21] Y. Wang, I.-C. Choi, H. Liu, Generalized Ensemble Model for Document Ranking in
     Information Retrieval, Computer Science and Information Systems 14 (2017) 123––151.
     doi:10.2298/csis160229042w.
[22] R. Nuray, F. Can, Automatic Ranking of Information Retrieval Systems Using Data Fusion,
     Information Processing and Management 42 (2006) 595–614. doi:10.1016/j.ipm.2005.
     03.023.
[23] M. Mosbah, B. Boucheham, Majority Voting Re-ranking Algorithm for Content Based-
     Image Retrieval, in: E. Garoufallou, R. J. Hartley, P. Gaitanou (Eds.), Metadata and Semantics
     Research, Springer International Publishing, Cham, 2015, pp. 121–131.
[24] A. T. Albaham, N. Salim, Quality Biased Thread Retrieval Using the Voting Model, in:
     Proceedings of the 18th Australasian Document Computing Symposium, ADCS ’13, ACM,
     New York, NY, USA, 2013, pp. 97–100. doi:10.1145/2537734.2537752.
[25] G. V. Cormack, C. L. A. Clarke, S. Buettcher, Reciprocal Rank Fusion Outperforms Con-
     dorcet and Individual Rank Learning Methods, in: Proceedings of the 32nd International
     ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’09,
     ACM, New York, NY, USA, 2009, pp. 758–759. doi:10.1145/1571941.1572114.
[26] M. Štefánik, V. Novotný, N. Groverová, P. Sojka, Adapt𝒪r: Objective-Centric Adaptation
     Framework for Language Models, in: Proceedings of the 60th Annual Meeting of the
     Association for Computational Linguistics: System Demonstrations, ACL, Dublin, Ireland,
     2022, pp. 261–269. URL: https://aclanthology.org/2022.acl-demo.26.