Ranked List Fusion and Re-ranking with Pre-trained
Transformers for ARQMath Lab
Shaurya Rohatgi1 , Jian Wu2 and C. Lee Giles1
1
    Pennsylvania State University
1
    Old Dominion University


                                         Abstract
                                         This paper elaborates on our submission to the ARQMath track at CLEF 2021. For our submission this
                                         year we use a collection of methods to retrieve and re-rank the answers in Math Stack Exchange in
                                         addition to our two-stage model which was comparable to the best model last year in terms of NDCG’.
                                         We also provide a detailed analysis of what the transformers are learning and why is it hard to train a
                                         math language model using transformers. This year’s submission to Task-1 includes summarizing long
                                         question-answer pairs to augment and index documents, using byte-pair encoding to tokenize formula
                                         and then re-rank them, and finally important keywords extraction from posts. Using an ensemble of
                                         these methods our approach shows a 20% improvement than our ARQMath’2020 Task-1 submission.

                                         Keywords
                                         Math Information Retrieval, Re-ranking, Math-aware search, Math formula search


1. Introduction
The ARQMath-2 lab at CLEF [1] is an effort to improve math information retrieval and aims
to produce better math-aware retrieval systems. The tasks introduced by the lab serve as
a benchmark for systems competing in this space. In CLEF-2020 several systems competed
for both tasks [2, 3, 4, 5]. There are two tasks in the ARQMath-lab. This work presents our
submission to Task-1: Answer Retrieval, where the participants are given a list of topics and
are expected to return a list of relevant posts from the Math Stack Exchange collection. The
challenge here is that these topics are multi-modal in the sense that they include both text and
math formulas.
   The Intelligent Information Systems Research Laboratory (IISRL) at Pennsylvania State
University participated in the CLEF-2021 ARQMath-2 track to contribute and further push the
limits of math-aware information retrieval. In addition to our approach for ARQMath-1 Lab
[5] we add new ways to index by augmenting more data. We also extract keywords for better
matching of the posts.
   We used a combination of approaches which includes -

                  • keyword extraction
                  • augmenting data using post summary

CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania
" szr207@psu.edu (S. Rohatgi); jwu@cs.odu.edu (J. Wu); clg20@psu.edu (C. L. Giles)
                                       © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
    • BM25
    • tf-idf
    • transformer based re-ranking
    • using byte-pair encoding (BPE) to rank formula separately

   We merge these ranking and indexing methods using reciprocal rank fusion [6] which boosts
the rankings of the posts which occur in more than one ranked list.
   This approach has shown improvement in our submission for Task-1 over our previous
year’s submission by 20%. While keyword extraction and augmenting data helps, a major
improvement is seen because of BPE for indexing and ranking formula separately hinting that
transformers and their tokenization strategy are worth exploring for math formula. We also
use this opportunity to showcase the BPE tokenization performance on another math formula
retrieval task- NTCIR-12 [7].
   The rest of the paper is structured as follows - Section 2 talks about previous work and
other participant submissions from ARQMath lab in CLEF 2020; Section 3 discusses our current
methodology and approach; Section 4 shows the results and analysis of our approach on Task-1
2020 as well as NTCIR-12 and finally we compare our 2021 submission to other participants
and where the current system can make improvements. Finally, we conclude that the trans-
formers might not be good at understanding standalone formulas but they are great at making
connections between formulas and related text.


2. Related Work
Previous year’s systems by MIRMU, zbMath, MathDowsers all submitted competitive results
for task-1. All systems brought in unique approaches and here we want to discuss some of
them. MIRMU used an ensemble approach where they use a mix of novel approaches including
CompuBERT, Soft Cosine Measure, and Formula2vec. While these approaches didn’t perform
well individually, their ensemble did well on task-1. We take inspiration from this approach
and try different matching strategies in this work. zbMath went a more entity-based route,
extracting Wikibase items from the collection and then doing manual runs using Google and
Math Stack Exchange searches. They also used a mix of ElasticSearch, Doc2vec, and K-nearest
neighbors to retrieve the relevant formulas. The main contribution here we think is that they
do an inductive classification study on the ARQMath Topics.
   MathDowsers submitted the best performing system for task-1 for ARQMath Lab-1. They
use a two-stage re-ranking approach where the re-ranking model is trained on the metadata
attributes of the posts like upvotes, number of comments etc. In their first stage, they use
Tangent-L for formula retrieval, and similar to our approach of indexing they concatenate
the question and the answers before indexing. This joint post is one unit of retrieval. In
our last year’s submission, we show that their system performs better on formula-dependent
queries(where answer relevance is more formula dependent) as compared to our system, while
our system performs well in text-dependent queries.
3. Methodology
In this section, we describe our multi-stage ranking approach which fuses ranked lists from
different approaches. The following subsections go through each approach in detail.

3.1. Byte Pair Encoding (BPE)
We take inspiration from the sentence piecing work demonstrated in the past [8]. Here we feed
the only formula’s to train a tokenizer that can look at frequently occurring characters together
and then token a string accordingly. This helps it identify otherwise difficult tokens in the text.
   This way of tokenizing can be exploited for formulas, as there is no strict way of tokenizing
them linearly unlike text. While other methods use SLTs and OPTs we stick to a simpler solution
of linearly tokenizing the formula. SLTs and OPTs have shown to perform really well for formula
search but they require MathML representations, while BPE can work on straight LATEX. Let’s
consider an example to better understand BPE for formula retrieval -
                                                       1
                                              𝑥2 −                                              (1)
                                                       2
  is tokenized as 𝑥, ∧, {, 2, }, −, frac, {, 1, }, {, 2, }. Notice how 𝑓 𝑟𝑎𝑐 is one token.

3.2. Summarizer and Posts Augmentation
A post can be written in various ways. In community question answering platforms like Math
Stack Exchange the members, in addition to asking questions add their experience with the
question they want to ask. For example, terms like "my assignment want me to . . ", "Please help
me understand this concept" etc. These produce noise for the indexing and retrieval systems
and need to be removed. Here are an example of summarization and what it can do -
   Original Text : We began learning this is math analysis a week or so ago along with
verifying identities, finding the exact value of trigonometric equations, and writing trigonometric
equations as algebraic equations and i’m totally lost.
   Summary : finding the exact value of trigonometric equations, and writing trigonometric
equations as algebraic equations
   The summarizer neural network uses a document-level encoder which is based on BERT [9]
which is correctly able to learn the semantics of the document and obtain vector representations
for its sentences. These vector representations go through a decoding step and the summary of
the document is produced.
   We run the summarizer for all question-answer pairs. This pair is then concatenated and
indexed as a single unit of retrieval. We give this document the same id as the answer post, thus
augmenting a post that is not there in the collection. This provides a question context for each
answer in serving as an answer to newly posed topic questions. When this index is queried for
relevant posts for the topics these documents are also returned in addition to the ARQMath
collection.
                                                       Ranked Lists                     Final run

                       Fused          BERT                             Fused          A.X. XXXX
   BM25                             Re-ranking
                                                         re-ranked
                      Rankings                                        Rankings        A.X. XXXX
    tf-idf                                               keyword                      A.X XXXX
   cosine                                                similarity

                                                        augmented
                                                          posts


                                                         formula
                                                       bpe rankings


Figure 1: The architecture of the system. Ranked lists are lists of answer posts sorted by relevance
returned by each approach. We fuse them using reciprocal rank fusion.


3.3. Keyword extraction
We generate keyword samples by automatically extracting noun phrases from the answer and
question posts in the collection using a grammar-based chunking method. This lets us highlight
the role of important keywords in the posts. We concatenate these keywords and create a
separate index, which is queried against each topic. This step is an additional filtering step
where we make sure that the important keywords are being matched between the posts which
go into the fusion step.

3.4. PSU at ARQMath Lab 2020 Task-1
The approach we submitted to ARQMath’20 is used in addition to the ones mentioned above
[5]. To summarize our earlier approach, we developed a two-stage retrieval and ranking system.
The first stage was a fusion of the BM25 score and cosine distance using tf-idf. Before the
second stage, we learn a language model on the math stack exchange collection. This made
the language model more math-aware1 . We then use this model to re-rank the top-1000 posts
retrieved by the first stage. This leads to an improvement in the overall ranking of the answers
retrieved as demonstrated in the previous year’s working notes.
   We use this model and add the aforementioned retrieval approaches. Figure 1 demonstrates
our fusion and re-ranking architecture for ARQMath Lab-2.

3.5. Reciprocal Rank Fusion
After we get a ranked list from all the methods detailed above we merge them using Reciprocal
Rank Fusion (RRF) [6] which then ranks the documents using a naive scoring formula. For a
set of posts 𝑃 to be ranked and a set of rankings 𝑅 from different scoring schemes, for each
permutation on 1 · · · |𝑃 |, we compute
   1
       https://huggingface.co/shauryr/arqmath-roberta-base-1.5M
                                                     ∑︁       1
                             𝑅𝑅𝐹 𝑠𝑐𝑜𝑟𝑒(𝑝 ∈ 𝑃 ) =                    ,                          (2)
                                                           𝑘 + 𝑟(𝑝)
                                                     𝑟∈𝑅

  where the original work suggests 𝑘 = 60, a hyper-parameter which we keep constant.


4. Experiments
Our experiments were conducted on a 24 core machine with Intel(R) Xeon(R)Silver 4116 CPU @
2.10GHz with 256GB of RAM with 4 RTX 2080 Ti GPUs. Default configurations were adopted
in ElasticSearch(ES) and Anserini(BM25). Anserini runs on a single thread while ES uses a
multi-threaded function. The size of ES Q+A and summarized posts index is 4.8GB whereas for
Anserini the size is 3.2GB.
   Each query takes on an average of 2.3 seconds for retrieval once everything is indexed. It
should be noted that we are using a single shard in ES which makes it a little slow.

4.1. Fine-tuning math-aware language model and NTCIR-12
BPE is a subword unit model and is an unsupervised text tokenizer and detokenizer mainly for
Neural Network-based text generation systems where the vocabulary size is predetermined
before the neural model training [8]. We use it here as math expressions are hard to tokenize as
there are no clear boundaries between tokens.
   It should be noted that BPE is a linear tokenizer and might miss out on the structure which is
better captured by tree-based methods. BPE is very cheap to train and requires less preprocessing
of the math expressions.
   In this case, we find that the vocabulary size for math is 250 tokens (excluding cardinal
numbers). We feed the tokenizer all the expressions from the LATEXof the ARQMath collection
to train it. It learns the patterns and creates a vocabulary using the co-occurrence frequency
of characters in the corpus. This vocabulary is used to do the fine-tuning of the pre-trained
weights of RoBERTa which is used in the ARQMath Lab-1 by PSU [5].
   We build a nearest neighbor map of the formula in the collection and queried the topics
against it to get a ranked list of math expressions that are visibly similar to the formula in the
topic.
   To understand if this method works for formulas we benchmarked it on the NTCIR-12 dataset
as it is a more complete and well-studied dataset for formula retrieval. We compare it with
the current state-of-the-art methods Tangent-CFT [10] and MathBERT [11] in Table 2. We
also, fused the rankings for Tangent-CFT and our pre-trained model to see if our model can
complement the tree-based embedding method. On further investigation, we find that the BPE
suffers for partial matching of formula but is great for exact matches of expressions.

4.2. Results and Observations
We show the results of our system for Task-1 for the ARQMath-1 Lab in Table 3. Here we can
see that the new system which includes data augmentation, BPE, and keyword extraction works
well and shows improvement over the previous year’s approach. It is to be noted that all the
Table 1
Some examples of retrieval by BPE from the NTCIR-12 query benchmark dataset
     NTCIR Query ID                      MathWiki-17                           MathWiki-13
         Rank              𝑥 − 1 − 12 − 14 − 15 − 16 − 19 − · · · = 1      𝐴 ⊕ 𝐵 = (𝐴𝑐 ⊖ 𝐵 𝑠 )𝑐
           1               𝑥 − 1 − 12 − 14 − 15 − 16 − 19 − · · · = 1      𝐴 ⊕ 𝐵 = (𝐴𝑐 ⊖ 𝐵 𝑠 )𝑐
           2            1 − 12 − 14 + 13 − 61 − 18 + 15 − 10
                                                           1
                                                             − 121
                                                                   + ···   𝐴 ∘ 𝐵 = (𝐴 ⊖ 𝐵) ⊕ 𝐵
                               4    4    4    4    4     4    4
           3              𝜋 = 1 − 3 + 5 − 7 + 9 − 11 + 13 − · · ·          𝐴 ∙ 𝐵 = (𝐴 ⊕ 𝐵) ⊖ 𝐵

Table 2
NTCIR-12 Results (Avg. bpref@1000). H-Mean is the harmonic mean of partial relevance (L1) and full
relevance score(L3). RRF-CFT+BERT is the reciprocal rank fusion of our BERT implementation and
Tangent-CFT.
                              System           Partial     Full     H-Mean
                            tangent-cft        0.7134     0.5963     0.6496
                         Finetuned-BERT        0.5657     0.5747     0.5701
                            MathBERT           0.7361     0.6135     0.6692
                         RRF-CFT+BERT          0.7414     0.6163    0.6730

Table 3
Our best runs compared to the baselines systems for ARQMath-1 Lab. The new system achieves better
results than the previous years submission by 20%.
          Type                System                 nDCG′      mAP′       p′ @10   Run Type
                           linked_results             0.303     0.210       0.418   Automatic
                    arq2020-task1-a0-submission       0.250     0.100       0.186    Manual
        Baselines    combined_tf_idf_tangents         0.248     0.047       0.073   Automatic
                         tf_idf_task1_final           0.204     0.049       0.074   Automatic
                           TangentS_Res               0.158     0.033       0.051   Automatic
        Our Runs      BM25+tf+tf.BERT (2020)          0.263     0.082       0.116   Automatic
                         Current system               0.317     0.116       0.165   Automatic


approaches mentioned are unsupervised and no learning is done on the previous year’s task’s
ranked list.


5. Conclusion and Future Work
We presented an automatic system that uses techniques to shorten long posts and extract
important keywords for more precise retrieval. Unsupervised ranking methods like tf-idf, BM25,
and BPE with the nearest neighbor search are used here and they demonstrate robustness. Our
last year’s submission fused with these new techniques improves our overall NDCG′ score.
   In the future, we want to improve over this ranking by learning to rank. This collection has
an abundance of ground truth data in the form of accepted answers. We could potentially train
a model using the metadata of the posts like the number of comments, the number of upvotes,
and the total number of answers to name a few; to train a learning to rank model which can
re-rank the results.
   Identifying the important formula for relevant retrieval from a number of formulas in a post
is hard. We want to focus our efforts on identifying the significant formulas in a post and
then giving a higher score to the formulas which match them rather than treating all formulas
equally.
   Lastly, we want to incorporate an actual tree-based formula retrieval technique like Approach0
or Tangent-S in our rankings. This can boost our relevance scores as these methods have been
shown to perform better with partial relevance when matching formulas, unlike BPE.


Acknowledgments
We would like to thank members of the ARQMath lab at the Department of Computer Science at
Rochester Institute of Technology for organizing this track. Special thanks to Behrooz Mansouri
for providing the dataset, initial analysis of topics, and starter code to all the participants of the
task; it made it easier for us to pre-process the data and jump directly to the experiments which
have been presented in this work.


References
 [1] B. Mansouri, A. Agarwal, D. W. Oard, R. Zanibbi, Advancing math-aware search: The
     arqmath-2 lab at clef 2021, in: D. Hiemstra, M.-F. Moens, J. Mothe, R. Perego, M. Potthast,
     F. Sebastiani (Eds.), Advances in Information Retrieval, Springer International Publishing,
     Cham, 2021, pp. 631–638.
 [2] P. Scharpf, M. Schubotz, A. Greiner-Petter, M. Ostendorff, O. Teschke, B. Gipp, Arqmath lab:
     An incubator for semantic formula search in zbmath open?, arXiv preprint arXiv:2012.02413
     (2020).
 [3] N. Yin Ki, D. J. Fraser, B. Kassaie, G. Labahn, M. S. Marzouk, F. W. Tompa, K. Wang,
     Dowsing for math answers with tangent-l, in: CEUR Workshop Proceedings. Thessaloniki,
     Greece, 2020.
 [4] V. Novotnỳ, P. Sojka, M. Štefánik, D. Lupták, Three is better than one, in: CEUR Workshop
     Proceedings. Thessaloniki, Greece, 2020.
 [5] S. Rohatgi, J. Wu, C. L. Giles, Psu at clef-2020 arqmath track: Unsupervised re-ranking
     using pretraining, in: CEUR Workshop Proceedings. Thessaloniki, Greece, 2020.
 [6] G. V. Cormack, C. L. Clarke, S. Buettcher, Reciprocal rank fusion outperforms condorcet and
     individual rank learning methods, in: Proceedings of the 32nd International ACM SIGIR
     Conference on Research and Development in Information Retrieval, 2009, pp. 758–759.
 [7] R. Zanibbi, A. Aizawa, M. Kohlhase, I. Ounis, G. Topic, K. Davila, Ntcir-12 mathir task
     overview., in: NTCIR, 2016.
 [8] R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword
     units, in: Proceedings of the 54th Annual Meeting of the Association for Computational
     Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Berlin,
     Germany, 2016, pp. 1715–1725. URL: https://www.aclweb.org/anthology/P16-1162. doi:10.
     18653/v1/P16-1162.
 [9] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[10] B. Mansouri, S. Rohatgi, D. W. Oard, J. Wu, C. L. Giles, R. Zanibbi, Tangent-cft: An
     embedding model for mathematical formulas, in: Proceedings of the 2019 ACM SIGIR
     International Conference on Theory of Information Retrieval, ICTIR ’19, Association for
     Computing Machinery, New York, NY, USA, 2019, p. 11–18. URL: https://doi.org/10.1145/
     3341981.3344235. doi:10.1145/3341981.3344235.
[11] S. Peng, K. Yuan, L. Gao, Z. Tang, Mathbert: A pre-trained model for mathematical formula
     understanding, ArXiv abs/2105.00377 (2021).