=Paper=
{{Paper
|id=Vol-2696/paper_121
|storemode=property
|title=PSU at CLEF-2020 ARQMath Track: Unsupervised Re-ranking using Pretraining
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_121.pdf
|volume=Vol-2696
|authors=Shaurya Rohatgi,Jian Wu,C. Lee Giles
|dblpUrl=https://dblp.org/rec/conf/clef/Rohatgi0G20
}}
==PSU at CLEF-2020 ARQMath Track: Unsupervised Re-ranking using Pretraining==
<pdf width="1500px">https://ceur-ws.org/Vol-2696/paper_121.pdf</pdf>
<pre>
        PSU at CLEF-2020 ARQMath Track:
      Unsupervised Re-ranking using Pretraining

                  Shaurya Rohatgi1 , Jian Wu3 , C. Lee Giles1
           1
               Pennsylvania State University, 3 Old Dominion University
                   1
                     {szr207,clg20}@psu.edu, 3 jwu@cs.odu.edu


        Abstract. This paper elaborates on our submission to the ARQMath
        track at CLEF 2020. Our primary run for the main Task-1: Question An-
        swering uses a two-stage retrieval technique in which the first stage is a
        fusion of traditional BM25 scoring and tf-idf with cosine similarity-based
        retrieval while the second stage is a finer re-ranking technique using con-
        textualized embeddings. For the re-ranking we use a pre-trained roberta-
        base model (110 million parameters) to make the language model more
        math-aware. Our approach achieves a higher NDCG0 score than the base-
        line, while our MAP and P@10 scores are competitive, performing better
        than the best submission (MathDowsers) for text and text+formula de-
        pendent topics.

        Keywords: Math Information Retrieval · Math Embeddings · Math-
        aware search · Math formula search


1     Introduction
The Intelligent Information Systems Research Laboratory at Pennsylvania State
University participated in the CLEF-2020 ARQMath track to contribute and
develop new techniques for math-aware information retrieval. One of the main
goals [9] of this track was to push mathematical question answering into an
informal language scenario, specifically using data from Math Stack Exchange
(MSE). We participated in Task-1: Answer Retrieval track, the goal of which
was to return ranked lists of past answers given an actual recent post containing
math formulas and text.
    Question answering has elements of both relevance matching and semantic
matching but remains a different task from document retrieval [16]. We use
a two phase retrieval and re-ranking approach. In the first phase, retrieval is
based on two runs: tf-idf representation using BM25 scoring and cosine similarity.
Once we obtain the top 1000 candidates from the first phase, we re-rank them
using contextualized BERT-based embeddings from the math-aware language
model trained on the MSE dataset. This language model makes these embeddings
semantically rich. Finally, we fuse the results from BM25, cosine, and BERT-
based re-ranking runs. Our contributions are:
    Copyright c 2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem-
    ber 2020, Thessaloniki, Greece.
 – Achieve a competitive score that beats the best baseline in terms of the
   NDCG0 score.
 – Achieve better results than the best submission for Text and Text+Math
   dependent posts.
 – Propose a Masked Language Model1 used for re-ranking candidate answers
   containing both text and math formulas.
    Our paper has the following Sections. Section 2 discusses our two-stage re-
trieval approach which includes indexing and re-ranking phases. Section 3 de-
scribes our experimental setup, system configuration, and a detailed comparison
of our results with the best submission. Conclusions are in Section 4.


2     Our Approach
Here we describe our two-stage cascade candidate retrieval and ranking ap-
proach. tf-idf with cosine similarity and an off-the-shelf BM25-based search plat-
form [15] are used for the first stage. This is relatively computationally cheaper
than the second more expensive re-ranking stage. For the second stage, we use a
pre-trained language model to obtain semantically rich embeddings for the can-
didates obtained from the first stage. We then use these embeddings to rank the
candidates using a cosine distance to the topic/query embedding. This ensures
that the ranking captures semantics between the query and the documents. We
only rely on the content of the post, which contain raw text and math formu-
las, not using external information such as votes/scores, user id, user score, post
tags, and linked duplicate posts.

2.1    First-Phase: Retrieval
The aim of this phase is to get a sufficient number of relevant candidates to be
re-ranked in the next stage. Past work has used BM25 scoring for this but we
add cosine similarity ranking as well.

Indexing: We convert the MSE dataset from XML to JSON format so that it
can be ingested by the search platforms we use. Because indexing all answers will
potentially lose information in the questions, we generate a document by con-
catenating the question (Q) post text (title and body) with their corresponding
answer (A) posts text (body). As such, the question body and title concatenated
with the answer body become a document - one unit of retrieval. This allows us
to remove questions without corresponding answers and represent the relevant
posts as one large document because the relevant information the user is seeking
could be in either the answer or the question post. This results in a total of
1,435,643 Q+A posts to be indexed.
    We index the data using two off-the-shelf libraries - Elasticsearch (ES) and
Anserini. We use two different libraries because each has its strengths and weak-
nesses. Anserini seems to perform better than ES in terms of relevance ranking
1
    https://huggingface.co/shauryr/arqmath-roberta-base-1.5M
and recall, which we observed when searching and evaluating the three training
queries for the Question Answering Task provided by the organizers. Elastic-
search has a more scalable implementation of tf-idf with cosine similarity which
is slow for a large dataset when using Anserini. For the formulas, we use the raw
LATEX strings. We re-rank answers based on contextualized text and formulas
embeddings using RoBERTa.

Retrieval: Once all posts in Elasticsearch and Anserini have been indexed, we
query the topics/queries for Task-1 to get the top-1000 candidates. For each
task, each index is queried independently and later fused using Reciprocal Rank
Fusion (RRF) [2] which then ranks the documents using a naive scoring formula.
Given a set of posts P to be ranked and a set of rankings R from different scoring
schemes, for each permutation on 1 · · · |P |, we compute
                                             X        1
                       RRF score(p ∈ P ) =                  ,                  (1)
                                                   k + r(p)
                                             r∈R

where the original work [2] suggests k = 60, a hyper-parameter which we keep
constant. The fusion of results from two different search platforms above signif-
icantly increases the performance numbers as demonstrated in Section 3.


2.2   Second-Phase: Re-Ranking

Here we describe how we pretrain our language model and re-rank candidates
obtained in the last phase.
    BERT [5] is a self-supervised approach for fine-tuning a deep transformer en-
coder [14]. Given a sequence, BERT learns a contextualized vector representation
for each token. The input representations are fed into a stack of multi-layer bidi-
rectional transformer blocks, which uses self-attention to compute semantically
rich text representations by considering the whole input sequence.
    BERT-based systems [4][17][11][10] have shown significant performance in the
recent tasks of TREC-2019 deep learning track [3] with the Microsoft MARCO
dataset [1]. Participants for these tasks used query-document pairs from the
training data to train transformer-based models to predict relevance. However,
such models rely on a massive amount of data, which is not available in our task.
Therefore, instead of training for relevance, we leveraged an unsupervised model
by calculating the semantic similarities of queries and documents and later using
them for re-ranking.
    We choose the RoBERTa model [8] over BERT because in our experiments
Vanilla RoBERTa achieves better NDCG0 scores than Vanilla BERT for the
three preliminary training posts provided by the task organizers. Also, RoBERTa
converges in fewer training steps than BERT as it gets rid of the computationally
expensive next sentence prediction task. RoBERTa attains better performance
on various NLP tasks than BERT [7]. We start with the initial weights of the
roberta-base model and further pretrain the model using MSE data. The language
model is trained for a mask prediction task. Once we are done training we use
                                 Runtime in seconds

                    ES


               Anserini

                           0.5    1.0      1.5        2.0   2.5

Fig. 1: Comparison of run-times for all the topics averaged over 10 runs. Fastest
topics to retrieve top-1000 for ES and Anserini were A.39 and A.88 respectively
while the slowest topics were A.87 and A.47 respectively.


the Masked Language Model (MLM) to get contextualized token embeddings
averaged over the sequence length to represent the candidates from Phase-1 into
a 768 dimension vector. Similarly, we obtain topic/query embeddings and rank
the candidates using their cosine distance.


3     Experiments and Results

Our experiments were conducted on a 24 core machine with Intel(R) Xeon(R)
Silver 4116 CPU @ 2.10GHz with 256GB of RAM with 4 RTX 2080 Ti GPUs.
Default configurations were adopted in ES (cosine similarity; tf-idf) and Anserini
(BM25). Anserini runs on a single thread while ES uses a multi-threaded func-
tion. In Figure 1 we compare runtimes of BM25 ranking using Anserini and tf-idf
based cosine similarity ranking using ES. The size of ES Q+A index is 3.6GB
whereas for Anserini the size is 2.2GB.

Pretraining RoBERTa: We use the transformers2 library’s implementation of
RoBERTa to train the MLM. We start with the original weights released by
the authors for roberta-base and then further pretrain the model on the MSE
dataset. Fortunately, BASE-vocabulary used in the original was able to cover the
whole MSE dataset so did not train from scratch. Further, we reduce the batch
size to 4 per GPU and increase step size to facilitate gradient accumulation.
We kept the maximum sequence length of 512 to accommodate longer posts.3
Q+A posts are usually longer than 512 tokens so we had to break the posts into
chunks before feeding them to our system.
    Once we are done with pretraining our MLM, we use it to extract the em-
beddings for each token in the candidate posts from the first-phase. For longer
Q+A posts, we had to find a way of truncating them so that the sequence can fit
2
    https://github.com/huggingface/transformers
3
    Link to training details
in the 512 token window. We experimented with the head − tail approach [13].
The head − tail approach gave a lower NDCG0 score for the three preliminary
train topics. Therefore, we use the head approach, in which we keep the first 510
tokens from the Q+A post and leave two token spaces for [CLS] and [SEP ].
This gives better results for the trained topics. This is likely because our Q+A
posts include question text, so if you can find a similar question to a topic, then
the answer to the similar question has a higher chance of being an answer to the
topic.


Table 1: Results for Task-1 compared with other submissions and best baselines.
tf is tf-idf representation and cosine similarity-based ranking and tf.BERT
signifies re-ranking using RoBERTa embeddings of the candidates selected by tf.
Reciprocal rank fusion is used to fuse the runs separated by + sign. (* represents
our best run)

                                               Evaluation Measures
                       Run Tag
                                             NDCG0 MAP0 P@10 Rel Ret
                   Linked MSE Posts           0.303 0.210 0.417 395
  Baselines
                   Approach-0                 0.250 0.100 0.062 441
                   PSU1 (tf+tf.BERT)          0.263 0.082 0.116 761
  Official Runs    PSU2 (tf)                  0.228 0.054 0.055 761
                   PSU3 (tf.BERT)             0.221 0.046 0.026 761
  Other Comparable MIRMU                      0.238 0.064 0.139 708
  Systems          MathDowsers (Task-1 Best) 0.345 0.139 0.161 804
                   BM25+tf+tf.BERT*           0.314 0.097 0.149 902
  Unsubmitted
                   BM25+tf                    0.304 0.098 0.151 875
  Runs
                   BM25                       0.246 0.078 0.139 660


    We show in Table 1 how our system compares with other submissions and
the baselines. We only include the best submissions for baselines and other sys-
tems. Our system BM25+tf+tf.BERT clearly beats the best baseline in terms
of NDCG0 . Before the challenge, we had only three train topics provided by the
organizers on which we could test our approach. For these three topics tf-idf with
cosine similarity was running better than BM25 scoring, so we did not include
BM25 in our final submission. Later we added BM25 scoring which showed a
greater increase in the number of relevant retrieved documents. It can be seen
that tf.BERT that selects candidates using tf-idf similarity and re-ranking us-
ing our pretrained RoBERTa model does not achieve the best performance. But
when the rankings of tf.BERT and tf-idf are fused, NDCG0 [12] score is sub-
stantially boosted. This is because the BERT ranking solely relies on semantic
similarity whereas the tf-idf replies only on word frequency. Fusing these two
runs seems reasonable since the posts which have a higher rank in both the runs
are boosted even higher in the final ranking list. Note that BM25+tf+tf.BERT
achieves the maximum number of relevant retrieved posts. It is important to
note that the tf runs are not as consistent since ES does sharding and round-
robins between different shard searching. Thus, we did multiple runs to achieve
the above scores.


Comparison with Other submissions: Our best unsubmitted run,
BM25+tf+tf.BERT was compared with other submissions, among which Math-
Dowsers is the system that achieved one of the best results in Task 1. Submis-
sions by Approach0 and MathDowsers leveraged Tangent-S [18] and Tangent-L
[6], which are Symbol Layout and Operator Tree-based systems giving more at-
tention to the math formula in the text. Our post-hoc system does not use a
different representation for formula and text, and therefore seems to suffer for
formula dependent topics. We believe representing math content as trees and
separately from the text content could have benefited our scores.


                                                 23                                0.5   0.482
                                                       BM25+tf+tf.BERT                                                       BM25+tf+tf.BERT
                                                       MathDowsers                                                   0.446   MathDowsers
                   20
                                                                                   0.4
                                                        17
                                                                                                                             0.333
Number of topics ids


                   15                                            14                0.3
                                                                         NCDG' score


                                                                                                 0.272                               0.272
                                                                                                         0.227
                   10                                                              0.2
                           8              8

                       5              4                                            0.1


                       0       Text         Formula    Text + Formula              0.0       Text          Formula           Text + Formula
                                          Dependence                                                     Dependence
                                                                                                                 0
    (a) Number of topics for which each sys-                               (b) Average NCDG over total topics
    tem had higher NCDG0 than the other                                    depending on different types of topics.
    system.

Fig. 2: Comparison of runs for the 77 topics in Task-1 based on dependence
class of the topics. We clearly see that for the topics that depend on Text and
Text+Formula our system performs better.


    In Figure 2 we see that our system is better for the topics that depend on
text and text+formula. For Text dependent topics MathDowsers had 4 topics
for which there NDCG0 score was higher than our best run, while our system
performed better in 8 topics. For the 31 text+formula dependent topics our
system had better NDCG0 score for 17 topics. MathDowsers achieves a higher
(≈ 96.4% higher) NDCG0 score for the topics which are only formula dependent.
In contrast, our system is better (≈ 22.42%) at ranking posts containing both
formula and text. This is attributed to the contextualized embeddings which our
pretrained MLM can produce. It models equations with surrounding text and
hence has better performance in text+formula topics. The difference (≈ 77.2%)
is even more when we compare text dependant topics.
4   Conclusion

Overall, our participation in the ARQMath Track helped in our understanding
of how to improve multi-modal search (text+formula) by exploiting state-of-
the-art text embedding and information retrieval models. In terms of effective-
ness, our most effective post-hoc run BM25+tf+tf.BERT was able to outperform
the baselines and all submissions except Mathdowsers in terms of NDCG0 . Our
post-hoc system achieved the best results for queries that depend on text and
text+formula.
   Future work would include the use of the MSE dataset to get question-answer
post pairs to train a better ranking model. It would also be helpful to investigate
how important a formula is in a question post to retrieve relevant answers.


Acknowledgements

We would like to thank members of the ARQMath lab at the Department of
Computer Science in Rochester Institute of Technology for organizing this track.
Special thanks to Behrooz Mansouri for providing the dataset, initial analysis
of topics, and starter code to all the participants of the task; it made it easier
for us to pre-process the data and jump directly to the experiments which have
been presented in this work.


References

 1. Bajaj, P., Campos, D., Craswell, N., Deng, L., Gao, J., Liu, X., Majumder, R., Mc-
    Namara, A., Mitra, B., Nguyen, T., et al.: Ms marco: A human generated machine
    reading comprehension dataset. arXiv preprint arXiv:1611.09268 (2016)
 2. Cormack, G.V., Clarke, C.L., Buettcher, S.: Reciprocal rank fusion outperforms
    condorcet and individual rank learning methods. In: Proceedings of the 32nd In-
    ternational ACM SIGIR Conference on Research and Development in Information
    Retrieval. pp. 758–759 (2009)
 3. Craswell, N., Mitra, B., Yilmaz, E., Campos, D., Voorhees, E.M.: Overview of the
    trec 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020)
 4. Dai, Z., Callan, J.: Deeper text understanding for ir with contextual neural lan-
    guage modeling. In: Proceedings of the 42nd International ACM SIGIR Conference
    on Research and Development in Information Retrieval. pp. 985–988 (2019)
 5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec-
    tional transformers for language understanding. arXiv preprint arXiv:1810.04805
    (2018)
 6. Fraser, D., Kane, A., Tompa, F.W.: Choosing math features for BM25 ranking
    with tangent-l. In: Proceedings of the ACM Symposium on Document Engineering
    2018. pp. 1–10 (2018)
 7. Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D.,
    Smith, N.A.: Don’t stop pretraining: Adapt language models to domains and tasks.
    arXiv preprint arXiv:2004.10964 (2020)
 8. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M.,
    Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining
    approach. arXiv preprint arXiv:1907.11692 (2019)
 9. Mansouri, B., Agarwal, A., Oard, D., Zanibbi, R.: Finding old answers to new math
    questions: The arqmath lab at clef 2020. In: Jose, J.M., Yilmaz, E., Magalhães,
    J., Castells, P., Ferro, N., Silva, M.J., Martins, F. (eds.) Advances in Information
    Retrieval. pp. 564–571. Springer International Publishing, Cham (2020)
10. Nogueira, R., Cho, K.: Passage re-ranking with bert. arXiv preprint
    arXiv:1901.04085 (2019)
11. Nogueira, R., Yang, W., Cho, K., Lin, J.: Multi-stage document ranking with bert.
    arXiv preprint arXiv:1910.14424 (2019)
12. Sakai, T., Kando, N.: On information retrieval metrics designed for evaluation with
    incomplete relevance assessments. Information Retrieval 11(5), 447–470 (2008)
13. Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune bert for text classification?
    In: China National Conference on Chinese Computational Linguistics. pp. 194–206.
    Springer (2019)
14. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,
    L., Polosukhin, I.: Attention is all you need. In: Advances in neural Information
    processing systems. pp. 5998–6008 (2017)
15. Yang, P., Fang, H., Lin, J.: Anserini: Enabling the use of lucene for information re-
    trieval research. In: Proceedings of the 40th International ACM SIGIR Conference
    on Research and Development in Information Retrieval. pp. 1253–1256 (2017)
16. Yang, W., Zhang, H., Lin, J.: Simple applications of bert for ad hoc document
    retrieval. arXiv preprint arXiv:1903.10972 (2019)
17. Yilmaz, Z.A., Wang, S., Yang, W., Zhang, H., Lin, J.: Applying bert to document
    retrieval with birch. In: Proceedings of the 2019 Conference on Empirical Methods
    in Natural Language Processing and the 9th International Joint Conference on
    Natural Language Processing (EMNLP-IJCNLP): System Demonstrations. pp. 19–
    24 (2019)
18. Zhong, W., Rohatgi, S., Wu, J., Giles, C.L., Zanibbi, R.: Accelerating substructure
    similarity search for formula retrieval. In: Jose, J.M., Yilmaz, E., Magalhães, J.,
    Castells, P., Ferro, N., Silva, M.J., Martins, F. (eds.) Advances in Information
    Retrieval. pp. 714–727. Springer International Publishing, Cham (2020)

</pre>