NLP-IISERB@Simpletext2022: To Explore the
Performance of BM25 and Transformer Based
Frameworks for Automatic Simplification of Scientific
Texts
Sourav Saha1 , Dwaipayan Roy2 , B Yuvaraj Goud3 , Chethan S Reddy3 and
Tanmay Basu3
1
  Computer Vision and Pattern Recognition Unit, Indian Statistical Institute Kolkata, India
2
  Department of Computational and Data Sciences, Indian Institute of Science Education and Research Kolkata, India
3
  Department of Data Science and Engineering, Indian Institute of Science Education and Research Bhopal, India


                                         Abstract
                                         CLEF SimpleText 2022 lab focuses on developing effective systems to identify relevant passages from a
                                         given set of scientific articles. The lab has organized three tasks this year. Task 1 is focused on passage
                                         retrieval from the given data for a query text. These passages can be complex and hence require further
                                         simplification to be carried out in tasks 2 and 3. The BioNLP research group at the Indian Institute
                                         of Science Education and Research Bhopal (IISERB) in collaboration with two different information
                                         retrieval research groups at IISER Kolkata and ISI Kolkata participated only in Task 1 of this challenge
                                         and submitted three runs using three different retrieval models. The paper explores the performance of
                                         these retrieval models for the given task. We used a standard BM25 model as our first run to identify
                                         1000 relevant passages for each query. Moreover, the passages for each query were ranked based on
                                         their similarity scores generated by the BM25 model. For our second run, we used a BERT (Bidirectional
                                         Encoder Representations from Transformers) based re-ranking method, called as Mono-BERT to further
                                         rank the 1000 passages retrieved by our first run for each query. A pre-trained sequence to sequence
                                         model based re-ranking method, called MonoT5 was used as our third run to reorder the 1000 passages
                                         retrieved by the Mono-BERT model for each query. As the official results of this task are not yet
                                         announced, we cannot explore the performance of our submissions. However, we have manually checked
                                         the retrieved results of many queries for each run, which indicate that the performance improved from
                                         run 1 to run 2 and further to run 3.

                                         Keywords
                                         information retrieval, text simplification


1. Introduction
Scientific articles are often hard to understand as they require significant background knowl-
edge of certain areas and use tricky terminology. Thus automatic simplification of scientific

CLEF 2022 – Conference and Labs of the Evaluation Forum, September 05–08, 2022, Bologna, Italy
$ souravsaha.juit@gmail.com (S. Saha); dwaipayan.roy@iiserkol.ac.in (D. Roy); byuvaraj19@iiserb.ac.in
(B. Y. Goud); chethan19@iiserb.ac.in (C. S. Reddy); welcometanmay@gmail.com (T. Basu)
 https://dwaipayanroy.github.io/ (D. Roy); https://sites.google.com/view/tanmaybasu/ (T. Basu)
 0000-0001-9536-8075 (T. Basu)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
text is a challenge although there have been some recent efforts in this direction [1]. The
SimpleText shared-task at CLEF 2022 is an initiative in this spirit [1]. The goal is to create
a concise summary of several scientific documents for a given query which provides a syn-
opsis regarding a specific topic. The classical text retrieval methods are typically designed
to identify whole documents that are relevant to a given query [2]. In these techniques, the
relevance between a document and the given query is determined based on the presence of
query words and phrases in the documents regardless of the location or proximity of them
within the document [2, 3]. As the classical information retrieval methods identify whole
documents for a query instead of a concise summary of those documents, the same may not
be useful for text simplification. An alternative approach is to consider each document as a
set of passages and to compute the similarity of each passage to the query [2, 4, 5], where a
passage is a contiguous block of text in the original document. The passages retrieved in this
way from different documents can then be ranked based on the order of relevance with the query.

   Classical document similarity methods such as BM25 or language model can be employed
for passage retrieval methods as well where the similarity between a query and a passage
can be computed based on lexical matching and frequency of common terms in query and
passages following a bag of words (BOW) approach [3, 6, 7, 8]. However, this type of BOW
approaches may suffer from the traditional vocabulary mismatch problem due to the use of
different vocabularies by the query creator and the passage writer. Further, these models cannot
capture the semantic similarity between a passage and a query, e.g., those passages that have
few common terms but have a set of terms with the same meaning as the query will get a low
similarity score and hence, they will be ranked lower in the ranked list [6].

  As an advancement of the model, embedding based models have become popular in the
past few years where significant improvements are observed by embedding based models over
the BOW models [9]. In the recent years, the transformer based methods leverage the con-
textualized representation produced by deep language models to identify semantic relevance
between a query and a passage such as BERT (Bidirectional Encoder Representations) [10, 11]. A
recent model developed in this spirit is the Mono-BERT model [12], which takes query-passage
pairs as the input of BERT and the similarity scores are computed based on the contextualized
token representation [6]. The Mono-BERT and its variants [12, 11, 6] have demonstrated better
performance than BOW based models on passage ranking tasks.

   The SimpleText lab at CLEF 2022 has organized three shared tasks this year regarding sci-
entific text simplification. The first task is focused on passage retrieval from the given corpus
for a query text. These passages can be complex and hence require further simplification to be
carried out in the scond and third tasks. However, we, the BioNLP research group at the Indian
Institute of Science Education and Research Bhopal (IISERB) in collaboration with two different
information retrieval research groups at IISER Kolkata and ISI Kolkata participated only in
Task 1 of this challenge and submitted three runs using three different retrieval models. The
organizers released the corpus and a set of queries for task 1. We have submitted three runs for
task 1. The standard BM25 model [7] was used as our first run to identify 1000 relevant passages
for each query. The passages retrieved for each query were ranked based on their similarity
scores generated by the BM25 model. The Mono-BERT model [12] was used as the second run,
which further re-ranks the 1000 passages retrieved by our first run for each query. As a third
run, we used the MonoT5 model [11], which is a pre-trained sequence to sequence model based
re-ranking method. This model reorder the 1000 passages retrieved by the Mono-BERT model
for each query and returned the 100 best passages based on the similarity score. As the official
results of this task are not yet announced, we cannot explore or compare the performance of
our submissions. However, we have manually checked the retrieved results for various queries
for each run which indicate the performance improvement over run 1 by run 2 and further by
the run 3 where MonoT5 based model was used.

  The paper is organized as follows. The runs submitted by our team for the first task is
described in section 2. Section 3 presents the analysis of experimental results. Eventually we
conclude with scopes of further works in section 4.


2. Proposed Frameworks
We employ a multistage [11] ranking pipeline for this task. As our first run, BM25 model is
used to rank a set of top 𝑘 documents from the collection. The Mono-BERT model is utilized to
re-rank the 𝑘 top passages returned by the BM25 model, which is submitted as our second run.
The Mono-BERT model [12] encodes query and the documents with BERT [13] based language
model. In Mono-BERT model, BERT is used as a binary classification model, where the output
of the CLS token is passed over to a feed-forward neural network to obtain a probability score
for each query-passage pair. Furthermore, the MonoT5 model is implemented as our third run
over the ranked passages returned by the Mono-BERT model for each query. MonoT5 is based
on the architecture of T5 [14]which is a sequence-to-sequence model that uses a similar masked
language modeling objective as BERT to pre-train its encoder–decoder architecture [15]. As we
do not have explicit relevance information for query-passage pairs, the pre-trained version of
both the models and used here.


3. Experimental Evaluation
3.1. Dataset
The organizers released a DBLP corpus in JSON file format of size 3 GB. The JSON file of
the corpus contains 4894063 entries for different publications. Each entry contain the detail
information of a publication i.e., title, abstract, authors name, year of publication, publisher
name, citation etc. The abstracts of individual publications were considered as individuals
documents. The queries were released in a CSV file, where there are 114 unique queries. The
query id and link from where the query was collected were also given in that file. The objective
was to identify relevant passages for each query from the documents of the corpus.
3.2. Experimental Setup
The collection was indexed with Apache Lucene1 . The standard analyzer2 with the default
stopword list was employed. Queries were formulated by appending both the topic and query
text. For implementing the BM25 model in the first run, 𝑘 i.e., the number of passages retrieved
per query was set to 1000. BM25 parameters, specifically 𝑘1 and 𝑏, respectively controlling the
term frequency (𝑡𝑓 ) scaling and the document length normalization, were set to 1.2 and 0.75.
All the other parameters were set to their default values.

Table 1
Performance of Different Teams for Task 1
Team                           #Topics             Avg #Doc.                             NDCG
                                                                              5               10                20
CYUT                                114                    4.9           0.5866           0.5636           0.5536
UAMS                                114                   95.5           0.3531           0.3776           0.4073
UAMS-MF⋆                             69                    2.7           0.3494           0.3328           0.3270
NLP@IISERB 1                         30                   92.5           0.0605           0.0680           0.0819
NLP@IISERB 2                        114                   100            0.0503           0.0640           0.0815
NLP@IISERB 3                        114                   100            0.0467           0.0522           0.0722
⋆
    Manual run.


3.3. Analysis of Results
The performance of all the teams participated in task 1 are reported in Table 1 in terms of
normalized discounted cumulative gain (NDCG) [16]. It can be seen from Table 1 that none of
our runs achieve a place among the top three runs of task 1. None of our runs implemented the
text preprocesing techniques like stemming or lemmatization before generating the inverted
index. This may be one of the reasons of poor performance. We used the default parameters of
the Mono-BERT and Mono-T5 models and did not tune different relevant parameters of these
models, which may lead to poor performance. In future, we will focus on addressing these
limitations of the proposed approaches.


4. Conclusion
The objective of SimpleText lab at CLEF 2022 is to involve the researchers to develop models
to generate simplified summary of scientific literature for a given query. We have submitted
three runs for the first shared task of the lab. The runs comprised of the classical BOW based
BM25 models and transformer based Mono-BERT and MonoT5 models to further improve the
performance of BM25 model. The transformer based re-ranking methods have been widely
used for the last few years. Hence we used these models to rearnk the passages retrieved by
the BM25 model for each given query. However, we could not achieve good performance as
      1
          https://lucene.apache.org/
      2
          https://lucene.apache.org/core/8_8_1/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
we already mentioned and the same needs to be addressed in future. Moreover, the aim is to
explore the performance of other transformer based method for the given task that have been
widely used for other applications.


References
 [1] L. Ermakova, P. Bellot, J. Kamps, D. Nurbakova, I. Ovchinnikova, E. SanJuan, E. Mathurin,
     S. Araújo, R. Hannachi, S. Huet, et al., Automatic simplification of scientific texts: Simple-
     text lab at clef-2022, in: European Conference on Information Retrieval, Springer, 2022, pp.
     364–373.
 [2] M. Kaszkiel, J. Zobel, Passage retrieval revisited, in: ACM SIGIR Forum, volume 31, ACM
     New York, NY, USA, 1997, pp. 178–185.
 [3] C. D. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval, Cambridge
     University Press, New York, 2008.
 [4] X. Liu, W. B. Croft, Passage retrieval based on language models, in: Proceedings of the
     eleventh international conference on Information and knowledge management, 2002, pp.
     375–382.
 [5] A. Mallia, O. Khattab, T. Suel, N. Tonellotto, Learning passage impacts for inverted
     indexes, in: Proceedings of the 44th International ACM SIGIR Conference on Research
     and Development in Information Retrieval, 2021, pp. 1723–1727.
 [6] S. Zhuang, G. Zuccon, Tilde: Term independent likelihood model for passage re-ranking,
     in: Proceedings of the 44th International ACM SIGIR Conference on Research and Devel-
     opment in Information Retrieval, 2021, pp. 1483–1492.
 [7] S. Robertson, H. Zaragoza, The probabilistic relevance framework: Bm25 and beyond,
     Found. Trends Inf. Retr. 3 (2009) 333–389. URL: https://doi.org/10.1561/1500000019. doi:10.
     1561/1500000019.
 [8] F. Song, W. B. Croft, A general language model for information retrieval, in: Proceedings of
     the Eighth International Conference on Information and Knowledge Management, CIKM
     ’99, Association for Computing Machinery, New York, NY, USA, 1999, p. 316–321. URL:
     https://doi.org/10.1145/319950.320022. doi:10.1145/319950.320022.
 [9] K. D. Onal, Y. Zhang, I. S. Altingovde, M. M. Rahman, P. Karagoz, A. Braylan, B. Dang,
     H. Chang, H. Kim, Q. McNamara, A. Angert, E. Banner, V. Khetan, T. McDonnell, A. T.
     Nguyen, D. Xu, B. C. Wallace, M. de Rijke, M. Lease, Neural information retrieval: at
     the end of the early years, Inf. Retr. J. 21 (2018) 111–182. URL: https://doi.org/10.1007/
     s10791-017-9321-y. doi:10.1007/s10791-017-9321-y.
[10] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional
     transformers for language understanding, in: Proceedings of the Conference of the North
     American Chapter of the Association for Computational Linguistics: Human Language
     Technologies„ Minnesota, USA, 2019, pp. 4171–4186.
[11] R. Nogueira, W. Yang, K. Cho, J. J. Lin, Multi-stage document ranking with bert, ArXiv
     abs/1910.14424 (2019).
[12] R. Nogueira, K. Cho, Passage re-ranking with BERT, CoRR abs/1901.04085 (2019). URL:
     http://arxiv.org/abs/1901.04085.
[13] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional
     transformers for language understanding, arXiv preprint arXiv:1810.04805 (2018).
[14] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
     Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of
     Machine Learning Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html.
[15] R. Nogueira, Z. Jiang, J. Lin, Document ranking with a pretrained sequence-to-sequence
     model, arXiv preprint arXiv:2003.06713 (2020).
[16] Y. Wang, L. Wang, Y. Li, D. He, T.-Y. Liu, A theoretical analysis of ndcg type ranking
     measures, in: Conference on learning theory, PMLR, 2013, pp. 25–54.