1. Introduction

Conference and Labs of the Evaluation Forum, September

Elsevier at SimpleText: Passage Retrieval by Fine-tuning GPL on Scientific Documents

Artemis Capari

P@10 R@10

Hosein Azarbonyad

Georgios Tsatsaronis

Zubair Afzal

Elsevier

Amsterdam

2023

1 8 21

CLEF SimpleText Lab is centered around finding relevant passages from a large collection of scientific documents in response to a lay query, detecting and explaining dificult terminology within those passages, and finally simplifying the passages. The first task is similar to the ad-hoc retrieval task in which given a topic/query, the goal is to retrieve relevant passages, but in addition to the relevance, ranking models should assess documents based on their readability/complexity as well. This paper describes our approach towards building a ranking model to tackle the first task. To build the ranking model, we first evaluate performance of several models on a proprietary test collection constructed based on scientific documents across multiple science domains. Then, we fine-tune the best performing model on a large collection of unlabelled documents using the Generative Pseudo Labeling approach. The key contribution and findings of our approach is that a bi-encoder model, trained on the MS-Marco dataset, fine-tuned further on a large collection of unlabelled scientific passages achieves the highest performance on the proprietary dataset which is specifically designed for the scientific passage retrieval task. Finally, fine-tuning a model in the same fashion, but only using the Computer Science queries from the test collection has proven to be successful for SimpleText Task 1.

eol>Information Retrieval Scientific Documents Domain Adaptation Scholarly Document Processing

1. Introduction

Scientists and researchers employ specialized language and ideas to efectively communicate information. Consequently, there exists a substantial, increasing volume of scientific concepts and information within any given scientific field, which contributes to the challenges scientists face in keeping pace with the expanding scope of technical concepts and novel content. Understanding scientific documents is even more challenging for the public audience. It has been shown that the readability of scientific documents is decreasing over time [ 1 ]. This poses challenges and opportunities towards both researchers and publishers to think about way to increase the readability of complex scientific documents for public audience.

SimpleText Lab [ 2 ] is specifically focused around addressing these challenges. The aim of this lab is to first find relevant passages to users’ queries, spot and explain dificult terminology within relevant passages, and finally simplify the passage by re-writing it in a more readable way. The very first task in the series of tasks associated with this lab, is a passage retrieval task namely “What is in (out)”, where the goal is, given a query/topic, to retrieve all passages relevant to the query/topic that can be used to create a simplified summary around the topic. In addition to the relevance, ranking models should also consider the complexity of passages when ranking them and prioritize less complex passages.

The state-of-the-art ranking models are semantic matching models using either a crossencoder or bi-encoder (or a combination of) architectures [ 3 ]. These models are trained on publicly available datasets such as MS-Marco [ 4 ] which do not contain scientific documents. The retrieval task of the SimpleText lab itself and the underlying training/evaluation sets are centred around scientific documents. Therefore, existing ranking models might not perform very well in this setting as the language of scientific documents is usually more complex and there might be specific scientific terminology within scientific documents that is specific for such documents.

In this paper we build our model on top of the existing state-of-the-art ranking models. To address the domain diference challenge, we use a domain adaptation technique, namely Generative Pseudo Labeling (GPL) to fine-tune the pre-trained models on a set of unlabelled scientific documents. To evaluate ranking models and fine-tune them, we build a proprietary test collection containing 5000 query document-pairs annotated by relevance labels. Our results on this dataset, shows that a bi-encoder model fine-tuned on a large collection of scientific unlabelled documents achieves a stronger performance than the zero-shot counterpart. We use this model to re-rank documents ranked by the Elastic Search system. Our results show that some of the fine-tuned models achieve a better performance than the zero-shot models on the SimpleText dataset as well. In the remainder of the paper, we briefly review related work in Section 2, we describe the technical details of the designed system in Section 3, we empirically evaluate the models in Sections 4 and 5 and we conclude in Section 6 by arraying some limitations of the current technical solution and provide pointers to future work.

2. Related Work

Dense retrieval models are a type of information retrieval (IR) model that use fixed-length dense vector representations to represent both queries and documents, allowing for eficient and accurate retrieval of relevant information from a large corpus of text by computing the similarity score between query- and document vectors. These models have been shown to outperform traditional sparse retrieval models, such as BM25 [ 5 ], in a variety of tasks, including open-domain question answering and document ranking.

Two popular types of such dense retrieval models are bi-encoders and cross-encoders. Both models still have the same objective, i.e. capturing the semantic meaning of queries and documents into dense vector representations, but difer in the architecture of the neural network used to learn their representations.

Bi-encoders use two separate encoders to independently encode the query and the document into dense vectors, which are then compared using a similarity function to produce a relevance score. One of the most popular bi-encoders is the Dense Passage Retrieval (DPR) model [ 6 ]. DPR uses a two-stage retrieval process, in which a large set of passages is first retrieved using sparse techniques, which is used in turn to compute a dense vector representation of each passage using a pre-trained language model such as BERT [ 7 ]. The query is represented using a similar dense vector representation as well. The passages are then ranked based on the cosine similarity between the query and passage vectors.

Cross-Encoders however, use a single encoder to encode the query and document into a joint embedding space. Documents are then ranked based on the similarity score that is computed between this joint embedding and the learned representation of the positive document. They can capture more complex interactions between query and document. However, they are computationally more expensive as it requires a unique embedding for each query-document pair, while bi-encoders encode queries and documents separately and therefore it only requires a single document corpus for all queries [ 8 ]. Therefore, they are often only used as re-rankers [ 9, 10, 11, 12, 13, 14 ].

3. Methodology

To train and fine-tune our models, we first build a test collection using a set of scientific documents. Then, we fine-tune existing ranking models using this dataset as well as a large collection of scientific documents to make these model more suitable for retrieving scientific passages.

3.1. Test Collection

To build a test collection, we select 100 queries spread across 20 diferent scientific domains 1. We select the queries to be a known scientific concept on which we can collect credible and relevant documents/passages. Once the queries are selected, we then use the well-known pooling mechanism to retrieve candidate documents to be annotated per query. We select five diferent models (two lexical matching, two bi-encoders, and one cross-encoder) as the models to be used to build the pool. These models are selected based on their performance on a small set or to ensure the diversity of models (and hence diversity of document within the pool). We select 50 documents per query using the pooling approach. These documents are then labeled by experts per domains as “relevant”, “partially relevant”, or “non-relevant”. We use this dataset to evaluate the performance of diferent ranking models. 3.2. GPL Generative Pseudo Labeling (GPL) is an unsupervised domain adaptation method first introduced in [15]. The proposed framework leverages the structure of a pre-trained generative model to generate pseudo labels for the target domain data, which are then used to train a retrieval model in a supervised manner. GPL outperforms existing unsupervised domain adaptation methods on several benchmark datasets and achieves state-of-the-art performance in unsupervised domain adaptation of dense retrieval. Considering we intend to use- and experiment with dense-retrieval models, and the importance of large amounts of data has often been highlighted in previous 1Genetics and Molecular Biology, Computer Science, Economics, Agricultural and Biological Sciences, Biochemistry, Econometrics and Finance, Toxicology and Pharmaceutical Science, Chemical Engineering, Veterinary Science and Veterinary Medicine, Chemistry, Materials Science, Earth and Planetary Sciences, Engineering, Food Science, Immunology and Microbiology, Mathematics, Nursing and Health Professions, Medicine and Dentistry, Neuroscience, Pharmacology, Psychology, Physics and Astronomy, Social Science work on dense retrieval methods [ 7, 6, 3 ], our manually annotated dataset might not sufice as it only consists of 5000 snippets from a set of a 100 queries. However, there are many more snippets and possible queries that can be extracted from a large collection of unlabeled scientific documents (research articles), which could be labeled by GPL on their relevance in order to ifne-tune and adapt the existing ranking models to the scientific document retrieval task.

We adapt GPL to our use-case, by first removing the query generation part. Instead, we select a set of known scientific concepts per domain, and then per concept, we find all passages mentioning the concept.

Finding an exact mention of a scientific concept in a document can be a very good indicator of relevance of the document to the concept. Then, per concept, each document mentioning it is regarded as positive, and a bi-encoder is used to find negative document per query.

The GPL framework uses a cross-encoder as a teacher model on the collected positive and negative documents to fine-tune the underlying bi-encoder model, which is used to adapt the bi-encoder model to our scientific document ranking setting. For our use-case, we have ifne-tuned two diferent bi-encoders msmarco-distilbert-base-v4[ 8 ] (MS-DB-v4) and msmarcodistilbert-base-tas-b[16] (MS-DB-tas-b) using our whole test collection, spanning 20 diferent scientific domains, consisting of 5 queries each. We found that msmarco-distilbert-base-tas-b was most suitable for tasks that require understanding of a wide range of domains.

However, as the SimpleText task aims at finding references in Computer Science, we have also fine-tuned the aforementioned models on queries and articles from just the Computer Science and Mathematics domains. Naturally, these models were fine-tuned on far less data (See Table 1).

Each of the models were fitted on pseudo labels created with ms-marco-MiniLM-L-6-v2, using the Adam Optimiser [17] with a learning rate of 2e− 5 and 1000 warm-up steps.

4. Experiments

We have applied our models in several settings before selecting the final 10 submitted runs. Diferent variations of the best performing models (on the proprietary test collection) were selected to make the final submissions. As shown in Table 2, the rankings for runs 17 were retrieved by taking the top-k documents found for each of the 29 queries from Simpletext_2023_task1_train.qrels by the Elastic Search API. These were then reranked using our fine-tuned models. The rankings for the first 4 runs were obtained with the model that was only fine-tuned on Computer Science and Mathematics data, while we used the model fine-tuned on all Science Direct Domains for runs 5-7. For run 8, the top-500 documents were retrieved by searching for “query, topic”, and then re-ranked using our CS ifne-tuned model, again using “query, topic” as the query input. For run 9, we used the model that performed best on our own test collection to search the entire corpus for each query, rather than pre-filtering with Elastic Search. Finally, we used our best CS-trained model once again, but searched per topic instead of per query.

5. Results

We have selected our runs based on our own evaluation, which uses the qrels provided to us. However, to our knowledge, these qrels are biased towards passages retrieved by ElasticSearch, which is a lexical search method. Naturally, the recall for our semantic search models may therefore be limited. As the test qrels that have been used for the oficial evaluation are based on pooling the submissions of 2023 participants [ 2 ], these qrels include passages from various types of neural rankers as well as lexical matching models. Hence the results from our own evaluation difer from the oficial results. Nonetheless, they are included as they still provide insight on our training process and our decisions behind selecting certain runs.

5.1. Selecting Best Runs

In this section, we describe the results of fine-tuning diferent ranking models on a large collection of unlabeled documents using the GPL model.

While ms-marco-distilbert-base-tas-b proved most suitable for fine-tuning on our use-case, Table 3 shows that it underperforms its zero-shot equivalent on the train set. A possible explanation could be the pooling bias or the shallow depth of the training set. To be able to explain this result and make solid conclusions based on these results, we need to evaluate the performance of these models on an unseen test set. On the other hand, the fine-tuned ms-marcodistilbert-base-v4 model outperforms the zero-shot version which shows the efectiveness of ifne-tuning on the performance of this model.

Furthermore, Figures 4, 2, and 3 show the performance of the GPL-based fine-tuned model at diferent training steps for diferent configurations. As can be seen, the distilbert-base-v4 ifne-tuned on CS data and evaluated based on top-100 ES documents achieves significant improvements with more training steps in the early stages of the training, but the model converges after 2 training steps.

The converged model has a significantly higher performance than the zero-shot version in terms of most evaluation metrics. While the distilbert-base-v4 gets improved by more training steps, the same behavior is not observed for the distilbert-base-tas-b model. In fact, this model’s performance steadily drops by more training steps. A more detailed analysis on a larger test collection (with more queries and deeper depth) is required to explain this behavior of the model.

5.2. Submitted Runs

1 2 3 4 5 6 7 8 9 10

Table 5 shows the performance of the submitted runs on the training set using topics. Performance of the models based on topics is similar to their query-based performance. However, the model used to re-rank top 5000 documents of the ES system achieves the higher performance in topic-based evaluation. 5.3. Oficial Results As per Table 6, where the results are sorted on the primary measure, nDCG@10, we see that our submitted runs (e.g. Elsevier) dominate the top of the scoreboard.

In particular, the highest performing result, run 8, was obtained by re-ranking top-500 passages retrieved by ElasticSearch when searching for “query, topic” with MS-DB-v4-GPLCS, again searching with query, topic. The selection of configurations(see Table 2) for our submissions were based on our own evaluation on the set of qrels provided to us, which indicated that searching only for the query with MS-DB-v4-GPL-CS outperformed our best model for the KAPR task: MS-DB-tas-b-GPL-all. However, this set might not have been representative of SimpleText’s oficial evaluation set as most of the other high-ranking results, were obtained with MS-DB-tas-b-GPL-all. For instance, run 7 can directly be compared with run 3 as they use the same type of query input and the same type of corpus (i.e. top-1000 ElasticSearch results). This also applies for run 5 versus run 2 and run 6 versus run 1. In each of these settings, the tas-b model fine-tuned on our entire benchmark set outperformed the v4 model fine-tuned on only the Computer Science portion of our test collection.

This indicates that even for the SimpleText task, MS-DB-tas-b-GPL-all performs better than MS-DB-v4-GPL-CS, and that the success of run 8 could thus be partly attributed to the fact that it was the only run that used “query, topic” as its query input. Using MS-DB-tas-b-GPL-all with “query, topic” might thus have outperformed our winning run. Nonetheless, these results show that the model fine-tuned for our specific scientific passage retrieval task still generalizes well to other datasets.

BPREF

6. Conclusion

MRR In this paper, we designed several ranking models to address the document retrieval task of the SimpleText lab. To this end, we first built a test collection containing 5000 query-document pairs annotated by relevance labels. The documents in this test collection are extracted from scientific documents which makes it suitable to evaluate performance of ranking models on the scientific document retrieval task. We, then evaluated the performance of existing ranking models on this test collection and selected a few models based on their performance to build our ranking models (used to create our SimpleText submissions). Since these models are trained on generic datasets created for the ad-hoc document retrieval task, they might not have a strong performance on the specific task of scientific document retrieval. To address this issue, we used a domain adaptation technique, namely Generative Pseudo Labeling (GPL) to fine-tune the selected ranking models to the scientific document retrieval task by means of a large collection of unlabeled scientific documents. Our results on the SimpleText training dataset shows the efectiveness of fine-tuning on the performance of our best ranking model. The distilbert-base-v4 model fine-tuned using GPL on a large collection of documents in Computer Science domain which is used to re-rank top-500 documents retrieved by a Elastic Search system using “topic, query” as the query input has the highest performance compared to the other fine-tuned models. Using the relevance labels from Computer Science-related domains to fine-tune state-of-the-art ranking models proved successful. However, as only a small portion of our test collection consisted of Computer Science queries, future work could explore labeling a larger set of queries in Computer Science-related domains to fine-tune a model in the same fashion. [14] C. Li, A. Yates, S. MacAvaney, B. He, Y. Sun, Parade: Passage representation aggregation for document reranking, arXiv preprint arXiv:2008.09093 (2020). [15] K. Wang, N. Thakur, N. Reimers, I. Gurevych, Gpl: Generative pseudo labeling for unsupervised domain adaptation of dense retrieval, arXiv preprint arXiv:2112.07577 (2021). [16] S. Hofstätter, S.-C. Lin, J.-H. Yang, J. Lin, A. Hanbury, Eficiently teaching an efective dense retriever with balanced topic aware sampling, in: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 113–122. [17] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).

[1]

Plavén-Sigray ,

G. J.

Matheson ,

B. C.

Schifler ,

W. H.

Thompson , The readability of scientific texts is decreasing over time , Elife 6 ( 2017 ) e27725 .

[2]

Ermakova , E. SanJuan,

Huet ,

Augereau ,

Azarbonyad ,

Kamps , Overview of simpletext - clef -2023 track on automatic simplification of scientific texts ., in: Avi Arampatzis, Evangelos Kanoulas, Theodora Tsikrika, Stefanos Vrochidis, Anastasia Giachanou,

Dan

Li ,

Mohammad

Aliannejadi , Michalis Vlachos, Guglielmo Faggioli, Nicola Ferro (Eds.) Experimental IR Meets Multilinguality , Multimodality, and Interaction . Proceedings of the Fourteenth International Conference of the CLEF Association , 2023 .

[3]

Thakur ,

Reimers ,

Rücklé ,

Srivastava , I. Gurevych , Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models , arXiv preprint arXiv:2104.08663 ( 2021 ).

[4]

Nguyen ,

Rosenberg ,

Song ,

Gao ,

Tiwary ,

Majumder , L. Deng, Ms marco: A human generated machine reading comprehension dataset , choice 2640 ( 2016 ) 660 .

[5]

Wang ,

Macdonald , I. Ounis , Improving zero-shot retrieval using dense external expansion , Information Processing & Management 59 ( 2022 ) 103026 .

[6]

Karpukhin ,

Oğuz ,

Min ,

Lewis ,

Wu ,

Edunov ,

Chen , W.-t. Yih, Dense passage retrieval for open-domain question answering , arXiv preprint arXiv: 2004 . 04906 ( 2020 ).

[7]

Devlin , M.-

Chang ,

Lee ,

Toutanova , Bert: Pre-training of deep bidirectional transformers for language understanding , arXiv preprint arXiv: 1810 . 04805 ( 2018 ).

[8]

Reimers , I. Gurevych , Sentence-bert: Sentence embeddings using siamese bert-networks , in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing , 2019 .

[9]

Nogueira ,

Cho , Passage re-ranking with bert , arXiv preprint arXiv: 1901 . 04085 ( 2019 ).

[10]

Nogueira ,

Yang ,

Lin ,

Cho , Document expansion by query prediction , arXiv preprint arXiv: 1904 . 08375 ( 2019 ).

[11]

Nogueira ,

Jiang ,

Lin , Document ranking with a pretrained sequence-to-sequence model , arXiv preprint arXiv: 2003 . 06713 ( 2020 ).

[12]

MacAvaney ,

Yates ,

Cohan ,

Goharian , Cedr: Contextualized embeddings for document ranking , in: Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval , 2019 , pp. 1101 - 1104 .

[13]

MacAvaney ,

F. M.

Nardini ,

Perego ,

Tonellotto ,

Goharian ,

Frieder , Eficient document re-ranking for transformers by precomputing term representations , in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , 2020 , pp. 49 - 58 .