=Paper=
{{Paper
|id=Vol-2936/paper-22
|storemode=property
|title=A Neural Text Ranking Approach for Automatic MeSH Indexing
|pdfUrl=https://ceur-ws.org/Vol-2936/paper-22.pdf
|volume=Vol-2936
|authors=Alastair Rae,James Mork,Dina Demner-Fushman
|dblpUrl=https://dblp.org/rec/conf/clef/RaeMD21
}}
==A Neural Text Ranking Approach for Automatic MeSH Indexing==
A Neural Text Ranking Approach for Automatic MeSH Indexing Alastair R. Rae, James G. Mork and Dina Demner-Fushman National Library of Medicine, 8600 Rockville Pike, Bethesda, MD, 20894, USA Abstract The U.S. National Library of Medicine (NLM) has been indexing the biomedical literature with MeSH terms since the mid-1960s, and in recent years the library has increasingly relied on AI assistance and automation to curate the biomedical literature more efficiently. Since 2002, the NLM has been using natural language processing algorithms to assist indexers by providing MeSH term recommendations, and we are continually working to improve the quality of these recommendations. This work presents a new neural text ranking approach for automatic MeSH indexing. The domain-specific pretrained transformer model, PubMedBERT, was fine-tuned on MEDLINE data and used to rank candidate main headings obtained from a Convolutional Neural Network (CNN). Pointwise, listwise, and multi-stage ranking approaches are demonstrated, and the algorithm performance was evaluated by participating in the BioASQ challenge task 9a on semantic indexing. The neural text ranking approach was found to have very competitive performance in the final batch of the challenge, and the multi-stage ranking method typically boosted the CNN model performance by about 5% points in terms of micro F1-score. Keywords Automatic MeSH Indexing, Medical Text Indexing, Neural Text Ranking, Transformers, BERT 1. Introduction The U.S. National Library of Medicine (NLM) maintains the MEDLINE® bibliographic database to help the biomedical research community find the journal articles that they need. To improve the quality of PubMed search results, all MEDLINE articles are indexed with a controlled vocabulary called Medical Subject Headings (MeSH® )1 . MeSH indexing is a time-consuming and highly specialized activity. NLM indexers review the full text of an article and then assign MeSH terms that represent the central concepts as well as every other topic that is discussed to a significant extent. This work focuses on the indexing of main headings, which are also known as MeSH descriptors. There are currently over 29,000 main headings in the 2021 MeSH vocabulary and each main heading describes an important biomedical concept. On average indexers assign about 11 main headings per article. Each year close to 1 million articles are indexed for MEDLINE, and the library uses AI assistance and automation to increase the efficiency of the indexing process. Since 2002, indexing assistance has been provided by the Medical Text Indexer (MTI) system[1]. MTI CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania " alastair.rae@nih.gov (A. R. Rae); jmork@mail.nih.gov (J. G. Mork); ddemner@mail.nih.gov (D. Demner-Fushman) © 2021 No copyright. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings CEUR Workshop Proceedings (CEUR-WS.org) http://ceur-ws.org ISSN 1613-0073 1 https://www.nlm.nih.gov/mesh/ improves productivity by providing a pick list of recommended MeSH terms that can be quickly selected by indexers. Automatic MeSH indexing is a difficult machine learning problem, and the main challenges are the large number of main headings and their highly imbalanced frequency distribution. This work presents a new neural text ranking approach for automatic MeSH indexing. Pointwise, listwise, and multi-stage ranking approaches are demonstrated using a domain-specific pre- trained transformer model called PubMedBERT[2]. To the best of our knowledge this is the first time that text ranking using pretrained transformers has been applied to the automatic MeSH indexing problem. The performance of the new approach was evaluated by participating in the BioASQ challenge task 9a on semantic indexing[3]. 2. Related Work In recent years two high-performing approaches for automatic MeSH indexing have emerged: learning-to-rank approaches, and neural network multi-label classification approaches. Ex- amples of learning-to-rank based systems are MTI[1] and DeepMeSH[4], and examples of neural network multi-label classification systems are MeSHProbeNet[5] and AttentionMeSH[6]. Recently, You et al. achieved state-of-the-art performance with BERTMeSH[7]. BERTMeSH is a neural network multi-label classification approach that leverages pretrained transformers[8] and the article full-text. Learning-to-rank[9] is a methodology that uses supervised machine learning algorithms to solve ranking problems. Typically, it is applied to automatic MeSH indexing by treating the title and abstract as the query, and candidate main headings as the documents to be ranked. Learning- to-rank algorithms usually make use of hand-crafted features and rank documents by integrating multiple sources of evidence. For example, MTI uses a learning-to-rank algorithm[10] to rank candidate main headings from MetaMap[11], PubMed Related Citations[12], and machine learning algorithms. Sources of evidence include text features such as the fraction of main heading unigrams and bigrams that appear in the title or abstract. Learning-to-rank algorithms can be classified as pointwise, pairwise, and listwise approaches depending on the loss function that is used. Pointwise approaches compute a loss for individual query-document pairs. The training task is to predict whether individual candidate documents are relevant to a query. At inference time, the model is run on all query-document pairs, and overall rankings are obtained using the predicted relevance scores. Pairwise approaches compute a loss for a query and a pair of documents. The training task is to predict which document is more relevant to a query. At inference time, for each query, pairwise rankings are obtained for all candidate documents pairs, and then these pairwise rankings are converted into an overall ranking. Finally, listwise approaches compute a loss for a query and all candidate documents. The training task is to predict the correct overall document ranking for a query. Hence, listwise approaches directly solve the ranking problem. Recently, neural text ranking using pretrained transformers has proven to be a very effective approach for ad-hoc information retrieval[13]. On the MS MARCO passage ranking dataset large-scale pretrained transformer models, such as BERT[14], have outperformed traditional information retrieval approaches by a considerable margin[15]. Text ranking using pretrained transformers was first demonstrated by Nogueira and Cho[16]. They implemented a pointwise approach by training BERT as a relevance classifier on MS MARCO query-passage pairs. Pairwise text ranking using BERT was also demonstrated as part of a multi-stage ranking architecture[17]. To the best of our knowledge there is no prior work on listwise text ranking using pretrained transformers. For a recent review of text ranking with pretrained transformers the interested reader is referred to “Pretrained Transformers for Text Ranking: BERT and Beyond” by J. Lin et al.[13]. Domain-specific pretraining of transformer models can improve performance on downstream tasks[8, 2], and BioBERT[8] is a popular domain-specific version of BERT that has been pre- trained on a biomedical corpus. The BioBERT authors started with the original BERT checkpoint and ran additional pretraining steps on a corpus of PubMed abstracts and PubMed Central article full-text. Recently, PubMedBERT was shown to outperform BioBERT on the Biomedical Language Understanding and Reasoning Benchmark (BLURB)[2]. PubMedBERT was also pre- trained on PubMed abstracts and PubMed Central article full-text, however, unlike BioBERT, it was trained from scratch using a domain-specific vocabulary. 3. Methods This section describes our automatic MeSH indexing approaches that were evaluated by partici- pating in the BioASQ challenge task 9a on semantic indexing. 3.1. Convolutional Neural Network The baseline approach was our previously described Convolutional Neural Network (CNN) for automatic MeSH indexing[18]. It is a neural network multi-label classification approach that takes the article title, abstract, journal, publication year, and indexing year as input. The top 𝑁 results from this model were also used as candidate main headings for the text ranking approaches. 3.2. Pointwise Text Ranking The neural text ranking approaches were implemented using a domain-specific pretrained transformer model called PubMedBERT[2]. PubMedBERT is a BERT model with a domain- specific vocabulary that has been pretrained from scratch on a biomedical corpus. It was chosen because it was the top performing model in the BLURB benchmark[2], and also because its domain-specific vocabulary was expected to encode biomedical text efficiently. More details about the BERT architecture and fine-tuning configurations can be found in the original BERT paper[14]. For the pointwise text ranking approach PubMedBERT was configured as a relevance classifier using the text pair classification configuration. The input sequence was: [[𝐶𝐿𝑆], 𝑞, [𝑆𝐸𝑃 ], 𝑑, [𝑆𝐸𝑃 ]], (1) where the query, 𝑞, comprises the concatenated tokens of the indexing year, journal name, title, and abstract, and 𝑑 comprises the tokens of the candidate main heading. [𝐶𝐿𝑆] and [𝑆𝐸𝑃 ] are the classification and separator special tokens respectively. In the text pair classification configuration, the [𝐶𝐿𝑆] token is used to represent the input sequence, and the relevance probability was computed by adding a softmax classification head on top of its contextualized embedding (𝑇[𝐶𝐿𝑆] ): 𝑃 (𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡 = 1|𝑑𝑖 , 𝑞) = 𝑠𝑖 ≜ 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(𝑇[𝐶𝐿𝑆] 𝑊 + 𝑏)1 , (2) where 𝑊 and 𝑏 are the weights and bias of the classification layer respectively, and 𝑠𝑜𝑓 𝑡𝑚𝑎𝑥(·)𝑖 denotes the 𝑖-th element of the softmax output. The training task was to predict whether a candidate main heading is relevant, given the article title, abstract, and other metadata. For the automatic MeSH indexing task, a main heading is considered relevant if it was indexed. PubMedBERT was fine-tuned on positive and negative article-candidate main heading pairs sampled from the CNN top results using the cross-entropy loss: ∑︁ ∑︁ 𝐿=− 𝑙𝑜𝑔(𝑠𝑗 ) − 𝑙𝑜𝑔(1 − 𝑠𝑗 ), (3) 𝑗∈𝐽𝑝𝑜𝑠 𝑗∈𝐽𝑛𝑒𝑔 where 𝐽𝑝𝑜𝑠 is a set of indexes for article-main heading pairs where the main heading was indexed, and 𝐽𝑛𝑒𝑔 is a set indexes for article-main heading pairs where the main heading was not indexed. At inference time, the fine-tuned model was run on all candidate main headings from the CNN top results and predicted relevance scores were used to generate a per-article main heading ranking. The final set of predicted main headings for an article was obtained by applying a decision threshold to the ranking scores. 3.3. Listwise Text Ranking For the listwise text ranking approach PubMedBERT was configured for text tagging, and the second text input was the shuffled candidate main headings separated by the pipe symbol. The input sequence was therefore: [[𝐶𝐿𝑆], 𝑞, [𝑆𝐸𝑃 ], |, 𝑑1 , |, 𝑑2 , ..., |, 𝑑𝑁 , [𝑆𝐸𝑃 ]]. (4) The pipe symbols allow the model to distinguish between different candidate main headings, and random shuffling was employed to prevent overfitting to the main heading order. Relevance probabilities were computed by feeding the contextualized embedding of the first token of each main heading to the softmax classification head described in Equation 2. Thus, the listwise approach directly creates a ranking by performing relevance classification on all candidate main headings at once. PubMedBERT was fine-tuned on the top 𝑁 main headings from the CNN model using cross-entropy loss. Again, the final set of main headings was obtained by applying a decision threshold. 3.4. Multi-Stage Text Ranking The listwise approach is expected to outperform the pointwise approach because it can consider interactions between main headings. However, a problem with the listwise approach is that the length of the input sequence is proportional to the number of candidate main headings, and this limits the recall of the approach because there is a maximum number of candidate main headings that can fit within BERT’s maximum sequence length of 512 tokens. The multi-stage text ranking approach attempts to overcome this limitation by first ranking the candidate main headings using the pointwise approach. The pointwise approach can rank any number of main headings, and it is expected to have higher recall@N than the CNN model. For the multi-stage ranking approach it was also found to be beneficial to average the ranking scores of the different stages. More formally, starting with the CNN ranking, 𝑅0 , the top 𝑁𝑝 results were reranked using the pointwise approach to generate a ranking 𝑅1 . 𝑅0 and 𝑅1 scores were then averaged to generate 𝑅2 . Next, the top 𝑁𝑙 results in 𝑅2 were reranked using the listwise approach to generate ranking 𝑅3 . The final ranking was computed by averaging the scores of 𝑅2 and 𝑅3 . A decision threshold was applied to generate the final main heading predictions. 3.5. Multi-Stage Text Ranking with COVID-19 Rules During the BioASQ challenge it was noticed that our machine learning approaches were performing poorly on COVID-19 articles, and two specific problems were identified: • The “COVID-19” main heading was always being indexed with unnecessary additional main headings "Pneumonia, Viral", "Coronavirus Infections", and "Pandemics". • The "SARS-CoV-2" main heading was always being indexed with the unnecessary addi- tional main heading "Betacoronavirus". The precision of the multi-stage text ranking approach was improved by removing these unnecessary main headings using manually written COVID-19 rules. 3.6. Hybrid Approach MTI First Line Indexing (MTIFL) and MTI Review Filtering (MTIR) are used for selected journals to partially automate the NLM indexing process2 . For MTIFL journals, MTI provides the initial indexing, and this is later reviewed (and potentially modified) by human indexers. For MTIR journals the process is the same except that human curation is only used for critical elements. Empirically, MTI is found to perform very well for semi-automatically indexed journals. In order to achieve the highest possible agreement with human indexing, we therefore implemented a hybrid approach that used MTI First Line Index results for MTIFL journals, Default MTI results for MTIR journals, and multi-stage text ranking results (with COVID-19 rules) for all other journals. Default MTI is configured for balanced precision and recall, and its results were expected to be the most similar to the initial indexing provided for MTIR journals. MTI results are made publicly available for anyone wanting to use them during the challenge. 2 https://ii.nlm.nih.gov/MTI/MTIFL.shtml 4. Experiments 4.1. BioASQ Task 9a The performance of our proposed approaches were evaluated by participating in the large-scale online biomedical semantic indexing task of the 2021 BioASQ challenge (task 9a). The task released 3 batches of 5 test sets, and these contained between 4,000 and 11,000 soon-to-be- indexed MEDLINE articles. Participants had a limited time window to submit results, and this was necessary to ensure that indexing predictions were made before indexer annotations became available. The NLM team used the challenge to evaluate various different text ranking approaches and configurations. This paper describes our final best-performing approaches, and these are evaluated on the 5 weekly test sets of batch 3. Results for our final approaches were only submitted to the last two test sets of batch 3, and for a more comprehensive performance evaluation, we have independently generated indexing predictions for the week 1-3 test sets. 4.1.1. Dataset The dataset was constructed from the MEDLINE/PubMed 2021 annual baseline3 . Fully and semi-automatically indexed articles (“Automated” or “Curated” indexing method) were excluded as we believe that the indexing of these articles may be biased by MTI’s predictions. 20,000 randomly selected articles published in 2020/2021 were reserved for a validation set, and another 40,000 randomly selected articles published in 2020/2021 were reserved for our personal test set. The remaining 10 million articles published after 2006 were used for the training set. The presented approaches were evaluated on the BioASQ task 9a batch 3 test sets, and independently generated predictions were evaluated using indexer annotations downloaded from the NLM E-Utilities4 service on the 28th of June 2021. The final challenge results were calculated using the indexing available on the 21st of May 2021, and to allow for fair comparisons between systems, indexing completed after this date was excluded. 4.1.2. Evaluation Metrics The primary evaluation metric used by the semantic indexing task is the micro F1-score (𝑀 𝑖𝐹 ) and this is defined as the harmonic mean of the micro precision (𝑀 𝑖𝑃 ) and the micro recall (𝑀 𝑖𝑅): 2 · 𝑀 𝑖𝑃 · 𝑀 𝑖𝑅 𝑀 𝑖𝐹 = , (5) 𝑀 𝑖𝑃 + 𝑀 𝑖𝑅 where ∑︀𝑁𝐴 ∑︀𝑁𝐿 𝑖=1 𝑗=1 𝑦𝑖𝑗 · 𝑦 ˆ𝑖𝑗 𝑀 𝑖𝑃 = ∑︀𝑁 ∑︀𝑁 , (6) 𝐴 𝐿 𝑖=1 𝑗=1 𝑦ˆ𝑖𝑗 ∑︀𝑁𝐴 ∑︀𝑁𝐿 𝑖=1 𝑗=1 𝑦𝑖𝑗 · 𝑦 ˆ𝑖𝑗 𝑀 𝑖𝑅 = ∑︀𝑁 ∑︀𝑁 . (7) 𝐴 𝐿 𝑖=1 𝑗=1 𝑦𝑖𝑗 3 https://www.nlm.nih.gov/databases/download/pubmed_medline.html 4 https://www.ncbi.nlm.nih.gov/books/NBK25497/ In the above equations 𝑦 are the indexer annotations, 𝑦ˆ are the model predictions, 𝑁𝐴 is the number of articles, and 𝑁𝐿 is the number of main headings. Model predictions were made after applying a decision threshold to the predicted scores. There is an optimum decision threshold that results in the highest F1-score, and this threshold was determined by a linear search on the validation set. 4.1.3. Configuration The configuration for the CNN model has previously been described in Rae et al.[18], and the model was retrained on the MEDLINE/PubMed 2021 dataset described in this paper. The pointwise and listwise ranking models were implemented using the Hugging Face Transformers library (v4.2.2) with a PyTorch (v1.7.1) backend. PubMedBERT pretrained weights were downloaded from the Hugging Face model repository, and the uncased model pretrained on abstracts and full-text was selected (“BiomedNLP-PubMedBERT-base-uncased-abstract- fulltext”). The pointwise model was implemented in Hugging Face Transformers using the BertForSe- quenceClassification class (specifying the number of labels as 2), and the default PubMedBERT configuration was left unchanged. Training was run for approximately 1 epoch on a balanced dataset, and the Adam optimizer was used with L2 weight decay set to 0.01. The learning rate schedule included 10,000 warmup steps, a maximum learning rate of 2e-5, and a linear decay to zero thereafter. The listwise model was implemented in Hugging Face Transformers using the BertForTo- kenClassification class, with the number of labels set to 2. All tokens, except for the first token of each main heading, were assigned the masking label of -100. The first token of each main heading was assigned a label of 1 or 0 for indexed and not-indexed main headings respectively. Again, the PubMedBERT configuration was not altered, and the model was trained on the CNN top 50 results for approximately 10 epochs. Other training settings were the same as for the pointwise approach, except that a lower maximum learning rate of 9e-6 was used. Both ranking models were trained on the Biowulf cluster5 using NVIDIA V100x 32GB GPUs. The pointwise and listwise models were trained on 4 and 2 GPUs respectively for approximately 10 days. FP16 training was used and an effective batch size of 128 was achieved using gradient accumulation. Validation set performance of the listwise model had converged after 10 days, however the performance of the pointwise model was still improving. For the hybrid approach, MTI results6 and MTIFL and MTIR journal lists7 (22nd of September 2020 versions) were downloaded from the NLM website. 4.1.4. Results Table 1 summarizes the micro F1-score performance of top performing systems in batch 3. For each weekly test set, the table includes the highest micro F1-score achieved by each team, along with the best performing MTI baseline for reference. The table shows that the performance of 5 https://hpc.nih.gov/ 6 http://ii.nlm.nih.gov/BioASQ/ 7 https://ii.nlm.nih.gov/MTI/MTIFL.shtml Table 1 Micro F1-score performance of top performing systems in batch 3. System Week 1 Week 2 Week 3 Week 4 Week 5 Average NLM System 3 (hybrid approach) 0.7059 0.6973 0.6966 0.6999 0.7075 0.7014 dmiip_fdu systems 0.7060 0.6976 0.6980 0.6966 0.7013 0.6999 MTI First Line Index 0.6555 0.6445 0.6541 0.6491 0.6508 0.6508 pi_dna 0.6443 0.6464 0.6503 0.6466 0.6498 0.6475 DeepSys2 0.5780 0.5674 0.5651 0.5625 0.5683 iria-1 0.4895 0.4778 0.4758 0.4818 0.4729 0.4796 Table 2 Micro F1-score performance of NLM approaches in batch 3. Approach Week 1 Week 2 Week 3 Week 4 Week 5 Average Hybrid (NLM System 3) 0.7059 0.6973 0.6966 0.6999 0.7075 0.7014 Multi-stage + COVID-19 rules 0.7032 0.6971 0.6953 0.6932 0.7011 0.6980 Multi-stage (NLM System 2) 0.7000 0.6945 0.6931 0.6894 0.6932 0.6940 Listwise (NLM System 4) 0.6931 0.6888 0.6884 0.6836 0.6876 0.6883 Pointwise (NLM System 1) 0.6888 0.6831 0.6820 0.6801 0.6799 0.6828 CNN (NLM CNN) 0.6482 0.6434 0.6424 0.6381 0.6424 0.6429 the neural text ranking approach is very competitive. Our best performing hybrid approach outperformed the MTI baseline by about 5% points, and it has very similar performance to the state-of-the-art dmiip_fdu systems. Table 2 shows micro F1-score performance of NLM approaches in batch 3. Note that results for the “Multi-stage + COVID-19 rules” approach were not submitted to the challenge because teams were allowed a maximum of 5 systems. Comparing the performance of the multi-stage ranking approach to the CNN model, it can be seen that neural text ranking provided about a 5% point performance boost on average. The table shows that the listwise approach outperformed the pointwise approach and also that multi-stage ranking was beneficial. The COVID-19 rules provided small but consistent performance improvements, and the hybrid approach, which substituted MTI results for semi-automatically indexed journals, was the best performing NLM system in all batch 3 test sets. 4.2. Listwise Model Hyperparameter Search There is a performance trade-off for the listwise approach: increasing the number of candidate main headings (𝑁 ) increases the maximum achievable recall, but it also results in more input truncation due to longer input sequence lengths. This section explores this trade-off through a hyperparameter search for the optimum number of candidate main headings for the listwise approach. For the study, the listwise model was trained with four different values of 𝑁 between 25 and 50, and input truncation percentages and model performance were measured on the validation set. BioBERT input truncation percentages were also measured for comparison. The results of the study are shown in Figure 1. Figure 1a shows a significant increase in Figure 1: a) Percentage of truncated inputs vs. number of candidate main headings (𝑁 ) for PubMedBERT and BioBERT. b) Listwise model micro F1-score and CNN model recall vs. number of candidate main headings. the percentage of truncated inputs as the number of candidate main headings is increased from 25 to 50. For PubMedBERT, 15% of inputs were truncated for 25 candidate main headings, and this rises to 39% of inputs for 50 candidate main headings. The figure also shows that input truncation percentages were much higher for BioBERT than for PubMedBERT, and this is because BioBERT does not have a domain-specific vocabulary. Despite the relatively high input truncation percentages observed in Figure 1a, Figure 1b shows that the listwise model micro F1-score increases with 𝑁 , and the model trained with 50 candidate main headings is shown to have the highest micro F1-score of 0.7020 on the validation set. As expected, the increase in micro F1-score is correlated with the increase in CNN model recall, but for 𝑁 = 50 the strength of this correlation appears to be weakening. 5. Discussion As expected, the results indicate that the listwise text ranking approach outperforms the pointwise text ranking approach, but to confirm this we would need to train the pointwise model to convergence and also optimize the number of candidate main headings. The pointwise approach considers one article-main heading pair per training example, whereas the listwise approach considers 50 article-main heading pairs per training example, and so it makes sense that training of the pointwise model would converge more slowly. This work has presented a hyperparameter search for the optimum number of candidate main headings for the listwise approach and increasing the number of candidate main headings from 25 to 50 was shown to result in a 0.51% point improvement in micro F1-score performance on the validation set. Increasing the number of candidate main headings further may result in additional performance improvements, however; for 𝑁 = 50 there is some evidence that input truncation is starting to limit performance. The study also indicates that PubMedBERT was a good model choice because it was shown to encode biomedical text more efficiently than BioBERT resulting in significantly less input truncation. For 50 candidate main headings the BioBERT tokenizer was shown to truncate about 75% of input sequences, and this would likely have a large negative impact on MeSH indexing performance. The poor performance of our machine learning models on COVID-19 articles (before applying the COVID-19 rules) can be explained by inconsistent and out-of-date training data. The problem is that indexing of COVID-19 articles has evolved during the pandemic due to changing indexing rules and also after the addition of COVID-19 specific main headings. This is an interesting example of how sudden data and concept drift have been problematic for machine learning systems during the COVID-19 pandemic. Finally, substituting MTI predictions for semi-automatically indexed journals was shown to consistently improve performance. An explanation could be that indexing of MTIFL and MTIR journals is biased by MTI’s predictions. 6. Conclusion This paper has presented a new neural text ranking approach for automatic MeSH indexing. PubMedBERT was fine-tuned on MEDLINE data and used to rank candidate main headings obtained from a CNN model. Pointwise, listwise, and multi-stage text ranking approaches were demonstrated, and their performance was evaluated on batch 3 of the BioASQ 2021 semantic indexing task. The neural text ranking approach was shown to have very competitive performance, and the multi-stage text ranking method was found to boost the CNN model micro F1-score performance by about 5% points. In the future, we would like to investigate the zero-shot performance of neural text ranking models for automatic MeSH indexing. In particular, it would be interesting to know if they can correctly index a new main heading for a concept that has only been seen during unsupervised pretraining. It would be very useful if the text ranking models are learning the general concept of “indexing relevance” rather than specific indexing rules for each main heading. Acknowledgments This research was supported by the Intramural Research Program of the National Library of Medicine, National Institutes of Health. References [1] J. Mork, A. Aronson, D. Demner-Fushman, 12 years on - is the NLM medical text indexer still useful and relevant?, J. Biomed. Semant. 8 (2017) 8. [2] Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, H. Poon, Domain-specific language model pretraining for biomedical natural language processing, 2021. arXiv:2007.15779. [3] M. Krallinger, A. Krithara, A. Nentidis, G. Paliouras, M. Villegas, BioASQ at CLEF2020: Large-scale biomedical semantic indexing and question answering, in: Advances in Information Retrieval, Springer International Publishing, 2020, pp. 550–556. [4] S. Peng, R. You, H. Wang, C. Zhai, H. Mamitsuka, S. Zhu, DeepMeSH: deep semantic representation for improving large-scale MeSH indexing, Bioinformatics 32 (2016) i70–i79. [5] G. Xun, K. Jha, Y. Yuan, Y. Wang, A. Zhang, MeSHProbeNet: a self-attentive probe net for MeSH indexing, Bioinformatics (2019). [6] Q. Jin, B. Dhingra, W. Cohen, X. Lu, AttentionMeSH: simple, effective and interpretable automatic MeSH indexer, in: 6th BioASQ Workshop, Brussels, Belgium, 1 Novemeber 2018. Proceedings of the 6th BioASQ Workshop, ACL, 2018, pp. 47–56. [7] R. You, Y. Liu, H. Mamitsuka, S. Zhu, BERTMeSH: deep contextual representation learning for large-scale high-performance MeSH indexing with full text, Bioinformatics 37 (2020) 684–692. [8] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 36 (2019) 1234– 1240. [9] T.-Y. Liu, Learning to rank for information retrieval, Found. Trends Inf. Retr. 3 (2009) 225–331. [10] I. Zavorin, J. Mork, D. Demner-Fushman, Using learning-to-rank to enhance NLM medical text indexer results, in: 4th BioASQ workshop, Berlin, Germany, 12-13 August 2016. Proceedings of the Fourth BioASQ workshop, ACL, 2016, pp. 8–15. [11] A. R. Aronson, F.-M. Lang, An overview of metamap: historical perspective and recent advances, J. Am. Med. Inform. Assoc. 17 (2010) 229–236. [12] J. Lin, J. W. Wilbur, PubMed related articles: a probabilistic topic-based model for content similarity, BMC Bioinformatics 8 (2007) 423. [13] J. Lin, R. Nogueira, A. Yates, Pretrained transformers for text ranking: BERT and beyond, 2020. arXiv:2010.06467. [14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ACL, 2019, pp. 4171–4186. [15] N. Craswell, B. Mitra, E. Yilmaz, D. Campos, Overview of the TREC 2020 deep learning track, 2021. arXiv:2102.07662. [16] R. Nogueira, K. Cho, Passage re-ranking with BERT, 2020. arXiv:1901.04085. [17] R. Nogueira, W. Yang, K. Cho, J. Lin, Multi-stage document ranking with BERT, 2019. arXiv:1910.14424. [18] A. R. Rae, D. O. Pritchard, J. G. Mork, D. Demner-Fushman, Automatic mesh indexing: Revisiting the subheading attachment problem, in: AMIA 2020, American Medical Infor- matics Association Annual Symposium, Virtual Event, USA, November 14-18, 2020, AMIA, 2020.