Evaluating Pretrained Transformer Models for Citation Recommendation

Evaluating Pretrained Transformer Models for Citation Recommendation RodrigoNogueira Tandon School of Engineering New York University David R. Cheriton School of Computer Science University of Waterloo ZhiyingJiang David R. Cheriton School of Computer Science University of Waterloo KyunghyunCho Courant Institute of Mathematical Sciences New York University Center for Data Science New York University Facebook AI Research CIFAR Azrieli Global Scholar JimmyLin David R. Cheriton School of Computer Science University of Waterloo

Lisbon Portugal

Evaluating Pretrained Transformer Models for Citation Recommendation C200933CB0E1750F300B042DAE5742E8 GROBID - A machine learning software for extracting information from scholarly documents

Citation recommendation systems for the scientific literature, to help authors find papers that should be cited, have the potential to speed up discoveries and uncover new routes for scientific exploration. We treat this task as a ranking problem, which we tackle with a twostage approach: candidate generation followed by re-ranking. Within this framework, we adapt to the scientific domain a proven combination based on "bag of words" retrieval followed by re-scoring with a BERT model. We experimentally show the effects of domain adaptation, both in terms of pretraining on in-domain data and exploiting in-domain vocabulary. In addition, we evaluate eleven pretrained transformer models and analyze some unexpected failure cases. On three different collections from different scientific disciplines, our models perform close to or at the state of the art in the citation recommendation task.

Introduction

The volume of scientific publications is growing at an incredible rate. For example, over 900,000 papers are added per year to MEDLINE, a database of the life sciences and biomedical literature. 1 A recent study estimates that 3M papers are published annually in the English language, with a growth rate of 3-5% per year [18]. This flood of information has made it nearly impossible for researchers to keep abreast of discoveries and innovations, both in their specific sub-field as well as more broadly. Furthermore, there is an overwhelming amount of material that a scientist entering a new field of study needs to read before becoming familiarized with common concepts, methods, and other foundations.

A number of tools have come along to help researchers cope with this deluge. For example, keyword-based literature search engines (Google Scholar, Microsoft Academic, PubMed, and Semantic Scholar) and citation recommendation tools [5,2,27,21,14] help scientists find relevant articles, often exploiting citation networks to identify what's important in a particular field. Methods to automatically populate scientific knowledge bases [12,34,35] form another broad approach to tackling this challenge.

In this work, we investigate the potential of deep pretrained transformer models such as BERT [7] and large scientific datasets such as Open Research [1] to improve scientific search tools. More concretely, we tackle the task of scientific literature recommendation, where a paper (title and abstract) is given as a query, and the system's task is to find papers that should be cited. We use a standard keyword search engine (based on inverted indexes) with BM25 ranking [33] to initially retrieve candidate documents and evaluate various pretrained transformer models as re-rankers.

We find that this simple pipeline is more effective than previous cluster-based methods [32,4]. To summarize, our main contributions are as follows:

-We evaluate eleven pretrained ranking models and find that pretraining on the target domain and using domain-specific vocabulary leads to large improvements over a general-purpose model. -We find that despite the effectiveness of the pretrained transformer models as query-document relevance estimators, they perform poorly when the term overlap between the query and candidate documents is low. To address this issue, we train with more query-candidate pairs that have low term overlap, but interestingly, such a model performs poorly, even on the training set (see Section 5.2). -Contrary to our expectation given the symmetric nature of query and candidate documents, we find that query terms are more important than candidate document terms for relevance estimation (see Section 5.3).

Related Work

Most early methods for scientific literature search and recommendation take advantage of keyword-based retrieval [13,22]. These techniques suffer from the term mismatch problem, which is common in "bag-of-words" retrieval methods, but the issue is aggravated by the diversity of scientific vocabulary [17,8,29].

As the number of users grows, popular search engines can exploit interaction signals to learn better ranking models [28,11,10]. However, the reported gains are relatively small compared to classic ranking methods such as BM25. Another common approach in scientific recommendation systems is collaborative filtering [27,24,6]. These methods typically suffer from the cold-start problem, in which there is not enough evidence about new items (or users) to make predictions accurately.

More recently, cluster-based methods have started to become competitive with traditional retrieval-based methods in this task. Kanakia et al. [19] cluster papers based on their word embedding representation and use co-citations to alleviate the cold-start problem. However, they perform human evaluations on a private dataset, which excludes an empirical comparison to our approach.

Perhaps closest to our work is Eto [9], who uses a combination of proximity measures from the graph of co-citations to score candidate documents. The edges in the graph are weighted by the distance in which two citations occur in the citing document. This method requires access to the full text of the citing document, which is often not available (for example, due to paywalled content). Our method, on the other hand, predicts citations using only article abstracts, which are widely available in scientific corpora.

The methods described so far and our work fall in the category of global methods, which aim at recommending citations for the entire paper. Another category comprises local methods, which aim at recommending citations for a specific sentence or paragraph in the document [14,26,15,16]. We do not compare our method to these as we do not assume access to the full text.

Methods

This work tackles the task of citation recommendation: given a partially written paper, the system's task is to return all papers that should be cited in it. The input query q is the title and abstract of a paper (and not the full text). We argue that this assumption is crucial to building a useful tool as authors might desire recommendations of relevant citations prior to writing most of their paper.

Our method comprises two phases, Retrieval and Ranking. In the first phase, the top-k papers D are retrieved by a keyword search engine when queried with query q. In the second phase, we compute the probability p(d|q) of each paper d ∈ D being relevant to q. For this, we use a BERT [7] re-ranker model based on Nogueira and Cho [30]. Using the same notation as Devlin et al., we feed the query tokens as sequence A and the candidate paper tokens as sequence B.

In our setup, both the query and the candidate are the concatenation of the title and abstract of each paper, resulting in an input sequence that is often longer than the maximum tokens allowed by the model (typically 512 tokens). To handle this, we devote 256 tokens for the query and 256 for the candidate, truncating as necessary. At inference time, we use the model as a binary classifier: we feed the [CLS] token to a single layer neural network to obtain p(d|q). The output of our method is a list of papers D ranked by p(d|q). Training details are provided in Section 4.2.

Experimental Setup

Datasets

Open Research. We train and evaluate our models on the Open Research corpus [1],2 comprising 7.2M computer science and biomedical paper abstracts and their references. We closely follow the data processing steps from Bhagavatula We remove papers that do not cite any other paper or that have no year of publication. Finally, we remove citations of papers that are not in the corpus or whose year of publication is later than that of the citing paper. Table 1 shows the statistics of the final dataset after all processing steps. Note that although our dataset statistics do not match those reported in Bhagavatula et al. [4], they match the output of the evaluation script provided by the authors. 3 The difference is that the authors report statistics before the filtering steps (e.g., removing papers without references). Thus, our corpus and dataset splits match exactly and thus our results are comparable.

DBLP and PubMed. The DBLP and PubMed datasets were introduced by Ren et al. [32] and comprise papers from computer science and biomedicine, respectively. We apply the same data processing steps from Bhagavatula et al., and the resulting dataset statistics are summarized in Table 1.

Once processed in the manner described above, the citations within each paper serve as the ground truth for that paper. That is, using a specific paper as a query, the perfect results set comprises the actual citations in that paper.

When evaluating our method on DBLP and PubMed, we use models trained on Open Research's training set as this yields better results than training on the much smaller DBLP and PubMed training sets. To avoid leaking training data into the evaluation sets, we use the following method to remove documents in Open Research's training set that appear in the development and test sets of PubMed and DBLP: We remove special characters from the title and use Jaccard similarity (on unigrams) to calculate the closeness of two documents, filtering with a threshold of 0.7. This method results in approximately half of the papers in the development and test sets of PubMed and DBLP being removed from the training set of Open Research.

Re-ranker Training

To obtain the positive and negative examples used to train our binary classification models, we retrieve the top 10 papers for each query (title + abstract) using the Anserini IR toolkit4 [36,37] with BM25 ranking. Among these, approximately 6% on average are relevant papers (positive examples). We do not balance positive and negative examples; see additional discussions about this decision in Section 5.2.

Starting with a pretrained BERT model, we fine-tune it to our task using cross-entropy loss:

L = − j∈Jpos log(p(d j |q)) − j∈Jneg log(1 − p(d j |q)),(1)

where J pos and J neg are the indexes of the relevant and non-relevant papers and p(d j |q) is the relevance probability the model assigns to the j-th paper. We examine several BERT variants, detailed in Section 5.1.

All models are fine-tuned using Google's TPUs v3-8 with a batch size of 128 (128 sequences × 512 tokens = 65,536 tokens/batch) for 300k iterations, which takes approximately three days. This corresponds to training on 38.4M (300k × 128) query-candidate pairs, or 1.1 epochs. We do not see any improvements in the development set when training for another 700k iterations, which is equivalent to 3.8 epochs. We use Adam [20] with the initial learning rate set to 3 × 10 −6 , β 1 = 0.9, β 2 = 0.999, L2 weight decay of 0.01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate. We use a dropout probability of 0.1 in all layers.

Inference and Metrics

At inference time, we first retrieve the top 1000 candidate documents with the title and abstract as the query using BM25 ranking in Anserini. These documents are further re-ranked with one of the variants of the fine-tuned BERT models (see Section 5.1 for more details). Following Bhagavatula et al. [4], we evaluate the results using F 1 of the top 20 retrieved papers (F 1 @20) and Mean Reciprocal Ranking (MRR) of the top 1000 retrieved papers. We additionally report Recall@1000 (R@1000) to assess the effectiveness of our keyword search in isolation, which provides an upper bound on re-ranking effectiveness.

Results

Our main results are shown in Table 2 with SciBERT-Large as the ranking model, selected based on the experiments in Section 5.1. On the Open Research dataset, our best configuration (BM25 + SciBERT-Large) improves upon the best previous result in terms of both F 1 @20 and MRR. On the smaller DBLP and PubMed datasets, our method is on par with the state of the art. Note that our BERT-based models are trained only on Open Research as we achieve better results than training on the smaller datasets. Interestingly, our baseline BM25 implementation using Anserini out of the box, denoted "BM25 (Anserini)" in Table 2, is 3-7 points higher in F 1 @20 than the BM25 implementation of Bhagavatula et al. This is likely due to the choice of the query form that we use for "bag of words" retrieval, which is analyzed in Section 5.3, and perhaps a better implementation of BM25 in Anserini (which is based on Lucene).

Our method appears to be as effective and more scalable than a clusterbased approach. For example, Bhagavatula et al.'s model requires at least 100 GB of RAM to search the 7M documents in the Open Research corpus, 5 whereas keyword search has far more modest memory requirements.

In the next sections, we investigate the effectiveness of our method by evaluating various pretrained transformer models, as well as the effects of class imbalance and different query forms.

In-vs. Out-Domain Pretraining

Here we investigate how different pretraining configurations change effectiveness in the target task. The results, shown in Table 3, are from fine-tuning the pretrained models on Open Research's training set for 300k iterations with a batch size of 128, which corresponds to approximately 1.1 epochs. In the remainder of this paper, we call an in-domain corpus a collection whose majority of documents are from the same domains as those in Open Research (i.e., biomedicine and computer science), and we call an out-domain corpus a collection whose majority of papers are not from those domains.

The models pretrained on an in-domain corpus, i.e., BioBERT [23] (row 7) and SciBERT [3] (rows 8-11), yield significant improvements in the target task over models pretrained on a corpus of a similar size but a different domain (rows 3-5). Pretraining on an out-domain corpus ten times the size of the in-domain corpus results in lower effectiveness on the target task; compare RoBERTa [25], row 6 vs. row 10. We conclude that, at least for the task of citation recommendation, pretraining on a smaller in-domain corpus is more effective than pretraining on a larger out-domain corpus.

When pretraining settings are kept the same except for the vocabulary, the use of in-domain vocabulary gives 5-10% improvement over out-domain vocab-ulary (row 8 vs. 9 and row 10 vs. 11). This make intuitive sense, and Beltagy et al. [3] report a similar finding in other tasks as well.

The NCBI models [31] (rows 1 and 2) are pretrained on an in-domain corpus but produce worse results than models pretrained on an out-domain corpus of a similar size (rows 3-5). They also underperform when compared to SciBERT-Base (row 8), which is pretrained on an in-domain corpus of a similar size but comprises full papers instead of abstracts. As also noted by Beltagy et al. [3], this result suggests that pretraining with longer documents improves the target task effectiveness.

We find that model size appears to be even more important than document length. Our SciBERT-Large models (rows 10 and 11) have higher effectiveness than the SciBERT-Base models (rows 8 and 9) despite being pretrained on a smaller corpus of 7M paper abstracts (1.4B tokens) as opposed to 1M full-text papers (3.2B tokens).

Class Imbalance

Because we only use the top 10 papers returned by BM25 as training examples, the BERT-based models in this work are trained with more negative examples than positive ones (94% vs. 6%). In a separate experiment, to balance these classes, we include in the training phase pairs of query and relevant papers not retrieved by BM25, but this results in F 1 @20 and MRR close to zero in both training and development sets. We obtain a similar result when adding to the training set negative candidates randomly sampled from the corpus.

What explains these findings? We hypothesize that although BERT is a strong model for document ranking, it still partly relies on exact term match to learn relevance. Thus, when we sample training documents not using an exact term match method such as BM25, fewer terms between the query and the candidate paper match, which makes learning relevance harder. Further studies should investigate if this limitation applies to other tasks as well.

Query Analysis

In the citation recommendation task, the "query" used for initial retrieval can take many forms, such as the title of the paper, the concatenation of title and abstract, or keywords extracted from the text. Here we investigate how these query forms impact the effectiveness of a keyword-based retrieval method.

In Table 4, we show the effectiveness of BM25 on the Open Research development set. For Key Terms, we follow Bhagavatula et al. [4] and use Whoosh6 to first create an index and then extract key terms from the title and abstract with Whoosh's key terms from text method. Despite being faster due to having fewer query terms, the results show that this method has lower effectiveness than simply concatenating the title and abstract of the paper. 1. F1@20 on the development set when varying the number of tokens allocated to the input sequence (whose limit is 512 tokens) for the query (as opposed to the candidate document). The query is the concatenation of the title and abstract.

One of the limitations of transformer-based models (including BERT) is that memory consumption increases quadratically with the number of tokens in the input sequence. On modern hardware such as TPU v3s or GPU V100s, the maximum number of tokens that we can efficiently train a BERT-Large model is approximately 512. In our task, since the concatenation of query and candidate tokens is typically longer than this limit, there is a trade-off between the number of tokens we allocate to each sequence.

In Figure 1, we show how effectiveness changes as we allocate more tokens to the query than to the candidate document while limiting the sum of the two sequences to 512 tokens. These results are obtained with BM25 + SciBERT-Base (for faster experimental turnaround). The curve shows that query terms are more important to the re-ranker model, as increasing query tokens from 64 to 256 increases F 1 @20 by 2 points. Decreasing candidate document tokens from 256 to 64 barely changes F 1 @20. This result is somewhat surprising as one expects the two sequences to have equal importance in the task of querydocument relevance estimation. Note that in all previous experiments (Table 2), we used 256 tokens for the query and 256 for the candidate; this suggests that our main results might be even higher had we tuned this hyperparameter as well. Future work should investigate if this is particular to citation recommendation, or if it also occurs in other retrieval tasks with long queries as well.

Conclusions

We provide an extensive evaluation of pretrained transformer models for the scientific literature recommendation task. We find that in-domain pretraining and domain-specific vocabulary greatly improve effectiveness. Additionally, we present an unexpected finding: Despite the symmetry of the two inputs when trying to estimate the relevance of a candidate article to a query article, we find that terms from the query article are more important than terms from the candidate article in allocating "space" for BERT input. Future work should investigate this observation in more detail.

Table 1 .1Statistics of the datasets.Open Research DBLP PubMedTotal # of docs6,892,252 50,227 47,347Total # of citations44,400,729 156,807 825,371Avg. # citations per doc6.453.1217.43Avg. len. per doc (char)1,391 1,1931,504Queries -Train3,343,809 27,322 26,793-Dev487,582 8,3242,768-Test464,4499318,815q/rel. doc pairs -Train32,470,673 106,011 558,674-Dev5,985,787 38,628 66,655-Test5,944,269 12,168 200,042

et al.[4] to create the training, development, and test sets. In more detail, we sort papers by publication year and use the oldest 80% for training (1991-2014), the next 10% for development (2014-2015), and the most recent 10% for testing (2015-2016). Since the development and test sets are too large (400k+ papers), we randomly sample 20k examples from each set.

Table 2 .2Main results on Open Research, DBLP, and PubMed.F1@20MRRR@1000Dev TestDev TestDev TestOpen ResearchBM25 [4]-0.058-0.218--BM25 (Anserini)0.082 0.0890.279 0.3120.424 0.421Citeomatic [4]-0.125-0.330--BM25 + SciBERT-Large0.136 0.1320.430 0.4310.424 0.421DBLPBM25 [4]-0.119-0.425--BM25 (Anserini)0.105 0.1940.352 0.5850.669 0.691ClusCite [32]-0.237-0.548--Citeomatic [4]-0.303-0.689--BM25 + SciBERT-Large0.149 0.2720.472 0.7140.669 0.691PubMedBM25 [4]-0.209-0.574--BM25 (Anserini)0.299 0.2680.793 0.7210.794 0.765ClusCite [32]-0.274-0.578--Citeomatic [4]-0.329-0.771--BM25 + SciBERT-Large0.326 0.3040.835 0.7920.794 0.765

Table 3 .3Results on Open Research's development set of BERT-based models pretrained under different settings. All models are fine-tuned for approximately one epoch on the training set.Pretrained Model Size Pretraining CorpusTokens VocabularyCased F1@20 MRR(1) NCBIBase PubMed+MIMIC4.5B Wiki+Books0.093 0.315(2) NCBILarge PubMed+MIMIC4.5B Wiki+Books0.105 0.352(3) GoogleBase Wiki+Books3.3B Wiki+Books0.113 0.374(4) GoogleLarge Wiki+Books3.3B Wiki+Books0.115 0.373(5) Google WWMLarge Wiki+Books3.3B Wiki+Books0.121 0.399(6) RoBERTaLarge Various (Non-Scientific)33B (Non-Scientific)0.125 0.409(7) BioBERT v1.1Base Wiki+Books+PubMed+PMC21.3B PubMed+PMC0.128 0.417(8) SciBERTBase Open Research (1M Full Papers)3.2B Wiki+Books0.125 0.409(9) SciBERTBase Open Research (1M Full Papers)3.2B Open Research0.131 0.423(10) SciBERTLarge Open Research (7M Abstracts)1.4B Wiki+Book0.135 0.420(11) SciBERTLarge Open Research (7M Abstracts)1.4B Open Research0.137 0.430

Table 4 .4BM25 results on Open Research's development set when different query forms are used. BERT-based re-ranking is not applied in these experiments.Open ResearchPubMedDBLPQuery TypeF1@20 MRR R@1000F1@20 MRR R@1000F1@20 MRR R@1000Key Terms (Whoosh) 0.065 0.251 0.2820.201 0.595 0.6040.130 0.425 0.510Title0.063 0.244 0.2870.199 0.584 0.6540.133 0.424 0.551Title and Abstract0.095 0.351 0.3630.268 0.720 0.7650.194 0.585 0.6910.140.131 @200.12F0.1164 0.1128256384448# tokens for queryFig.

https://s3-us-west-2.amazonaws.com/ai2-s2-research-public/open-corpus/ 2017-02-21/papers-2017-02-21.zip https://github.com/allenai/citeomatic/blob/master/citeomatic/scripts/ evaluate.py http://anserini.io/ BIR 2020 Workshop on Bibliometric-enhanced Information Retrieval https://github.com/allenai/citeomatic#citeomatic-evaluation BIR 2020 Workshop on Bibliometric-enhanced Information Retrieval https://whoosh.readthedocs.io/en/latest/

Acknowledgments

This research was supported in part by the Canada First Research Excellence Fund, the Natural Sciences and Engineering Research Council (NSERC) of Canada, NVIDIA, and eBay. Additionally, we would like to thank Google for computational resources in the form of Google Cloud credits.

Construction of the literature graph in Semantic Scholar WAmmar DGroeneveld CBhagavatula IBeltagy MCrawford DDowney JDunkelberger AElgohary SFeldman VHa RKinney SKohlmeier KLo TMurray HHOoi MPeters JPower SSkjonsberg LWang CWilhelm ZYuan MVan Zuylen OEtzioni Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2018 3 Industry Papers Technical paper recommendation: A study in combining multiple information sources CBasu HHirsh WWCohen CNevill-Manning Journal of Artificial Intelligence Research 14 2001 IBeltagy KLo ACohan arXiv:1903.10676 SciBERT: Pretrained contextualized embeddings for scientific text 2019 CBhagavatula SFeldman RPower WAmmar arXiv:1802.08301 Content-based citation recommendation 2018 A system for automatic personalized tracking of scientific literature on the web KDBollacker SLawrence CLGiles Proceedings of the Fourth ACM conference on Digital Libraries (DL '99) the Fourth ACM conference on Digital Libraries (DL '99) 1999 Research paper recommender systems on big scholarly data TTChen MLee Pacific Rim Knowledge Acquisition Workshop 2018 BERT: Pre-training of deep bidirectional transformers for language understanding JDevlin MWChang KLee KToutanova Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2019 1 Combining global and local semantic contexts for improving biomedical information retrieval DDinh LTamine European Conference on Information Retrieval 2011 Extended co-citation search: Graph-based document retrieval on a cocitation network containing citation context information MEto Information Processing & Management 56 6 102046 2019 Best Match: New relevance search for PubMed NFiorini KCanese GStarchenko EKireev WKim VMiller MOsipov MKholodov RIsmagilov SMohan JOstell ZLu PLoS Biology 16 8 e2005343 2018 How user intelligence is improving PubMed NFiorini RLeaman DJLipman ZLu Nature Biotechnology 36 10 937 2018 Swan: A distributed knowledge infrastructure for Alzheimer disease research YGao JKinoshita EWu EMiller RLee ASeaborne SCayzer TClark Web Semantics: Science, Services and Agents on the World Wide Web 4 3 2006 First steps towards electronic research communication PGinsparg Computers in Physics 8 4 1994 Context-aware citation recommendation QHe JPei DKifer PMitra CLGiles Proceedings of the 19th International Conference on World Wide Web the 19th International Conference on World Wide Web 2010 Recommending citations: Translating papers into references WHuang SKataria CCaragea PMitra CLGiles LRokach Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM '12) the 21st ACM International Conference on Information and Knowledge Management (CIKM '12) 2012 A neural probabilistic model for context based citation recommendation WHuang ZWu CLiang PMitra CLGiles Twenty-Ninth AAAI Conference on Artificial Intelligence 2015 Information needs of clinical teams: Analysis of questions received by the clinical informatics consult service RNJerome NBGiuse KWGish NASathe MSDietrich Bulletin of the Medical Library Association 89 2 177 2001 RJohnson AWatkinson MMabe The STM report: An overview of scientific and scholarly publishing Technical and Medical Publishers 2018 A scalable hybrid research paper recommender system for Microsoft Academic AKanakia ZShen DEide KWang The World Wide Web Conference 2019 Adam: A method for stochastic optimization DPKingma JBa arXiv:1412.6980 2014 Conceptual recommender system for CiteSeerX AKodakateri Pudhiyaveetil SGauch HLuong JEno Proceedings of the Third ACM Conference on Recommender Systems the Third ACM Conference on Recommender Systems 2009 Indexing and retrieval of scientific literature SLawrence KBollacker CLGiles Proceedings of the 8th ACM International Conference on Information and Knowledge Management (CIKM '99) the 8th ACM International Conference on Information and Knowledge Management (CIKM '99) 1999 JLee WYoon SKim DKim SKim CHSo JKang arXiv:1901.08746 BioBERT: A pre-trained biomedical language representation model for biomedical text mining 2019 Context-based collaborative filtering for citation recommendation HLiu XKong XBai WWang TMBekele FXia IEEE Access 3 2015 RoBERTa: A Robustly Optimized BERT Pretraining Approach YLiu MOtt NGoyal JDu MJoshi DChen OLevy MLewis LZettlemoyer VStoyanov arXiv:1907.11692 2019 Recommending citations with translation model YLu JHe DShan HYan Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM '11) the 20th ACM International Conference on Information and Knowledge Management (CIKM '11) 2011 On the recommending of citations for research papers SMMcnee IAlbert DCosley PGopalkrishnan SKLam AMRashid JAKonstan JRiedl Proceedings of the 2002 ACM Conference on Computer Supported Cooperative Work the 2002 ACM Conference on Computer Supported Cooperative Work 2002 Deep learning for biomedical information retrieval: Learning textual relevance from click logs SMohan NFiorini SKim ZLu BioNLP 2017. 2017 Improved biomedical term selection in pseudo relevance feedback MNabeel Asim MWasim MUsman Ghani Khan WMahmood Database 2018. 2018 RNogueira KCho arXiv:1901.04085 Passage re-ranking with BERT 2019 YPeng SYan ZLu arXiv:1906.05474 Transfer learning in biomedical natural language processing: An evaluation of BERT and ELMo on ten benchmarking datasets 2019 ClusCite: Effective citation recommendation by information network-based clustering XRen JLiu XYu UKhandelwal QGu LWang JHan Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2014 Okapi at TREC-3 SERobertson SWalker SJones MHancock-Beaulieu MGatford Proceedings of the 3rd Text REtrieval Conference (TREC-3) the 3rd Text REtrieval Conference (TREC-3)

Gaithersburg, Maryland

1994 Automated hypothesis generation based on mining scientific literature SSpangler ADWilkins BJBachman MNagarajan TDayaram PJHaas SRegenbogen CRPickering AComer JNMyers IRStanoi LKato ALelescu JJLabrie NParikh AMLisewski LDonehower YChen OLichtarge Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2014 Moliere: Automatic biomedical hypothesis generation system JSybrandt MShtutman ISafro Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2017 Anserini: Enabling the use of Lucene for information retrieval research PYang HFang JLin Proceedings of the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR the 40th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017. 2017 Anserini: Reproducible ranking baselines using Lucene PYang HFang JLin Journal of Data and Information Quality 10 4 16 2018