The Impact of Negative Samples on Learning to Rank Claudio Lucchese Franco Maria Nardini Ca’ Foscari University of Venice, Italy ISTI-CNR, Pisa, Italy claudio.lucchese@unive.it f.nardini@isti.cnr.it Raffaele Perego Salvatore Trani ISTI-CNR, Pisa, Italy ISTI-CNR, Pisa, Italy r.perego@isti.cnr.it s.trani@isti.cnr.it ABSTRACT the trade-off between the number of queries and the number of Learning-to-Rank (LtR) techniques leverage machine learning al- judgments per query when building training sets. In particular, they gorithms and large amounts of training data to induce high-quality show that training sets with more queries but less judgments per ranking functions. Given a set of documents and a user query, these query are more cost effective than training sets with less queries functions are able to predict a score for each of the documents that is but more judgments per query. in turn exploited to induce a relevance ranking. The effectiveness of Asham et al. propose several document selection methodolo- these learned functions has been proved to be significantly affected gies, i.e., depth-k pooling, sampling (infAP, statAP), active-learning by the data used to learn them. Several analysis and document (MTC), and on-line heuristics (hedge) [1]. Authors prove that some selection strategies have been proposed in the past to deal with of the proposed methods, i.e., infAP, statAP and depth pooling, this aspect. In this paper we review the state-of-the-art proposals are better than others (hedge and the LETOR method) for building and we report the results of a preliminary investigation of a new efficient and effective learning to rank collections. Authors also sampling strategy aimed at reducing the number of not relevant propose a comparison with the document selection methodology query-document pairs, so to significantly decrease the training time used to create the LETOR datasets. The proposed study also deals of the learning algorithm and to increase the final effectiveness of with both i) the proportion of relevant documents to non-relevant the model by reducing noise and redundancy in the training set. documents and ii) the similarity between relevant and non-relevant documents in the datasets. Results confirm that both characteristics ACM Reference format: highly affect the quality of the learning-to-rank collections, with the Claudio Lucchese, Franco Maria Nardini, Raffaele Perego, and Salvatore Trani. 2017. The Impact of Negative Samples on Learning to Rank. In latter having more impact. As a side result, authors also observed Proceedings of the first international Workshop on LEARning Next gEneration that some learning to rank algorithms, RankNet and LambdaRank, Rankers, Amsterdam, The Netherlands, October 1, 2017 (LEARNER’17), 2 are more robust to document selection methodologies than other, pages. i.e., Regression, RankBoost and Ranking SVM. In a later contribution, Kanoulas et al. propose a large-scale study on the effect of the distribution of labels across the different 1 INTRODUCTION grades of relevance in the training set on the performance of trained Ranking is one of the most important problems in information re- ranking functions [3]. Authors propose a methodology that allows trieval. Modern Web search engines exploit and combine hundreds to generate a large number of training datasets with different label of features modelling the relevance between the user query and distributions. The datasets are then employed by three learning the candidate documents. A specific area of machine learning, i.e., to rank algorithms to fit a ranking model. Authors investigate the learning to rank, has been developed to deal with the ranking prob- effect of these distributions on the accuracy of the obtained ranking lem. Given a training set of feature vectors and relevance pairs, a functions to characterize how training sets should be constructed. learning to rank algorithm learns how to combine the query and Authors conclude that the relevance grade distribution in the train- the document features so to optimize a specific effectiveness rank- ing set is an important factor for the effectiveness of learning to ing metric. In the last years important efforts have been spent on rank algorithms. They provide qualitative advise about the con- feature engineering/extraction and the development of sophisti- struction of learning to rank datasets: i) distributions with a balance cated learning to rank algorithms while little research has been between the number of documents in the extreme grades should be conducted on how to choose queries and documents for learning favoured as the middle relevance grades play less important role to rank datasets nor on the effect of these choices on the ability than the extreme ones. of learning to rank algorithms to learn effectively and efficiently. In this paper, we investigate a new technique to sample doc- Yilmaz and Robertson attack the problem by observing that the uments in order to improve both efficiency and effectiveness of number of judgments in the training set directly affects the quality learning to rank models. Indeed, the improved efficiency is a conse- of the learned system [5]. Given the expense of obtaining relevance quence of a reduced size of the sampled dataset, with the ranking al- judgments for constructing training data, the major problem is how gorithm that have to learn from a lower number of query-document to well distribute this judgment effort. Authors thus investigate pairs. On the other hand, an effective sampling technique may lead to improve the effectiveness of the resulting model by filtering out LEARNER’17, October 1, 2017, Amsterdam, The Netherlands noise and reducing the redundancy of the query-document pairs. Copyright ©2017 for this paper by its authors. Copying permitted for private and academic purposes. (a) Model trained on Full Dataset (b) Model trained on 10% Sample Figure 1: Effectiveness of λ-Mart models trained on full Figure 2: Similarity distribution of negative documents. dataset vs sampled ones. consequence the resulting model is not able to correctly identify We present preliminary experimental results proving the benefits them and this harm the ranking accuracy. of the new proposed sampling methodology. The proposed sampling strategy has been showed to provide a significant boost to the effectiveness of the ranking model. To understand the reasons that lead to this gain, we investigate the 2 METHODOLOGY AND DISCUSSION similarity distribution between pairs of documents in the negative We propose a sampling methodology that acts in two steps: i) we class. Figure 2 reports such instance analysis, where the similarity train a ranking model on the full dataset and we use it to rank the between a pair of documents is computed by counting the number query-document pairs of the same dataset; ii) we select a given of exit leaves in common in scoring the two documents using the fraction of the top-ranked negative samples, i.e., label = 0, together given ensemble-based ranking model. The x-axis corresponds to with all the positive ones, i.e., label> 0. The rationale is that the the number of leaves in common (1,000 is the max for two docu- top-ranked negative samples by the learned model are the most ments exiting in the same leaves for all the trees), while the y-axis likely to be mis-ranked. Consequently, we reduce the dataset to corresponds to the number of pairs with the given similarity. The such positive and negative examples to let the learning algorithm negative examples in the full dataset were analyzed with the models focuses on the most critical and discriminating examples. Moreover, trained on the full dataset and on the 10% sample. According to the by sampling only negative documents we indirectly reduce the model trained on the sampled dataset, the document pairs distri- unbalance between positive and negative classes and reduce the bution is significantly skewed towards dissimilar pairs, i.e., with redundancy in the negative class. a few common exit leaves. This not only signifies that the initial We evaluate the effectiveness of our proposed technique on the dataset was redundant in the negative class, but it also implies that Istella1 [2] publicly available dataset. It is composed of 33,018 the model trained on the sampled dataset is much more effective in queries and each of the 10,454,629 query-document pairs is rep- discriminating among the negative examples and thus resulting in resented by means of 220 features. Training and validation data an improved ranking accuracy. were used to train a reference algorithm. i.e.. λ-Mart. a list-wise The preliminary experimental evaluation confirms the validity algorithm that is capable of using NDCG in its loss function, re- of the proposed sampling technique. As future work we intend sulting in a predictor of the ranking [4]. The λ-Mart algorithm to investigate in depth for a robust and systematic sampling tech- was fine-tuned by sweeping its parameters to maximize NDCG@10. nique that is able to provide benefits independently from the class- The best performance were obtained with a learning rate equal to distribution of the dataset and where the best sampling ratio is 0.05 and using 64 leaves. automatically chosen accordingly to the dataset properties. We We built several sampled datasets by varying the fraction of believe that sampling strategies can both improve the quality of negatives sampled to {5%, 10%, 20%, 40%}. We trained a λ-Mart the training data generate and the effectiveness of the learning model on each of these sampled datasets and we evaluated their algorithms. performance on the full test set. As shown in Figure 1, the best performing model is the one trained on the dataset where only the REFERENCES top 10% negatives were selected. This model significantly outper- [1] Javed A. Aslam, Evangelos Kanoulas, Virgil Pavlu, Stefan Savev, and Emine forms the model trained on the full training set by a large margin. Yilmaz. 2009. Document Selection Methodologies for Efficient and Effective Learning-to-rank. In Proc. ACM SIGIR’09. 468–475. Moreover, when the fraction of documents selected is lowered to [2] D. Dato, C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, N. Tonellotto, and 5%, we observe a drop in performance, while in the first iterations R. Venturini. 2016. Fast Ranking with Additive Ensembles of Oblivious and (up to 200 trees) results are in line with other models. This behav- Non-Oblivious Regression Trees. ACM Trans. Inf. Syst. 35, 2, Article 15 (2016). [3] Evangelos Kanoulas, Stefan Savev, Pavel Metrikov, Virgil Pavlu, and Javed Aslam. ior suggests that the use of 5% of the negative instances does not 2011. A Large-scale Study of the Effect of Training Set Characteristics over fully represent the negative class of query-document pairs. As a Learning-to-rank Algorithms. In Proc. ACM SIGIR’11. 1243–1244. [4] Q. Wu, C.J.C. Burges, K.M. Svore, and J. Gao. 2010. Adapting boosting for information retrieval measures. Information Retrieval (2010). [5] Emine Yilmaz and Stephen Robertson. 2009. Deep Versus Shallow Judgments in 1 http://blog.istella.it/istella-learning-to-rank-dataset/ Learning to Rank. In Proc. ACM SIGIR’09. 662–663.