The Impact of Negative Samples on Learning to Rank
                              Claudio Lucchese                                                      Franco Maria Nardini
                  Ca’ Foscari University of Venice, Italy                                              ISTI-CNR, Pisa, Italy
                        claudio.lucchese@unive.it                                                       f.nardini@isti.cnr.it

                                Raffaele Perego                                                         Salvatore Trani
                              ISTI-CNR, Pisa, Italy                                                    ISTI-CNR, Pisa, Italy
                               r.perego@isti.cnr.it                                                      s.trani@isti.cnr.it

ABSTRACT                                                                           the trade-off between the number of queries and the number of
Learning-to-Rank (LtR) techniques leverage machine learning al-                    judgments per query when building training sets. In particular, they
gorithms and large amounts of training data to induce high-quality                 show that training sets with more queries but less judgments per
ranking functions. Given a set of documents and a user query, these                query are more cost effective than training sets with less queries
functions are able to predict a score for each of the documents that is            but more judgments per query.
in turn exploited to induce a relevance ranking. The effectiveness of                  Asham et al. propose several document selection methodolo-
these learned functions has been proved to be significantly affected               gies, i.e., depth-k pooling, sampling (infAP, statAP), active-learning
by the data used to learn them. Several analysis and document                      (MTC), and on-line heuristics (hedge) [1]. Authors prove that some
selection strategies have been proposed in the past to deal with                   of the proposed methods, i.e., infAP, statAP and depth pooling,
this aspect. In this paper we review the state-of-the-art proposals                are better than others (hedge and the LETOR method) for building
and we report the results of a preliminary investigation of a new                  efficient and effective learning to rank collections. Authors also
sampling strategy aimed at reducing the number of not relevant                     propose a comparison with the document selection methodology
query-document pairs, so to significantly decrease the training time               used to create the LETOR datasets. The proposed study also deals
of the learning algorithm and to increase the final effectiveness of               with both i) the proportion of relevant documents to non-relevant
the model by reducing noise and redundancy in the training set.                    documents and ii) the similarity between relevant and non-relevant
                                                                                   documents in the datasets. Results confirm that both characteristics
ACM Reference format:
                                                                                   highly affect the quality of the learning-to-rank collections, with the
Claudio Lucchese, Franco Maria Nardini, Raffaele Perego, and Salvatore
Trani. 2017. The Impact of Negative Samples on Learning to Rank. In
                                                                                   latter having more impact. As a side result, authors also observed
Proceedings of the first international Workshop on LEARning Next gEneration        that some learning to rank algorithms, RankNet and LambdaRank,
Rankers, Amsterdam, The Netherlands, October 1, 2017 (LEARNER’17), 2               are more robust to document selection methodologies than other,
pages.                                                                             i.e., Regression, RankBoost and Ranking SVM.
                                                                                       In a later contribution, Kanoulas et al. propose a large-scale
                                                                                   study on the effect of the distribution of labels across the different
1    INTRODUCTION                                                                  grades of relevance in the training set on the performance of trained
Ranking is one of the most important problems in information re-                   ranking functions [3]. Authors propose a methodology that allows
trieval. Modern Web search engines exploit and combine hundreds                    to generate a large number of training datasets with different label
of features modelling the relevance between the user query and                     distributions. The datasets are then employed by three learning
the candidate documents. A specific area of machine learning, i.e.,                to rank algorithms to fit a ranking model. Authors investigate the
learning to rank, has been developed to deal with the ranking prob-                effect of these distributions on the accuracy of the obtained ranking
lem. Given a training set of feature vectors and relevance pairs, a                functions to characterize how training sets should be constructed.
learning to rank algorithm learns how to combine the query and                     Authors conclude that the relevance grade distribution in the train-
the document features so to optimize a specific effectiveness rank-                ing set is an important factor for the effectiveness of learning to
ing metric. In the last years important efforts have been spent on                 rank algorithms. They provide qualitative advise about the con-
feature engineering/extraction and the development of sophisti-                    struction of learning to rank datasets: i) distributions with a balance
cated learning to rank algorithms while little research has been                   between the number of documents in the extreme grades should be
conducted on how to choose queries and documents for learning                      favoured as the middle relevance grades play less important role
to rank datasets nor on the effect of these choices on the ability                 than the extreme ones.
of learning to rank algorithms to learn effectively and efficiently.                   In this paper, we investigate a new technique to sample doc-
Yilmaz and Robertson attack the problem by observing that the                      uments in order to improve both efficiency and effectiveness of
number of judgments in the training set directly affects the quality               learning to rank models. Indeed, the improved efficiency is a conse-
of the learned system [5]. Given the expense of obtaining relevance                quence of a reduced size of the sampled dataset, with the ranking al-
judgments for constructing training data, the major problem is how                 gorithm that have to learn from a lower number of query-document
to well distribute this judgment effort. Authors thus investigate                  pairs. On the other hand, an effective sampling technique may lead
                                                                                   to improve the effectiveness of the resulting model by filtering out
LEARNER’17, October 1, 2017, Amsterdam, The Netherlands                            noise and reducing the redundancy of the query-document pairs.
Copyright ©2017 for this paper by its authors. Copying permitted for private and
academic purposes.
                                                                         (a) Model trained on Full Dataset (b) Model trained on 10% Sample


Figure 1: Effectiveness of λ-Mart models trained on full                   Figure 2: Similarity distribution of negative documents.
dataset vs sampled ones.
                                                                         consequence the resulting model is not able to correctly identify
We present preliminary experimental results proving the benefits         them and this harm the ranking accuracy.
of the new proposed sampling methodology.                                   The proposed sampling strategy has been showed to provide
                                                                         a significant boost to the effectiveness of the ranking model. To
                                                                         understand the reasons that lead to this gain, we investigate the
2     METHODOLOGY AND DISCUSSION                                         similarity distribution between pairs of documents in the negative
We propose a sampling methodology that acts in two steps: i) we          class. Figure 2 reports such instance analysis, where the similarity
train a ranking model on the full dataset and we use it to rank the      between a pair of documents is computed by counting the number
query-document pairs of the same dataset; ii) we select a given          of exit leaves in common in scoring the two documents using the
fraction of the top-ranked negative samples, i.e., label = 0, together   given ensemble-based ranking model. The x-axis corresponds to
with all the positive ones, i.e., label> 0. The rationale is that the    the number of leaves in common (1,000 is the max for two docu-
top-ranked negative samples by the learned model are the most            ments exiting in the same leaves for all the trees), while the y-axis
likely to be mis-ranked. Consequently, we reduce the dataset to          corresponds to the number of pairs with the given similarity. The
such positive and negative examples to let the learning algorithm        negative examples in the full dataset were analyzed with the models
focuses on the most critical and discriminating examples. Moreover,      trained on the full dataset and on the 10% sample. According to the
by sampling only negative documents we indirectly reduce the             model trained on the sampled dataset, the document pairs distri-
unbalance between positive and negative classes and reduce the           bution is significantly skewed towards dissimilar pairs, i.e., with
redundancy in the negative class.                                        a few common exit leaves. This not only signifies that the initial
   We evaluate the effectiveness of our proposed technique on the        dataset was redundant in the negative class, but it also implies that
Istella1 [2] publicly available dataset. It is composed of 33,018        the model trained on the sampled dataset is much more effective in
queries and each of the 10,454,629 query-document pairs is rep-          discriminating among the negative examples and thus resulting in
resented by means of 220 features. Training and validation data          an improved ranking accuracy.
were used to train a reference algorithm. i.e.. λ-Mart. a list-wise         The preliminary experimental evaluation confirms the validity
algorithm that is capable of using NDCG in its loss function, re-        of the proposed sampling technique. As future work we intend
sulting in a predictor of the ranking [4]. The λ-Mart algorithm          to investigate in depth for a robust and systematic sampling tech-
was fine-tuned by sweeping its parameters to maximize NDCG@10.           nique that is able to provide benefits independently from the class-
The best performance were obtained with a learning rate equal to         distribution of the dataset and where the best sampling ratio is
0.05 and using 64 leaves.                                                automatically chosen accordingly to the dataset properties. We
   We built several sampled datasets by varying the fraction of          believe that sampling strategies can both improve the quality of
negatives sampled to {5%, 10%, 20%, 40%}. We trained a λ-Mart            the training data generate and the effectiveness of the learning
model on each of these sampled datasets and we evaluated their           algorithms.
performance on the full test set. As shown in Figure 1, the best
performing model is the one trained on the dataset where only the        REFERENCES
top 10% negatives were selected. This model significantly outper-        [1] Javed A. Aslam, Evangelos Kanoulas, Virgil Pavlu, Stefan Savev, and Emine
forms the model trained on the full training set by a large margin.          Yilmaz. 2009. Document Selection Methodologies for Efficient and Effective
                                                                             Learning-to-rank. In Proc. ACM SIGIR’09. 468–475.
Moreover, when the fraction of documents selected is lowered to          [2] D. Dato, C. Lucchese, F. M. Nardini, S. Orlando, R. Perego, N. Tonellotto, and
5%, we observe a drop in performance, while in the first iterations          R. Venturini. 2016. Fast Ranking with Additive Ensembles of Oblivious and
(up to 200 trees) results are in line with other models. This behav-         Non-Oblivious Regression Trees. ACM Trans. Inf. Syst. 35, 2, Article 15 (2016).
                                                                         [3] Evangelos Kanoulas, Stefan Savev, Pavel Metrikov, Virgil Pavlu, and Javed Aslam.
ior suggests that the use of 5% of the negative instances does not           2011. A Large-scale Study of the Effect of Training Set Characteristics over
fully represent the negative class of query-document pairs. As a             Learning-to-rank Algorithms. In Proc. ACM SIGIR’11. 1243–1244.
                                                                         [4] Q. Wu, C.J.C. Burges, K.M. Svore, and J. Gao. 2010. Adapting boosting for
                                                                             information retrieval measures. Information Retrieval (2010).
                                                                         [5] Emine Yilmaz and Stephen Robertson. 2009. Deep Versus Shallow Judgments in
1 http://blog.istella.it/istella-learning-to-rank-dataset/                   Learning to Rank. In Proc. ACM SIGIR’09. 662–663.