A software library for conducting large scale experiments on
                     Learning to Rank algorithms
                    Nicola Ferro                                              Paolo Picello                                Gianmaria Silvello
       Dept. of Information Engineering,                        Dept. of Information Engineering,                  Dept. of Information Engineering,
          University of Padua, Italy                               University of Padua, Italy                         University of Padua, Italy
              ferro@dei.unipd.it                                   paolopicelloit@gmail.com                              silvello@dei.unipd.it

ABSTRACT
This paper presents an efficient application for driving large scale
experiments on Learning to Rank (LtR) algorithms. We designed a
software library that exploits caching mechanisms and efficient data
structures to make the execution of massime experiments on LtR
algorithms as fast as possible in order to try as many combinations
of components as possible. This presented software has been tested
on different algorithms as well as on different implementations of
the same algorithm in different libraries. This software is highly                    Figure 1: Repetition of query/document pairs within multi-
configurable and extensible in order to enable the seamless addition                  ple runs.
of new features, algorithms, and libraries.

CCS CONCEPTS
•Information systems →Evaluation of retrieval results; Test
collections;

KEYWORDS
learning to rank; component-based evaluation; grid of points


1    INTRODUCTION
LtR is a branch of Information Retrieval (IR) that employs machine
learning techniques to improve the effectiveness of IR systems, by
taking as input the ranked result list generated by an IR system
and producing as output a new re-ranked list of documents [4]. LtR
techniques are extremely popular nowadays in IR and they are used
by almost all the commercial search engines.
   Our long term goal is to study how LtR algorithms behave and
interact with other components typically present in an IR systems,
such as stemmers or different IR models. To this end, we will rely on                 Figure 2: Number of repeated query/document pairs as the
and extend our methodology based on the use of Grid of Points (GoP)                   number of runs increases.
and General Linear Mixed Model (GLMM) [1, 2], where a factorial
combination of all the components under experimentation is lever-
aged to estimate the main and interaction effects of the different                    features, usually in LETOR format [5], are given as input and per-
components as well as their effect size. Therefore, we will basically                 formance scores are directly produced as output.
need to test each LtR algorithm with all the combinations of the                         Even when an LtR library is directly integrated in an IR system,
other IR system components; if you consider that typically GoPs                       as it happens with JForest1 and Terrier2 , they are designed to run
just combining different stop lists, stemmers and IR models consist                   a single experiment at time and, if you have to run thousands of
of thousands of IR systems [3], you can imagine the explosion in                      experiments as in our case, you have to re-start from scratch each
the number of combinations to be tested when you add alternative                      time, while the same query/document pair is typically found by
LtR algorithms on top of them.                                                        many different runs, as shown in Figure 1. As a consequence, these
   Unfortunately, when it comes to test LtR algorithms, they are                      approaches compute the same document and query features again
usually evaluated in isolation, i.e. outside of a typical IR system                   and again with a consequent waste of time and resources.
pipeline. Indeed, instead of ranked list of documents, documents                         Figure 2 shows the number of already found query/document
                                                                                      pairs, i.e. the number of feature vectors already computed, as the
LEARNER’17, October 1, 2017, Amsterdam, The Netherlands
                                                                                      1 https://github.com/jeromepaul/jForest
Copyright ©2017 for this paper by its authors. Copying permitted for private and
                                                                                      2 http://terrier.org/
academic purposes.
LEARNER’17, , October 1, 2017, Amsterdam, The Netherlands.                                                           N. Ferro, P. Picello, and G. Silvello


number of considered runs increases. If instead of recomputing                   and save the computed values in a byte buffer to avoid unnecessary
these features each time we encounter them again, we somehow                     memory occupation. Once all the features have been computed we
cache and re-use them, we will obtain a significant performance                  proceed by populating the feature matrix.
improvement. For example, you can note from Figure 2 that, after
just 30 runs, we have already computed the features for almost all               2.2     Learning To Rank Algorithms Execution
the possible query/document pairs.
                                                                                 For each run we have several algorithm/library configurations,
   Therefore, our objective is to build an application that allows us
                                                                                 but for all these configurations we need the same LETOR files; so,
to evaluate LtR algorithms performance in an end-to-end pipeline,
                                                                                 we extract the required features from the previously generated
configuring different components and evaluating the re-ranking
                                                                                 data structure. We parallelize the features extraction process and
process as a whole, and that optimizes the costs, in terms of com-
                                                                                 the LETOR file generation by writing one different LETOR file
putational load and execution times3 .
                                                                                 for each query and then merging these files accordingly to the
   The paper is organized as follows: Section 2 describes the pro-
                                                                                 train/validation/test split specified in configuration parameters.
posed solution; Section 3 shows the performance of the proposed
                                                                                    We realized three different feature extraction alternatives: (i)
solution in terms of execution costs; finally, Section 4 wraps up the
                                                                                 the first one employs no parallelism and each task is processed
discussion and outlooks future work.
                                                                                 sequentially; (ii) the second one employs a different thread for each
                                                                                 different task, one thread is responsible for writing train file, one
2     PROPOSED SOLUTION
                                                                                 for the validation file and another for the test file; and, (iii) the third
The application we propose is modular and presents a logical sepa-               one employs a Thread Pool, that works like many different threads,
ration between two successive experimental stages:                               but minimizing the overhead due to thread creation. This solution
      • Features Extraction                                                      requires a locking mechanism to avoid readers-writers problems
      • Learning To Rank Algorithms Execution                                    when different threads access the same file. An advantage of this
The first module, Features Extraction, is responsible of the compu-              last approach is that if we change the train/validation/test split we
tation of the features for each query/document pair in the input                 do not need to perform the LETOR extraction phase again, but we
runs enabling fast retrieval when these features are required by the             only need to perform the faster merging operation.
Learning To Rank Algorithms Execution module. This second mod-                      In the Figure 4 we see the time required for the LETOR creation
ule is responsible for retrieving the desired features from memory,              by the three different approaches. In this case the Thread Pool
constructing the required LETOR files, and executing the desired                 generates only 4 threads because it was limited by the computer’s
LtR algorithm.                                                                   architecture used for testing. If we employ more threads the gain
   This division of the main tasks allows us to drop down the total              will be even higher. As we can see, the Thread Pool solution reduces
execution time and facilitate the separation of the functions in                 the time by more than 50% w.r.t. the single thread execution.
the software. Indeed, the first phase is time consuming because                     The execution of this module is repeated for each input run. The
we need to compute the features for all the considered input runs.               steps it follows are: (i) to create the LETOR text files according to
Afterwards we execute the second module as many times as we                      the desired train/validation/test split; (ii) to merge the generated
want and with different parameters, without re-computing the                     text files; and, (iii) execute the LTR algorithms.
features. This is where we improve the efficiency of the process.
                                                                                 3     PERFORMANCE EVALUATION
2.1    Features Extraction                                                               We conducted the experiment by using a MacBook Air (Mid 2013)
For each query/document pair we calculate the relative features                          with a 1,3 GHz Intel Core i5 3MB cache L3, Hyper-Threading (up to 4
only the first time a pair is processed.                                                 threads) and 4 GB DDR3 a 1600 MHz. We employed Terrier v4.1 for
    We employ a caching mechanism based on a features matrix al-                         extracting the features and the LtR algorithms reported in Figure 5
lowing us to store the features for the already computed query/document where we also report the open-source libraries implementing these
pairs. In this matrix each line is a document and the columns con-                       algorithms.
tain its features. Figure 3 shows this data structure.                                      As we can see, several algorithms like MART or LambdaMART
    We can see that when a document processed for the first time                         are implemented by all the considered libraries, while others as
is found in the input run, its features are computed and stored in                       AdaRank or LineSearch are specific only to a single library. We cre-
a matrix. This structure is a <key,value> map where the key is                           ated a single property file where the parameters of the algorithms
the document identifier and the value is its associated vector of                        are specified. Since different implementations of the same algo-
features. Once a new document is found, we need to verify if its                         rithm use different nomenclature, we used a map that associates a
entry is already in the table and if it is not we just go on to the next                 parameter value with its corresponding parameter in a given library.
document.                                                                                As an example, let us consider MART where all the libraries have a
    When we need to compute the features for the given docu-                             different parameter name to indicate the number of trees: tree for
ment/query pair, first of all we get the relevance judgment from                         RankLib, num-trees for QuickRank and boosting.num-trees for
the pool file and we check if it is the first document we consider for                   JForests. We gathered all these parameters under the same entry in
the given query. Then, we compute every other required feature                           our properties file to simplify the testing phase.
                                                                                            All the tests have been conducted by using the TREC7 corpus
3 The code is available here https://bitbucket.org/tesisti-unipd/picello as open-source. (TIPSTER disk 4 and 5 minus CR) and its 50 topics (number 351-400).
A software library for large scale experiments on LtR algorithms           LEARNER’17, , October 1, 2017, Amsterdam, The Netherlands.


Figure 3: The data structure used to store the calculated features. We see that a new row is added every time a new document
is processed.


Figure 4: Time for generating LETOR text files with differ-
ent approaches.


We created a GoP of 1990 runs by using Terrier as detailed in [2];
for each run in this set we performed a re-ranking with all the
available LtR algorithms.
                                                                        Figure 5: Summary of the employed algorithms and the
3.1    Efficiency
                                                                        tested open-source libraries implementing them.
In Figure 6 we see the execution time for about 40 different runs for
the same topic. As we could expect from previous considerations,
the execution time decreases as the number of processed runs                                          Action                  Time
increases, saturating after about 30 runs. For the first run we have                 Time to get docid from docno            2.0 ms
                                                                                     Time to compute arrays term frequency   61.0 ms
the maximum execution time since we need to compute the features                     Time to compute arrays TF IDF           1.0 ms
for all the documents in the run.                                                    Time to compute other features          1.0 ms
                                                                                     Time to write whole byte array          12.3 ms
   In Table 1, we show the main stages involved in features com-                     Total time for features computation     68.0 ms
putation and the respective execution time. This example is about       Table 1: Example of execution times for features computa-
features computation for document FBIS4-33167 for topic 351.            tion.
   Finally, in Figure 7 we see the total execution time for the Fea-
tures Extraction phase for the considered topics.
   There is an average computation time of about 2, 000 seconds (33
minutes) for each topic. The total execution time for this test was
about 99683 seconds (27.68 hours). We recall from above that this       computation has to be performed only once for a given set of runs,
LEARNER’17, , October 1, 2017, Amsterdam, The Netherlands.                                                 N. Ferro, P. Picello, and G. Silvello

                                                                                          Algorithm                  Library     Time(s)
                                                                                MART                                 RankLib        10.29
                                                                                                                    QuickRank        6.32
                                                                                                                     JForests        9.54
                                                                                LambdaMART                           RankLib        37.34
                                                                                                                    QuickRank       13.81
                                                                                                                     JForests       11.16
                                                                                ListNet                              RankLib        67.27
                                                                                RankBoost                            RankLib       353.79
                                                                                                                    QuickRank      208.01
                                                                                Gradientboostingbinaryclassifier     JForests       10.38
                                                                                RankNet                              RankLib       504.80
                                                                                Coordinate Ascent                    RankLib       201.42
                                                                                                                    QuickRank       11.21
                                                                                AdaRank                              RankLib        69.20
                                                                                Random Forest                        RankLib        69.78
                                                                                LineSearch                          QuickRank       12.17
                                                                      Table 2: Example of execution times of the tested LtR algo-
                                                                      rithms with default parameters.


                                                                      ten for a given run re-ranked with different LtR algorithms. The
Figure 6: Time for extracting features and generating                 base run employs the DFIZ ranking model, the standard Terrier
LETOR files for different runs.                                       stopword list and a 8-grams lexical unit generator. As we can see

                                                                                 Algorithm          Library        MAP      P@5      P@10
                                                                              Original                 -           0.1387   0.5167   0.4583
                                                                              MART                  RankLib        0.1477   0.4333   0.4083
                                                                                                   QuickRank       0.1153   0.3500   0.3333
                                                                                                    JForests       0.1288   0.3667   0.3417
                                                                              LambdaMART            RankLib        0.1388   0.4167   0.3500
                                                                                                   QuickRank       0.1323   0.4167   0.4083
                                                                                                    JForests       0.1363   0.4500   0.3833
                                                                              ListNet               RankLib        0.0443   0.0833   0.1167
                                                                              RankBoost             RankLib        0.1299   0.4167   0.3583
                                                                                                   QuickRank       0.1449   0.4833   0.4333
                                                                              Gradient Boosting
                                                                                                     JForests      0.1329   0.4000   0.3333
                                                                              Binary Classifier
                                                                              RankNet               RankLib        0.0444   0.1167   0.1000
                                                                              Coordinate Ascent     RankLib        0.1412   0.4667   0.4083
                                                                                                   QuickRank       0.0306   0.0167   0.0083
                                                                              AdaRank               RankLib        0.1564   0.5000   0.4500
                                                                              Random Forest         RankLib        0.1488   0.4333   0.3750
                                                                              LineSearch           QuickRank       0.0331   0.0167   0.0083
                                                                      Table 3: MAP, P@5, P@10 and P@15 for different algo-
                                 Topic number                         rithms

Figure 7: Time for generating LETOR for different topics.
On the x-axis there are topic numbers and on the y-axis there         ListNet, LineSearch, RankNet and the QuickRank implementa-
is the execution time expressed in seconds.                           tion of Coordinate Ascent give the lowest results, suggesting that
                                                                      some improves are in order or that they need a thorough param-
                                                                      eter tuning phase. In this situation and with default parameters
while the second phase where the LtR algorithms are executed is       AdaRank gives the better results in terms of MAP.
possibly repeated many times.                                             We analyzed the performance of the same algorithm imple-
  For what concerns the LtR algorithms execution module, the          mented by different libraries in terms of DCG values, to understand
execution time depends by the LTR libraries. As an example, in        if there are differences between different implementations. Again,
Table 2 we report the execution times of the LtR algorithms by        these are preliminary tests and we report the analysis for a single
employing their default parameter settings.                           topic. In particular, we present the results we get for LambdaMART,
                                                                      RankBoost and Coordinate Ascent.
3.2   Effectiveness                                                       Figure 8 shows the DCG of the LambdaMART algorithm for topic
In this subsection we give an initial glance over the effectiveness   390.
performances of the tested LtR and we point out where different           As we can see all the implementations have similar performances.
implementations of the same algorithm lead to different perfor-       In the case of RankLib’s implementation we have some little im-
mances; we leave a deeper and extensive analysis for future works.    provements with respect to the original run. This behaviour is the
In Table 3 we report the MAP and precision at cutoff five and         same for most of the topics.
A software library for large scale experiments on LtR algorithms        LEARNER’17, , October 1, 2017, Amsterdam, The Netherlands.


    Figure 8: DCG for LambdaMART runs for topic 390.                  Figure 10: DCG for Coordinate Ascent runs for topic 390.

   RankBoost is implemented by both QuickRank and RankLib;              As future works, one of the first improvements is to employ
in Figure 9 we show the DCG curves for the topic 390 where we        the Hadoop MapReduce implementation of Terrier to index large
report also the original run. We can see how QuickRank slightly      document collections in a distributed way [6].

                                                                     REFERENCES
                                                                     [1] N. Ferro and D. Harman. 2010. CLEF 2009: Grid@CLEF Pilot Track Overview.
                                                                         In Multilingual Information Access Evaluation Vol. I Text Retrieval Experiments –
                                                                         Tenth Workshop of the Cross–Language Evaluation Forum (CLEF 2009). Revised
                                                                         Selected Papers, C. Peters, G. M. Di Nunzio, M. Kurimo, T. Mandl, D. Mostefa,
                                                                         A. Peñas, and G. Roda (Eds.). Lecture Notes in Computer Science (LNCS) 6241,
                                                                         Springer, Heidelberg, Germany, 552–565.
                                                                     [2] N. Ferro and G. Silvello. 2016. A General Linear Mixed Models Approach to
                                                                         Study System Component Effects. In Proc. 39th Annual International ACM SIGIR
                                                                         Conference on Research and Development in Information Retrieval (SIGIR 2016),
                                                                         R. Perego, F. Sebastiani, J. Aslam, I. Ruthven, and J. Zobel (Eds.). ACM Press, New
                                                                         York, USA, 25–34.
                                                                     [3] N. Ferro and G. Silvello. 2017. Towards an Anatomy of IR System Component Per-
                                                                         formances. Journal of the American Society for Information Science and Technology
                                                                         (JASIST) (2017).
                                                                     [4] Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Foundations
                                                                         and Trends in Information Retrieval (FnTIR) 3, 3 (March 2009), 225–331. http:
                                                                         //dx.doi.org/10.1561/1500000016
                                                                     [5] T.-Y. Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. 2007. LETOR: Benchmark Dataset
                                                                         for Research on Learning to Rank for Information Retrieval. In SIGIR 2007 Work-
                                                                         shop on Learning to Rank for Information Retrieval, T. Joachims, H. Li, T.-Y. Liu,
                                                                         and C. Zhai (Eds.).
                                                                     [6] R. McCreadie, C. Macdonald, and I. Ounis. 2012. MapReduce Indexing Strategies:
                                                                         Studying Scalability and Efficiency. Information Processing & Management 48, 5
                                                                         (September 2012), 873–888.
      Figure 9: DCG for RankBoost runs for topic 390.

outperforms both RankLib and the original run. In Figure 10 we
analyze Coordinate Ascent, where RankLib performs similarly
to the original run, while QuickRank is slightly worse.

4   FINAL REMARKS
In this paper we described a software library that enables us to
run large-scale experiments over many LtR algorithms. We have
designed a library that, separating the execution in two different
modules, avoid to repeat unnecessary computations. This allows
us to efficiently run batch experiments for studying how different
parameters affect the results of the models and how the results
differ for the same algorithms implemented by different libraries.