A software library for conducting large scale experiments on Learning to Rank algorithms Nicola Ferro Paolo Picello Gianmaria Silvello Dept. of Information Engineering, Dept. of Information Engineering, Dept. of Information Engineering, University of Padua, Italy University of Padua, Italy University of Padua, Italy ferro@dei.unipd.it paolopicelloit@gmail.com silvello@dei.unipd.it ABSTRACT This paper presents an efficient application for driving large scale experiments on Learning to Rank (LtR) algorithms. We designed a software library that exploits caching mechanisms and efficient data structures to make the execution of massime experiments on LtR algorithms as fast as possible in order to try as many combinations of components as possible. This presented software has been tested on different algorithms as well as on different implementations of the same algorithm in different libraries. This software is highly Figure 1: Repetition of query/document pairs within multi- configurable and extensible in order to enable the seamless addition ple runs. of new features, algorithms, and libraries. CCS CONCEPTS •Information systems →Evaluation of retrieval results; Test collections; KEYWORDS learning to rank; component-based evaluation; grid of points 1 INTRODUCTION LtR is a branch of Information Retrieval (IR) that employs machine learning techniques to improve the effectiveness of IR systems, by taking as input the ranked result list generated by an IR system and producing as output a new re-ranked list of documents [4]. LtR techniques are extremely popular nowadays in IR and they are used by almost all the commercial search engines. Our long term goal is to study how LtR algorithms behave and interact with other components typically present in an IR systems, such as stemmers or different IR models. To this end, we will rely on Figure 2: Number of repeated query/document pairs as the and extend our methodology based on the use of Grid of Points (GoP) number of runs increases. and General Linear Mixed Model (GLMM) [1, 2], where a factorial combination of all the components under experimentation is lever- aged to estimate the main and interaction effects of the different features, usually in LETOR format [5], are given as input and per- components as well as their effect size. Therefore, we will basically formance scores are directly produced as output. need to test each LtR algorithm with all the combinations of the Even when an LtR library is directly integrated in an IR system, other IR system components; if you consider that typically GoPs as it happens with JForest1 and Terrier2 , they are designed to run just combining different stop lists, stemmers and IR models consist a single experiment at time and, if you have to run thousands of of thousands of IR systems [3], you can imagine the explosion in experiments as in our case, you have to re-start from scratch each the number of combinations to be tested when you add alternative time, while the same query/document pair is typically found by LtR algorithms on top of them. many different runs, as shown in Figure 1. As a consequence, these Unfortunately, when it comes to test LtR algorithms, they are approaches compute the same document and query features again usually evaluated in isolation, i.e. outside of a typical IR system and again with a consequent waste of time and resources. pipeline. Indeed, instead of ranked list of documents, documents Figure 2 shows the number of already found query/document pairs, i.e. the number of feature vectors already computed, as the LEARNER’17, October 1, 2017, Amsterdam, The Netherlands 1 https://github.com/jeromepaul/jForest Copyright ©2017 for this paper by its authors. Copying permitted for private and 2 http://terrier.org/ academic purposes. LEARNER’17, , October 1, 2017, Amsterdam, The Netherlands. N. Ferro, P. Picello, and G. Silvello number of considered runs increases. If instead of recomputing and save the computed values in a byte buffer to avoid unnecessary these features each time we encounter them again, we somehow memory occupation. Once all the features have been computed we cache and re-use them, we will obtain a significant performance proceed by populating the feature matrix. improvement. For example, you can note from Figure 2 that, after just 30 runs, we have already computed the features for almost all 2.2 Learning To Rank Algorithms Execution the possible query/document pairs. For each run we have several algorithm/library configurations, Therefore, our objective is to build an application that allows us but for all these configurations we need the same LETOR files; so, to evaluate LtR algorithms performance in an end-to-end pipeline, we extract the required features from the previously generated configuring different components and evaluating the re-ranking data structure. We parallelize the features extraction process and process as a whole, and that optimizes the costs, in terms of com- the LETOR file generation by writing one different LETOR file putational load and execution times3 . for each query and then merging these files accordingly to the The paper is organized as follows: Section 2 describes the pro- train/validation/test split specified in configuration parameters. posed solution; Section 3 shows the performance of the proposed We realized three different feature extraction alternatives: (i) solution in terms of execution costs; finally, Section 4 wraps up the the first one employs no parallelism and each task is processed discussion and outlooks future work. sequentially; (ii) the second one employs a different thread for each different task, one thread is responsible for writing train file, one 2 PROPOSED SOLUTION for the validation file and another for the test file; and, (iii) the third The application we propose is modular and presents a logical sepa- one employs a Thread Pool, that works like many different threads, ration between two successive experimental stages: but minimizing the overhead due to thread creation. This solution • Features Extraction requires a locking mechanism to avoid readers-writers problems • Learning To Rank Algorithms Execution when different threads access the same file. An advantage of this The first module, Features Extraction, is responsible of the compu- last approach is that if we change the train/validation/test split we tation of the features for each query/document pair in the input do not need to perform the LETOR extraction phase again, but we runs enabling fast retrieval when these features are required by the only need to perform the faster merging operation. Learning To Rank Algorithms Execution module. This second mod- In the Figure 4 we see the time required for the LETOR creation ule is responsible for retrieving the desired features from memory, by the three different approaches. In this case the Thread Pool constructing the required LETOR files, and executing the desired generates only 4 threads because it was limited by the computer’s LtR algorithm. architecture used for testing. If we employ more threads the gain This division of the main tasks allows us to drop down the total will be even higher. As we can see, the Thread Pool solution reduces execution time and facilitate the separation of the functions in the time by more than 50% w.r.t. the single thread execution. the software. Indeed, the first phase is time consuming because The execution of this module is repeated for each input run. The we need to compute the features for all the considered input runs. steps it follows are: (i) to create the LETOR text files according to Afterwards we execute the second module as many times as we the desired train/validation/test split; (ii) to merge the generated want and with different parameters, without re-computing the text files; and, (iii) execute the LTR algorithms. features. This is where we improve the efficiency of the process. 3 PERFORMANCE EVALUATION 2.1 Features Extraction We conducted the experiment by using a MacBook Air (Mid 2013) For each query/document pair we calculate the relative features with a 1,3 GHz Intel Core i5 3MB cache L3, Hyper-Threading (up to 4 only the first time a pair is processed. threads) and 4 GB DDR3 a 1600 MHz. We employed Terrier v4.1 for We employ a caching mechanism based on a features matrix al- extracting the features and the LtR algorithms reported in Figure 5 lowing us to store the features for the already computed query/document where we also report the open-source libraries implementing these pairs. In this matrix each line is a document and the columns con- algorithms. tain its features. Figure 3 shows this data structure. As we can see, several algorithms like MART or LambdaMART We can see that when a document processed for the first time are implemented by all the considered libraries, while others as is found in the input run, its features are computed and stored in AdaRank or LineSearch are specific only to a single library. We cre- a matrix. This structure is a map where the key is ated a single property file where the parameters of the algorithms the document identifier and the value is its associated vector of are specified. Since different implementations of the same algo- features. Once a new document is found, we need to verify if its rithm use different nomenclature, we used a map that associates a entry is already in the table and if it is not we just go on to the next parameter value with its corresponding parameter in a given library. document. As an example, let us consider MART where all the libraries have a When we need to compute the features for the given docu- different parameter name to indicate the number of trees: tree for ment/query pair, first of all we get the relevance judgment from RankLib, num-trees for QuickRank and boosting.num-trees for the pool file and we check if it is the first document we consider for JForests. We gathered all these parameters under the same entry in the given query. Then, we compute every other required feature our properties file to simplify the testing phase. All the tests have been conducted by using the TREC7 corpus 3 The code is available here https://bitbucket.org/tesisti-unipd/picello as open-source. (TIPSTER disk 4 and 5 minus CR) and its 50 topics (number 351-400). A software library for large scale experiments on LtR algorithms LEARNER’17, , October 1, 2017, Amsterdam, The Netherlands. Figure 3: The data structure used to store the calculated features. We see that a new row is added every time a new document is processed. Figure 4: Time for generating LETOR text files with differ- ent approaches. We created a GoP of 1990 runs by using Terrier as detailed in [2]; for each run in this set we performed a re-ranking with all the available LtR algorithms. Figure 5: Summary of the employed algorithms and the 3.1 Efficiency tested open-source libraries implementing them. In Figure 6 we see the execution time for about 40 different runs for the same topic. As we could expect from previous considerations, the execution time decreases as the number of processed runs Action Time increases, saturating after about 30 runs. For the first run we have Time to get docid from docno 2.0 ms Time to compute arrays term frequency 61.0 ms the maximum execution time since we need to compute the features Time to compute arrays TF IDF 1.0 ms for all the documents in the run. Time to compute other features 1.0 ms Time to write whole byte array 12.3 ms In Table 1, we show the main stages involved in features com- Total time for features computation 68.0 ms putation and the respective execution time. This example is about Table 1: Example of execution times for features computa- features computation for document FBIS4-33167 for topic 351. tion. Finally, in Figure 7 we see the total execution time for the Fea- tures Extraction phase for the considered topics. There is an average computation time of about 2, 000 seconds (33 minutes) for each topic. The total execution time for this test was about 99683 seconds (27.68 hours). We recall from above that this computation has to be performed only once for a given set of runs, LEARNER’17, , October 1, 2017, Amsterdam, The Netherlands. N. Ferro, P. Picello, and G. Silvello Algorithm Library Time(s) MART RankLib 10.29 QuickRank 6.32 JForests 9.54 LambdaMART RankLib 37.34 QuickRank 13.81 JForests 11.16 ListNet RankLib 67.27 RankBoost RankLib 353.79 QuickRank 208.01 Gradientboostingbinaryclassifier JForests 10.38 RankNet RankLib 504.80 Coordinate Ascent RankLib 201.42 QuickRank 11.21 AdaRank RankLib 69.20 Random Forest RankLib 69.78 LineSearch QuickRank 12.17 Table 2: Example of execution times of the tested LtR algo- rithms with default parameters. ten for a given run re-ranked with different LtR algorithms. The Figure 6: Time for extracting features and generating base run employs the DFIZ ranking model, the standard Terrier LETOR files for different runs. stopword list and a 8-grams lexical unit generator. As we can see Algorithm Library MAP P@5 P@10 Original - 0.1387 0.5167 0.4583 MART RankLib 0.1477 0.4333 0.4083 QuickRank 0.1153 0.3500 0.3333 JForests 0.1288 0.3667 0.3417 LambdaMART RankLib 0.1388 0.4167 0.3500 QuickRank 0.1323 0.4167 0.4083 JForests 0.1363 0.4500 0.3833 ListNet RankLib 0.0443 0.0833 0.1167 RankBoost RankLib 0.1299 0.4167 0.3583 QuickRank 0.1449 0.4833 0.4333 Gradient Boosting JForests 0.1329 0.4000 0.3333 Binary Classifier RankNet RankLib 0.0444 0.1167 0.1000 Coordinate Ascent RankLib 0.1412 0.4667 0.4083 QuickRank 0.0306 0.0167 0.0083 AdaRank RankLib 0.1564 0.5000 0.4500 Random Forest RankLib 0.1488 0.4333 0.3750 LineSearch QuickRank 0.0331 0.0167 0.0083 Table 3: MAP, P@5, P@10 and P@15 for different algo- Topic number rithms Figure 7: Time for generating LETOR for different topics. On the x-axis there are topic numbers and on the y-axis there ListNet, LineSearch, RankNet and the QuickRank implementa- is the execution time expressed in seconds. tion of Coordinate Ascent give the lowest results, suggesting that some improves are in order or that they need a thorough param- eter tuning phase. In this situation and with default parameters while the second phase where the LtR algorithms are executed is AdaRank gives the better results in terms of MAP. possibly repeated many times. We analyzed the performance of the same algorithm imple- For what concerns the LtR algorithms execution module, the mented by different libraries in terms of DCG values, to understand execution time depends by the LTR libraries. As an example, in if there are differences between different implementations. Again, Table 2 we report the execution times of the LtR algorithms by these are preliminary tests and we report the analysis for a single employing their default parameter settings. topic. In particular, we present the results we get for LambdaMART, RankBoost and Coordinate Ascent. 3.2 Effectiveness Figure 8 shows the DCG of the LambdaMART algorithm for topic In this subsection we give an initial glance over the effectiveness 390. performances of the tested LtR and we point out where different As we can see all the implementations have similar performances. implementations of the same algorithm lead to different perfor- In the case of RankLib’s implementation we have some little im- mances; we leave a deeper and extensive analysis for future works. provements with respect to the original run. This behaviour is the In Table 3 we report the MAP and precision at cutoff five and same for most of the topics. A software library for large scale experiments on LtR algorithms LEARNER’17, , October 1, 2017, Amsterdam, The Netherlands. Figure 8: DCG for LambdaMART runs for topic 390. Figure 10: DCG for Coordinate Ascent runs for topic 390. RankBoost is implemented by both QuickRank and RankLib; As future works, one of the first improvements is to employ in Figure 9 we show the DCG curves for the topic 390 where we the Hadoop MapReduce implementation of Terrier to index large report also the original run. We can see how QuickRank slightly document collections in a distributed way [6]. REFERENCES [1] N. Ferro and D. Harman. 2010. CLEF 2009: Grid@CLEF Pilot Track Overview. In Multilingual Information Access Evaluation Vol. I Text Retrieval Experiments – Tenth Workshop of the Cross–Language Evaluation Forum (CLEF 2009). Revised Selected Papers, C. Peters, G. M. Di Nunzio, M. Kurimo, T. Mandl, D. Mostefa, A. Peñas, and G. Roda (Eds.). Lecture Notes in Computer Science (LNCS) 6241, Springer, Heidelberg, Germany, 552–565. [2] N. Ferro and G. Silvello. 2016. A General Linear Mixed Models Approach to Study System Component Effects. In Proc. 39th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016), R. Perego, F. Sebastiani, J. Aslam, I. Ruthven, and J. Zobel (Eds.). ACM Press, New York, USA, 25–34. [3] N. Ferro and G. Silvello. 2017. Towards an Anatomy of IR System Component Per- formances. Journal of the American Society for Information Science and Technology (JASIST) (2017). [4] Tie-Yan Liu. 2009. Learning to Rank for Information Retrieval. Foundations and Trends in Information Retrieval (FnTIR) 3, 3 (March 2009), 225–331. http: //dx.doi.org/10.1561/1500000016 [5] T.-Y. Y. Liu, J. Xu, T. Qin, W. Xiong, and H. Li. 2007. LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval. In SIGIR 2007 Work- shop on Learning to Rank for Information Retrieval, T. Joachims, H. Li, T.-Y. Liu, and C. Zhai (Eds.). [6] R. McCreadie, C. Macdonald, and I. Ounis. 2012. MapReduce Indexing Strategies: Studying Scalability and Efficiency. Information Processing & Management 48, 5 (September 2012), 873–888. Figure 9: DCG for RankBoost runs for topic 390. outperforms both RankLib and the original run. In Figure 10 we analyze Coordinate Ascent, where RankLib performs similarly to the original run, while QuickRank is slightly worse. 4 FINAL REMARKS In this paper we described a software library that enables us to run large-scale experiments over many LtR algorithms. We have designed a library that, separating the execution in two different modules, avoid to repeat unnecessary computations. This allows us to efficiently run batch experiments for studying how different parameters affect the results of the models and how the results differ for the same algorithms implemented by different libraries.