INTRODUCTION

October

A soware library for conducting large scale experiments on Learning to Rank algorithms

Nicola Ferro

ferro@dei.unipd.it 0

Paolo Picello

paolopicelloit@gmail.com 0

Gianmaria Silvello

silvello@dei.unipd.it 0 0 Dept. of Information Engineering, University of Padua , Italy

2017

1 2017

is paper presents an ecient application for driving large scale experiments on Learning to Rank (LtR) algorithms. We designed a soware library that exploits caching mechanisms and ecient data structures to make the execution of massime experiments on LtR algorithms as fast as possible in order to try as many combinations of components as possible. is presented soware has been tested on dierent algorithms as well as on dierent implementations of the same algorithm in dierent libraries. is soware is highly congurable and extensible in order to enable the seamless addition of new features, algorithms, and libraries.

INTRODUCTION

LtR is a branch of Information Retrieval (IR) that employs machine learning techniques to improve the eectiveness of IR systems, by taking as input the ranked result list generated by an IR system and producing as output a new re-ranked list of documents [ 4 ]. LtR techniques are extremely popular nowadays in IR and they are used by almost all the commercial search engines.

Our long term goal is to study how LtR algorithms behave and interact with other components typically present in an IR systems, such as stemmers or dierent IR models. To this end, we will rely on and extend our methodology based on the use of Grid of Points (GoP) and General Linear Mixed Model (GLMM) [ 1, 2 ], where a factorial combination of all the components under experimentation is leveraged to estimate the main and interaction eects of the dierent components as well as their eect size. erefore, we will basically need to test each LtR algorithm with all the combinations of the other IR system components; if you consider that typically GoPs just combining dierent stop lists, stemmers and IR models consist of thousands of IR systems [ 3 ], you can imagine the explosion in the number of combinations to be tested when you add alternative LtR algorithms on top of them.

Unfortunately, when it comes to test LtR algorithms, they are usually evaluated in isolation, i.e. outside of a typical IR system pipeline. Indeed, instead of ranked list of documents, documents features, usually in LETOR format [ 5 ], are given as input and performance scores are directly produced as output.

Even when an LtR library is directly integrated in an IR system, as it happens with JForest1 and Terrier2, they are designed to run a single experiment at time and, if you have to run thousands of experiments as in our case, you have to re-start from scratch each time, while the same query/document pair is typically found by many dierent runs, as shown in Figure 1. As a consequence, these approaches compute the same document and query features again and again with a consequent waste of time and resources.

Figure 2 shows the number of already found query/document pairs, i.e. the number of feature vectors already computed, as the 1hps://github.com/jeromepaul/jForest 2hp://terrier.org/ number of considered runs increases. If instead of recomputing these features each time we encounter them again, we somehow cache and re-use them, we will obtain a signicant performance improvement. For example, you can note from Figure 2 that, aer just 30 runs, we have already computed the features for almost all the possible query/document pairs.

erefore, our objective is to build an application that allows us to evaluate LtR algorithms performance in an end-to-end pipeline, conguring dierent components and evaluating the re-ranking process as a whole, and that optimizes the costs, in terms of computational load and execution times3.

e paper is organized as follows: Section 2 describes the proposed solution; Section 3 shows the performance of the proposed solution in terms of execution costs; nally, Section 4 wraps up the discussion and outlooks future work. 2

PROPOSED SOLUTION

e application we propose is modular and presents a logical separation between two successive experimental stages:

Features Extraction

Learning To Rank Algorithms Execution e rst module, Features Extraction, is responsible of the computation of the features for each query/document pair in the input runs enabling fast retrieval when these features are required by the Learning To Rank Algorithms Execution module. is second module is responsible for retrieving the desired features from memory, constructing the required LETOR les, and executing the desired LtR algorithm.

is division of the main tasks allows us to drop down the total execution time and facilitate the separation of the functions in the soware. Indeed, the rst phase is time consuming because we need to compute the features for all the considered input runs. Aerwards we execute the second module as many times as we want and with dierent parameters, without re-computing the features. is is where we improve the eciency of the process. and save the computed values in a byte buer to avoid unnecessary memory occupation. Once all the features have been computed we proceed by populating the feature matrix. 2.2

Learning To Rank Algorithms Execution

For each run we have several algorithm/library congurations, but for all these congurations we need the same LETOR les; so, we extract the required features from the previously generated data structure. We parallelize the features extraction process and the LETOR le generation by writing one dierent LETOR le for each query and then merging these les accordingly to the train/validation/test split specied in conguration parameters.

We realized three dierent feature extraction alternatives: (i) the rst one employs no parallelism and each task is processed sequentially; (ii) the second one employs a dierent thread for each dierent task, one thread is responsible for writing train le, one for the validation le and another for the test le; and, (iii) the third one employs a read Pool, that works like many dierent threads, but minimizing the overhead due to thread creation. is solution requires a locking mechanism to avoid readers-writers problems when dierent threads access the same le. An advantage of this last approach is that if we change the train/validation/test split we do not need to perform the LETOR extraction phase again, but we only need to perform the faster merging operation.

In the Figure 4 we see the time required for the LETOR creation by the three dierent approaches. In this case the read Pool generates only 4 threads because it was limited by the computer’s architecture used for testing. If we employ more threads the gain will be even higher. As we can see, the read Pool solution reduces the time by more than 50% w.r.t. the single thread execution.

e execution of this module is repeated for each input run. e steps it follows are: (i) to create the LETOR text les according to the desired train/validation/test split; (ii) to merge the generated text les; and, (iii) execute the LTR algorithms. 3

PERFORMANCE EVALUATION 2.1 Features Extraction

We conducted the experiment by using a MacBook Air (Mid 2013) For each query/document pair we calculate the relative features with a 1,3 GHz Intel Core i5 3MB cache L3, Hyper-reading (up to 4 only the rst time a pair is processed. threads) and 4 GB DDR3 a 1600 MHz. We employed Terrier v4.1 for

We employ a caching mechanism based on a features matrix al- extracting the features and the LtR algorithms reported in Figure 5 lowing us to store the features for the already computed query/document where we also report the open-source libraries implementing these pairs. In this matrix each line is a document and the columns con- algorithms. tain its features. Figure 3 shows this data structure. As we can see, several algorithms like MART or LambdaMART

We can see that when a document processed for the rst time are implemented by all the considered libraries, while others as is found in the input run, its features are computed and stored in AdaRank or LineSearch are specic only to a single library. We crea matrix. is structure is a <key,value> map where the key is ated a single property le where the parameters of the algorithms the document identier and the value is its associated vector of are specied. Since dierent implementations of the same algofeatures. Once a new document is found, we need to verify if its rithm use dierent nomenclature, we used a map that associates a entry is already in the table and if it is not we just go on to the next parameter value with its corresponding parameter in a given library. document. As an example, let us consider MART where all the libraries have a

When we need to compute the features for the given docu- dierent parameter name to indicate the number of trees: tree for ment/query pair, rst of all we get the relevance judgment from RankLib, num-trees for ickRank and boosting.num-trees for the pool le and we check if it is the rst document we consider for JForests. We gathered all these parameters under the same entry in the given query. en, we compute every other required feature our properties le to simplify the testing phase. All the tests have been conducted by using the TREC7 corpus 3e code is available here hps://bitbucket.org/tesisti-unipd/picello as open-source. (TIPSTER disk 4 and 5 minus CR) and its 50 topics (number 351-400).

We created a GoP of 1990 runs by using Terrier as detailed in [ 2 ]; for each run in this set we performed a re-ranking with all the available LtR algorithms. 3.1

Eciency

In Figure 6 we see the execution time for about 40 dierent runs for the same topic. As we could expect from previous considerations, the execution time decreases as the number of processed runs increases, saturating aer about 30 runs. For the rst run we have the maximum execution time since we need to compute the features for all the documents in the run.

In Table 1, we show the main stages involved in features computation and the respective execution time. is example is about features computation for document FBIS4-33167 for topic 351.

Finally, in Figure 7 we see the total execution time for the Features Extraction phase for the considered topics.

ere is an average computation time of about 2; 000 seconds (33 minutes) for each topic. e total execution time for this test was about 99683 seconds (27.68 hours). We recall from above that this

Action Time Time to get docid from docno 2.0 ms Time to compute arrays term frequency 61.0 ms Time to compute arrays TF IDF 1.0 ms Time to compute other features 1.0 ms Time to write whole byte array 12.3 ms

Total time for features computation 68.0 ms Table 1: Example of execution times for features computation. computation has to be performed only once for a given set of runs, while the second phase where the LtR algorithms are executed is possibly repeated many times.

For what concerns the LtR algorithms execution module, the execution time depends by the LTR libraries. As an example, in Table 2 we report the execution times of the LtR algorithms by employing their default parameter seings.

3.2 Eectiveness

In this subsection we give an initial glance over the eectiveness performances of the tested LtR and we point out where dierent implementations of the same algorithm lead to dierent performances; we leave a deeper and extensive analysis for future works. In Table 3 we report the MAP and precision at cuto ve and ten for a given run re-ranked with dierent LtR algorithms. e base run employs the DFIZ ranking model, the standard Terrier stopword list and a 8-grams lexical unit generator. As we can see ListNet, LineSearch, RankNet and the ickRank implementation of Coordinate Ascent give the lowest results, suggesting that some improves are in order or that they need a thorough parameter tuning phase. In this situation and with default parameters AdaRank gives the beer results in terms of MAP.

We analyzed the performance of the same algorithm implemented by dierent libraries in terms of DCG values, to understand if there are dierences between dierent implementations. Again, these are preliminary tests and we report the analysis for a single topic. In particular, we present the results we get for LambdaMART, RankBoost and Coordinate Ascent.

Figure 8 shows the DCG of the LambdaMART algorithm for topic 390.

As we can see all the implementations have similar performances. In the case of RankLib’s implementation we have some lile improvements with respect to the original run. is behaviour is the same for most of the topics.

RankBoost is implemented by both ickRank and RankLib; in Figure 9 we show the DCG curves for the topic 390 where we report also the original run. We can see how ickRank slightly outperforms both RankLib and the original run. In Figure 10 we analyze Coordinate Ascent, where RankLib performs similarly to the original run, while ickRank is slightly worse. 4

FINAL REMARKS

In this paper we described a soware library that enables us to run large-scale experiments over many LtR algorithms. We have designed a library that, separating the execution in two dierent modules, avoid to repeat unnecessary computations. is allows us to eciently run batch experiments for studying how dierent parameters aect the results of the models and how the results dier for the same algorithms implemented by dierent libraries.

As future works, one of the rst improvements is to employ the Hadoop MapReduce implementation of Terrier to index large document collections in a distributed way [ 6 ].

[1]

Ferro and

Harman . 2010 . CLEF 2009: Grid@CLEF Pilot Track Overview . In Multilingual Information Access Evaluation Vol. I Text Retrieval Experiments - Tenth Workshop of the Cross-Language Evaluation Forum (CLEF 2009 ). Revised Selected Papers,

Peters ,

G. M.

Di Nunzio ,

Kurimo ,

Mandl ,

Mostefa , A. Pe n˜as, and G. Roda (Eds.). Lecture Notes in Computer Science (LNCS) 6241 , Springer, Heidelberg, Germany, 552 - 565 .

[2]

Ferro and

Silvello . 2016 . A General Linear Mixed Models Approach to Study System Component Eects . In Proc. 39th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2016 ),

Perego ,

Sebastiani ,

Aslam , I. Ruthven , and J. Zobel (Eds.). ACM Press, New York, USA, 25 - 34 .

[3]

Ferro and

Silvello . 2017 . Towards an Anatomy of IR System Component Performances . Journal of the American Society for Information Science and Technology (JASIST) ( 2017 ).

[4] Tie-Yan Liu . 2009 . Learning to Rank for Information Retrieval . Foundations and Trends in Information Retrieval (FnTIR) 3, 3 (March 2009 ), 225 - 331 . hp: //dx.doi.org/10.1561/1500000016

[5] T.-Y. Y. Liu , J.

Xu , T.

Qin , W.

Xiong , and H.

Li . 2007 . LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval . In SIGIR 2007 Workshop on Learning to Rank for Information Retrieval,

Joachims ,

Li , T.-Y. Liu, and C. Zhai (Eds.).

[6]

McCreadie ,

Macdonald ,

and I.

Ounis . 2012 . MapReduce Indexing Strategies: Studying Scalability and Eciency . Information Processing & Management 48 , 5 ( September 2012 ), 873 - 888 .