<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A soware library for conducting large scale experiments on Learning to Rank algorithms</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Nicola Ferro</string-name>
          <email>ferro@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Picello</string-name>
          <email>paolopicelloit@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianmaria Silvello</string-name>
          <email>silvello@dei.unipd.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Dept. of Information Engineering, University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2017</year>
      </pub-date>
      <volume>1</volume>
      <issue>2017</issue>
      <abstract>
        <p>is paper presents an ecient application for driving large scale experiments on Learning to Rank (LtR) algorithms. We designed a soware library that exploits caching mechanisms and ecient data structures to make the execution of massime experiments on LtR algorithms as fast as possible in order to try as many combinations of components as possible. is presented soware has been tested on dierent algorithms as well as on dierent implementations of the same algorithm in dierent libraries. is soware is highly congurable and extensible in order to enable the seamless addition of new features, algorithms, and libraries.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        LtR is a branch of Information Retrieval (IR) that employs machine
learning techniques to improve the eectiveness of IR systems, by
taking as input the ranked result list generated by an IR system
and producing as output a new re-ranked list of documents [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. LtR
techniques are extremely popular nowadays in IR and they are used
by almost all the commercial search engines.
      </p>
      <p>
        Our long term goal is to study how LtR algorithms behave and
interact with other components typically present in an IR systems,
such as stemmers or dierent IR models. To this end, we will rely on
and extend our methodology based on the use of Grid of Points (GoP)
and General Linear Mixed Model (GLMM) [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], where a factorial
combination of all the components under experimentation is
leveraged to estimate the main and interaction eects of the dierent
components as well as their eect size. erefore, we will basically
need to test each LtR algorithm with all the combinations of the
other IR system components; if you consider that typically GoPs
just combining dierent stop lists, stemmers and IR models consist
of thousands of IR systems [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], you can imagine the explosion in
the number of combinations to be tested when you add alternative
LtR algorithms on top of them.
      </p>
      <p>
        Unfortunately, when it comes to test LtR algorithms, they are
usually evaluated in isolation, i.e. outside of a typical IR system
pipeline. Indeed, instead of ranked list of documents, documents
features, usually in LETOR format [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], are given as input and
performance scores are directly produced as output.
      </p>
      <p>Even when an LtR library is directly integrated in an IR system,
as it happens with JForest1 and Terrier2, they are designed to run
a single experiment at time and, if you have to run thousands of
experiments as in our case, you have to re-start from scratch each
time, while the same query/document pair is typically found by
many dierent runs, as shown in Figure 1. As a consequence, these
approaches compute the same document and query features again
and again with a consequent waste of time and resources.</p>
      <p>Figure 2 shows the number of already found query/document
pairs, i.e. the number of feature vectors already computed, as the
1hps://github.com/jeromepaul/jForest
2hp://terrier.org/
number of considered runs increases. If instead of recomputing
these features each time we encounter them again, we somehow
cache and re-use them, we will obtain a signicant performance
improvement. For example, you can note from Figure 2 that, aer
just 30 runs, we have already computed the features for almost all
the possible query/document pairs.</p>
      <p>erefore, our objective is to build an application that allows us
to evaluate LtR algorithms performance in an end-to-end pipeline,
conguring dierent components and evaluating the re-ranking
process as a whole, and that optimizes the costs, in terms of
computational load and execution times3.</p>
      <p>e paper is organized as follows: Section 2 describes the
proposed solution; Section 3 shows the performance of the proposed
solution in terms of execution costs; nally, Section 4 wraps up the
discussion and outlooks future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>PROPOSED SOLUTION</title>
      <p>e application we propose is modular and presents a logical
separation between two successive experimental stages:</p>
      <p>Features Extraction</p>
      <p>Learning To Rank Algorithms Execution
e rst module, Features Extraction, is responsible of the
computation of the features for each query/document pair in the input
runs enabling fast retrieval when these features are required by the
Learning To Rank Algorithms Execution module. is second
module is responsible for retrieving the desired features from memory,
constructing the required LETOR les, and executing the desired
LtR algorithm.</p>
      <p>is division of the main tasks allows us to drop down the total
execution time and facilitate the separation of the functions in
the soware. Indeed, the rst phase is time consuming because
we need to compute the features for all the considered input runs.
Aerwards we execute the second module as many times as we
want and with dierent parameters, without re-computing the
features. is is where we improve the eciency of the process.
and save the computed values in a byte buer to avoid unnecessary
memory occupation. Once all the features have been computed we
proceed by populating the feature matrix.
2.2</p>
    </sec>
    <sec id="sec-3">
      <title>Learning To Rank Algorithms Execution</title>
      <p>For each run we have several algorithm/library congurations,
but for all these congurations we need the same LETOR les; so,
we extract the required features from the previously generated
data structure. We parallelize the features extraction process and
the LETOR le generation by writing one dierent LETOR le
for each query and then merging these les accordingly to the
train/validation/test split specied in conguration parameters.</p>
      <p>We realized three dierent feature extraction alternatives: (i)
the rst one employs no parallelism and each task is processed
sequentially; (ii) the second one employs a dierent thread for each
dierent task, one thread is responsible for writing train le, one
for the validation le and another for the test le; and, (iii) the third
one employs a read Pool, that works like many dierent threads,
but minimizing the overhead due to thread creation. is solution
requires a locking mechanism to avoid readers-writers problems
when dierent threads access the same le. An advantage of this
last approach is that if we change the train/validation/test split we
do not need to perform the LETOR extraction phase again, but we
only need to perform the faster merging operation.</p>
      <p>In the Figure 4 we see the time required for the LETOR creation
by the three dierent approaches. In this case the read Pool
generates only 4 threads because it was limited by the computer’s
architecture used for testing. If we employ more threads the gain
will be even higher. As we can see, the read Pool solution reduces
the time by more than 50% w.r.t. the single thread execution.</p>
      <p>e execution of this module is repeated for each input run. e
steps it follows are: (i) to create the LETOR text les according to
the desired train/validation/test split; (ii) to merge the generated
text les; and, (iii) execute the LTR algorithms.
3</p>
    </sec>
    <sec id="sec-4">
      <title>PERFORMANCE EVALUATION 2.1</title>
    </sec>
    <sec id="sec-5">
      <title>Features Extraction</title>
      <p>We conducted the experiment by using a MacBook Air (Mid 2013)
For each query/document pair we calculate the relative features with a 1,3 GHz Intel Core i5 3MB cache L3, Hyper-reading (up to 4
only the rst time a pair is processed. threads) and 4 GB DDR3 a 1600 MHz. We employed Terrier v4.1 for</p>
      <p>We employ a caching mechanism based on a features matrix al- extracting the features and the LtR algorithms reported in Figure 5
lowing us to store the features for the already computed query/document where we also report the open-source libraries implementing these
pairs. In this matrix each line is a document and the columns con- algorithms.
tain its features. Figure 3 shows this data structure. As we can see, several algorithms like MART or LambdaMART</p>
      <p>We can see that when a document processed for the rst time are implemented by all the considered libraries, while others as
is found in the input run, its features are computed and stored in AdaRank or LineSearch are specic only to a single library. We
crea matrix. is structure is a &lt;key,value&gt; map where the key is ated a single property le where the parameters of the algorithms
the document identier and the value is its associated vector of are specied. Since dierent implementations of the same
algofeatures. Once a new document is found, we need to verify if its rithm use dierent nomenclature, we used a map that associates a
entry is already in the table and if it is not we just go on to the next parameter value with its corresponding parameter in a given library.
document. As an example, let us consider MART where all the libraries have a</p>
      <p>When we need to compute the features for the given docu- dierent parameter name to indicate the number of trees: tree for
ment/query pair, rst of all we get the relevance judgment from RankLib, num-trees for ickRank and boosting.num-trees for
the pool le and we check if it is the rst document we consider for JForests. We gathered all these parameters under the same entry in
the given query. en, we compute every other required feature our properties le to simplify the testing phase.
All the tests have been conducted by using the TREC7 corpus
3e code is available here hps://bitbucket.org/tesisti-unipd/picello as open-source. (TIPSTER disk 4 and 5 minus CR) and its 50 topics (number 351-400).</p>
      <p>
        We created a GoP of 1990 runs by using Terrier as detailed in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ];
for each run in this set we performed a re-ranking with all the
available LtR algorithms.
3.1
      </p>
    </sec>
    <sec id="sec-6">
      <title>Eciency</title>
      <p>In Figure 6 we see the execution time for about 40 dierent runs for
the same topic. As we could expect from previous considerations,
the execution time decreases as the number of processed runs
increases, saturating aer about 30 runs. For the rst run we have
the maximum execution time since we need to compute the features
for all the documents in the run.</p>
      <p>In Table 1, we show the main stages involved in features
computation and the respective execution time. is example is about
features computation for document FBIS4-33167 for topic 351.</p>
      <p>Finally, in Figure 7 we see the total execution time for the
Features Extraction phase for the considered topics.</p>
      <p>ere is an average computation time of about 2; 000 seconds (33
minutes) for each topic. e total execution time for this test was
about 99683 seconds (27.68 hours). We recall from above that this</p>
      <p>Action Time
Time to get docid from docno 2.0 ms
Time to compute arrays term frequency 61.0 ms
Time to compute arrays TF IDF 1.0 ms
Time to compute other features 1.0 ms
Time to write whole byte array 12.3 ms</p>
      <p>Total time for features computation 68.0 ms
Table 1: Example of execution times for features
computation.
computation has to be performed only once for a given set of runs,
while the second phase where the LtR algorithms are executed is
possibly repeated many times.</p>
      <p>For what concerns the LtR algorithms execution module, the
execution time depends by the LTR libraries. As an example, in
Table 2 we report the execution times of the LtR algorithms by
employing their default parameter seings.</p>
    </sec>
    <sec id="sec-7">
      <title>3.2 Eectiveness</title>
      <p>In this subsection we give an initial glance over the eectiveness
performances of the tested LtR and we point out where dierent
implementations of the same algorithm lead to dierent
performances; we leave a deeper and extensive analysis for future works.
In Table 3 we report the MAP and precision at cuto ve and
ten for a given run re-ranked with dierent LtR algorithms. e
base run employs the DFIZ ranking model, the standard Terrier
stopword list and a 8-grams lexical unit generator. As we can see
ListNet, LineSearch, RankNet and the ickRank
implementation of Coordinate Ascent give the lowest results, suggesting that
some improves are in order or that they need a thorough
parameter tuning phase. In this situation and with default parameters
AdaRank gives the beer results in terms of MAP.</p>
      <p>We analyzed the performance of the same algorithm
implemented by dierent libraries in terms of DCG values, to understand
if there are dierences between dierent implementations. Again,
these are preliminary tests and we report the analysis for a single
topic. In particular, we present the results we get for LambdaMART,
RankBoost and Coordinate Ascent.</p>
      <p>Figure 8 shows the DCG of the LambdaMART algorithm for topic
390.</p>
      <p>As we can see all the implementations have similar performances.
In the case of RankLib’s implementation we have some lile
improvements with respect to the original run. is behaviour is the
same for most of the topics.</p>
      <p>RankBoost is implemented by both ickRank and RankLib;
in Figure 9 we show the DCG curves for the topic 390 where we
report also the original run. We can see how ickRank slightly
outperforms both RankLib and the original run. In Figure 10 we
analyze Coordinate Ascent, where RankLib performs similarly
to the original run, while ickRank is slightly worse.
4</p>
    </sec>
    <sec id="sec-8">
      <title>FINAL REMARKS</title>
      <p>In this paper we described a soware library that enables us to
run large-scale experiments over many LtR algorithms. We have
designed a library that, separating the execution in two dierent
modules, avoid to repeat unnecessary computations. is allows
us to eciently run batch experiments for studying how dierent
parameters aect the results of the models and how the results
dier for the same algorithms implemented by dierent libraries.</p>
      <p>
        As future works, one of the rst improvements is to employ
the Hadoop MapReduce implementation of Terrier to index large
document collections in a distributed way [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Harman</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>CLEF 2009: Grid@CLEF Pilot Track Overview</article-title>
          .
          <source>In Multilingual Information Access Evaluation Vol. I Text Retrieval Experiments - Tenth Workshop of the Cross-Language Evaluation Forum (CLEF</source>
          <year>2009</year>
          ). Revised Selected Papers,
          <string-name>
            <given-names>C.</given-names>
            <surname>Peters</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Di Nunzio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kurimo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Mandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mostefa</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Pe n˜as, and</article-title>
          G. Roda (Eds.).
          <source>Lecture Notes in Computer Science (LNCS) 6241</source>
          , Springer, Heidelberg, Germany,
          <fpage>552</fpage>
          -
          <lpage>565</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>A General Linear Mixed Models Approach to Study System Component Eects</article-title>
          .
          <source>In Proc. 39th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR</source>
          <year>2016</year>
          ),
          <string-name>
            <given-names>R.</given-names>
            <surname>Perego</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sebastiani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aslam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Ruthven</given-names>
            , and J.
            <surname>Zobel</surname>
          </string-name>
          (Eds.). ACM Press, New York, USA,
          <fpage>25</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Silvello</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Towards an Anatomy of IR System Component Performances</article-title>
          .
          <source>Journal of the American Society for Information Science and Technology (JASIST)</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Tie-Yan Liu</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Learning to Rank for Information Retrieval</article-title>
          .
          <article-title>Foundations and Trends in Information Retrieval (FnTIR) 3, 3</article-title>
          (March
          <year>2009</year>
          ),
          <fpage>225</fpage>
          -
          <lpage>331</lpage>
          . hp: //dx.doi.org/10.1561/1500000016
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>T.-Y. Y. Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Qin</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Xiong</surname>
            , and
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval</article-title>
          . In SIGIR 2007 Workshop on Learning to Rank for Information Retrieval,
          <string-name>
            <given-names>T.</given-names>
            <surname>Joachims</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          , T.-Y. Liu, and C. Zhai (Eds.).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>McCreadie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macdonald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Ounis</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>MapReduce Indexing Strategies: Studying Scalability and Eciency</article-title>
          .
          <source>Information Processing &amp; Management</source>
          <volume>48</volume>
          ,
          <issue>5</issue>
          (
          <year>September 2012</year>
          ),
          <fpage>873</fpage>
          -
          <lpage>888</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>