=Paper= {{Paper |id=Vol-2167/paper12 |storemode=property |title=Agile Information Retrieval Experimentation with Terrier Notebooks |pdfUrl=https://ceur-ws.org/Vol-2167/paper12.pdf |volume=Vol-2167 |authors=Craig Macdonald,Richard McCreadie,Iadh Ounis |dblpUrl=https://dblp.org/rec/conf/desires/MacdonaldMO18 }} ==Agile Information Retrieval Experimentation with Terrier Notebooks== https://ceur-ws.org/Vol-2167/paper12.pdf
           Agile Information Retrieval Experimentation with Terrier
                                 Notebooks∗
                                                 Craig Macdonald, Richard McCreadie, Iadh Ounis
                                                                            University of Glasgow
                                                                           Glasgow, Scotland, UK
                                                                       first.lastname@glasgow.ac.uk

ABSTRACT                                                                                      As with many computing science subjects, hands-on coding ex-
Teaching modern information retrieval is greatly benefited by giv-                         perience in IR is very beneficial for grasping the concepts being
ing students hands-on experience with an open-source search en-                            taught. However, commercial search engines are not publicly avail-
gine that they can experiment with. As such, open source platforms                         able for students to experiment with. For this reason, academic
such as Terrier are a valuable resource upon which learning exer-                          research groups have devoted resources to developing open source
cises can be built. However, experimentation using such systems                            search engines [14, 16] that can be used to support learning and
can be a laborious process when performed by hand; queries might                           teaching in IR, such as the Terrier IR platform [14], while keeping
be rewritten, executed, and model parameters tuned. Moreover,                              them up-to-date with state-of-the-art techniques.
the rise of learning-to-rank as the de-facto standard for state-of-                           However, while these platforms provide the core functionality of
the-art retrieval complicates this further, with the introduction of                       a search engine and are valuable for supporting hands-on exercises,
training, validation and testing (likely over multiple folded datasets                     significant time and effort is required for students to learn the
representing different query types). Currently, students resort to                         basics of experimenting with such a platform. Furthermore, as the
shell scripting to make experimentation easier, however this is far                        complexity of these platforms grow, the barriers to entry for using
from ideal. On the other hand, the introduction of experimental                            these platforms is increasing. For instance, the rise of learning-
pipelines in platforms like scikit-learn and Apache Spark in con-                          to-rank as the de-facto standard for state-of-the-art retrieval now
junction with notebook environments such as Jupyter have been                              requires students to understand and build command pipelines for
shown to markedly reduce to barriers to non-experts setting up and                         document and query feature extraction (along with configuration
running experiments. In this paper, we discuss how next genera-                            and possibly optimisation); dataset folding; and the subsequent the
tion information retrieval experimental pipelines can be combined                          training, validation and testing of the learned models. As a result,
in an agile manner using notebook-style interaction mechanisms.                            students either spend significant time manually running commands
Building upon the Terrier IR platform, we describe how this is                             or resort to shell scripting to make experimentation easier. In either
achieved using a recently released Terrier-Spark module and other                          case, this is an undesirable burden on the students that wastes
recent changes in Terrier 5.0. Overall, this paper demonstrates the                        valuable tuition time.
advantages of the agile nature of notebooks to experimental IR                                On the other hand, agile experimentation, particularly for non-
environments, from the classroom environment, through academic                             IR machine learning applications, is increasingly being facilitated
and industry research labs.                                                                by the use of experimental pipelines in toolkits like scikit-learn
                                                                                           and Apache Spark. These toolkits break-down the steps involved
1       INTRODUCTION                                                                       in machine learning into small discrete operations, which can be
                                                                                           dynamically chained together - forming a reusable and customis-
Information retrieval (IR) is an important field in computing science,                     able pipeline. For example, in the scikit-learn toolkit from Python,
and is taught to undergraduate students at universities worldwide.                         each supervised technique exposes a fit() method for training,
IR is a dense topic to teach, as it encompasses 30 years of intensive                      and a transform() method for applying the trained model. Apache
research and development from academia and industry. Moreover,                             Spark’s MLib API similarly defines fit() and transform() meth-
IR is constantly evolving as a field, as new techniques are intro-                         ods for the same purpose. These experimental pipelines are very
duced, tested and adopted by commercial search engines. Indeed,                            powerful when combined with recent ‘notebook’ applications, such
supervised machine learning techniques [9] have systematically                             as Jupyter, which enable developers to store, edit and re-run por-
replaced traditional theoretically-founded term weighting models                           tions of their pipelines on-demand.
(e.g. BM25, language models) over the last decade. As such, taught                            In this paper, we argue that similar experimental pipelines are
IR courses need to be flexible in the face of rapid changes in the                         the next step in enhancing the teaching of IR in the classroom. In
broader field, so that students are prepared for the challenges they                       particular, experimental pipelines encapsulated as notebooks:
will face in industry.
                                                                                              (1) Provide an agile experimental platform that students/re-
    ∗   This is an extended version of a demonstration paper that was published at SIGIR          searchers/practictioners can easily edit.
2018.
                                                                                              (2) Enable reproducibility in information retrieval experiments.
                                                                                              (3) Centralise configuration such that issues are more easily
DESIRES’18, 2018, Bertinoro, Italy                                                                identified.
© 2018 Copyright held by the author(s).                                                       (4) Allow for more complex examples released as pre-configured
                                                                                                  notebooks.
DESIRES’18, 2018, Bertinoro, Italy                                                       Craig Macdonald, Richard McCreadie, Iadh Ounis


Furthermore, we discuss recent advances in the Terrier IR platform       the IR concepts already presented in the lectures. Students also
with the Terrier-Spark module and additions to Terrier 5.0 that          value the reinforcement of the empirical nature of IR as a field, and
enables experimental pipelines moving forward.                           the necessity of experimental evaluation.
   The structure of this paper is as follows: Section 2 details recent      However, students also note the difficulties in configuring Terrier
feedback in an empirical information retrieval course; Section 3         (long commandline incantations, and/or awkward editing of the
highlights the main requirements for an experimental IR platform;        terrier.properties files). Indeed, not all students are familiar with
Section 4 summarises the current Terrier platform; Section 5 intro-      commandline scripting technologies on their chosen operating sys-
duces Terrier-Spark and how to conduct IR experiment using it;           tem. Some students contrasted this with other courses, such as our
Section 6 highlights other relecant changes in Terrier 5.0; Section 7    recent Text-as-Data course on text mining/classification/informa-
discusses the advantages of combining Terrier-Spark with Jupyter         tion extraction, which uses Python notebooks, and lamented the
notebooks. Concluding remarks follow in Section 8.                       lack of a Juypter environment for Terrier.
                                                                            Overall, from the feedback described above, it is clear that mov-
2   RECENT EXPERIENCES OF AN                                             ing towards providing a notebook environment for performing IR
    INFORMATION RETRIEVAL COURSE                                         experiments would aid students. To this end, we have considered
    USING TERRIER                                                        how to modernise the experimental environment for conducting ex-
                                                                         periments using Terrier. The next few sections detail the underlying
Information retrieval has been taught in Glasgow as an undergrad-
                                                                         essential requirements for an experimental IR platform (Section 3),
uate and postgraduate elective since the mid-1980s. Its current
                                                                         the current Terrier status (Section 4), as well as how adapt Terrier to
incarnation consists of 20 hours of lectures, along with supplemen-
                                                                         support a notebook paradigm leveraging Apache Spark (Section 5).
tary tutorials and laboratory sessions, allowing students to gain
hands-on experience in developing IR technologies and critically
                                                                         3    IR PLATFORM REQUIREMENTS FOR
assessing their performance. To address this latter point, we set
students with two assessed exercises (aka courseworks), of approxi-           CONDUCTING EMPIRICAL EXPERIMENTS
mately 20 hours in length total. These allow the student experience      Below, we argue for, in our experience, the main attributes of an
with empirical, experimental IR, both from core concepts (TF.IDF,        experimental IR platform. These are described in terms of required
document length normalisation) through to learning-to-rank.              functionalities - in practice, there are non-functional requirements
                                                                         such as running experiments efficiently on large corpora such as
   Coursework 1 [8 hours]. Create an index of 50% of the .GOV test
                                                                         ClueWeb09.
collection using Terrier; Perform retrieval for a variety of standard
weighting models, with and without query expansion, on three                 R1 Perform an “untrained” run for a weighting model over a
tasks of the TREC 2004 Web track (homepage finding, named-page                  set of query topics, retrieving and ranking results from an
finding, topic distillation). They are also asked to compare and con-           index.
trast with a “simple” TF.IDF they have implemented themselves.               R2 Evaluate a run over a set of topics, based on relevance labels.
Students then analyse the results, including per-topic analysis, pre-        R3 Train the parameters of a run, which may require repetitive
cision recall graphs, etc. Overall, Coursework 1 is designed to famil-          execution of queries from an index and evaluation.
iarise the students with the core workings of a (web) search engine,         R4 Extract a run with multiple features that can be used as input
and performing IR experiments, as well as analysing their results,              to a learning-to-rank technique.
critically analysing the attributes of different retrieval techniques,       R5 Re-rank results based on multiple features and a pre-trained
and how these affect performance on different topic sets.                       learning-to-rank technique.
                                                                            R1 concerns the ability of the IR system to be executed in an
   Coursework 2 [12 hours]. Use a provided index for the .GOV
                                                                         offline batch mode - to produce the results of a set of query topics.
test collection that contains field and positional information, along
                                                                         Academic-based platforms such as Terrier, Indri [16], Galago [3]
with a number of standard Web features (PageRank, DFR prox-
                                                                         offer such functionality out of the box. R2 concerns the provision
imity, BM25 on the title). The students are asked to implement
                                                                         of evaluation tools that permit a run to be evaluated. Standard tools
two proximity features from a number described in [5], and com-
                                                                         exist such as the C-based trec_eval library, but integration in the
bine these within a LambdaMART learning-to-rank model. Their
                                                                         native language of the system may provide advantages for R3. Other
analysis must use techniques learned through Coursework 1. e.g.
                                                                         systems such as Lucene/Solr/Elastic may need some scripting or
identifying which queries were most benefitted by further proxim-
                                                                         external tools (Azzopardi et al. [1] highlight the lack of empirical
ity features, etc. Overall, this coursework allows student hands-on
                                                                         tools for IR experimentation and teaching on Lucene, and have
experience with deploying a learning-to-rank pipeline (training/
                                                                         made some inroads into addressing this need).
validation/testing and evaluation), as well as the notion of posi-
                                                                            Indeed, R3 represents the early advent of machine learning into
tional information and posting list iterators necessary to implement
                                                                         the IR platform, where gradient ascent/descent algorithms were
their additional proximity features.
                                                                         used to optimise the parameters of systems by (relatively expensive)
   Student feedback on the current courseworks. The positive feed-       repeated querying and evaluation of different parameter settings.
back on the current coursework exercises is that these form an           Effective techniques such as BM25F [20] & PL2F [10] were facilitated
effective vehicle for achieving the intended learning outcomes of        by common use of such optimisation techniques.
the course. In particular, they encompass more than programming             Finally, R4 & R5 are concerned with successful integration of
implementations, and that they force the students to understand          learning-to-rank into the IR system. As with new technologies,
Agile Information Retrieval Experimentation with Terrier Notebooks                                         DESIRES’18, 2018, Bertinoro, Italy


there can be a lag between research-fresh developments and how           bin/trec_terrier.sh -r -Dtrec.topics=/path/to/topics \
they are bled into production-ready systems. Of these, for the pur-          -Dtrec.model=BM25
poses of experimentation, R4 is the more important - the ability to      bin/trec_terrier.sh -e -Dtrec.qrels=/path/to/qrels
efficiently extract multiple query dependent features has received
some coverage in the literature [11]. R5 is concerned with taking this   Listing 1: A simple retrieval run and evaluation using Ter-
a stage further, and applying a learned model to re-rank the results.    rier’s commandline interface - c.f. requirements R1 & R2.
   In the following, we will describe how Terrier currently meets
requirements R1-R5 (Section 4), and how it can be adopted within a
Spark environment to meet these in a more agile fashion (Section 5).     methods) from scikit-learn2 (Python machine learning library),
                                                                         namely:
4      BACKGROUND ON TERRIER                                                   • DataFrame: a relation containing structured data.
Terrier [14] is a retrieval platform dating back to 2001 with an ex-           • Transformer: an object that can transform a data instance
perimental focus. First released as open source in 2004, it has been             from a DataFrame.
downloaded >50,000 times since. While Terrier portrays a Java API              • Estimator: an algorithm that can be fitted to data in a DataFrame.
that allows extension and/or integration into a number of appli-                 The outcome of an Estimator can be a Transformer - for in-
cations, the typical execution of Terrier is based upon procedural               stance, a machine-learned model obtained from an Estimator
command invocations from the commandline. Listing 1 provides                     will be a Transformer.
the commandline invocations necessary to fulfil requirements R1 &              • Pipeline: A series of Transformer and Estimators chained
R2 using Terrier. All requirements R1-R5 listed above are supported              together to create a workflow.
by the commandline. Moreover, the use of a rich commandline                    • Parameter: A configuration option for an Estimator.
scripting language (GNU Bash, for instance) permits infinite com-            In our adaptation of Terrier to the Spark environment, Terrier-
binations of different configurations to be evaluated automatically.     Spark, we have implemented a number of Estimators and Trans-
Moreover, with appropriate cluster management software, such             formers. These allow the natural stages of an IR system to be com-
runs can be conducted efficiently in a distributed fashion. This         bined in various ways, while also leveraging the existing supervised
commandline API is also the main methods that students learn to          ML techniques within Spark to permit the learning of ranking mod-
interact with the IR system.                                             els (e.g. Spark contains logistic regression, random forests, gradient
   However, we have increasingly found that a commandline API            boosted regression trees, but notably no support for listwise based
was not suited for all purposes. For instance, the chaining of the       learning techniques such as LambdaMART [19], which are often
outcomes of between invocations requires complicated scripting.          the most effective [2, 9]).
For instance, consider, for each fold of a 5-fold cross validation:          Table 1 summarises the main components developed to sup-
training the b length normalisation parameter of BM25, saving            port the integration of Terrier into Apache Spark, along with their
the optimal value, and using that for input to a learning-to-rank        inputs, outputs and key parameters. In particular, QueryingTrans-
run, distributed among a cluster environment. Such an example            former is the key Transformer, in that this internally invokes Terrier
would require creating tedious amounts of shell scripting, for little    to retrieve the docids and scores of each retrieved document for
subsequent empirical benefit. In short, this paper argues that IR        the queries in the input dataframe. As Terrier is written in Java,
experimentation has now reached the stage where we should not            and Scala and Java both are JVM-based languages, Terrier can run
be limited by the confines of a shell-scripting environment.             “in-process”. Furthermore, as discussed in Section 6 below, further
                                                                         changes in Terrier 5.0 permit accessing indices on remotely hosted
5      TERRIER-SPARK                                                     Terrier servers.
To address the perceived limitations in the procedural commandline           In the following, we provide examples of retrieval experimental
use of Terrier, we have developed a new experimental interface           listings using Spark through Scala.
for the Terrier platform, building upon Apache Spark, and called
Terrier-Spark. Apache Spark is a fast and general engine for large-      5.1       Performing an untrained retrieval run
scale data processing. While Spark can be invoked in Java, Scala         Listing 1 shows how a simple retrieval run can be made using
and Python, we focus on the Scala environment, which allows for          Terrier’s commandline API. The location of the topics and qrels
code that is more succinct than the equivalent Java (for instance,       files as well as the weighting model, are set on the commandline,
through the use of functional progamming constructs, and auto-           although defaults could be set in a configuration file.
matic type inference). Spark allows relational algebra operations            In contrast, Listing 2 shows how the exact same run might be
on dataframes (relations) to be easily expressed as function calls,      achieved from Scala in a Spark environment. Once the topics files
which are then compiled to a query plan that is distributed and          are loaded into a two-column dataframe (keyed by “qid", the topic
executed on machines within the cluster.                                 number), these are transformed into a dataframe of result sets, ob-
   Apache Spark borrows the notions of dataframes from Pandas1           tained from Terrier (keyed by “⟨qid,docno⟩”). Then a second trans-
(a Python data analysis library), and similarly the notion of machine    former record the relevant and non-relevant documents within the
learning pipeline constructs and interfaces (e.g. fit and transform      dataframe, by joining with the contents of the qrels file, before
                                                                         evaluation.
1   http://pandas.pydata.org/                                            2   http://scikit-learn.org/
DESIRES’18, 2018, Bertinoro, Italy                                                               Craig Macdonald, Richard McCreadie, Iadh Ounis

 Component                                     Inputs                                  Output                                 Parameters
 QueryStringTransformer                        Queries                                 Queries                      Lambda function to transform query
 QueryingTransformer                           Queries                      docids, scores for each query            number of docs; weighting model
 FeaturedQueryingTransformer                   Queries              docids, scores of each feature for each query             + feature set
 QrelTransformer                         results with docids                results with docids and labels                       qrel file
 NDCGEvaluator                     results with docids and labels                  Mean NDCG@K                                   cutoff K
                          Table 1: Summary of the primary user-facing components available Terrier-Spark.



val props = Map("terrier.home" -> "/path/to/Terrier")                           //assuming various variables as per Listing 2.
TopicSource.configureTerrier(props)                                             val pipeline = new Pipeline()
val topics = "/path/to/topics.401-450"                                            .setStages(Array(queryTransform, qrelTransform))
val qrels = "/path/to/qrels.trec8"
                                                                                val paramGrid = new ParamGridBuilder()
val topics = TopicSource.extractTRECTopics(topics)                                .addGrid(queryTransform.localTerrierProperties,
    .toList.toDF("qid", "query")                                                       Array(Map["c"->"1"], Map["c"->"10"], Map["c"->"100"]))
                                                                                  .addGrid(queryTransform.sampleModel,
val queryTransform = new QueryingTransformer()                                         Array("InL2", "PL2"))
    .setTerrierProperties(props)                                                  .build()
    .sampleModel.set("BM25")
                                                                                val cv = new CrossValidator()
val r1 = queryTransform.transform(topics)                                         .setEstimator(pipeline)
//r1 is a dataframe with results for queries in topics                            .setEvaluator(new NDCGEvaluator)
val qrelTransform = new QrelTransformer()                                         .setEstimatorParamMaps(paramGrid)
    .setQrelsFile(qrels)                                                          .setNumFolds(5)
                                                                                val cvModel = cv.fit(topics)
val r2 = qrelTransform.transform(r1)
//r2 is a dataframe as r1, but also includes a label column                     Listing 3: Grid searching the weighting model and document
                                                                                length normalisation c parameters using Spark’s CrossVal-
val meanNDCG = new NDCGEvaluator().evaluate(r2)
                                                                                idator - c.f. requirement R3.

Listing 2: A retrieval run in Scala - c.f. requirements R1 & R2.
                                                                                val queryTransform = new FeaturesQueryingTransformer()
                                                                                  .setTerrierProperties(props)
   While clearly more verbose than the simpler commandline API,                   .setMaxResults(5000)
Listing 2 demonstrates equivalent functionality, and clearly high-                .setRetrievalFeatures(List(
lights the needed data and the steps involved in the experiment.                     "WMODEL:BM25",
Moreover, the use of objects suitable to be built into a Spark pipeline              "WMODEL:PL2",
offers the possibility to build and automate pipelines. As we show                   "DSM:org.terrier.matching.dsms.DFRDependenceScoreModifier"))
below, this functionality permits the powerful features of a func-                .setSampleModel("InL2")
                                                                                val r1 = queryTransform.transform(topics)
tional language to allow more complex experimental pipelines.
                                                                                //r1 is as per Listing 2, but now also has a column of 3
                                                                                //feature values for each retrieved document
5.2    Training weighting models                                                val qrelTransform = new QrelTransformer()
Listing 3 demonstrates the use of Spark’s Pipeline and CrossVal-                    .setQrelsFile(qrels)
idator components to create a pipeline that applies a grid-search to            val r2 = qrelTransform.transform(r1)
determine the most effective weighting model and its corresponding
document length normalisation c parameter. Such a grid-search can               //learn a Random Forest model
be parallelised across many Spark worker machines in a cluster. We              val rf = new RandomForestRegressor()
                                                                                   .setLabelCol("label")
note that while grid-search is one possibility, it is feasible to consider
                                                                                   .setFeaturesCol("features")
use of a gradient descent algorithm to tune the c parameter. How-                  .setPredictionCol("newscore")
ever, at this stage we do not yet have a parallelised algorithm imple-          rf.fit(r2)
mented that would make best use of a clustered Spark environment.
                                                                                Listing 4: Training a Random Forests based learning-to-rank
5.3    Training learning-to-rank models                                         model - c.f. requirements R4 & R5.
Finally, Listing 4 demonstrates the use of Spark’s in-built machine
learning Random Forest regression technique to learn a learning-to-
rank model. In this example, the initial ranking of documents is per-           dependent features being calculated for the top 5000 ranked docu-
formed by the InL2 weighting model, with an additional three query              ments for each query. Internally, this uses Terrier’s Fat framework
Agile Information Retrieval Experimentation with Terrier Notebooks                                                DESIRES’18, 2018, Bertinoro, Italy


for implementing the efficient calculation of additional query depen-      6.2     Remote Querying
dent features [11]. The received random forests model can be triv-         As discussed in Section 5 above, Terrier-Spark has initially been
ially applied to a further set of unseen topics (not shown). The result-   designed for in-process querying. However, concurrent changes to
ing Scala code is markedly more comprehensible to the equivalent           Terrier for version 5.0 have abstracted the notion of all retrieval
complex commandline invocations necessary for Terrier 4.2 [17].            access being within the current process or accessible from the
Moreover, we highlight the uniqueness of our offering - while other        same machine. Indeed, a reference to an index may refer to an
platforms such as Solr and Elastic have Spark tools, none offer the        index hosted on another server, and made accessible over a RESTful
ability to export a multi-feature representation suitable for conduct-     interface. While this is a conventional facility offered by some other
ing learning-to-rank experiments within Spark (c.f. R4 & R5).              search engine products (and made available through their Spark
    Of course, the pipeline framework of Estimators and Transform-         tools, such as for Elastic’s3 and Solr’s4 ), this offers a number of
ers is generic, and one can easily imagine further implementations         advantages for teaching. Indeed, often IR test collection can be too
of both to increase the diversity of possible experiments: For in-         large to provide as downloads - allowing a remote index accessible
stance, new Estimators for increased coverage of learning-to-rank          over a (secured) RESTful HTTP connection would negate the need
techniques, such as LambdaMART [19]; Similarly, Transformers               to provide students with the raw contents of the documents for
for adapting the query representation, for example by applying             indexing. Moreover, unlike conventional Spark tools for Elastic and
query-log based expansions [7] or proximity-query rewriting such           Solr, the results returned can have various features pre-calculated
as Sequential Dependence models [12]. Once a suitable Pipeline             for applying and evaluating learning-to-rank models.
is configured, conducting experiments such as learning-to-rank                To make this more concrete, consider the TREC Microblog track
feature ablations can be conducted in only a few lines of Scala.           which used an “evaluation-as-a-service” methodology [8]. In this,
                                                                           the evaluation track organisers provided a search API based upon
6     OTHER CHANGES TO TERRIER 5.0                                         Lucene, through which the collection can be accessed for complet-
We have also made a number of other changes to Terrier, which have         ing the evaluation task. The advancements described here would
been incorporated into the recently released version 5.0, which aid        allow a Terrier index to be provided for a particular TREC collection,
in the expanding the possible retrieval concepts that can be easily        easily accessible through either the conventional Terrier comman-
implemented using Terrier-Spark, while increasing the flexibility          dline tools, or through Terrier-Spark. A run in that track could
offered by the platform.                                                   then be crafted and submitted to TREC wholly within a Jupyter
                                                                           notebook, facilitating easy run reproducibility.
6.1     Indri-esque matching query language
Terrier 5.0 implements a subset of the Indri/Galago query lan-             7     CONDUCTING IR EXPERIMENTS WITHIN
guage [3, Ch. 5], including complex operators such as #syn and                   A JUPYTER NOTEBOOK ENVIRONMENT
#uwN. In particular, Terrier 5.0 provides:                                 Spark can be used in a number of manners: by compiling Scala
      • #syn(t1 t2): groups a set of terms into a single term for the      source files into executable (Jar) files, which are submitted to a Spark
        purposes of matching. This can be used for implementing            cluster, or through line-by-line execution in spark-shell (a Read-
        query-time stemming.                                               Eval-Print-Loop or REPL tool). However, each has its disadvantages:
      • #uwN(t1 t2): counts the number of occurrences of t1 & t2           the former only permits slow development iterations through the
        within unordered windows of size N .                               necessity to recompile at each iteration; on the other hand, the
      • #1(t1 t2): counts the number of exact matches of the bigram        REPL spark-shell environment does not easily record the developed
        t1 & t2.                                                           code, nor allow parts of the code to be re-executed.
      • #combine(t1 t2): allows the weighting of t1 & t2 to be ad-             Instead, we note that the use of a Spark environment naturally
        justed.                                                            fits with the use of Scala Jupyter notebooks56 . Jupyter is an open-
      • #tag(NAME t1 t2): this allows to assign a name to a set of         source web application that allows the creation and sharing of
        terms, which can be then be formed as a set of features            documents that contain code, equations, visualisations and narra-
        by the later Fat layer. In doing so, such tagged expression        tive text. Increasingly entire technical report documents, slides and
        allows various sub-queries to form separate features during        books are being written as Jupyter notebooks, due to the easy inte-
        learning-to-rank. This functionality is not present in Indri       gration of text, code and resulting analysis tables or visualisations.
        or Galago.                                                             Jupyter notebooks are increasingly used to share the algorithms
For example, Metzler and Croft’s sequential dependence proxim-             and analysis conducted in machine learning research papers, sig-
ity model [12] can be formulated using combinations of #uwN                nificantly aiding reproducibility [15]. Indeed, in their report on the
and #owN query operators. Such a query rewriting technique can             Daghstuhl workshop on reproducibility of IR experiments [6], Ferro,
be easily implemented within Terrier-Spark by applying a Query-            Fuhr et al. note that sharing of code and experimental methods
StringTransformer that applies a lambda function upon each query,          would aid reproducibility in IR, but do not recognise the ability of
allow users to build upon the new complex query operators imple-           notebooks to aid in this process.
mented in Terrier 5.0. Figure 1 demonstrates applying sequential
dependence to a dataframe of queries, within a Jupyter notebook.           3 https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
This is achieved by instantiating a QueryStringTransformer upon            4 https://github.com/lucidworks/spark-solr 5 http://jupyter.org/ 6 We note that
a dataframe of queries, using a lambda function that appropriately         Jupyter notebooks are extensible through plugins to Scala and other languages, i.e.
rewrites the queries.                                                      not limited to Python.
DESIRES’18, 2018, Bertinoro, Italy                                                                 Craig Macdonald, Richard McCreadie, Iadh Ounis




       Figure 1: Example of applying a Transformer to apply sequential dependency proximity to a dataframe of queries.


   Jupyter notebooks are interactive in manner, in that a code block                Terrier for two different retrieval models; Figure 3 shows analy-
in a single cell can be run independently of all other cells in the note-           sis on the results, by ranking queries based on the difference of
book. As a result, Jupyter is also increasingly used for educational                their evaluation performance between rankings; Finally, Figure 4
purposes - for example, teaching programming within undergradu-                     demonstrates the same information as a per-query analysis figure.
ate degree courses [4, 18], as well as a plethora of data science or
machine learning courses [15]. O’Hara et al. [13] described four uses
for notebooks in classroom situations, including lectures, flipped-
                                                                                    8   CONCLUSIONS
classrooms, home/lab work and exams. For instance, the use of                       In this paper, we have described the challenge of teaching a modern
notebooks within a lecturing situation easily permits the students                  undergraduate- and postgraduate-level elective course on infor-
to replicate the analysis demonstrated by the lecturer.                             mation retrieval. We highlight the main requirements of an ex-
   We argue that these general advantages of notebooks can be ap-                   perimental IR platform, then further describe Terrier-Spark, an
plied to experimental information retrieval education, through the                  extension of Terrier IR platform to perform IR experimentation
use of a Spark-integrated IR platform, such as that described in this               within the Spark distributed computing engine, which not only
paper. Indeed, we believe that the changes described in Sections 5                  addresses these requirements, but can allow complex experiments
& 6 should address these these feedbacks, allow students to more                    to be easily defined within a few lines of Scala code. Terrier-Spark
easily configure the retrieval platform (all configuration of Terrier               and Terrier 5.0 have been released as open source. In addition, we
is presented on the screen), make more powerful experimentation                     also argue that Jupyter notebooks for IR can aid not only agile IR
available to a wider and less experienced audience not wishing to                   experimentation by students, but also in research reproducibility in
engage in complicated shell-scripting.                                              information retrieval by facilitating easily-distributable notebooks
   We believe that Terrier-Spark can bring these same advantages                    that demonstrate the conducted experiments.
to conducting modern (e.g. learning-to-rank) IR experiments, com-                      Overall, we believe that notebooks are an important aspect of
bined with the accessible and agile nature of a notebook environ-                   data science, and that we as an IR community should not fall behind
ment. Moreover, an integrated Juypter environment also facilitates,                 other branches of data science in using notebooks for empirical
for instance in the IR teaching environment, the creation and pre-                  IR experimentation. The frameworks described here might be ex-
sentation of inline figures (e.g. created using the Scala vegas-viz                 tended to other languages (e.g. a Python wrapper for Terrier’s
library7 ), such as per-query analyses and interpolated precision-                  RESTful interface), or even to other IR platforms. In doing so, we
recall graphs. Figures 2 - 4 provide screenshots from such a note-                  bring more powerful and agile experimental IR tools into the hands
book89 In particular: Figure 2 demonstrates the querying of the                     of researchers and students alike. Terrier-Spark has been released
                                                                                    as open source, and is available from
7 https://github.com/vegas-viz/.              8 We    use the Apache Toree
kernel for Jupyter, which allows notebooks written in Scala and                                https://github.com/terrier-org/terrier-spark
which automatically interfaces with Apache Spark.                    9 The orig-
inal notebook can be found in the Terrier-Spark repository, see
https://github.com/terrier-org/terrier-spark/tree/master/example_notebooks/toree.   along with example Jupyter notebooks.
Agile Information Retrieval Experimentation with Terrier Notebooks                                                                  DESIRES’18, 2018, Bertinoro, Italy




    Figure 2: Conducting two different retrieval runs within a Jupyter notebook using a function defined in Terrier-Spark.


REFERENCES                                                                                      normlisation and language specific stemming. In Proceedings of CLEF.
 [1] Leif Azzopardi et al. 2017. Lucene4IR: Developing Information Retrieval Evalua-       [11] Craig Macdonald, Rodrygo L.T. Santos, Iadh Ounis, and Ben He. 2013. About
     tion Resources Using Lucene. SIGIR Forum 50, 2 (2017), 18.                                 Learning Models with Multiple Query-dependent Features. ToIS 31, 3 (2013).
                                                                                           [12] Donald Metzler and W. Bruce Croft. 2005. A Markov random field model for
 [2] Olivier Chapelle and Yi Chang. 2011. Yahoo! Learning to Rank Challenge
                                                                                                term dependencies. In Proceedings of SIGIR.
     Overview. Proceedings of Machine Learning Research 14 (2011).
                                                                                           [13] Keith J. O’Hara, Douglas Blank, and James Marshall. 2015. Computational Note-
 [3] Bruce Croft, Donald Metzler, and Trevor Strohman. 2009. Search Engines: In-
                                                                                                books for AI Education. In Proceedings of FLAIRS.
     formation Retrieval in Practice (1st ed.). Addison-Wesley Publishing Company,
                                                                                           [14] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and
     USA.
                                                                                                Christina Lioma. 2006. Terrier: A High Performance and Scalable Information
 [4] Lynn Cullimore. 2016. Using Jupyter Notebooks to teach computational
                                                                                                Retrieval Platform. In Proceedings of OSIR.
     literacy. (2016).        http://www.elearning.eps.manchester.ac.uk/blog/2016/
                                                                                           [15] Fernando Perez and Brian E Granger. 2015. Project Jupyter: Computational
     using-jupyter-notebooks-to-teach-computational-literacy/
                                                                                                narratives as the engine of collaborative data science. Technical Report. http:
 [5] Ronan Cummins and Colm O’Riordan. 2009. Learning in a Pairwise Term-term
                                                                                                //archive.ipython.org/JupyterGrantNarrative-2015.pdf
     Proximity Framework for Information Retrieval. In Proceedings of SIGIR.
                                                                                           [16] Trevor Strohman, Donald Metzler, Howard Turtle, and W Bruce Croft. 2005. Indri:
 [6] Nicola Ferro, Norbert Fuhr, et al. 2016. Increasing Reproducibility in IR: Findings
                                                                                                A language model-based search engine for complex queries. In Proceedings of the
     from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in
                                                                                                International Conference on Intelligent Analysis, Vol. 2. Citeseer, 2–6.
     e-Science". SIGIR Forum 50, 1 (2016).
                                                                                           [17] Terrier.org. 2016. Learning to Rank with Terrier. (2016). http://terrier.org/docs/
 [7] Rosie Jones, Benjamin Rey, Omid Madani, and Wiley Greiner. 2006. Generating
                                                                                                v4.2/learning.html
     Query Substitutions. In Proceedings of WWW.
                                                                                           [18] John Williamson. 2017. CS1P: Running Jupyter from the command line. (2017).
 [8] Jimmy Lin and Miles Efron. 2013. Evaluation As a Service for Information
                                                                                                https://www.youtube.com/watch?v=hqpuC0YLbpM
     Retrieval. SIGIR Forum 47, 2 (Jan. 2013), 8–14.
                                                                                           [19] Qiang Wu, Chris J. C. Burges, Krysta M. Svore, and Jianfeng Gao. 2008. Ranking,
 [9] Tie-Yan Liu. 2009. Learning to rank for information retrieval. Foundations and
                                                                                                Boosting, and Model Adaptation. Technical Report MSR-TR-2008-109. Microsoft.
     Trends® in Information Retrieval 3, 3 (2009).
                                                                                           [20] Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robert-
[10] Craig Macdonald, Vassilis Plachouras, Ben He, Christina Lioma, and Iadh Ou-
                                                                                                son. 2004. Microsoft Cambridge at TREC-13: Web and HARD tracks. In Proceed-
     nis. 2006. University of Glasgow at WebCLEF 2005: Experiments in per-field
                                                                                                ings of TREC.
DESIRES’18, 2018, Bertinoro, Italy                                                   Craig Macdonald, Richard McCreadie, Iadh Ounis




                                     Figure 3: Comparing the results of two different retrieval runs.




                  Figure 4: Graphically displaying the per-query differences between different retrieval runs.