=Paper=
{{Paper
|id=Vol-2167/paper12
|storemode=property
|title=Agile Information Retrieval Experimentation with Terrier Notebooks
|pdfUrl=https://ceur-ws.org/Vol-2167/paper12.pdf
|volume=Vol-2167
|authors=Craig Macdonald,Richard McCreadie,Iadh Ounis
|dblpUrl=https://dblp.org/rec/conf/desires/MacdonaldMO18
}}
==Agile Information Retrieval Experimentation with Terrier Notebooks==
Agile Information Retrieval Experimentation with Terrier
Notebooks∗
Craig Macdonald, Richard McCreadie, Iadh Ounis
University of Glasgow
Glasgow, Scotland, UK
first.lastname@glasgow.ac.uk
ABSTRACT As with many computing science subjects, hands-on coding ex-
Teaching modern information retrieval is greatly benefited by giv- perience in IR is very beneficial for grasping the concepts being
ing students hands-on experience with an open-source search en- taught. However, commercial search engines are not publicly avail-
gine that they can experiment with. As such, open source platforms able for students to experiment with. For this reason, academic
such as Terrier are a valuable resource upon which learning exer- research groups have devoted resources to developing open source
cises can be built. However, experimentation using such systems search engines [14, 16] that can be used to support learning and
can be a laborious process when performed by hand; queries might teaching in IR, such as the Terrier IR platform [14], while keeping
be rewritten, executed, and model parameters tuned. Moreover, them up-to-date with state-of-the-art techniques.
the rise of learning-to-rank as the de-facto standard for state-of- However, while these platforms provide the core functionality of
the-art retrieval complicates this further, with the introduction of a search engine and are valuable for supporting hands-on exercises,
training, validation and testing (likely over multiple folded datasets significant time and effort is required for students to learn the
representing different query types). Currently, students resort to basics of experimenting with such a platform. Furthermore, as the
shell scripting to make experimentation easier, however this is far complexity of these platforms grow, the barriers to entry for using
from ideal. On the other hand, the introduction of experimental these platforms is increasing. For instance, the rise of learning-
pipelines in platforms like scikit-learn and Apache Spark in con- to-rank as the de-facto standard for state-of-the-art retrieval now
junction with notebook environments such as Jupyter have been requires students to understand and build command pipelines for
shown to markedly reduce to barriers to non-experts setting up and document and query feature extraction (along with configuration
running experiments. In this paper, we discuss how next genera- and possibly optimisation); dataset folding; and the subsequent the
tion information retrieval experimental pipelines can be combined training, validation and testing of the learned models. As a result,
in an agile manner using notebook-style interaction mechanisms. students either spend significant time manually running commands
Building upon the Terrier IR platform, we describe how this is or resort to shell scripting to make experimentation easier. In either
achieved using a recently released Terrier-Spark module and other case, this is an undesirable burden on the students that wastes
recent changes in Terrier 5.0. Overall, this paper demonstrates the valuable tuition time.
advantages of the agile nature of notebooks to experimental IR On the other hand, agile experimentation, particularly for non-
environments, from the classroom environment, through academic IR machine learning applications, is increasingly being facilitated
and industry research labs. by the use of experimental pipelines in toolkits like scikit-learn
and Apache Spark. These toolkits break-down the steps involved
1 INTRODUCTION in machine learning into small discrete operations, which can be
dynamically chained together - forming a reusable and customis-
Information retrieval (IR) is an important field in computing science, able pipeline. For example, in the scikit-learn toolkit from Python,
and is taught to undergraduate students at universities worldwide. each supervised technique exposes a fit() method for training,
IR is a dense topic to teach, as it encompasses 30 years of intensive and a transform() method for applying the trained model. Apache
research and development from academia and industry. Moreover, Spark’s MLib API similarly defines fit() and transform() meth-
IR is constantly evolving as a field, as new techniques are intro- ods for the same purpose. These experimental pipelines are very
duced, tested and adopted by commercial search engines. Indeed, powerful when combined with recent ‘notebook’ applications, such
supervised machine learning techniques [9] have systematically as Jupyter, which enable developers to store, edit and re-run por-
replaced traditional theoretically-founded term weighting models tions of their pipelines on-demand.
(e.g. BM25, language models) over the last decade. As such, taught In this paper, we argue that similar experimental pipelines are
IR courses need to be flexible in the face of rapid changes in the the next step in enhancing the teaching of IR in the classroom. In
broader field, so that students are prepared for the challenges they particular, experimental pipelines encapsulated as notebooks:
will face in industry.
(1) Provide an agile experimental platform that students/re-
∗ This is an extended version of a demonstration paper that was published at SIGIR searchers/practictioners can easily edit.
2018.
(2) Enable reproducibility in information retrieval experiments.
(3) Centralise configuration such that issues are more easily
DESIRES’18, 2018, Bertinoro, Italy identified.
© 2018 Copyright held by the author(s). (4) Allow for more complex examples released as pre-configured
notebooks.
DESIRES’18, 2018, Bertinoro, Italy Craig Macdonald, Richard McCreadie, Iadh Ounis
Furthermore, we discuss recent advances in the Terrier IR platform the IR concepts already presented in the lectures. Students also
with the Terrier-Spark module and additions to Terrier 5.0 that value the reinforcement of the empirical nature of IR as a field, and
enables experimental pipelines moving forward. the necessity of experimental evaluation.
The structure of this paper is as follows: Section 2 details recent However, students also note the difficulties in configuring Terrier
feedback in an empirical information retrieval course; Section 3 (long commandline incantations, and/or awkward editing of the
highlights the main requirements for an experimental IR platform; terrier.properties files). Indeed, not all students are familiar with
Section 4 summarises the current Terrier platform; Section 5 intro- commandline scripting technologies on their chosen operating sys-
duces Terrier-Spark and how to conduct IR experiment using it; tem. Some students contrasted this with other courses, such as our
Section 6 highlights other relecant changes in Terrier 5.0; Section 7 recent Text-as-Data course on text mining/classification/informa-
discusses the advantages of combining Terrier-Spark with Jupyter tion extraction, which uses Python notebooks, and lamented the
notebooks. Concluding remarks follow in Section 8. lack of a Juypter environment for Terrier.
Overall, from the feedback described above, it is clear that mov-
2 RECENT EXPERIENCES OF AN ing towards providing a notebook environment for performing IR
INFORMATION RETRIEVAL COURSE experiments would aid students. To this end, we have considered
USING TERRIER how to modernise the experimental environment for conducting ex-
periments using Terrier. The next few sections detail the underlying
Information retrieval has been taught in Glasgow as an undergrad-
essential requirements for an experimental IR platform (Section 3),
uate and postgraduate elective since the mid-1980s. Its current
the current Terrier status (Section 4), as well as how adapt Terrier to
incarnation consists of 20 hours of lectures, along with supplemen-
support a notebook paradigm leveraging Apache Spark (Section 5).
tary tutorials and laboratory sessions, allowing students to gain
hands-on experience in developing IR technologies and critically
3 IR PLATFORM REQUIREMENTS FOR
assessing their performance. To address this latter point, we set
students with two assessed exercises (aka courseworks), of approxi- CONDUCTING EMPIRICAL EXPERIMENTS
mately 20 hours in length total. These allow the student experience Below, we argue for, in our experience, the main attributes of an
with empirical, experimental IR, both from core concepts (TF.IDF, experimental IR platform. These are described in terms of required
document length normalisation) through to learning-to-rank. functionalities - in practice, there are non-functional requirements
such as running experiments efficiently on large corpora such as
Coursework 1 [8 hours]. Create an index of 50% of the .GOV test
ClueWeb09.
collection using Terrier; Perform retrieval for a variety of standard
weighting models, with and without query expansion, on three R1 Perform an “untrained” run for a weighting model over a
tasks of the TREC 2004 Web track (homepage finding, named-page set of query topics, retrieving and ranking results from an
finding, topic distillation). They are also asked to compare and con- index.
trast with a “simple” TF.IDF they have implemented themselves. R2 Evaluate a run over a set of topics, based on relevance labels.
Students then analyse the results, including per-topic analysis, pre- R3 Train the parameters of a run, which may require repetitive
cision recall graphs, etc. Overall, Coursework 1 is designed to famil- execution of queries from an index and evaluation.
iarise the students with the core workings of a (web) search engine, R4 Extract a run with multiple features that can be used as input
and performing IR experiments, as well as analysing their results, to a learning-to-rank technique.
critically analysing the attributes of different retrieval techniques, R5 Re-rank results based on multiple features and a pre-trained
and how these affect performance on different topic sets. learning-to-rank technique.
R1 concerns the ability of the IR system to be executed in an
Coursework 2 [12 hours]. Use a provided index for the .GOV
offline batch mode - to produce the results of a set of query topics.
test collection that contains field and positional information, along
Academic-based platforms such as Terrier, Indri [16], Galago [3]
with a number of standard Web features (PageRank, DFR prox-
offer such functionality out of the box. R2 concerns the provision
imity, BM25 on the title). The students are asked to implement
of evaluation tools that permit a run to be evaluated. Standard tools
two proximity features from a number described in [5], and com-
exist such as the C-based trec_eval library, but integration in the
bine these within a LambdaMART learning-to-rank model. Their
native language of the system may provide advantages for R3. Other
analysis must use techniques learned through Coursework 1. e.g.
systems such as Lucene/Solr/Elastic may need some scripting or
identifying which queries were most benefitted by further proxim-
external tools (Azzopardi et al. [1] highlight the lack of empirical
ity features, etc. Overall, this coursework allows student hands-on
tools for IR experimentation and teaching on Lucene, and have
experience with deploying a learning-to-rank pipeline (training/
made some inroads into addressing this need).
validation/testing and evaluation), as well as the notion of posi-
Indeed, R3 represents the early advent of machine learning into
tional information and posting list iterators necessary to implement
the IR platform, where gradient ascent/descent algorithms were
their additional proximity features.
used to optimise the parameters of systems by (relatively expensive)
Student feedback on the current courseworks. The positive feed- repeated querying and evaluation of different parameter settings.
back on the current coursework exercises is that these form an Effective techniques such as BM25F [20] & PL2F [10] were facilitated
effective vehicle for achieving the intended learning outcomes of by common use of such optimisation techniques.
the course. In particular, they encompass more than programming Finally, R4 & R5 are concerned with successful integration of
implementations, and that they force the students to understand learning-to-rank into the IR system. As with new technologies,
Agile Information Retrieval Experimentation with Terrier Notebooks DESIRES’18, 2018, Bertinoro, Italy
there can be a lag between research-fresh developments and how bin/trec_terrier.sh -r -Dtrec.topics=/path/to/topics \
they are bled into production-ready systems. Of these, for the pur- -Dtrec.model=BM25
poses of experimentation, R4 is the more important - the ability to bin/trec_terrier.sh -e -Dtrec.qrels=/path/to/qrels
efficiently extract multiple query dependent features has received
some coverage in the literature [11]. R5 is concerned with taking this Listing 1: A simple retrieval run and evaluation using Ter-
a stage further, and applying a learned model to re-rank the results. rier’s commandline interface - c.f. requirements R1 & R2.
In the following, we will describe how Terrier currently meets
requirements R1-R5 (Section 4), and how it can be adopted within a
Spark environment to meet these in a more agile fashion (Section 5). methods) from scikit-learn2 (Python machine learning library),
namely:
4 BACKGROUND ON TERRIER • DataFrame: a relation containing structured data.
Terrier [14] is a retrieval platform dating back to 2001 with an ex- • Transformer: an object that can transform a data instance
perimental focus. First released as open source in 2004, it has been from a DataFrame.
downloaded >50,000 times since. While Terrier portrays a Java API • Estimator: an algorithm that can be fitted to data in a DataFrame.
that allows extension and/or integration into a number of appli- The outcome of an Estimator can be a Transformer - for in-
cations, the typical execution of Terrier is based upon procedural stance, a machine-learned model obtained from an Estimator
command invocations from the commandline. Listing 1 provides will be a Transformer.
the commandline invocations necessary to fulfil requirements R1 & • Pipeline: A series of Transformer and Estimators chained
R2 using Terrier. All requirements R1-R5 listed above are supported together to create a workflow.
by the commandline. Moreover, the use of a rich commandline • Parameter: A configuration option for an Estimator.
scripting language (GNU Bash, for instance) permits infinite com- In our adaptation of Terrier to the Spark environment, Terrier-
binations of different configurations to be evaluated automatically. Spark, we have implemented a number of Estimators and Trans-
Moreover, with appropriate cluster management software, such formers. These allow the natural stages of an IR system to be com-
runs can be conducted efficiently in a distributed fashion. This bined in various ways, while also leveraging the existing supervised
commandline API is also the main methods that students learn to ML techniques within Spark to permit the learning of ranking mod-
interact with the IR system. els (e.g. Spark contains logistic regression, random forests, gradient
However, we have increasingly found that a commandline API boosted regression trees, but notably no support for listwise based
was not suited for all purposes. For instance, the chaining of the learning techniques such as LambdaMART [19], which are often
outcomes of between invocations requires complicated scripting. the most effective [2, 9]).
For instance, consider, for each fold of a 5-fold cross validation: Table 1 summarises the main components developed to sup-
training the b length normalisation parameter of BM25, saving port the integration of Terrier into Apache Spark, along with their
the optimal value, and using that for input to a learning-to-rank inputs, outputs and key parameters. In particular, QueryingTrans-
run, distributed among a cluster environment. Such an example former is the key Transformer, in that this internally invokes Terrier
would require creating tedious amounts of shell scripting, for little to retrieve the docids and scores of each retrieved document for
subsequent empirical benefit. In short, this paper argues that IR the queries in the input dataframe. As Terrier is written in Java,
experimentation has now reached the stage where we should not and Scala and Java both are JVM-based languages, Terrier can run
be limited by the confines of a shell-scripting environment. “in-process”. Furthermore, as discussed in Section 6 below, further
changes in Terrier 5.0 permit accessing indices on remotely hosted
5 TERRIER-SPARK Terrier servers.
To address the perceived limitations in the procedural commandline In the following, we provide examples of retrieval experimental
use of Terrier, we have developed a new experimental interface listings using Spark through Scala.
for the Terrier platform, building upon Apache Spark, and called
Terrier-Spark. Apache Spark is a fast and general engine for large- 5.1 Performing an untrained retrieval run
scale data processing. While Spark can be invoked in Java, Scala Listing 1 shows how a simple retrieval run can be made using
and Python, we focus on the Scala environment, which allows for Terrier’s commandline API. The location of the topics and qrels
code that is more succinct than the equivalent Java (for instance, files as well as the weighting model, are set on the commandline,
through the use of functional progamming constructs, and auto- although defaults could be set in a configuration file.
matic type inference). Spark allows relational algebra operations In contrast, Listing 2 shows how the exact same run might be
on dataframes (relations) to be easily expressed as function calls, achieved from Scala in a Spark environment. Once the topics files
which are then compiled to a query plan that is distributed and are loaded into a two-column dataframe (keyed by “qid", the topic
executed on machines within the cluster. number), these are transformed into a dataframe of result sets, ob-
Apache Spark borrows the notions of dataframes from Pandas1 tained from Terrier (keyed by “⟨qid,docno⟩”). Then a second trans-
(a Python data analysis library), and similarly the notion of machine former record the relevant and non-relevant documents within the
learning pipeline constructs and interfaces (e.g. fit and transform dataframe, by joining with the contents of the qrels file, before
evaluation.
1 http://pandas.pydata.org/ 2 http://scikit-learn.org/
DESIRES’18, 2018, Bertinoro, Italy Craig Macdonald, Richard McCreadie, Iadh Ounis
Component Inputs Output Parameters
QueryStringTransformer Queries Queries Lambda function to transform query
QueryingTransformer Queries docids, scores for each query number of docs; weighting model
FeaturedQueryingTransformer Queries docids, scores of each feature for each query + feature set
QrelTransformer results with docids results with docids and labels qrel file
NDCGEvaluator results with docids and labels Mean NDCG@K cutoff K
Table 1: Summary of the primary user-facing components available Terrier-Spark.
val props = Map("terrier.home" -> "/path/to/Terrier") //assuming various variables as per Listing 2.
TopicSource.configureTerrier(props) val pipeline = new Pipeline()
val topics = "/path/to/topics.401-450" .setStages(Array(queryTransform, qrelTransform))
val qrels = "/path/to/qrels.trec8"
val paramGrid = new ParamGridBuilder()
val topics = TopicSource.extractTRECTopics(topics) .addGrid(queryTransform.localTerrierProperties,
.toList.toDF("qid", "query") Array(Map["c"->"1"], Map["c"->"10"], Map["c"->"100"]))
.addGrid(queryTransform.sampleModel,
val queryTransform = new QueryingTransformer() Array("InL2", "PL2"))
.setTerrierProperties(props) .build()
.sampleModel.set("BM25")
val cv = new CrossValidator()
val r1 = queryTransform.transform(topics) .setEstimator(pipeline)
//r1 is a dataframe with results for queries in topics .setEvaluator(new NDCGEvaluator)
val qrelTransform = new QrelTransformer() .setEstimatorParamMaps(paramGrid)
.setQrelsFile(qrels) .setNumFolds(5)
val cvModel = cv.fit(topics)
val r2 = qrelTransform.transform(r1)
//r2 is a dataframe as r1, but also includes a label column Listing 3: Grid searching the weighting model and document
length normalisation c parameters using Spark’s CrossVal-
val meanNDCG = new NDCGEvaluator().evaluate(r2)
idator - c.f. requirement R3.
Listing 2: A retrieval run in Scala - c.f. requirements R1 & R2.
val queryTransform = new FeaturesQueryingTransformer()
.setTerrierProperties(props)
While clearly more verbose than the simpler commandline API, .setMaxResults(5000)
Listing 2 demonstrates equivalent functionality, and clearly high- .setRetrievalFeatures(List(
lights the needed data and the steps involved in the experiment. "WMODEL:BM25",
Moreover, the use of objects suitable to be built into a Spark pipeline "WMODEL:PL2",
offers the possibility to build and automate pipelines. As we show "DSM:org.terrier.matching.dsms.DFRDependenceScoreModifier"))
below, this functionality permits the powerful features of a func- .setSampleModel("InL2")
val r1 = queryTransform.transform(topics)
tional language to allow more complex experimental pipelines.
//r1 is as per Listing 2, but now also has a column of 3
//feature values for each retrieved document
5.2 Training weighting models val qrelTransform = new QrelTransformer()
Listing 3 demonstrates the use of Spark’s Pipeline and CrossVal- .setQrelsFile(qrels)
idator components to create a pipeline that applies a grid-search to val r2 = qrelTransform.transform(r1)
determine the most effective weighting model and its corresponding
document length normalisation c parameter. Such a grid-search can //learn a Random Forest model
be parallelised across many Spark worker machines in a cluster. We val rf = new RandomForestRegressor()
.setLabelCol("label")
note that while grid-search is one possibility, it is feasible to consider
.setFeaturesCol("features")
use of a gradient descent algorithm to tune the c parameter. How- .setPredictionCol("newscore")
ever, at this stage we do not yet have a parallelised algorithm imple- rf.fit(r2)
mented that would make best use of a clustered Spark environment.
Listing 4: Training a Random Forests based learning-to-rank
5.3 Training learning-to-rank models model - c.f. requirements R4 & R5.
Finally, Listing 4 demonstrates the use of Spark’s in-built machine
learning Random Forest regression technique to learn a learning-to-
rank model. In this example, the initial ranking of documents is per- dependent features being calculated for the top 5000 ranked docu-
formed by the InL2 weighting model, with an additional three query ments for each query. Internally, this uses Terrier’s Fat framework
Agile Information Retrieval Experimentation with Terrier Notebooks DESIRES’18, 2018, Bertinoro, Italy
for implementing the efficient calculation of additional query depen- 6.2 Remote Querying
dent features [11]. The received random forests model can be triv- As discussed in Section 5 above, Terrier-Spark has initially been
ially applied to a further set of unseen topics (not shown). The result- designed for in-process querying. However, concurrent changes to
ing Scala code is markedly more comprehensible to the equivalent Terrier for version 5.0 have abstracted the notion of all retrieval
complex commandline invocations necessary for Terrier 4.2 [17]. access being within the current process or accessible from the
Moreover, we highlight the uniqueness of our offering - while other same machine. Indeed, a reference to an index may refer to an
platforms such as Solr and Elastic have Spark tools, none offer the index hosted on another server, and made accessible over a RESTful
ability to export a multi-feature representation suitable for conduct- interface. While this is a conventional facility offered by some other
ing learning-to-rank experiments within Spark (c.f. R4 & R5). search engine products (and made available through their Spark
Of course, the pipeline framework of Estimators and Transform- tools, such as for Elastic’s3 and Solr’s4 ), this offers a number of
ers is generic, and one can easily imagine further implementations advantages for teaching. Indeed, often IR test collection can be too
of both to increase the diversity of possible experiments: For in- large to provide as downloads - allowing a remote index accessible
stance, new Estimators for increased coverage of learning-to-rank over a (secured) RESTful HTTP connection would negate the need
techniques, such as LambdaMART [19]; Similarly, Transformers to provide students with the raw contents of the documents for
for adapting the query representation, for example by applying indexing. Moreover, unlike conventional Spark tools for Elastic and
query-log based expansions [7] or proximity-query rewriting such Solr, the results returned can have various features pre-calculated
as Sequential Dependence models [12]. Once a suitable Pipeline for applying and evaluating learning-to-rank models.
is configured, conducting experiments such as learning-to-rank To make this more concrete, consider the TREC Microblog track
feature ablations can be conducted in only a few lines of Scala. which used an “evaluation-as-a-service” methodology [8]. In this,
the evaluation track organisers provided a search API based upon
6 OTHER CHANGES TO TERRIER 5.0 Lucene, through which the collection can be accessed for complet-
We have also made a number of other changes to Terrier, which have ing the evaluation task. The advancements described here would
been incorporated into the recently released version 5.0, which aid allow a Terrier index to be provided for a particular TREC collection,
in the expanding the possible retrieval concepts that can be easily easily accessible through either the conventional Terrier comman-
implemented using Terrier-Spark, while increasing the flexibility dline tools, or through Terrier-Spark. A run in that track could
offered by the platform. then be crafted and submitted to TREC wholly within a Jupyter
notebook, facilitating easy run reproducibility.
6.1 Indri-esque matching query language
Terrier 5.0 implements a subset of the Indri/Galago query lan- 7 CONDUCTING IR EXPERIMENTS WITHIN
guage [3, Ch. 5], including complex operators such as #syn and A JUPYTER NOTEBOOK ENVIRONMENT
#uwN. In particular, Terrier 5.0 provides: Spark can be used in a number of manners: by compiling Scala
• #syn(t1 t2): groups a set of terms into a single term for the source files into executable (Jar) files, which are submitted to a Spark
purposes of matching. This can be used for implementing cluster, or through line-by-line execution in spark-shell (a Read-
query-time stemming. Eval-Print-Loop or REPL tool). However, each has its disadvantages:
• #uwN(t1 t2): counts the number of occurrences of t1 & t2 the former only permits slow development iterations through the
within unordered windows of size N . necessity to recompile at each iteration; on the other hand, the
• #1(t1 t2): counts the number of exact matches of the bigram REPL spark-shell environment does not easily record the developed
t1 & t2. code, nor allow parts of the code to be re-executed.
• #combine(t1 t2): allows the weighting of t1 & t2 to be ad- Instead, we note that the use of a Spark environment naturally
justed. fits with the use of Scala Jupyter notebooks56 . Jupyter is an open-
• #tag(NAME t1 t2): this allows to assign a name to a set of source web application that allows the creation and sharing of
terms, which can be then be formed as a set of features documents that contain code, equations, visualisations and narra-
by the later Fat layer. In doing so, such tagged expression tive text. Increasingly entire technical report documents, slides and
allows various sub-queries to form separate features during books are being written as Jupyter notebooks, due to the easy inte-
learning-to-rank. This functionality is not present in Indri gration of text, code and resulting analysis tables or visualisations.
or Galago. Jupyter notebooks are increasingly used to share the algorithms
For example, Metzler and Croft’s sequential dependence proxim- and analysis conducted in machine learning research papers, sig-
ity model [12] can be formulated using combinations of #uwN nificantly aiding reproducibility [15]. Indeed, in their report on the
and #owN query operators. Such a query rewriting technique can Daghstuhl workshop on reproducibility of IR experiments [6], Ferro,
be easily implemented within Terrier-Spark by applying a Query- Fuhr et al. note that sharing of code and experimental methods
StringTransformer that applies a lambda function upon each query, would aid reproducibility in IR, but do not recognise the ability of
allow users to build upon the new complex query operators imple- notebooks to aid in this process.
mented in Terrier 5.0. Figure 1 demonstrates applying sequential
dependence to a dataframe of queries, within a Jupyter notebook. 3 https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
This is achieved by instantiating a QueryStringTransformer upon 4 https://github.com/lucidworks/spark-solr 5 http://jupyter.org/ 6 We note that
a dataframe of queries, using a lambda function that appropriately Jupyter notebooks are extensible through plugins to Scala and other languages, i.e.
rewrites the queries. not limited to Python.
DESIRES’18, 2018, Bertinoro, Italy Craig Macdonald, Richard McCreadie, Iadh Ounis
Figure 1: Example of applying a Transformer to apply sequential dependency proximity to a dataframe of queries.
Jupyter notebooks are interactive in manner, in that a code block Terrier for two different retrieval models; Figure 3 shows analy-
in a single cell can be run independently of all other cells in the note- sis on the results, by ranking queries based on the difference of
book. As a result, Jupyter is also increasingly used for educational their evaluation performance between rankings; Finally, Figure 4
purposes - for example, teaching programming within undergradu- demonstrates the same information as a per-query analysis figure.
ate degree courses [4, 18], as well as a plethora of data science or
machine learning courses [15]. O’Hara et al. [13] described four uses
for notebooks in classroom situations, including lectures, flipped-
8 CONCLUSIONS
classrooms, home/lab work and exams. For instance, the use of In this paper, we have described the challenge of teaching a modern
notebooks within a lecturing situation easily permits the students undergraduate- and postgraduate-level elective course on infor-
to replicate the analysis demonstrated by the lecturer. mation retrieval. We highlight the main requirements of an ex-
We argue that these general advantages of notebooks can be ap- perimental IR platform, then further describe Terrier-Spark, an
plied to experimental information retrieval education, through the extension of Terrier IR platform to perform IR experimentation
use of a Spark-integrated IR platform, such as that described in this within the Spark distributed computing engine, which not only
paper. Indeed, we believe that the changes described in Sections 5 addresses these requirements, but can allow complex experiments
& 6 should address these these feedbacks, allow students to more to be easily defined within a few lines of Scala code. Terrier-Spark
easily configure the retrieval platform (all configuration of Terrier and Terrier 5.0 have been released as open source. In addition, we
is presented on the screen), make more powerful experimentation also argue that Jupyter notebooks for IR can aid not only agile IR
available to a wider and less experienced audience not wishing to experimentation by students, but also in research reproducibility in
engage in complicated shell-scripting. information retrieval by facilitating easily-distributable notebooks
We believe that Terrier-Spark can bring these same advantages that demonstrate the conducted experiments.
to conducting modern (e.g. learning-to-rank) IR experiments, com- Overall, we believe that notebooks are an important aspect of
bined with the accessible and agile nature of a notebook environ- data science, and that we as an IR community should not fall behind
ment. Moreover, an integrated Juypter environment also facilitates, other branches of data science in using notebooks for empirical
for instance in the IR teaching environment, the creation and pre- IR experimentation. The frameworks described here might be ex-
sentation of inline figures (e.g. created using the Scala vegas-viz tended to other languages (e.g. a Python wrapper for Terrier’s
library7 ), such as per-query analyses and interpolated precision- RESTful interface), or even to other IR platforms. In doing so, we
recall graphs. Figures 2 - 4 provide screenshots from such a note- bring more powerful and agile experimental IR tools into the hands
book89 In particular: Figure 2 demonstrates the querying of the of researchers and students alike. Terrier-Spark has been released
as open source, and is available from
7 https://github.com/vegas-viz/. 8 We use the Apache Toree
kernel for Jupyter, which allows notebooks written in Scala and https://github.com/terrier-org/terrier-spark
which automatically interfaces with Apache Spark. 9 The orig-
inal notebook can be found in the Terrier-Spark repository, see
https://github.com/terrier-org/terrier-spark/tree/master/example_notebooks/toree. along with example Jupyter notebooks.
Agile Information Retrieval Experimentation with Terrier Notebooks DESIRES’18, 2018, Bertinoro, Italy
Figure 2: Conducting two different retrieval runs within a Jupyter notebook using a function defined in Terrier-Spark.
REFERENCES normlisation and language specific stemming. In Proceedings of CLEF.
[1] Leif Azzopardi et al. 2017. Lucene4IR: Developing Information Retrieval Evalua- [11] Craig Macdonald, Rodrygo L.T. Santos, Iadh Ounis, and Ben He. 2013. About
tion Resources Using Lucene. SIGIR Forum 50, 2 (2017), 18. Learning Models with Multiple Query-dependent Features. ToIS 31, 3 (2013).
[12] Donald Metzler and W. Bruce Croft. 2005. A Markov random field model for
[2] Olivier Chapelle and Yi Chang. 2011. Yahoo! Learning to Rank Challenge
term dependencies. In Proceedings of SIGIR.
Overview. Proceedings of Machine Learning Research 14 (2011).
[13] Keith J. O’Hara, Douglas Blank, and James Marshall. 2015. Computational Note-
[3] Bruce Croft, Donald Metzler, and Trevor Strohman. 2009. Search Engines: In-
books for AI Education. In Proceedings of FLAIRS.
formation Retrieval in Practice (1st ed.). Addison-Wesley Publishing Company,
[14] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and
USA.
Christina Lioma. 2006. Terrier: A High Performance and Scalable Information
[4] Lynn Cullimore. 2016. Using Jupyter Notebooks to teach computational
Retrieval Platform. In Proceedings of OSIR.
literacy. (2016). http://www.elearning.eps.manchester.ac.uk/blog/2016/
[15] Fernando Perez and Brian E Granger. 2015. Project Jupyter: Computational
using-jupyter-notebooks-to-teach-computational-literacy/
narratives as the engine of collaborative data science. Technical Report. http:
[5] Ronan Cummins and Colm O’Riordan. 2009. Learning in a Pairwise Term-term
//archive.ipython.org/JupyterGrantNarrative-2015.pdf
Proximity Framework for Information Retrieval. In Proceedings of SIGIR.
[16] Trevor Strohman, Donald Metzler, Howard Turtle, and W Bruce Croft. 2005. Indri:
[6] Nicola Ferro, Norbert Fuhr, et al. 2016. Increasing Reproducibility in IR: Findings
A language model-based search engine for complex queries. In Proceedings of the
from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in
International Conference on Intelligent Analysis, Vol. 2. Citeseer, 2–6.
e-Science". SIGIR Forum 50, 1 (2016).
[17] Terrier.org. 2016. Learning to Rank with Terrier. (2016). http://terrier.org/docs/
[7] Rosie Jones, Benjamin Rey, Omid Madani, and Wiley Greiner. 2006. Generating
v4.2/learning.html
Query Substitutions. In Proceedings of WWW.
[18] John Williamson. 2017. CS1P: Running Jupyter from the command line. (2017).
[8] Jimmy Lin and Miles Efron. 2013. Evaluation As a Service for Information
https://www.youtube.com/watch?v=hqpuC0YLbpM
Retrieval. SIGIR Forum 47, 2 (Jan. 2013), 8–14.
[19] Qiang Wu, Chris J. C. Burges, Krysta M. Svore, and Jianfeng Gao. 2008. Ranking,
[9] Tie-Yan Liu. 2009. Learning to rank for information retrieval. Foundations and
Boosting, and Model Adaptation. Technical Report MSR-TR-2008-109. Microsoft.
Trends® in Information Retrieval 3, 3 (2009).
[20] Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robert-
[10] Craig Macdonald, Vassilis Plachouras, Ben He, Christina Lioma, and Iadh Ou-
son. 2004. Microsoft Cambridge at TREC-13: Web and HARD tracks. In Proceed-
nis. 2006. University of Glasgow at WebCLEF 2005: Experiments in per-field
ings of TREC.
DESIRES’18, 2018, Bertinoro, Italy Craig Macdonald, Richard McCreadie, Iadh Ounis
Figure 3: Comparing the results of two different retrieval runs.
Figure 4: Graphically displaying the per-query differences between different retrieval runs.