<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Agile Information Retrieval Experimentation with Terrier Notebooks∗</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Craig Macdonald</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richard McCreadie</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Iadh Ounis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Glasgow Glasgow</institution>
          ,
          <addr-line>Scotland</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <abstract>
        <p>Teaching modern information retrieval is greatly benefited by giving students hands-on experience with an open-source search engine that they can experiment with. As such, open source platforms such as Terrier are a valuable resource upon which learning exercises can be built. However, experimentation using such systems can be a laborious process when performed by hand; queries might be rewritten, executed, and model parameters tuned. Moreover, the rise of learning-to-rank as the de-facto standard for state-ofthe-art retrieval complicates this further, with the introduction of training, validation and testing (likely over multiple folded datasets representing diferent query types). Currently, students resort to shell scripting to make experimentation easier, however this is far from ideal. On the other hand, the introduction of experimental pipelines in platforms like scikit-learn and Apache Spark in conjunction with notebook environments such as Jupyter have been shown to markedly reduce to barriers to non-experts setting up and running experiments. In this paper, we discuss how next generation information retrieval experimental pipelines can be combined in an agile manner using notebook-style interaction mechanisms. Building upon the Terrier IR platform, we describe how this is achieved using a recently released Terrier-Spark module and other recent changes in Terrier 5.0. Overall, this paper demonstrates the advantages of the agile nature of notebooks to experimental IR environments, from the classroom environment, through academic and industry research labs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        Information retrieval (IR) is an important field in computing science,
and is taught to undergraduate students at universities worldwide.
IR is a dense topic to teach, as it encompasses 30 years of intensive
research and development from academia and industry. Moreover,
IR is constantly evolving as a field, as new techniques are
introduced, tested and adopted by commercial search engines. Indeed,
supervised machine learning techniques [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] have systematically
replaced traditional theoretically-founded term weighting models
(e.g. BM25, language models) over the last decade. As such, taught
IR courses need to be flexible in the face of rapid changes in the
broader field, so that students are prepared for the challenges they
will face in industry.
      </p>
      <p>∗ This is an extended version of a demonstration paper that was published at SIGIR
2018.</p>
      <p>
        As with many computing science subjects, hands-on coding
experience in IR is very beneficial for grasping the concepts being
taught. However, commercial search engines are not publicly
available for students to experiment with. For this reason, academic
research groups have devoted resources to developing open source
search engines [
        <xref ref-type="bibr" rid="ref14 ref16">14, 16</xref>
        ] that can be used to support learning and
teaching in IR, such as the Terrier IR platform [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], while keeping
them up-to-date with state-of-the-art techniques.
      </p>
      <p>However, while these platforms provide the core functionality of
a search engine and are valuable for supporting hands-on exercises,
significant time and efort is required for students to learn the
basics of experimenting with such a platform. Furthermore, as the
complexity of these platforms grow, the barriers to entry for using
these platforms is increasing. For instance, the rise of
learningto-rank as the de-facto standard for state-of-the-art retrieval now
requires students to understand and build command pipelines for
document and query feature extraction (along with configuration
and possibly optimisation); dataset folding; and the subsequent the
training, validation and testing of the learned models. As a result,
students either spend significant time manually running commands
or resort to shell scripting to make experimentation easier. In either
case, this is an undesirable burden on the students that wastes
valuable tuition time.</p>
      <p>On the other hand, agile experimentation, particularly for
nonIR machine learning applications, is increasingly being facilitated
by the use of experimental pipelines in toolkits like scikit-learn
and Apache Spark. These toolkits break-down the steps involved
in machine learning into small discrete operations, which can be
dynamically chained together - forming a reusable and
customisable pipeline. For example, in the scikit-learn toolkit from Python,
each supervised technique exposes a fit() method for training,
and a transform() method for applying the trained model. Apache
Spark’s MLib API similarly defines fit() and transform()
methods for the same purpose. These experimental pipelines are very
powerful when combined with recent ‘notebook’ applications, such
as Jupyter, which enable developers to store, edit and re-run
portions of their pipelines on-demand.</p>
      <p>In this paper, we argue that similar experimental pipelines are
the next step in enhancing the teaching of IR in the classroom. In
particular, experimental pipelines encapsulated as notebooks:
(1) Provide an agile experimental platform that
students/researchers/practictioners can easily edit.
(2) Enable reproducibility in information retrieval experiments.
(3) Centralise configuration such that issues are more easily
identified.
(4) Allow for more complex examples released as pre-configured
notebooks.
Furthermore, we discuss recent advances in the Terrier IR platform
with the Terrier-Spark module and additions to Terrier 5.0 that
enables experimental pipelines moving forward.</p>
      <p>The structure of this paper is as follows: Section 2 details recent
feedback in an empirical information retrieval course; Section 3
highlights the main requirements for an experimental IR platform;
Section 4 summarises the current Terrier platform; Section 5
introduces Terrier-Spark and how to conduct IR experiment using it;
Section 6 highlights other relecant changes in Terrier 5.0; Section 7
discusses the advantages of combining Terrier-Spark with Jupyter
notebooks. Concluding remarks follow in Section 8.
2</p>
    </sec>
    <sec id="sec-2">
      <title>RECENT EXPERIENCES OF AN</title>
    </sec>
    <sec id="sec-3">
      <title>INFORMATION RETRIEVAL COURSE</title>
    </sec>
    <sec id="sec-4">
      <title>USING TERRIER</title>
      <p>Information retrieval has been taught in Glasgow as an
undergraduate and postgraduate elective since the mid-1980s. Its current
incarnation consists of 20 hours of lectures, along with
supplementary tutorials and laboratory sessions, allowing students to gain
hands-on experience in developing IR technologies and critically
assessing their performance. To address this latter point, we set
students with two assessed exercises (aka courseworks), of
approximately 20 hours in length total. These allow the student experience
with empirical, experimental IR, both from core concepts (TF.IDF,
document length normalisation) through to learning-to-rank.</p>
      <p>Coursework 1 [8 hours]. Create an index of 50% of the .GOV test
collection using Terrier; Perform retrieval for a variety of standard
weighting models, with and without query expansion, on three
tasks of the TREC 2004 Web track (homepage finding, named-page
ifnding, topic distillation). They are also asked to compare and
contrast with a “simple” TF.IDF they have implemented themselves.
Students then analyse the results, including per-topic analysis,
precision recall graphs, etc. Overall, Coursework 1 is designed to
familiarise the students with the core workings of a (web) search engine,
and performing IR experiments, as well as analysing their results,
critically analysing the attributes of diferent retrieval techniques,
and how these afect performance on diferent topic sets.</p>
      <p>
        Coursework 2 [12 hours]. Use a provided index for the .GOV
test collection that contains field and positional information, along
with a number of standard Web features (PageRank, DFR
proximity, BM25 on the title). The students are asked to implement
two proximity features from a number described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and
combine these within a LambdaMART learning-to-rank model. Their
analysis must use techniques learned through Coursework 1. e.g.
identifying which queries were most benefitted by further
proximity features, etc. Overall, this coursework allows student hands-on
experience with deploying a learning-to-rank pipeline (training/
validation/testing and evaluation), as well as the notion of
positional information and posting list iterators necessary to implement
their additional proximity features.
      </p>
      <p>Student feedback on the current courseworks. The positive
feedback on the current coursework exercises is that these form an
efective vehicle for achieving the intended learning outcomes of
the course. In particular, they encompass more than programming
implementations, and that they force the students to understand
the IR concepts already presented in the lectures. Students also
value the reinforcement of the empirical nature of IR as a field, and
the necessity of experimental evaluation.</p>
      <p>However, students also note the dificulties in configuring Terrier
(long commandline incantations, and/or awkward editing of the
terrier.properties files). Indeed, not all students are familiar with
commandline scripting technologies on their chosen operating
system. Some students contrasted this with other courses, such as our
recent Text-as-Data course on text
mining/classification/information extraction, which uses Python notebooks, and lamented the
lack of a Juypter environment for Terrier.</p>
      <p>Overall, from the feedback described above, it is clear that
moving towards providing a notebook environment for performing IR
experiments would aid students. To this end, we have considered
how to modernise the experimental environment for conducting
experiments using Terrier. The next few sections detail the underlying
essential requirements for an experimental IR platform (Section 3),
the current Terrier status (Section 4), as well as how adapt Terrier to
support a notebook paradigm leveraging Apache Spark (Section 5).
3</p>
    </sec>
    <sec id="sec-5">
      <title>IR PLATFORM REQUIREMENTS FOR</title>
    </sec>
    <sec id="sec-6">
      <title>CONDUCTING EMPIRICAL EXPERIMENTS</title>
      <p>Below, we argue for, in our experience, the main attributes of an
experimental IR platform. These are described in terms of required
functionalities - in practice, there are non-functional requirements
such as running experiments eficiently on large corpora such as
ClueWeb09.</p>
      <p>R1 Perform an “untrained” run for a weighting model over a
set of query topics, retrieving and ranking results from an
index.</p>
      <p>R2 Evaluate a run over a set of topics, based on relevance labels.
R3 Train the parameters of a run, which may require repetitive
execution of queries from an index and evaluation.</p>
      <p>R4 Extract a run with multiple features that can be used as input
to a learning-to-rank technique.</p>
      <p>R5 Re-rank results based on multiple features and a pre-trained
learning-to-rank technique.</p>
      <p>
        R1 concerns the ability of the IR system to be executed in an
ofline batch mode - to produce the results of a set of query topics.
Academic-based platforms such as Terrier, Indri [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], Galago [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
ofer such functionality out of the box. R2 concerns the provision
of evaluation tools that permit a run to be evaluated. Standard tools
exist such as the C-based trec_eval library, but integration in the
native language of the system may provide advantages for R3. Other
systems such as Lucene/Solr/Elastic may need some scripting or
external tools (Azzopardi et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] highlight the lack of empirical
tools for IR experimentation and teaching on Lucene, and have
made some inroads into addressing this need).
      </p>
      <p>
        Indeed, R3 represents the early advent of machine learning into
the IR platform, where gradient ascent/descent algorithms were
used to optimise the parameters of systems by (relatively expensive)
repeated querying and evaluation of diferent parameter settings.
Efective techniques such as BM25F [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] &amp; PL2F [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] were facilitated
by common use of such optimisation techniques.
      </p>
      <p>
        Finally, R4 &amp; R5 are concerned with successful integration of
learning-to-rank into the IR system. As with new technologies,
there can be a lag between research-fresh developments and how
they are bled into production-ready systems. Of these, for the
purposes of experimentation, R4 is the more important - the ability to
eficiently extract multiple query dependent features has received
some coverage in the literature [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. R5 is concerned with taking this
a stage further, and applying a learned model to re-rank the results.
      </p>
      <p>In the following, we will describe how Terrier currently meets
requirements R1-R5 (Section 4), and how it can be adopted within a
Spark environment to meet these in a more agile fashion (Section 5).
4</p>
    </sec>
    <sec id="sec-7">
      <title>BACKGROUND ON TERRIER</title>
      <p>
        Terrier [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is a retrieval platform dating back to 2001 with an
experimental focus. First released as open source in 2004, it has been
downloaded &gt;50,000 times since. While Terrier portrays a Java API
that allows extension and/or integration into a number of
applications, the typical execution of Terrier is based upon procedural
command invocations from the commandline. Listing 1 provides
the commandline invocations necessary to fulfil requirements R1 &amp;
R2 using Terrier. All requirements R1-R5 listed above are supported
by the commandline. Moreover, the use of a rich commandline
scripting language (GNU Bash, for instance) permits infinite
combinations of diferent configurations to be evaluated automatically.
Moreover, with appropriate cluster management software, such
runs can be conducted eficiently in a distributed fashion. This
commandline API is also the main methods that students learn to
interact with the IR system.
      </p>
      <p>However, we have increasingly found that a commandline API
was not suited for all purposes. For instance, the chaining of the
outcomes of between invocations requires complicated scripting.
For instance, consider, for each fold of a 5-fold cross validation:
training the b length normalisation parameter of BM25, saving
the optimal value, and using that for input to a learning-to-rank
run, distributed among a cluster environment. Such an example
would require creating tedious amounts of shell scripting, for little
subsequent empirical benefit. In short, this paper argues that IR
experimentation has now reached the stage where we should not
be limited by the confines of a shell-scripting environment.
5</p>
    </sec>
    <sec id="sec-8">
      <title>TERRIER-SPARK</title>
      <p>To address the perceived limitations in the procedural commandline
use of Terrier, we have developed a new experimental interface
for the Terrier platform, building upon Apache Spark, and called
Terrier-Spark. Apache Spark is a fast and general engine for
largescale data processing. While Spark can be invoked in Java, Scala
and Python, we focus on the Scala environment, which allows for
code that is more succinct than the equivalent Java (for instance,
through the use of functional progamming constructs, and
automatic type inference). Spark allows relational algebra operations
on dataframes (relations) to be easily expressed as function calls,
which are then compiled to a query plan that is distributed and
executed on machines within the cluster.</p>
      <p>Apache Spark borrows the notions of dataframes from Pandas1
(a Python data analysis library), and similarly the notion of machine
learning pipeline constructs and interfaces (e.g. fit and transform</p>
      <sec id="sec-8-1">
        <title>1 http://pandas.pydata.org/</title>
        <p>bin/trec_terrier.sh -r -Dtrec.topics=/path/to/topics \
-Dtrec.model=BM25
bin/trec_terrier.sh -e -Dtrec.qrels=/path/to/qrels
Listing 1: A simple retrieval run and evaluation using
Terrier’s commandline interface - c.f. requirements R1 &amp; R2.
methods) from scikit-learn2 (Python machine learning library),
namely:
• DataFrame: a relation containing structured data.
• Transformer: an object that can transform a data instance
from a DataFrame.
• Estimator: an algorithm that can be fitted to data in a DataFrame.</p>
        <p>The outcome of an Estimator can be a Transformer - for
instance, a machine-learned model obtained from an Estimator
will be a Transformer.
• Pipeline: A series of Transformer and Estimators chained
together to create a workflow.</p>
        <p>• Parameter: A configuration option for an Estimator.</p>
        <p>
          In our adaptation of Terrier to the Spark environment,
TerrierSpark, we have implemented a number of Estimators and
Transformers. These allow the natural stages of an IR system to be
combined in various ways, while also leveraging the existing supervised
ML techniques within Spark to permit the learning of ranking
models (e.g. Spark contains logistic regression, random forests, gradient
boosted regression trees, but notably no support for listwise based
learning techniques such as LambdaMART [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ], which are often
the most efective [
          <xref ref-type="bibr" rid="ref2 ref9">2, 9</xref>
          ]).
        </p>
        <p>Table 1 summarises the main components developed to
support the integration of Terrier into Apache Spark, along with their
inputs, outputs and key parameters. In particular,
QueryingTransformer is the key Transformer, in that this internally invokes Terrier
to retrieve the docids and scores of each retrieved document for
the queries in the input dataframe. As Terrier is written in Java,
and Scala and Java both are JVM-based languages, Terrier can run
“in-process”. Furthermore, as discussed in Section 6 below, further
changes in Terrier 5.0 permit accessing indices on remotely hosted
Terrier servers.</p>
        <p>In the following, we provide examples of retrieval experimental
listings using Spark through Scala.
5.1</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>Performing an untrained retrieval run</title>
      <p>Listing 1 shows how a simple retrieval run can be made using
Terrier’s commandline API. The location of the topics and qrels
ifles as well as the weighting model, are set on the commandline,
although defaults could be set in a configuration file.</p>
      <p>In contrast, Listing 2 shows how the exact same run might be
achieved from Scala in a Spark environment. Once the topics files
are loaded into a two-column dataframe (keyed by “qid", the topic
number), these are transformed into a dataframe of result sets,
obtained from Terrier (keyed by “⟨qid,docno⟩”). Then a second
transformer record the relevant and non-relevant documents within the
dataframe, by joining with the contents of the qrels file, before
evaluation.</p>
      <sec id="sec-9-1">
        <title>2 http://scikit-learn.org/</title>
        <p>Component
QueryStringTransformer
QueryingTransformer
FeaturedQueryingTransformer
QrelTransformer
NDCGEvaluator
Inputs
Queries
Queries</p>
        <p>Queries
results with docids
results with docids and labels
Output</p>
        <p>Queries
docids, scores for each query
docids, scores of each feature for each query
results with docids and labels</p>
        <p>Mean NDCG@K</p>
        <p>Parameters
Lambda function to transform query
number of docs; weighting model
+ feature set
qrel file
cutof K
val r2 = qrelTransform.transform(r1)
//r2 is a dataframe as r1, but also includes a label column
val meanNDCG = new NDCGEvaluator().evaluate(r2)</p>
        <sec id="sec-9-1-1">
          <title>Listing 2: A retrieval run in Scala - c.f. requirements R1 &amp; R2.</title>
          <p>While clearly more verbose than the simpler commandline API,
Listing 2 demonstrates equivalent functionality, and clearly
highlights the needed data and the steps involved in the experiment.
Moreover, the use of objects suitable to be built into a Spark pipeline
ofers the possibility to build and automate pipelines. As we show
below, this functionality permits the powerful features of a
functional language to allow more complex experimental pipelines.
5.2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>Training weighting models</title>
      <p>Listing 3 demonstrates the use of Spark’s Pipeline and
CrossValidator components to create a pipeline that applies a grid-search to
determine the most efective weighting model and its corresponding
document length normalisation c parameter. Such a grid-search can
be parallelised across many Spark worker machines in a cluster. We
note that while grid-search is one possibility, it is feasible to consider
use of a gradient descent algorithm to tune the c parameter.
However, at this stage we do not yet have a parallelised algorithm
implemented that would make best use of a clustered Spark environment.
5.3</p>
    </sec>
    <sec id="sec-11">
      <title>Training learning-to-rank models</title>
      <p>Finally, Listing 4 demonstrates the use of Spark’s in-built machine
learning Random Forest regression technique to learn a
learning-torank model. In this example, the initial ranking of documents is
performed by the InL2 weighting model, with an additional three query
//assuming various variables as per Listing 2.
val pipeline = new Pipeline()</p>
      <p>.setStages(Array(queryTransform, qrelTransform))
val paramGrid = new ParamGridBuilder()
.addGrid(queryTransform.localTerrierProperties,</p>
      <p>Array(Map["c"-&gt;"1"], Map["c"-&gt;"10"], Map["c"-&gt;"100"]))
.addGrid(queryTransform.sampleModel,</p>
      <p>Array("InL2", "PL2"))
.build()
val cv = new CrossValidator()
.setEstimator(pipeline)
.setEvaluator(new NDCGEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(5)
val cvModel = cv.fit(topics)
Listing 3: Grid searching the weighting model and document
length normalisation c parameters using Spark’s
CrossValidator - c.f. requirement R3.
val queryTransform = new FeaturesQueryingTransformer()
.setTerrierProperties(props)
.setMaxResults(5000)
.setRetrievalFeatures(List(
"WMODEL:BM25",
"WMODEL:PL2",
"DSM:org.terrier.matching.dsms.DFRDependenceScoreModifier"))
.setSampleModel("InL2")
val r1 = queryTransform.transform(topics)
//r1 is as per Listing 2, but now also has a column of 3
//feature values for each retrieved document
val qrelTransform = new QrelTransformer()</p>
      <p>.setQrelsFile(qrels)
val r2 = qrelTransform.transform(r1)
//learn a Random Forest model
val rf = new RandomForestRegressor()
.setLabelCol("label")
.setFeaturesCol("features")
.setPredictionCol("newscore")
rf.fit(r2)</p>
      <sec id="sec-11-1">
        <title>Listing 4: Training a Random Forests based learning-to-rank model - c.f. requirements R4 &amp; R5.</title>
        <p>
          dependent features being calculated for the top 5000 ranked
documents for each query. Internally, this uses Terrier’s Fat framework
for implementing the eficient calculation of additional query
dependent features [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. The received random forests model can be
trivially applied to a further set of unseen topics (not shown). The
resulting Scala code is markedly more comprehensible to the equivalent
complex commandline invocations necessary for Terrier 4.2 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ].
Moreover, we highlight the uniqueness of our ofering - while other
platforms such as Solr and Elastic have Spark tools, none ofer the
ability to export a multi-feature representation suitable for
conducting learning-to-rank experiments within Spark (c.f. R4 &amp; R5).
        </p>
        <p>
          Of course, the pipeline framework of Estimators and
Transformers is generic, and one can easily imagine further implementations
of both to increase the diversity of possible experiments: For
instance, new Estimators for increased coverage of learning-to-rank
techniques, such as LambdaMART [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]; Similarly, Transformers
for adapting the query representation, for example by applying
query-log based expansions [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] or proximity-query rewriting such
as Sequential Dependence models [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Once a suitable Pipeline
is configured, conducting experiments such as learning-to-rank
feature ablations can be conducted in only a few lines of Scala.
6
        </p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>OTHER CHANGES TO TERRIER 5.0</title>
      <p>We have also made a number of other changes to Terrier, which have
been incorporated into the recently released version 5.0, which aid
in the expanding the possible retrieval concepts that can be easily
implemented using Terrier-Spark, while increasing the flexibility
ofered by the platform.
6.1</p>
    </sec>
    <sec id="sec-13">
      <title>Indri-esque matching query language</title>
      <p>Terrier 5.0 implements a subset of the Indri/Galago query
language [3, Ch. 5], including complex operators such as #syn and
#uwN. In particular, Terrier 5.0 provides:
• #syn(t1 t2): groups a set of terms into a single term for the
purposes of matching. This can be used for implementing
query-time stemming.
• #uwN(t1 t2): counts the number of occurrences of t1 &amp; t2
within unordered windows of size N .
• #1(t1 t2): counts the number of exact matches of the bigram
t1 &amp; t2.
• #combine(t1 t2): allows the weighting of t1 &amp; t2 to be
adjusted.
• #tag(NAME t1 t2): this allows to assign a name to a set of
terms, which can be then be formed as a set of features
by the later Fat layer. In doing so, such tagged expression
allows various sub-queries to form separate features during
learning-to-rank. This functionality is not present in Indri
or Galago.</p>
      <p>
        For example, Metzler and Croft’s sequential dependence
proximity model [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] can be formulated using combinations of #uwN
and #owN query operators. Such a query rewriting technique can
be easily implemented within Terrier-Spark by applying a
QueryStringTransformer that applies a lambda function upon each query,
allow users to build upon the new complex query operators
implemented in Terrier 5.0. Figure 1 demonstrates applying sequential
dependence to a dataframe of queries, within a Jupyter notebook.
This is achieved by instantiating a QueryStringTransformer upon
a dataframe of queries, using a lambda function that appropriately
rewrites the queries.
6.2
      </p>
    </sec>
    <sec id="sec-14">
      <title>Remote Querying</title>
      <p>As discussed in Section 5 above, Terrier-Spark has initially been
designed for in-process querying. However, concurrent changes to
Terrier for version 5.0 have abstracted the notion of all retrieval
access being within the current process or accessible from the
same machine. Indeed, a reference to an index may refer to an
index hosted on another server, and made accessible over a RESTful
interface. While this is a conventional facility ofered by some other
search engine products (and made available through their Spark
tools, such as for Elastic’s3 and Solr’s4), this ofers a number of
advantages for teaching. Indeed, often IR test collection can be too
large to provide as downloads - allowing a remote index accessible
over a (secured) RESTful HTTP connection would negate the need
to provide students with the raw contents of the documents for
indexing. Moreover, unlike conventional Spark tools for Elastic and
Solr, the results returned can have various features pre-calculated
for applying and evaluating learning-to-rank models.</p>
      <p>
        To make this more concrete, consider the TREC Microblog track
which used an “evaluation-as-a-service” methodology [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In this,
the evaluation track organisers provided a search API based upon
Lucene, through which the collection can be accessed for
completing the evaluation task. The advancements described here would
allow a Terrier index to be provided for a particular TREC collection,
easily accessible through either the conventional Terrier
commandline tools, or through Terrier-Spark. A run in that track could
then be crafted and submitted to TREC wholly within a Jupyter
notebook, facilitating easy run reproducibility.
7
      </p>
    </sec>
    <sec id="sec-15">
      <title>CONDUCTING IR EXPERIMENTS WITHIN</title>
    </sec>
    <sec id="sec-16">
      <title>A JUPYTER NOTEBOOK ENVIRONMENT</title>
      <p>Spark can be used in a number of manners: by compiling Scala
source files into executable (Jar) files, which are submitted to a Spark
cluster, or through line-by-line execution in spark-shell (a
ReadEval-Print-Loop or REPL tool). However, each has its disadvantages:
the former only permits slow development iterations through the
necessity to recompile at each iteration; on the other hand, the
REPL spark-shell environment does not easily record the developed
code, nor allow parts of the code to be re-executed.</p>
      <p>Instead, we note that the use of a Spark environment naturally
ifts with the use of Scala Jupyter notebooks 56. Jupyter is an
opensource web application that allows the creation and sharing of
documents that contain code, equations, visualisations and
narrative text. Increasingly entire technical report documents, slides and
books are being written as Jupyter notebooks, due to the easy
integration of text, code and resulting analysis tables or visualisations.</p>
      <p>
        Jupyter notebooks are increasingly used to share the algorithms
and analysis conducted in machine learning research papers,
significantly aiding reproducibility [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Indeed, in their report on the
Daghstuhl workshop on reproducibility of IR experiments [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Ferro,
Fuhr et al. note that sharing of code and experimental methods
would aid reproducibility in IR, but do not recognise the ability of
notebooks to aid in this process.
3 https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
4 https://github.com/lucidworks/spark-solr 5 http://jupyter.org/ 6 We note that
Jupyter notebooks are extensible through plugins to Scala and other languages, i.e.
not limited to Python.
      </p>
      <p>
        Jupyter notebooks are interactive in manner, in that a code block
in a single cell can be run independently of all other cells in the
notebook. As a result, Jupyter is also increasingly used for educational
purposes - for example, teaching programming within
undergraduate degree courses [
        <xref ref-type="bibr" rid="ref18 ref4">4, 18</xref>
        ], as well as a plethora of data science or
machine learning courses [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. O’Hara et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] described four uses
for notebooks in classroom situations, including lectures,
flippedclassrooms, home/lab work and exams. For instance, the use of
notebooks within a lecturing situation easily permits the students
to replicate the analysis demonstrated by the lecturer.
      </p>
      <p>We argue that these general advantages of notebooks can be
applied to experimental information retrieval education, through the
use of a Spark-integrated IR platform, such as that described in this
paper. Indeed, we believe that the changes described in Sections 5
&amp; 6 should address these these feedbacks, allow students to more
easily configure the retrieval platform (all configuration of Terrier
is presented on the screen), make more powerful experimentation
available to a wider and less experienced audience not wishing to
engage in complicated shell-scripting.</p>
      <p>We believe that Terrier-Spark can bring these same advantages
to conducting modern (e.g. learning-to-rank) IR experiments,
combined with the accessible and agile nature of a notebook
environment. Moreover, an integrated Juypter environment also facilitates,
for instance in the IR teaching environment, the creation and
presentation of inline figures (e.g. created using the Scala vegas-viz
library7), such as per-query analyses and interpolated
precisionrecall graphs. Figures 2 - 4 provide screenshots from such a
notebook89 In particular: Figure 2 demonstrates the querying of the
7 https://github.com/vegas-viz/. 8 We use the Apache Toree
kernel for Jupyter, which allows notebooks written in Scala and
which automatically interfaces with Apache Spark. 9 The
original notebook can be found in the Terrier-Spark repository, see
https://github.com/terrier-org/terrier-spark/tree/master/example_notebooks/toree.
Terrier for two diferent retrieval models; Figure 3 shows
analysis on the results, by ranking queries based on the diference of
their evaluation performance between rankings; Finally, Figure 4
demonstrates the same information as a per-query analysis figure.
8</p>
    </sec>
    <sec id="sec-17">
      <title>CONCLUSIONS</title>
      <p>In this paper, we have described the challenge of teaching a modern
undergraduate- and postgraduate-level elective course on
information retrieval. We highlight the main requirements of an
experimental IR platform, then further describe Terrier-Spark, an
extension of Terrier IR platform to perform IR experimentation
within the Spark distributed computing engine, which not only
addresses these requirements, but can allow complex experiments
to be easily defined within a few lines of Scala code. Terrier-Spark
and Terrier 5.0 have been released as open source. In addition, we
also argue that Jupyter notebooks for IR can aid not only agile IR
experimentation by students, but also in research reproducibility in
information retrieval by facilitating easily-distributable notebooks
that demonstrate the conducted experiments.</p>
      <p>Overall, we believe that notebooks are an important aspect of
data science, and that we as an IR community should not fall behind
other branches of data science in using notebooks for empirical
IR experimentation. The frameworks described here might be
extended to other languages (e.g. a Python wrapper for Terrier’s
RESTful interface), or even to other IR platforms. In doing so, we
bring more powerful and agile experimental IR tools into the hands
of researchers and students alike. Terrier-Spark has been released
as open source, and is available from</p>
      <p>https://github.com/terrier-org/terrier-spark
along with example Jupyter notebooks.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Leif</given-names>
            <surname>Azzopardi</surname>
          </string-name>
          et al.
          <year>2017</year>
          .
          <article-title>Lucene4IR: Developing Information Retrieval Evaluation Resources Using Lucene</article-title>
          .
          <source>SIGIR Forum 50</source>
          ,
          <issue>2</issue>
          (
          <year>2017</year>
          ),
          <fpage>18</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Chapelle</surname>
          </string-name>
          and
          <string-name>
            <given-names>Yi</given-names>
            <surname>Chang</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Yahoo! Learning to Rank Challenge Overview</article-title>
          .
          <source>Proceedings of Machine Learning Research</source>
          <volume>14</volume>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          , Donald Metzler, and
          <string-name>
            <given-names>Trevor</given-names>
            <surname>Strohman</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Search Engines: Information Retrieval in Practice (1st ed</article-title>
          .). Addison-Wesley Publishing Company, USA.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Lynn</given-names>
            <surname>Cullimore</surname>
          </string-name>
          .
          <year>2016</year>
          .
          <article-title>Using Jupyter Notebooks to teach computational literacy</article-title>
          . (
          <year>2016</year>
          ). http://www.elearning.eps.manchester.ac.uk/blog/2016/
          <article-title>using-jupyter-notebooks-to-teach-computational-literacy/</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Ronan</given-names>
            <surname>Cummins</surname>
          </string-name>
          and
          <string-name>
            <surname>Colm O'Riordan</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Learning in a Pairwise Term-term Proximity Framework for Information Retrieval</article-title>
          .
          <source>In Proceedings of SIGIR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Nicola</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Norbert</given-names>
            <surname>Fuhr</surname>
          </string-name>
          , et al.
          <year>2016</year>
          .
          <article-title>Increasing Reproducibility in IR: Findings from the Dagstuhl Seminar on "Reproducibility of Data-Oriented Experiments in e-Science"</article-title>
          .
          <source>SIGIR Forum 50</source>
          ,
          <issue>1</issue>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Rosie</given-names>
            <surname>Jones</surname>
          </string-name>
          , Benjamin Rey, Omid Madani, and Wiley Greiner.
          <year>2006</year>
          .
          <article-title>Generating Query Substitutions</article-title>
          .
          <source>In Proceedings of WWW.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jimmy</given-names>
            <surname>Lin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Miles</given-names>
            <surname>Efron</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Evaluation As a Service for Information Retrieval</article-title>
          .
          <source>SIGIR Forum 47</source>
          ,
          <issue>2</issue>
          (Jan.
          <year>2013</year>
          ),
          <fpage>8</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Tie-Yan Liu</surname>
          </string-name>
          .
          <year>2009</year>
          .
          <article-title>Learning to rank for information retrieval</article-title>
          .
          <source>Foundations and Trends® in Information Retrieval 3</source>
          ,
          <issue>3</issue>
          (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Craig</surname>
            <given-names>Macdonald</given-names>
          </string-name>
          , Vassilis Plachouras, Ben He,
          <string-name>
            <surname>Christina Lioma</surname>
            , and
            <given-names>Iadh</given-names>
          </string-name>
          <string-name>
            <surname>Ounis</surname>
          </string-name>
          .
          <year>2006</year>
          . University of Glasgow at WebCLEF 2005:
          <article-title>Experiments in per-field normlisation and language specific stemming</article-title>
          .
          <source>In Proceedings of CLEF.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Craig</surname>
            <given-names>Macdonald</given-names>
          </string-name>
          , Rodrygo L.T. Santos, Iadh Ounis, and
          <string-name>
            <given-names>Ben</given-names>
            <surname>He</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>About Learning Models with Multiple Query-dependent Features</article-title>
          .
          <source>ToIS 31</source>
          ,
          <issue>3</issue>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Donald</given-names>
            <surname>Metzler</surname>
          </string-name>
          and
          <string-name>
            <given-names>W. Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>A Markov random field model for term dependencies</article-title>
          .
          <source>In Proceedings of SIGIR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Keith J. O'Hara</surname>
            ,
            <given-names>Douglas</given-names>
          </string-name>
          <string-name>
            <surname>Blank</surname>
            , and
            <given-names>James</given-names>
          </string-name>
          <string-name>
            <surname>Marshall</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Computational Notebooks for AI Education</article-title>
          .
          <source>In Proceedings of FLAIRS.</source>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Iadh</surname>
            <given-names>Ounis</given-names>
          </string-name>
          , Gianni Amati, Vassilis Plachouras, Ben He,
          <string-name>
            <surname>Craig Macdonald</surname>
            , and
            <given-names>Christina</given-names>
          </string-name>
          <string-name>
            <surname>Lioma</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Terrier: A High Performance and Scalable Information Retrieval Platform</article-title>
          .
          <source>In Proceedings of OSIR.</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Fernando</given-names>
            <surname>Perez</surname>
          </string-name>
          and
          <string-name>
            <given-names>Brian E</given-names>
            <surname>Granger</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Project Jupyter: Computational narratives as the engine of collaborative data science</article-title>
          .
          <source>Technical Report</source>
          . http: //archive.ipython.org/JupyterGrantNarrative-2015.pdf
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Trevor</surname>
            <given-names>Strohman</given-names>
          </string-name>
          , Donald Metzler, Howard Turtle, and
          <string-name>
            <given-names>W Bruce</given-names>
            <surname>Croft</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Indri: A language model-based search engine for complex queries</article-title>
          .
          <source>In Proceedings of the International Conference on Intelligent Analysis</source>
          , Vol.
          <volume>2</volume>
          . Citeseer,
          <volume>2</volume>
          -
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Terrier</surname>
          </string-name>
          .org.
          <year>2016</year>
          .
          <article-title>Learning to Rank with Terrier</article-title>
          .
          <article-title>(</article-title>
          <year>2016</year>
          ). http://terrier.org/docs/ v4.2/learning.html
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>John</given-names>
            <surname>Williamson</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>CS1P: Running Jupyter from the command line</article-title>
          . (
          <year>2017</year>
          ). https://www.youtube.com/watch?v=hqpuC0YLbpM
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <surname>Qiang</surname>
            <given-names>Wu</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>Chris J. C.</given-names>
            <surname>Burges</surname>
          </string-name>
          ,
          <string-name>
            <surname>Krysta M. Svore</surname>
            , and
            <given-names>Jianfeng</given-names>
          </string-name>
          <string-name>
            <surname>Gao</surname>
          </string-name>
          .
          <year>2008</year>
          . Ranking, Boosting, and
          <string-name>
            <given-names>Model</given-names>
            <surname>Adaptation</surname>
          </string-name>
          .
          <source>Technical Report MSR-TR-2008-109</source>
          . Microsoft.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Hugo</surname>
            <given-names>Zaragoza</given-names>
          </string-name>
          , Nick Craswell, Michael Taylor, Suchi Saria, and
          <string-name>
            <given-names>Stephen</given-names>
            <surname>Robertson</surname>
          </string-name>
          .
          <year>2004</year>
          . Microsoft Cambridge at TREC-
          <volume>13</volume>
          :
          <article-title>Web and HARD tracks</article-title>
          .
          <source>In Proceedings of TREC.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>