<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>It's messy out there: DBC's journey towards its first test collection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Serena Canu</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>DBC Digital A/S</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tempovej</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ballerup</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denmark</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>The use of test collections to evaluate the effectiveness of Information Retrieval (IR) systems is wide-spread, and the literature covers many examples of improvements and ways to solve specific problems. In this abstract, we explore the initial difficulties of building and implementing a test collection outside of academic walls and without having easy or immediate access to many of the main tools discussed in academic papers. This is an example of how a test collection can be a way to engage companies in finding creative solutions to make a first step towards the evaluation of IR systems.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Test collections</kwd>
        <kwd>IR systems evaluation</kwd>
        <kwd>Information retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Test collections have been used for decades as
fundamental tools to evaluate IR systems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and
they might even look like an easy-to-implement
tool to many experts and scholars. The reality is
that adopting a test collection impacts almost
every aspect of an organization, especially if the
organization was not structured to perform
constant evaluation to begin with.
      </p>
      <p>Nonetheless, there are situations in which
using a test collection is a crucial step to evolve
and improve the company's products. This has
been the case at DBC Digital, a Danish company
whose main task is to develop and maintain the
bibliographic and IT infrastructure of the Danish
public libraries. Among other things, DBC
develops and deploys the search engine used by
the public website bibliotek.dk, which gives
access to the common catalogue of the Danish
public libraries.</p>
      <p>After years focusing mainly on the efficiency
of the system, the need for a new way to evaluate
the search engine was a necessary step forward.
Bringing a perspective focused on effectiveness
represented some sort of small revolution, and it
brought back at the center the question “what do
the users need?”.</p>
      <p>
        Unfortunately, good intentions per se were not
sufficient, and we faced several challenges to
understand how to create and use a test collection
without having any prior experience. We settled
on the work done by Sanderson and coll. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ][
        <xref ref-type="bibr" rid="ref3">3</xref>
        ][
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
to guide our work, since they presented a useful
and practical summary to understand the basic
steps to make a representative test collection: a set
of real queries, a set of real - or at least realistic –
narratives, and a way to make relevance
judgments.
      </p>
      <p>
        Query logs, and user data analysis are often
considered as a ground base to understand users’
needs and define the list of queries [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. When
DBC decided to include a test collection among
its tools, there were no comprehensive query logs
to be used for an analysis, no recent data about
loans or other users’ behavior, nor the possibility
to directly involve real users. And this is where we
had to find creative ways to overcome the
uncertainty, using as a sole starting point a dataset
with a list of the most searched queries in 2018
through the DDB CMS, the CMS used by the
Danish public libraries. A work of classification
and interpretation was then necessary, to try to
give some sense to this dataset.
      </p>
      <p>We started our test collection with 77 queries.
This included both some of the most-searched
queries and queries which we deemed to be
challenging to our search engine. The
corresponding narratives - initially written by a
sole expert and assessor - were collectively
reviewed with the help of colleagues with
different expertise and backgrounds. The
descriptions were then narrowed down for
convenience.</p>
      <p>
        This form of brainstorming, even though
unconventional, proved to be particularly useful
during the process, since it helped to manage the
scarcity of resources, and started a discussion
about - and a broader understanding of - the
concept of relevance [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In fact, what might look
like trivial questions to evaluation experts were in
fact crucial steps for the definition of DBC’s test
collection.
      </p>
      <p>The documents were initially chosen following
an alternative method to the traditional pooling,
and the judgments were made using a graded
scale. With this first attempt, we were able to get
an idea on how good our current search engine
was, using the traditional metrics of Precision,
Recall, F-measure, and nDCG.</p>
      <p>Equally important, this was only the first step
towards a new perspective that is consistently
taking its place within the company, underlying
the necessity of gathering more data, involving
more experts, and finding ways to diminish some
initially unavoidable biases.
2. More than just a test collection</p>
      <p>Apart from the evaluation of the current and
new IR systems, the test collection has also been
used for other cross-departmental projects. For
instance, as a baseline to observe possible
differences between indexing strategies using
curated metadata, full text indexing, and ML.</p>
      <p>Also, it is a concrete way to communicate with
our customers and QA, to explain which behavior
they can expect from the search engine, and to
define additional functionalities.</p>
      <p>Research about IR has evolved significantly,
so much that it might be hard to be aware of the
real challenges that an average company has to
face to implement evaluation tools in its practices.
We believe that, sometimes, going back to the
basics and dealing with a messy development can
still be an option. Especially if the main result is
to ignite a conversation, modifying entirely the
company’s understanding of what IR systems
evaluation can mean. And despite the
compromises made along the way.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <article-title>On the history of evaluation in IR</article-title>
          .
          <source>Journal of Information Science</source>
          <volume>34</volume>
          (
          <issue>4</issue>
          ) (
          <year>2008</year>
          ),
          <fpage>439</fpage>
          -
          <lpage>456</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Clough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          .
          <article-title>Evaluating the performance of information retrieval systems using test collections</article-title>
          .
          <source>Information research</source>
          , (
          <year>2013</year>
          ),
          <volume>18</volume>
          (
          <issue>2</issue>
          ),
          <fpage>18</fpage>
          -
          <lpage>2</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Braschler</surname>
          </string-name>
          ,
          <article-title>Best practices for test collection creation and information retrieval system evaluation</article-title>
          ,
          <source>TrebleCLEF Project</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanderson</surname>
          </string-name>
          ,
          <article-title>Test collection based evaluation of information retrieval systems</article-title>
          .
          <source>Foundations and Trends in Information Retrieval</source>
          , (
          <year>2010</year>
          )
          <fpage>4</fpage>
          -
          <lpage>4</lpage>
          ,
          <fpage>247</fpage>
          -
          <lpage>375</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W. B.</given-names>
            <surname>Croft</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          , T. Strohman, Search engines: Information retrieval in practice, 1st ed.,
          <string-name>
            <surname>Addison-Wesley</surname>
          </string-name>
          , Boston, MA,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Saracevic</surname>
          </string-name>
          ,
          <article-title>Relevance reconsidered</article-title>
          ,
          <source>in: Proceedings of the second conference on conceptions of library and information science (CoLIS 2)</source>
          , ACM Press, New York, NY,
          <year>1996</year>
          , pp.
          <fpage>201</fpage>
          -
          <lpage>218</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>