<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>March</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>On Search Topic Variability in Interactive Information Retrieval</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ying-Hsang Liu</string-name>
          <email>yingliu@csu.edu.au</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nina Wacholder</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Communication and Information, Rutgers University</institution>
          ,
          <addr-line>New Brunswick NJ 089091, USA, +1 732 932 7500 ext. 8214</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>School of Information Studies, Charles Sturt University</institution>
          ,
          <addr-line>Wagga Wagga NSW 2678, Australia, +61 2 6933 2171</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <volume>28</volume>
      <issue>2010</issue>
      <abstract>
        <p>This paper describes the research design and methodologies we used to assess the usefulness of MeSH (Medical Subject Headings) terms for different types of users in an interactive search environment. We observed four different kinds of information seekers using an experimental IR system: (1) search novices; (2) domain experts; (3) search experts and (4) medical librarians. We employed a user-oriented evaluation methodology to assess search effectiveness of automatic and manual indexing methods using TREC Genomics Track 2004 data set. Our approach demonstrated (1) the reusability of a large test collection originally created for TREC, (2) an experimental design that specifically considers types of searchers, system versions and search topic pairs by Graeco-Latin square design and (3) search topic variability can be alleviated by using different sets of equally difficult topics and well-controlled experimental design for contextual information retrieval evaluation.</p>
      </abstract>
      <kwd-group>
        <kwd>Information retrieval evaluation</kwd>
        <kwd>Search topic interactive information retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Categories and Subject Descriptors</title>
      <p>H.3.3 [Information Storage and Retrieval]: Information Search
and Retrieval−query formulation, search process</p>
    </sec>
    <sec id="sec-2">
      <title>General Terms</title>
    </sec>
    <sec id="sec-3">
      <title>1. INTRODUCTION</title>
      <p>
        The creation and refinement of test design and methodologies for
IR system evaluation have been one of the greatest achievements
in IR research and development. In the second Cranfield project
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], the main purpose is to evaluate the effectiveness of indexing
techniques at a level of abstraction where users are not
specifically considered in a batch mode experiment.
      </p>
      <p>
        The test design and methodology following the Cranfield
paradigm culminated in the TREC (Text REtrieval Conference)
activities since the 1990s. TREC has provided a research forum
for comparing the search effectiveness of different retrieval
techniques across IR systems in a laboratory and controlled
environment [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. The very large test collection used in TREC
provided a test bed for researchers to experiment the scalability of
retrieval techniques, which had not been possible in previous
years. However, how we specifically take into account different
aspects of user contexts within a more realistic test environment
has been challenging in part because it is difficult to isolate the
effects of user, search topic and system in IR experiments (see
e.g., [
        <xref ref-type="bibr" rid="ref17 ref7">7, 17</xref>
        ] for recent efforts).
      </p>
      <p>
        In batch experiments the search effectiveness of different
retrieval techniques is achieved by comparing the search
performance of queries. IR researchers have widely used the
micro-averaging method of performing statistics on the queries in
summarizing precision and recall values for comparing the search
effectiveness of different retrieval techniques in order to meet the
statistical requirements (see e.g., [
        <xref ref-type="bibr" rid="ref25 ref27">25, 27</xref>
        ]). The method of
microaveraging is intended to obtain reliable results in comparing
search performance of different retrieval techniques by giving
equal weights to each query.
      </p>
      <p>
        However, within an interactive IR search environment that
involves human searchers, it is difficult to use a large set of search
topics. Empirical evidence has demonstrated that the search topic
set size of 50 is necessary to determine the relative performance
of different retrieval techniques in batch evaluations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], because
the variability of search topics has an overriding effect on search
results. Another possible solution is to use different sets of topics
in a non-matched-pair design [
        <xref ref-type="bibr" rid="ref21 ref22 ref5">5, 21, 22</xref>
        ], but theoretically it
requires a very large sample of independent searches.
      </p>
      <p>
        This problem has been exacerbated by the fact that we have
little theoretical understanding about the nature and properties of
search topics for evaluation purposes [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. From a systems
perspective, recent in-depth failure analyses of variability in
search topics for reliable and robust retrieval performance (e.g.,
[
        <xref ref-type="bibr" rid="ref11 ref28">11, 28</xref>
        ]) have contributed to our preliminary understanding of
how and why IR systems fail to do well across all search topics. It
is still elusive what kinds of search topics can be used to directly
control the topic effect for IR evaluation purposes.
      </p>
      <p>
        This study was designed to assess the search effectiveness of
MeSH terms by different types of searchers in an interactive
search environment. By an experimental design that controls
searchers, system versions and search topic pairs and the use of a
relatively large number of search topics, we were able to
demonstrate an IR user experiment that specifically controls the
search topic variability and assesses the user effect on search
effectiveness within the laboratory IR framework (see e.g., [
        <xref ref-type="bibr" rid="ref14 ref15">14,
15</xref>
        ] for recent discussions).
      </p>
    </sec>
    <sec id="sec-4">
      <title>2. METHOD</title>
      <p>Thirty-two searchers from a major public university and nearby
medical libraries in the northeast area of the US participated in the
study. Each searcher belonged to one of four groups: (1) Search
Novice (SN), (2) Domain Experts (DE), (3) Search Experts (SE)
and (4) Medical Librarians (ML).</p>
      <p>The experimental task was to conduct a total of eight
searches to help biologists conduct their research. Participants
searched either using a version of the system in which abstracts
and MeSH terms were displayed (MeSH+) or another version in
which they had to formulate their own terms based only on the
display of abstracts (MeSH−). Participants conducted four
searches each with two different systems: in one, they browsed a
displayed list of MeSH terms (MeSH+) and in the other (MeSH−).
Half the participants used MeSH+ system first; half used MeSH−
first. Each participant was allowed to conduct searches on eight
different topics.</p>
      <p>The experimental setting for most searchers was a university
office; for some searchers, it was a medical library. Before they
began searching participants were briefly trained in how to use the
MeSH terms. We kept search logs that recorded search terms, a
ranked list of retrieved documents, and time-stamps.</p>
    </sec>
    <sec id="sec-5">
      <title>2.1 Subjects</title>
      <p>We used the purposive sampling method for recruiting our
subjects since we were concerned with the impact of specific
searcher characteristics on search effectiveness. The key searcher
characteristics were different levels of domain knowledge in the
biomedical domain and whether they had substantial search
training. The four types of searchers were distinguished by their
levels of domain knowledge and search training.</p>
    </sec>
    <sec id="sec-6">
      <title>2.2 Experimental design</title>
      <p>
        The experiment was a 4×2×2 factorial design with four types of
searchers, two versions of an experimental system and controlled
search topic pairs. The versions of a system, types of searchers
(distinguished by levels of domain knowledge and search training)
and search topic pairs were controlled by a Graeco-Latin square
balanced design [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The possible ordering effects have been taken
into account by the design. The requirement for this experimental
design is that the examined variables do not interact and each
variable has the same number of levels [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The treatment layout
of a 4×4 Graeco-Latin square design is illustrated in Figure 1.
1
SN
38
12
29
50
42
46
32
SN
36
9
46
42
23
33
43
2
Note. Numbers 1-16 refers to participant ID; SN, DE, DE and ML
refer to types of searchers, SN=Search Novices, DE=Domain
Experts; SE=Search Experts; ML=Medical Librarians; Shaded
and non-shaded blocks refer to MeSH+ and MeSH− versions of
an experimental system; Numbers in blocks refer to search topic
ID number from TREC Genomics Track 2004 data set; 10 search
topic pairs, randomly selected from a pool of 20 selected topics,
include (38, 12), (29, 50), (42, 46), (32, 15), (27, 45), (9, 36), (30,
20), (2, 43), (1, 49) and (33, 23).
      </p>
      <p>
        Because of the potential interfering effect of search topic
variability on search performance in IR evaluation, we used a
design that included relatively large number of search topics. In
theory, the effect of topic variability and topic-system interaction
on system performance could be eliminated by averaging the
performance scores of the topics (micro-averaging method),
together with the use of very large number of search topics. The
TREC standard ad hoc task evaluation studies ([
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ]) and other
proposals of test collections (e.g., [
        <xref ref-type="bibr" rid="ref20 ref21 ref22 ref24 ref29">20-22, 24, 29</xref>
        ]) have been
concerned with the large search topic variability in batch
experiments. However, in a user-centered IR experiment it is not
feasible to use as many as 50 search topics because of human
fatigue.
      </p>
      <p>We controlled search topic pairs by a balanced design in
order to alleviate the overriding effect of search topic variability.
We assumed that all the search topics are equally difficult, since
we do not have a good theory about what makes some search
topics more difficult than others. By design we ensured that each
search topic pair was assigned to all types of searchers and was
searched at least two times by the same type of searchers. This
design required a total of 10 search topic pairs and a minimum of
16 participants.</p>
    </sec>
    <sec id="sec-7">
      <title>2.3 Search tasks and incentive system</title>
      <p>
        The search task was designed to simulate online searching
situations in which professional searchers look for information on
behalf of users. We decided to use this relatively challenging task
for untrained searchers because choosing realistic tasks such as
this one would enhance the external validity of the experiment.
Considering the relatively difficult tasks, we were concerned that
searchers may have problems completing all searches. Because
research literature has suggested that the motivational
characteristics of participants are possible sources of sample bias
[
        <xref ref-type="bibr" rid="ref23">23</xref>
        ], we designed an incentive system to motivate the searchers.
      </p>
      <p>We promised monetary incentives according to the
participant’s search effectiveness. Each subject was paid $20 for
participating and was also paid up to $10.00 dollars more based
on the average number of relevant documents in the top ten search
results across all search topics; on average each participant
received an additional $4.40, with a range of $2.00 - $8.00.</p>
    </sec>
    <sec id="sec-8">
      <title>2.4 Experimental procedures</title>
      <p>After signing the consent form, the participant filled out a
searcher background questionnaire before the search assignment.
After a brief training session, they were assigned to one of the
arranged experimental conditions and conducted search tasks.
They completed a search perception questionnaire and were asked
to indicate the relevance of two pre-judged documents when they
were done with each search topic. A brief interview was
conducted when they finished all search topics. Search logs with
search terms and ranked retrieved documents were recorded.</p>
      <p>
        The MeSH Browser [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], an online vocabulary look-up aid,
prepared by U.S. National Library of Medicine, was designed to
help searchers find appropriate MeSH terms and display hierarchy
of terms for retrieval purposes. The MeSH Browser was only
available when participants were assigned to the MeSH+ version
of an experimental system; in the MeSH− version, participants
had to formulate their own terms without the assistance of MeSH
Browser and displayed MeSH terms in bibliographic records.
      </p>
      <p>
        Because we were concerned that the topics were so hard that
even the medical librarians would not understand them, we used a
questionnaire regarding search topic understanding after each
topic. The testing items of two randomly selected pre-judged
documents, one definitely relevant and the other definitely not
relevant, were prepared from the data set [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ].
      </p>
      <p>Each search topic was allocated up to ten minutes. The last
search within the time limit was used for calculating search
performance. To keep the participants motivated and reward their
effort, they were asked to orally indicate which previous search
result would be the best answer when the search task was not
finished within ten minutes.</p>
    </sec>
    <sec id="sec-9">
      <title>2.5 Experimental system</title>
      <p>
        For this study, it was important for participants to conduct their
searches in a carefully controlled environment; our goal was to
offer as much help as possible while still making sure that the help
and search functions did not interfere with our ability to measure
the impact of the MeSH terms. We built an information retrieval
system based on the Greenstone Digital Library Software version
2.70 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] because it provides reliable search functionality,
customizable search interface and good documentation [
        <xref ref-type="bibr" rid="ref31">31</xref>
        ].
      </p>
      <p>We prepared two different search interfaces using a single
system using Greenstone: MeSH+ and MeSH− versions. One
interface allowed users to use MeSH terms; the other required
them to devise their own terms. One interface displayed MeSH
terms in retrieved bibliographic records and the other did not.
Because we were concerned that the participant responds to the
cue that may signal the experimenter’s intent, the search interfaces
were termed ‘System Version A’ and ‘System Version B’ for
‘MeSH+ Version’ and ‘MeSH− Version’ respectively (see
http://comminfo.rutgers.edu/irgs/gsdl/cgi-bin/library/). The
MeSH− version was used as baseline system for an automatic
indexing system, whereas the MeSH+ version served as
performance of a manual indexing system. That is, MeSH terms
added another layer of document representation to the MeSH+
version.</p>
      <p>
        The experimental system was constructed as Boolean-based
system with ranked functions by the TF×IDF weighting rule [
        <xref ref-type="bibr" rid="ref32">32</xref>
        ].
More specifically, MGPP (MG++), a re-implementation of the mg
(Managing Gigabytes) searching and compression algorithms,
was used as indexing and querying indexer. Basic system features,
including fielded searching, phrase searching, Boolean operators,
case sensitivity, stemming and display of search history, were
sufficient to fulfill the search tasks. The display of search history
was necessary because it provided useful feedback regarding the
magnitude of retrieved documents for difficult search tasks that
usually required query reformulations.
      </p>
      <p>
        Since our goal was specifically to investigate the usefulness
of displayed MeSH terms, we deliberately refrained from
implementing certain system features that allow users to take
advantage of the hierarchical structures of MeSH terms, such as
the hyperlinked MeSH terms, explode function that automatically
includes all narrower terms and automatic query expansion (see
e.g. [
        <xref ref-type="bibr" rid="ref13 ref18">13, 18</xref>
        ]) available on other online search systems. The use of
those features would have invalidated the results by introducing
other variables at the levels of search interface and query
processing, although a full integration of those system features
would have increased the usefulness of MeSH terms.
      </p>
    </sec>
    <sec id="sec-10">
      <title>2.6 Documents</title>
      <p>
        The experimental system was set up on a server, using
bibliographic records from the 2004 TREC Genomics document
set [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. TREC Genomics Track 2004 Data Set document test
collection was a 10-year (from 1994 to 2003) subset of
MEDLINE with a total of 4,591,108 records. The test collection
subset fed into the system used 75.0% of the whole collection, a
total of 3,442,321 records, excluding the records without MeSH
terms or abstracts.
      </p>
      <p>We prepared two sets of documents for setting up the
experimental system: MeSH+ and MeSH− versions. One interface
allowed users to use MeSH terms; the other did not provide this
search option. The difference was also reflected in retrieved
bibliographic records.</p>
    </sec>
    <sec id="sec-11">
      <title>2.7 Search topics</title>
      <p>The search topics used in this study were originally created
for TREC Genomics Track 2004 for the purpose of evaluating the
search effectiveness of different retrieval techniques (see Figure
3-9 for an example). They covered a range of genomics topics
typically asked by biomedical researchers. Besides a unique ID
number for each topic, the topic was constructed in a format that
included the title, need and context fields. The title field was a
short query. The need field was a short description of the kind of
material the biologists are interested in, whereas the context field
provides background information for judging the relevance of
documents. The need and context fields were designed to provide
more possible search terms for system experimentation purposes.</p>
      <p>ID: 39
Title: Hypertension
Need: Identify genes as potential genetic risk factors
candidates for causing hypertension.</p>
      <p>Context: A relevant document is one which discusses genes
that could be considered as candidates to test in a randomized
controlled trial which studies the genetic risk factors for
stroke.</p>
      <p>Because of the technical nature of genomics topics, we
wondered whether the search topics could be understood by
human searchers, particularly for those without advanced training
in the biomedical field. TREC search topics were designed for
machine runs with little or no consideration for searches by real
users. We selected 20 of the 50 topics using the following
procedure:
1. Consulting an experienced professional searcher with
biology background and a graduate student in
neuroscience, to help make a judgment as to whether the
topics would be comprehensible to the participants who
were not domain experts. Topics that used advanced
technical vocabulary, such as specific genes, pathways
and mechanisms, were excluded;
2. Ensuring that major concepts in search topics could be
mapped to MeSH by searching the MeSH Browser. For
instance, topic 39 could be mapped to MeSH preferred
terms hypertension and risk factors;
3. Eliminating topics with very low MAP (mean average
precision) and P10 (precision at top 10 documents) score
in the relevance judgment set because these topics would
be too difficult;
The selected topics were then randomly ordered to create ten
search topic pairs for the experimental conditions (see Figure 1 for
search topic pairs).</p>
    </sec>
    <sec id="sec-12">
      <title>2.8 Reliability of relevance judgment sets</title>
      <p>
        We measured search outcome using standard precision and recall
measures for accuracy and time spent for user effort [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] because
we were concerned with the usefulness of MeSH terms on search
effectiveness by using TREC assessments [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        Theoretically speaking, the calculation of recall measure
requires relevance judgments from the whole test collection.
However, it is almost impossible to obtain these judgments from a
test collection with more than 3 million documents. For practical
reasons the recall measure used a pooling method that created a
set of unique documents from the top 75 documents submitted by
27 groups participated in the TREC 2004 Genomics Track ad hoc
tasks [
        <xref ref-type="bibr" rid="ref26">26</xref>
        ]. Empirical evidence has shown that recall calculated
with a pooling method provides a reasonable approximation,
although the recall is likely to be overestimated [
        <xref ref-type="bibr" rid="ref33">33</xref>
        ]. But as a
result of this approach, there was an average pool size of 976
documents, with a range of 476-1450, which had relevance
judgments for each topic [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>
        It was quite likely that some of the participants in our
experiment would retrieve documents that had not been judged.
The existence of un-judged relevant documents, called sampling
bias in pooling method, is concerned with the pool depth and the
diversity of retrieval methods that may affect the reliability of
relevance judgment set [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The assumption that the pooled
judgment set is a reasonable approximation of complete relevance
judgment set may become invalid when the test collection is very
large.
      </p>
      <p>To ensure that the TREC pooled relevance judgment set was
sufficiently complete and valid for the current study, we analyzed
top 10 retrieved documents from each human runs (32 searchers ×
8 topics = 256 runs). Cross-tabulation results showed that about
one-third of all documents retrieved in our study had not been
judged in the TREC data set. More specifically, for a total of 2277
analyzed documents, 762 (33.5 %) had not been assigned relevant
judgments. There existed large variations in percentage of
unjudged documents for each search topic, with a range of 0–59.3%.</p>
      <p>To assess the impact of incomplete relevance judgments, we
compared the top 10 ranked search results between the judged
document set and the pooled document set for each topic. The
judged document set was composed of the documents that
matched TREC data, i.e., combination of judged not relevant and
judged relevant. The un-judged documents, added to the pooled
document set, were considered ‘not relevant’ in our calculations
of search outcome. We used precision oriented measures, MAP
(mean average precision), P10 (precision at top 10 documents)
and P100 (precision at top 100 documents) to estimate the impact
of incomplete judgments.</p>
      <p>The paired t-test results by search topic revealed significant
differences between the two sets in terms of MAP (t(19) = -3.69, p
&lt; .01), P10 (t(19) = -3.89, p &lt; .001) and P100 (t(19) = -3.95, p &lt;
.001) measures. The mean of the differences for MAP, P10 and
P100 was approximately 2.7%, 9.9% and 4.9% respectively. We
concluded that the TREC relevance judgments are applicable to
this study.</p>
    </sec>
    <sec id="sec-13">
      <title>2.9 Limitations of the design</title>
      <p>
        This study was designed to assess the impact of MeSH terms
on search effectiveness in an interactive search environment. One
limitation of the design was that participants were a self-selected
group of searchers that may not be representative of the
population. The interaction effects of selection biases and the
experimental variable, i.e., the displayed MeSH terms, were
another possible factor that limits the generalizability of this study
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The use of relatively technical and difficult search topics in
the interactive search environment posed threat to external
validity, since those topics might not represent typical topics
received by medical librarians in practice.
      </p>
      <p>The internal validity of this design was enhanced by
specifically considering several aspects: We devised an incentive
system to consider the possible sampling bias of searchers’
motivational characteristics in experimental settings. Besides
levels of education, participants’ domain knowledge was
evaluated by a topic understanding test. The variability of search
topics was alleviated by using a relatively large number of search
topics by experimental design. Selected search topics were
intelligible in consultation with domain expert and medical
librarian. A concept analysis form was used to help searchers
recognize potentially useful terms. The reliability of relevance
judgment sets was ensured by additional analysis of top 10 search
results from our human searchers.</p>
    </sec>
    <sec id="sec-14">
      <title>3. DISCUSSION AND CONCLUSION</title>
      <p>
        The Cranfield paradigm has been very useful for comparing
search effectiveness of different retrieval techniques at the level of
abstraction that simulates user search performance. Putting users
in the loop of IR experiments is particularly challenging because it
is difficult to separate the effects of systems, searchers and topics
and the search topics have had dominating effects [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. To
alleviate search topic variability in interactive IR experiments, we
consider how to increase the topic set size by experimental design
within the laboratory IR framework.
      </p>
      <p>
        This study has demonstrated that a total of 20 search topics
can be used in an interactive experiment by Graeco-Latin square
balanced design and using different sets of carefully selected
topics. We assume that the selected topics are equally difficult
since we do not have a good theory of search topics that can
directly control the topic difficulty for evaluation purposes.
Recent attempts to use reduced topic sets and use non-matched
topics (see e.g., [
        <xref ref-type="bibr" rid="ref10 ref5">5, 10</xref>
        ]) indirectly support our experimental
design considerations of search topic variability and topic
difficulty. However, an important theoretical question remains.
How can we better control the topic effects in batch and user IR
experiments?
      </p>
    </sec>
    <sec id="sec-15">
      <title>4. ACKNOWLEDGMENTS</title>
      <p>This study was funded by NSF grant #0414557, PIs. Michael Lesk
and Nina Wacholder. We thank anonymous reviewers for their
constructive comments.</p>
    </sec>
    <sec id="sec-16">
      <title>5. REFERENCES</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Banks</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Over</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and Zhang, N.-F.
          <year>1999</year>
          .
          <article-title>Blind men and elephants: Six approaches to TREC data</article-title>
          .
          <source>Inform Retrieval</source>
          ,
          <volume>1</volume>
          ,
          <issue>1</issue>
          /2 (April
          <year>1999</year>
          ),
          <fpage>7</fpage>
          -
          <lpage>34</lpage>
          . DOI=http://dx.doi.org/10.1023/A:1009984519381
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dimmick</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Soboroff</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Bias and the limits of pooling for large collections</article-title>
          .
          <source>Inform Retrieval</source>
          ,
          <volume>10</volume>
          ,
          <issue>6</issue>
          (
          <year>December 2007</year>
          ),
          <fpage>491</fpage>
          -
          <lpage>508</lpage>
          . DOI=http://dx.doi.org/10.1007/s10791-007-9032-x
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>Retrieval system evaluation</article-title>
          . In Voorhees,
          <string-name>
            <given-names>E. M.</given-names>
            and
            <surname>Harman</surname>
          </string-name>
          , D. K. (Eds.),
          <source>TREC: Experiment and Evaluation in Information Retrieval</source>
          , The MIT Press, Cambridge, MA,
          <fpage>53</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Campbell</surname>
            ,
            <given-names>D. T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stanley</surname>
            ,
            <given-names>J. C.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Gage</surname>
            ,
            <given-names>N. L.</given-names>
          </string-name>
          <year>1966</year>
          .
          <article-title>Experimental and Quasi-Experimental Designs for Research</article-title>
          . R.
          <string-name>
            <surname>McNally</surname>
          </string-name>
          , Chicago.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Cattelan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Mizzaro</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>IR evaluation without a common set of topics</article-title>
          .
          <source>In Proceedings of the 2nd International Conference on the Theory of Information Retrieval</source>
          (Cambridge, UK,
          <source>September 10-12</source>
          ,
          <year>2009</year>
          ).
          <source>ICTIR 2009</source>
          . Springer, Berlin,
          <fpage>342</fpage>
          -
          <lpage>345</lpage>
          . DOI=http://dx.doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -04417-5_
          <fpage>35</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Cleverdon</surname>
            ,
            <given-names>C. W.</given-names>
          </string-name>
          <year>1967</year>
          .
          <article-title>The Cranfield tests on index language devices</article-title>
          .
          <source>Aslib Proc</source>
          ,
          <volume>19</volume>
          ,
          <issue>6</issue>
          (
          <year>1967</year>
          ),
          <fpage>173</fpage>
          -
          <lpage>193</lpage>
          . DOI=http://dx.doi.org/10.1108/eb050097
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Dumais</surname>
            ,
            <given-names>S. T.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Belkin</surname>
            ,
            <given-names>N. J.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>The TREC Interactive Track: Putting the user into search</article-title>
          . In Voorhees,
          <string-name>
            <given-names>E. M.</given-names>
            and
            <surname>Harman</surname>
          </string-name>
          , D. K. (Eds.),
          <source>TREC: Experiment and Evaluation in Information Retrieval</source>
          , The MIT Press, Cambridge, MA,
          <fpage>123</fpage>
          -
          <lpage>152</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8] Fisher,
          <string-name>
            <surname>R. A.</surname>
          </string-name>
          <year>1935</year>
          .
          <article-title>The Design of Experiments. Oliver and Boyd</article-title>
          , Edinburgh.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Greenstone</given-names>
            <surname>Digital</surname>
          </string-name>
          Library
          <source>Software (Version 2.70)</source>
          .
          <year>2006</year>
          . Department of Computer Science, The University of Waikato, New Zealand. Available at: http://prdownloads.sourceforge.net/greenstone/gsdl-2.
          <fpage>70</fpage>
          - export.zip
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Guiver</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mizzaro</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>A few good topics: Experiments in topic set reduction for retrieval evaluation</article-title>
          .
          <source>ACM Trans. Inf</source>
          . Syst.,
          <volume>27</volume>
          ,
          <issue>4</issue>
          (November
          <year>2009</year>
          ),
          <fpage>1</fpage>
          -
          <lpage>26</lpage>
          . DOI=http://doi.acm.
          <source>org/10</source>
          .1145/1629096.1629099
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Buckley</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Overview of the Reliable Information Access Workshop</article-title>
          . Inform Retrieval,
          <volume>12</volume>
          ,
          <issue>6</issue>
          (
          <year>December 2009</year>
          ),
          <fpage>615</fpage>
          -
          <lpage>641</lpage>
          . DOI=http://dx.doi.org/10.1007/s10791-009-9101-4
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Hersh</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bhupatiraju</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ross</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cohen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kraemer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>Enhancing access to the Bibliome: The TREC 2004 Genomics Track</article-title>
          ,
          <source>Journal of Biomedical Discovery and Collaboration</source>
          ,
          <volume>1</volume>
          ,
          <issue>3</issue>
          (March
          <year>2006</year>
          ). DOI=http://dx.doi.org/10.1186/
          <fpage>1747</fpage>
          -5333-1-3
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Hersh</surname>
            ,
            <given-names>W. R.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>Information Retrieval: A Health and Biomedical Perspective</article-title>
          . Springer, New York.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>Ingwersen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Järvelin</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>The Turn: Integration of Information Seeking and</article-title>
          Retrieval in Context. Springer, Dordrecht.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Ingwersen</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Järvelin</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>On the holistic cognitive theory for information retrieval</article-title>
          .
          <source>In Proceedings of the First International Conference on the Theory of Information Retrieval (ICTIR)</source>
          (Budapest, Hungary,
          <year>2007</year>
          ).
          <article-title>Foundation for Information Society</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>Kirk</surname>
            ,
            <given-names>R. E.</given-names>
          </string-name>
          <string-name>
            <surname>Experimental</surname>
          </string-name>
          <article-title>Design: Procedures for the Behavioral Sciences</article-title>
          .
          <year>1995</year>
          . Brooks/Cole, Pacific Grove, CA.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Lagergren</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Over</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>Comparing interactive information retrieval systems across sites: The TREC-6 interactive track matrix experiment</article-title>
          .
          <source>In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne</source>
          , Australia,
          <year>1998</year>
          ).
          <source>SIGIR '98</source>
          . ACM Press, New York, NY,
          <fpage>164</fpage>
          -
          <lpage>172</lpage>
          . DOI=http://doi.acm.
          <source>org/10</source>
          .1145/290941.290986
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Wilbur</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <article-title>Evaluation of query expansion using MeSH in PubMed</article-title>
          .
          <source>Inform Retrieval</source>
          ,
          <volume>12</volume>
          ,
          <issue>1</issue>
          (
          <year>February 2009</year>
          ),
          <fpage>69</fpage>
          -
          <lpage>80</lpage>
          . DOI=http://dx.doi.org/10.1007/s10791-008-9074-8
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>MeSH</given-names>
            <surname>Browser</surname>
          </string-name>
          (2003
          <source>MeSH)</source>
          .
          <year>2004</year>
          . U.S. National Library of Medicine. Available at: http://www.nlm.nih.gov/mesh/2003/MBrowser.html
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          <year>1981</year>
          .
          <article-title>The methodology of information retrieval experiment</article-title>
          . In Sparck Jones,
          <string-name>
            <surname>K.</surname>
          </string-name>
          (Ed.), Information Retrieval Experiment, Butterworth, London,
          <fpage>9</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          <year>1990</year>
          .
          <article-title>On sample sizes for non-matched-pair IR experiments</article-title>
          .
          <source>Inform Process Manag</source>
          ,
          <volume>26</volume>
          ,
          <issue>6</issue>
          (
          <year>1990</year>
          ),
          <fpage>739</fpage>
          -
          <lpage>753</lpage>
          . DOI=http://dx.doi.org/10.1016/
          <fpage>0306</fpage>
          -
          <lpage>4573</lpage>
          (
          <issue>90</issue>
          )
          <fpage>90049</fpage>
          -
          <lpage>8</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>Robertson</surname>
            ,
            <given-names>S. E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thompson</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Macaskill</surname>
            ,
            <given-names>M. J.</given-names>
          </string-name>
          <year>1986</year>
          .
          <article-title>Weighting, ranking and relevance feedback in a frontend system</article-title>
          .
          <source>Journal of Information and Image Management</source>
          ,
          <volume>12</volume>
          ,
          <issue>1</issue>
          /2, (
          <year>January 1986</year>
          ),
          <fpage>71</fpage>
          -
          <lpage>75</lpage>
          . DOI=http://dx.doi.org/10.1177/016555158601200112
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>Sharp</surname>
            ,
            <given-names>E. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pelletier</surname>
            ,
            <given-names>L. G.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Levesque</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2006</year>
          .
          <article-title>The double-edged sword of rewards for participation in psychology experiments</article-title>
          .
          <source>Can J Beh Sci</source>
          ,
          <volume>38</volume>
          ,
          <issue>3</issue>
          (Jul
          <year>2006</year>
          ),
          <fpage>269</fpage>
          -
          <lpage>277</lpage>
          . DOI=http://dx.doi.org/10.1037/cjbs2006014
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>Sparck</given-names>
            <surname>Jones</surname>
          </string-name>
          , K. and
          <string-name>
            <surname>van Rijsbergen</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          <year>1976</year>
          .
          <article-title>Information retrieval test collections</article-title>
          .
          <source>J Doc</source>
          ,
          <volume>32</volume>
          , 1 (March
          <year>1976</year>
          ),
          <fpage>59</fpage>
          -
          <lpage>75</lpage>
          . DOI=http://dx.doi.org/10.1108/eb026616
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>Tague-Sutcliffe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1992</year>
          .
          <article-title>The pragmatics of information retrieval experimentation, revisited</article-title>
          .
          <source>Inform Process Manag</source>
          ,
          <volume>28</volume>
          , 4
          <year>1992</year>
          ),
          <fpage>467</fpage>
          -
          <lpage>490</lpage>
          . DOI=http://dx.doi.org/10.1016/
          <fpage>0306</fpage>
          -
          <lpage>4573</lpage>
          (
          <issue>92</issue>
          )
          <fpage>90005</fpage>
          -
          <lpage>K</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <article-title>TREC 2004 Genomics Track document set data file</article-title>
          .
          <year>2005</year>
          . Available at http://ir.ohsu.edu/genomics/data/2004/
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>van Rijsbergen</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          <year>1979</year>
          . Information Retrieval. Butterworths, London.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>The TREC robust retrieval track</article-title>
          .
          <source>SIGIR Forum</source>
          ,
          <volume>39</volume>
          ,
          <issue>1</issue>
          (
          <year>June 2005</year>
          ),
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          . DOI=http://doi.acm.
          <source>org/10</source>
          .1145/1067268.1067272
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <article-title>On test collections for adaptive information retrieval</article-title>
          .
          <source>Inform Process Manag</source>
          ,
          <volume>44</volume>
          ,
          <issue>6</issue>
          (November
          <year>2008</year>
          ),
          <fpage>1879</fpage>
          -
          <lpage>1885</lpage>
          . DOI=http://dx.doi.org/10.1016/j.ipm.
          <year>2007</year>
          .
          <volume>12</volume>
          .011
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>Voorhees</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Harman</surname>
            ,
            <given-names>D. K.</given-names>
          </string-name>
          <year>2005</year>
          .
          <article-title>TREC: Experiment and Evaluation in Information Retrieval</article-title>
          . The MIT Press, Cambridge, MA.
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I. H.</given-names>
          </string-name>
          <article-title>and</article-title>
          <string-name>
            <surname>Bainbridge</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>A retrospective look at Greenstone: Lessons from the first decade</article-title>
          .
          <source>In Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries</source>
          (Vancouver, Canada, June 18-23,
          <year>2007</year>
          ).
          <source>JCDL '07</source>
          . ACM Press, New York, NY,
          <fpage>147</fpage>
          -
          <lpage>156</lpage>
          . DOI=http://doi.acm.
          <source>org/10</source>
          .1145/1255175.1255204
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Moffat</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Bell</surname>
            ,
            <given-names>T. C.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>Managing Gigabytes: Compressing and Indexing Documents and Images</article-title>
          . Morgan Kaufmann, San Francisco.
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>Zobel</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1998</year>
          .
          <article-title>How reliable are the results of large-scale information retrieval experiments</article-title>
          ?
          <source>In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Melbourne</source>
          , Australia,
          <year>1998</year>
          ).
          <source>SIGIR '98</source>
          . ACM Press, New York, NY,
          <fpage>307</fpage>
          -
          <lpage>314</lpage>
          . DOI=http://doi.acm.
          <source>org/10</source>
          .1145/290941.291014
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>