On Search Topic Variability in Interactive Information
                           Retrieval
                     Ying-Hsang Liu                                                            Nina Wacholder
            School of Information Studies                                        School of Communication and Information
              Charles Sturt University                                                      Rutgers University
          Wagga Wagga NSW 2678, Australia                                            New Brunswick NJ 089091, USA
                 +61 2 6933 2171                                                        +1 732 932 7500 ext. 8214
                  yingliu@csu.edu.au                                                        ninwac@rutgers.edu

ABSTRACT
This paper describes the research design and methodologies we                 The test design and methodology following the Cranfield
used to assess the usefulness of MeSH (Medical Subject                  paradigm culminated in the TREC (Text REtrieval Conference)
Headings) terms for different types of users in an interactive          activities since the 1990s. TREC has provided a research forum
search environment. We observed four different kinds of                 for comparing the search effectiveness of different retrieval
information seekers using an experimental IR system: (1) search         techniques across IR systems in a laboratory and controlled
novices; (2) domain experts; (3) search experts and (4) medical         environment [30]. The very large test collection used in TREC
librarians. We employed a user-oriented evaluation methodology          provided a test bed for researchers to experiment the scalability of
to assess search effectiveness of automatic and manual indexing         retrieval techniques, which had not been possible in previous
methods using TREC Genomics Track 2004 data set. Our                    years. However, how we specifically take into account different
approach demonstrated (1) the reusability of a large test collection    aspects of user contexts within a more realistic test environment
originally created for TREC, (2) an experimental design that            has been challenging in part because it is difficult to isolate the
specifically considers types of searchers, system versions and          effects of user, search topic and system in IR experiments (see
search topic pairs by Graeco-Latin square design and (3) search         e.g., [7, 17] for recent efforts).
topic variability can be alleviated by using different sets of                In batch experiments the search effectiveness of different
equally difficult topics and well-controlled experimental design        retrieval techniques is achieved by comparing the search
for contextual information retrieval evaluation.                        performance of queries. IR researchers have widely used the
                                                                        micro-averaging method of performing statistics on the queries in
                                                                        summarizing precision and recall values for comparing the search
Categories and Subject Descriptors                                      effectiveness of different retrieval techniques in order to meet the
H.3.3 [Information Storage and Retrieval]: Information Search           statistical requirements (see e.g., [25, 27]). The method of micro-
and Retrieval−query formulation, search process                         averaging is intended to obtain reliable results in comparing
                                                                        search performance of different retrieval techniques by giving
General Terms                                                           equal weights to each query.
Measurement, Human Factors                                                    However, within an interactive IR search environment that
                                                                        involves human searchers, it is difficult to use a large set of search
                                                                        topics. Empirical evidence has demonstrated that the search topic
Keywords                                                                set size of 50 is necessary to determine the relative performance
Information retrieval evaluation,      Search    topic   variability,   of different retrieval techniques in batch evaluations [3], because
interactive information retrieval                                       the variability of search topics has an overriding effect on search
                                                                        results. Another possible solution is to use different sets of topics
1. INTRODUCTION                                                         in a non-matched-pair design [5, 21, 22], but theoretically it
                                                                        requires a very large sample of independent searches.
The creation and refinement of test design and methodologies for              This problem has been exacerbated by the fact that we have
IR system evaluation have been one of the greatest achievements         little theoretical understanding about the nature and properties of
in IR research and development. In the second Cranfield project         search topics for evaluation purposes [20]. From a systems
[6], the main purpose is to evaluate the effectiveness of indexing      perspective, recent in-depth failure analyses of variability in
techniques at a level of abstraction where users are not                search topics for reliable and robust retrieval performance (e.g.,
specifically considered in a batch mode experiment.                     [11, 28]) have contributed to our preliminary understanding of
                                                                        how and why IR systems fail to do well across all search topics. It
                                                                        is still elusive what kinds of search topics can be used to directly
                                                                        control the topic effect for IR evaluation purposes.
                                                                              This study was designed to assess the search effectiveness of
Appears in the Proceedings of The 2nd International Workshop on
                                                                        MeSH terms by different types of searchers in an interactive
Contextual Information Access, Seeking and Retrieval Evaluation         search environment. By an experimental design that controls
(CIRSE 2010), March 28, 2010, Milton Keynes, UK.                        searchers, system versions and search topic pairs and the use of a
http://www.irit.fr/CIRSE/                                               relatively large number of search topics, we were able to
Copyright owned by the authors.                                         demonstrate an IR user experiment that specifically controls the
search topic variability and assesses the user effect on search
effectiveness within the laboratory IR framework (see e.g., [14,
                                                                           9       10       11       12       13       14       15       16
15] for recent discussions).
                                                                          SE       ML       SN       DE      ML        SE       DE       SN
2. METHOD                                                                 29       50       27       45       42       46        9       36
                                                                          50       29       29       27       46       36       42        9
Thirty-two searchers from a major public university and nearby
medical libraries in the northeast area of the US participated in the     27       45       45       50        9       42       36       46
study. Each searcher belonged to one of four groups: (1) Search           45       27       50       29       36        9       46       42
Novice (SN), (2) Domain Experts (DE), (3) Search Experts (SE)              2       43        1       49        2       43       33       23
and (4) Medical Librarians (ML).
                                                                          43        1       49        2       43        2       23       33
      The experimental task was to conduct a total of eight
                                                                           1       49        2       43       33       23        2       43
searches to help biologists conduct their research. Participants
searched either using a version of the system in which abstracts          49        2       43        1       23       33       43        2
and MeSH terms were displayed (MeSH+) or another version in             Note. Numbers 1-16 refers to participant ID; SN, DE, DE and ML
which they had to formulate their own terms based only on the           refer to types of searchers, SN=Search Novices, DE=Domain
display of abstracts (MeSH−). Participants conducted four               Experts; SE=Search Experts; ML=Medical Librarians; Shaded
searches each with two different systems: in one, they browsed a        and non-shaded blocks refer to MeSH+ and MeSH− versions of
displayed list of MeSH terms (MeSH+) and in the other (MeSH−).          an experimental system; Numbers in blocks refer to search topic
Half the participants used MeSH+ system first; half used MeSH−          ID number from TREC Genomics Track 2004 data set; 10 search
first. Each participant was allowed to conduct searches on eight        topic pairs, randomly selected from a pool of 20 selected topics,
different topics.                                                       include (38, 12), (29, 50), (42, 46), (32, 15), (27, 45), (9, 36), (30,
      The experimental setting for most searchers was a university      20), (2, 43), (1, 49) and (33, 23).
office; for some searchers, it was a medical library. Before they
began searching participants were briefly trained in how to use the                 Figure 1. 4×4 Graeco-Latin square design
MeSH terms. We kept search logs that recorded search terms, a                Because of the potential interfering effect of search topic
ranked list of retrieved documents, and time-stamps.                    variability on search performance in IR evaluation, we used a
                                                                        design that included relatively large number of search topics. In
2.1 Subjects                                                            theory, the effect of topic variability and topic-system interaction
We used the purposive sampling method for recruiting our                on system performance could be eliminated by averaging the
subjects since we were concerned with the impact of specific            performance scores of the topics (micro-averaging method),
searcher characteristics on search effectiveness. The key searcher      together with the use of very large number of search topics. The
characteristics were different levels of domain knowledge in the        TREC standard ad hoc task evaluation studies ([1, 3]) and other
biomedical domain and whether they had substantial search               proposals of test collections (e.g., [20-22, 24, 29]) have been
training. The four types of searchers were distinguished by their       concerned with the large search topic variability in batch
levels of domain knowledge and search training.                         experiments. However, in a user-centered IR experiment it is not
                                                                        feasible to use as many as 50 search topics because of human
2.2 Experimental design                                                 fatigue.
The experiment was a 4×2×2 factorial design with four types of               We controlled search topic pairs by a balanced design in
searchers, two versions of an experimental system and controlled        order to alleviate the overriding effect of search topic variability.
search topic pairs. The versions of a system, types of searchers        We assumed that all the search topics are equally difficult, since
(distinguished by levels of domain knowledge and search training)       we do not have a good theory about what makes some search
and search topic pairs were controlled by a Graeco-Latin square         topics more difficult than others. By design we ensured that each
balanced design [8]. The possible ordering effects have been taken      search topic pair was assigned to all types of searchers and was
into account by the design. The requirement for this experimental       searched at least two times by the same type of searchers. This
design is that the examined variables do not interact and each          design required a total of 10 search topic pairs and a minimum of
variable has the same number of levels [16]. The treatment layout       16 participants.
of a 4×4 Graeco-Latin square design is illustrated in Figure 1.
                                                                        2.3 Search tasks and incentive system
   1        2       3        4        5       6        7        8       The search task was designed to simulate online searching
  SN       DE       SE      ML       DE      SN       ML       SE       situations in which professional searchers look for information on
  38       12       29      50       38       12      27       45       behalf of users. We decided to use this relatively challenging task
                                                                        for untrained searchers because choosing realistic tasks such as
  12       38       50      29       12       45      38       27       this one would enhance the external validity of the experiment.
  29       50       12      38       27       38      45       12       Considering the relatively difficult tasks, we were concerned that
  50       29       38      12       45       27      12       38       searchers may have problems completing all searches. Because
                                                                        research literature has suggested that the motivational
  42       46       32      15        9       36      30       20       characteristics of participants are possible sources of sample bias
  46       42       15      42       36       9       20       30       [23], we designed an incentive system to motivate the searchers.
  32       15       42      46       30       20       9       36            We promised monetary incentives according to the
                                                                        participant’s search effectiveness. Each subject was paid $20 for
  15       32       46      32       20       30      36        9
participating and was also paid up to $10.00 dollars more based        More specifically, MGPP (MG++), a re-implementation of the mg
on the average number of relevant documents in the top ten search      (Managing Gigabytes) searching and compression algorithms,
results across all search topics; on average each participant          was used as indexing and querying indexer. Basic system features,
received an additional $4.40, with a range of $2.00 - $8.00.           including fielded searching, phrase searching, Boolean operators,
                                                                       case sensitivity, stemming and display of search history, were
2.4 Experimental procedures                                            sufficient to fulfill the search tasks. The display of search history
After signing the consent form, the participant filled out a           was necessary because it provided useful feedback regarding the
searcher background questionnaire before the search assignment.        magnitude of retrieved documents for difficult search tasks that
After a brief training session, they were assigned to one of the       usually required query reformulations.
arranged experimental conditions and conducted search tasks.                 Since our goal was specifically to investigate the usefulness
They completed a search perception questionnaire and were asked        of displayed MeSH terms, we deliberately refrained from
to indicate the relevance of two pre-judged documents when they        implementing certain system features that allow users to take
were done with each search topic. A brief interview was                advantage of the hierarchical structures of MeSH terms, such as
conducted when they finished all search topics. Search logs with       the hyperlinked MeSH terms, explode function that automatically
search terms and ranked retrieved documents were recorded.             includes all narrower terms and automatic query expansion (see
     The MeSH Browser [19], an online vocabulary look-up aid,          e.g. [13, 18]) available on other online search systems. The use of
prepared by U.S. National Library of Medicine, was designed to         those features would have invalidated the results by introducing
help searchers find appropriate MeSH terms and display hierarchy       other variables at the levels of search interface and query
of terms for retrieval purposes. The MeSH Browser was only             processing, although a full integration of those system features
available when participants were assigned to the MeSH+ version         would have increased the usefulness of MeSH terms.
of an experimental system; in the MeSH− version, participants
had to formulate their own terms without the assistance of MeSH        2.6 Documents
Browser and displayed MeSH terms in bibliographic records.                   The experimental system was set up on a server, using
     Because we were concerned that the topics were so hard that       bibliographic records from the 2004 TREC Genomics document
even the medical librarians would not understand them, we used a       set [26]. TREC Genomics Track 2004 Data Set document test
questionnaire regarding search topic understanding after each          collection was a 10-year (from 1994 to 2003) subset of
topic. The testing items of two randomly selected pre-judged           MEDLINE with a total of 4,591,108 records. The test collection
documents, one definitely relevant and the other definitely not        subset fed into the system used 75.0% of the whole collection, a
relevant, were prepared from the data set [26].                        total of 3,442,321 records, excluding the records without MeSH
     Each search topic was allocated up to ten minutes. The last       terms or abstracts.
search within the time limit was used for calculating search                 We prepared two sets of documents for setting up the
performance. To keep the participants motivated and reward their       experimental system: MeSH+ and MeSH− versions. One interface
effort, they were asked to orally indicate which previous search       allowed users to use MeSH terms; the other did not provide this
result would be the best answer when the search task was not           search option. The difference was also reflected in retrieved
finished within ten minutes.                                           bibliographic records.

2.5 Experimental system                                                2.7 Search topics
For this study, it was important for participants to conduct their          The search topics used in this study were originally created
searches in a carefully controlled environment; our goal was to        for TREC Genomics Track 2004 for the purpose of evaluating the
offer as much help as possible while still making sure that the help   search effectiveness of different retrieval techniques (see Figure
and search functions did not interfere with our ability to measure     3-9 for an example). They covered a range of genomics topics
the impact of the MeSH terms. We built an information retrieval        typically asked by biomedical researchers. Besides a unique ID
system based on the Greenstone Digital Library Software version        number for each topic, the topic was constructed in a format that
2.70 [9] because it provides reliable search functionality,            included the title, need and context fields. The title field was a
customizable search interface and good documentation [31].             short query. The need field was a short description of the kind of
      We prepared two different search interfaces using a single       material the biologists are interested in, whereas the context field
system using Greenstone: MeSH+ and MeSH− versions. One                 provides background information for judging the relevance of
interface allowed users to use MeSH terms; the other required          documents. The need and context fields were designed to provide
them to devise their own terms. One interface displayed MeSH           more possible search terms for system experimentation purposes.
terms in retrieved bibliographic records and the other did not.
Because we were concerned that the participant responds to the              ID: 39
                                                                            Title: Hypertension
cue that may signal the experimenter’s intent, the search interfaces
were termed ‘System Version A’ and ‘System Version B’ for                   Need: Identify genes as potential genetic risk factors
‘MeSH+ Version’ and ‘MeSH− Version’ respectively (see                       candidates for causing hypertension.
                                                                            Context: A relevant document is one which discusses genes
http://comminfo.rutgers.edu/irgs/gsdl/cgi-bin/library/).        The
MeSH− version was used as baseline system for an automatic                  that could be considered as candidates to test in a randomized
indexing system, whereas the MeSH+ version served as                        controlled trial which studies the genetic risk factors for
                                                                            stroke.
performance of a manual indexing system. That is, MeSH terms
added another layer of document representation to the MeSH+                              Figure 2. Sample search topic
version.
      The experimental system was constructed as Boolean-based             Because of the technical nature of genomics topics, we
system with ranked functions by the TF×IDF weighting rule [32].        wondered whether the search topics could be understood by
human searchers, particularly for those without advanced training      document set and the pooled document set for each topic. The
in the biomedical field. TREC search topics were designed for          judged document set was composed of the documents that
machine runs with little or no consideration for searches by real      matched TREC data, i.e., combination of judged not relevant and
users. We selected 20 of the 50 topics using the following             judged relevant. The un-judged documents, added to the pooled
procedure:                                                             document set, were considered ‘not relevant’ in our calculations
   1. Consulting an experienced professional searcher with             of search outcome. We used precision oriented measures, MAP
        biology background and a graduate student in                   (mean average precision), P10 (precision at top 10 documents)
        neuroscience, to help make a judgment as to whether the        and P100 (precision at top 100 documents) to estimate the impact
        topics would be comprehensible to the participants who         of incomplete judgments.
        were not domain experts. Topics that used advanced                   The paired t-test results by search topic revealed significant
        technical vocabulary, such as specific genes, pathways         differences between the two sets in terms of MAP (t(19) = -3.69, p
        and mechanisms, were excluded;                                 < .01), P10 (t(19) = -3.89, p < .001) and P100 (t(19) = -3.95, p <
     2. Ensuring that major concepts in search topics could be         .001) measures. The mean of the differences for MAP, P10 and
        mapped to MeSH by searching the MeSH Browser. For              P100 was approximately 2.7%, 9.9% and 4.9% respectively. We
        instance, topic 39 could be mapped to MeSH preferred           concluded that the TREC relevance judgments are applicable to
        terms hypertension and risk factors;                           this study.
     3. Eliminating topics with very low MAP (mean average
        precision) and P10 (precision at top 10 documents) score       2.9 Limitations of the design
        in the relevance judgment set because these topics would             This study was designed to assess the impact of MeSH terms
        be too difficult;                                              on search effectiveness in an interactive search environment. One
The selected topics were then randomly ordered to create ten           limitation of the design was that participants were a self-selected
search topic pairs for the experimental conditions (see Figure 1 for   group of searchers that may not be representative of the
search topic pairs).                                                   population. The interaction effects of selection biases and the
                                                                       experimental variable, i.e., the displayed MeSH terms, were
2.8 Reliability of relevance judgment sets                             another possible factor that limits the generalizability of this study
We measured search outcome using standard precision and recall         [4]. The use of relatively technical and difficult search topics in
measures for accuracy and time spent for user effort [6] because       the interactive search environment posed threat to external
we were concerned with the usefulness of MeSH terms on search          validity, since those topics might not represent typical topics
effectiveness by using TREC assessments [12].                          received by medical librarians in practice.
      Theoretically speaking, the calculation of recall measure              The internal validity of this design was enhanced by
requires relevance judgments from the whole test collection.           specifically considering several aspects: We devised an incentive
However, it is almost impossible to obtain these judgments from a      system to consider the possible sampling bias of searchers’
test collection with more than 3 million documents. For practical      motivational characteristics in experimental settings. Besides
reasons the recall measure used a pooling method that created a        levels of education, participants’ domain knowledge was
set of unique documents from the top 75 documents submitted by         evaluated by a topic understanding test. The variability of search
27 groups participated in the TREC 2004 Genomics Track ad hoc          topics was alleviated by using a relatively large number of search
tasks [26]. Empirical evidence has shown that recall calculated        topics by experimental design. Selected search topics were
with a pooling method provides a reasonable approximation,             intelligible in consultation with domain expert and medical
although the recall is likely to be overestimated [33]. But as a       librarian. A concept analysis form was used to help searchers
result of this approach, there was an average pool size of 976         recognize potentially useful terms. The reliability of relevance
documents, with a range of 476-1450, which had relevance               judgment sets was ensured by additional analysis of top 10 search
judgments for each topic [12].                                         results from our human searchers.
      It was quite likely that some of the participants in our
experiment would retrieve documents that had not been judged.          3. DISCUSSION AND CONCLUSION
The existence of un-judged relevant documents, called sampling
                                                                       The Cranfield paradigm has been very useful for comparing
bias in pooling method, is concerned with the pool depth and the
                                                                       search effectiveness of different retrieval techniques at the level of
diversity of retrieval methods that may affect the reliability of
                                                                       abstraction that simulates user search performance. Putting users
relevance judgment set [2]. The assumption that the pooled
                                                                       in the loop of IR experiments is particularly challenging because it
judgment set is a reasonable approximation of complete relevance
                                                                       is difficult to separate the effects of systems, searchers and topics
judgment set may become invalid when the test collection is very
                                                                       and the search topics have had dominating effects [17]. To
large.
                                                                       alleviate search topic variability in interactive IR experiments, we
      To ensure that the TREC pooled relevance judgment set was
                                                                       consider how to increase the topic set size by experimental design
sufficiently complete and valid for the current study, we analyzed
                                                                       within the laboratory IR framework.
top 10 retrieved documents from each human runs (32 searchers ×
8 topics = 256 runs). Cross-tabulation results showed that about            This study has demonstrated that a total of 20 search topics
one-third of all documents retrieved in our study had not been         can be used in an interactive experiment by Graeco-Latin square
judged in the TREC data set. More specifically, for a total of 2277    balanced design and using different sets of carefully selected
analyzed documents, 762 (33.5 %) had not been assigned relevant        topics. We assume that the selected topics are equally difficult
judgments. There existed large variations in percentage of un-         since we do not have a good theory of search topics that can
judged documents for each search topic, with a range of 0–59.3%.       directly control the topic difficulty for evaluation purposes.
      To assess the impact of incomplete relevance judgments, we       Recent attempts to use reduced topic sets and use non-matched
compared the top 10 ranked search results between the judged           topics (see e.g., [5, 10]) indirectly support our experimental
design considerations of search topic variability and topic         [13] Hersh, W. R. 2008. Information Retrieval: A Health and
difficulty. However, an important theoretical question remains.          Biomedical Perspective. Springer, New York.
How can we better control the topic effects in batch and user IR    [14] Ingwersen, P. and Järvelin, K. 2005. The Turn: Integration of
experiments?                                                             Information Seeking and Retrieval in Context. Springer,
                                                                         Dordrecht.
4. ACKNOWLEDGMENTS                                                  [15] Ingwersen, P. and Järvelin, K. 2007. On the holistic cognitive
This study was funded by NSF grant #0414557, PIs. Michael Lesk           theory for information retrieval. In Proceedings of the First
and Nina Wacholder. We thank anonymous reviewers for their               International Conference on the Theory of Information
constructive comments.                                                   Retrieval (ICTIR) (Budapest, Hungary, 2007). Foundation
                                                                         for Information Society.
                                                                    [16] Kirk, R. E. Experimental Design: Procedures for the
5. REFERENCES                                                            Behavioral Sciences. 1995. Brooks/Cole, Pacific Grove, CA.
                                                                    [17] Lagergren, E. and Over, P. 1998. Comparing interactive
[1] Banks, D., Over, P. and Zhang, N.-F. 1999. Blind men and
                                                                         information retrieval systems across sites: The TREC-6
     elephants: Six approaches to TREC data. Inform Retrieval, 1,
                                                                         interactive track matrix experiment. In Proceedings of the
     1/2 (April 1999), 7-34.
                                                                         21st Annual International ACM SIGIR Conference on
     DOI=http://dx.doi.org/10.1023/A:1009984519381
                                                                         Research and Development in Information Retrieval
[2] Buckley, C., Dimmick, D., Soboroff, I. and Voorhees, E.
                                                                         (Melbourne, Australia, 1998). SIGIR ’98. ACM Press, New
     2007. Bias and the limits of pooling for large collections.
                                                                         York, NY, 164-172.
     Inform Retrieval, 10, 6 (December 2007), 491-508.
                                                                         DOI=http://doi.acm.org/10.1145/290941.290986
     DOI=http://dx.doi.org/10.1007/s10791-007-9032-x
                                                                    [18] Lu, Z., Kim, W. and Wilbur, W. Evaluation of query
[3] Buckley, C. and Voorhees, E. M. 2005. Retrieval system
                                                                         expansion using MeSH in PubMed. Inform Retrieval, 12, 1
     evaluation. In Voorhees, E. M. and Harman, D. K. (Eds.),
                                                                         (February 2009), 69-80.
     TREC: Experiment and Evaluation in Information Retrieval,
                                                                         DOI=http://dx.doi.org/10.1007/s10791-008-9074-8
     The MIT Press, Cambridge, MA, 53-75.
                                                                    [19] MeSH Browser (2003 MeSH). 2004. U.S. National Library
[4] Campbell, D. T., Stanley, J. C. and Gage, N. L. 1966.
                                                                         of Medicine. Available at:
     Experimental and Quasi-Experimental Designs for Research.
                                                                         http://www.nlm.nih.gov/mesh/2003/MBrowser.html
     R. McNally, Chicago.
                                                                    [20] Robertson, S. E. 1981. The methodology of information
[5] Cattelan, M. and Mizzaro, S. 2009. IR evaluation without a
                                                                         retrieval experiment. In Sparck Jones, K. (Ed.), Information
     common set of topics. In Proceedings of the 2nd
                                                                         Retrieval Experiment, Butterworth, London, 9-31.
     International Conference on the Theory of Information
                                                                    [21] Robertson, S. E. 1990. On sample sizes for non-matched-pair
     Retrieval (Cambridge, UK, September 10-12, 2009). ICTIR
                                                                         IR experiments. Inform Process Manag, 26, 6 (1990), 739-
     2009. Springer, Berlin, 342-345.
                                                                         753. DOI=http://dx.doi.org/10.1016/0306-4573(90)90049-8
     DOI=http://dx.doi.org/10.1007/978-3-642-04417-5_35
                                                                    [22] Robertson, S. E., Thompson, C. L. and Macaskill, M. J.
[6] Cleverdon, C. W. 1967. The Cranfield tests on index
                                                                         1986. Weighting, ranking and relevance feedback in a front-
     language devices. Aslib Proc, 19, 6 (1967), 173-193.
                                                                         end system. Journal of Information and Image Management,
     DOI=http://dx.doi.org/10.1108/eb050097
                                                                         12, 1/2, (January 1986), 71-75.
[7] Dumais, S. T. and Belkin, N. J. 2005. The TREC Interactive
                                                                         DOI=http://dx.doi.org/10.1177/016555158601200112
     Track: Putting the user into search. In Voorhees, E. M. and
                                                                    [23] Sharp, E. C., Pelletier, L. G. and Levesque, C. 2006. The
     Harman, D. K. (Eds.), TREC: Experiment and Evaluation in
                                                                         double-edged sword of rewards for participation in
     Information Retrieval, The MIT Press, Cambridge, MA, 123-
                                                                         psychology experiments. Can J Beh Sci, 38, 3 (Jul 2006),
     152.
                                                                         269-277. DOI=http://dx.doi.org/10.1037/cjbs2006014
[8] Fisher, R. A. 1935. The Design of Experiments. Oliver and
                                                                    [24] Sparck Jones, K. and van Rijsbergen, C. J. 1976. Information
     Boyd, Edinburgh.
                                                                         retrieval test collections. J Doc, 32, 1 (March 1976), 59-75.
[9] Greenstone Digital Library Software (Version 2.70). 2006.
                                                                         DOI=http://dx.doi.org/10.1108/eb026616
     Department of Computer Science, The University of
                                                                    [25] Tague-Sutcliffe, J. 1992. The pragmatics of information
     Waikato, New Zealand. Available at:
                                                                         retrieval experimentation, revisited. Inform Process Manag,
     http://prdownloads.sourceforge.net/greenstone/gsdl-2.70-
                                                                         28, 4 1992), 467-490. DOI=http://dx.doi.org/10.1016/0306-
     export.zip
                                                                         4573(92)90005-K
[10] Guiver, J., Mizzaro, S. and Robertson, S. 2009. A few good
                                                                    [26] TREC 2004 Genomics Track document set data file. 2005.
     topics: Experiments in topic set reduction for retrieval
                                                                         Available at http://ir.ohsu.edu/genomics/data/2004/
     evaluation. ACM Trans. Inf. Syst., 27, 4 (November 2009),
     1-26. DOI=http://doi.acm.org/10.1145/1629096.1629099           [27] van Rijsbergen, C. J. 1979. Information Retrieval.
                                                                         Butterworths, London.
[11] Harman, D. and Buckley, C. 2009. Overview of the Reliable
                                                                    [28] Voorhees, E. M. 2005. The TREC robust retrieval track.
     Information Access Workshop. Inform Retrieval, 12, 6
     (December 2009), 615-641.                                           SIGIR Forum, 39, 1 (June 2005), 11-20.
                                                                         DOI=http://doi.acm.org/10.1145/1067268.1067272
     DOI=http://dx.doi.org/10.1007/s10791-009-9101-4
                                                                    [29] Voorhees, E. M. On test collections for adaptive information
[12] Hersh, W., Bhupatiraju, R., Ross, L., Roberts, P., Cohen, A.
     and Kraemer, D. 2006. Enhancing access to the Bibliome:             retrieval. Inform Process Manag, 44, 6 (November 2008),
                                                                         1879-1885.
     The TREC 2004 Genomics Track, Journal of Biomedical
                                                                         DOI=http://dx.doi.org/10.1016/j.ipm.2007.12.011
     Discovery and Collaboration, 1, 3 (March 2006).
     DOI=http://dx.doi.org/10.1186/1747-5333-1-3
[30] Voorhees, E. M. and Harman, D. K. 2005. TREC:                  [32] Witten, I. H., Moffat, A. and Bell, T. C. 1999. Managing
     Experiment and Evaluation in Information Retrieval. The             Gigabytes: Compressing and Indexing Documents and
     MIT Press, Cambridge, MA.                                           Images. Morgan Kaufmann, San Francisco.
[31] Witten, I. H. and Bainbridge, D. 2007. A retrospective look    [33] Zobel, J. 1998. How reliable are the results of large-scale
     at Greenstone: Lessons from the first decade. In Proceedings        information retrieval experiments? In Proceedings of the 21st
     of the 7th ACM/IEEE-CS Joint Conference on Digital                  Annual International ACM SIGIR Conference on Research
     Libraries (Vancouver, Canada, June 18-23, 2007). JCDL '07.          and Development in Information Retrieval (Melbourne,
     ACM Press, New York, NY, 147-156.                                   Australia, 1998). SIGIR '98. ACM Press, New York, NY,
     DOI=http://doi.acm.org/10.1145/1255175.1255204                      307-314. DOI=http://doi.acm.org/10.1145/290941.291014