=Paper= {{Paper |id=None |storemode=property |title=On Search Topic Variability in Interactive Information Retrieval |pdfUrl=https://ceur-ws.org/Vol-569/paper6.pdf |volume=Vol-569 }} ==On Search Topic Variability in Interactive Information Retrieval== https://ceur-ws.org/Vol-569/paper6.pdf

On Search Topic Variability in Interactive Information
Retrieval
Ying-Hsang Liu Nina Wacholder
School of Information Studies School of Communication and Information
Charles Sturt University Rutgers University
Wagga Wagga NSW 2678, Australia New Brunswick NJ 089091, USA
+61 2 6933 2171 +1 732 932 7500 ext. 8214
yingliu@csu.edu.au ninwac@rutgers.edu

ABSTRACT
This paper describes the research design and methodologies we The test design and methodology following the Cranfield
used to assess the usefulness of MeSH (Medical Subject paradigm culminated in the TREC (Text REtrieval Conference)
Headings) terms for different types of users in an interactive activities since the 1990s. TREC has provided a research forum
search environment. We observed four different kinds of for comparing the search effectiveness of different retrieval
information seekers using an experimental IR system: (1) search techniques across IR systems in a laboratory and controlled
novices; (2) domain experts; (3) search experts and (4) medical environment [30]. The very large test collection used in TREC
librarians. We employed a user-oriented evaluation methodology provided a test bed for researchers to experiment the scalability of
to assess search effectiveness of automatic and manual indexing retrieval techniques, which had not been possible in previous
methods using TREC Genomics Track 2004 data set. Our years. However, how we specifically take into account different
approach demonstrated (1) the reusability of a large test collection aspects of user contexts within a more realistic test environment
originally created for TREC, (2) an experimental design that has been challenging in part because it is difficult to isolate the
specifically considers types of searchers, system versions and effects of user, search topic and system in IR experiments (see
search topic pairs by Graeco-Latin square design and (3) search e.g., [7, 17] for recent efforts).
topic variability can be alleviated by using different sets of In batch experiments the search effectiveness of different
equally difficult topics and well-controlled experimental design retrieval techniques is achieved by comparing the search
for contextual information retrieval evaluation. performance of queries. IR researchers have widely used the
micro-averaging method of performing statistics on the queries in
summarizing precision and recall values for comparing the search
Categories and Subject Descriptors effectiveness of different retrieval techniques in order to meet the
H.3.3 [Information Storage and Retrieval]: Information Search statistical requirements (see e.g., [25, 27]). The method of micro-
and Retrieval−query formulation, search process averaging is intended to obtain reliable results in comparing
search performance of different retrieval techniques by giving
General Terms equal weights to each query.
Measurement, Human Factors However, within an interactive IR search environment that
involves human searchers, it is difficult to use a large set of search
topics. Empirical evidence has demonstrated that the search topic
Keywords set size of 50 is necessary to determine the relative performance
Information retrieval evaluation, Search topic variability, of different retrieval techniques in batch evaluations [3], because
interactive information retrieval the variability of search topics has an overriding effect on search
results. Another possible solution is to use different sets of topics
1. INTRODUCTION in a non-matched-pair design [5, 21, 22], but theoretically it
requires a very large sample of independent searches.
The creation and refinement of test design and methodologies for This problem has been exacerbated by the fact that we have
IR system evaluation have been one of the greatest achievements little theoretical understanding about the nature and properties of
in IR research and development. In the second Cranfield project search topics for evaluation purposes [20]. From a systems
[6], the main purpose is to evaluate the effectiveness of indexing perspective, recent in-depth failure analyses of variability in
techniques at a level of abstraction where users are not search topics for reliable and robust retrieval performance (e.g.,
specifically considered in a batch mode experiment. [11, 28]) have contributed to our preliminary understanding of
how and why IR systems fail to do well across all search topics. It
is still elusive what kinds of search topics can be used to directly
control the topic effect for IR evaluation purposes.
This study was designed to assess the search effectiveness of
Appears in the Proceedings of The 2nd International Workshop on
MeSH terms by different types of searchers in an interactive
Contextual Information Access, Seeking and Retrieval Evaluation search environment. By an experimental design that controls
(CIRSE 2010), March 28, 2010, Milton Keynes, UK. searchers, system versions and search topic pairs and the use of a
http://www.irit.fr/CIRSE/ relatively large number of search topics, we were able to
Copyright owned by the authors. demonstrate an IR user experiment that specifically controls the
search topic variability and assesses the user effect on search
effectiveness within the laboratory IR framework (see e.g., [14,
9 10 11 12 13 14 15 16
15] for recent discussions).
SE ML SN DE ML SE DE SN
2. METHOD 29 50 27 45 42 46 9 36
50 29 29 27 46 36 42 9
Thirty-two searchers from a major public university and nearby
medical libraries in the northeast area of the US participated in the 27 45 45 50 9 42 36 46
study. Each searcher belonged to one of four groups: (1) Search 45 27 50 29 36 9 46 42
Novice (SN), (2) Domain Experts (DE), (3) Search Experts (SE) 2 43 1 49 2 43 33 23
and (4) Medical Librarians (ML).
43 1 49 2 43 2 23 33
The experimental task was to conduct a total of eight
1 49 2 43 33 23 2 43
searches to help biologists conduct their research. Participants
searched either using a version of the system in which abstracts 49 2 43 1 23 33 43 2
and MeSH terms were displayed (MeSH+) or another version in Note. Numbers 1-16 refers to participant ID; SN, DE, DE and ML
which they had to formulate their own terms based only on the refer to types of searchers, SN=Search Novices, DE=Domain
display of abstracts (MeSH−). Participants conducted four Experts; SE=Search Experts; ML=Medical Librarians; Shaded
searches each with two different systems: in one, they browsed a and non-shaded blocks refer to MeSH+ and MeSH− versions of
displayed list of MeSH terms (MeSH+) and in the other (MeSH−). an experimental system; Numbers in blocks refer to search topic
Half the participants used MeSH+ system first; half used MeSH− ID number from TREC Genomics Track 2004 data set; 10 search
first. Each participant was allowed to conduct searches on eight topic pairs, randomly selected from a pool of 20 selected topics,
different topics. include (38, 12), (29, 50), (42, 46), (32, 15), (27, 45), (9, 36), (30,
The experimental setting for most searchers was a university 20), (2, 43), (1, 49) and (33, 23).
office; for some searchers, it was a medical library. Before they
began searching participants were briefly trained in how to use the Figure 1. 4×4 Graeco-Latin square design
MeSH terms. We kept search logs that recorded search terms, a Because of the potential interfering effect of search topic
ranked list of retrieved documents, and time-stamps. variability on search performance in IR evaluation, we used a
design that included relatively large number of search topics. In
2.1 Subjects theory, the effect of topic variability and topic-system interaction
We used the purposive sampling method for recruiting our on system performance could be eliminated by averaging the
subjects since we were concerned with the impact of specific performance scores of the topics (micro-averaging method),
searcher characteristics on search effectiveness. The key searcher together with the use of very large number of search topics. The
characteristics were different levels of domain knowledge in the TREC standard ad hoc task evaluation studies ([1, 3]) and other
biomedical domain and whether they had substantial search proposals of test collections (e.g., [20-22, 24, 29]) have been
training. The four types of searchers were distinguished by their concerned with the large search topic variability in batch
levels of domain knowledge and search training. experiments. However, in a user-centered IR experiment it is not
feasible to use as many as 50 search topics because of human
2.2 Experimental design fatigue.
The experiment was a 4×2×2 factorial design with four types of We controlled search topic pairs by a balanced design in
searchers, two versions of an experimental system and controlled order to alleviate the overriding effect of search topic variability.
search topic pairs. The versions of a system, types of searchers We assumed that all the search topics are equally difficult, since
(distinguished by levels of domain knowledge and search training) we do not have a good theory about what makes some search
and search topic pairs were controlled by a Graeco-Latin square topics more difficult than others. By design we ensured that each
balanced design [8]. The possible ordering effects have been taken search topic pair was assigned to all types of searchers and was
into account by the design. The requirement for this experimental searched at least two times by the same type of searchers. This
design is that the examined variables do not interact and each design required a total of 10 search topic pairs and a minimum of
variable has the same number of levels [16]. The treatment layout 16 participants.
of a 4×4 Graeco-Latin square design is illustrated in Figure 1.
2.3 Search tasks and incentive system
1 2 3 4 5 6 7 8 The search task was designed to simulate online searching
SN DE SE ML DE SN ML SE situations in which professional searchers look for information on
38 12 29 50 38 12 27 45 behalf of users. We decided to use this relatively challenging task
for untrained searchers because choosing realistic tasks such as
12 38 50 29 12 45 38 27 this one would enhance the external validity of the experiment.
29 50 12 38 27 38 45 12 Considering the relatively difficult tasks, we were concerned that
50 29 38 12 45 27 12 38 searchers may have problems completing all searches. Because
research literature has suggested that the motivational
42 46 32 15 9 36 30 20 characteristics of participants are possible sources of sample bias
46 42 15 42 36 9 20 30 [23], we designed an incentive system to motivate the searchers.
32 15 42 46 30 20 9 36 We promised monetary incentives according to the
participant’s search effectiveness. Each subject was paid $20 for
15 32 46 32 20 30 36 9
participating and was also paid up to $10.00 dollars more based More specifically, MGPP (MG++), a re-implementation of the mg
on the average number of relevant documents in the top ten search (Managing Gigabytes) searching and compression algorithms,
results across all search topics; on average each participant was used as indexing and querying indexer. Basic system features,
received an additional $4.40, with a range of $2.00 - $8.00. including fielded searching, phrase searching, Boolean operators,
case sensitivity, stemming and display of search history, were
2.4 Experimental procedures sufficient to fulfill the search tasks. The display of search history
After signing the consent form, the participant filled out a was necessary because it provided useful feedback regarding the
searcher background questionnaire before the search assignment. magnitude of retrieved documents for difficult search tasks that
After a brief training session, they were assigned to one of the usually required query reformulations.
arranged experimental conditions and conducted search tasks. Since our goal was specifically to investigate the usefulness
They completed a search perception questionnaire and were asked of displayed MeSH terms, we deliberately refrained from
to indicate the relevance of two pre-judged documents when they implementing certain system features that allow users to take
were done with each search topic. A brief interview was advantage of the hierarchical structures of MeSH terms, such as
conducted when they finished all search topics. Search logs with the hyperlinked MeSH terms, explode function that automatically
search terms and ranked retrieved documents were recorded. includes all narrower terms and automatic query expansion (see
The MeSH Browser [19], an online vocabulary look-up aid, e.g. [13, 18]) available on other online search systems. The use of
prepared by U.S. National Library of Medicine, was designed to those features would have invalidated the results by introducing
help searchers find appropriate MeSH terms and display hierarchy other variables at the levels of search interface and query
of terms for retrieval purposes. The MeSH Browser was only processing, although a full integration of those system features
available when participants were assigned to the MeSH+ version would have increased the usefulness of MeSH terms.
of an experimental system; in the MeSH− version, participants
had to formulate their own terms without the assistance of MeSH 2.6 Documents
Browser and displayed MeSH terms in bibliographic records. The experimental system was set up on a server, using
Because we were concerned that the topics were so hard that bibliographic records from the 2004 TREC Genomics document
even the medical librarians would not understand them, we used a set [26]. TREC Genomics Track 2004 Data Set document test
questionnaire regarding search topic understanding after each collection was a 10-year (from 1994 to 2003) subset of
topic. The testing items of two randomly selected pre-judged MEDLINE with a total of 4,591,108 records. The test collection
documents, one definitely relevant and the other definitely not subset fed into the system used 75.0% of the whole collection, a
relevant, were prepared from the data set [26]. total of 3,442,321 records, excluding the records without MeSH
Each search topic was allocated up to ten minutes. The last terms or abstracts.
search within the time limit was used for calculating search We prepared two sets of documents for setting up the
performance. To keep the participants motivated and reward their experimental system: MeSH+ and MeSH− versions. One interface
effort, they were asked to orally indicate which previous search allowed users to use MeSH terms; the other did not provide this
result would be the best answer when the search task was not search option. The difference was also reflected in retrieved
finished within ten minutes. bibliographic records.

2.5 Experimental system 2.7 Search topics
For this study, it was important for participants to conduct their The search topics used in this study were originally created
searches in a carefully controlled environment; our goal was to for TREC Genomics Track 2004 for the purpose of evaluating the
offer as much help as possible while still making sure that the help search effectiveness of different retrieval techniques (see Figure
and search functions did not interfere with our ability to measure 3-9 for an example). They covered a range of genomics topics
the impact of the MeSH terms. We built an information retrieval typically asked by biomedical researchers. Besides a unique ID
system based on the Greenstone Digital Library Software version number for each topic, the topic was constructed in a format that
2.70 [9] because it provides reliable search functionality, included the title, need and context fields. The title field was a
customizable search interface and good documentation [31]. short query. The need field was a short description of the kind of
We prepared two different search interfaces using a single material the biologists are interested in, whereas the context field
system using Greenstone: MeSH+ and MeSH− versions. One provides background information for judging the relevance of
interface allowed users to use MeSH terms; the other required documents. The need and context fields were designed to provide
them to devise their own terms. One interface displayed MeSH more possible search terms for system experimentation purposes.
terms in retrieved bibliographic records and the other did not.
Because we were concerned that the participant responds to the ID: 39
Title: Hypertension
cue that may signal the experimenter’s intent, the search interfaces
were termed ‘System Version A’ and ‘System Version B’ for Need: Identify genes as potential genetic risk factors
‘MeSH+ Version’ and ‘MeSH− Version’ respectively (see candidates for causing hypertension.
Context: A relevant document is one which discusses genes
http://comminfo.rutgers.edu/irgs/gsdl/cgi-bin/library/). The
MeSH− version was used as baseline system for an automatic that could be considered as candidates to test in a randomized
indexing system, whereas the MeSH+ version served as controlled trial which studies the genetic risk factors for
stroke.
performance of a manual indexing system. That is, MeSH terms
added another layer of document representation to the MeSH+ Figure 2. Sample search topic
version.
The experimental system was constructed as Boolean-based Because of the technical nature of genomics topics, we
system with ranked functions by the TF×IDF weighting rule [32]. wondered whether the search topics could be understood by
human searchers, particularly for those without advanced training document set and the pooled document set for each topic. The
in the biomedical field. TREC search topics were designed for judged document set was composed of the documents that
machine runs with little or no consideration for searches by real matched TREC data, i.e., combination of judged not relevant and
users. We selected 20 of the 50 topics using the following judged relevant. The un-judged documents, added to the pooled
procedure: document set, were considered ‘not relevant’ in our calculations
1. Consulting an experienced professional searcher with of search outcome. We used precision oriented measures, MAP
biology background and a graduate student in (mean average precision), P10 (precision at top 10 documents)
neuroscience, to help make a judgment as to whether the and P100 (precision at top 100 documents) to estimate the impact
topics would be comprehensible to the participants who of incomplete judgments.
were not domain experts. Topics that used advanced The paired t-test results by search topic revealed significant
technical vocabulary, such as specific genes, pathways differences between the two sets in terms of MAP (t(19) = -3.69, p
and mechanisms, were excluded; < .01), P10 (t(19) = -3.89, p < .001) and P100 (t(19) = -3.95, p <
2. Ensuring that major concepts in search topics could be .001) measures. The mean of the differences for MAP, P10 and
mapped to MeSH by searching the MeSH Browser. For P100 was approximately 2.7%, 9.9% and 4.9% respectively. We
instance, topic 39 could be mapped to MeSH preferred concluded that the TREC relevance judgments are applicable to
terms hypertension and risk factors; this study.
3. Eliminating topics with very low MAP (mean average
precision) and P10 (precision at top 10 documents) score 2.9 Limitations of the design
in the relevance judgment set because these topics would This study was designed to assess the impact of MeSH terms
be too difficult; on search effectiveness in an interactive search environment. One
The selected topics were then randomly ordered to create ten limitation of the design was that participants were a self-selected
search topic pairs for the experimental conditions (see Figure 1 for group of searchers that may not be representative of the
search topic pairs). population. The interaction effects of selection biases and the
experimental variable, i.e., the displayed MeSH terms, were
2.8 Reliability of relevance judgment sets another possible factor that limits the generalizability of this study
We measured search outcome using standard precision and recall [4]. The use of relatively technical and difficult search topics in
measures for accuracy and time spent for user effort [6] because the interactive search environment posed threat to external
we were concerned with the usefulness of MeSH terms on search validity, since those topics might not represent typical topics
effectiveness by using TREC assessments [12]. received by medical librarians in practice.
Theoretically speaking, the calculation of recall measure The internal validity of this design was enhanced by
requires relevance judgments from the whole test collection. specifically considering several aspects: We devised an incentive
However, it is almost impossible to obtain these judgments from a system to consider the possible sampling bias of searchers’
test collection with more than 3 million documents. For practical motivational characteristics in experimental settings. Besides
reasons the recall measure used a pooling method that created a levels of education, participants’ domain knowledge was
set of unique documents from the top 75 documents submitted by evaluated by a topic understanding test. The variability of search
27 groups participated in the TREC 2004 Genomics Track ad hoc topics was alleviated by using a relatively large number of search
tasks [26]. Empirical evidence has shown that recall calculated topics by experimental design. Selected search topics were
with a pooling method provides a reasonable approximation, intelligible in consultation with domain expert and medical
although the recall is likely to be overestimated [33]. But as a librarian. A concept analysis form was used to help searchers
result of this approach, there was an average pool size of 976 recognize potentially useful terms. The reliability of relevance
documents, with a range of 476-1450, which had relevance judgment sets was ensured by additional analysis of top 10 search
judgments for each topic [12]. results from our human searchers.
It was quite likely that some of the participants in our
experiment would retrieve documents that had not been judged. 3. DISCUSSION AND CONCLUSION
The existence of un-judged relevant documents, called sampling
The Cranfield paradigm has been very useful for comparing
bias in pooling method, is concerned with the pool depth and the
search effectiveness of different retrieval techniques at the level of
diversity of retrieval methods that may affect the reliability of
abstraction that simulates user search performance. Putting users
relevance judgment set [2]. The assumption that the pooled
in the loop of IR experiments is particularly challenging because it
judgment set is a reasonable approximation of complete relevance
is difficult to separate the effects of systems, searchers and topics
judgment set may become invalid when the test collection is very
and the search topics have had dominating effects [17]. To
large.
alleviate search topic variability in interactive IR experiments, we
To ensure that the TREC pooled relevance judgment set was
consider how to increase the topic set size by experimental design
sufficiently complete and valid for the current study, we analyzed
within the laboratory IR framework.
top 10 retrieved documents from each human runs (32 searchers ×
8 topics = 256 runs). Cross-tabulation results showed that about This study has demonstrated that a total of 20 search topics
one-third of all documents retrieved in our study had not been can be used in an interactive experiment by Graeco-Latin square
judged in the TREC data set. More specifically, for a total of 2277 balanced design and using different sets of carefully selected
analyzed documents, 762 (33.5 %) had not been assigned relevant topics. We assume that the selected topics are equally difficult
judgments. There existed large variations in percentage of un- since we do not have a good theory of search topics that can
judged documents for each search topic, with a range of 0–59.3%. directly control the topic difficulty for evaluation purposes.
To assess the impact of incomplete relevance judgments, we Recent attempts to use reduced topic sets and use non-matched
compared the top 10 ranked search results between the judged topics (see e.g., [5, 10]) indirectly support our experimental
design considerations of search topic variability and topic [13] Hersh, W. R. 2008. Information Retrieval: A Health and
difficulty. However, an important theoretical question remains. Biomedical Perspective. Springer, New York.
How can we better control the topic effects in batch and user IR [14] Ingwersen, P. and Järvelin, K. 2005. The Turn: Integration of
experiments? Information Seeking and Retrieval in Context. Springer,
Dordrecht.
4. ACKNOWLEDGMENTS [15] Ingwersen, P. and Järvelin, K. 2007. On the holistic cognitive
This study was funded by NSF grant #0414557, PIs. Michael Lesk theory for information retrieval. In Proceedings of the First
and Nina Wacholder. We thank anonymous reviewers for their International Conference on the Theory of Information
constructive comments. Retrieval (ICTIR) (Budapest, Hungary, 2007). Foundation
for Information Society.
[16] Kirk, R. E. Experimental Design: Procedures for the
5. REFERENCES Behavioral Sciences. 1995. Brooks/Cole, Pacific Grove, CA.
[17] Lagergren, E. and Over, P. 1998. Comparing interactive
[1] Banks, D., Over, P. and Zhang, N.-F. 1999. Blind men and
information retrieval systems across sites: The TREC-6
elephants: Six approaches to TREC data. Inform Retrieval, 1,
interactive track matrix experiment. In Proceedings of the
1/2 (April 1999), 7-34.
21st Annual International ACM SIGIR Conference on
DOI=http://dx.doi.org/10.1023/A:1009984519381
Research and Development in Information Retrieval
[2] Buckley, C., Dimmick, D., Soboroff, I. and Voorhees, E.
(Melbourne, Australia, 1998). SIGIR ’98. ACM Press, New
2007. Bias and the limits of pooling for large collections.
York, NY, 164-172.
Inform Retrieval, 10, 6 (December 2007), 491-508.
DOI=http://doi.acm.org/10.1145/290941.290986
DOI=http://dx.doi.org/10.1007/s10791-007-9032-x
[18] Lu, Z., Kim, W. and Wilbur, W. Evaluation of query
[3] Buckley, C. and Voorhees, E. M. 2005. Retrieval system
expansion using MeSH in PubMed. Inform Retrieval, 12, 1
evaluation. In Voorhees, E. M. and Harman, D. K. (Eds.),
(February 2009), 69-80.
TREC: Experiment and Evaluation in Information Retrieval,
DOI=http://dx.doi.org/10.1007/s10791-008-9074-8
The MIT Press, Cambridge, MA, 53-75.
[19] MeSH Browser (2003 MeSH). 2004. U.S. National Library
[4] Campbell, D. T., Stanley, J. C. and Gage, N. L. 1966.
of Medicine. Available at:
Experimental and Quasi-Experimental Designs for Research.
http://www.nlm.nih.gov/mesh/2003/MBrowser.html
R. McNally, Chicago.
[20] Robertson, S. E. 1981. The methodology of information
[5] Cattelan, M. and Mizzaro, S. 2009. IR evaluation without a
retrieval experiment. In Sparck Jones, K. (Ed.), Information
common set of topics. In Proceedings of the 2nd
Retrieval Experiment, Butterworth, London, 9-31.
International Conference on the Theory of Information
[21] Robertson, S. E. 1990. On sample sizes for non-matched-pair
Retrieval (Cambridge, UK, September 10-12, 2009). ICTIR
IR experiments. Inform Process Manag, 26, 6 (1990), 739-
2009. Springer, Berlin, 342-345.
753. DOI=http://dx.doi.org/10.1016/0306-4573(90)90049-8
DOI=http://dx.doi.org/10.1007/978-3-642-04417-5_35
[22] Robertson, S. E., Thompson, C. L. and Macaskill, M. J.
[6] Cleverdon, C. W. 1967. The Cranfield tests on index
1986. Weighting, ranking and relevance feedback in a front-
language devices. Aslib Proc, 19, 6 (1967), 173-193.
end system. Journal of Information and Image Management,
DOI=http://dx.doi.org/10.1108/eb050097
12, 1/2, (January 1986), 71-75.
[7] Dumais, S. T. and Belkin, N. J. 2005. The TREC Interactive
DOI=http://dx.doi.org/10.1177/016555158601200112
Track: Putting the user into search. In Voorhees, E. M. and
[23] Sharp, E. C., Pelletier, L. G. and Levesque, C. 2006. The
Harman, D. K. (Eds.), TREC: Experiment and Evaluation in
double-edged sword of rewards for participation in
Information Retrieval, The MIT Press, Cambridge, MA, 123-
psychology experiments. Can J Beh Sci, 38, 3 (Jul 2006),
152.
269-277. DOI=http://dx.doi.org/10.1037/cjbs2006014
[8] Fisher, R. A. 1935. The Design of Experiments. Oliver and
[24] Sparck Jones, K. and van Rijsbergen, C. J. 1976. Information
Boyd, Edinburgh.
retrieval test collections. J Doc, 32, 1 (March 1976), 59-75.
[9] Greenstone Digital Library Software (Version 2.70). 2006.
DOI=http://dx.doi.org/10.1108/eb026616
Department of Computer Science, The University of
[25] Tague-Sutcliffe, J. 1992. The pragmatics of information
Waikato, New Zealand. Available at:
retrieval experimentation, revisited. Inform Process Manag,
http://prdownloads.sourceforge.net/greenstone/gsdl-2.70-
28, 4 1992), 467-490. DOI=http://dx.doi.org/10.1016/0306-
export.zip
4573(92)90005-K
[10] Guiver, J., Mizzaro, S. and Robertson, S. 2009. A few good
[26] TREC 2004 Genomics Track document set data file. 2005.
topics: Experiments in topic set reduction for retrieval
Available at http://ir.ohsu.edu/genomics/data/2004/
evaluation. ACM Trans. Inf. Syst., 27, 4 (November 2009),
1-26. DOI=http://doi.acm.org/10.1145/1629096.1629099 [27] van Rijsbergen, C. J. 1979. Information Retrieval.
Butterworths, London.
[11] Harman, D. and Buckley, C. 2009. Overview of the Reliable
[28] Voorhees, E. M. 2005. The TREC robust retrieval track.
Information Access Workshop. Inform Retrieval, 12, 6
(December 2009), 615-641. SIGIR Forum, 39, 1 (June 2005), 11-20.
DOI=http://doi.acm.org/10.1145/1067268.1067272
DOI=http://dx.doi.org/10.1007/s10791-009-9101-4
[29] Voorhees, E. M. On test collections for adaptive information
[12] Hersh, W., Bhupatiraju, R., Ross, L., Roberts, P., Cohen, A.
and Kraemer, D. 2006. Enhancing access to the Bibliome: retrieval. Inform Process Manag, 44, 6 (November 2008),
1879-1885.
The TREC 2004 Genomics Track, Journal of Biomedical
DOI=http://dx.doi.org/10.1016/j.ipm.2007.12.011
Discovery and Collaboration, 1, 3 (March 2006).
DOI=http://dx.doi.org/10.1186/1747-5333-1-3
[30] Voorhees, E. M. and Harman, D. K. 2005. TREC: [32] Witten, I. H., Moffat, A. and Bell, T. C. 1999. Managing
Experiment and Evaluation in Information Retrieval. The Gigabytes: Compressing and Indexing Documents and
MIT Press, Cambridge, MA. Images. Morgan Kaufmann, San Francisco.
[31] Witten, I. H. and Bainbridge, D. 2007. A retrospective look [33] Zobel, J. 1998. How reliable are the results of large-scale
at Greenstone: Lessons from the first decade. In Proceedings information retrieval experiments? In Proceedings of the 21st
of the 7th ACM/IEEE-CS Joint Conference on Digital Annual International ACM SIGIR Conference on Research
Libraries (Vancouver, Canada, June 18-23, 2007). JCDL '07. and Development in Information Retrieval (Melbourne,
ACM Press, New York, NY, 147-156. Australia, 1998). SIGIR '98. ACM Press, New York, NY,
DOI=http://doi.acm.org/10.1145/1255175.1255204 307-314. DOI=http://doi.acm.org/10.1145/290941.291014