On Search Topic Variability in Interactive Information Retrieval Ying-Hsang Liu Nina Wacholder School of Information Studies School of Communication and Information Charles Sturt University Rutgers University Wagga Wagga NSW 2678, Australia New Brunswick NJ 089091, USA +61 2 6933 2171 +1 732 932 7500 ext. 8214 yingliu@csu.edu.au ninwac@rutgers.edu ABSTRACT This paper describes the research design and methodologies we The test design and methodology following the Cranfield used to assess the usefulness of MeSH (Medical Subject paradigm culminated in the TREC (Text REtrieval Conference) Headings) terms for different types of users in an interactive activities since the 1990s. TREC has provided a research forum search environment. We observed four different kinds of for comparing the search effectiveness of different retrieval information seekers using an experimental IR system: (1) search techniques across IR systems in a laboratory and controlled novices; (2) domain experts; (3) search experts and (4) medical environment [30]. The very large test collection used in TREC librarians. We employed a user-oriented evaluation methodology provided a test bed for researchers to experiment the scalability of to assess search effectiveness of automatic and manual indexing retrieval techniques, which had not been possible in previous methods using TREC Genomics Track 2004 data set. Our years. However, how we specifically take into account different approach demonstrated (1) the reusability of a large test collection aspects of user contexts within a more realistic test environment originally created for TREC, (2) an experimental design that has been challenging in part because it is difficult to isolate the specifically considers types of searchers, system versions and effects of user, search topic and system in IR experiments (see search topic pairs by Graeco-Latin square design and (3) search e.g., [7, 17] for recent efforts). topic variability can be alleviated by using different sets of In batch experiments the search effectiveness of different equally difficult topics and well-controlled experimental design retrieval techniques is achieved by comparing the search for contextual information retrieval evaluation. performance of queries. IR researchers have widely used the micro-averaging method of performing statistics on the queries in summarizing precision and recall values for comparing the search Categories and Subject Descriptors effectiveness of different retrieval techniques in order to meet the H.3.3 [Information Storage and Retrieval]: Information Search statistical requirements (see e.g., [25, 27]). The method of micro- and Retrieval−query formulation, search process averaging is intended to obtain reliable results in comparing search performance of different retrieval techniques by giving General Terms equal weights to each query. Measurement, Human Factors However, within an interactive IR search environment that involves human searchers, it is difficult to use a large set of search topics. Empirical evidence has demonstrated that the search topic Keywords set size of 50 is necessary to determine the relative performance Information retrieval evaluation, Search topic variability, of different retrieval techniques in batch evaluations [3], because interactive information retrieval the variability of search topics has an overriding effect on search results. Another possible solution is to use different sets of topics 1. INTRODUCTION in a non-matched-pair design [5, 21, 22], but theoretically it requires a very large sample of independent searches. The creation and refinement of test design and methodologies for This problem has been exacerbated by the fact that we have IR system evaluation have been one of the greatest achievements little theoretical understanding about the nature and properties of in IR research and development. In the second Cranfield project search topics for evaluation purposes [20]. From a systems [6], the main purpose is to evaluate the effectiveness of indexing perspective, recent in-depth failure analyses of variability in techniques at a level of abstraction where users are not search topics for reliable and robust retrieval performance (e.g., specifically considered in a batch mode experiment. [11, 28]) have contributed to our preliminary understanding of how and why IR systems fail to do well across all search topics. It is still elusive what kinds of search topics can be used to directly control the topic effect for IR evaluation purposes. This study was designed to assess the search effectiveness of Appears in the Proceedings of The 2nd International Workshop on MeSH terms by different types of searchers in an interactive Contextual Information Access, Seeking and Retrieval Evaluation search environment. By an experimental design that controls (CIRSE 2010), March 28, 2010, Milton Keynes, UK. searchers, system versions and search topic pairs and the use of a http://www.irit.fr/CIRSE/ relatively large number of search topics, we were able to Copyright owned by the authors. demonstrate an IR user experiment that specifically controls the search topic variability and assesses the user effect on search effectiveness within the laboratory IR framework (see e.g., [14, 9 10 11 12 13 14 15 16 15] for recent discussions). SE ML SN DE ML SE DE SN 2. METHOD 29 50 27 45 42 46 9 36 50 29 29 27 46 36 42 9 Thirty-two searchers from a major public university and nearby medical libraries in the northeast area of the US participated in the 27 45 45 50 9 42 36 46 study. Each searcher belonged to one of four groups: (1) Search 45 27 50 29 36 9 46 42 Novice (SN), (2) Domain Experts (DE), (3) Search Experts (SE) 2 43 1 49 2 43 33 23 and (4) Medical Librarians (ML). 43 1 49 2 43 2 23 33 The experimental task was to conduct a total of eight 1 49 2 43 33 23 2 43 searches to help biologists conduct their research. Participants searched either using a version of the system in which abstracts 49 2 43 1 23 33 43 2 and MeSH terms were displayed (MeSH+) or another version in Note. Numbers 1-16 refers to participant ID; SN, DE, DE and ML which they had to formulate their own terms based only on the refer to types of searchers, SN=Search Novices, DE=Domain display of abstracts (MeSH−). Participants conducted four Experts; SE=Search Experts; ML=Medical Librarians; Shaded searches each with two different systems: in one, they browsed a and non-shaded blocks refer to MeSH+ and MeSH− versions of displayed list of MeSH terms (MeSH+) and in the other (MeSH−). an experimental system; Numbers in blocks refer to search topic Half the participants used MeSH+ system first; half used MeSH− ID number from TREC Genomics Track 2004 data set; 10 search first. Each participant was allowed to conduct searches on eight topic pairs, randomly selected from a pool of 20 selected topics, different topics. include (38, 12), (29, 50), (42, 46), (32, 15), (27, 45), (9, 36), (30, The experimental setting for most searchers was a university 20), (2, 43), (1, 49) and (33, 23). office; for some searchers, it was a medical library. Before they began searching participants were briefly trained in how to use the Figure 1. 4×4 Graeco-Latin square design MeSH terms. We kept search logs that recorded search terms, a Because of the potential interfering effect of search topic ranked list of retrieved documents, and time-stamps. variability on search performance in IR evaluation, we used a design that included relatively large number of search topics. In 2.1 Subjects theory, the effect of topic variability and topic-system interaction We used the purposive sampling method for recruiting our on system performance could be eliminated by averaging the subjects since we were concerned with the impact of specific performance scores of the topics (micro-averaging method), searcher characteristics on search effectiveness. The key searcher together with the use of very large number of search topics. The characteristics were different levels of domain knowledge in the TREC standard ad hoc task evaluation studies ([1, 3]) and other biomedical domain and whether they had substantial search proposals of test collections (e.g., [20-22, 24, 29]) have been training. The four types of searchers were distinguished by their concerned with the large search topic variability in batch levels of domain knowledge and search training. experiments. However, in a user-centered IR experiment it is not feasible to use as many as 50 search topics because of human 2.2 Experimental design fatigue. The experiment was a 4×2×2 factorial design with four types of We controlled search topic pairs by a balanced design in searchers, two versions of an experimental system and controlled order to alleviate the overriding effect of search topic variability. search topic pairs. The versions of a system, types of searchers We assumed that all the search topics are equally difficult, since (distinguished by levels of domain knowledge and search training) we do not have a good theory about what makes some search and search topic pairs were controlled by a Graeco-Latin square topics more difficult than others. By design we ensured that each balanced design [8]. The possible ordering effects have been taken search topic pair was assigned to all types of searchers and was into account by the design. The requirement for this experimental searched at least two times by the same type of searchers. This design is that the examined variables do not interact and each design required a total of 10 search topic pairs and a minimum of variable has the same number of levels [16]. The treatment layout 16 participants. of a 4×4 Graeco-Latin square design is illustrated in Figure 1. 2.3 Search tasks and incentive system 1 2 3 4 5 6 7 8 The search task was designed to simulate online searching SN DE SE ML DE SN ML SE situations in which professional searchers look for information on 38 12 29 50 38 12 27 45 behalf of users. We decided to use this relatively challenging task for untrained searchers because choosing realistic tasks such as 12 38 50 29 12 45 38 27 this one would enhance the external validity of the experiment. 29 50 12 38 27 38 45 12 Considering the relatively difficult tasks, we were concerned that 50 29 38 12 45 27 12 38 searchers may have problems completing all searches. Because research literature has suggested that the motivational 42 46 32 15 9 36 30 20 characteristics of participants are possible sources of sample bias 46 42 15 42 36 9 20 30 [23], we designed an incentive system to motivate the searchers. 32 15 42 46 30 20 9 36 We promised monetary incentives according to the participant’s search effectiveness. Each subject was paid $20 for 15 32 46 32 20 30 36 9 participating and was also paid up to $10.00 dollars more based More specifically, MGPP (MG++), a re-implementation of the mg on the average number of relevant documents in the top ten search (Managing Gigabytes) searching and compression algorithms, results across all search topics; on average each participant was used as indexing and querying indexer. Basic system features, received an additional $4.40, with a range of $2.00 - $8.00. including fielded searching, phrase searching, Boolean operators, case sensitivity, stemming and display of search history, were 2.4 Experimental procedures sufficient to fulfill the search tasks. The display of search history After signing the consent form, the participant filled out a was necessary because it provided useful feedback regarding the searcher background questionnaire before the search assignment. magnitude of retrieved documents for difficult search tasks that After a brief training session, they were assigned to one of the usually required query reformulations. arranged experimental conditions and conducted search tasks. Since our goal was specifically to investigate the usefulness They completed a search perception questionnaire and were asked of displayed MeSH terms, we deliberately refrained from to indicate the relevance of two pre-judged documents when they implementing certain system features that allow users to take were done with each search topic. A brief interview was advantage of the hierarchical structures of MeSH terms, such as conducted when they finished all search topics. Search logs with the hyperlinked MeSH terms, explode function that automatically search terms and ranked retrieved documents were recorded. includes all narrower terms and automatic query expansion (see The MeSH Browser [19], an online vocabulary look-up aid, e.g. [13, 18]) available on other online search systems. The use of prepared by U.S. National Library of Medicine, was designed to those features would have invalidated the results by introducing help searchers find appropriate MeSH terms and display hierarchy other variables at the levels of search interface and query of terms for retrieval purposes. The MeSH Browser was only processing, although a full integration of those system features available when participants were assigned to the MeSH+ version would have increased the usefulness of MeSH terms. of an experimental system; in the MeSH− version, participants had to formulate their own terms without the assistance of MeSH 2.6 Documents Browser and displayed MeSH terms in bibliographic records. The experimental system was set up on a server, using Because we were concerned that the topics were so hard that bibliographic records from the 2004 TREC Genomics document even the medical librarians would not understand them, we used a set [26]. TREC Genomics Track 2004 Data Set document test questionnaire regarding search topic understanding after each collection was a 10-year (from 1994 to 2003) subset of topic. The testing items of two randomly selected pre-judged MEDLINE with a total of 4,591,108 records. The test collection documents, one definitely relevant and the other definitely not subset fed into the system used 75.0% of the whole collection, a relevant, were prepared from the data set [26]. total of 3,442,321 records, excluding the records without MeSH Each search topic was allocated up to ten minutes. The last terms or abstracts. search within the time limit was used for calculating search We prepared two sets of documents for setting up the performance. To keep the participants motivated and reward their experimental system: MeSH+ and MeSH− versions. One interface effort, they were asked to orally indicate which previous search allowed users to use MeSH terms; the other did not provide this result would be the best answer when the search task was not search option. The difference was also reflected in retrieved finished within ten minutes. bibliographic records. 2.5 Experimental system 2.7 Search topics For this study, it was important for participants to conduct their The search topics used in this study were originally created searches in a carefully controlled environment; our goal was to for TREC Genomics Track 2004 for the purpose of evaluating the offer as much help as possible while still making sure that the help search effectiveness of different retrieval techniques (see Figure and search functions did not interfere with our ability to measure 3-9 for an example). They covered a range of genomics topics the impact of the MeSH terms. We built an information retrieval typically asked by biomedical researchers. Besides a unique ID system based on the Greenstone Digital Library Software version number for each topic, the topic was constructed in a format that 2.70 [9] because it provides reliable search functionality, included the title, need and context fields. The title field was a customizable search interface and good documentation [31]. short query. The need field was a short description of the kind of We prepared two different search interfaces using a single material the biologists are interested in, whereas the context field system using Greenstone: MeSH+ and MeSH− versions. One provides background information for judging the relevance of interface allowed users to use MeSH terms; the other required documents. The need and context fields were designed to provide them to devise their own terms. One interface displayed MeSH more possible search terms for system experimentation purposes. terms in retrieved bibliographic records and the other did not. Because we were concerned that the participant responds to the ID: 39 Title: Hypertension cue that may signal the experimenter’s intent, the search interfaces were termed ‘System Version A’ and ‘System Version B’ for Need: Identify genes as potential genetic risk factors ‘MeSH+ Version’ and ‘MeSH− Version’ respectively (see candidates for causing hypertension. Context: A relevant document is one which discusses genes http://comminfo.rutgers.edu/irgs/gsdl/cgi-bin/library/). The MeSH− version was used as baseline system for an automatic that could be considered as candidates to test in a randomized indexing system, whereas the MeSH+ version served as controlled trial which studies the genetic risk factors for stroke. performance of a manual indexing system. That is, MeSH terms added another layer of document representation to the MeSH+ Figure 2. Sample search topic version. The experimental system was constructed as Boolean-based Because of the technical nature of genomics topics, we system with ranked functions by the TF×IDF weighting rule [32]. wondered whether the search topics could be understood by human searchers, particularly for those without advanced training document set and the pooled document set for each topic. The in the biomedical field. TREC search topics were designed for judged document set was composed of the documents that machine runs with little or no consideration for searches by real matched TREC data, i.e., combination of judged not relevant and users. We selected 20 of the 50 topics using the following judged relevant. The un-judged documents, added to the pooled procedure: document set, were considered ‘not relevant’ in our calculations 1. Consulting an experienced professional searcher with of search outcome. We used precision oriented measures, MAP biology background and a graduate student in (mean average precision), P10 (precision at top 10 documents) neuroscience, to help make a judgment as to whether the and P100 (precision at top 100 documents) to estimate the impact topics would be comprehensible to the participants who of incomplete judgments. were not domain experts. Topics that used advanced The paired t-test results by search topic revealed significant technical vocabulary, such as specific genes, pathways differences between the two sets in terms of MAP (t(19) = -3.69, p and mechanisms, were excluded; < .01), P10 (t(19) = -3.89, p < .001) and P100 (t(19) = -3.95, p < 2. Ensuring that major concepts in search topics could be .001) measures. The mean of the differences for MAP, P10 and mapped to MeSH by searching the MeSH Browser. For P100 was approximately 2.7%, 9.9% and 4.9% respectively. We instance, topic 39 could be mapped to MeSH preferred concluded that the TREC relevance judgments are applicable to terms hypertension and risk factors; this study. 3. Eliminating topics with very low MAP (mean average precision) and P10 (precision at top 10 documents) score 2.9 Limitations of the design in the relevance judgment set because these topics would This study was designed to assess the impact of MeSH terms be too difficult; on search effectiveness in an interactive search environment. One The selected topics were then randomly ordered to create ten limitation of the design was that participants were a self-selected search topic pairs for the experimental conditions (see Figure 1 for group of searchers that may not be representative of the search topic pairs). population. The interaction effects of selection biases and the experimental variable, i.e., the displayed MeSH terms, were 2.8 Reliability of relevance judgment sets another possible factor that limits the generalizability of this study We measured search outcome using standard precision and recall [4]. The use of relatively technical and difficult search topics in measures for accuracy and time spent for user effort [6] because the interactive search environment posed threat to external we were concerned with the usefulness of MeSH terms on search validity, since those topics might not represent typical topics effectiveness by using TREC assessments [12]. received by medical librarians in practice. Theoretically speaking, the calculation of recall measure The internal validity of this design was enhanced by requires relevance judgments from the whole test collection. specifically considering several aspects: We devised an incentive However, it is almost impossible to obtain these judgments from a system to consider the possible sampling bias of searchers’ test collection with more than 3 million documents. For practical motivational characteristics in experimental settings. Besides reasons the recall measure used a pooling method that created a levels of education, participants’ domain knowledge was set of unique documents from the top 75 documents submitted by evaluated by a topic understanding test. The variability of search 27 groups participated in the TREC 2004 Genomics Track ad hoc topics was alleviated by using a relatively large number of search tasks [26]. Empirical evidence has shown that recall calculated topics by experimental design. Selected search topics were with a pooling method provides a reasonable approximation, intelligible in consultation with domain expert and medical although the recall is likely to be overestimated [33]. But as a librarian. A concept analysis form was used to help searchers result of this approach, there was an average pool size of 976 recognize potentially useful terms. The reliability of relevance documents, with a range of 476-1450, which had relevance judgment sets was ensured by additional analysis of top 10 search judgments for each topic [12]. results from our human searchers. It was quite likely that some of the participants in our experiment would retrieve documents that had not been judged. 3. DISCUSSION AND CONCLUSION The existence of un-judged relevant documents, called sampling The Cranfield paradigm has been very useful for comparing bias in pooling method, is concerned with the pool depth and the search effectiveness of different retrieval techniques at the level of diversity of retrieval methods that may affect the reliability of abstraction that simulates user search performance. Putting users relevance judgment set [2]. The assumption that the pooled in the loop of IR experiments is particularly challenging because it judgment set is a reasonable approximation of complete relevance is difficult to separate the effects of systems, searchers and topics judgment set may become invalid when the test collection is very and the search topics have had dominating effects [17]. To large. alleviate search topic variability in interactive IR experiments, we To ensure that the TREC pooled relevance judgment set was consider how to increase the topic set size by experimental design sufficiently complete and valid for the current study, we analyzed within the laboratory IR framework. top 10 retrieved documents from each human runs (32 searchers × 8 topics = 256 runs). Cross-tabulation results showed that about This study has demonstrated that a total of 20 search topics one-third of all documents retrieved in our study had not been can be used in an interactive experiment by Graeco-Latin square judged in the TREC data set. More specifically, for a total of 2277 balanced design and using different sets of carefully selected analyzed documents, 762 (33.5 %) had not been assigned relevant topics. We assume that the selected topics are equally difficult judgments. There existed large variations in percentage of un- since we do not have a good theory of search topics that can judged documents for each search topic, with a range of 0–59.3%. directly control the topic difficulty for evaluation purposes. To assess the impact of incomplete relevance judgments, we Recent attempts to use reduced topic sets and use non-matched compared the top 10 ranked search results between the judged topics (see e.g., [5, 10]) indirectly support our experimental design considerations of search topic variability and topic [13] Hersh, W. R. 2008. Information Retrieval: A Health and difficulty. However, an important theoretical question remains. Biomedical Perspective. Springer, New York. How can we better control the topic effects in batch and user IR [14] Ingwersen, P. and Järvelin, K. 2005. The Turn: Integration of experiments? Information Seeking and Retrieval in Context. Springer, Dordrecht. 4. ACKNOWLEDGMENTS [15] Ingwersen, P. and Järvelin, K. 2007. On the holistic cognitive This study was funded by NSF grant #0414557, PIs. Michael Lesk theory for information retrieval. In Proceedings of the First and Nina Wacholder. We thank anonymous reviewers for their International Conference on the Theory of Information constructive comments. Retrieval (ICTIR) (Budapest, Hungary, 2007). Foundation for Information Society. [16] Kirk, R. E. Experimental Design: Procedures for the 5. REFERENCES Behavioral Sciences. 1995. Brooks/Cole, Pacific Grove, CA. [17] Lagergren, E. and Over, P. 1998. Comparing interactive [1] Banks, D., Over, P. and Zhang, N.-F. 1999. Blind men and information retrieval systems across sites: The TREC-6 elephants: Six approaches to TREC data. Inform Retrieval, 1, interactive track matrix experiment. In Proceedings of the 1/2 (April 1999), 7-34. 21st Annual International ACM SIGIR Conference on DOI=http://dx.doi.org/10.1023/A:1009984519381 Research and Development in Information Retrieval [2] Buckley, C., Dimmick, D., Soboroff, I. and Voorhees, E. (Melbourne, Australia, 1998). SIGIR ’98. ACM Press, New 2007. Bias and the limits of pooling for large collections. York, NY, 164-172. Inform Retrieval, 10, 6 (December 2007), 491-508. DOI=http://doi.acm.org/10.1145/290941.290986 DOI=http://dx.doi.org/10.1007/s10791-007-9032-x [18] Lu, Z., Kim, W. and Wilbur, W. Evaluation of query [3] Buckley, C. and Voorhees, E. M. 2005. Retrieval system expansion using MeSH in PubMed. Inform Retrieval, 12, 1 evaluation. In Voorhees, E. M. and Harman, D. K. (Eds.), (February 2009), 69-80. TREC: Experiment and Evaluation in Information Retrieval, DOI=http://dx.doi.org/10.1007/s10791-008-9074-8 The MIT Press, Cambridge, MA, 53-75. [19] MeSH Browser (2003 MeSH). 2004. U.S. National Library [4] Campbell, D. T., Stanley, J. C. and Gage, N. L. 1966. of Medicine. Available at: Experimental and Quasi-Experimental Designs for Research. http://www.nlm.nih.gov/mesh/2003/MBrowser.html R. McNally, Chicago. [20] Robertson, S. E. 1981. The methodology of information [5] Cattelan, M. and Mizzaro, S. 2009. IR evaluation without a retrieval experiment. In Sparck Jones, K. (Ed.), Information common set of topics. In Proceedings of the 2nd Retrieval Experiment, Butterworth, London, 9-31. International Conference on the Theory of Information [21] Robertson, S. E. 1990. On sample sizes for non-matched-pair Retrieval (Cambridge, UK, September 10-12, 2009). ICTIR IR experiments. Inform Process Manag, 26, 6 (1990), 739- 2009. Springer, Berlin, 342-345. 753. DOI=http://dx.doi.org/10.1016/0306-4573(90)90049-8 DOI=http://dx.doi.org/10.1007/978-3-642-04417-5_35 [22] Robertson, S. E., Thompson, C. L. and Macaskill, M. J. [6] Cleverdon, C. W. 1967. The Cranfield tests on index 1986. Weighting, ranking and relevance feedback in a front- language devices. Aslib Proc, 19, 6 (1967), 173-193. end system. Journal of Information and Image Management, DOI=http://dx.doi.org/10.1108/eb050097 12, 1/2, (January 1986), 71-75. [7] Dumais, S. T. and Belkin, N. J. 2005. The TREC Interactive DOI=http://dx.doi.org/10.1177/016555158601200112 Track: Putting the user into search. In Voorhees, E. M. and [23] Sharp, E. C., Pelletier, L. G. and Levesque, C. 2006. The Harman, D. K. (Eds.), TREC: Experiment and Evaluation in double-edged sword of rewards for participation in Information Retrieval, The MIT Press, Cambridge, MA, 123- psychology experiments. Can J Beh Sci, 38, 3 (Jul 2006), 152. 269-277. DOI=http://dx.doi.org/10.1037/cjbs2006014 [8] Fisher, R. A. 1935. The Design of Experiments. Oliver and [24] Sparck Jones, K. and van Rijsbergen, C. J. 1976. Information Boyd, Edinburgh. retrieval test collections. J Doc, 32, 1 (March 1976), 59-75. [9] Greenstone Digital Library Software (Version 2.70). 2006. DOI=http://dx.doi.org/10.1108/eb026616 Department of Computer Science, The University of [25] Tague-Sutcliffe, J. 1992. The pragmatics of information Waikato, New Zealand. Available at: retrieval experimentation, revisited. Inform Process Manag, http://prdownloads.sourceforge.net/greenstone/gsdl-2.70- 28, 4 1992), 467-490. DOI=http://dx.doi.org/10.1016/0306- export.zip 4573(92)90005-K [10] Guiver, J., Mizzaro, S. and Robertson, S. 2009. A few good [26] TREC 2004 Genomics Track document set data file. 2005. topics: Experiments in topic set reduction for retrieval Available at http://ir.ohsu.edu/genomics/data/2004/ evaluation. ACM Trans. Inf. Syst., 27, 4 (November 2009), 1-26. DOI=http://doi.acm.org/10.1145/1629096.1629099 [27] van Rijsbergen, C. J. 1979. Information Retrieval. Butterworths, London. [11] Harman, D. and Buckley, C. 2009. Overview of the Reliable [28] Voorhees, E. M. 2005. The TREC robust retrieval track. Information Access Workshop. Inform Retrieval, 12, 6 (December 2009), 615-641. SIGIR Forum, 39, 1 (June 2005), 11-20. DOI=http://doi.acm.org/10.1145/1067268.1067272 DOI=http://dx.doi.org/10.1007/s10791-009-9101-4 [29] Voorhees, E. M. On test collections for adaptive information [12] Hersh, W., Bhupatiraju, R., Ross, L., Roberts, P., Cohen, A. and Kraemer, D. 2006. Enhancing access to the Bibliome: retrieval. Inform Process Manag, 44, 6 (November 2008), 1879-1885. The TREC 2004 Genomics Track, Journal of Biomedical DOI=http://dx.doi.org/10.1016/j.ipm.2007.12.011 Discovery and Collaboration, 1, 3 (March 2006). DOI=http://dx.doi.org/10.1186/1747-5333-1-3 [30] Voorhees, E. M. and Harman, D. K. 2005. TREC: [32] Witten, I. H., Moffat, A. and Bell, T. C. 1999. Managing Experiment and Evaluation in Information Retrieval. The Gigabytes: Compressing and Indexing Documents and MIT Press, Cambridge, MA. Images. Morgan Kaufmann, San Francisco. [31] Witten, I. H. and Bainbridge, D. 2007. A retrospective look [33] Zobel, J. 1998. How reliable are the results of large-scale at Greenstone: Lessons from the first decade. In Proceedings information retrieval experiments? In Proceedings of the 21st of the 7th ACM/IEEE-CS Joint Conference on Digital Annual International ACM SIGIR Conference on Research Libraries (Vancouver, Canada, June 18-23, 2007). JCDL '07. and Development in Information Retrieval (Melbourne, ACM Press, New York, NY, 147-156. Australia, 1998). SIGIR '98. ACM Press, New York, NY, DOI=http://doi.acm.org/10.1145/1255175.1255204 307-314. DOI=http://doi.acm.org/10.1145/290941.291014