=Paper=
{{Paper
|id=Vol-1172/CLEF2006wn-all-AgostiEt2006
|storemode=property
|title=Scientific Data of an Evaluation Campaign: Do We Properly Deal With Them?
|pdfUrl=https://ceur-ws.org/Vol-1172/CLEF2006wn-all-AgostiEt2006.pdf
|volume=Vol-1172
|dblpUrl=https://dblp.org/rec/conf/clef/AgostiNF06a
}}
==Scientific Data of an Evaluation Campaign: Do We Properly Deal With Them?==
Scientific Data of an Evaluation Campaign: Do We Properly Deal With Them? Maristella Agosti, Giorgio Maria Di Nunzio, and Nicola Ferro Department of Information Engineering – University of Padua Via Gradenigo, 6/b – 35131 Padova – Italy {agosti, dinunzio, ferro}@dei.unipd.it Abstract. This paper examines the current way of keeping the data pro- duced during the evaluation campaigns and highlights some shortenings of it. As a consequence, we propose a new approach for improving the management evaluation campaigns’ data. In this approach, the data are considered as scientific data to be cured and enriched in order to give full support to longitudinal statistical studies and long-term preservation. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.3 Information Search and Retrieval; H.3.4 [Systems and Software]: Performance evaluation. General Terms Experimentation, Performance, Measurement, Algorithms. Additional Keywords and Phrases Multilingual Information Access, Cross-Language Information Retrieval, Scien- tific Data, Data Curation, In-depth evaluation studies. 1 Introduction The experimental evaluation of an Information Retrieval System (IRS) is a sci- entific activity whose outcomes are different kinds of scientific data: – experiments: the primary scientific data produced by the participants of an evaluation campaign; – performance measurements: the metrics, such as precision and recall [1], used to evaluate the performance of an IRS in a given experiment; – descriptive statistics: the statistics, such as mean or median, used to summa- rize the overall performances achieved by an experiment or a the collection of experiments of a track; – hypothesis tests and other statistical analyses: the different statistical tech- niques used for performing an in-depth analysis of the experiments. When we deal with scientific data, “the lineage (provenance) of the data must be tracked, since a scientist needs to know where the data came from [. . . ] and what cleaning, rescaling, or modelling was done to arrive at the data to be interpreted” [2]. In addition, [3] points out how provenance is “important in judging the quality and applicability of information for a given use and for determining when changes at sources require revising derived information”. Furthermore, when scientific data are maintained for further and future use, they should be enriched and, besides information about provenance, also changes at sources occurred over time need to be tracked. Sometimes the enrichment of a portion of scientific data can make use of a citation for explicitly mentioning and making references to useful information. In this paper we examine whether the current methodology properly deals with the data produced during an evaluation campaign by recognizing that they are in effect valuable scientific data. Furthermore, we describe the data curation approach [4] which we have undertaken to overcome some of the shortenings of the current methodology and we have applied in designing and developing the infrastructure for the Cross-Language Evaluation Forum (CLEF). The paper is organized as follows: Section 2 introduces the motivations and the objectives of our research work; Section 3 describes the work carried out in developing the CLEF infrastructure; finally, Section 4 draws some conclusions. 2 Motivations and Objectives 2.1 Experimental Collections Nowadays, the experimental evaluation of an IRS is carried out in important international evaluation forums which bring research groups together, provide them with the means for measuring the performances of their systems, dis- cuss and compare their work. The Text REtrieval Conference (TREC)1 has been the first initiative in this field and has laid the groundwork for the other subsequent initiatives; TREC developed a common evaluation procedure in or- der to compare IRSs by measuring the effectiveness of different techniques, and to discuss how differences between systems affected performances. After TREC, other international important initiatives have been launched, in partic- ular Cross-Language Evaluation Forum (CLEF), NII-NACSIS Test Collection for IR Systems (NTCIR) and INitiative for the Evaluation of XML Retrieval (INEX). The CLEF2 aims at evaluating Cross Language Information Retrieval (CLIR) systems that operate on European languages in both monolingual and cross-lingual contexts. The NTCIR3 is the Asian counterpart of CLEF where 1 http://trec.nist.gov/ 2 http://clef.isti.cnr.it/ 3 http://research.nii.ac.jp/ntcir/index-en.html the traditional Chinese, Korean, Japanese, and English languages are the basis of the evaluation of cross-lingual tasks. The INEX4 provides participants with evaluation procedures for content-oriented eXtensible Markup Language (XML) retrieval in order to measure the effectiveness of IRSs that manage XML docu- ments. These evaluation forums are usually further organized into tracks, which investigate different facets of the evaluation of the information access compo- nents of a Digital Library Management System (DLMS). All of the previously mentioned initiatives are generally carried out according to the Cranfield methodology, which makes use of experimental collections [5]. An experimental collection is a triple C = (D, Q, J), where: D is a set of docu- ments, called also collection of documents; Q is a set of topics, from which the actual queries are derived; J is a set relevance judgements, i.e. for each topic q ∈ Q the documents d ∈ D, which are relevant for the topic q, are determined. An experimental collection C allows the comparison of two retrieval methods, say X and Y , according to some measurements which quantifies the retrieval perfor- mances of these methods. An experimental collection both provides a common test–bed to be indexed and searched by the IRS X and Y and guarantees the possibility of replicating the experiments. Nevertheless, the Cranfield methodology is mainly focused on creating com- parable experiments and evaluating the performances of an IRS rather than modeling and managing the scientific data produced during an evaluation cam- paign. As an example, note that the exchange of information between organizers and participants is mainly performed by means of textual files formatted according to the TREC data format, which is the de-facto standard in this field. Note that this information represents a first kind of scientific data produced during the evaluation process. The following is a fragment of the results of an experiment submitted by a participant to the organizers, where the gray header is not really present in the exchanged data but serves here as an explanation of the fields. Topic Iter. Document Rank Score Experiment 141 Q0 AGZ.950609.0067 0 0.440873414278 IMSMIPO 141 Q0 AGZ.950613.0165 1 0.305291658641 IMSMIPO ... As a further example, the following is a fragment of relevance judgements sent back by organizers to participants: Topic Iter. Document Relevant 141 0 AGZ.950606.0013 0 141 0 AGZ.950609.0067 1 141 0 AGZ.950613.0165 0 ... In the above data, each row represents a record of an experiment or of a relevance judgement, where fields are separated by white spaces. There is the field which 4 http://inex.is.informatik.uni-duisburg.de/ specifies the unique identifier of the topic (e.g. 141), the field for the unique identifier of the document (e.g. AGZ.950609.0067), the field which identifies the experiment (e.g. IMSMFPO), the field which specifies whether a document is relevant for a topic (e.g. 1) or not (e.g. 0) and so on, as specified by the gray headers. As it can be noted from the above examples, this format is suitable for a simple data exchange between participants and organizers. Nevertheless, neither this format provides any metadata explaining its content nor a scheme exists in order to define the structure of each file, the data type of each field, and various constraints on the data, such as numeric floating point precision. Moreover, this format is not very suitable for modelling the information space involved by an evaluation forum because the relationships among the different entities (documents, topics, experiments, participants) are not modeled and each entity is treated separately from the others. Furthermore, present collections keeping over time does not permit system- atic studies on reached improvements by participants over the years, for example in a specific multilingual setting [6]. We argue that the information space implied by an evaluation forum needs an appropriate conceptual model which takes into consideration and describes all the entities involved by the evaluation forum. In fact, an appropriate conceptual model is the necessary basis to make the scientific data produced during the eval- uation an active part of all those information enrichments, as data provenance and citation, we have described in the previous section. This conceptual model can be also translated into an appropriate logical model in order to manage the information of an evaluation forum by using the database technology, as an ex- ample. Finally, from this conceptual model we can derive also appropriate data formats for exchanging information among organizers and participants, such as an XML format that complies with an XML Schema [7,8] which describes and constraints the exchanged information. 2.2 Statistical Analysis of Experiments The Cranfield methodology is mainly focused on how to evaluate the perfor- mances of two systems and how to provide a common ground which makes the experimental results comparable. [9] points out that, in order to evaluate retrieval performances, we do not need only an experimental collection and measures for quantifying retrieval performances, but also a statistical methodology for judg- ing whether measured differences between retrieval methods X and Y can be considered statistically significant. To address this issue, evaluation forums have traditionally carried out statis- tical analyses, which provide participants with an overview analysis of the sub- mitted experiments, as in the case of the overview papers of the different tracks at TREC and CLEF; some recent examples of this kind of papers are [10,11]. Furthermore, participants may conduct statistical analyses on their own exper- iments by using either ad-hoc packages, such as IR-STAT-PAK5 , or generally available software tools with statistical analysis capabilities, like R6 , SPSS7 , or MATLAB8 . However, the choice of whether performing a statical analysis or not is left up to each participant who may even not have all the skills and resources needed to perform such analyses. Moreover, when participants perform statis- tical analyses using their own tools, the comparability among these analyses could not be fully granted because, for example, different statistical tests can be employed to analyze the data, or different choices and approximations for the various parameters of the same statistical test can be made. Thus, we can observe that, in general, there is a limited support to the sys- tematical employment of statistical analysis by participants. For this reason, we suggest that evaluation forums should support and guide participants in adopting a more uniform way of performing statistical analyses on their own experiments. In this way, participants can not only benefit from standard exper- imental collections which make their experiments comparable, but they can also exploit standard tools for the analysis of the experimental results, which make the analysis and assessment of their experiments comparable too. 2.3 Information Enrichment and Interpretation As introduced in Section 1, scientific data, their enrichment and interpretation are essential components of scientific research. The Cranfield methodology traces out how these scientific data have to be produced, while the statistical analysis of experiments provide the means for further elaborating and interpreting the experimental results. Nevertheless, the current methodologies does not require any particular coordination or synchronization between the basic scientific data and the analyses on them, which are treated as almost separated items. On the contrary, researchers would greatly benefit from an integrated vision of them, where the access to a scientific data item could also offer the possibility of re- trieving all the analyses and interpretations on it. Furthermore, it should be possible to enrich the basic scientific data in an incremental way, progressively adding further analyses and interpretations on them. Let us consider what is currently done in an evaluation forum: – Experimental collections: • there are few or no metadata about document collections, the context they refer to, how they have been created, and so on; • there are few or no metadata about topics, how they have been created, the problems encountered by their creators, what documents creators found relevant for a given topic, and so on; • there are few or no metadata about how pools have been created and about the relevance assessments, the problems which have been faced by the assessors when dealing with difficult topics; 5 http://users.cs.dal.ca/~ jamie/pubs/IRSP-overview.html 6 http://www.r-project.org/ 7 http://www.spss.com/ 8 http://www.mathworks.com/ – Experiments: • there are few or no metadata about them, such as what techniques have been adopted or what tunings have been carried out; • they can be not publicly accessible, making it difficult for other re- searchers to make a direct comparison with their own experiments; • their citation can be an issue; – Performance measurements: • there are no metadata about how a measure has been created, which software has been used to compute it, and so on; • often only summaries are publicly available while all the detailed mea- surements may be not accessible; • their format can be not suitable for further computer processing; • their modelling and management needs to be dealt with; – Descriptive statistics and hypothesis tests: • they are mainly limited to task overviews produced by organizers; • participants may not have all the skills needed to perform a statistical analysis; • participants can carry out statistical analyses only on their own experi- ments without the possibility of comparing them with the experiments of other participants; • the comparability among the statistical analyses conducted by the par- ticipants is not fully granted due to possible differences in the design of the statistical experiments. These issues are better faced and framed in the wider context of the curation of scientific data, which plays an important role on the systematic definition of a proper methodology to manage and promote the use of data. The e–Science Data Curation Report gives the following definition of data curation [12]: “the activity of managing and promoting the use of data from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and re-use. For dynamic datasets this may mean continuous enrichment or updating to keep it fit for purpose”. This definition implies that we have to take into consideration the possibility of information enrichment of scientific data, meant as archiving and preserving scientific data so that the experiments, records, and observations will be available for future research, as well as provenance, curation, and citation of scientific data items. The benefits of this approach include the growing involvement of scientists in international research projects and forums and increased interest in compar- ative research activities. Furthermore, the definition introduced above reflects the importance of some of the many possible reasons for which keeping data is important, for example: re-use of data for new research, including collection based research to generate new science; retention of unique observational data which is impossible to re-create; retention of expensively generated data which is cheaper to maintain than to re-generate; enhancing existing data available for research projects; validating published research results. As a concrete example in the field of information retrieval, please consider the data fusion problem [13], where lists of results produced by different systems have to be merged into a single list. In this context, researchers do not start from scratch, but they often experiment their merging algorithms by using the list of results produced in experiments carried out even by other researchers. This is the case, for example, of the CLEF 2005 multilingual merging track [10], which provided participants with some of the CLEF 2003 multilingual experiments as list of results to be used as input to their merging algorithms. It is now clear that researchers of this field would benefit by a clear data curation strategy, which promotes the re-use of existing data and allows the data fusion experiments to be traced back to the originary list of results and, perhaps, to the analyses and interpretations about them. Thus, we consider all these points as requirements that should be taken into account when we are going to produce and manage scientific data that come out from the experimental evaluation of an IRS. In addition, to achieve the full information enrichment discussed in Section 1, both the experimental datasets and their further elaboration, such as their statistical analysis, should be first class objects that can be directly referenced and cited. Indeed, as recognized by [12], the possibility of citing scientific data and their further elaboration is an effective way for making scientists and researchers an active part of the digital curation process. 3 The CLEF Infrastructure 3.1 Conceptual Model for an Evaluation Forum As discussed in the previous section, we need to design and develop a proper conceptual model of the information space involved by an evaluation forum. Indeed, this conceptual model provide us with the basis needed to offer all the information enrichment and interpretation features described above. Figure 1 shows the Entity–Relationship (ER) schema which represents the conceptual model we have developed. The conceptual model is built around five main areas of modelling: – evaluation forum: deals with the different aspects of an evaluation forum, such as the conducted evaluation campaigns and the different editions of each campaign, the tracks along which the campaign is organized, the subscription of the participants to the tracks, the topics of each track; – collection: concerns the different collections made available by an evaluation forum; each collection can be organized into various files and each file may contain one or more multimedia documents; the same collection can be used by different tracks and by different editions of the evaluation campaign; – experiments: regards the experiments submitted by the participants and the evaluation metrics computed on those experiments, such as precision and recall; – pool/relevance assessment: is about the pooling method [14], where a set of experiments is pooled and the documents retrieved in those experiments are assessed with respect to the topics of the track the experiments belongs to; Collection Modelling Evaluation Forum Modelling Name Language Description SourceTarget Workshop Year ShortName FullName WebSite DtdFile ReadMeFile ContactName COLLECTION (1, N) USE EDITION (1, 1) O RGANIZE (1, N) CAMPAIGN PathToZipFile ContactMail (1, N) Copyright ID Type Content Modified Created (1, N) (1, N) Creators StartDate EndDate DISSEMINATE (1, 1) CONSISTO F STRUCTURE Date PAPER Format (1, 1) PUBLISH Name (1, 1) Created ID (1, 1) Description UserName Pwd Name ContactName (1, N) (1, N) Prefix FILE TRACK (0, N) SUBSCRIBE (1, N) PARTICIPANT ContactMail (1, N) (1, N) (1, N) Suffix Modified Priority SubmittedFile (0, N) Position CONTAIN QUERY SUBMIT Date DocID Content ID OpenTop ID TrecEvalFile Experiments Modelling (1, 1) CloseTop (1, N) (1, 1) OpenTitle CloseTitle Notes DOCUMENT TOPICGROUP EXPERIMENT Type OpenDesc CloseDesc OpenNarr ID alpha p-value (0, N) (1, N) (0, N) TEST (1, N) CloseNarr (0, N) POOLED (1, N) CONTAIN (0, N) STATANALYSIS (0, N) (0, N) (1, N) ID Name ID Title (1, N) GROUPTEST (0, N) (1, 1) PrcAvg AvgPrc AddInfo (0, N) TestedFigure 11 Retr ID (0, N) ASSESSOR (1, N) ASSESS (1, N) TOPIC (0, N) METRIC (0, N) TYPIFY RelRet 9 Rel Prc ExtPrc Name Description ID Threshold Email Pwd (0, N) RUNGROUP Relevant Description Narrative (1, 1) (1, N) I TEM Hypothesis STATTEST POOL Rank Score Statistical Analysis Modelling Notes CONTAIN Pool / Relevance Assessment Modelling (1, N) Fig. 1. Conceptual model for the information space of an evaluation forum. – statistical analysis: models the different aspects concerning the statisti- cal analysis of the experimental results, such as the type of statistical test employed, its parameters, the observed test statistic, and so forth. 3.2 Architecture of the Service Figure 2 shows the architecture of the proposed service. It consists of three layers – data, application and interface logic layers – in order to achieve a better modularity and to properly describe the behavior of the service by isolating specific functionalities at the proper layer. In this way, the behavior of the system is designed in a modular and extensible way. In the following, we briefly describe the architecture shown in figure 2, from bottom to top. Data Logic The data logic layer deals with the persistence of the different information objects coming from the upper layers. There is a set of “storing managers” dedicated to storing the submitted experiments, the relevance assess- ments and so on. We adopt the Data Access Object (DAO)9 and the Transfer Object (TO)9 design patterns. The DAO implements the access mechanism re- quired to work with the underlying data source, acting as an adapter between 9 http://java.sun.com/blueprints/corej2eepatterns/Patterns/ Stand-alone Native Information Access Evaluation Service for a Scientific meta -DLMS Application Logic Inteface Logic Applications Libraries Participant Assessor Administrator User Interface User Interface User Interface Service Integration Layer Matlab Statistics Java-Matlab Bridge Toolbox Statistical Analysis Run Pool-Assessment Evaluation Forum User Log Management Tool Management Tool Management Tool Management Tool Management Tool Management Tool Java-Treceval Engine Storing Abstraction Layer Data Logic Statistical Analisys Run Pool-Assessment Evaluation Forum User Log Storing Manager Storing Manager Storing Manager Storing Manager Storing Manager Storing Manager Databases Fig. 2. Architecture of a service for supporting the evaluation of the information access components of a DLMS. the upper layers and the data source. If the underlying data source implementa- tion changes, this pattern allows the DAO to adapt to different storage schemes without affecting the upper layers. In addition to the other storing managers, there is the log storing manager which fine traces both system and user events. It captures information such as the user name, the Internet Protocol (IP) address of the connecting host, the action that has been invoked by the user, the messages exchanged among the components of the system in order to carry out the requested action, any error condition, and so on. Thus, besides offering us a log of the system and user activities, the log storing manager allows us to fine trace the provenance of each piece of data from its entrance in the system to every further processing on it. Finally, on top of the various “storing managers” there is the Storing Ab- straction Layer (SAL) which hides the details about the storage management to the upper layers. In this way, the addition of a new “storing manager” is totally transparent for the upper layers. Application Logic The application logic layer deals with the flow of operations within Distributed Information Retrieval Evaluation Campaign Tool (DIRECT). It provides a set of tools capable of managing high-level tasks, such as experiment submission, pool assessment, statistical analysis of an experiment. For example, the Statistical Analysis Management Tool (SAMT) offers the functionalities needed to conduct a statistical analysis on a set of experiments. In order to ensure comparability and reliability, the SAMT makes uses of well- known and widely used tools to implement the statistical tests, so that everyone can replicate the same test, even if he has no access to the service. In the archi- tecture, the MATLAB Statistics Toolbox10 has been adopted, since MATLAB is a leader application in the field of numerical analysis which employs state- of-the-art algorithms, but other software could have been used as well. In the case of MATLAB, an additional library is needed to allow our service to ac- cess MATLAB in a programmatic way; other softwares could require different solutions. As an additional example aimed at wide comparability and accep- tance of the tools, a further library provides an interface for our service towards the trec eval package11 . trec eval has been firstly developed and adopted by TREC and represents the standard tool for computing the basic performance figures, such as precision and recall. Finally, the Service Integration Layer (SIL) provides the interface logic layer with a uniform and integrated access to the various tools. As we noticed in the case of the SAL, thanks to the SIL also the addition of new tools is transparent for the interface logic layer. Interface Logic It is the highest level of the architecture, and it is the ac- cess point for the user to interact with the system. It provides specialised User Interfaces (UIs) for different types of users, that are the participants, the asses- sors, and the administrators. Note that, thanks to the abstraction provided by the application logic layer, different kind of UIs can be provided, either stand- alone applications or Web-based applications. 3.3 DIRECT: the Running Prototype The proposed service has been implemented in a first prototype, called Dis- tributed Information Retrieval Evaluation Campaign Tool (DIRECT) [15,16,17], and it has been tested in the context of the CLEF 2005 and 2006 evaluation cam- paigns. The initial prototype moves a first step in the direction of an information access evaluation service for scientific reflection DLMSs, by providing support for: – the management of an evaluation forum: the track set-up, the harvesting of documents, the management of the subscription of participants to tracks; – the management of submission of experiments, the collection of metadata about experiments, and their validation; – the creation of document pools and the management of relevance assessment; – common statistical analysis tools for both organizers and participants in order to allow the comparison of the experiments; – common tools for summarizing, producing reports and graphs on the mea- sured performances and conducted analyses; – common XML format for exchanging data between organizers and partici- pants. 10 http://www.mathworks.com/products/statistics/ 11 ftp://ftp.cs.cornell.edu/pub/smart/ DIRECT was successfully adopted during the CLEF 2005 campaign. It was used by nearly 30 participants spread over 15 different nations, who submitted more than 530 experiments; then 15 assessors assessed more than 160,000 doc- uments in seven different languages, including Russian and Bulgarian which do not have a latin alphabet. During the CLEF 2006 campaign, it has been used by nearly 75 participants spread over 25 different nations, who have submitted around 570 experiments; 40 assessors assessed more than 198,500 documents in nine different languages. DIRECT was then used for producing reports and overview graphs about the submitted experiments [18,19]. DIRECT has been developed by using the Java12 programming language, which ensures great portability of the system across different platforms. We used the PostgreSQL13 DataBase Management System (DBMS) for performing the actual storage of the data. Finally, all kinds of UI in DIRECT are Web-based interfaces, which make the service easily accessible to end-users without the need of installing any kind of software. These interfaces have been developed by using the Apache STRUTS14 framework, an open-source framework for developing Web applications. 4 Conclusions The discussed data curation approach that can help to face the test-collection challenge for the evaluation and future development of information access and extraction components of interactive information management systems. On the basis of the experience gained keeping and managing the data of interest of the CLEF evaluation campaign, we are testing the considered requirements to revise the proposed approach. A prototype of the carrying out the proposed approach, called DIRECT, has been implemented and widely tested during the CLEF 2005 and 2006 evaluation campaigns. On the basis of the experience gained, we are enhancing the proposed conceptual model and architecture, in order to implement a second version of the prototype to be tested and validated during the next CLEF campaigns. Acknowledgements The work reported in this paper has been partially supported by the DELOS Network of Excellence on Digital Libraries, as part of the Information Soci- ety Technologies (IST) Program of the European Commission (Contract G038- 507618). The authors would like to warmly thank Carol Peters, coordinator of CLEF, for her continuous support and advice. 12 http://java.sun.com/ 13 http://www.postgresql.org/ 14 http://struts.apache.org/ References 1. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw- Hill, New York, USA (1983) 2. Abiteboul, S., Agrawal, R., Bernstein, P., Carey, M., Ceri, S., Croft, B., DeWitt, D., Franklin, M., Garcia-Molina, H., Gawlick, D., Gray, J., Haas, L., Halevy, A., Heller- stein, J., Ioannidis, Y., Kersten, M., Pazzani, M., Lesk, M., Maier, D., Naughton, J., Schek, H.J., Sellis, T., Silberschatz, A., Stonebraker, M., Snodgrass, R., Ull- man, J.D., Weikum, G., Widom, J., Zdonik, S.: The Lowell Database Research Self-Assessment. Communications of the ACM (CACM) 48 (2005) 111–118 3. Ioannidis, Y., Maier, D., Abiteboul, S., Buneman, P., Davidson, S., Fox, E.A., Halevy, A., Knoblock, C., Rabitti, F., Schek, H.J., Weikum, G.: Digital library information-technology infrastructures. International Journal on Digital Libraries 5 (2005) 266–274 4. Agosti, M., Di Nunzio, G.M., Ferro, N.: A Data Curation Approach to Support In-depth Evaluation Studies. In Gey, F.C., Kando, N., Peters, C., Lin, C.Y., eds.: Proc. International Workshop on New Directions in Multilingual Information Ac- cess (MLIA 2006), http://ucdata.berkeley.edu/sigir2006-mlia.htm [last vis- ited 2006, August 17] (2006) 65–68 5. Cleverdon, C.W.: The Cranfield Tests on Index Languages Devices. In Spack Jones, K., Willett, P., eds.: Readings in Information Retrieval, Morgan Kaufmann Pub- lisher, Inc., San Francisco, California, USA (1997) 47–60 6. Agosti, M., Di Nunzio, G.M., Ferro, N.: Evaluation of a Digital Library System. In Agosti, M., Fuhr, N., eds.: Notes of the DELOS WP7 Workshop on the Eval- uation of Digital Libraries, http://dlib.ionio.gr/wp7/workshop2004_program. html [last visited 2006, February 28] (2004) 73–78 7. W3C: XML Schema Part 1: Structures – W3C Recommendation 28 October 2004. http://www.w3.org/TR/xmlschema-1/ [last visited 2006, September 4] (2004) 8. W3C: XML Schema Part 2: Datatypes – W3C Recommendation 28 October 2004. http://www.w3.org/TR/xmlschema-2/ [last visited 2006, September 4] (2004) 9. Hull, D.: Using Statistical Testing in the Evaluation of Retrieval Experiments. In Korfhage, R., Rasmussen, E., Willett, P., eds.: Proc. 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 1993), ACM Press, New York, USA (1993) 329–338 10. Di Nunzio, G.M., Ferro, N., Jones, G.J.F., Peters, C.: CLEF 2005: Ad Hoc Track Overview. In Peters, C., Gey, F.C., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini, B., Müller, H., de Rijke, M., eds.: Accessing Multilingual Information Repositories: Sixth Workshop of the Cross–Language Evaluation Forum (CLEF 2005). Revised Selected Papers, Lecture Notes in Computer Science (LNCS) 4022, Springer, Hei- delberg, Germany (2006) 11–36 11. Voorhees, E.M.: Overview of the TREC 2005 Robust Retrieval Track. In Voorhees, E.M., Buckland, L.P., eds.: The Fourteenth Text REtrieval Conference Proceedings (TREC 2005), http://trec.nist.gov/pubs/trec14/t14_proceedings.html [last visited 2006, August 4] (2005) 12. Lord, P., Macdonald, A.: e-Science Curation Report. Data curation for e-Science in the UK:an audit to establish requirements for future curation and provision. The JISC Committee for the Support of Research (JCSR). http://www.jisc.ac. uk/uploaded_documents/e-ScienceReportFinal.pdf [last visited 2006, February 28] (2003) 13. Croft, W.B.: Combining Approaches to Information Retrieval. In Croft, W.B., ed.: Advances in Information Retrieval: Recent Research from the Center for Intelligent Information Retrieval. Kluwer Academic Publishers, Norwell (MA), USA (2000) 1–36 14. Harman, D.K.: Overview of the First Text REtrieval Conference (TREC-1). In Harman, D.K., ed.: The First Text REtrieval Conference (TREC-1), National Insti- tute of Standards and Technology (NIST), Special Pubblication 500-207, Whasing- ton, USA. http://trec.nist.gov/pubs/trec1/papers/01.txt [last visited 2006, February 28] (1992) 15. Di Nunzio, G.M., Ferro, N.: DIRECT: a Distributed Tool for Information Retrieval Evaluation Campaigns. In Ioannidis, Y., Schek, H.J., Weikum, G., eds.: Proc. 8th DELOS Thematic Workshop on Future Digital Library Management Systems: Sys- tem Architecture and Information Access, http://ii.umit.at/research/delos_ website/delos-dagstuhl-handout-all.pdf [last visited 2006, June 21] (2005) 58–63 16. Di Nunzio, G.M., Ferro, N.: DIRECT: a System for Evaluating Information Access Components of Digital Libraries. In Rauber, A., Christodoulakis, S., Min Tjoa, A., eds.: Proc. 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2005), Lecture Notes in Computer Science (LNCS) 3652, Springer, Heidelberg, Germany (2005) 483–484 17. Di Nunzio, G.M., Ferro, N.: Scientific Evaluation of a DLMS: a service for eval- uating information access components. In: Proc. 10th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2006), Lecture Notes in Computer Science (LNCS), Springer, Heidelberg, Germany (in print) (2006) 18. Di Nunzio, G.M., Ferro, N.: Appendix A. Results of the Core Tracks and Domain-Specific Tracks. In Peters, C., Quochi, V., eds.: Working Notes for the CLEF 2005 Workshop, http://www.clef-campaign.org/2005/working_notes/ workingnotes2005/appendix_a.pdf [last visited 2006, February 28] (2005) 19. Di Nunzio, G.M., Ferro, N.: Appendix A. Results of the Core Tracks. In Nardi, A., Peters, C., Vicedo, J.L., eds.: Working Notes for the CLEF 2006 Workshop, Published Online (2006)