=Paper=
{{Paper
|id=Vol-1172/CLEF2006wn-all-AgostiEt2006
|storemode=property
|title=Scientific Data of an Evaluation Campaign: Do We Properly Deal With Them?
|pdfUrl=https://ceur-ws.org/Vol-1172/CLEF2006wn-all-AgostiEt2006.pdf
|volume=Vol-1172
|dblpUrl=https://dblp.org/rec/conf/clef/AgostiNF06a
}}
==Scientific Data of an Evaluation Campaign: Do We Properly Deal With Them?==
<pdf width="1500px">https://ceur-ws.org/Vol-1172/CLEF2006wn-all-AgostiEt2006.pdf</pdf>
<pre>
    Scientific Data of an Evaluation Campaign:
        Do We Properly Deal With Them?

        Maristella Agosti, Giorgio Maria Di Nunzio, and Nicola Ferro

          Department of Information Engineering – University of Padua
                  Via Gradenigo, 6/b – 35131 Padova – Italy
                 {agosti, dinunzio, ferro}@dei.unipd.it


      Abstract. This paper examines the current way of keeping the data pro-
      duced during the evaluation campaigns and highlights some shortenings
      of it. As a consequence, we propose a new approach for improving the
      management evaluation campaigns’ data. In this approach, the data are
      considered as scientific data to be cured and enriched in order to give full
      support to longitudinal statistical studies and long-term preservation.


Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.3 Information Search and
Retrieval; H.3.4 [Systems and Software]: Performance evaluation.


General Terms
Experimentation, Performance, Measurement, Algorithms.


Additional Keywords and Phrases
Multilingual Information Access, Cross-Language Information Retrieval, Scien-
tific Data, Data Curation, In-depth evaluation studies.


1   Introduction
The experimental evaluation of an Information Retrieval System (IRS) is a sci-
entific activity whose outcomes are different kinds of scientific data:
 – experiments: the primary scientific data produced by the participants of an
   evaluation campaign;
 – performance measurements: the metrics, such as precision and recall [1], used
   to evaluate the performance of an IRS in a given experiment;
 – descriptive statistics: the statistics, such as mean or median, used to summa-
   rize the overall performances achieved by an experiment or a the collection
   of experiments of a track;
 – hypothesis tests and other statistical analyses: the different statistical tech-
   niques used for performing an in-depth analysis of the experiments.

    When we deal with scientific data, “the lineage (provenance) of the data
must be tracked, since a scientist needs to know where the data came from [. . . ]
and what cleaning, rescaling, or modelling was done to arrive at the data to
be interpreted” [2]. In addition, [3] points out how provenance is “important
in judging the quality and applicability of information for a given use and for
determining when changes at sources require revising derived information”.
    Furthermore, when scientific data are maintained for further and future use,
they should be enriched and, besides information about provenance, also changes
at sources occurred over time need to be tracked. Sometimes the enrichment of
a portion of scientific data can make use of a citation for explicitly mentioning
and making references to useful information.
    In this paper we examine whether the current methodology properly deals
with the data produced during an evaluation campaign by recognizing that they
are in effect valuable scientific data. Furthermore, we describe the data curation
approach [4] which we have undertaken to overcome some of the shortenings of
the current methodology and we have applied in designing and developing the
infrastructure for the Cross-Language Evaluation Forum (CLEF).
    The paper is organized as follows: Section 2 introduces the motivations and
the objectives of our research work; Section 3 describes the work carried out in
developing the CLEF infrastructure; finally, Section 4 draws some conclusions.


2     Motivations and Objectives

2.1   Experimental Collections

Nowadays, the experimental evaluation of an IRS is carried out in important
international evaluation forums which bring research groups together, provide
them with the means for measuring the performances of their systems, dis-
cuss and compare their work. The Text REtrieval Conference (TREC)1 has
been the first initiative in this field and has laid the groundwork for the other
subsequent initiatives; TREC developed a common evaluation procedure in or-
der to compare IRSs by measuring the effectiveness of different techniques,
and to discuss how differences between systems affected performances. After
TREC, other international important initiatives have been launched, in partic-
ular Cross-Language Evaluation Forum (CLEF), NII-NACSIS Test Collection
for IR Systems (NTCIR) and INitiative for the Evaluation of XML Retrieval
(INEX). The CLEF2 aims at evaluating Cross Language Information Retrieval
(CLIR) systems that operate on European languages in both monolingual and
cross-lingual contexts. The NTCIR3 is the Asian counterpart of CLEF where
1
  http://trec.nist.gov/
2
  http://clef.isti.cnr.it/
3
  http://research.nii.ac.jp/ntcir/index-en.html
the traditional Chinese, Korean, Japanese, and English languages are the basis
of the evaluation of cross-lingual tasks. The INEX4 provides participants with
evaluation procedures for content-oriented eXtensible Markup Language (XML)
retrieval in order to measure the effectiveness of IRSs that manage XML docu-
ments. These evaluation forums are usually further organized into tracks, which
investigate different facets of the evaluation of the information access compo-
nents of a Digital Library Management System (DLMS).
    All of the previously mentioned initiatives are generally carried out according
to the Cranfield methodology, which makes use of experimental collections [5].
An experimental collection is a triple C = (D, Q, J), where: D is a set of docu-
ments, called also collection of documents; Q is a set of topics, from which the
actual queries are derived; J is a set relevance judgements, i.e. for each topic
q ∈ Q the documents d ∈ D, which are relevant for the topic q, are determined.
An experimental collection C allows the comparison of two retrieval methods, say
X and Y , according to some measurements which quantifies the retrieval perfor-
mances of these methods. An experimental collection both provides a common
test–bed to be indexed and searched by the IRS X and Y and guarantees the
possibility of replicating the experiments.
    Nevertheless, the Cranfield methodology is mainly focused on creating com-
parable experiments and evaluating the performances of an IRS rather than
modeling and managing the scientific data produced during an evaluation cam-
paign.
    As an example, note that the exchange of information between organizers and
participants is mainly performed by means of textual files formatted according
to the TREC data format, which is the de-facto standard in this field. Note that
this information represents a first kind of scientific data produced during the
evaluation process. The following is a fragment of the results of an experiment
submitted by a participant to the organizers, where the gray header is not really
present in the exchanged data but serves here as an explanation of the fields.
        Topic Iter.  Document    Rank    Score    Experiment
         141   Q0 AGZ.950609.0067 0 0.440873414278 IMSMIPO
         141   Q0 AGZ.950613.0165 1 0.305291658641 IMSMIPO
         ...
As a further example, the following is a fragment of relevance judgements sent
back by organizers to participants:
                     Topic Iter.  Document     Relevant
                      141    0 AGZ.950606.0013     0
                      141    0 AGZ.950609.0067     1
                      141    0 AGZ.950613.0165     0
                      ...
In the above data, each row represents a record of an experiment or of a relevance
judgement, where fields are separated by white spaces. There is the field which
4
    http://inex.is.informatik.uni-duisburg.de/
specifies the unique identifier of the topic (e.g. 141), the field for the unique
identifier of the document (e.g. AGZ.950609.0067), the field which identifies
the experiment (e.g. IMSMFPO), the field which specifies whether a document is
relevant for a topic (e.g. 1) or not (e.g. 0) and so on, as specified by the gray
headers.
    As it can be noted from the above examples, this format is suitable for a
simple data exchange between participants and organizers. Nevertheless, neither
this format provides any metadata explaining its content nor a scheme exists in
order to define the structure of each file, the data type of each field, and various
constraints on the data, such as numeric floating point precision. Moreover,
this format is not very suitable for modelling the information space involved
by an evaluation forum because the relationships among the different entities
(documents, topics, experiments, participants) are not modeled and each entity
is treated separately from the others.
    Furthermore, present collections keeping over time does not permit system-
atic studies on reached improvements by participants over the years, for example
in a specific multilingual setting [6].
    We argue that the information space implied by an evaluation forum needs
an appropriate conceptual model which takes into consideration and describes all
the entities involved by the evaluation forum. In fact, an appropriate conceptual
model is the necessary basis to make the scientific data produced during the eval-
uation an active part of all those information enrichments, as data provenance
and citation, we have described in the previous section. This conceptual model
can be also translated into an appropriate logical model in order to manage the
information of an evaluation forum by using the database technology, as an ex-
ample. Finally, from this conceptual model we can derive also appropriate data
formats for exchanging information among organizers and participants, such as
an XML format that complies with an XML Schema [7,8] which describes and
constraints the exchanged information.


2.2   Statistical Analysis of Experiments

The Cranfield methodology is mainly focused on how to evaluate the perfor-
mances of two systems and how to provide a common ground which makes the
experimental results comparable. [9] points out that, in order to evaluate retrieval
performances, we do not need only an experimental collection and measures for
quantifying retrieval performances, but also a statistical methodology for judg-
ing whether measured differences between retrieval methods X and Y can be
considered statistically significant.
    To address this issue, evaluation forums have traditionally carried out statis-
tical analyses, which provide participants with an overview analysis of the sub-
mitted experiments, as in the case of the overview papers of the different tracks
at TREC and CLEF; some recent examples of this kind of papers are [10,11].
Furthermore, participants may conduct statistical analyses on their own exper-
iments by using either ad-hoc packages, such as IR-STAT-PAK5 , or generally
available software tools with statistical analysis capabilities, like R6 , SPSS7 , or
MATLAB8 . However, the choice of whether performing a statical analysis or not
is left up to each participant who may even not have all the skills and resources
needed to perform such analyses. Moreover, when participants perform statis-
tical analyses using their own tools, the comparability among these analyses
could not be fully granted because, for example, different statistical tests can be
employed to analyze the data, or different choices and approximations for the
various parameters of the same statistical test can be made.
     Thus, we can observe that, in general, there is a limited support to the sys-
tematical employment of statistical analysis by participants. For this reason,
we suggest that evaluation forums should support and guide participants in
adopting a more uniform way of performing statistical analyses on their own
experiments. In this way, participants can not only benefit from standard exper-
imental collections which make their experiments comparable, but they can also
exploit standard tools for the analysis of the experimental results, which make
the analysis and assessment of their experiments comparable too.

2.3   Information Enrichment and Interpretation
As introduced in Section 1, scientific data, their enrichment and interpretation
are essential components of scientific research. The Cranfield methodology traces
out how these scientific data have to be produced, while the statistical analysis
of experiments provide the means for further elaborating and interpreting the
experimental results. Nevertheless, the current methodologies does not require
any particular coordination or synchronization between the basic scientific data
and the analyses on them, which are treated as almost separated items. On the
contrary, researchers would greatly benefit from an integrated vision of them,
where the access to a scientific data item could also offer the possibility of re-
trieving all the analyses and interpretations on it. Furthermore, it should be
possible to enrich the basic scientific data in an incremental way, progressively
adding further analyses and interpretations on them.
    Let us consider what is currently done in an evaluation forum:
 – Experimental collections:
    • there are few or no metadata about document collections, the context
      they refer to, how they have been created, and so on;
    • there are few or no metadata about topics, how they have been created,
      the problems encountered by their creators, what documents creators
      found relevant for a given topic, and so on;
    • there are few or no metadata about how pools have been created and
      about the relevance assessments, the problems which have been faced by
      the assessors when dealing with difficult topics;
5
  http://users.cs.dal.ca/~ jamie/pubs/IRSP-overview.html
6
  http://www.r-project.org/
7
  http://www.spss.com/
8
  http://www.mathworks.com/
 – Experiments:
    • there are few or no metadata about them, such as what techniques have
       been adopted or what tunings have been carried out;
    • they can be not publicly accessible, making it difficult for other re-
       searchers to make a direct comparison with their own experiments;
    • their citation can be an issue;
 – Performance measurements:
    • there are no metadata about how a measure has been created, which
       software has been used to compute it, and so on;
    • often only summaries are publicly available while all the detailed mea-
       surements may be not accessible;
    • their format can be not suitable for further computer processing;
    • their modelling and management needs to be dealt with;
 – Descriptive statistics and hypothesis tests:
    • they are mainly limited to task overviews produced by organizers;
    • participants may not have all the skills needed to perform a statistical
       analysis;
    • participants can carry out statistical analyses only on their own experi-
       ments without the possibility of comparing them with the experiments
       of other participants;
    • the comparability among the statistical analyses conducted by the par-
       ticipants is not fully granted due to possible differences in the design of
       the statistical experiments.
     These issues are better faced and framed in the wider context of the curation
of scientific data, which plays an important role on the systematic definition of
a proper methodology to manage and promote the use of data. The e–Science
Data Curation Report gives the following definition of data curation [12]: “the
activity of managing and promoting the use of data from its point of creation, to
ensure it is fit for contemporary purpose, and available for discovery and re-use.
For dynamic datasets this may mean continuous enrichment or updating to keep
it fit for purpose”.
     This definition implies that we have to take into consideration the possibility
of information enrichment of scientific data, meant as archiving and preserving
scientific data so that the experiments, records, and observations will be available
for future research, as well as provenance, curation, and citation of scientific data
items. The benefits of this approach include the growing involvement of scientists
in international research projects and forums and increased interest in compar-
ative research activities. Furthermore, the definition introduced above reflects
the importance of some of the many possible reasons for which keeping data
is important, for example: re-use of data for new research, including collection
based research to generate new science; retention of unique observational data
which is impossible to re-create; retention of expensively generated data which
is cheaper to maintain than to re-generate; enhancing existing data available for
research projects; validating published research results.
     As a concrete example in the field of information retrieval, please consider
the data fusion problem [13], where lists of results produced by different systems
have to be merged into a single list. In this context, researchers do not start from
scratch, but they often experiment their merging algorithms by using the list of
results produced in experiments carried out even by other researchers. This is
the case, for example, of the CLEF 2005 multilingual merging track [10], which
provided participants with some of the CLEF 2003 multilingual experiments as
list of results to be used as input to their merging algorithms. It is now clear that
researchers of this field would benefit by a clear data curation strategy, which
promotes the re-use of existing data and allows the data fusion experiments to
be traced back to the originary list of results and, perhaps, to the analyses and
interpretations about them.
     Thus, we consider all these points as requirements that should be taken into
account when we are going to produce and manage scientific data that come
out from the experimental evaluation of an IRS. In addition, to achieve the full
information enrichment discussed in Section 1, both the experimental datasets
and their further elaboration, such as their statistical analysis, should be first
class objects that can be directly referenced and cited. Indeed, as recognized
by [12], the possibility of citing scientific data and their further elaboration is an
effective way for making scientists and researchers an active part of the digital
curation process.

3     The CLEF Infrastructure
3.1   Conceptual Model for an Evaluation Forum
As discussed in the previous section, we need to design and develop a proper
conceptual model of the information space involved by an evaluation forum.
Indeed, this conceptual model provide us with the basis needed to offer all the
information enrichment and interpretation features described above.
    Figure 1 shows the Entity–Relationship (ER) schema which represents the
conceptual model we have developed. The conceptual model is built around five
main areas of modelling:
 – evaluation forum: deals with the different aspects of an evaluation forum,
   such as the conducted evaluation campaigns and the different editions of each
   campaign, the tracks along which the campaign is organized, the subscription
   of the participants to the tracks, the topics of each track;
 – collection: concerns the different collections made available by an evaluation
   forum; each collection can be organized into various files and each file may
   contain one or more multimedia documents; the same collection can be used
   by different tracks and by different editions of the evaluation campaign;
 – experiments: regards the experiments submitted by the participants and
   the evaluation metrics computed on those experiments, such as precision and
   recall;
 – pool/relevance assessment: is about the pooling method [14], where a set
   of experiments is pooled and the documents retrieved in those experiments
   are assessed with respect to the topics of the track the experiments belongs
   to;
 Collection Modelling                                                                                  Evaluation Forum Modelling
                              Name Language Description                  SourceTarget                     Workshop                  Year                                    ShortName FullName WebSite
                    DtdFile

              ReadMeFile                                                                                                                                                                                                                ContactName
                                   COLLECTION                (1, N)              USE                                  EDITION                (1, 1)        O RGANIZE             (1, N)          CAMPAIGN
             PathToZipFile
                                                                                                                                                                                                                                        ContactMail
                                                                                                                                              (1, N)
                 Copyright                                                                                                                                                                  ID        Type Content
                                                  Modified Created
                                         (1, N)                                                                           (1, N)
                              Creators                                                                        StartDate            EndDate             DISSEMINATE
                                                                                                                                                                                 (1, 1)

                                    CONSISTO F                                                                     STRUCTURE                                           Date                           PAPER                             Format
                                                                                                                                                                                 (1, 1)
                                                                                                                                                           PUBLISH
                               Name      (1, 1)
                                                  Created                                                        ID       (1, 1)
                                                                                                                                    Description                                   UserName Pwd Name ContactName
                                                                                 (1, N)                                                                                 (1, N)


                    Prefix               FILE                                                                         TRACK                  (0, N)    SUBSCRIBE                 (1, N)       PARTICIPANT

                                                                                                                                                                                                                                    ContactMail
                                         (1, N)                                                                           (1, N)                                                                        (1, N)
                                Suffix            Modified                                                                                                                                Priority               SubmittedFile
                                                                                                                                                            (0, N)

                  Position            CONTAIN                                                                         QUERY                                                                          SUBMIT                             Date

                               DocID              Content                                                        ID                OpenTop                                                  ID                   TrecEvalFile             Experiments
                                                                                                                                                                                                                                           Modelling
                                         (1, 1)                                               CloseTop                    (1, N)                                                                        (1, 1)

                                                                                       OpenTitle
                                                                                       CloseTitle                                                                                                                                       Notes
                                      DOCUMENT                                                         TOPICGROUP                                                                              EXPERIMENT                               Type
                                                                                      OpenDesc
                                                                                      CloseDesc
                                                                                        OpenNarr                                                                                                                                                                          ID    alpha       p-value
                                         (0, N)                                                            (1, N)                                                                                                              (0, N)             TEST
                                                                               (1, N)            CloseNarr
                                                                                                                                                                                                                                                              (0, N)
                                       POOLED                         (1, N)                                          CONTAIN                                                                                         (0, N)                                              STATANALYSIS
                                                                                            (0, N)
                                                                                                                                                                                                                                                              (0, N)
        (1, N)
                                 ID                Name                                                          ID                 Title                                                  (1, N)                                              GROUPTEST
                                         (0, N)                                                                           (1, 1)                           PrcAvg      AvgPrc
                                                                                                                                                                                                                                                                       AddInfo (0, N)      TestedFigure
                                                                                                                                                              11             Retr
                                                                                                                                                                                                                                           ID     (0, N)
                                      ASSESSOR               (1, N)            ASSESS                (1, N)            TOPIC                  (0, N)           METRIC                                        (0, N)
                                                                                                                                                                                                                                                                                TYPIFY
                                                                                                                                                  RelRet
                                                                                                                                                              9
                                                                                                                                                                                     Rel
                                                                                                                                                            Prc      ExtPrc                                                                                            Name               Description
 ID          Threshold          Email              Pwd
                                                                                                                          (0, N)                                                                                                                RUNGROUP
                                                                               Relevant                  Description                 Narrative                                                                                                                                   (1, 1)


                                                                                                                                                                                                    (1, N)
                                                                                                                       I TEM                                                                                                                          Hypothesis               STATTEST
        POOL

                                                                                                               Rank                 Score
                                                                                                                                                                                                                                                   Statistical
                                                                                                                                                                                                                                               Analysis Modelling
Notes

                                                                                                                                                                                                      CONTAIN
                                 Pool / Relevance Assessment Modelling
                                                                                   (1, N)


             Fig. 1. Conceptual model for the information space of an evaluation forum.


 – statistical analysis: models the different aspects concerning the statisti-
   cal analysis of the experimental results, such as the type of statistical test
   employed, its parameters, the observed test statistic, and so forth.


3.2              Architecture of the Service

Figure 2 shows the architecture of the proposed service. It consists of three
layers – data, application and interface logic layers – in order to achieve a better
modularity and to properly describe the behavior of the service by isolating
specific functionalities at the proper layer. In this way, the behavior of the system
is designed in a modular and extensible way. In the following, we briefly describe
the architecture shown in figure 2, from bottom to top.


Data Logic The data logic layer deals with the persistence of the different
information objects coming from the upper layers. There is a set of “storing
managers” dedicated to storing the submitted experiments, the relevance assess-
ments and so on. We adopt the Data Access Object (DAO)9 and the Transfer
Object (TO)9 design patterns. The DAO implements the access mechanism re-
quired to work with the underlying data source, acting as an adapter between
9
      http://java.sun.com/blueprints/corej2eepatterns/Patterns/
Stand-alone       Native
                                      Information Access Evaluation Service for a Scientific meta -DLMS


                                                                                                                                                             Application Logic Inteface Logic
Applications     Libraries
                                         Participant                                        Assessor                                        Administrator
                                        User Interface                                    User Interface                                    User Interface


                                                                                     Service Integration Layer
      Matlab
   Statistics    Java-Matlab Bridge
     Toolbox                          Statistical Analysis        Run          Pool-Assessment        Evaluation Forum        User               Log
                                       Management Tool       Management Tool   Management Tool        Management Tool    Management Tool   Management Tool
                   Java-Treceval
                      Engine


                                                                                     Storing Abstraction Layer


                                                                                                                                                             Data Logic
                                      Statistical Analisys         Run         Pool-Assessment        Evaluation Forum         User              Log
                                       Storing Manager       Storing Manager   Storing Manager        Storing Manager    Storing Manager   Storing Manager


                                                                                          Databases


          Fig. 2. Architecture of a service for supporting the evaluation of the information access
          components of a DLMS.


          the upper layers and the data source. If the underlying data source implementa-
          tion changes, this pattern allows the DAO to adapt to different storage schemes
          without affecting the upper layers.
              In addition to the other storing managers, there is the log storing manager
          which fine traces both system and user events. It captures information such as
          the user name, the Internet Protocol (IP) address of the connecting host, the
          action that has been invoked by the user, the messages exchanged among the
          components of the system in order to carry out the requested action, any error
          condition, and so on. Thus, besides offering us a log of the system and user
          activities, the log storing manager allows us to fine trace the provenance of each
          piece of data from its entrance in the system to every further processing on it.
              Finally, on top of the various “storing managers” there is the Storing Ab-
          straction Layer (SAL) which hides the details about the storage management to
          the upper layers. In this way, the addition of a new “storing manager” is totally
          transparent for the upper layers.


          Application Logic The application logic layer deals with the flow of operations
          within Distributed Information Retrieval Evaluation Campaign Tool (DIRECT).
          It provides a set of tools capable of managing high-level tasks, such as experiment
          submission, pool assessment, statistical analysis of an experiment.
              For example, the Statistical Analysis Management Tool (SAMT) offers the
          functionalities needed to conduct a statistical analysis on a set of experiments.
          In order to ensure comparability and reliability, the SAMT makes uses of well-
          known and widely used tools to implement the statistical tests, so that everyone
          can replicate the same test, even if he has no access to the service. In the archi-
tecture, the MATLAB Statistics Toolbox10 has been adopted, since MATLAB
is a leader application in the field of numerical analysis which employs state-
of-the-art algorithms, but other software could have been used as well. In the
case of MATLAB, an additional library is needed to allow our service to ac-
cess MATLAB in a programmatic way; other softwares could require different
solutions. As an additional example aimed at wide comparability and accep-
tance of the tools, a further library provides an interface for our service towards
the trec eval package11 . trec eval has been firstly developed and adopted by
TREC and represents the standard tool for computing the basic performance
figures, such as precision and recall.
    Finally, the Service Integration Layer (SIL) provides the interface logic layer
with a uniform and integrated access to the various tools. As we noticed in the
case of the SAL, thanks to the SIL also the addition of new tools is transparent
for the interface logic layer.


Interface Logic It is the highest level of the architecture, and it is the ac-
cess point for the user to interact with the system. It provides specialised User
Interfaces (UIs) for different types of users, that are the participants, the asses-
sors, and the administrators. Note that, thanks to the abstraction provided by
the application logic layer, different kind of UIs can be provided, either stand-
alone applications or Web-based applications.


3.3     DIRECT: the Running Prototype

The proposed service has been implemented in a first prototype, called Dis-
tributed Information Retrieval Evaluation Campaign Tool (DIRECT) [15,16,17],
and it has been tested in the context of the CLEF 2005 and 2006 evaluation cam-
paigns. The initial prototype moves a first step in the direction of an information
access evaluation service for scientific reflection DLMSs, by providing support
for:

 – the management of an evaluation forum: the track set-up, the harvesting of
   documents, the management of the subscription of participants to tracks;
 – the management of submission of experiments, the collection of metadata
   about experiments, and their validation;
 – the creation of document pools and the management of relevance assessment;
 – common statistical analysis tools for both organizers and participants in
   order to allow the comparison of the experiments;
 – common tools for summarizing, producing reports and graphs on the mea-
   sured performances and conducted analyses;
 – common XML format for exchanging data between organizers and partici-
   pants.
10
     http://www.mathworks.com/products/statistics/
11
     ftp://ftp.cs.cornell.edu/pub/smart/
    DIRECT was successfully adopted during the CLEF 2005 campaign. It was
used by nearly 30 participants spread over 15 different nations, who submitted
more than 530 experiments; then 15 assessors assessed more than 160,000 doc-
uments in seven different languages, including Russian and Bulgarian which do
not have a latin alphabet. During the CLEF 2006 campaign, it has been used
by nearly 75 participants spread over 25 different nations, who have submitted
around 570 experiments; 40 assessors assessed more than 198,500 documents
in nine different languages. DIRECT was then used for producing reports and
overview graphs about the submitted experiments [18,19].
    DIRECT has been developed by using the Java12 programming language,
which ensures great portability of the system across different platforms. We
used the PostgreSQL13 DataBase Management System (DBMS) for performing
the actual storage of the data. Finally, all kinds of UI in DIRECT are Web-based
interfaces, which make the service easily accessible to end-users without the need
of installing any kind of software. These interfaces have been developed by using
the Apache STRUTS14 framework, an open-source framework for developing
Web applications.


4    Conclusions

The discussed data curation approach that can help to face the test-collection
challenge for the evaluation and future development of information access and
extraction components of interactive information management systems. On the
basis of the experience gained keeping and managing the data of interest of the
CLEF evaluation campaign, we are testing the considered requirements to revise
the proposed approach.
   A prototype of the carrying out the proposed approach, called DIRECT, has
been implemented and widely tested during the CLEF 2005 and 2006 evaluation
campaigns. On the basis of the experience gained, we are enhancing the proposed
conceptual model and architecture, in order to implement a second version of
the prototype to be tested and validated during the next CLEF campaigns.


Acknowledgements

The work reported in this paper has been partially supported by the DELOS
Network of Excellence on Digital Libraries, as part of the Information Soci-
ety Technologies (IST) Program of the European Commission (Contract G038-
507618).
    The authors would like to warmly thank Carol Peters, coordinator of CLEF,
for her continuous support and advice.
12
   http://java.sun.com/
13
   http://www.postgresql.org/
14
   http://struts.apache.org/
References

 1. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-
    Hill, New York, USA (1983)
 2. Abiteboul, S., Agrawal, R., Bernstein, P., Carey, M., Ceri, S., Croft, B., DeWitt, D.,
    Franklin, M., Garcia-Molina, H., Gawlick, D., Gray, J., Haas, L., Halevy, A., Heller-
    stein, J., Ioannidis, Y., Kersten, M., Pazzani, M., Lesk, M., Maier, D., Naughton,
    J., Schek, H.J., Sellis, T., Silberschatz, A., Stonebraker, M., Snodgrass, R., Ull-
    man, J.D., Weikum, G., Widom, J., Zdonik, S.: The Lowell Database Research
    Self-Assessment. Communications of the ACM (CACM) 48 (2005) 111–118
 3. Ioannidis, Y., Maier, D., Abiteboul, S., Buneman, P., Davidson, S., Fox, E.A.,
    Halevy, A., Knoblock, C., Rabitti, F., Schek, H.J., Weikum, G.: Digital library
    information-technology infrastructures. International Journal on Digital Libraries
    5 (2005) 266–274
 4. Agosti, M., Di Nunzio, G.M., Ferro, N.: A Data Curation Approach to Support
    In-depth Evaluation Studies. In Gey, F.C., Kando, N., Peters, C., Lin, C.Y., eds.:
    Proc. International Workshop on New Directions in Multilingual Information Ac-
    cess (MLIA 2006), http://ucdata.berkeley.edu/sigir2006-mlia.htm [last vis-
    ited 2006, August 17] (2006) 65–68
 5. Cleverdon, C.W.: The Cranfield Tests on Index Languages Devices. In Spack Jones,
    K., Willett, P., eds.: Readings in Information Retrieval, Morgan Kaufmann Pub-
    lisher, Inc., San Francisco, California, USA (1997) 47–60
 6. Agosti, M., Di Nunzio, G.M., Ferro, N.: Evaluation of a Digital Library System.
    In Agosti, M., Fuhr, N., eds.: Notes of the DELOS WP7 Workshop on the Eval-
    uation of Digital Libraries, http://dlib.ionio.gr/wp7/workshop2004_program.
    html [last visited 2006, February 28] (2004) 73–78
 7. W3C: XML Schema Part 1: Structures – W3C Recommendation 28 October 2004.
    http://www.w3.org/TR/xmlschema-1/ [last visited 2006, September 4] (2004)
 8. W3C: XML Schema Part 2: Datatypes – W3C Recommendation 28 October 2004.
    http://www.w3.org/TR/xmlschema-2/ [last visited 2006, September 4] (2004)
 9. Hull, D.: Using Statistical Testing in the Evaluation of Retrieval Experiments. In
    Korfhage, R., Rasmussen, E., Willett, P., eds.: Proc. 16th Annual International
    ACM SIGIR Conference on Research and Development in Information Retrieval
    (SIGIR 1993), ACM Press, New York, USA (1993) 329–338
10. Di Nunzio, G.M., Ferro, N., Jones, G.J.F., Peters, C.: CLEF 2005: Ad Hoc Track
    Overview. In Peters, C., Gey, F.C., Gonzalo, J., Jones, G.J.F., Kluck, M., Magnini,
    B., Müller, H., de Rijke, M., eds.: Accessing Multilingual Information Repositories:
    Sixth Workshop of the Cross–Language Evaluation Forum (CLEF 2005). Revised
    Selected Papers, Lecture Notes in Computer Science (LNCS) 4022, Springer, Hei-
    delberg, Germany (2006) 11–36
11. Voorhees, E.M.: Overview of the TREC 2005 Robust Retrieval Track. In Voorhees,
    E.M., Buckland, L.P., eds.: The Fourteenth Text REtrieval Conference Proceedings
    (TREC 2005), http://trec.nist.gov/pubs/trec14/t14_proceedings.html [last
    visited 2006, August 4] (2005)
12. Lord, P., Macdonald, A.: e-Science Curation Report. Data curation for e-Science
    in the UK:an audit to establish requirements for future curation and provision.
    The JISC Committee for the Support of Research (JCSR). http://www.jisc.ac.
    uk/uploaded_documents/e-ScienceReportFinal.pdf [last visited 2006, February
    28] (2003)
13. Croft, W.B.: Combining Approaches to Information Retrieval. In Croft, W.B., ed.:
    Advances in Information Retrieval: Recent Research from the Center for Intelligent
    Information Retrieval. Kluwer Academic Publishers, Norwell (MA), USA (2000)
    1–36
14. Harman, D.K.: Overview of the First Text REtrieval Conference (TREC-1). In
    Harman, D.K., ed.: The First Text REtrieval Conference (TREC-1), National Insti-
    tute of Standards and Technology (NIST), Special Pubblication 500-207, Whasing-
    ton, USA. http://trec.nist.gov/pubs/trec1/papers/01.txt [last visited 2006,
    February 28] (1992)
15. Di Nunzio, G.M., Ferro, N.: DIRECT: a Distributed Tool for Information Retrieval
    Evaluation Campaigns. In Ioannidis, Y., Schek, H.J., Weikum, G., eds.: Proc. 8th
    DELOS Thematic Workshop on Future Digital Library Management Systems: Sys-
    tem Architecture and Information Access, http://ii.umit.at/research/delos_
    website/delos-dagstuhl-handout-all.pdf [last visited 2006, June 21] (2005)
    58–63
16. Di Nunzio, G.M., Ferro, N.: DIRECT: a System for Evaluating Information Access
    Components of Digital Libraries. In Rauber, A., Christodoulakis, S., Min Tjoa,
    A., eds.: Proc. 9th European Conference on Research and Advanced Technology
    for Digital Libraries (ECDL 2005), Lecture Notes in Computer Science (LNCS)
    3652, Springer, Heidelberg, Germany (2005) 483–484
17. Di Nunzio, G.M., Ferro, N.: Scientific Evaluation of a DLMS: a service for eval-
    uating information access components. In: Proc. 10th European Conference on
    Research and Advanced Technology for Digital Libraries (ECDL 2006), Lecture
    Notes in Computer Science (LNCS), Springer, Heidelberg, Germany (in print)
    (2006)
18. Di Nunzio, G.M., Ferro, N.: Appendix A. Results of the Core Tracks and
    Domain-Specific Tracks. In Peters, C., Quochi, V., eds.: Working Notes for the
    CLEF 2005 Workshop, http://www.clef-campaign.org/2005/working_notes/
    workingnotes2005/appendix_a.pdf [last visited 2006, February 28] (2005)
19. Di Nunzio, G.M., Ferro, N.: Appendix A. Results of the Core Tracks. In Nardi,
    A., Peters, C., Vicedo, J.L., eds.: Working Notes for the CLEF 2006 Workshop,
    Published Online (2006)

</pre>