First participation of University and Hospitals of Geneva to Domain-
Specific Track in CLEF 2008
Julien Gobeill, Patrick Ruch
University and Hospitals of Geneva, Switzerland
julien.gobeill@sim.hcuge.ch
Abstract
We participate in 2008 to our first Domain-Specific Track, with the aim to establish a baseline for
our Information Retrieval engine in an unknown domain for us. We are specialized in Natural
Language Processing in the biomedical domain, and we participate to the medical Image track and
to TREC Genomics for four years with textual strategies, as queries expansions with controlled
vocabularies, pattern recognition and vectorial space models. The technical component of our cross-
language search engine is a generic toolkit, EasyIR, with which we can perform Text Categorization
and Information Retrieval. The strategy applied for the 2008 Domain-Specific track is as simple as
possible, as we want only to establish a baseline for EasyIR in a new track. For the English
monolingual task, we choose to work with the title, the descriptive text and some types of
classification terms to index documents. For the German queries to English collection bilingual task,
we choose to perform a simple retrieval on the German collection in one hand, and to collect the
descriptors of the retrieved documents in order to make cross-lingual query expansion in the other
hand. Unfortunately, our results cannot be seen as fair, as we achieve MAP of 0.171 for the
monolingual task and MAP of 0.132 for the bilingual task. Nevertheless, comparing to several
baseline runs of other participants for DS CLEF 2007, our baseline run achieves equal
performances. Possibilities to improve for the next DS CLEF are best tuning of our system with the
benchmark, and an efficient use of the controlled vocabularies.
Categories and Subject Descriptors
H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and
Retrieval; H.3.4 Systems and Software; H.3.7 Digital Libraries; H.2.3 [Database Management]: Languages -
Query Languages
General terms
Measurement, Performance, Experimentation
Keywords
Image Retrieval, Text categorization, multimodal retrieval
1 Introduction
The Cross Language Evaluation Forum (CLEF) is a challenge which occurs each year since 2000. The goal of
this challenge is to evaluate the participants on a common multilingual task, to establish a state of the art of the
techniques used in a domain, and to build a benchmark for future evaluations. The Domain-Specific (DS) Track
has started since 2000 with the goal to retrieve relevant documents in a scientific documents structured
collection. The DS Task is for few years focused on bibliographic databases in the social sciences domain. The
goal of this task is to retrieve relevant documents, in respect to a query, in a multilingual collection, using titles,
abstracts and human-assigned descriptors (1).
Our team is specialized in Natural Language Processing in the biomedical domain, as we regularly
participate to the TREC Genomics Track (2; 3) and to the ImageCLEF medical retrieval Track (4; 5). In these
challenges, we usually use simple textual strategies with thesaural resources in order to compose our runs. The
technical component of our cross-language search engine is a generic toolkit, EasyIR, which can perform Text
Categorization at high precision for high rank (6) – above 90% for Medical Subject Headings terms – and
Information Retrieval. Our first participation to the 2008 DS Track is motivated by the aim to establish a
baseline for our Information Retrieval engine in an unknown domain for us, where some controlled vocabularies
can be used for query expansion and more efficient retrieval.
We participate to the English monolingual task, and to the German queries to English collection
bilingual task. As the aim of our first participation is only to obtain a baseline evaluation of our engine in this
track, we only submit one run per task, with the simplest possible strategy.
2 Data and Strategies
The 2008 collection is the same as in 2007. The concerned collection for the tasks we participate – English
monolingual and German queries to English collection bilingual tasks – comprises documents from two different
sources. On one hand, the German Indexing and Retrieval Testdatabase in its forth version (GIRT-4 German)
contains 151,319 German documents dealing with social science and covering the years 1999-2000; a pseudo-
parallel English version of this collection, GIRT-4 English, contains the same documents translated in English.
On the other hand, the social science database Sociological Abstracts from Cambridge Scientific Abstracts
(CSA-SA) contains 20,000 documents, covering the years 1994-1996.
A typical composition of a document contains different useful features for indexing, as title, author
names, type of document, and publication date. An abstract is present for 96% of the GIRT-4 German documents
and 94% of the CSA documents – but only for 17% of the GIRT-4 English translated documents. Additional
thesaurus descriptors and classification codes belonging to controlled vocabularies are manually added to each
document. For the GIRT-4 collections, descriptors are issued from the GESIS IZ Thesaurus; for the CSA-SA
collection, they are issued from the CSA Thesaurus of Sociological Indexing Terms. See figures 1-3 for an
example of a document for each collection.
CSASA-1-EN-9600289
Structural Tightness and Social Conformity: Varying the Source of
External Influence
Roberts, Lance W.
Boldt, Edward D.
Guest, Anne
Dept Sociology U Manitoba, Winnipeg R3T 2N2
Abstract of Journal Article
1990
US
Hutterites
Conformity
Manitoba
College Students
social psychology; personality and social roles
(individual traits, social identity, adjustment, conformism, and
deviance)
social conformity, structural tightness thesis; test data;
Hutterites/undergraduates, Manitoba;
Structural tightness is defined as the capacity to impose collective
role expectations on community members. An attempt is made to reconceptualize this
term so that the findings in a cross-cultural conformity study may be brought into
a different light. Theoretical considerations are made in order to break down an
ecocultural model provided by others working in the field. It is this
conceptualization that puts forth the original definition of structural tightness
that is debated. To test these notions, test data were obtained from ethnic
Hutterites and 51 undergraduates in Manitoba. Findings suggest that the
theoretical rationale put forth is plausible and support the proposed
reconceptualization.
Figure 1: example of a document from the CSA-SA collection.
GIRT-DE19909343
GIRT-DE19909343
Die sozioökonomische Transformation einer Region : Das Bergische Land
von 1930 bis 1960
Henne, Franz J.
Geyer, Michael
1990
DE
Rheinland
historische Entwicklung
regionale Entwicklung
sozioökonomische Faktoren
historisch
Aktenanalyse
Sozialgeschichte
Die Arbeit hat das Ziel, anhand einer regionalen Studie die
Entstehung des "modernen" fordistischen Wirtschaftssystems und des sozialen
Systems im Zeitraum zwischen 1930 und 1960 zu beleuchten; dabei geht es auch um
das Studium des "Sozial-imaginären", der Veränderung von Bewußtsein und Selbst-
Verständnis von Arbeitern durch das Erlebnis und die Erfahrung der Depression, des
Nationalsozialismus und der Nachkriegszeit, welches sich in den 1950er Jahren
gemeinsam mit der wirtschaftlichen Veränderung zu einem neuen "System"
zusammenfügt.
Figure 2: example of a document from the GIRT-4 German collection.
GIRT-EN19901932
GIRT-EN19901932
The Socio-Economic Transformation of a Region : the Bergische Land from
1930 to 1960
Henne, Franz J.
Geyer, Michael
1990
EN
Rhenish Prussia
historical development
regional development
socioeconomic factors
historical
document analysis
Social History
Figure 3: the corresponding document to figure 2 in the GIRT-4 English collection.
Strategy for the English monolingual task. We choose to perform a simple Information Retrieval process for
this task. For the GIRT-4 English collection, the title, abstract, controlled terms and classification texts are
concatenated in a bag of words in order to index each document. For the CSA-SA collection, the title, text and
classification texts are used in a same way. The keystone of our strategy in ImageCLEF and TREC Genomics is
the automatic assignments of descriptors to documents and queries, in order to synthesize the concepts of a
document in a kind of intermediate language (7). As human-generated keywords are already associated with
each documents in the DS Track collection, and as we have no expertise of these bibliographic controlled
vocabularies – and as the submitted runs are supposed to establish a baseline and to be as simple as possible – we
choose to not work deeply with controlled vocabularies terms for document indexing. Moreover, when studying
the Working Notes of the previous DS Track, we choose to not to use the controlled vocabularies in order to
make query expansions, as several participating teams report than this technique leads no significant
improvements (1; 8; 9). Therefore, this run is as basic as we could.
Strategy for the German queries to English collection bilingual task. For this task, our strategy is lightly
more sophisticated. As GIRT-4 offers translated version across German and English, we firstly choose to
perform a simple Information Retrieval process in the GIRT-4 German collection, in respect to the German
queries, in order to obtain a first ranking. We don’t perform any translation of the queries, whereas it seems to be
an effective strategy in the previous DS Track (1). Then, we use this ranking in order to make query expansion:
for each query, we select the 10 most relevant retrieved documents in GIRT-4 German, and then we parse their
corresponding documents and descriptors in the GIRT-4 English in order to extract the 5 most frequent English
descriptors. These English descriptors are added to the queries in order to perform a second retrieval in the CSA-
SA corpus in order to obtain a second ranking. The two ranking are then normalized and merged into a final
ranking, with weights of 75% for the first ranking and 25% for the second one. We don’t use at any time the
provided vocabulary mappings.
3 Methods
Two main modules constitute the skeleton of EasyIR, our Information Retrieval engine: the regular expression
component, and the vector space component. Each of the basic classifiers implements known approaches to
document retrieval. The first tool is based on a regular expression pattern matcher (10). The second classifier is
based on a vector space engine. This second tool is expected to provide high recall in contrast to the regular
expression-based tool, which should privilege precision. The former component uses tokens as indexing units
and can be merged with a thesaurus, while the latter uses stems (Porter). See (11) for more precisions about our
engine.
The mean average precision (map): is the main measure for evaluating ad hoc retrieval tasks (for both
monolingual and bilingual runs). Following (12), we also use this measure to tune the Information Retrieval
system. We use the parameters obtained by a previous tuning on a small set of OHSUMED abstracts: 1200
randomly selected abstracts were used to select the weighting parameters of the vector space classifier and the
best combination of these parameters with the regular expression-based classifier.
4 Results and Discussion
We then describe each task separately.
4.1 English monolingual task
For this task, our run achieves a R-precision of 22.69%, and a map of 17.14%. These performances make us the
lasts of the two rankings and are relatively far from the best ones (respectively around 40% for R-precision and
38% for map). This could be considered relatively weak, but once again, the aim of our participation is only to
establish a baseline with simple methods in this DS track. Nevertheless, a closer look to the previous DS Track
Working Notes shows that several teams participating to DS Track this year submitted last year equivalent runs
(13; 14), even if the two Tracks cannot be directly compared as queries have changed. We assume that the
performance of our run is fair relatively to our expertness and our background in this domain, and that we will be
able to submit more efficient runs in the future DS Tracks.
4.2 German queries to English collection bilingual task
The result of this task is quite similar. Our run achieves a R-precision of 18.80% – which is not the worst R-
precision of the ranking – and a map of 17.14%. As for the English monolingual task, we find several runs with
equivalent performances in the previous DS Track. As we didn’t tune our system, and we didn’t use strong use
of the controlled vocabularies and their mapping, we assume once again that we have a lot of room for
improvement for the future evaluations.
5 Conclusion and Future Work
For the future DS Track, we need to invest more time in an efficient tuning of our engine with the previous
benchmark. A more in-depth state of the art of the successful techniques used this year, followed by a more
efficient use of the controlled vocabularies in order to make query expansion, and automatic translations of
queries, should be planned too.
References
1. The Domain-Specific Track at CLEF 2007. V Petras, S Baerisch, M Stempfhuber. CLEF 2007 Proceedings.
2. TREC 2007 Genomics Track Overview. W Hersh, A Cohen, L Ruslen, P Roberts. TREC 2007 Proceedings.
3. Vocabulary-driven Passage Retrieval for Question-Answering in Genomics. J Gobeill, F Ehrler, I Tbahriti,
P Ruch. TREC 2007 Proceedings.
4. Overview of the ImageCLEF 2007 Medical Retrieval and Annotation Tasks. H Muller, T Deselaers, E Kim,
J Kalpathy–Cramer,T M Deserno, W Hersh. ImageCLEF 2007 Proceedings.
5. University and Hospitals of Geneva at ImageCLEF 2007. X Zhou, J Gobeill, P Ruch and H Muller. CLEF
2007 Working notes.
6. Automatic Assignment of Biomedical Categories: Toward a Generic Approach. Ruch, P. 22(6), 2006,
Bioinformatics, pp. 658-64.
7. Query and Document Translation by Automatic Text Categorization: A Simple Approach to Establish a String
Textual Baseline for ImageCLEFmed 2006. J Gobeill, H Muller and P Ruch. 2006. ImageCLEF.
8. Experiments in Classification Clustering and Thesaurus Expansion for Domain Specific Cross-Language
Retrieval. Larson, R R. CLEF 2007 Proceedings.
9. Domain-Specific IR for German, English and Russian Languages. C Fautsch, L Dolamic, S Abdou, J Savoy.
TREC 2007 Proceedings.
10. A tool to search through entire file systems. Wu, U Mamber and S. Proceedings of the USENIX Winter
1994 Technical Conference, San Francisco, pp. 23-32.
11. Learning-Free Text Categorization. P Ruch, R Baud, and A Geissbuhler. 2003, LNAI 2780, pp. 199-208.
12. Combining classifiers in text categorization. Croft, L Larkey and W. 1996, SIGIR, ACM Press, New York,
pp. 289-297.
13. Domain-Specific Cross Language Retrieval: Comparing and Merging Structured and Unstructued Indices.
Eibl, J Kursten and M. TREC 2007 Proceedings.
14. XRCE's Participation to CLEF 2007 Domain-specific Track. Renders, S Clinchant and JM. TREC 2007
Proceedings.