=Paper=
{{Paper
|id=Vol-1173/CLEF2007wn-DomainSpecific-PetrasEt2007
|storemode=property
|title=The Domain -Specific Track at CLEF 2007
|pdfUrl=https://ceur-ws.org/Vol-1173/CLEF2007wn-DomainSpecific-PetrasEt2007.pdf
|volume=Vol-1173
|dblpUrl=https://dblp.org/rec/conf/clef/PetrasBS07a
}}
==The Domain -Specific Track at CLEF 2007==
The Domain-Specific Track at CLEF 2007 Vivien Petras, Stefan Baerisch, Maximillian Stempfhuber GESIS Social Science Information Centre, Lennéstr. 30, 53113 Bonn, Germany {vivien.petras | stefan.baerisch | max.stempfhuber@gesis.org} Abstract The domain-specific track uses test collections from the social science domain to test monolingual and cross-language retrieval in structured bibliographic databases. Special attention is given to the existence of controlled vocabularies for content description and their potential usefulness in retrieval. Test collections and topics are provided in German, English and Russian. This year, a new English test collection (from the CSA Sociological Abstracts database) was added. We present an overview of the CLEF domain-specific track including a description of the tasks, collections, topic preparation, and relevance assessments as well as contributions to the track. A summary of results is given. The track participants experimented with different retrieval models ranging from classic vector-space to probabilistic to language models. The controlled vocabularies were used for query expansion or as bilingual dictionaries for query translation. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval General Terms Measurement, Performance, Experimentation Keywords Information Retrieval, Evaluation, Controlled Vocabularies 1 Introduction The CLEF domain-specific track evaluates mono- and cross-language information retrieval on structured scientific data. A point of emphasis in this track is research on leveraging the structure of data in collections (i.e. controlled vocabularies and other metadata) to improve search. In recent years, the focus of the domain-specific data collections was on bibliographic databases in the social science domain. The domain-specific track was established at the inception of CLEF in 2000 and was funded by the European Union from 2001-2004 (Kluck & Gey, 2001; Kluck, 2004). It is now continued at the GESIS German Social Science Information Centre (Bonn) in cooperation with the DELOS Network of Excellence on Digital Libraries. The GIRT databases (now in version 4) are extracts from the German Social Science Information Centre’s SOLIS (Social Science Literature) and SOFIS (Social Science Research Projects) databases from 1990-2000. In 2005, the Russian Social Science Corpus (RSSC) was added as a Russian-language test collection (94,581 documents), which was changed in 2006 to the INION ISISS corpus covering social sciences and economics in Russian. This year, another English- language social science collection was added. The second English collection is an extract from CSA’s Sociological abstracts providing more documents and another thesaurus to the test bed. In addition to the four test collections, various controlled vocabularies and mappings between vocabularies were made available. As is standard for the domain-specific track, 25 topics were prepared in German and then translated into English and Russian. 2 The Domain-Specific Task The domain-specific track includes three subtasks: • Monolingual retrieval against the German GIRT collection, the English GIRT and CSA Sociological Abstract collections, or the Russian INION ISISS collection; • Bilingual retrieval from any of the source languages to any of the target languages; • Multilingual retrieval from any source language to all collections / languages. 2.1 The Test Collections In recent years, pseudo-parallel collections in German and English (GIRT) and one or two Russian test collections were provided (Kluck & Stempfhuber, 2005; Stempfhuber & Baerisch, 2006). This year, only one Russian but two English collections were provided. Every test collection is in the format of a bibliographic database (records include title, author, abstract and source information) with the addition of subject metadata from controlled vocabularies. German The German GIRT collection (the social science German Indexing and Retrieval Testdatabase) is now used in its forth version (Girt-description, 2007) with 151,319 documents covering the years 1990-2000 using the German version of the Thesaurus for the Social Sciences. Almost all documents contain an abstract (145,941). English The English GIRT collection is a pseudo-parallel corpus to the German GIRT collection, providing translated versions of the German documents. It also contains 151,319 documents using the English version of the Thesaurus for the Social Sciences but only 17% (26,058) documents contain an abstract. New additions this year were the documents from the social science database Sociological Abstracts from Cambridge Scientific Abstracts (CSA) with 20,000 documents, 94% of which contain an abstract. The documents were taken from the SA database covering the years 1994, 1995, and 1996. Additional to title and abstract, each document contains subject-describing keywords from the CSA Thesaurus of Sociological Indexing Terms and classification codes from the Sociological Abstracts classification. Russian For Russian retrieval, the INION corpus ISISS with bibliographic data from the social sciences and economics with 145,802 documents was once again used. ISISS documents contain authors, titles, abstracts (for 27% of the test collection or 39,404 documents) and keywords from the Inion Thesaurus. 2.2 Controlled Vocabularies The GIRT collections have assigned descriptors from the GESIS IZ Thesaurus for the Social Sciences in German and English depending on the collection language. The CSA Sociological Abstracts documents contain descriptors from the CSA Thesaurus of Sociological Indexing Terms and the Russian ISISS documents are provided with Russian INION Thesaurus terms. GIRT documents also contain classification codes from the GESIS IZ classification and CSA SA documents from the Sociological Abstracts classification. Table 1 shows the distribution of subject-describing terms per document in each collection. Collection GIRT-4 CSA INION ISISS (German or Sociological English) Abstracts Thesaurus descriptors 10 6.4 3.9 / document Classification codes / 2 1.3 n/a document Table 1. Distribution of subject-describing terms per collection Vocabulary mappings Additional to the “mapping table” for the German and English terms from the GESIS IZ Thesaurus for the Social Sciences, which is really a translation, a bidirectional mapping between the GIRT and CSA Thesauri was provided. Vocabulary mappings are one-directional, intellectually created term transformations between two controlled vocabularies. They can be used to switch from the subject metadata terms of one knowledge system to the other, enabling a retrieval system to treat the subject descriptions of two or more different collections as one and the same. This year’s mappings were equivalence transformations, showing only term mappings that were found to be equivalent between two different controlled vocabularies. We provided mappings between the German Thesaurus for the Social Sciences and the English CSA Thesaurus of Sociological Indexing Terms. Since the German Thesaurus for the Social Sciences exists in an English version as well, we also provided the mapping from the English Thesaurus for the Social Sciences to the English CSA Thesaurus of Sociological Indexing Terms for monolingual retrieval. An example for a mapping from the English Thesaurus for the Social Sciences to the English CSA Thesaurus of Sociological Indexing Terms would be:This example shows that a mapping can overcome differences in technical language and the treatment of singular and plural in different controlled vocabularies. 2.3 Topic Preparation As is standard for the CLEF domain-specific track, 25 topics were prepared. For topic preparation we were supported by our colleagues from the GESIS Social Science Information Centre. As a special service to the social science community in Germany, the Information Centre biannually publishes updates on new entries in the SOLIS and SOFIS databases (from which the GIRT collections were generated). The specialized updates are prepared in 28 subject categories by subject specialists working at the Centre. Topics range from general sociology, family research, women’s and gender studies, international relations, research on Eastern Europe to social psychology and environmental research. An overview of the service including the 28 topics can be found at the following URL: http://www.gesis.org/en/information/soFid/index.htm. We asked our colleagues to think of between 2-5 topics related to their subject area and potentially relevant in the years 1990-2000 (the coverage of our test collections). The suggestions from 15 different colleagues were then checked according to breadth, variance from previous years and coverage in the test collections. 25 topics were selected and edited into the CLEF topic XML format. Figure 1 is an example for a topic. All topics were created in German and then consequently translated into English and Russian. agricultural area Rural areas Figure 1. Example topic in English Table 2 lists all 25 topic titles in English to give a perspective on the variance in topics. Sibling relations Class-specific leisure behaviour Unemployed youths without vocational Mortality rate training Economic elites in Eastern Europe and Russia German-French relations after 1945 System change and family planning in East Multinational corporations Germany Partnership and desire for children Gender and career chances Torture in the constitutional state Ecological standards in emerging or Family policy and national economy developing countries Women and income level Integration policy Lifestyle and environmental behaviour Tourism industry in Germany Unstable employment situations Promoting health in the workplace Value change in Eastern Europe Economic situations of families Migration pressure European climate policy Quality of life of elderly persons Economic support in the East Table 2. Topic titles for domain-specific CLEF track 2007 To date, 200 topics have been created for the domain-specific track. 3 Overview of the 2007 Domain-Specific Track More details of the individual runs and methods employed can be found in the corresponding articles by the participating groups. 3.1 Participants Although 10 groups had registered for the domain-specific task, only 5 groups submitted runs. Four groups have submitted descriptions to the working notes so far (Clinchant and Renders, 2007; Fautsch et al., 2007; Kürsten and Eibl, 2007; Larson, 2007). Table 3 lists the participants. Abbreviation Group Institution Country Media Informatics, Chemnitz University Chemnitz Germany of Technology Cheshire School of Information, UC Berkeley USA Xerox Research Centre Europe - Data Xerox France Mining Group Moscow Moscow State University Russia Computer Science Department, Unine Switzerland University of Neuchatel Table 3. Domain-specific track 2007 - participants 3.2 Submitted Runs Experiments for all tasks (monolingual, bilingual and multilingual retrieval) were submitted to the track. Monolingual and bilingual experiments were equally attempted, whereas multilingual retrieval runs were only submitted by 2 groups. Russian remains slightly less popular than the other two languages. Table 4 provides the number of submitted runs per task, table 5 provides an overview over submitted runs per task per participant. Task Runs Monolingual - against German 13 - against English 15 - against Russian 11 Bilingual - against German 14 - against English 15 - against Russian 9 Multilingual 9 Table 4. Submitted runs per task in the domain-specific track Task Participants (Runs) Monolingual - against German Chemnitz (3), Cheshire (2), Unine (4), Xerox (4) - against English Chemnitz (3), Cheshire (2), Moscow (2), Unine (4), Xerox (4) - against Russian Chemnitz (3), Cheshire (2), Moscow (2), Unine (4) Bilingual - against German Chemnitz (4), Cheshire (4), Xerox (6) - against English Chemnitz (3), Cheshire (4), Moscow (2), Xerox (6) - against Russian Chemnitz (3), Cheshire (4), Moscow (2) Multilingual Chemnitz (3), Cheshire (6) Table 5. Submitted runs per task and participant 3.3 Relevance Assessments In previous years, the domain-specific relevance assessments were administrated and overseen at least partly in-house at the Social Science Information Centre (using a self-developed Java-Swing program). This year all relevance assessments were administered and processed in the DIRECT system (Distributed Information Retrieval Evaluation Campaign Tool) provided by Giorgio M. Di Nunzio and Nicola Ferro from the Information Management Systems (IMS) Research Group at the University of Padova, Italy. This provided tremendous assistance for the CLEF group at the Information Centre and was positively accepted by the five assessors. Some problems occurred because of bandwidth and execution problems, but overall the assessment stage went smoothly. Documents were pooled using the top 100 ranked documents from each submission. Table 6 shows the pool sizes for each language. German 16,288 English 17,867 Russian 14,473 Table 6. Pool sizes in the domain-specific track For the German assessments, 652 documents per topic were judged on average and about 22% were found relevant. However, assessments vary from topic to topic. Figure 2 shows the German assessments per topic. German Relevance Assessments 1200 Documents 1000 Relevant 800 600 400 200 0 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 Figure 2. German assessments per topic For the English assessments, 715 documents per topic were judged on average and about 25% were found relevant. For the Russian assessments, 3 topics were found to have no relevant documents in the ISISS collection: 178, 181 and 191. For the assessments, 579 documents per topic were judged and only 10% were found relevant. Figures 3 and 4 show the English and Russian relevance assessments numbers. English Relevance Assessments 1200 1000 Documents Relevant 800 600 400 200 0 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 Figure 3. English assessments per topic Russian Relevance Assessments 1000 900 Documents 800 Relevant 700 600 500 400 300 200 100 0 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 Figure 4. Russian assessments per topic At first glance, several topics seem to yield particularly many relevant documents over all 3 languages despite different collections (e.g. 188, 190, 195) whereas others seem to yield particularly few (e.g. 181, 191). One explanation might be the timeliness and specificity of topics. The topics yielding many relevant topics (Quality of life of elderly persons, Mortality rate, Integration policy) seem to be rather broad and ongoing themes in the social science literature. The other two topics (Torture in the constitutional state, Economic elites in Eastern Europe and Russia) could be considered more specific and geared towards more recent time frames than others. 4 Domain-Specific Experiments Every group used the controlled vocabularies and structured data in some facility or other. One point of emphasis was query expansion with the help of the subject description provided by the thesauri. However, the translation and mapping tables were also used as bilingual dictionaries for the cross-language experiments. 4.1 Retrieval models The Chemnitz group (Kürsten and Eibl, 2007) used a redesigned version of their retrieval system based on the Lucene API and utilized two indices in retrieval: a structured index (taking the structure of the documents into account) and a plain index without considering the structure of the documents. To combine the two indices, a data fusion approach using the z-score introduced by the Unine group (Savoy, 2004) was employed. They found that the unstructured indexed outperformed the structured one. The Berkeley group (Larson, 2007) used a probabilistic model employing a logistic regression algorithm successfully used for cross-language retrieval since TREC-2 and implemented it with the Cheshire retrieval system. Unine (Fautsch et al., 2007) used several retrieval models for comparison purposes: the classical tf idf vector space model, probabilistic retrieval with the Okapi algorithm and four variants of the DFR (Divergence from Randomness) approach as well as a language modelling approach. Data fusion was applied using the z-score to combine these different models. They also compared word- based and n-gram indexing for retrieval with the Russian language corpus. The Xerox group (Clinchant and Renders, 2007) used a language modelling approach for their retrieval experiments. 4.2 Language Processing for Documents and Queries Standard language processing for documents and queries in the form of stopword-removal and stemming or normalization was employed by all groups. The Unine group successfully developed a new light-weight stemmer for the Russian language. For the German language, Unine and Xerox used a decompounding module to split German compounds whereas Berkeley and Chemnitz did not. 4.3 Query Expansion Three of the groups focused on query expansion in some way or another. Berkeley used a version of Entry Vocabulary Indexes (Gey et al., 2001) based on the same logistic regression algorithm as their retrieval system to associate title and description terms from topics with controlled vocabulary terms from documents. Another approach was a thesaurus-lookup where title and description words were looked up in a thesaurus that combined all subject-describing keywords from the different collections. The terms from the controlled vocabularies were added to the query. As part of its standard retrieval process, the Cheshire system also implemented a blind feedback algorithm based on the Robertson and Sparck Jones term weights. Whereas the Entry Vocabulary Index approach worked better for the English target language, the thesaurus look-up worked better for German and Russian. Unine used the Thesaurus for the Social Sciences to enhance queries with terms from the thesaurus. Thesaurus entries were indexed as documents and retrieved in response to query terms, then simply added to the query. They also used blind query feedback with Rocchio’s formula as well as an idf-based approach described in Abdou and Savoy (2007). The blind feedback approach improved the average precision of results, whereas the thesaurus expansion did not. Xerox used lexical entailment to provide query expansion whereby a language modelling approach is employed to find similar terms from corpus documents in relation to query terms. They found that this approach outperformed simple blind feedback but a combined approach worked best. 4.4 Translation Another focus of research was query translation, where the provided mapping tables were utilized as bilingual dictionaries. Berkeley used the commercially available LEC Power Translator program for translation in all languages. Chemnitz implemented a translation-plug-in to their Lucene retrieval system utilizing well-known freely-available translation services like Babel Fish, Google Translate, PROMT and Reverso. They also used the bilingual mapping table from the thesauri for translation. Finally, Xerox compared their Statistical Machine Translation System MATRAX with a sophisticated language-model-based approach of dictionary adaptation. Dictionary adaptation attempts to select one out of several translation possibilities for a term using a bilingual dictionary and calculating the probability of a target term given the language context of the source query term. They found that this approach worked well compared to the statistical machine translation system tested. 5 Results In the Appendix of this volume, mean average precision numbers (MAP) for each run per task and recall-precision graphs for the top-performing runs for each task are listed. 6 Outlook This year’s experiments have shown that leveraging a controlled vocabulary for query expansion or translation can improve results in structured test collections. A new collection and new vocabulary (CSA Sociological Abstracts) was added and a mapping table between the CSA Thesaurus and the GIRT Thesaurus provided for experiments. As new collections are added and distributed search across several collections becomes more common, the seamless switching between controlled vocabularies becomes crucial to utilize expansion and translation techniques developed for individual collections. For this purpose, several resources for terminology mapping have been developed at the German Social Science Information Centre (KoMoHe Project Website, 2007). Among them are over 40 bidirectional mappings between various controlled vocabularies. A web service to retrieve mapped terms is being developed. Besides the expansion of test collections, these vocabulary mapping services could be a future branch of research for the domain-specific track within CLEF. Acknowledgements We would like to thank Cambridge Scientific Abstracts for providing the documents for the new Sociological Abstracts test collection. We greatly acknowledge the support of Natalia Loukachevitch and her colleagues from the Research Computing Center of M.V. Lomonosov Moscow State University in translating the topics into Russian as well as providing parts of the Russian relevance assessments. Very special thanks also to Giorgio Di Nunzio and Nicola Ferro from the Information Management Systems (IMS) Research Group at the University of Padova for providing the DIRECT system and all their help in the assessments process and for providing the graphs and numbers for the results analysis. Claudia Henning and Jeof Spiro did the German and English assessments; Monika Gonser and Oksana Schäfer provided the rest of the Russian assessments. References Abdou, S. and Savoy J. (2007). Searching in Medline: Stemming, query expansion, and manual indexing evaluation. Information Processing & Management, to appear. Stephane Clinchant and Jean-Michel Renders (2007). XRCE’s Participation to CLEF 2007 Domain-specific Track. This volume. Claire Fautsch, Ljiljana Dolamic, Samir Abdou and Jacques Savoy (2007). Domain-Specific IR for German, English and Russian Languages. This volume. Fredric Gey, Michael Buckland, Aitao Chen, and Ray Larson (2001). Entry vocabulary – a technology to enhance digital search. In Proceedings of HLT2001, First International Conference on Human Language Technology, San Diego, pages 91–95, March 2001. Girt Description (2007). GIRT - Mono- and Cross-language Domain-Specific Information Retrieval (GIRT4). http://www.gesis.org/en/research/information_technology/girt4.htm Michael Kluck and Frederik C. Gey (2001). The Domain-Specific Task of CLEF - Specific Evaluation Strategies in Cross-Language Information Retrieval . In: Carol Peters (ed.): Cross- Language Information Retrieval and Evaluation. Workshop of the Cross-Language Information Evaluation Forum, CLEF 2000, Lisbon, Portugal, September 21-22, 2000, Revised Papers. Berlin/Heidelberg/New York: Springer 48-56 (Lecture Notes in Computer Science, 2069) Michael Kluck (2004). The GIRT Data in the Evaluation of CLIR Systems – from 1997 until 2003. In: Peters, C., Gonzalo, J., Braschler, M., Kluck, M. (Eds..) Comparative Evaluation of Multilingual Information Access Systems. 4th Workshop of the Cross-Language Evaluation Forum, CLEF 2003, Trondheim, Norway, August 21-22, 2003, Revised Selected Papers. Berlin/Heidelberg/New York: Springer 2004, 379-393 (Lecture Notes in Computer Science, 3237) Michael Kluck and Maximilian Stempfhuber (2005). Domain-Specific Track CLEF 2005: Overview of Results and Approaches, Remarks on the Assessment Analysis. Working Notes for the CLEF 2005 Workshop, 21-23 September, Vienna, Austria. http://www.clef- campaign.org/2005/working_notes/workingnotes2005/kluck05.pdf KoMoHe Project Website (2007). Competence Center Modeling and Treatment of Semantic Heterogeneity. http://www.gesis.org/en/research/information_technology/komohe.htm Jens Kürsten and Maximilian Eibl (2007). Domain-Specific Cross Language Retrieval: Comparing and Merging Structured and Unstructured Indices. This volume. Ray Larson (2007). Experiments in Classification Clustering and Thesaurus Expansion for Domain Specific Cross-Language Retrieval. This volume. Savoy, Jacques (2004). Data Fusion for Effective European Monolingual Information Retrieval. Working Notes for the CLEF 2004 Workshop, 15-17 September, Bath, UK. http://www.clef- campaign.org/2004/working_notes/WorkingNotes2004/22.pdf Maximilian Stempfhuber and Stefan Baerisch (2006). Domain-Specific Track CLEF 2005: Overview of Results and Approaches, Remarks on the Assessment Analysis. Working Notes for the CLEF 2006 Workshop, 20-22 September, Alicante, Spain. http://www.clef- campaign.org/2006/working_notes/workingnotes2006/stempfhuberOCLEF2006.pdf 192 System change and family planning in East Germany Find documents describing birth trends and family planning since reunification in East Germany. Of interest are documents on demographic changes which have taken place after 1989 in the territory of the former GDR as well as the slump in birth numbers, decline in marriages and divorces.