Océ at CLEF 2003 Roel Brand, Marvin Brünner, Samuel Driessen, Pascha Iljin, Jakob Klok Océ-Technologies B.V. P.O. Box 101 5900 MA Venlo The Netherlands {rkbr, mbru, sjdr, qpilj, klok}@oce.nl Abstract This report describes the work done at Océ Research for the Cross-Language Evaluation Forum (CLEF) 2003. This year we participated in seven mono-lingual tasks (all languages except Russian). We used the BM25 model, a probabilistic and (for Dutch only) a statistical model to rank documents. Knowledge Concepts’ Content Enabler semantic network was used (for statistical model only) for stemming, and query expansion. Our main goals were to compare the BM25 model and the probabilistic model, and to evaluate the performance of a statistical model that uses ‘knowledge’ from queries and relevance assessments from previous years. Furthermore, we give some comments on the standard performance measures used by CLEF. 1 Introduction This is our third participation in the Cross-Language Evaluation Forum (CLEF). In 2001 we have only participated in the Dutch monolingual task. Last year we participated in all mono-lingual tasks, some of the cross-lingual and in the multi-lingual task. The goal for this year was to concentrate on mono-lingual tasks. We aimed to compare the models we constructed during the last two years and get more insight into the possibilities of using ‘knowledge’ from queries and relevance assessments from previous years in order to construct information retrieval systems for this year. (Due to problems with indexing and a restriction on a maximal number of mono-lingual runs, the runs for Russian language were not carried out.) 2 Methods 2.1 Ranking systems Three different approaches were used to rank the documents: 1. The BM25 model (for seven languages). 2. A probabilistic model (for seven languages) 3. A statistical model (for Dutch language only) The BM25 model was used this year again, just like in CLEF 2002 [1]. The probabilistic and statistical models appeared as a result of internal research done last year [2]. 2.2 Query A query is constructed automatically from the title and description tags of a topic, and all single characters and stop word terms are removed. Further, all remaining terms were brought to lower case. The expansion of query terms was carried out for the statistical approach only. For that model, the morphological collapse (dictionary based stemming) of Knowledge Concepts’ Content Enabler semantic network was used to obtain root forms of (not necessarily all) query terms. The root forms were then expanded with the semantic network. The morphological variants of the root form (such as plural form, etc.) were added to the query. 2.3 Indexing The indexes were built for each of the languages by splitting the documents on non-alphanumerical characters. Single characters were removed from indexes. We have decided to leave the stop words in the indexes because of the following reasons. It is very difficult to construct a universal set of stop words. If it is based on the frequencies within a document collection, it is highly probable that the set of stop words will not be the same for two different document collections. In case it is based on human decisions, a number of important terms from the document collection and/or query will be removed. For example, consider the terms ‘new’ and ‘year’ as stop words (they are used quite often). After removing these terms from the document collection and from the queries, it becomes difficult to find a set of relevant documents for the query ‘A New Year tree’. In order to show that stop word removal is not always good, consider the query ‘Who said “To be or not to be?”’. In this case all terms from the query can be defined as stop words. This year we used of the following stop word lists: • Dutch - internally developed stop word list - http://www.unine.ch/Info/clef/ • English - http://www.searchengineworld.com/spy/stopwords.htm - http://www.unine.ch/Info/clef/ • Norwegian - http://snowball.tartarus.org/norwegian/stop.txt • Spanish - http://www.unine.ch/Info/clef/ - ftp://ftp.cs.cornell.edu/pub/smart/spanish.stop • Finnish • French • Swedish • German - http://www.unine.ch/Info/clef/ 3 Activities 3.1 Ranking models The BM25 model has been described in [1]. Last year we have shown that the performance of the BM25 ranking algorithm depends greatly on the choice of the values of the parameters k1 and b. However, the estimation of those values for the optimal performance is only possible when the document collection, the set of queries and the set of relevance assessments are all available beforehand. As far as the relevance assessments are not known in advance, a choice should be made for the parameters. For this year we have chosen k1=1.2 and b=0.6 for all seven languages. This pair of values was optimal for Dutch document collection, set of queries and relevance assessments of 2001. The urn model (balls in an urn=terms in a document) has been selected as a basis for the probabilistic model and has been used for all seven languages. The probability model has been selected as the result of theoretical research done in 2002. We aimed to implement a set of clues (we defined) in a ‘mathematically correct’ model. That is a model with the assumptions and ranking rules, which are valid from a mathematical point of view. Examples of clues are: presence of terms in the document that are synonyms to the terms from the query, importance of a query field, and length of a document. We found out that a set of clues we defined could not be entirely incorporated in the currently known information retrieval models and be mathematically correct. We found the probabilistic model the most appropriate one to implement several of the defined clues. Last year we have experimented with the Dutch document collection, the set of queries and the relevance assessments for 2001 and 2002. The statistics for each of these two years were obtained. Two runs have been submitted for the Dutch language using the statistical model, each run used the statistics of one of the two years. Different statistics have been chosen depending on their degree of significance to obtain better performance results. We took into consideration features as the part of speech of the query terms; terms in the document from title or from description; query terms of certain document frequency; terms located in title, lead or description of a document; terms related to the query terms; synonyms to the query terms; proper names. For each feature we calculated a value expressing the expected ‘gain’ from its usage. The gain value has been defined on the base of known relevance assessments done for (document, topic) pairs. We will not explain the way we calculated the ‘gain’ values because it cannot be done in short. The interested reader is referred to [2]. The experiments with the statistical model for CLEF data from 2001 and 2002 resulted in improvements of performance compared to the BM25 and probabilistic models. The proper choice of features to be selected and their ‘gain’ values lead to better results. However, this model is strongly dependent on the data collection, queries and relevance assessments. Hence, the results for a set of new documents, new queries and new relevance assessments are unpredictable. 3.2 Different parts of the topic Last year we experimented with generating queries using different parts of the topics. All possible combinations combined from one to three parts of (title, description, and narrative) have been investigated. Both the runs (for probabilistic and statistical models) and the statistical information indicate that using the narrative makes the performance of the information retrieval engine worse. We suppose that the narrative part contains too many irrelevant terms that add ‘noise’ to the query. A clever selection of terms from the narrative is needed. However, an automatic selection of ‘proper’ terms is not an easy task. We did not aim to solve it this year. 3.3 Compound Splitting For a statistical model, each compound word was split into parts that are independent words themselves. Those words were added to the query as one set. This means only the appearance of all parts of a compound in a document counts. A recursive algorithm takes a word, tries to split a left part that is a meaningful word, and proceeds with the remaining tail of the word. It has been observed that words are sometimes divided into groups of words that have no logical relation with each other: Frankrijk (France)=Frank+rijk (Frank+rich) or neerslag (precipitation) = neer+slag (down+knock). Besides, a number of compounds can be split in several ways. For example, the word ‘basketbalkampioenschappen’ (basketball championships) has eight (!) ways to be divided into independent nouns. These observations lead us to the development of a statistical rule that reduces senseless splitting. This rule is based on data from the document collection. 3.4 Expansion of the query terms As was mentioned in the ‘Query’ paragraph, we have morphologically expanded query terms for the statistical model only using Knowledge Concepts’ Content Enabler semantic network. 3.5 Related terms and synonym expansion We did the research on using related terms and synonyms. We have found that Knowledge Concepts’ Content Enabler is not good enough to create related terms and synonyms for our models, both from the point of view of logic and the performance of the retrieval systems. A measure of ‘similarity’ between two terms is needed in order to rank the proposed list of related terms and synonyms. The most ‘similar’ terms only should be used in order to extend the query with related terms and synonyms. 4 Runs for 2003 This year we have submitted 16 mono-lingual runs: - for each of the following languages two runs (using BM25 and probabilistic ranking principles): Finnish, French, German, Italian, Spanish and Swedish two runs; - for Dutch BM25 ranking, probabilistic ranking, and two statistical rankings (one is based on statistics for 2001 and another one is based on statistics for 2002). Numerical results of Océ at CLEF 2003 Name of the run Number of retrieved relevant documents Average precision R-precision Swedish BM25 729 out of 889 0.3584 0.3585 Swedish probabil. 633 out of 889 0.2716 0.2743 Italian BM25 759 out of 809 0.4361 0.4287 Italian probabil. 731 out of 809 0.3805 0.3865 French BM25 894 out of 946 0.4601 0.4273 French probabil. 865 out of 946 0.4188 0.4044 Finnish BM25 417 out of 483 0.3570 0.3230 Finnish probabil. 407 out of 483 0.3031 0.2624 Spanish BM25 2109 out of 2368 0.4156 0.4094 Spanish probabil. 2025 out of 2368 0.3500 0.3696 German BM25 1482 out of 1825 0.3858 0.3838 German probabil. 1337 out of 1825 0.3017 0.3088 Dutch BM25 1438 out of 1577 0.4561 0.4438 Dutch probabil. 1336 out of 1577 0.4049 0.3652 Dutch statist. 2001 1375 out of 1577 0.4253 0.3940 Dutch statist. 2002 1378 out of 1577 0.4336 0.3983 5 Further remarks on the evaluation measures of the information retrieval systems In Appendix A we give some remarks on the evaluation measures used within CLEF. 6 Conclusions We have compared the BM25 and the probabilistic models on the base of mono-lingual runs. The BM25 model systematically outperforms the probabilistic one. This indicates that striving for mathematical correctness is not the best guideline to obtain better engine performance. At the same time we have observed that a quite simple implementation of the probabilistic model has a satisfactory performance. Furthermore, we conclude that (the construction of) a better retrieval model needs ‘knowledge’ about the data collection, and the way the user formulates the topics and assesses the documents. All this information is needed in order to tune the statistical retrieval engine. 7 References [1] Roel Brand, Marvin Brünner, Océ at CLEF 2002, Lecture Notes on Computer Science, to appear in 2003. [2] Pascha Iljin, Modeling Document Relevancy Clues in Information Retrieval Systems, SAI, to appear in 2004. Appendix A The relevance judgements available from the CLEF processing (taking the Dutch data from 2001as an example) are: There are 190,604 documents in the document collection. For every query, every participating information retrieval system has provided a ranking of these documents. The top 1000 documents are delivered to the CLEF as the output of each system. Judges select the top N (N≤1000) documents from this ranked list. Suppose that there are M participating information retrieval systems. At most N*M documents are read by judges and receive the relevance judgements. In case the same document was in the top N for at least two participating systems, less than N*M obtain relevance judgement values. There are 16774 relevance judgement values available for 50 queries. This means on average of 335 relevance judgements per query. There are 1224 documents known to be relevant for 50 queries. This means on average of 25 relevant documents per query. We think that it is not entirely correct to calculate the statistics for the top 1000 documents in a commonly accepted way. The very simple reason is that we know the relevance judgement values for much less than 1000 documents. It appears that about 60-70% documents in the top 1000 have unknown values for relevance! Those unread documents get values ‘irrelevant’. Suppose that a relevant document was not retrieved within top N documents by any of the participating information retrieval systems. Because no value was assigned to this document, it obtains the value 0 (irrelevant) by definition. The paradox is that a relevant document gets the value of an irrelevant document. Conclusion For a robust comparison between s information retrieval systems based on the top N documents, all s systems must obtain relevance judgements for all N documents, and for every query from the set of queries. The stated claim can be approximated by reading more documents for every participating system, or having more systems participate in CLEF that rank as many mutually different documents as possible within the top N.