Intr oduction

M onolingua l Retr ieval E xper iments wit h a Doma inSpecific Document Cor p us at t he C hem nit z T echnical Univer sity

0 Jens Kürsten, Maximilian Eibl Chemnitz University of Technology Faculty of Computer Science , Media Informatics Strasse der Nationen 62, 09107 Chemnitz , Germany

2006

This article describes the first participation of the Media Informatics Section of the Chemnitz Technical University at the Cross Language Evaluation Forum. A first experimental prototype is described which implements several different methods of optimizing search results. The configuration of the prototype is tested with the GIRT corpus. The results of the DomainSpecific Monolingual German task suggest that combining the suffix stripping stemming and the decompounding approach is very useful. Also, a local document clustering approach used to improve pseudo relevance feedback seems to be quite beneficial. Nevertheless, the evaluation of the English task using the same configuration suggests that the qualities of the results are highly speech dependent. Categor ies and Subject Descr iptor s H.3 [Infor mation Stor age and Retr ieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval I.5 [Patter n Recognition]: I.5.3 Clustering Gener a l Ter ms In our first participation in the Cross Language Evaluation Forum we focused on the development of a retrieval system which provides a good baseline for the Monolingual GIRT tasks. Furthermore, we tried to improve the baseline results with several optimisation approaches. The outline of this paper is organized as follows. Section 2 introduces the concepts used to achieve good baseline results and section 3 describes the concepts used to optimise the baseline results. Section 4 deals with the configuration of our submitted runs and section 5 recapitulates the results. In section 6 the results are discussed with respect to our expectations. Imp r oving baseline r esu lts using the CLEF tr aining data

Intr oduction

At first, we had to design a system which implements the well known algorithms of information retrieval. We used the Apache Lucene API [1] as core to simplify matters. The developed system consists of the following components: (a) filters for indexing: lowercase filter, stop word filter and stem filter, (b) several algorithms for query construction and (c) an evaluation tool, which allows comparison of different configurations based on the training data. The weighting scheme is embedded in the Lucene core and it is defined in its documentation.

In our primary experiments with the training data provided by the CLEF campaign we concentrated on the DomainSpecific Monolingual German task. Unfortunately, the results were quite bad results in those experiments. Thus, we searched the respective components for possible improvements. We decided to improve the query construction process and to try different German stemmers. 2.1

Constr ucting imp r oved quer ies

In a first step only the topic terms (title and description) in all fields of the GIRT corpus were searched. After evaluating our results we analysed the corpus and we observed that the title , abstract , controlledterm and classificationtext fields should be used for searching, because the other fields do not contain information which can successfully be used for indexing and searching. A slight improvement was achieved by adapting the query construction process to search the topic terms only within the fields abovementioned. As we wanted to participate with automatic runs, we have to consider the CLEF guideline [2]. It states that all fields of the GIRT corpus can be used for automatic retrieval, which is still satisfied.

Also, the query construction had to be improved. There are many ways to combine the search terms within a query in the correct manner. For example, one can build phrases, i.e. that two or more terms have to appear besides each other and in the correct order. But phrases are not the only way to specify or generalise a search for two or more terms which could have a semantic connection. Other methods to combine terms in a query formulation procedure are span queries, fuzzy queries and range queries. We experimented with phrase queries without a significant success and with span queries with a slight improvement of precision. The results of our experiments are shown in table 1. query construction standard

spans standard + spans standard + spans + topic analyzer mean average precision average recall 0.3373 0.6731 0.2880 0.3632 0.3515 0.6523 0.3822 0.6920

In our initial experiments we also observed that the topics should be analysed before they are used for query construction. The problem with the topics is that they contain certain terms that are not beneficial for searching the GIRT collection. Thus, we implemented a topic analyzer to remove those terms automatically. We had to be careful by realizing this, since we did not want to eliminate terms that are important for a specific topic. As the described topic analyzer works without any kind of human intervention, we are still constructing our queries automatically according to the CLEF guideline. As shown in table 1, we achieved a significant improvement of both precision and recall. 2.2

Differ ent stemming algor ithms for Ger man

We compared two different stemming approaches for German. The first one is a part of the Snowball project [3] and the second one was implemented at Chemnitz University of Technology by Stefan Wagner [4]. With the development of the second stemmer, we aimed at the improvement of the effectiveness of our retrieval results. Therefore, we used a stemming approach, which is quite different from well known suffix stripping algorithms. In our approach every term is divided into a number of syllables. These syllables are treated by the system as tokens. Each token then can consist of syllables again. Different implementations of the decompounding approach were tested to detect which one would achieve the best retrieval results. Our decompounding algorithm has two options: (1) remove flexion and (2) avoid overstemming. With the first option we try to remove flexions from the original term and with the second one we try to remove some letters to avoid the problem of overstemming.

Tables 2, 3 and 4 compare the results of the two stemming approaches and their algorithms on the basis of the CLEF training data from 2003 to 2005. As one can see in the result tables both stemming approaches outperform the nonstemmed approach. There is also a major improvement with our decompounding approach, which is not very significant at precision. But concerning the recall values we achieve an increase of 11.9% (2004), 14.9% (2003) and 18.7% (2005) compared to the snowball algorithm.

stemming algorithm

none plain german decompounder plain german decompounder

+ overstemming plain german decompounder + remove flexion + overstemming snowball german2 mean average precision average recall

0.3114 0.6122 0.3751 (+20.5%) 0.7789 (+27.2%) 0.4173 (+34.0%) 0.7950 (+29.9%) 0.4089 (+31.3%)

0.7794 (+27.3%) 0.3822 (+22.7%)

0.6920 (+13.0%)

Query expansion is one of the most widely used techniques to improve precision and/or recall of information retrieval systems. There are many different approaches in the query expansion field. The main distinction of them is whether manual (user) intervention is needed or not. In our experiments for automatic retrieval we used the pseudo relevance feedback approach. Pseudo relevance feedback is a form of unsupervised learning, where terms from the top k documents of an initial retrieval are used to reformulate the original query. Unfortunately, the application of pseudo relevance feedback is not useful in all cases. Many researchers tried to figure out how the reformulation of the query has to be done and which of the available features have to be used for it. Thus, various approaches exist in this field. Again, we tried to keep things simple and implemented a pseudo relevance feedback strategy as follows.

We also used the top k documents of an initial retrieval run. From those k documents, where k = 10 turned out to be a good choice, we only took terms that appeared at least n times in our k documents. Thereby, it can be ensured that only terms from those documents are used that are term correlated and related to the topic anyway. In our experiments we found out that n = 3 seems to be a good value for our query expansion strategy.

In further experiments with our feedback algorithm we tried to improve the performance by using the feedback algorithm repeatedly, but we observed that only using it twice could be useful. Higher numbers of iterations did not improve retrieval performance; in fact they even deteriorated performance. The results of our successful experiments with query expansion are summarized in table 5 and compared to a retrieval run without query expansion (first row). We only show the results of the CLEF training data from 2005 to keep clearness. As one can see, we got a strong performance increase with our feedback strategy. The results of the tests with the training data from the other years as well as those with the stemming algorithm from section 2.2 were similar.

This section describes how we applied clustering in the query expansion procedure. We also mention the algorithms implemented for clustering. The goal of the appliance of clustering was again the improvement of precision and/or recall.

At first, a short review of the different clustering techniques is given. Principally, one has to distinguish between nonhierarchical and hierarchical algorithms for clustering. We tested both concepts to find out what advantages and disadvantages they have in an IR application. The nonhierarchical methods can be further subdivided into single pass partitioning and reallocation algorithms. The nonhierarchical algorithms are all heuristic in nature because they require a priori decisions, e.g. about cluster size and cluster number. For that reason, these algorithms are only approximating clusters. Nevertheless, the most commonly used non hierarchical clustering algorithm called kmeans was also implemented.

Although many researchers have been concentrating their forces on hierarchical clustering, some papers on kmeans like algorithms have appeared in the last few years. For example Steinbach et al. compared some of the most widely used approaches with an algorithm they developed [5]. A short overview on nonhierarchical clustering is also given in [6], but the main part of the paper is on hierarchical clustering. In [7] Peter Willett provided a review of research on hierarchical document clustering. As abovementioned, we also implemented hierarchical clustering. We applied the LanceWilliams update formula [6] to get the possibility to compare different hierarchical algorithms. To avoid confusion we do not explain in further detail how our clustering algorithms were implemented.

As mentioned in the beginning of this section we tried to apply clustering in the query expansion procedure. Therefore, we tried the following configurations: (a) document clustering: using the top k documents returned from clustering the top n documents of an initial retrieval, (b) term clustering : using the terms returned from clustering the terms of the top n documents of an initial retrieval, (c) document clustering + PRF : combining (a) with our query expansion from section 3.1 and (d) term clustering + PRF : combining (b) with our query expansion from section 3.1. For document clustering our experiments showed, that clustering the top n = 50 documents of an initial retrieval provides the best results.

The term clustering approach is quite different from the traditional document clustering, because the objects to be clustered are not the top k documents themselves, but the terms of those top k documents. To get more reliable termcorrelations, we used the top k = 20 documents for term clustering. A brief description of some methods for term clustering is given in [8].

Table 6 lists the results of our four clustering configurations by testing each of them with the CLEF training data from 2005. Clustering is compared with our baseline results (first row) and the simple PRF procedure (second row). The value “10 + x” in the first column of the table means, that the top k = 10 documents from the initial retrieval were taken and we added those documents returned from clustering, which were not already in this set of 10 documents, i.e. that the number of documents for PRF can vary between 10 and 20.

This section will give a short review on the result list fusion approaches we implemented to combine the different indexing (stemming) schemes described in section 2.2. The motivation of applying fusion of different result lists is based on the following two effects. First, the relevant documents retrieved by different retrieval systems should have a high rank and a high overlap. Second, the nonrelevant documents of the result lists obtained by different systems are supposed to have a low rank and a low overlap. It is assumed that those hypotheses hold on the result lists we want to fuse.

Merging has also been in the interest of many researchers in the IR field. In the report of Fox and Shaw [9] methods for result list fusion have been compared. In their result summarizing the RSV values performed best. Savoy [10] also compared some data fusion operators and he additionally suggested the zscore operator, which performs best in his study. Another fusion operator called normbytopk was introduced by Lin and Chen in their report on merging mechanisms for multilingual information retrieval [11]. Beside the most widely known round robin and raw score fusion operators we implemented the fusion mechanisms depicted in table 7. name sum RSV

Norm max Norm RSV zscore Norm by topk

formula k åa i ´ RSVk 1 k åai ´ ( 1 k åai ´ 1 k åai ´ ( 1

RSVk

Max i with di = ) (RSVk - Min i ) (M ax i - Min i ) (R SVk - Mean i )

StDev i

Mea ni - Min i

StDev i

k åai ´ ( 1

RSVk RSV top - k )

+di ) , explanation RSVk retrieval status value of document k in list i ai weighting factor for list i Maxi maximum retrieval status value of list i

Mini Meani

StDevi minimum retrieval status value of list i mean retrieval status value of list i standard deviation of mean in list i

RSVtopk average retrieval status

value of the topk documents in list i

We tested the merging strategies with the two results lists obtained by the two different stemming algorithms of table 4 from section 2.2. All the tested fusion operators improved the precision of our primary results as can be seen in table 8. Furthermore, one can observe that zscore performed best and norm RSV performed almost as good as zscore. Another interesting observation is that the simple round robin approach also achieved very good results. The only fusion operator that performed poorly compared to the others was ra w score .

We submitted four runs for this task and their individual configurations were as follows: (1) TUCMIgirtde1

Stemmer: Feedback: Local clustering: Query construction:

Merging: (2) TUCMIgirtde2

Stemmer: Feedback: Local clustering: Query construction:

Merging: (3) TUCMIgirtde3

Stemmer: Feedback: Local clustering: Query construction:

Merging: (4) TUCMIgirtde4

Stemmer: Feedback: Local clustering: Query construction: Merging:

Snowball – German2

Yes, 2step iterative feedback (see section 3.1) No fully automatic from topic title and topic description No

Snowball – German2

Yes, 2step iterative feedback (see section 3.1) Yes, document clustering for PRF (see section 3.2) fully automatic from topic title and topic description No

Snowball – German2

Yes, 2step iterative feedback (see section 3.1) Yes, term clustering for PRF (see section 3.2) fully automatic from topic title and topic description No Snowball – German2, Plain German Decompounder + Overstemming Yes, 2step iterative feedback (see section 3.1) No fully automatic from topic title and topic description

Yes, sum RSV (see section 3.3)

All of our experiments used the weighting scheme embedded in the Lucene core, which is a tf*idf weighting scheme (see [1]). With our submitted experiments we hoped to confirm our observations from our experiments with the training data. Mainly, we hoped that the query expansion with clustering would be superior to the query expansion approach without clustering. Moreover, we expected quite good results from our last experiment where the result lists of the two different stemming approaches (see section 2.2) were fused. 4.2

Domainspecific M onolingual Ret r ieva l – English

We also submitted four runs for this task and their individual configurations were as follows. (1) TUCMIgirten1

Stemmer: Feedback: Local clustering: Query construction:

Merging: (2) TUCMIgirten2

Stemmer: Feedback: Local clustering: Query construction: Merging:

Snowball – Porter

Yes, 2step iterative feedback (see section 3.1) No fully automatic from topic title and topic description No (3) TUCMIgirten3

Stemmer: Feedback: Local clustering: Query construction:

Merging: (4) TUCMIgirten4

Stemmer: Feedback: Local clustering: Query construction: Merging:

Snowball – Porter

Yes, 2step iterative feedback (see section 3.1) Yes, term clustering for PRF (see section 3.2) fully automatic from topic title and topic description No

Snowball – Porter

Yes, simple feedback (see section 3.1) No fully automatic from topic title and topic description

With our submission for this task we also wanted to confirm that the query expansion with clustering performs better than the other query expansion approaches. The last experiment was submitted to test whether the simple query expansion performs better than the iterative 2step method. 5

Result s of submit ted r uns

In this section we present the results of our official experiments. The results for the DomainSpecific Monolingual German task are shown in table 9 and those for the DomainSpecific Monolingual English task are illustrated in table 10. In each row of the tables the value of the best performing configuration is highlighted bold and the worst performing configuration italic.

Generally speaking, the results of our experiments were as expected. The result list fusion approach for the DomainSpecific Monolingual German task clearly outperformed our other experiments. Moreover, the local clustering of the top documents in combination with the PRF approach achieves slight better results than PRF alone or PRF in combination with local term clustering.

In the results of the DomainSpecific Monolingual English task one can see that also PRF with document clustering performs best and PRF with term clustering worked worst. Besides, the differences in the results were very small. Additionally, we have to reinvestigate our experiments for the Monolingual English task, because the gap between those results and the results of the Monolingual German task indicates, that there might be an unknown issue with those experiments, when we keep in mind that the used corpora are pseudo parallel.

Concluding our experiments one can say, that for German text retrieval, combining the suffix stripping stemming and the decompounding approach is very useful and further experiments with other languages should be done to determine if one can achieve similar results with them. Also, the local document clustering approach for improvement of PRF seems to be quite beneficial, but again further experiments, especially with different corpora, should be carried out to detect, if the improvement works only within this (domainspecific) environment or not. 7 [1]

Refer ences

The Apache Software Foundation (19982006). Lucene. Retrieved August 10, 2006 from the World Wide Web: http://lucene.apache.org