-

Back to Basics - Again - for Domain Specific Retrieval

0 Ray R. Larson School of Information University of California , Berkeley , USA

In this paper we will describe Berkeley's approach to the Domain Specific (DS) track for CLEF 2008. Last year we used Entry Vocabulary Indexes and Thesaurus expansion approaches for DS, but found in later testing that some simple text retrieval approaches had better results than these more complex query expansion approaches. This year we decided to revisit our basic text retrieval approaches and see how they would stack up against the various expansion approaches used by other groups. The results are now in and the answer is clear, they perform pretty badly compared to other groups' approaches. All of the runs submitted were performed using the Cheshire II system. This year the Berkeley/Cheshire group submitted a total of twenty-four runs, including two for each subtask of the DS track. These include six Monolingual runs for English, German, and Russian, twelve Bilingual runs (four X2EN, four X2DE, and four X2RU), and six Multilingual runs (two EN, two DE, and two RU). The overall results include Cheshire runs in the top five participants for each task, but usually as the lowest of the five (and often fewer) groups.

Cheshire II Logistic Regression Entry Vocabulary Indexes

This paper discusses the retrieval methods and evaluation results for Berkeley’s participation in the CLEF 2008 Domain Specific track. In 2007 we focused on query expansion using Entry Vocabulary Indexes(EVIs)[ 4, 6 ], and thesaurus lookup of topic terms. Once the relevance judgements for 2007 were released we discovered that these rather complex method actually did not perform as well as basic text retrieval on the topics without additional query expansion. So, this year for the Domain Specific track we have returned to using a basic text retrieval approach using Probabilistic retrieval based on Logistic Regression with the inclusion of blind feedback, as used in 2006[ 5 ].

1 0.9 0.8 0.7 ion 0.6 ics 0.5 reP 0.4 0.3 0.2 0.1 0

The Retrieval Algorithms

As we have discussed in our other papers for the Adhoc-TEL and GeoCLEF tracks, basic form and variables of the Logistic Regression (LR) algorithm used for all of our submissions were originally developed by Cooper, et al. [ 3 ]. To formally the LR method, the goal of the logistic regression method is to define a regression model that will estimate (given a set of training data), for a particular query Q and a particular document D in a collection the value P (R | Q, D), that is, the probability of relevance for that Q and D. This value is then used to rank the documents in the collection which are presented to the user in order of decreasing values of that probability. To avoid invalid probability values, the usual calculation of P (R | Q, D) uses the “log odds” of relevance given a set of S statistics, si, derived from the query and database, giving a regression formula for estimating the log odds from those statistics: where b0 is the intercept term and the bi are the coefficients obtained from the regression analysis of a sample set of queries, a collection and relevance judgements. The final ranking is determined by the conversion of the log odds form to probabilities: 2.1

TREC2 Logistic Regression Algorithm

For all of our Domain Specific submissions this year we used a version of the Logistic Regression (LR) algorithm that has been used very successfully in Cross-Language IR by Berkeley researchers for a number of years[1] and which is also used in our GeoCLEF and Domain Specific submissions. For the Domain Specific track we used the Cheshire II information retrieval system implementation of this algorithm. One of the current limitations of this implementation is the lack of decompounding for German documents and query terms in the current system. As noted in our (1) (2) other CLEF notebook papers, the Logistic Regression algorithm used was originally developed by Cooper et al. [ 2 ] for text retrieval from the TREC collections for TREC2. The basic formula is: log O(R|C, Q) = log

p(R|C, Q) 1 − p(R|C, Q) = log p(R|C, Q) p(R|C, Q) 1 |XQc| qtfi p|Qc| + 1 i=1 ql + 35 where C denotes a document component (i.e., an indexed part of a document which may be the entire document) and Q a query, R is a relevance variable, p(R|C, Q) is the probability that document component C is relevant to query Q, p(R|C, Q) the probability that document component C is not relevant to query Q, which is 1.0 p(R|C, Q) |Qc| is the number of matching terms between a document component and a query, qtfi is the within-query frequency of the ith matching term, tfi is the within-document frequency of the ith matching term, ctfi is the occurrence frequency in a collection of the ith matching term, ql is query length (i.e., number of terms in a query like |Q| for non-feedback situations), cl is component length (i.e., number of terms in a component), and Nt is collection length (i.e., number of terms in a test collection). ck are the k coefficients obtained though the regression analysis.

More details of this algorithm and the coefficients used with it may be found in our Adhoc-TEL notebook paper where the same algorithm and coefficients were used. In addition to this primary algorithm we used a version that performs “blind feedback” during the retrieval process. The method used is also described in detail in our Adhoc-TEL paper. Our blind feedback approach uses some number of top-ranked documents from an initial retrieval using the LR algorithm above, and selects some number of terms from the content of those documents, using a version of the Robertson and Sparck Jones probabilistic term relevance weights [ 7 ]. Those terms are merged with the original query and new term frequency weights are calculated, and the revised query submitted to obtain the final ranking. We used different numbers of documents and terms for different collections based on some tests run the 2007 data, varying these numbers to find the optimal point for the specific collection. For the German collection we selected 20 documents and the 35 topranked terms from those documents for feedback. For English we used 14 documents and 16 terms, and for Russian we used 16 documents and the topranked 10 terms. 3

Approaches for Domain Specific Retrieval

In this section we describe the specific approaches taken for our submitted runs for the Domain Specific track. First we describe the database creation and the indexing and term extraction methods used, and then the search features we used for the submitted runs. Although the Cheshire II system uses the XML structure of documents and extracts selected portions of the record for indexing and retrieval, for the submitted runs this year we used only a single one of these indexes that contains the entire content of the document.

Table 1 lists the indexes created for the Domain Specific database and the document elements from which the contents of those indexes were extracted. The “Used” column in Table 1 indicates whether or not a particular index was used in the submitted Domain Specific runs.

For all indexing we used language-specific stoplists to exclude function words and very common words from the indexing and searching. The German language runs, however, did not use decompounding in the indexing and querying processes to generate simple word forms from compounds. 3.3

Search Processing

Searching the Domain Specific collection used Cheshire II scripts to parse the topics and submit the title and description elements from the topics to the “topic” index containing all terms from Name docno author title topic date subject

Document ID Author name Article Title All Content Words Date Controlled Vocabulary the documents. For the monolingual search tasks we used the topics in the appropriate language (English, German, or Russian), and for bilingual tasks the topics were translated from the source language to the target language using the LEC Power Translator PC-based program. Overall we have found that this translation program seems to generate good translations between any of the languages needed for this track, but we still intend to do some further testing to compare to previous approaches (which used web-based translation tools like Babelfish and PROMT). We suspect that, as always, different tools provide a more accurate representation of different topics for some languages, but the LEC Power Translator seemed to do pretty good (and often better) translations for all of the needed languages.

All searches were submitted using the TREC2 Algorithm with blind feedback described above. This year we did no expansion of topics or use of the thesaurus or the classification clusters created last year. The differences in the runs for a given language or language pair (for bilingual) in Table 2 are primarily whether the topic title and description only (TD) or title, description and narrative (TDN). 4

Results for Submitted Runs

The summary results (as Mean Average Precision) for all of our submitted runs for English, German and Russian are shown in Table 2, the Recall-Precision curves for these runs are also shown in Figure 1 (for monolingual), Figure 2 (for bilingual) and Figure 3 (for multilingual). In Figures 1, 2, and 3 the names are abbrevated to the letters and numbers of the full name in Table 2 describing the languages and query expansion approach used. For example, in Figure 2 DEEN-TD corresponds to run BRK-BI-DEEN-TD in Table 2.

We observe that for the vast majority of our runs, using the narrative tends to degrade instead of improve performance. (We observed the same in other tracks as well.)

It is worth noting that the approaches used in our submitted runs provided the best results when testing with 2007 data and topics when compared to our official 2007 runs. In fact we may have over-simplified for this track. Although at least one Cheshire run appeared in the top five runs of the overall summary results available on the DIRECT system, none of them were top-ranked and for many tasks there appeared to be fewer than five participants. BRK-MO-DE-TD BRK-MO-DE-TDN BRK-MO-EN-TD BRK-MO-EN-TDN BRK-MO-RU-TD BRK-MO-RU-TDN BRK-BI-ENDE-TD BRK-BI-ENDE-TDN BRK-BI-RUDE-TD BRK-BI-RUDE-TDN BRK-BI-DEEN-TD BRK-BI-DEEN-TDN BRK-BI-RUEN-TD BRK-BI-RUEN-TDN BRK-BI-DERU-TD BRK-BI-DERU-TDN BRK-BI-ENRU-TD BRK-BI-ENRU-TDN BRK-MU-DE-TD BRK-MU-DE-TDN BRK-MU-EN-TD BRK-MU-EN-TDN BRK-MU-RU-TD BRK-MU-RU-TDN

Monolingual German Monolingual German Monolingual English Monolingual English Monolingual Russian Monolingual Russian

Bilingual English⇒German Bilingual English⇒German Bilingual Russian⇒German Bilingual Russian⇒German Bilingual German⇒English Bilingual German⇒English Bilingual Russian⇒English Bilingual Russian⇒ English Bilingual German⇒Russian Bilingual German⇒Russian Bilingual English⇒Russian Bilingual English⇒Russian

Multilingual German Multilingual German Multilingual English Multilingual English Multilingual Russian Multilingual Russian

TD auto TDN auto TD auto TDN auto TD auto TDN auto TD auto TDN auto TD auto TDN auto TD auto TDN auto TD auto TDN auto TD auto TDN auto TD auto TDN auto TD auto TDN auto TD auto TDN auto TD auto

TDN auto Since we have not yet had a chance to test alternative approaches on the 2008 topics and relevance judgement, we don’t yet have much to report on ways forward. Given that the re-introduction of fusion approaches in our GeoCLEF entry led to very good results, we suspect that the application of selected fusion approaches for this task may also prove valuable.

We are much more curious to see what approaches the other groups in this task used this year, since some very strong results (at least compared to our own) appeared in the overall summary data. [1] Aitao Chen and Fredric C. Gey. Multilingual information retrieval using machine translation, relevance feedback and decompounding. Information Retrieval, 7:149–182, 2004.

[2]

W. S.

Cooper ,

Chen , and

F. C.

Gey . Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression . In Text REtrieval Conference (TREC-2) , pages 57 - 66 , 1994 .

[3] William

Cooper , Fredric C. Gey , and Daniel P. Dabney. Probabilistic retrieval based on staged logistic regression . In 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval , Copenhagen, Denmark, June 21-24, pages 198 - 210 , New York, 1992 . ACM.

[4]

Fredric

Gey , Michael Buckland, Aitao Chen, and

Ray

Larson . Entry vocabulary - a technology to enhance digital search . In Proceedings of HLT2001, First International Conference on Human Language Technology , San Diego, pages 91 - 95 , March 2001 .

[5] Ray

Larson . Domain specific retrieval: Back to basics. In Evaluation of Multilingual and Multi-modal Information Retrieval - Seventh Workshop of the Cross-Language Evaluation Forum , CLEF 2006 , LNCS, page to appear, Alicante, Spain, September 2006 .

[6]

Vivien

Petras , Fredric Gey, and

Ray

Larson . Domain-specific CLIR of english, german and russian using fusion and subject metadata for query expansion . In Cross-Language Evaluation Forum: CLEF 2005 , pages 226 - 237 . Springer (Lecture Notes in Computer Science LNCS 4022) , 2006 .

[7]

S. E.

Robertson and

K. Sparck

Jones . Relevance weighting of search terms . Journal of the American Society for Information Science , pages 129 - 146 , May-June 1976 .