-

The University of Amsterdam at the CLEF 2008 Domain Specific Track

Edgar Meij emeij@science.uva.nl

0 1 0 ISLA, University of Amsterdam 1 Maarten de Rijke

We describe our participation in the CLEF 2008 Domain Specific track. The research questions we address are threefold: (i) what are the effects of estimating and applying relevance models to the domain specific collection used at CLEF 2008, (ii) what are the results of parsimonizing these relevance models, and (iii) what are the results of applying concept models for blind relevance feedback? Parsimonization is a technique by which the term probabilities in a language model may be re-estimated based on a comparison with a reference model, making the resulting model more sparse and to the point. Concept models are term distributions over vocabulary terms, based on the language associated with concepts in a thesaurus or ontology and are estimated using the documents which are annotated with concepts. Concept models may be used for blind relevance feedback, by first translating a query to concepts and then back to query terms. We find that applying relevance models helps significantly for the current test collection, in terms of both mean average precision and early precision. Moreover, parsimonizing the relevance models helps mean average precision on title-only queries and early precision on title+narrative queries. Our concept models are able to significantly outperform a baseline query-likelihood run, both in terms of mean average precision and early precision on both title-only and title+narrative queries.

Parsimonious Models Language Models Relevance Feedback

We describe our participation in the 2008 CLEF Domain Specific track. Our main motivation for participating was to evaluate the retrieval models we have developed for another, very similar domain on the CLEF Domain Specific test collection. Our concept models have thus far been developed and evaluated on the TREC Genomics test collections, which also consists of documents which are manually annotated using concepts from a thesaurus [ 5, 6 ].

The main idea behind our approach is to model the language use associated with concepts from a thesaurus or ontology. To this end we use the document annotations as a bridge between vocabulary terms and the concepts in the knowledge source at hand. We model the language use around concepts using a generative language modeling framework, which provides theoretically sound estimation methods and builds upon a solid statistical background.

Our concept models may be used to determine semantic relatedness or to generate navigational suggestions, either in the form of concepts or vocabulary terms. These can then be used as suggestions for the user or for blind relevance feedback [ 13, 14, 18 ]. In order to apply blind relevance feedback using our models, we perform a double translation. First we estimate the most likely concepts given a query and then we use the most distinguishing terms from these concepts to formulate a new query. In a sense we are using the concepts as a pivot language [ 9 ]. To find the most distinguishing terms given a concept, we apply a technique called parsimonization [ 8 ]. Parsimonization is an algorithm based on expectation-maximization (EM) [ 3 ] and may be used to re-estimate probabilities of one model with respect to another. Events that are well-predicted by the latter model will lose probability mass, which in turn will be given to the remaining events. Recently, we have successfully applied parsimonization to the estimation of relevance models on a variety of tasks and collections [ 15 ]. In all of these cases, as well as with our concept models, we find that interpolating the newly found query with the original one yields the best performance—an observation which is in line with the literature [ 10 ].

The research questions we address are threefold: (i) what are the effects of estimating and applying relevance models to the collection used at the CLEF 2008 Domain Specific track [ 12 ], (ii) what are the results of parsimonizing these relevance models, and (iii) what are the results of applying our concept models for blind relevance feedback?

We find that applying relevance models helps significantly for the current test collection, in terms of both mean average precision and early precision. Moreover, we find that parsimonizing the relevance models helps mean average precision on title-only queries and early precision on title+narrative queries. Our concept models are able to significantly outperform a baseline query-likelihood run, both in terms of mean average precision and in terms of early precision on both title-only and title+narrative queries.

The remainder of this paper is organized as follows. In Section 2 we introduce the retrieval framework we have used for our submission, i.e., statistical language modeling. In Section 3 and 4 we introduce the specifics of our models and techniques. In Section 5 we describe the experimental setup, our parameter settings, and the preprocessing steps we performed on the collection. In Section 6 we discuss our experimental results and we end with a concluding section. 2

Language Modeling

Language modeling is a relatively new framework in the context of information retrieval [ 7, 16, 20 ]. It is centered around the assumption that a query as issued by a user is a sample generated from some underlying term distribution—the information need. The documents in the collection are modeled in a similar fashion and are usually considered to be a mixture of a document-specific model and a more general background model. At retrieval time, each document is ranked according to the likelihood of having generating the query (query-likelihood).

Lafferty and Zhai [ 11 ] propose to generalize the query likelihood model to the KL-divergence scoring method, in which the query is modeled separately. Scoring documents then comes down to measuring the divergence between a query model P (tj Q) and each document model P (tj D), in which the divergence is negated for ranking purposes. When the query model is generated using the empirical, maximumlikelihood estimate (MLE) on the original query as follows:

P (tj~Q) = n(t; Q) jQj ; (1) where n(t; Q) is the number of occurrences of term t in query Q and jQj the length of the query, it can be shown that documents are ranked in the same order as using the query likelihood model [20]. More formally, the score for each document given a query using the KL-divergence retrieval model is: Score(Q; D) =

KL( Qjj D) =

X P (tj Q) log P (tj D) + X P (tj Q) log P (tj Q); t2V t2V where V denotes the vocabulary. The entropy of the query—Pt2V P (tj Q) log P (tj Q)—remains constant per query and can be ignored for ranking purposes. 2.1

Smoothing

Each document model is estimated as the MLE of each term in the document P (tj D), linearly interpolated with a background language model P (t), which in turn is calculated as the likelihood of observing t in a sufficiently large collection, such as the document collection:

P (tj D) =

P (tj D) + (1 )P (t): We smooth using Bayesian smoothing with a Dirichlet prior and set = jDj+ where is the Dirichlet prior that controls the influence of smoothing [ 2, 22 ]. and (1 ) = jDj Dj+j , 2.2

Query Modeling

Relevance feedback can be applied to better capture a user’s information need [ 1, 12, 19 ]. In the context of statistical language modeling, this is usually performed by estimating a new query model, viz. P (tj Q), in Eq. 2 [16, 21]. Automatically reformulating queries (or blind relevance feedback) entails looking at the terms in some set of (pseudo-)relevant documents and selecting the most informative ones with respect to the set or the collection. These terms may then be reweighed based on information pertinent to the query or to the documents and—in a language modeling setting—be used to estimate a query model.

Relevance modeling is one specific technique by which to estimate a query model given a set of (pseudo-)relevant documents DQ. The query and documents are both taken to samples of an underlying generative model—the relevance model. There are several ways by which to estimate the parameters of this model given the observed data (query and documents), each following a different independence assumption [ 12 ]. For our current experiments we use method 2, which is formulated as: where q1; : : : ; qk are the query terms, D a document, and t a term. Bayes’ rule is used to estimate the term P ( Djt): (2) (3) (4) (5) (6) (7) P (tj^Q) / P (t) Y

P (qij Di )P ( Di jt); qi2Q Di2DQ P ( Djt) =

P (tj D)P ( D) ;

P (t) where we assume the document prior P ( D) to be uniform. Similar to Eq. 3, the term P (tj D) may be interpreted as a way of accounting for the fact that the (pseudo-)relevant documents contain terms related to the information need as well as terms from a more general model. We set it to the following mixture: P (tj D) = + (1

)P (t); n(t; D)

jDj where P (t) is the probability of observing t in a sufficiently large collection such as the entire document collection. Query models obtained using relevance models perform better when they are subsequently interpolated with the initial query using a mixing weight [ 10 ]:

P (tj Q) =

P (tj~Q) + (1 )P (tj^Q)

Concept Models

In order to leverage the explicit knowledge encapsulated in the GIRT/CSASA thesauri, we perform blind relevance feedback using the concepts defined therein. We leverage the dual document representations— concepts and terms—to create a generative language model for each concept, which bridges the gap between terms and concepts. Related work has also used textual representations to represent concepts, see e.g., [ 4, 17 ], however, we use statistical language modeling techniques to parametrize the concept models, by leveraging the dual representation of the documents.

To incorporate concepts in the retrieval process, we propose a conceptual query model which is an interpolation of the initial query with another query model. This model is obtained from a double concept translation. In this translation, concepts are used as a pivot language [ 9 ]; the initial query is translated to concepts and back to expanded query terms:

P (tj Q) = Note that we assume that the probability of selecting a term is no longer dependent on the query once we have selected a concept given that query. Then, two components need to be estimated: the probability of a concept given a query P (cjQ) and of a term given a concept P (tjc). To acquire P (tjc), we will use the assignments of GIRT/CSASA thesaural concepts to the documents in the collection and aggregate over the documents Dc which are labeled with a particular concept c: We drop the conditional dependence of t on c given D, again assume the document prior to be uniform, and apply Bayes’ rule to obtain:

P (tjc) =

X P (tjD; c)P (Djc):

D2Dc P (tjc) =

1 P (c)

X P (tj D)P (cj D);

D2DC P (c) =

Pc0

PD n(c; D)

PD0 n(c0; D0) P (cj D) =

n(c; D) Pc0 n(c0; D)

: P (cjQ) =

X P (cj D)P (DjQ); D2DQ (8) (9) (10) (11) where P (c) is a maximum likelihood (ML) estimate on the collection: and P (cj D) is determined using the ML of the concepts associated with that document Next, we also need need a way of estimating concepts for each query, which means that we are looking for a set of concepts CQ such that c 2 CQ have the highest posterior probability P (cjQ). We approach this by looking at the assignment of concepts to documents ane again consider documents which are related to the original query, by using the top ranked documents DQ from the initial retrieval run: where P (DjQ) is determined using the retrieval scores. Note that we again assume that the probability of observing a concept is independent of the query, once we have selected a document given the query, i.e., P (cjD; Q) = P (cj D). This enables us to directly use Eq. 10. 4

Parsimonization

If P (tj D) and P (cj D) in Eq. 6 and Eq. 10 are estimated based on MLE, general terms and concepts may acquire too much probability mass, simply because they occur more frequently. Parsimonization may be 0.8 0.7 0.6 0.5 n o i ics 0.4 e r P 0.3 0.2 0.1 0.0

UAmsBaseline.res.eval UAmsConceptModels.res.eval

UAmsParsRelModels.eval

UAmsRelModels.eval used to reduce the amount and probability mass of non-specific terms in a language model by iteratively adjusting the individual term probabilities based on a comparison with a large reference corpus, such as the collection [ 8 ]. While one of the introduced models may already contain a way of incorporating a reference corpus, viz. Eq. 6, we propose to make the estimates more sparse. Doing so enables more specific terms to receive more probability mass, thus making the resulting model more to the point. In order to achieve this, we consider both models to be a mixture of a document model P (xj D) and a background model P (x), where x 2 ft; cg, and we “parsimonize” the estimates through applying the following EM algorithm until the estimates do not change significantly anymore:

E-step: M-step:

ex = n(x; D) (1

P (xj D) )P (x) + P (xj D) P (xj D) = Px0 ex0 ex 5

Experimental Setup

We did not perform any preprocessing on the document collection, besides replacing German characters as well as HTML entities. To estimate our concept models, we have used the CONTROLLED-TERM-EN field in the documents. We submitted the following runs: 0.45 0.35 0.25 !pam 00..0155 209 129 201 223 221 210 220 215 214 222 203 216 027 123 212 211 202 224 206 -0.05 -0.15 -0.25 -0.35 Something went wrong with the submitted UAmsParsRelModels run based on parsimonious relevance models, making it identical to the UAmsRelModels run. In this paper we report on the corrected version.

In all runs which use blind relevance feedback, we use the 5 terms with the highest probability from the 10 highest ranked documents to estimate our query models. We have then used the 2007 CLEF Domain Specific topics to find the optimal parameter settings for (Eq. 6) and (Eq. 7 and Eq. 8). For our current experiments we set = 50 and fix = 0:15 [ 8 ]. 6

Results and Discussion

Table 1 lists the results of our runs. On the 2007 data, we found that adding the narrative field of the topics helps retrieval effectiveness. For comparative purposes we have included results for both title-only and title+narrative runs. On the 2008 topics, we do not find the same improvement when adding the narrative field, besides slightly improving precision@10. When looking at the longer topics (title+narrative), applying parsimonization to the relevance models hurts mean average precision, but helps early precision. This precision-enhancing effect is in line with earlier results [ 15 ].

The proposed concept models improve significantly over the query-likelihood baseline, both in terms of mean average precision and precision@10 and for both title-only and title+narrative topics. From the precision-recall plot in Figure 1 (title-only) it is clear that our concept model improves slightly in early precision and that the biggest gain is obtained between recall levels 0.2 and 0.7. It also shows that the relevance modeling approaches mainly help to improve recall and not so much precision.

Figure 2 displays a per-topic comparison between the query-likelihood run and each of the other runs. From these contrastive plots it emerges that topics 205 and 225 are hurt most when using relevance models. Further analysis should indicate which characteristics of these topics are responsible for this result. Interestingly, these two topics are hurt less when we apply our concepts models, whereas topic 219 is hurt most in this run. On the other side of the graph, there are quite a few topics which are helped using either relevance models or concept models. Especially topic 216 is improved when applying parsimonious relevance models (> 0:25 increase in mean average precision). The positive difference when applying concept models is even more distinctive; topic 217 is nearly improved by a 0.5 increase in mean average precision.

Conclusion

We have described our participation in this year’s CLEF Domain Specific track. Our aim was to evaluate blind relevance feedback models as well as concept models on the CLEF Domain Specific test collection. The results of our experiments show that applying relevance modeling techniques has a significant positive effect on the current topics, in terms of both mean average precision and precision@10. Moreover, we find that parsimonizing the relevance models helps mean average precision on title-only queries and early precision on title+narrative queries. When we apply concept models for blind relevance feedback, we observe an even bigger as well as significant improvement over the query-likelihood baseline, also in terms of mean average precision and early precision. Moreover, unlike (parsimonious) relevance models, our concept model improves title-only as well as title+narrative queries on both measures. 8

Acknowledgements

This work was carried out in the context of the Virtual Laboratory for e-Science project, which is supported by a BSIK grant from the Dutch Ministry of Education, Culture and Science (OC&W) and is part of the ICT innovation program of the Ministry of Economic Affairs (EZ). Maarten de Rijke was supported by the E.U. IST programme of the 6th FP for RTD under project MultiMATCH contract IST-033104, and by the Netherlands Organisation for Scientific Research (NWO) under project numbers 220-80-001, 017.001.190, 640.001.501, 640.002.501, 612.066.512, STE-07-012, 612.061.814, 612.061.815. 9 [16] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR ’98, 1998. [17] D. R. Recupero. A new unsupervised method for document clustering by using wordnet lexical and conceptual relations. Inf. Retr., 10(6):563–579, 2007. [18] D. Trieschnigg, E. Meij, M. de Rijke, and W. Kraaij. Measuring concept relatedness using language models. In SIGIR ’08, 2008. [19] J. Xu and W. B. Croft. Query expansion using local and global document analysis. In SIGIR ’96, pages 4–11, 1996. [20] C. Zhai. Risk Minimization and Language Modeling in Text Retrieval. PhD thesis, Carnegie Mellon

University, 2002. [21] C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. In CIKM ’01, pages 403–410, New York, NY, USA, 2001. ACM Press. [22] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179–214, April 2004.

[1]

Anick . Using terminological feedback for web search refinement: a log-based study . In SIGIR '03 , 2003 .

[2]

S. F.

Chen and J. Goodman. An empirical study of smoothing techniques for language modeling . In ACL , pages 310 - 318 , 1996 .

[3]

A. P.

Dempster ,

N. M.

Laird ,

and D. B.

Rubin . Maximum likelihood from incomplete data via the EM algorithm . Journal of the Royal Statistical Society. Series B (Methodological) , 39 ( 1 ): 1 - 38 , 1977 .

[4]

Gabrilovich and

Markovitch . Computing semantic relatedness using wikipedia-based explicit semantic analysis . In IJCAI'07 , 2007 .

[5]

Hersh ,

Cohen ,

Yang ,

R. T.

Bhupatiraju ,

Roberts , and

Hearst . TREC 2005 Genomics track overview . In Proceedings of the 14th Text Retrieval Conference. NIST , 2005 .

[6]

Hersh ,

Cohen , and

Roberts . TREC 2007 Genomics track overview . In TREC '07 , 2007 .

[7]

Hiemstra . A linguistically motivated probabilistic model of information retrieval . In ECDL '98 , pages 569 - 584 , 1998 .

[8]

Hiemstra ,

Robertson , and

Zaragoza . Parsimonious language models for information retrieval . In SIGIR '04 , 2004 .

[9]

Kraaij and F. de Jong. Transitive probabilistic CLIR models . In Proceedings of RIAO 2004 , 2004 .

[10]

Kurland ,

Lee , and

Domshlak . Better than the real thing?: iterative pseudo-query processing using cluster-based language models . In SIGIR '05 , 2005 .

[11]

Lafferty and

Zhai . Document language models, query models, and risk minimization for information retrieval . In SIGIR '01 , 2001 .

[12]

Lavrenko and

B. W.

Croft . Relevance based language models . In SIGIR '01 , 2001 .

[13]

Meij and M. de Rijke . Thesaurus-based feedback to support mixed search and browsing environments . In ECDL '07 , 2007 .

[14]

Meij ,

Trieschnigg , M. de Rijke, and

Kraaij . Parsimonious concept modeling . In SIGIR '08 , 2008 .

[15]

Meij ,

Weerkamp ,

Balog , and M. de Rijke. Parsimonious relevance models . In SIGIR '08 , 2008 .