The University of Amsterdam at the CLEF 2008 Domain Specific Track Parsimonious Relevance and Concept Models Edgar Meij Maarten de Rijke emeij@science.uva.nl mdr@science.uva.nl ISLA, University of Amsterdam Abstract We describe our participation in the CLEF 2008 Domain Specific track. The research ques- tions we address are threefold: (i) what are the effects of estimating and applying relevance models to the domain specific collection used at CLEF 2008, (ii) what are the results of parsi- monizing these relevance models, and (iii) what are the results of applying concept models for blind relevance feedback? Parsimonization is a technique by which the term probabilities in a language model may be re-estimated based on a comparison with a reference model, making the resulting model more sparse and to the point. Concept models are term distributions over vocabulary terms, based on the language associated with concepts in a thesaurus or ontology and are estimated using the documents which are annotated with concepts. Concept models may be used for blind relevance feedback, by first translating a query to concepts and then back to query terms. We find that applying relevance models helps significantly for the current test collection, in terms of both mean average precision and early precision. Moreover, par- simonizing the relevance models helps mean average precision on title-only queries and early precision on title+narrative queries. Our concept models are able to significantly outperform a baseline query-likelihood run, both in terms of mean average precision and early precision on both title-only and title+narrative queries. Categories and Subject Descriptors H.3 [Information Storage and Retrieval]: H.3.1 Content Analysis and Indexing; H.3.3 Information Search and Retrieval; H.3.7 Digital Libraries General Terms Algorithms, Theory, Experimentation, Measurement Keywords Parsimonious Models, Language Models, Relevance Feedback 1 Introduction We describe our participation in the 2008 CLEF Domain Specific track. Our main motivation for partic- ipating was to evaluate the retrieval models we have developed for another, very similar domain on the CLEF Domain Specific test collection. Our concept models have thus far been developed and evaluated on the TREC Genomics test collections, which also consists of documents which are manually annotated using concepts from a thesaurus [5, 6]. The main idea behind our approach is to model the language use associated with concepts from a thesaurus or ontology. To this end we use the document annotations as a bridge between vocabulary terms and the concepts in the knowledge source at hand. We model the language use around concepts using a generative language modeling framework, which provides theoretically sound estimation methods and builds upon a solid statistical background. Our concept models may be used to determine semantic relatedness or to generate navigational sug- gestions, either in the form of concepts or vocabulary terms. These can then be used as suggestions for the user or for blind relevance feedback [13, 14, 18]. In order to apply blind relevance feedback using our models, we perform a double translation. First we estimate the most likely concepts given a query and then we use the most distinguishing terms from these concepts to formulate a new query. In a sense we are using the concepts as a pivot language [9]. To find the most distinguishing terms given a concept, we apply a technique called parsimonization [8]. Parsimonization is an algorithm based on expectation-maximization (EM) [3] and may be used to re-estimate probabilities of one model with respect to another. Events that are well-predicted by the latter model will lose probability mass, which in turn will be given to the remaining events. Recently, we have successfully applied parsimonization to the estimation of relevance models on a variety of tasks and collections [15]. In all of these cases, as well as with our concept models, we find that interpolating the newly found query with the original one yields the best performance—an observation which is in line with the literature [10]. The research questions we address are threefold: (i) what are the effects of estimating and applying relevance models to the collection used at the CLEF 2008 Domain Specific track [12], (ii) what are the results of parsimonizing these relevance models, and (iii) what are the results of applying our concept models for blind relevance feedback? We find that applying relevance models helps significantly for the current test collection, in terms of both mean average precision and early precision. Moreover, we find that parsimonizing the relevance models helps mean average precision on title-only queries and early precision on title+narrative queries. Our concept models are able to significantly outperform a baseline query-likelihood run, both in terms of mean average precision and in terms of early precision on both title-only and title+narrative queries. The remainder of this paper is organized as follows. In Section 2 we introduce the retrieval framework we have used for our submission, i.e., statistical language modeling. In Section 3 and 4 we introduce the specifics of our models and techniques. In Section 5 we describe the experimental setup, our parameter set- tings, and the preprocessing steps we performed on the collection. In Section 6 we discuss our experimental results and we end with a concluding section. 2 Language Modeling Language modeling is a relatively new framework in the context of information retrieval [7, 16, 20]. It is centered around the assumption that a query as issued by a user is a sample generated from some underlying term distribution—the information need. The documents in the collection are modeled in a similar fashion and are usually considered to be a mixture of a document-specific model and a more general background model. At retrieval time, each document is ranked according to the likelihood of having generating the query (query-likelihood). Lafferty and Zhai [11] propose to generalize the query likelihood model to the KL-divergence scoring method, in which the query is modeled separately. Scoring documents then comes down to measuring the divergence between a query model P (t|θQ ) and each document model P (t|θD ), in which the divergence is negated for ranking purposes. When the query model is generated using the empirical, maximum- likelihood estimate (MLE) on the original query as follows: n(t; Q) P (t|θ̃Q ) = , (1) |Q| where n(t; Q) is the number of occurrences of term t in query Q and |Q| the length of the query, it can be shown that documents are ranked in the same order as using the query likelihood model [20]. More formally, the score for each document given a query using the KL-divergence retrieval model is: X X Score(Q, D) = −KL(θQ ||θD ) = − P (t|θQ ) log P (t|θD ) + P (t|θQ ) log P (t|θQ ), (2) t∈V t∈V P where V denotes the vocabulary. The entropy of the query— t∈V P (t|θQ ) log P (t|θQ )—remains constant per query and can be ignored for ranking purposes. 2.1 Smoothing Each document model is estimated as the MLE of each term in the document P (t|θD ), linearly interpolated with a background language model P (t), which in turn is calculated as the likelihood of observing t in a sufficiently large collection, such as the document collection: P (t|θD ) = βP (t|θD ) + (1 − β)P (t). (3) µ |D| We smooth using Bayesian smoothing with a Dirichlet prior and set β = |D|+µ and (1 − β) = |D|+µ , where µ is the Dirichlet prior that controls the influence of smoothing [2, 22]. 2.2 Query Modeling Relevance feedback can be applied to better capture a user’s information need [1, 12, 19]. In the context of statistical language modeling, this is usually performed by estimating a new query model, viz. P (t|θQ ), in Eq. 2 [16, 21]. Automatically reformulating queries (or blind relevance feedback) entails looking at the terms in some set of (pseudo-)relevant documents and selecting the most informative ones with respect to the set or the collection. These terms may then be reweighed based on information pertinent to the query or to the documents and—in a language modeling setting—be used to estimate a query model. Relevance modeling is one specific technique by which to estimate a query model given a set of (pseudo-)relevant documents DQ . The query and documents are both taken to samples of an underly- ing generative model—the relevance model. There are several ways by which to estimate the parameters of this model given the observed data (query and documents), each following a different independence assumption [12]. For our current experiments we use method 2, which is formulated as: Y X P (t|θ̂Q ) ∝ P (t) P (qi |θDi )P (θDi |t), (4) qi ∈Q Di ∈DQ where q1 , . . . , qk are the query terms, D a document, and t a term. Bayes’ rule is used to estimate the term P (θD |t): P (t|θD )P (θD ) P (θD |t) = , (5) P (t) where we assume the document prior P (θD ) to be uniform. Similar to Eq. 3, the term P (t|θD ) may be interpreted as a way of accounting for the fact that the (pseudo-)relevant documents contain terms related to the information need as well as terms from a more general model. We set it to the following mixture: n(t; D) P (t|θD ) = α + (1 − α)P (t), (6) |D| where P (t) is the probability of observing t in a sufficiently large collection such as the entire document collection. Query models obtained using relevance models perform better when they are subsequently interpolated with the initial query using a mixing weight λ [10]: P (t|θQ ) = λP (t|θ̃Q ) + (1 − λ)P (t|θ̂Q ) (7) 3 Concept Models In order to leverage the explicit knowledge encapsulated in the GIRT/CSASA thesauri, we perform blind relevance feedback using the concepts defined therein. We leverage the dual document representations— concepts and terms—to create a generative language model for each concept, which bridges the gap be- tween terms and concepts. Related work has also used textual representations to represent concepts, see e.g., [4, 17], however, we use statistical language modeling techniques to parametrize the concept models, by leveraging the dual representation of the documents. To incorporate concepts in the retrieval process, we propose a conceptual query model which is an interpolation of the initial query with another query model. This model is obtained from a double concept translation. In this translation, concepts are used as a pivot language [9]; the initial query is translated to concepts and back to expanded query terms: X P (t|θQ ) = λP (t|θ̃Q ) + (1 − λ) P (t|c)P (c|Q). (8) c∈C Note that we assume that the probability of selecting a term is no longer dependent on the query once we have selected a concept given that query. Then, two components need to be estimated: the probability of a concept given a query P (c|Q) and of a term given a concept P (t|c). To acquire P (t|c), we will use the assignments of GIRT/CSASA thesaural concepts to the documents in the collection and aggregate over the documents Dc which are labeled with a particular concept c: X P (t|c) = P (t|D, c)P (D|c). D∈Dc We drop the conditional dependence of t on c given D, again assume the document prior to be uniform, and apply Bayes’ rule to obtain: 1 X P (t|c) = P (t|θD )P (c|θD ), (9) P (c) D∈DC where P (c) is a maximum likelihood (ML) estimate on the collection: P D n(c; D) P (c) = P P 0 0 c0 D 0 n(c ; D ) and P (c|θD ) is determined using the ML of the concepts associated with that document n(c; D) P (c|θD ) = P 0 . (10) c0 n(c ; D) Next, we also need need a way of estimating concepts for each query, which means that we are looking for a set of concepts CQ such that c ∈ CQ have the highest posterior probability P (c|Q). We approach this by looking at the assignment of concepts to documents ane again consider documents which are related to the original query, by using the top ranked documents DQ from the initial retrieval run: X P (c|Q) = P (c|θD )P (D|Q), (11) D∈DQ where P (D|Q) is determined using the retrieval scores. Note that we again assume that the probability of observing a concept is independent of the query, once we have selected a document given the query, i.e., P (c|D, Q) = P (c|θD ). This enables us to directly use Eq. 10. 4 Parsimonization If P (t|θD ) and P (c|θD ) in Eq. 6 and Eq. 10 are estimated based on MLE, general terms and concepts may acquire too much probability mass, simply because they occur more frequently. Parsimonization may be Table 1: Empirical results of our submitted runs, in terms of mean average precision (MAP) and preci- sion@10 (P10). Best scores are marked in boldface. A †/‡ indicates a statistically significant difference as compared to the baseline at the 0.05/0.01 level respectively (tested using a Wilcoxon signed rank test). title title+narrative MAP P10 MAP P10 UAmsBaseline 0.2433 0.4560 0.2077 0.4600 UAmsRelModels 0.2737‡ 0.5040 0.2396 0.4400 UAmsParsRelModels 0.2779† 0.5000 0.2271 0.4800† † † † UAmsConceptModels 0.2922 0.5160 0.2581 0.5240† 0.8 UAmsBaseline.res.eval UAmsConceptModels.res.eval 0.7 UAmsParsRelModels.eval UAmsRelModels.eval 0.6 0.5 Precision 0.4 0.3 0.2 0.1 0.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall Figure 1: Precision-recall graph of the various runs. used to reduce the amount and probability mass of non-specific terms in a language model by iteratively adjusting the individual term probabilities based on a comparison with a large reference corpus, such as the collection [8]. While one of the introduced models may already contain a way of incorporating a reference corpus, viz. Eq. 6, we propose to make the estimates more sparse. Doing so enables more specific terms to receive more probability mass, thus making the resulting model more to the point. In order to achieve this, we consider both models to be a mixture of a document model P (x|θD ) and a background model P (x), where x ∈ {t, c}, and we “parsimonize” the estimates through applying the following EM algorithm until the estimates do not change significantly anymore: γP (x|θD ) E-step: ex = n(x; D) (1 − γ)P (x) + γP (x|θD ) ex M-step: P (x|θD ) = P x0 ex 0 5 Experimental Setup We did not perform any preprocessing on the document collection, besides replacing German characters as well as HTML entities. To estimate our concept models, we have used the CONTROLLED-TERM-EN field in the documents. We submitted the following runs: a b c 217 0.45 0.45 0.45 0.35 0.35 0.35 216 224 0.25 0.25 0.25 215 214 209 209 206 219 219 201 223 0.15 0.15 0.15 201 207 223 223 221 210 220 222 215 221 203 214 222 203 ! map ! map ! map 216 203 221 214 210 207 204 207 202 212 211 220 213 215 212 204 210 212 213 211 202 211 218 224 206 216 224 0.05 0.05 0.05 209 218 213 206 218 204 201 208 208 222 208 217 225 205 220 -0.05 -0.05 -0.05 217 202 225 225 219 -0.15 -0.15 -0.15 205 205 -0.25 -0.25 -0.25 -0.35 -0.35 -0.35 Figure 2: Per-topic breakdown of the difference in terms of average precision between the baseline run and UAmsRelModels (a), UAmsParsRelModels (b), and UAmsConceptModels (c). Topics are sorted by increasing difference, the labels indicate the respective topic identifiers. UAmsBaseline – a baseline run based with P (t|θQ ) in Eq. 2 set to the empirical estimate on the query (Eq. 1), UAmsRelModels – a run based on relevance models, viz. Eq. 4, UAmsParsRelModels – a run based on parsimonious relevance models, viz. Eq. 4 in conjunction with the E- and M-steps described in the previous section, UAmsConceptModels – a run based on concept models, viz. Section 3 and Section 4. Something went wrong with the submitted UAmsParsRelModels run based on parsimonious relevance models, making it identical to the UAmsRelModels run. In this paper we report on the corrected version. In all runs which use blind relevance feedback, we use the 5 terms with the highest probability from the 10 highest ranked documents to estimate our query models. We have then used the 2007 CLEF Domain Specific topics to find the optimal parameter settings for α (Eq. 6) and λ (Eq. 7 and Eq. 8). For our current experiments we set µ = 50 and fix γ = 0.15 [8]. 6 Results and Discussion Table 1 lists the results of our runs. On the 2007 data, we found that adding the narrative field of the topics helps retrieval effectiveness. For comparative purposes we have included results for both title-only and title+narrative runs. On the 2008 topics, we do not find the same improvement when adding the narrative field, besides slightly improving precision@10. When looking at the longer topics (title+narrative), apply- ing parsimonization to the relevance models hurts mean average precision, but helps early precision. This precision-enhancing effect is in line with earlier results [15]. The proposed concept models improve significantly over the query-likelihood baseline, both in terms of mean average precision and precision@10 and for both title-only and title+narrative topics. From the precision-recall plot in Figure 1 (title-only) it is clear that our concept model improves slightly in early precision and that the biggest gain is obtained between recall levels 0.2 and 0.7. It also shows that the relevance modeling approaches mainly help to improve recall and not so much precision. Figure 2 displays a per-topic comparison between the query-likelihood run and each of the other runs. From these contrastive plots it emerges that topics 205 and 225 are hurt most when using relevance mod- els. Further analysis should indicate which characteristics of these topics are responsible for this result. Interestingly, these two topics are hurt less when we apply our concepts models, whereas topic 219 is hurt most in this run. On the other side of the graph, there are quite a few topics which are helped using either relevance models or concept models. Especially topic 216 is improved when applying parsimonious rele- vance models (> 0.25 increase in mean average precision). The positive difference when applying concept models is even more distinctive; topic 217 is nearly improved by a 0.5 increase in mean average precision. 7 Conclusion We have described our participation in this year’s CLEF Domain Specific track. Our aim was to evaluate blind relevance feedback models as well as concept models on the CLEF Domain Specific test collection. The results of our experiments show that applying relevance modeling techniques has a significant positive effect on the current topics, in terms of both mean average precision and precision@10. Moreover, we find that parsimonizing the relevance models helps mean average precision on title-only queries and early precision on title+narrative queries. When we apply concept models for blind relevance feedback, we observe an even bigger as well as significant improvement over the query-likelihood baseline, also in terms of mean average precision and early precision. Moreover, unlike (parsimonious) relevance models, our concept model improves title-only as well as title+narrative queries on both measures. 8 Acknowledgements This work was carried out in the context of the Virtual Laboratory for e-Science project, which is supported by a BSIK grant from the Dutch Ministry of Education, Culture and Science (OC&W) and is part of the ICT innovation program of the Ministry of Economic Affairs (EZ). Maarten de Rijke was supported by the E.U. IST programme of the 6th FP for RTD under project MultiMATCH contract IST-033104, and by the Netherlands Organisation for Scientific Research (NWO) under project numbers 220-80-001, 017.001.190, 640.001.501, 640.002.501, 612.066.512, STE-07-012, 612.061.814, 612.061.815. 9 References [1] P. Anick. Using terminological feedback for web search refinement: a log-based study. In SIGIR ’03, 2003. [2] S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. In ACL, pages 310–318, 1996. [3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1–38, 1977. [4] E. Gabrilovich and S. Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI’07, 2007. [5] W. Hersh, A. Cohen, J. Yang, R. T. Bhupatiraju, P. Roberts, and M. Hearst. TREC 2005 Genomics track overview. In Proceedings of the 14th Text Retrieval Conference. NIST, 2005. [6] W. Hersh, A. Cohen, and P. Roberts. TREC 2007 Genomics track overview. In TREC ’07, 2007. [7] D. Hiemstra. A linguistically motivated probabilistic model of information retrieval. In ECDL ’98, pages 569–584, 1998. [8] D. Hiemstra, S. Robertson, and H. Zaragoza. Parsimonious language models for information retrieval. In SIGIR ’04, 2004. [9] W. Kraaij and F. de Jong. Transitive probabilistic CLIR models. In Proceedings of RIAO 2004, 2004. [10] O. Kurland, L. Lee, and C. Domshlak. Better than the real thing?: iterative pseudo-query processing using cluster-based language models. In SIGIR ’05, 2005. [11] J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for infor- mation retrieval. In SIGIR ’01, 2001. [12] V. Lavrenko and B. W. Croft. Relevance based language models. In SIGIR ’01, 2001. [13] E. Meij and M. de Rijke. Thesaurus-based feedback to support mixed search and browsing environ- ments. In ECDL ’07, 2007. [14] E. Meij, D. Trieschnigg, M. de Rijke, and W. Kraaij. Parsimonious concept modeling. In SIGIR ’08, 2008. [15] E. Meij, W. Weerkamp, K. Balog, and M. de Rijke. Parsimonious relevance models. In SIGIR ’08, 2008. [16] J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In SIGIR ’98, 1998. [17] D. R. Recupero. A new unsupervised method for document clustering by using wordnet lexical and conceptual relations. Inf. Retr., 10(6):563–579, 2007. [18] D. Trieschnigg, E. Meij, M. de Rijke, and W. Kraaij. Measuring concept relatedness using language models. In SIGIR ’08, 2008. [19] J. Xu and W. B. Croft. Query expansion using local and global document analysis. In SIGIR ’96, pages 4–11, 1996. [20] C. Zhai. Risk Minimization and Language Modeling in Text Retrieval. PhD thesis, Carnegie Mellon University, 2002. [21] C. Zhai and J. Lafferty. Model-based feedback in the language modeling approach to information retrieval. In CIKM ’01, pages 403–410, New York, NY, USA, 2001. ACM Press. [22] C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179–214, April 2004.