Language Modeling and Document Re-Ranking:
          Trinity Experiments at TEL@CLEF-2009
                              Dong Zhou1,2                   Vincent Wade1
   1. Centre for Next Generation Localisation, University of Dublin, Trinity College, Dublin 2, Ireland
    2. School of Computer and Communication, Hunan University, Changsha, Hunan, China, 410082
              dongzhou1979@hotmail.com                       Vincent.Wade@cs.tcd.ie


                                                  Abstract
      This paper presents a report on our participation in the CLEF-2009 monolingual and bilingual
      ad hoc TEL@CLEF tasks involving three different languages: English, French and German.
      Language modeling is adopted as the underlying information retrieval model. While the data
      collection is extremely sparse, smoothing is particular important when estimating a language
      model. The main purpose of the monolingual task is to compare different smoothing strategies
      and investigate the effectiveness of each alternative. This retrieval model is then used alongside
      a document re-ranking method based on Latent Dirichlet Allocation (LDA) which exploits
      the implicit structure of the documents with respect to original queries for the monolingual
      and bilingual tasks. Experimental results demonstrated that three smoothing strategies behave
      differently across testing languages while LDA-based document re-ranking method should
      be considered further in order to bring significant improvement over the baseline language
      modeling systems in the cross-language setting.


Categories and Subject Descriptors
H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Indexing methods; Linguis-
tic processing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search
process

General Terms
Experimentation, Measurement

Keywords
Language Modeling, Smoothing, Document Re-Ranking, Latent Dirichlet Allocation, Cross-Language
Information Retrieval


1 Introduction
This year’s participation in the CLEF 2009 ad hoc monolingual and bilingual track was motivated by a
desire to compare different smoothing strategies applied to language modeling for library data retrieval as
well as test and extend a newly developed document re-ranking method.
    Language modeling has been successfully applied to the problem of ad hoc retrieval [4, 5, 6]. It provides
an attractive information model due to its theoretical foundations. The basic idea behind this approach is
extremely simple - estimate a language model for each document and/or a query, and rank documents by
the likelihood of the query (with respect to the document language model) or by the distance between the
two models. The main object of smoothing is to adjust the maximum likelihood estimator of a language
model so that it will be more accurate [7].
    However, previous success over news collection data does not necessarily mean it will be efficient
over the library data. Firstly the data is actually multilingual: all collections to a greater or lesser extent
contain records pointing to documents in other languages. However this is not a major problem because
the majority of documents in the test collection are written in main languages of those test collections.
Furthermore, documents written in different languages tend not to match the queries in main languages.
The main characteristic of the data is that it is very different from the newspaper articles and news agency
dispatches previously used in the CLEF. The data tends to be very sparse. Many records contain only title,
author and subject heading information; other records provide more detail (see the experiment section on
what fields are chosen to include). The average document lengths are 14.66 for BL and 24.19 for BNF
collections after pre-processing, respectively.
    Recently, there is a trend of exploring the hidden structure of documents to re-rank results [2, 3, 8].
We addressed in the previous work that there are two important factors that should be taken into account
when designing any re-ranking algorithms: the original queries and initial retrieval scores. Based on this
observation, we introduce a new document re-ranking method based on Latent Dirichlet Allocation (LDA)
[1, 9] which exploits implicit structure of the documents with respect to original queries. Rather than
relying on graph-based techniques as in [5, 6] to identify the internal structure, the approach tries to directly
model the latent structure of ”topics” or ”concepts” in the initial retrieval set. Then we can compute
the distance between queries and initial retrieval results based on latent semantic information inferred.
Experiments in [9] demonstrated the effectiveness of the proposed method in monolingual retrieval. In this
experiment, we try to extend the approach to cross-language information retrieval.


2 Methodology
2.1 Language Modeling
Smoothing a data set typically means creating an approximating function that attempts to capture important
patterns in the data, while leaving out noise or other fine-scale structures/rapid phenomena. In language
modeling, the simplest reason to use smoothing is to not assign a zero probability to unseen words. The
accuracy of smoothing is directly related to the retrieval performance.
    Given a text sequence (either a query or a document), the probability distribution can be regarded
as a probabilistic language model Md or Mq from each document d or each query q. In other words, it
assumes that there is an underlying language model which ”generates” a term (sequence) [1]. The unigram
language model is utilized here. There are several ways to estimate the probabilities. Let g(w ∈ d) denotes
the number of times the term w occurs in a document d (same idea can be used on a query). The Maximum-
likelihood estimation (MLE) of w with respect to d is defined as:

                                                        g(w ∈ d)
                                           MLEd w =                                                          (1)
                                                       ∑w0 g(w0 ∈ d)
    We choose to use three representative methods that are widely used in previous research and relatively
efficient to implement. The Jelinek-Mercer method, the Bayesian smoothing using Dirichlet priors, and the
absolute discounting method. The description of the methods could be found in [7]

2.2 Document Re-Ranking
The intuition behind the document re-ranking method is the hidden structural information among the doc-
uments: similar documents are likely to have the same hidden information with respect to a query. In other
words, if a group of documents are talking about the same topic which shares a strong similarity with a
query, in our method they will get allocated similar ranking as they are more likely to be relevant to the
query. In addition, the refined ranking scores should be relevant to the initial ranking scores, which, in the
experiments conducted in this paper, are combined together with the re-ranking score either using a linear
fashion.
                                           Table 1: Indexing and searching fields
                                          Fields              English   French   German
                                                                √         √        √
                                          dc:language           √         √        √
                                          dc:identifier         √         √        √
                                          dc:rights             √         √        √
                                          dc:type               √         √        √
                                          dc:creator            √         √        √
                                          dc:publisher          √         √        √
                                          dc:date               √
                                          dc:relation           √         √        √
                                          dc:contributor        √                  √
                                          dcterms:issued        √
                                          dcterms:extent                           √
                                          dcterms:spatial                          √
                                          dcterms:isPartOf                         √
                                          dcterms:edition                          √
                                          dcterms:available     √         √        √
                                          mods:location


    The distance between a query and a document in this method adopts the KL divergence between the
query terms and document terms and through a linear combination of the re-ranking scores based on initial
ranker and the latent document re-ranker.
    This method could be found in great detail in [9]. We apply this method into the cross-language re-
ranking by concatenating texts from different languages into several dual-language documents and a single
dual-language query. An LDA analysis of these texts results in a multilingual semantic space in which
terms from both languages are presented. Hence force the re-ranking process can be carried out by directly
model the latent structure of multilingual ”topics” or ”concepts” in this enriched initial retrieval set. The
similarity of ”contexts” in which the terms appear is guaranteed to capture the inter-relationship between
texts in different languages.


3 Experimental Setup
3.1 Overview of the Experimental Process
All of the documents in the experiment were indexed using the Lemur toolkit1 . Prior to indexing, Porter’s
stemmer and a stopword list2 were used for the English documents. We use a French analyzer3 and a
German analyzer to analyze French and German documents. The query sets consist of 50 topics, all of
which were used in the experiment. Each topic is composed of several parts such as: Title, Description,
Narrative. We chose to conduct Title+Description runs as queries. The queries are processed similarly to
the treatment in the test collections. The chosen fields used in the indexing and searching are shown in the
Table 1.

3.2 Experimental Runs
In order to investigate the effectiveness of various techniques, we performed a retrieval experiment with
several permutations. These experimental runs are denoted as follows:
    For monolingual retrieval:
    LM-DIR: This part of the experiment involved retrieving documents from the test collection using
language modeling with Bayesian smoothing method using Dirichlet prior.
    LM-ABS: as above, except that the absolute discounting smoothing method was used.
    LM-JM: as above, except that the Jelinek-Mercer smoothing method was adopted.
  1 http://www.lemurproject.org
  2 ftp://ftp.cs.cornell.edu/pub/smart/
  3 http://lucene.apache.org/
                                  Table 2: Retrieval results for monolingual task
 Run ID         source   target     description                                MAP      bpref    P@5     P@10
 TCDENRUN1      EN       EN         LM-DIR                                    0.2905   0.3001   0.4560   0.4140
 TCDENRUN2      EN       EN         LM-ABS                                    0.4035   0.4054   0.6160   0.5640
 TCDENRUN3      EN       EN         LM-JM                                     0.3696   0.3658   0.5680   0.5060
 TCDFRRUN1      FR       FR         LM-DIR                                    0.1451   0.1570   0.2000   0.1740
 TCDFRRUN2      FR       FR         LM-ABS                                    0.1745   0.1767   0.2320   0.2380
 TCDFRRUN3      FR       FR         LM-JM                                     0.1723   0.1765   0.2520   0.2280
 TCDDERUN1      DE       DE         LM-DIR                                    0.2577   0.2615   0.4480   0.3760
 TCDDERUN2      DE       DE         LM-ABS                                    0.2397   0.2397   0.4280   0.3540
 TCDDERUN3      DE       DE         LM-JM                                     0.2686   0.2653   0.4520   0.3840


    For bilingual retrieval:
    GOOGLETRANS: In this part of the experiment, documents were retrieved from the test collection
using the Google Translator for translating the queries. (It is worth to note that due to the submission
restrictions this is an unofficial measurement.)
    GOOGLETRANS-LDA: Here we retrieved documents from the document collection using query trans-
lations suggested by the Google Translator, then directly re-rank the retrieval results using the translated
query with the proposed LDA based document re-ranking method.
    GOOGLETRANS-SLDA: Here we retrieved documents from the document collection using query
translations suggested by the Google Translator, then we conducted a multilingual corpus with documents
written in both query and document languages. Re-ranking was performed by apply the LDA based method
on this multilingual space (with the translated and the original query).


4 Results and Discussion
4.1 Monolingual Task
In this section we compare three smoothing methods across different languages in the library search (Table
2). As we conducted queries using the title and description fields, they could be considered as long infor-
mative queries. Previous research on news and web data suggested that on average, Jelinek-Mercer is better
than Dirichlet and absolute discounting in metrics like non-interpolated average precision, precision at 10
and 20 documents while both Jelinek-Mercer and Dirichlet clearly have a better average precision than
absolute discounting. The German monolingual runs demonstrated the same observation where Jelnek-
Mercer is better than Dirichlet, which is subsequently better than absolute discounting.
    English and French runs showed a different behaviour. Absolute discounting is a clear winner of
three smoothing methods while Jelinek-Mercer still perform better than Dirichlet. If explained by the two
different roles in the query likelihood retrieval method [7], Dirichlet method performs better as it is good
for the estimation role (as for shorter queries). So that it consistently demonstrate the worst performance
across all the languages. However, Jelinek-Mercer performs best for longer queries and should be good
for the role of query modeling. This is the case for the German runs while not for English and French
runs where absolute discounting substitute the Jelinek-Mercer’s role in the modeling process. The results
suggest that smoothing methods tend to be sensitive for distinct languages and different test collections.

4.2 Bilingual Task
We now turn to the bilingual tasks to study the LDA-based re-ranking method. The main experimental
results are presented in Table 3, for all three languages. The first question we are interested in is how the
re-ranking method performs directly over the bilingual retrieval results (taken as a whole). It is shown that
our methods bring improvements upon the Google translator baselines in all of the 6 relevant comparisons.
Another observation is that in many cases, the method can outperform the baselines for all the evaluation
metrics.
                                 Table 3: Retrieval results for bilingual task
 Run ID            source   target   description                                 MAP    bpref    P@5     P@10
 TCDFRENRUN1       FR       EN       GOOGLETRANS                             0.3481    0.3526   0.5760   0.5220
 TCDFRENRUN2       FR       EN       GOOGLETRANS-LDA                         0.3488    0.3527   0.5720   0.5220
 TCDFRENRUN3       FR       EN       GOOGLETRANS-SLDA                        0.3500    0.3535   0.5760   0.5140
 TCDDEENRUN1       DE       EN       GOOGLETRANS                             0.3411    0.3500   0.5700   0.5040
 TCDDEENRUN2       DE       EN       GOOGLETRANS-LDA                         0.3500    0.3596   0.5760   0.5040
 TCDDEENRUN3       DE       EN       GOOGLETRANS-SLDA                        0.3505    0.3602   0.5880   0.5040
 TCDENFRRUN1       EN       FR       GOOGLETRANS                             0.1579    0.1572   0.2520   0.2320
 TCDENFRRUN2       EN       FR       GOOGLETRANS-LDA                         0.1591    0.1573   0.2520   0.2340
 TCDENFRRUN3       EN       FR       GOOGLETRANS-SLDA                        0.1576    0.1561   0.2560   0.2320
 TCDDEFRRUN1       DE       FR       GOOGLETRANS                             0.1618    0.1743   0.2680   0.2300
 TCDDEFRRUN2       DE       FR       GOOGLETRANS-LDA                         0.1633    0.1752   0.2600   0.2300
 TCDDEFRRUN3       DE       FR       GOOGLETRANS-SLDA                        0.1624    0.1739   0.2600   0.2260
 TCDENDERUN1       EN       DE       GOOGLETRANS                             0.1901    0.1923   0.3480   0.2900
 TCDENDERUN2       EN       DE       GOOGLETRANS-LDA                         0.1910    0.1922   0.3480   0.2920
 TCDENDERUN3       EN       DE       GOOGLETRANS-SLDA                        0.1935    0.1944   0.3480   0.2920
 TCDFRDERUN1       FR       DE       GOOGLETRANS                             0.1826    0.2053   0.3480   0.2700
 TCDFRDERUN2       FR       DE       GOOGLETRANS-LDA                         0.1840    0.2063   0.3520   0.2780
 TCDFRDERUN3       FR       DE       GOOGLETRANS-SLDA                        0.1839    0.2050   0.3560   0.2760


    With respect to the bilingual re-ranking, the method showed some improvements over the Googe trans-
lator and direct re-ranking methods in the X2EN and X2DE runs in terms of mean average precision. The
performance is somewhat disappoint in the X2FR runs. Furthermore, although there are some improve-
ments, the difference are not large enough in terms of MAP. However, regarding to the precision at 5
documents the method could demonstrate higher performance than simple re-ranking. This shows that the
method is a promising direction but need further investigation.
    It is worth mentioning that the combination of methods used in this experiment could achieve very
good overall performance as nearly all of our selected monolingual and bilingual runs are among top five
participants in CLEF 2009 (except in the French monolingual task) such as:
TCDENRUN2 absolute discounting, English monolingual
TCDDERUN1 Dirichlet prior, German monolingual
TCDDEENRUN3 Google translator with SLDA, German-English bilingual
TCDDEFRRUN2 Google translator with LDA, German-French bilingual
TCDENDERUN3 Google translator with SLDA, English-German bilingual


5 Conclusion
In this paper we have described our contribution to the CLEF 2009 ad hoc monolingual and bilingual
tracks. Our monolingual experiment involved the comparison of three different smoothing strategies ap-
plied to language modeling approach for library data retrieval. We also made a first attempt to extend the
previously proposed document re-ranking method to cross-language information retrieval. Experimental
results demonstrated that smoothing methods tend to behave differently in the library search and across
testing languages. They also showed that LDA-based document re-ranking method should be considered
further in order to bring significant improvement over the baseline language modeling systems in the cross-
language setting.


6 Acknowledgments
This research is supported by the Science Foundation Ireland (Grant 07/CE/I1142) as part of the Centre for
Next Generation Localisation (www.cngl.ie) at University of Dublin, Trinity College.
References
[1] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res.,
    3:993–1022, 2003. 944937.

[2] Oren Kurland and Lillian Lee. Pagerank without hyperlinks: structural re-ranking using links induced
    by language models. In Proceedings of the 28th annual international ACM SIGIR conference on
    Research and development in information retrieval, pages 306–313, Salvador, Brazil, 2005. ACM.
    1076087.
[3] Oren Kurland and Lillian Lee. Respect my authority!: Hits without hyperlinks, utilizing cluster-based
    language models. In Proceedings of the 29th annual international ACM SIGIR conference on Re-
    search and development in information retrieval, pages 83–90, Seattle, Washington, USA, 2006. ACM.
    1148188.

[4] Jay M. Ponte and W. Bruce Croft. A language modeling approach to information retrieval. In Pro-
    ceedings of the 21st annual international ACM SIGIR conference on Research and development in
    information retrieval, pages 275–281, Melbourne, Australia, 1998. ACM Press. 291008.

[5] Xing Wei and W. Bruce Croft. Lda-based document models for ad-hoc retrieval. In Proceedings of
    the 29th annual international ACM SIGIR conference on Research and development in information
    retrieval, pages 178–185, Seattle, Washington, USA, 2006. ACM. 1148204.

[6] Chengxiang Zhai and John Lafferty. Model-based feedback in the language modeling approach to in-
    formation retrieval. In Proceedings of the tenth international conference on Information and knowledge
    management, pages 403–410, Atlanta, Georgia, USA, 2001. ACM. 502654.
[7] Chengxiang Zhai and John Lafferty. A study of smoothing methods for language models applied to
    information retrieval. ACM Trans. Inf. Syst., 22(2):179–214, 2004. 984322.

[8] Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen, and Wei-Ying Ma. Im-
    proving web search results using affinity graph. In Proceedings of the 28th annual international ACM
    SIGIR conference on Research and development in information retrieval, pages 504–511, Salvador,
    Brazil, 2005. ACM. 1076120.
[9] Dong Zhou and Vincent Wade. Latent document re-ranking. In Proceedings of the 2009 Conference
    on Empirical Methods in Natural Language Processing. (EMNLP 2009), pages 1571–1580, Singapore,
    2009. Association for Computational Linguistics.