=Paper=
{{Paper
|id=Vol-1179/CLEF2013wn-CHiC-Larson2013
|storemode=property
|title=Pseudo-Relevance Feedback for CLEF-CHiC Adhoc
|pdfUrl=https://ceur-ws.org/Vol-1179/CLEF2013wn-CHiC-Larson2013.pdf
|volume=Vol-1179
|dblpUrl=https://dblp.org/rec/conf/clef/Larson13
}}
==Pseudo-Relevance Feedback for CLEF-CHiC Adhoc==
<pdf width="1500px">https://ceur-ws.org/Vol-1179/CLEF2013wn-CHiC-Larson2013.pdf</pdf>
<pre>
    Pseudo-Relevance Feedback for CLEF-CHiC
                     Adhoc

                                   Ray R. Larson

                               School of Information
                      University of California, Berkeley, USA
                             ray@sims.berkeley.edu


      Abstract. In this paper we will briefly describe the approaches taken by
      the Cheshire (Berkeley) Group for the CLEF CHiC Adhoc tasks (Mono-
      lingual, Bilingual and Multilingual retrieval for English, French and Ger-
      man). We used multiple translations of the topics for searching each of the
      CHiC Europeana English, French and German subcollections, employing
      Google Translate as our translation system. In addition we combined the
      original topics for various multilingual runs.
      Once again this year our approach was to use probabilistic text retrieval
      based on logistic regression and incorporating pseudo relevance feedback
      for all of the runs.
      The results overall, when viewed using the multilingual qrels based on
      the entire set of languages for the CHiC collection, were not good, while
      the individual monolingual runs using only the collection-specific qrels,
      appear to have performed reasonably well. There is some question about
      the qrels for the the entire multilingual collection, since there appear to
      be no relevant documents at all from the English collection.


1   Introduction
The collections used for the Cultural Heritage in CLEF (CHiC) track are derived
from the Europeana Digital Library and contain metadata describing materials
ranging from text documents to photographs, museum objects, historic locations
and persons.
    Each the collections used in the CLEF Adhoc CHiC track are considered to
be “mainly” in a particular language (at least in the English, French, and Ger-
man collections), according to the language codes specified the in the language
attribute of the metadata tag, however records also included descriptive meta-
data in virtually all other languages as part of the “enrichment:concept label”
elements. Thus, and English record would also include concept names in other
languages as well. This overlap of languages presents an interesting multilin-
gual search problem, and we explored it by using tranlations of topics into each
of the other languages (we used only English, French and German for this) and
combining those translations with the original topics in some of our submissions.
    In this paper we review the retrieval algorithms and evaluation results for
Berkeley’s official submissions for the Adhoc-CHiC 2013 track. All of the runs
were automatic without manual intervention in the queries (or translations). We
submitted ten Multilingual runs (using various combinations of query languages
and searching the English, French and German collections), six Bilingual runs
(two for each target language German, English and French) and three monolin-
gual runs for the three target languages.
    This paper first describes the retrieval algorithms used for our submissions,
followed by a discussion of the processing used for the runs. We then examine the
results obtained for our official runs, and finally present conclusions and future
directions for Adhoc-CHiC participation.


2     The Retrieval Algorithms

Note that this section is virtually identical to one that appears in our papers
from previous CLEF participation and appears here for reference only[8, 7] The
basic form and variables of the Logistic Regression (LR) algorithm used for all
of our submissions was originally developed by Cooper, et al. [5]. As originally
formulated, the LR model of probabilistic IR attempts to estimate the probabil-
ity of relevance for each document based on a set of statistics about a document
collection and a set of queries in combination with a set of weighting coefficients
for those statistics. The statistics to be used and the values of the coefficients
are obtained from regression analysis of a sample of a collection (or similar test
collection) for some set of queries where relevance and non-relevance has been
determined. More formally, given a particular query and a particular document
in a collection P (R | Q, D) is calculated and the documents or components are
presented to the user ranked in order of decreasing values of that probability. To
avoid invalid probability values, the usual calculation of P (R | Q, D) uses the
“log odds” of relevance given a set of S statistics, si , derived from the query and
database, such that:
                                                   S
                                                   X
                         log O(R | Q, D) = b0 +          bi si                  (1)
                                                   i=1

where b0 is the intercept term and the bi are the coefficients obtained from the
regression analysis of the sample collection and relevance judgements. The final
ranking is determined by the conversion of the log odds form to probabilities:

                                            elog O(R|Q,D)
                         P (R | Q, D) =                                         (2)
                                          1 + elog O(R|Q,D)

2.1   TREC2 Logistic Regression Algorithm

For Adhoc-CHiC we used a version the Logistic Regression (LR) algorithm that
has been used very successfully in Cross-Language IR by Berkeley researchers
for a number of years[3]. The formal definition of the TREC2 Logistic Regression
algorithm used is:
                                       p(R|C, Q)         p(R|C, Q)
               log O(R|C, Q) = log                 = log
                                     1 − p(R|C, Q)       p(R|C, Q)
                                                                |Qc |
                                                   1       X qtfi
                              = c0 + c1 ∗ p
                                                 |Qc | + 1 i=1 ql + 35
                                                       |Qc |
                                             1         X                tfi
                              + c2 ∗ p                         log             (3)
                                         |Qc | + 1 i=1               cl + 80
                                                       |Qc |
                                             1         X             ctfi
                              − c3 ∗ p                         log
                                         |Qc | + 1 i=1                Nt
                              + c4 ∗ |Qc |

where C denotes a document component (i.e., an indexed part of a document
which may be the entire document) and Q a query, R is a relevance variable,

p(R|C, Q) is the probability that document component C is relevant to query
     Q,
p(R|C, Q) the probability that document component C is not relevant to query
     Q, which is 1.0 - p(R|C, Q)
|Qc | is the number of matching terms between a document component and a
     query,
qtfi is the within-query frequency of the ith matching term,
tfi is the within-document frequency of the ith matching term,
ctfi is the occurrence frequency in a collection of the ith matching term,
ql is query length (i.e., number of terms in a query like |Q| for non-feedback
     situations),
cl is component length (i.e., number of terms in a component), and
Nt is collection length (i.e., number of terms in a test collection).
ck are the k coefficients obtained though the regression analysis.

    If stopwords are removed from indexing, then ql, cl, and Nt are the query
length, document length, and collection length, respectively. If the query terms
are re-weighted (in feedback, for example), then qtfi is no longer the original
term frequency, but the new weight, and ql is the sum of the new weight values
for the query terms. Note that, unlike the document and collection lengths, query
length is the “optimized” relative frequency without first taking the log over the
matching terms.
    The coefficients were determined by fitting the logistic regression model spec-
ified in log O(R|C, Q) to TREC training data using a statistical software package.
The coefficients, ck , used for our official runs are the same as those described
by Chen[1]. These were: c0 = −3.51, c1 = 37.4, c2 = 0.330, c3 = 0.1937 and
c4 = 0.0929. Further details on the TREC2 version of the Logistic Regression
algorithm may be found in Cooper et al. [4].
2.2   Pseudo Relevance Feedback
Instead of performing direct retrieval of documents using the TREC2 logistic
regression algorithm described above, we have implmented a “pseudo relevance
feedback” step as part of the search processing. The algorithm used for pseudo
relevance feedback was by Chen [2] for an earlier search system used in CLEF.
Pseudo relevance feedback has become well-known in the information retrieval
community primarily because of its apparent ability to provide consistent im-
provements over initial search results as seen in TREC, CLEF and other retrieval
evaluations [6]. While the most commonly used algorithm for pseudo relevance
feedback is the Rocchio algorithm originally developed for the SMART system,
we have adopted a probabilistic approach based on the probabilistic term rele-
vance weighting formula developed by Robertson and Sparck Jones [9].
    Pseudo relevance feedback is typically performed in two stages. First, an ini-
tial search using the original topic statement is performed, after which a number
of terms are selected from some number of the top-ranked documents (which
are presumed to be relevant). The selected terms are then weighted and then
merged with the initial query to formulate a new query. Finally the reweighted
and expanded query is submitted against the same collection to produce a fi-
nal ranked list of documents. Obviously there are important choices to be made
regarding the number of top-ranked documents to consider, and the number of
terms to extract from those documents. For CHiC this year, we chose to use the
top 10 terms from 10 top-ranked documents, since these parameters have worked
well in similar evaluations. The terms were chosen by extracting the document
vectors for each of the 10 and computing the Robertson and Sparck Jones term
relevance weight for each document. This weight is based on a contingency table
where the counts of 4 different conditions for combinations of (assumed) rele-
vance and whether or not the term is, or is not in a document. Table 1 shows
this contingency table.
             Table 1. Contingency table for term relevance weighting

                             Relevant Not Relevant
                  In doc     Rt       Nt − Rt       Nt
                  Not in doc R − Rt N − Nt − R + Rt N − Nt
                             R        N −R          N


   The relevance weight is calculated using the assumption that the first 10
documents are relevant and all others are not. For each term in these documents
the following weight is calculated:
                                              Rt
                                            R−Rt
                              wt = log      Nt −Rt
                                                                              (4)
                                         N −Nt −R+Rt

   The 10 terms (including those that appeared in the original query) with the
highest wt are selected and added to the original query terms. For the terms
not in the original query, the new “term frequency” (qtfi in main LR equation
above) is set to 0.5. Terms that were in the original query, but are not in the
top 10 terms are left with their original qtfi . For terms in the top 10 and in the
original query the new qtfi is set to 1.5 times the original qtfi for the query. The
new query is then processed using the same LR algorithm as shown in Equation
4 and the ranked results returned as the response for that topic.


3     Approaches for CHiC Adhoc

In this section we describe the specific approaches taken for our submitted runs
for the CHiC Adhoc task. First we describe the indexing and term extraction
methods used, and then the search features we used for the submitted runs.


3.1   Indexing and Term Extraction

The Cheshire II system uses the XML structure of the documents to extract
selected portions for indexing and retrieval. Any combination of tags can be
used to define the index contents.
                Table 2. Cheshire II Indexes for Adhoc-CHiC 2006

           Name        Description        Content Tags               Used
           URI         Europeana URI      europeana:uri               no
                                          dc:identifier
           contributor Contributing Inst. dc:contributor              no
           subject     Topics             dc:subject                  no
           title       Title              dc:title                    no
                                          dc:subject, dc:description
           topic       Entire record      ims:fields                  yes


    Table 2 lists the indexes created by the Cheshire II system for the three
Adhoc-CHiC databases and the document elements from which the contents of
those indexes were extracted. The “Used” column in Table 2 indicates whether or
not a particular index was used in the submitted Adhoc-CHiC runs. As the table
shows we used only the topic index, which contains most of the content-bearing
parts of records, for all of our submitted runs. In addition to the databases for
the individual language collection, we used a Cheshire II feature that allows
searching across multiple databases and merging the ranked results for searches
of all three languages.
    For all indexing we used language-specific stoplists to exclude function words
and very common words from the indexing and searching. The German language
runs did not use decompounding in the indexing and querying processes to gen-
erate simple word forms from compounds. The Snowball stemmer was used by
Cheshire for language-specific stemming.
               Table 3. Mean Average Precision for Different QRels

        RunID              EN      FR     DE     ENFRDE ML
        BerkMonoEN01       0.1842* 0.0000 0.0000 0.0645  0.0375
        BerkMonoFR02       0.0000 0.2014* 0.0000 0.0648  0.0325
        BerkMonoDE03       0.0000 0.0000 0.1757* 0.0857* 0.0440*
        BerkBiDEEN04       0.1942* 0.0000 0.0000 0.0641  0.0371*
        BerkBiFREN05       0.1744 0.0000 0.0000 0.0579   0.0333
        BerkBiFRDE06       0.0000 0.0000 0.1474 0.0679   0.0330
        BerkBiENFR07       0.0000 0.1608* 0.0000 0.0555  0.0284
        BerkBiDEFR08       0.0000 0.1296 0.0000 0.0391   0.0217
        BerkBiENDE09       0.0000 0.0000 0.1785* 0.0726* 0.0331
        BerkMLEN10         0.0000 0.0274 0.0379 0.0350   0.0166
        BerkMLFR11         0.0000 0.0896* 0.0185 0.0474  0.0231
        BerkMLDE12         0.0000 0.0205 0.1260* 0.0734  0.0385
        BerkMLDU13         0.0000 0.0244 0.0368 0.0337   0.0198
        BerkMLFI14         0.0000 0.0369 0.0336 0.0406   0.0273
        BerkMLIT15         0.0000 0.0222 0.0196 0.0207   0.0123
        BerkMLSP16         0.0000 0.0200 0.0254 0.0260   0.0148
        BerkMLALL17        0.0000 0.0485 0.0758 0.0736   0.0357
        BerkMLSPENFRDEIT18 0.0000 0.0597 0.0673 0.0737   0.0353
        BerkMLENFRDE19     0.0000 0.0642 0.0763 0.0803* 0.0393*


3.2   Search Processing

Searching the Adhoc-CHiC collection using the Cheshire II system involved using
TCL scripts to parse the topics and submit the title element from the topics.
For monolingual search tasks we used the topics in the appropriate language
(English, German, and French) and searched the specific language collection.
    For bilingual tasks the topics were translated from the source language to
the target language using Google Translate (this involved converting the original
topics to HTML (as if the set of topics was a web page) and then asking Google
Translate to translate the web page, then the translated page was converted back
to the topic schema, again the matching language database for the translated
topics was searched.
    For multilingual search tasks we used the original topics, but searched that
topic (without translation) in all three (EN, FR, DE) databases and combined
the results based on MINMAX normalized scores. The topics used include ad-
ditional languages other than the main language of the database (e.g., Dutch,
Finnish and Italian) without translation, and combinations of multiple languages
for a given topic. The
    The scripts for each run submitted the topic elements as they appeared in
the topic or expanded topic to the system for TREC2 logistic regression search-
ing with pseudo feedback. Only the “title” topic element was used, and where
appropriate multiple titles in different languages were combined into a single
probabilistic query and searched using the “topic” index as described in Table 2.
                        Fig. 1. Berkeley Monolingual Runs


4   Results for Submitted Runs
The summary results, as Mean Average Precision (MAP) for the submitted
monolingual, bilingual and multilingual runs for English German and French
collections are shown in Table 3. In this table each column represents the MAP
results for a particular set of the task qrels. The EN column uses only qrels
for the English subcollection, similarly for German (DE) and French (FR). The
ENFRDE column combines the qrels for the three languages and while the ML
column is the official MAP calculated using all language collections.
    The Recall-Precision curves for these runs are also shown in Figures 1 (for
monolingual) and 2 (for bilingual) and 3 (for Multilingual). In Figures 1 and 2
the names for the individual runs represent the language codes, which can easily
be compared with full names and descriptions in Table 3 (since each language
combination has only a single run).
    Table 3 indicates runs that had the highest overall MAP for the task within
a given qrel set and subtask (monolingual, translated bilingual and multilingual)
by asterisks next to the scores.
    The results in Table 3 show, as might be expected, the best results are usu-
ally the pure monolingual approach for a single language subcollection. We did
have the interesting case again (which has happened before in other CLEF and
NTCIR evaluations for us) that an bilingual version of a query in another lan-
guage translated to English outperforms the “native” English query on the same
collection. In this case the bilingual German to English translation did better in
searching the English collection than the original English topic.
                          Fig. 2. Berkeley Bilingual Runs


    Also not very surprising was the fact that particular languages used in search-
ing the collections of different languages typically ended up with a MAP of zero
or very near zero. What is rather strange however, is that it appears that no
results were obtained from the English collections for all multilingual searches
(which were supposed to be searching the English, German and French collec-
tions). Since MINMAX normalization of scores was used, it seems unlikely that
this was the result of a scoring difference. It may be, however, that some system
error that is causing the system to fail to combine all of the result for all three
language collections. We will be checking into this later.


5   Conclusions

Our overall results this year compared poorly with the best performing sys-
tems, based on the track summary data on the DIRECT system. However, even
our best runs included only the results from the English, French, and German
subcollections, and no other languages, this alone apparently was enough to
drastically lower the results. Interestingly, our monolingual run against only the
German collection was our best performing run (when evaluated using the full
multillingual qrels), which seems to indicate a predominance of German docu-
ments in the overall relevant. What was also interesting was that single language
searching the three collections using languages outside of the main content of the
collections including Dutch, Finnish, Italian and Spanish, provided non-zero re-
sults from each of the qrel sets, except for English. This would seem to indicated
                         Fig. 3. Berkeley Multilingual Runs


that the multilingual topical metadata included in the Europeana collections is
providing a boost for multilingual search.


References
1. Aitao Chen. Multilingual information retrieval using english and chinese queries.
   In Carol Peters, Martin Braschler, Julio Gonzalo, and Michael Kluck, editors, Eval-
   uation of Cross-Language Information Retrieval Systems: Second Workshop of the
   Cross-Language Evaluation Forum, CLEF-2001, Darmstadt, Germany, September
   2001, pages 44–58. Springer Computer Scinece Series LNCS 2406, 2002.
2. Aitao Chen. Cross-Language Retrieval Experiments at CLEF 2002, pages 28–48.
   Springer (LNCS #2785), 2003.
3. Aitao Chen and Fredric C. Gey. Multilingual information retrieval using machine
   translation, relevance feedback and decompounding. Information Retrieval, 7:149–
   182, 2004.
4. W. S. Cooper, A. Chen, and F. C. Gey. Full Text Retrieval based on Probabilis-
   tic Equations with Coefficients fitted by Logistic Regression. In Text REtrieval
   Conference (TREC-2), pages 57–66, 1994.
5. William S. Cooper, Fredric C. Gey, and Daniel P. Dabney. Probabilistic retrieval
   based on staged logistic regression. In 15th Annual International ACM SIGIR Con-
   ference on Research and Development in Information Retrieval, Copenhagen, Den-
   mark, June 21-24, pages 198–210, New York, 1992. ACM.
6. Ray R. Larson. Probabilistic retrieval, component fusion and blind feedback for
   XML retrieval. In INEX 2005, pages 225–239. Springer (Lecture Notes in Computer
   Science, LNCS 3977), 2006.
7. Ray R. Larson. Cheshire at GeoCLEF 2007: Retesting text retrieval baselines.
   In 8th Workshop of the Cross-Language Evaluation Forum, CLEF 2007, Budapest,
   Hungary, September 19-21, 2007, Revised Selected Papers, LNCS 5152, pages 811–
   814, Budapest, Hungary, September 2008.
8. Ray R. Larson. Experiments in classification clustering and thesaurus expansion
   for domain specific cross-language retrieval. In 8th Workshop of the Cross-Language
   Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19-21, 2007, Revised
   Selected Papers, LNCS 5152, pages 188–195, Budapest, Hungary, September 2008.
9. S. E. Robertson and K. Sparck Jones. Relevance weighting of search terms. Journal
   of the American Society for Information Science, pages 129–146, May–June 1976.

</pre>