Hummingbird Ottawa, Ontario, Canada stephen.tomlinson@hummingbird.com http://www.hummingbird.com/

Comparing the Robustness of Expansion Techniques and Retrieval Measures

Stephen Tomlinson

stephen.tomlinson@hummingbird.com 0 1

Robust Retrieval, Blind Feedback, First Relevant Score

0 Measurement , Performance, Experimentation 1 Ottawa , Ontario , Canada

Hummingbird participated in the monolingual (Bulgarian, French, Hungarian, Portuguese and English) and robust (Dutch, English, French, German, Italian and Spanish) information retrieval tasks of the Ad-Hoc Track of the Cross-Language Evaluation Forum (CLEF) 2006. In all 22 of our experiments with blind feedback (a technique known to impair robustness across topics), the mean scores of the Average Precision, Geometric MAP and Precision@10 measures increased (and most of these increases were statistically significant), implying that these measures are not suitable as robust retrieval measures. In contrast, we found that measures based on just the first relevant item, such as a Generalized Success@10 measure, successfully discerned some robustness gains, particularly the robustness advantage of expanding Title queries by using the Description field instead of blind feedback.

Hummingbird Ottawa, Ontario, Canada stephen.tomlinson@hummingbird.com http://www.hummingbird.com/

Hummingbird SearchServer1 is a toolkit for developing enterprise search and retrieval applications. The SearchServer kernel is also embedded in other Hummingbird products for the enterprise.

SearchServer works in Unicode internally [ 3 ] and supports most of the world’s major character sets and languages. The major conferences in text retrieval experimentation (CLEF [ 2 ], NTCIR [ 4 ] and TREC [ 7 ]) have provided judged test collections for objective experimentation with SearchServer in more than a dozen languages.

1SearchServerTM, SearchSQLTMand Intuitive SearchingTM are trademarks of Hummingbird Ltd. All other copyrights, trademarks and tradenames are the property of their respective owners.

Language

Portuguese

French Bulgarian Hungarian English

This paper describes experimental work with SearchServer for the task of finding relevant documents for natural language queries in various European languages using the CLEF 2006 AdHoc Track test collections. 2 2.1

Methodology Data

The CLEF 2006 Ad-Hoc Track document sets consisted of tagged (SGML-formatted) news articles in 5 different languages: Bulgarian, French, Hungarian, Portuguese and English. Table 1 gives the sizes.

The CLEF organizers created 50 natural language “topics” and translated them into many languages. Some topics were discarded for some languages because of a lack of relevant documents. Table 1 gives the final number of topics for each language and their average number of relevant documents (along with the lowest, median and highest number of relevant documents of any topic). For more information on the CLEF test collections, see the track overview paper. 2.2

Indexing

Our indexing approach was mostly the same as last year [ 11 ]. Accents were not indexed except for the combining breve in Bulgarian. The apostrophe was treated as a word separator for the investigated languages (except English). The custom text reader, cTREC, was updated to maintain support for the CLEF guidelines of only indexing specifically tagged fields.

Some stop words were excluded from indexing (e.g. “the”, “by” and “of” in English). For these experiments, the stop word lists for Bulgarian and Hungarian were based on Savoy’s updated lists [ 6 ].

By default, the SearchServer index supports both exact matching (after some Unicode-based normalizations, such as decompositions and conversion to upper-case) and morphological matching (e.g. inflections, derivations and compounds, depending on the linguistic component used). 2.3

Searching

We experimented with the SearchServer CONTAINS predicate. Our test application specified SearchSQL to perform a boolean-OR of the query words. For example, for English topic 279 whose Title was “Swiss referendums”, a corresponding SearchSQL query would be: SELECT RELEVANCE(’2:3’) AS REL, DOCNO FROM CLEF06EN WHERE FT_TEXT CONTAINS ’Swiss’|’referendums’ ORDER BY REL DESC;

Most aspects of the SearchServer relevance value calculation are the same as described last year [ 11 ]. Briefly, SearchServer dampens the term frequency and adjusts for document length in a manner similar to Okapi [ 5 ] and dampens the inverse document frequency using an approximation of the logarithm. These calculations are based on the stems of the terms (roughly speaking) when doing morphological searching (i.e. when SET TERM_GENERATOR ‘word!ftelp/inflect’ was previously specified). The SearchServer RELEVANCE_METHOD setting was set to ‘2:3’ and RELEVANCE_DLEN_IMP was set to 750 for all experiments in this paper. 2.4

Experimental Runs

For each language, we executed 5 experimental runs in May 2006, though just 3 were allowed to be submitted for official assessment. In the identifiers (e.g. “humBG06tde”), ‘t’, ‘d’ and ‘n’ indicate that the Title, Description and Narrative field of the topic were used (respectively), and ‘e’ indicates that query expansion from blind feedback on the first 3 rows was used (weight of onehalf on the original query, and one-sixth each on the 3 expanded rows). From the Description and Narrative fields for most languages, instruction words such as “find”, “relevant” and “document” were automatically removed (based on looking at some older topic lists, not this year’s topics; this step was skipped for Hungarian because we did not update our lists based on last year’s topics). All runs used inflections and/or derivations from stemming.

The 5 executed runs for each language: ² “t”: Just the Title field of the topic was used. ² “te”: Same as “t” except that blind feedback (based on the first 3 rows of the “t” query) was used to expand the query. (This run was not submitted.) ² “td”: Same as “t” except that the Description field was additionally used. ² “tde”: Same as “td” except that blind feedback (based on the first 3 rows of the “td” query) was used to expand the query. ² “tdn”: Same as “td” except that the Narrative field was additionally used. (This run was not submitted.) 3

Retrieval Measures

Traditionally, different retrieval measures have been used for “ad hoc” tasks, which seek relevant items for a topic, than for “known-item” tasks, which seek a particular known document. However, we argue that the known-item measures are not only applicable to ad hoc tasks, but that they are often preferable. For many ad hoc tasks, e.g. finding answer documents for questions, just one relevant item is needed. Also, the traditional ad hoc measures encourage retrieval of duplicate relevants, which does not correspond to user benefit.

The traditional known-item measures are very coarse, e.g. Success@10 is 1 or 0 for each topic, while reciprocal rank cannot produce a value between 1.0 and 0.5. Last year, we began investigating a new measure, Generalized Success@10 (GS10) (introduced as “First Relevant Score” (FRS) in [ 11 ]), which is defined below. This investigation led to the discovery that the blind feedback technique (a commonly used technique at CLEF, NTCIR and TREC, but not known to be popular in real systems) had a downside, namely that it pushes down the first relevant item (on average), as has now been verified not just for our own blind feedback approach, but on 7 other major blind feedback systems [ 9 ]. 3.1

Primary Recall Measures “Primary recall” is retrieval of the first relevant item for a topic. Primary recall measures include the following: ² Generalized Success@30 (GS30): For a topic, GS30 is 1:0241¡r where r is the rank of the first row for which a desired page is found, or zero if a desired page was not found. (This is an experimental new measure introduced in this paper; compared to GS10 (defined below), it further deemphasizes small differences at the top of the list.) ² Generalized Success@10 (GS10): For a topic, GS10 is 1:081¡r where r is the rank of the first row for which a desired page is found, or zero if a desired page was not found. ² Success@n (S@n): For a topic, Success@n is 1 if a desired page is found in the first n rows, 0 otherwise. This paper lists Success@1 (S1) and Success@10 (S10) for all runs. ² Reciprocal Rank (RR): For a topic, RR is 1r where r is the rank of the first row for which a desired page is found, or zero if a desired page was not found. “Mean Reciprocal Rank” (MRR) is the mean of the reciprocal ranks over all the topics.

Interpretation of Generalized Success@n: GS30 and GS10 are estimates of the percentage of potential result list reading the system saved the user to get to the first relevant item, assuming that users are less and less likely to continue reading as they get deeper into the result list. Comparison of GS10 and Reciprocal Rank : Both GS10 and RR are 1.0 if a desired page is found at rank 1. At rank 2, GS10 is just 7 points lower (0.93), whereas RR is 50 points lower (0.50). At rank 3, GS10 is another 7 points lower (0.86), whereas RR is 17 points lower (0.33). At rank 10, GS10 is 0.50, whereas RR is 0.10. GS10 is greater than RR for ranks 2 to 52 and lower for ranks 53 and beyond.

Connection of GS10 to Success@10 : GS10 is considered a generalization of Success@10 because it rounds to 1 for r·10 and to 0 for r>10. (Similarly, GS30 is considered a generalization of Success@30 because it rounds to 1 for r·30 and to 0 for r>30.) 3.2

Secondary Recall Measures

“Secondary recall” is retrieval of the additional relevant items for a topic (after the first one). Secondary recall measures place most of their weight on these additional relevant items. ² Precision@n: For a topic, “precision” is the percentage of retrieved documents which are relevant. “Precision@n” is the precision after n documents have been retrieved. This paper lists Precision@10 (P10) for all runs. ² Average Precision (AP): For a topic, AP is the average of the precision after each relevant document is retrieved (using zero as the precision for relevant documents which are not retrieved). By convention, AP is based on the first 1000 retrieved documents for the topic. The score ranges from 0.0 (no relevants found) to 1.0 (all relevants found at the top of the list). “Mean Average Precision” (MAP) is the mean of the average precision scores over all of the topics (i.e. all topics are weighted equally). ² Geometric MAP (GMAP): GMAP (introduced in [ 13 ]) is the primary measure for the “robust task” this year. It is based on “Log Average Precision” which for a topic is the natural log of the max of 0.00001 and the average precision. GMAP is the exponential of the mean log average precision. (We argue in [ 9 ] that primary recall measures better reflect robustness than GMAP.) ² GMAP’ : We also define a linearized log average precision measure (denoted GMAP’) which linearly maps the ‘log average precision’ values to the [ 0,1 ] interval. For statistical significance purposes, GMAP’ gives the same results as GMAP, and it has advantages such as that the individual topic differences are in the familiar ¡1:0 to 1.0 range and are on the same scale as the mean. Table 2 shows examples of the mapping of the AP and GMAP’ scores for a topic; for example, the table shows that for GMAP, an AP increase from 0.00001 to 0.01 is considered more important than an increase from 0.01 to 1.0 (these are differences of 0.6 and 0.4 respectively in GMAP’). (This example illustrates one of our concerns with GMAP, which is that small differences likely to be unimportant to a user can be dramatically amplified.) 3.3

Statistical Significance Tables

For tables comparing 2 diagnostic runs (such as Table 4), the columns are as follows: ² “Expt” specifies the experiment. The language code is given, followed by the labels of the 2 runs being compared. The difference is the first run minus the second run. For example, “BG-td-t” specifies the difference of subtracting the scores of the Bulgarian ‘t’ run from the Bulgarian ‘td’ run (of Table 3). ² “¢GS30” is the difference of the mean GS30 scores of the two runs being compared (and “¢GS10” is the difference of the mean GS10 scores, etc.). ² “95% Conf” is an approximate 95% confidence interval for the difference (calculated from plus/minus twice the standard error of the mean difference). If zero is not in the interval, the result is “statistically significant” (at the 5% level), i.e. the feature is unlikely to be of neutral impact (on average), though if the average difference is small (e.g. <0.020) it may still be too minor to be considered “significant” in the magnitude sense. ² “vs.” is the number of topics on which the first run scored higher, lower and tied (respectively) compared to the second run. These numbers should always add to the number of topics. ² “3 Extreme Diffs (Topic)” lists 3 of the individual topic differences, each followed by the topic number in brackets. The first difference is the largest one of any topic (based on the

Results of Query Expansion Experiments Expansion of Title Queries

Table 4 shows that expanding the Title queries by adding the Description field increased the mean score for all investigated measures (GS30, GS10, MRR, P10, GMAP and MAP), including at least one statistically significant increase for each measure. Adding the Description is a “robust” technique that can sometimes improve a poor result from just using the Title field.

Table 5 shows that expanding the Title queries via blind feedback of the first 3 rows did not produce any statistically significant increases for the primary recall measures (GS30, GS10, MRR), even though it produced statistically significant increases for the secondary recall measures (P10, GMAP, MAP). Blind feedback is not a robust technique in that it is unlikely to improve poor results. (In a larger experiment, we would expect the primary recall measures to show statistically significant decreases, like we saw for Bulgarian last year [ 11 ].)

Table 6 compares the results of the two title-expansion approaches. For each primary recall measure (GS30, GS10, MRR), there is at least one positive statistically significant difference, BG-td-t FR-td-t HU-td-t PT-td-t EN-td-t BG-td-t FR-td-t HU-td-t PT-td-t EN-td-t BG-td-t FR-td-t HU-td-t PT-td-t EN-td-t BG-td-t FR-td-t HU-td-t PT-td-t EN-td-t BG-td-t FR-td-t HU-td-t PT-td-t EN-td-t BG-te-t FR-te-t HU-te-t PT-te-t EN-te-t BG-te-t FR-te-t HU-te-t PT-te-t EN-te-t BG-te-t FR-te-t HU-te-t PT-te-t EN-te-t BG-te-t FR-te-t HU-te-t PT-te-t EN-te-t BG-te-t FR-te-t HU-te-t PT-te-t EN-te-t BG-td-te FR-td-te HU-td-te PT-td-te EN-td-te BG-td-te FR-td-te HU-td-te PT-td-te EN-td-te BG-td-te FR-td-te HU-td-te PT-td-te EN-td-te BG-td-te FR-td-te HU-td-te PT-td-te EN-td-te BG-td-te FR-td-te HU-td-te PT-td-te EN-td-te reflecting the robustness of using the Description instead of blind feedback. However, there are no statistically significant differences in the secondary recall measures (P10, GMAP, MAP); these measures do not discern the higher robustness of the “td” run compared to the “te” run. 4.2

Expansion of “Title+Desc” Queries

Table 7 shows that expanding the Description queries by adding the Narrative field tended to be beneficial for both primary and secondary recall measures, though not as consistently as was adding the Description to the Title queries. (Sometimes the Narrative field specifies what is not relevant.)

Table 8 produced a lot of statistically significant increases for the secondary recall measures (P10, GMAP, MAP). We also see one statistically significant increase for a primary recall measure (for Hungarian), which we suspect is a Type I error, because it does not fit the pattern we have seen over several other experiments [ 11, 8, 10, 9 ] (including last year’s Hungarian experiment, for which mean GS10 was down slightly with blind feedback [ 11 ]).

Table 9 compares the results of the two expansion approaches for “Title+Desc” queries. The Narrative was modestly beneficial for the primary recall measures compared to blind feedback, reflecting a robustness advantage, even though blind feedback boosted the secondary recall measures a little more. BG-tdn-td FR-tdn-td HU-tdn-td PT-tdn-td EN-tdn-td BG-tdn-td FR-tdn-td HU-tdn-td PT-tdn-td EN-tdn-td BG-tdn-td FR-tdn-td HU-tdn-td PT-tdn-td EN-tdn-td BG-tdn-td FR-tdn-td HU-tdn-td PT-tdn-td EN-tdn-td BG-tdn-td FR-tdn-td HU-tdn-td PT-tdn-td EN-tdn-td BG-tde-td FR-tde-td HU-tde-td PT-tde-td EN-tde-td BG-tde-td FR-tde-td HU-tde-td PT-tde-td EN-tde-td BG-tde-td FR-tde-td HU-tde-td PT-tde-td EN-tde-td BG-tde-td FR-tde-td HU-tde-td PT-tde-td EN-tde-td BG-tde-td FR-tde-td HU-tde-td PT-tde-td EN-tde-td ¡0:007 ¡0:006 0.012 ¡0:003 ¡0:010 ¢GS10 ¡0:002 ¡0:011 0.025 0.002 ¡0:017 BG-tdn-tde FR-tdn-tde HU-tdn-tde PT-tdn-tde EN-tdn-tde BG-tdn-tde FR-tdn-tde HU-tdn-tde PT-tdn-tde EN-tdn-tde BG-tdn-tde FR-tdn-tde HU-tdn-tde PT-tdn-tde EN-tdn-tde BG-tdn-tde FR-tdn-tde HU-tdn-tde PT-tdn-tde EN-tdn-tde BG-tdn-tde FR-tdn-tde HU-tdn-tde PT-tdn-tde EN-tdn-tde DE-e0 EN-e0 ES-e0 FR-e0 IT-e0 NL-e0 DE-e0 EN-e0 ES-e0 FR-e0 IT-e0 NL-e0 DE-e0 EN-e0 ES-e0 FR-e0 IT-e0 NL-e0 DE-e0 EN-e0 ES-e0 FR-e0 IT-e0 NL-e0 DE-e0 EN-e0 ES-e0 FR-e0 IT-e0 NL-e0 DE-e1 EN-e1 ES-e1 FR-e1 IT-e1 NL-e1 DE-e1 EN-e1 ES-e1 FR-e1 IT-e1 NL-e1 DE-e1 EN-e1 ES-e1 FR-e1 IT-e1 NL-e1 DE-e1 EN-e1 ES-e1 FR-e1 IT-e1 NL-e1 DE-e1 EN-e1 ES-e1 FR-e1 IT-e1 NL-e1

Robust Task Results

The “Robust Task” re-used the old test collections for Dutch, English, French, German, Italian and Spanish from CLEF 2001-2003. Of the 160 old topics, 60 were allowed to be used for new “training”, leaving the other 100 for “testing”. Participants were encouraged to train on the GMAP measure, though we believe primary recall measures better reflect robustness. We actually did not do any new training for this task.

Note that even though the document sets were not always the same for each language in 2001, 2002 and 2003, a fixed document set was used for each language in this task. Hence there may be more unjudged relevant items than usual. Unfortunately, we did not have time to look at metrics on just judged items for this paper.

Table 10 lists the mean scores of our submitted Robust Task runs. For each language, we submitted a “td” run (no blind feedback) and a “tde” run (incorporating blind feedback based on the first 3 rows of “td”). Even though blind feedback is known to tend to make results less robust, the GMAP score was higher with blind feedback in all cases (as were P10 and MAP).

Tables 11 and 12 isolate the impact of blind feedback on each measure. The impact on the primary recall measures tended to be detrimental, including a statistically significant decrease on the Spanish training topics. The increases on the secondary recall measures were mostly statistically significant. While this generally fits the pattern we have seen in other experiments (e.g. [ 9 ]), the negative impact on the primary recall measures seems to be less strong than we have seen elsewhere. Perhaps the old CLEF topics tend to be “easier” than, say, the old TREC topics used at RIA [ 9 ], providing relatively fewer cases for which blind feedback would be detrimental. 6

Conclusions

For all 22 blind feedback experiments reported in this paper, the mean scores for MAP, GMAP and P10 were up with blind feedback, and most of these increases were statistically significant. As blind feedback is known to be bad for robustness (because of its tendency to “not help (and frequently hurt) the worst performing topics” [ 12 ]), we conclude that none of these 3 measures should be used as robustness measures.

Measures based on just the first relevant item (i.e. primary recall measures such as GS30 and GS10) reflect robustness. In this paper, we found in particular that these measures discerned the robustness advantage of expanding Title queries by using the Description field instead of blind feedback, while the secondary recall measures (MAP, GMAP, P10) did not.

These results are consistent with what we have seen elsewhere [ 11, 8, 10, 9 ]. For example, in [ 9 ], 7 other groups’ blind feedback systems were studied, and it was found that blind feedback was detrimental to the first relevant item (on average), even though it boosted the secondary recall measures.

A paper at the recent SIGIR conference [ 1 ] gives a theoretical explanation for why different retrieval approaches are superior when seeking just one relevant item instead of several. In particular, it finds that when seeking just one relevant item, it can theoretically be advantageous to use negative pseudo-relevance feedback to encourage more diversity in the results.

To encourage more research in robust retrieval, probably the simplest thing the organizers of ad hoc tracks could do would be to use a measure based on just the first relevant item (e.g. GS10 or GS30) as the primary measure for the ad hoc task. Participants would then find it detrimental to use the non-robust blind feedback technique, but potentially would be rewarded for finding ways of producing more diverse results. for

Systems) Home Page. [5] S. E. Robertson, S. Walker, S. Jones, M. M. Hancock-Beaulieu and M. Gatford. Okapi at

Multilingual information retrieval resource

Hummingbird

[1]

Harr

Chen and

David R.

Karger . Less is More: Probabilistic Models for Retrieving Fewer Relevant Documents . SIGIR 2006 , pp. 429 - 436 .

[2] Cross-Language Evaluation Forum web site . http://www.clef-campaign.org/

[3]

Andrew

Hodgson . Converting the Fulcrum Search Engine to Unicode . Sixteenth International Unicode Conference, 2000 .

[4] NTCIR (NII-NACSIS Test Collection http://research.nii.ac.jp/»ntcadm/index-en.html

[5]

S. E.

Robertson ,

Walker ,

Jones ,

M. M.

Hancock-Beaulieu and

Gatford . Okapi at TREC-3. Proceedings of TREC-3 , 1995 .

[6]

Jacques

Savoy . CLEF and http://www.unine.ch/info/clef/

[7]

Text

REtrieval Conference (TREC) Home Page . http://trec.nist.gov/

[8]

Stephen

Tomlinson . CJK Experiments with Hummingbird SearchServerTM at NTCIR-5 . Proceedings of NTCIR-5 , 2005 .

[9]

Stephen

Tomlinson. Early Precision Measures: Implications from the Downside of Blind Feedback . SIGIR 2006 , pp. 705 - 706 .

[10]

Stephen

Tomlinson . Enterprise, QA, Robust and Terabyte Experiments with Hummingbird SearchServerTM at TREC 2005 . Proceedings of TREC 2005 .

[11]

Stephen

Tomlinson . European Ad Hoc Retrieval Experiments with SearchServerTM at CLEF 2005 . Working Notes for the CLEF 2005 Workshop .

[12] Ellen

Voorhees . Overview of the TREC 2003 Robust Retrieval Track . Proceedings of TREC 2003 .

[13] Ellen

Voorhees . Overview of the TREC 2004 Robust Retrieval Track . Proceedings of TREC 2004 .