Introduction

Experiments in 8 European Languages with Hummingbird SearchServerTM at CLEF 2002

Stephen Tomlinson Hummingbird Ottawa

Ontario

1998

3 4

Hummingbird submitted ranked result sets for all Monolingual Information Retrieval tasks of the Cross-Language Evaluation Forum (CLEF) 2002. Enabling stemming in SearchServer increased average precision by 16 points in Finnish, 9 points in German, 4 points in Spanish, 3 points in Dutch, 2 points in French and Italian, and 1 point in Swedish and English. Accent-indexing increased average precision by 3 points in Finnish and 2 points in German, but decreased it by 2 points in French and 1 point in Italian and Swedish. Treating apostrophes as word separators increased average precision by 3 points in French and 1 point in Italian. Confidence intervals produced using the bootstrap percentile method were found to be very similar to those produced using the standard method; both were of similar width to rank-based intervals for differences in average precision, but substantially narrower for differences in Precision@10.

Introduction

erty of their respective owners.

Language German

Spanish Dutch Swedish English Italian French Finnish For the experiments described in this paper, an internal development build of SearchServer 5.3 was used (5.3.500.279).

The CLEF 2002 document sets consisted of tagged (SGML-formatted) news articles (mostly from 1994) in 8 different languages: German, French, Italian, Spanish, Dutch, Swedish, Finnish and English. Table 1 gives their sizes. For more information on the CLEF collections, see the CLEF web site [ 2 ]. 2.2

Text Reader

The custom text reader called cTREC, originally written for handling TREC collections [12], handled expansion of the library files of the CLEF collections and was extended to support the CLEF guidelines of only indexing specific fields of specific documents. The entities described in the DTD files were also converted, e.g. “=” was converted to the equal sign “=”.

The documents were assumed to be in the Latin-1 character set, the code page which, for example, assigns e-acute (´e) hexadecimal 0xe9 or decimal 233. cTREC passes through the Latin-1 characters, i.e. does not convert them to Unicode. SearchServer’s Translation Text Reader (nti), was chained on top of cTREC and the Win 1252 UCS2 translation was specified via its /t option to translate from Latin-1 to the Unicode character set desired by SearchServer. 2.3

Indexing

A separate SearchServer table was created for each language, created with a SearchSQL statement such as the following:

CREATE SCHEMA CLEF02DE CREATE TABLE CLEF02DE (DOCNO VARCHAR(256) 128) TABLE_LANGUAGE 'GERMAN' STOPFILE 'LANGDE.STP' PERIODIC BASEPATH 'e:\data\clef';

The TABLE LANGUAGE parameter specifies which language to use when performing stemming operations at index time. The STOPFILE parameter specifies a stop file containing typically a couple hundred stop words to not index; the stop file also contains instructions on changes to the default indexing rules, for example, to enable accent-indexing, or to change the apostrophe to a word separator. Here are the first few lines of the stop file used for the French task: PST = "'`" STOPLIST = a aµ afin The IAC line enables indexing of the specified accents (Unicode combining diacritical marks 0x0300-0x0345). Accent-indexing was enabled for all runs except the Italian and English runs. Accents were known to be specified in the Italian queries but were not consistently used in the Italian documents. The PST line adds the specified characters (apostrophes in this case) to the list of word separators. The apostrophes were changed to word separators except for English runs.

Into each table, we just needed to insert one row, specifying the top directory of the library files for the language, using an Insert statement such as the following:

INSERT INTO CLEF02DE ( FT_SFNAME, FT_FLIST ) VALUES ('German','cTREC/E/d=128:s!nti/t=Win_1252_UCS2:cTREC/C/@:s'); To index each table, we just executed a Validate Index statement such as the following: VALIDATE INDEX CLEF02DE VALIDATE TABLE;

By default, the index supports both exact matching (after some Unicode-based normalizations, such as converting to upper-case and decomposed form) and matching on stems. 3

Search Techniques

The CLEF organizers created 50 “topics” (numbered 91-140) and translated them into many languages. Each topic contained a “Title” (subject of the topic), “Description” (a one-sentence specification of the information need) and “Narrative” (more detailed guidelines for what a relevant document should or should not contain). The participants were asked to use the Title and Description fields for at least one automatic submission per task this year to facilitate comparison of results.

We created an ODBC application, called QueryToRankings.c, based on the example stsample.c program included with SearchServer, to parse the CLEF topics files, construct and execute corresponding SearchSQL queries, fetch the top 1000 rows, and write out the rows in the results format requested by CLEF. SELECT statements were issued with the SQLExecDirect api call. Fetches were done with SQLFetch (typically 1000 SQLFetch calls per query). 3.1

Intuitive Searching

For all runs, we used SearchServer’s Intuitive Searching, i.e. the IS ABOUT predicate of SearchSQL, which accepts unstructured text. For example, for the German version of topic 41 (from last year), the Title was “Pestizide in Babykost” (Pesticides in Baby Food), and the Description was “Berichteu¨ber Pestizide in Babynahrung sind gesucht” (Find reports on pesticides in baby food). A corresponding SearchSQL query would be: SELECT RELEVANCE('V2:3') AS REL, DOCNO FROM CLEF02DE WHERE FT TEXT IS ABOUT 'Pestizide in Babykost Berichte Äuber Pestizide in Babynahrung sind gesucht' ORDER BY REL DESC; This query would create a working table with the 2 columns named in the SELECT clause, a REL column containing the relevance value of the row for the query, and a DOCNO column containing the document’s identifier. The ORDER BY clause specifies that the most relevant rows should be listed first. The statement “SET MAX SEARCH ROWS 1000” was previously executed so that the working table would contain at most 1000 rows. 3.2

Stemming

SearchServer “stems” each distinct word to one or more base forms, called stems. For example, in English, “baby”, “babied”, “babies”, “baby’s” and “babying” all have “baby” as a stem. Compound words in German, Dutch and Finnish produce multiple stems; e.g., in German, “babykost” has “baby” and “kost” as stems. SearchServer 5.3 uses the lexicon-based Inxight LinguistX Platform 3.3.1 for stemming operations.

By default, Intuitive Searching stems each word in the query, counts the number of occurrences of each stem, and creates a vector. Optionally some stems are discarded (secondary term selection) if they have a high document frequency or to enforce a maximum number of stems, but we didn’t discard any for our CLEF runs. The index is searched for documents containing terms which stem to any of the stems of the vector.

The VECTOR GENERATOR set option controls which stemming operations are performed by Intuitive Searching. To enable stemming, we used the same setting for each language except for the /lang parameter. For example, for German, the setting was ‘word!ftelp/lang=german/base/noalt j * j word!ftelp/lang=german/inflect’. To disable stemming, the setting was just ‘’.

Besides linguistic expansion from stemming, we did not do any other kinds of query expansion. For example, we did not use approximate text searching for spell-correction because the queries were believed to be spelled correctly. We did not use row expansion or any other kind of blind feedback technique. 3.3

Statistical Relevance Ranking

SearchServer calculates a relevance value for a row of a table with respect to a vector of stems based on several statistics. The inverse document frequency of the stem is estimated from information in the dictionary. The term frequency (number of occurrences of the stem in the row (including any term that stems to it)) is determined from the reference file. The length of the row (based on the number of indexed characters in all columns of the row, which is typically dominated by the external document), is optionally incorporated. The already-mentioned count of the stem in the vector is also used. To synthesize this information into a relevance value, SearchServer dampens the term frequency and adjusts for document length in a manner similar to Okapi [ 6 ] and dampens the inverse document frequency in a manner similar to [8]. SearchServer’s relevance values are always an integer in the range 0 to 1000.

SearchServer’s RELEVANCE METHOD setting can be used to optionally square the importance of the inverse document frequency (by choosing a RELEVANCE METHOD of ‘V2:4’ instead of ‘V2:3’). The importance of document length to the ranking is controlled by SearchServer’s RELEVANCE DLEN IMP setting (scale of 0 to 1000). For all runs in this paper, RELEVANCE METHOD was set to ‘V2:3’ and RELEVANCE DLEN IMP was set to 750. 3.4

Query Stop Words

Our QueryToRankings program removed words such as “find”, “relevant” and “document” from the topics before presenting them to SearchServer, i.e. words which are not stop words in general but were commonly used in the CLEF topics as general instructions. For the submitted runs, the lists were developed by examining the CLEF 2000 and 2001 topics (not this year’s topics). For the diagnostic runs in this paper, “finde” was added as a query stop word because it was noticed to be common in the German topics this year. An evaluation of the impact of query stop words is provided below.

Run Finnish

German Spanish Dutch French Italian Swedish English

AvgP The evaluation measures are likely explained in an appendix of this volume. Briefly: “Precision” is the percentage of retrieved documents which are relevant. “Precision@n” is the precision after n documents have been retrieved. “Average precision” for a topic is the average of the precision after each relevant document is retrieved (using zero as the precision for relevant documents which are not retrieved). “Recall” is the percentage of relevant documents which have been retrieved. “Interpolated precision” at a particular recall level for a topic is the maximum precision achieved for the topic at that or any higher recall level. For a set of topics, the measure is the average of the measure for each topic (i.e. all topics are weighted equally).

The Monolingual Information Retrieval tasks were to run 50 queries against document collections in the same language and submit a list of the top-1000 ranked documents to CLEF for judging (in June 2002). CLEF produced a “qrels” file for each of the 8 tasks: a list of documents judged to be relevant or not relevant for each topic. From these, the evaluation measures were calculated with Chris Buckley’s trec eval program.

For some topics and languages, no documents were judged relevant. The precision scores are just averaged over the number of topics for which at least one document was judged relevant. 4.1

Impact of Stemming

Most of the remaining tables will focus on one particular precision measure (usually average precision), comparing the scores when a particular feature (such as stemming) is enabled to when it is disabled. The columns of these tables are as follows: ² “Experiment” is the language and topic fields used (for example, “-td” indicates the Title and Description fields were used). ² “AvgDiff” is the average difference in the precision score. In [9], a difference of at least 2 full points (i.e. >=0.020) is considered “noticeable”, 4 points “material”, 6 points “striking” and 8 points “dramatic”. ² “95% Confidence” is an approximate 95% confidence interval for the average difference calculated using the bootstrap percentile method (described in the last section). If zero is not in the interval, the result is “statistically significant” (at the 5% level), i.e. the feature is unlikely to be of neutral impact, though if the average difference is small (e.g. <0.020) it may still be too minor to be considered “significant” in the magnitude sense. ² “vs.” is the number of topics on which the precision was higher, lower and tied (respectively) with the feature enabled. These numbers should always add to the number of topics for the language (as per Table 2). ² “2 Largest Diffs (Topic)” lists the two largest differences in the precision score (based on the absolute value), with each followed by the corresponding topic number in brackets (the topic numbers range from 91 to 140).

Table 3 shows the impact of stemming on the average precision measure. The benefit for Finnish and German, for which stemming includes compound-breaking, is dramatic. For example, Finnish topic 115, regarding “avioerotilastoja” (divorce statistics), apparently benefits from compoundbreaking. Surprisingly, the other investigated language for which compounds are broken, Dutch, does not similarly stand out, unlike last year [11], though its confidence interval still overlaps the one for German.

Table 4 shows the impact of stemming on the shorter (Title-only) queries. It appears the benefits are a little bigger for the shorter queries in most languages, with English the only language without a noticeable benefit on average. Of course, stemming can hurt precision for some queries, as in English topic 139 (EU fishing quotas), so an application probably should make stemming a user-controllable option. 4.2

Impact of Query Stop Words

Table 5 shows the impact of discarding query stop words, such as “find”, “relevant” and “documents”. Query stop words differ from general stop words (such as “the”, “of”, “by”) in that they do not seem to be noise words in general, but their common use in past CLEF topic sets (particularly the Description and Narrative fields) suggests they are likely not useful terms when encountered in CLEF queries. In the table, a positive difference indicates a benefit from removing query stop words from the topics.

Table 5 shows that the impact of discarding query stop words was always minor (the biggest average benefit was just 1.6 points), though some of the differences are “statistically significant” because of the consistency of the minor benefits. This is a case where a “statistically significant” benefit is still not a “significant” benefit.

Sometimes noise words may occur in relevant documents by chance and scores may fall if the noise words are discarded. Apparently that happened in French topics 123 and 132 (regarding “mariage” and “Kaliningrad” respectively) in which excluding “trouver” and “documents” decreased the scores, even though they don’t seem to be meaningful terms for their queries. 4.3

Impact of Stop Words

Tables 6 and 7 show the impact of using stop words on the average precision measure. To do this experiment, two tables were created for each language, one indexed with a stopfile containing typically a couple hundred stop words, the other with no stop words (though other SearchServer stopfile instructions, such as accent-indexing and apostrophes as word separators, were kept the same as used for the submitted runs). For this experiment, query stop words were not discarded for either run, to isolate the impact of the general stop words on precision. In the tables, a positive difference indicates a benefit to specifying stop words.

Table 6 shows the impact of using stop words for Title plus Description queries was very slight on average, and none of the differences were statistically significant. Table 7 shows there was a noticeable benefit for full topic queries (i.e. when additionally including the Narrative) for some languages, and a statistically significant benefit for most of them. Other benefits of specifying stop words are to reduce search time, indexing time and index size. However, there may be cases when what is usually a stop word is meaningful to a query (e.g. find documents containing “to be or not to be”), so it may be better to make stop word elimination an option at search time rather than at index time, depending on the goals of the application.

Stop word lists for many languages are on the Neuchaˆtel resource page [7]. Our stop word lists may contain differences. 4.4

Impact of Indexing Accents

settings were the same as for the submitted runs; in particular, apostrophes were used as word separators except in English.

Tables 8 and 9 show that topic 98, regarding the Kaurisma¨ki brothers, was strongly affected in many languages by whether or not accents were preserved. Spanish, French and Italian topics 98 included the accent in Kaurisma¨ki, but the documents more often did not include the accent, so accent-indexing hurt precision in those cases. But accent-indexing was helpful for Finnish for this topic, apparently because in Finnish there were variants which required stemming to match (e.g. Kaurisma¨kien and Kaurisma¨en), and the stemmer was more effective when given the words with the accents preserved. It appears it would help if the stemmer was modified to be more tolerant of missing accents. 4.5

Impact of Apostrophes as Word Separators

Table 10 shows the impact of treating apostrophes as word separators on the average precision measure. To do this experiment, two tables were created for each language, one treating apostrophes as word separators, the other not. No stop words were used and no query stop words were discarded. Otherwise, the settings were the same as for the submitted runs; in particular, accent-indexing was enabled except in Italian and English.

Table 10 shows that treating apostrophes as word separators had a noticeable benefit for French. For example, French topic 121 may be benefiting from breaking “d’Ayrton” at its apostrophe. The benefit for Italian may have been less because stemming appears to be handling apostrophes. For example, in Italian, if apostrophes are not word separators, “l’ombrello” still matches “ombrello” when stemming is enabled, whereas in French, “l’´ecole” still does not match “´ecole” (again, this difference is moot when apostrophes are treated as word separators). The impact for other languages is slight, including for English. We submitted 10 monolingual runs (the maximum allowed) in June 2002. Runs humDE02, humFR02, humIT02, humES02, humNL02, humSV02 and humFI02 provided a run for each language using the Title and Description fields as requested by the organizers (note that English monolingual runs were not accepted). For the remaining 3 runs, we submitted an extra run for Finnish, Swedish and Dutch including the Narrative field (runs humFI02n, humSV02n, humNL02n); these languages were expected to have the fewest participants, so additional submissions seemed more likely to be helpful for the judging pools. The precision scores of the submitted runs are expected to be included in an appendix of this volume. Table 11 shows a comparison of the submitted runs with the median scores of submitted monolingual runs for each language. In all but one case, SearchServer scored higher than the median on more topics than it scored lower. Note that the relative performance on different languages may not be meaningful for several reasons, including that the medians are from a mix of runs where some may have used the Narrative field, multiple runs may be submitted by the same group, and the mixture can vary across languages.

The submitted runs of June used an older, experimental build than was used for the diagnostic runs in August, and there may be minor differences in the scores even when the settings are the same. 5

Confidence Intervals for Precision Differences

The 95% confidence intervals presented in this paper have been produced using Efron’s Bootstrap (percentile method). If there are 50 topics (i.e. 50 precision differences), then precision differences are chosen randomly (with replacement) 50 times, producing a “bootstrap sample”, and a mean (average) is computed from this sample. This step is repeated B times (e.g. B=100,000). The B sample means are sorted, the bottom and top 2.5% are discarded, and the endpoints of the remaining range of sample means are an approximate 95% confidence interval for the average difference in precision (we always rounded so that the listed endpoints are not actually in the produced interval). The bootstrap percentile method is considered to work well in more cases than the standard method of using the mean plus/minus 1.96 times the standard error, though there are more complicated bootstrap methods which are considered even more general [ 1 ].

Table 12 shows the bootstrap confidence intervals produced for the impact of stemming on average precision with different numbers of iterations. Even at just 1000 iterations the values are fairly close to the values at 1 million iterations. When comparing 1,000,000 iterations to 100,000, very few of the endpoints changed, and they only changed by 0.001. For the confidence intervals in this paper, we used B=100,000.

Tables 13 and 14 contain side-by-side comparisons of the approximate 95% confidence intervals produced by the bootstrap percentile method and the standard method. It turns they are very similar. There is a disagreement on statistical significance (i.e. when zero is not in the interval) in the case of Dutch in Table 13, but it is a borderline case.

Tables 13 and 14 also include an estimator and 95% confidence interval based on the Wilcoxon signed rank test (the 2 rightmost columns). (We implemented an exact computation, including for the case of ties in the absolute values of the differences [ 4 ]). For differences in average precision, the widths of the intervals are very similar (the bootstrap intervals are a little smaller than the Wilcoxon intervals for the Finnish and German results, and the Wilcoxon intervals are a little smaller for the others); the methods agree on which differences are statistically significant. However, for differences in Precision@10, the bootstrap intervals are a lot smaller than the Wilcoxon intervals (because the Wilcoxon is based on ranks, it cannot distinguish between a shift of 0.01 and 0.09 (they have the same effect on the ranks because every difference is a multiple of 0.10)); the methods still agree on statistically significant results (for the 8 cases listed). for

Systems) Home Page.

[5] NTCIR (NII-NACSIS Test Collection

http://research.nii.ac.jp/»ntcadm/index-en.html [7] Jacques Savoy. (Universit´e de Neuchˆatel.) CLEF and Multilingual information retrieval resource page. http://www.unine.ch/info/clef/

Fulcrum SearchServer

Proceedings of the Publication 500-249.

[1] Michael

Chernick . Bootstrap Methods: A Practitioner's Guide . 1999 . John Wiley & Sons.

[2] Cross-Language Evaluation Forum web site . http://www.clef-campaign.org/

[3]

Andrew

Hodgson . Converting the Fulcrum Search Engine to Unicode . In Sixteenth International Unicode Conference, Amsterdam, The Netherlands, March 2000 .

[4]

Myles

Hollander and

Douglas A.

Wolfe . Nonparametric Statistical Methods. Second Edition , 1999 . John Wiley & Sons.

[6]

S. E.

Robertson ,

Walker ,

Jones ,

M. M.

Hancock-Beaulieu , M. (City University.) Okapi at TREC-3 . In D. K. Harman, editor, Overview Third Text REtrieval Conference (TREC-3) . NIST Special Publication http://trec.nist.gov/pubs/trec3/t3 proceedings.html