-

Appendix C Results of the Robust Track

Giorgio Maria Di Nunzio

dinunzio@dei.unipd.it 0

Nicola Ferro

ferro@dei.unipd.it 0 0 Department of Information Engineering University of Padua Italy

641 682

Introduction 3

Results for CLEF 2008 Ad-hoc Robust Track 3. Individual Experiment Results and Graphs This section provides the individual results for each official experiment. For each experiment the following tables and graphs are shown: - Overall statistics and information - Interpolated recall vs precision averages plot - Average precision statistics and box plot - Average precision comparison to median plot - Document cutoff levels vs precision at DCL plot - R-Precision statistics and box plot - R-Precision comparison to median plot Topics are identified with DOIs, as well as the experiments. The prefix for the DOI of a topic is 10.2452. The following example shows how to build the DOI for a topic given its number: for topic 200-AH, the corresponding DOI is 10.2452/200-AH List of Submitted Experiments 7

TD TDN TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TD TDN TDN TDN TDN TDN TD TDN TD TDN TDN TD TD TD TD TD TD TD TD ufrgs ufrgs ufrgs ufrgs ufrgs uniba uniba uniba uniba unine unine unine unine uniba Track Overview Results and Graphs 11 0%

0% 1 0.5 0.5 −1 −1

50% Recall

178−AH Topic Identifier 166−AH 167−AH 168−AH 169−AH 170−AH 171−AH 172−AH 173−AH 174−AH 175−AH 176−AH 177−AH 179−AH 180−AH 181−AH 182−AH 183−AH 184−AH 185−AH 186−AH 187−AH 188−AH 189−AH 190−AH

Ad−Hoc Robust Monolingual English Test Task Top 5 Participants − Comparison to Median Average Precision by Topic (Topics 191−AH to 265−AH) 191−AH 192−AH 193−AH 194−AH s t n e m i r e p xEufrgs [Experiment UFRGS_R_MONO2_TEST; MAP 33.95%; Not Pooled]

ixa [Experiment EN2ENNOWSD; MAP 35.34%; Not Pooled] ufrgs [Experiment UFRGS_R_MONO1_TEST; MAP 31.20%; Not Pooled]

inaoe [Experiment INAOEF; MAP 28.46%; Not Pooled] know−center [Experiment ASSO; MAP 27.72%; Not Pooled]

inaoe [Experiment INAOEV; MAP 25.82%; Not Pooled] uniba [Experiment MONO11NUS2F; MAP 19.24%; Not Pooled] uniba [Experiment MONO1TDNUS2F; MAP 16.81%; Not Pooled] uniba [Experiment MONO13NUS2F; MAP 15.48%; Not Pooled] uniba [Experiment MONO12NUS2FOUT; MAP 14.57%; Not Pooled] uniba [Experiment MONO14NUS2F; MAP 6.87%; Not Pooled] 0% 10% 20% 30% 70% 80% 90%

100% 40% 50% 60%

Average Precision

Track Overview Results and Graphs AH-ROBUST-MONO-EN-TEST-CLEF2008 10.2455/TUKEY_T_TEST.488B4DC45C240AEDD7AED91CF79383BE

Ad−Hoc Robust Monolingual English Test Task − Tukey T test with "top group" highlighted 0.3 arcsin(s0q.r4t(Average 0P.r5ecsion)) 0.6 0%

5 1 0.5 0.5 −1 −1 −1 166−AH 167−AH 168−AH 169−AH 170−AH 171−AH 172−AH 173−AH 174−AH 175−AH 176−AH 177−AH 179−AH 180−AH 181−AH 182−AH 183−AH 184−AH 185−AH 186−AH 187−AH 188−AH 189−AH 190−AH

Ad−Hoc Robust Monolingual English Test Task Top 5 Participants − Comparison to Median R−Precision by Topic (Topics 191−AH to 265−AH) 191−AH 192−AH 193−AH 194−AH 195−AH 196−AH 197−AH 198−AH 199−AH 200−AH 251−AH 252−AH 254−AH 255−AH 256−AH 257−AH 258−AH unine [Experiment UNINEROBUST4; R−Prec 42.99%; Not Pooled] geneva [Experiment ISILEMTDN; R−Prec 38.05%; Not Pooled] ucm [Experiment BM25_BO1; R−Prec 36.15%; Not Pooled] ixa [Experiment EN2ENNOWSDPSREL; R−Prec 36.12%; Not Pooled] ufrgs [Experiment UFRGS_R_MONO2_TEST; R−Prec 32.81%; Not Pooled] unine [Experiment UNINEROBUST4; R−Prec 42.99%; Not Pooled] geneva [Experiment ISILEMTDN; R−Prec 38.05%; Not Pooled] ucm [Experiment BM25_BO1; R−Prec 36.15%; Not Pooled] ixa [Experiment EN2ENNOWSDPSREL; R−Prec 36.12%; Not Pooled] ufrgs [Experiment UFRGS_R_MONO2_TEST; R−Prec 32.81%; Not Pooled] s t n e m i r e p xEufrgs [Experiment UFRGS_R_MONO2_TEST; R−Prec 32.81%; Not Pooled]

ixa [Experiment EN2ENNOWSD; R−Prec 33.14%; Not Pooled] ufrgs [Experiment UFRGS_R_MONO1_TEST; R−Prec 30.43%; Not Pooled] know−center [Experiment ASSO; R−Prec 27.45%; Not Pooled] inaoe [Experiment INAOEF; R−Prec 27.35%; Not Pooled] inaoe [Experiment INAOEV; R−Prec 25.53%; Not Pooled] uniba [Experiment MONO11NUS2F; R−Prec 21.20%; Not Pooled] uniba [Experiment MONO1TDNUS2F; R−Prec 19.00%; Not Pooled] uniba [Experiment MONO13NUS2F; R−Prec 17.01%; Not Pooled] uniba [Experiment MONO12NUS2FOUT; R−Prec 16.43%; Not Pooled] uniba [Experiment MONO14NUS2F; R−Prec 8.57%; Not Pooled] 0% 10% 20% 30% 40% 60% 70% 80% 90%

100% 50%

R−Precision

Ad−Hoc Robust Monolingual English Test Task − Tukey T test with "top group" highlighted

UNINEROBUST4 UNINEROBUST1

ISILEMTDN

ISILEMTD EN2ENNOWSDPSREL

BM25_KLD

BM25_BO1 BM25_BO1_AVICTF

BM25_BASELINE

EN2ENNOWSD UFRGS_R_MONO2_TEST UFRGS_R_MONO1_TEST

ASSO INAOEF

INAOEV

MONO11NUS2F MONO1TDNUS2F

MONO13NUS2F MONO12NUS2FOUT

MONO14NUS2F 0.1 ixa

Precision averages (%) for individual queries 4 AUTOMATIC Spanish; Castilian title, description false topics, UBC docs

best Ad−Hoc Robust Word Sense Disambiguation Bilingual English Test Task − Standard Recall Levels vs Mean Interpolated Precision 100% ES2ENUBCDOCSPSREL 90% 18 −AH 189−AH 190−AH

ES2ENUBCDOCSPSREL 264−AH 265−AH

ES2ENUBCDOCSPSREL 28 −AH 289−AH 290−AH

ES2ENUBCDOCSPSREL ixa

Precision averages (%) for individual queries ixa retrieved, R_PRECISION Maximum Minimum First Quartile Second Quartile Third Quartile Interquartile range Mean Standard Deviation Lower Outlier Threshold Upper Outlier Threshold Mean With No Outliers Std With No Outliers Ad−Hoc Robust Word Sense Disambiguation Bil ngual English Test Task − Comparison to Median R−Precision by Topic (Topics 141−AH to 165−AH) 10 15 20 ixa

Precision averages (%) for individual queries 40% 30% 20% 10% 0%0% 1 AUTOMATIC Spanish; Castilian title, description false Ad−Hoc Robust Word Sense Disambiguation Bilingual English Test Task − Standard Recall Levels vs Mean Interpolated Precision 100% UFRGS_R_BI_WSD1_TEST 90% 80% 70% 60% 10% 20% 30% 40% 60% 70% 80% 90%

100% 0% 5% 10% 15% 20% 25% 30% 35% 40% 65% 70% 75% 80% 85% 90% ufrgs

Precision averages (%) for individual queries ufrgs retrieved, R_PRECISION Maximum Minimum First Quartile Second Quartile Third Quartile Interquartile range Mean Standard Deviation Lower Outlier Threshold Upper Outlier Threshold Mean With No Outliers Std With No Outliers

Ad−Hoc Robust Word Sense Disambiguation Bilingual English Test Task − Box plot of the Topics of the Experiment 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 60% 65% 70% 75% 80% 85% 90% 187−AH 18 −AH 189−AH 190−AH

UFRGS_R_BI_WSD1_TEST 287−AH 28 −AH 289−AH 290−AH

UFRGS_R_BI_WSD1_TEST 313−AH ufrgs

Precision averages (%) for individual queries 2 AUTOMATIC Spanish; Castilian title, description, false NLevels (Keyword Ad−Hoc Robust Word Sense Disambiguation Bilingual English Test Task − Standard Recall Levels vs Mean Interpolated Precision 100% CROSSWSD11NUS2F 90% 80% 70% 60% 40% 30% 20% 10% 0%0% 10% 20% 30% 40% 60% 70% 80% 90%

100%

Ad−Hoc Robust Word Sense Disambiguation Bilingual English Test Task − Box plot of the Topics of the Experiment 0% 5% 10% 15% 20% 25% 30% 35% 40% 65% 70% 75% 80% 85% 90% 5% 10% 15% 20% 25% 30% 35% 40% 65% 70% 75% 80% 85% 90% 164−AH 165−AH CROSSWSD1 NUS2F 264−AH 265−AH

CROSSWSD1 NUS2F 18 −AH 189−AH 190−AH

CROSSWSD1 NUS2F 28 −AH 289−AH 290−AH

CROSSWSD1 NUS2F 314−AH 315−AH

CROSSWSD1 NUS2F 3 8−AH 3 9−AH 340−AH

CROSSWSD1 NUS2F uniba

Precision averages (%) for individual queries uniba retrieved, R_PRECISION Maximum Minimum First Quartile Second Quartile Third Quartile Interquartile range Mean Standard Deviation Lower Outlier Threshold Upper Outlier Threshold Mean With No Outliers Std With No Outliers

Ad−Hoc Robust Word Sense Disambiguation Bilingual English Test Task − Retrieved documents vs Mean Precision 100% CROSSWSD11NUS2F 90% 80% 70% 60%

Ad−Hoc Robust Word Sense Disambiguation Bilingual English Test Task − Box plot of the Topics of the Experiment 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 60% 65% 70% 75% 80% 85% 90% 5% 10% 15% 20% 25% 30% 35% 40% 45% 60% 65% 70% 75% 80% 85% 90%

Ad−Hoc Robust Word Sense Disambiguation Bil ngual English Test Task − Comparison to Median R−Precision by Topic (Topics 141−AH to 165−AH) 187−AH 18 −AH 189−AH 190−AH

CROSSWSD1 NUS2F 287−AH 28 −AH 289−AH 290−AH

CROSSWSD1 NUS2F uniba

Precision averages (%) for individual queries 3 AUTOMATIC Spanish; Castilian title, description, false NLevels (Synset Ad−Hoc Robust Word Sense Disambiguation Bilingual English Test Task − Standard Recall Levels vs Mean Interpolated Precision 100% CROSSWSD12NUS2F 90% t 150 n e m ir e p xE100 e h ft cspoo i fT 50 o r e b m u N

00% 80% 70% 60% 40% 30% 20% 10% 0%0% 10% 20% 30% 40% 60% 70% 80% 90%

100%

Ad−Hoc Robust Word Sense Disambiguation Bil ngual English Test Task − Comparison to Median Average Precision by Topic (Topics 141−AH to 165−AH) 18 −AH 189−AH 190−AH

CROSSWSD12NUS2F 28 −AH 289−AH 290−AH

CROSSWSD12NUS2F 314−AH 315−AH

CROSSWSD12NUS2F 3 8−AH 3 9−AH 340−AH

CROSSWSD12NUS2F uniba

Precision averages (%) for individual queries uniba

DCL retrieved, R_PRECISION Maximum Minimum First Quartile Second Quartile Third Quartile Interquartile range Mean Standard Deviation Lower Outlier Threshold Upper Outlier Threshold Mean With No Outliers Std With No Outliers

Ad−Hoc Robust Word Sense Disambiguation Bilingual English Test Task − Retrieved documents vs Mean Precision 100% CROSSWSD12NUS2F 90% 80% 70% 60%

Ad−Hoc Robust Word Sense Disambiguation Bilingual English Test Task − Box plot of the Topics of the Experiment 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 60% 65% 70% 75% 80% 85% 90% 150 t n e m ir e p xE100 e h tf csoop i fT 50 o r e b m u N

00% 5% 10% 15% 20% 25% 30% 35% 40% 45% 60% 65% 70% 75% 80% 85% 90%

Ad−Hoc Robust Word Sense Disambiguation Bil ngual English Test Task − Comparison to Median R−Precision by Topic (Topics 141−AH to 165−AH) 187−AH 18 −AH 189−AH 190−AH

CROSSWSD12NUS2F 287−AH 28 −AH 289−AH 290−AH

CROSSWSD12NUS2F uniba

Precision averages (%) for individual queries 1 AUTOMATIC Spanish; Castilian title, description false synset expansion Ad−Hoc Robust Word Sense Disambiguation Bilingual English Test Task − Standard Recall Levels vs Mean Interpolated Precision 100% CROSSWSD1NUS2F 90% t 150 n e m ir e p xE100 e h ft csoop i fT 50 o r e b m u N

00% 80% 70% 60% 40% 30% 20% 10% 0%0% 10% 20% 30% 40% 60% 70% 80% 90%

100%

Ad−Hoc Robust Word Sense Disambiguation Bil ngual English Test Task − Comparison to Median Average Precision by Topic (Topics 141−AH to 165−AH) 142−AH 143−AH 18 −AH 189−AH 190−AH

CROSSWSD1NUS2F 28 −AH 289−AH 290−AH

CROSSWSD1NUS2F 314−AH 315−AH

CROSSWSD1NUS2F 3 8−AH 3 9−AH 340−AH

CROSSWSD1NUS2F uniba

Precision averages (%) for individual queries uniba

Ad−Hoc Robust Word Sense Disambiguation Bilingual English Test Task − Retrieved documents vs Mean Precision 100% CROSSWSD1NUS2F 90% 80% 70% 60%

30 100 200 Retrieved Documents (logarithmic scale)

00% 341−AH 5% 10% 15% 20% 25% 30% 35% 40% 45% 60% 65% 70% 75% 80% 85% 90%

Ad−Hoc Robust Word Sense Disambiguation Bil ngual English Test Task − Comparison to Median R−Precision by Topic (Topics 141−AH to 165−AH) 187−AH 18 −AH 189−AH 190−AH

CROSSWSD1NUS2F 287−AH 28 −AH 289−AH 290−AH

CROSSWSD1NUS2F

Precision averages (%) for individual queries

30 100 200

Retrieved

Documents (logarithmic scale) 30 100 200

Retrieved

Documents

(logarithmic scale)