<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Evaluating Evaluation Measures with Worst-Case Confidence Interval Widths</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author role="corresp">
							<persName><forename type="first">Tetsuya</forename><surname>Sakai</surname></persName>
							<email>tetsuyasakai@acm.org</email>
							<affiliation key="aff0">
								<orgName type="institution">Waseda University</orgName>
							</affiliation>
						</author>
						<title level="a" type="main">Evaluating Evaluation Measures with Worst-Case Confidence Interval Widths</title>
					</analytic>
					<monogr>
						<imprint>
							<date/>
						</imprint>
					</monogr>
					<idno type="MD5">9749A3C1B173487A013A77490CD26689</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2023-03-25T03:46+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>ANOVA</term>
					<term>con dence intervals</term>
					<term>e ect sizes</term>
					<term>evaluation measures</term>
					<term>p-values</term>
					<term>sample sizes</term>
					<term>statistical signi cance</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>IR evaluation measures are o en compared in terms of rank correlation between two system rankings, agreement with the users' preferences, the swap method, and discriminative power. While we view the agreement with real users as the most important, this paper proposes to use the Worst-case Con dence interval Width (WCW) curves to supplement it in test-collection environments. WCW is the worst-case width of a con dence interval (CI) for the di erence between any two systems, given a topic set size. We argue that WCW curves are more useful than the swap method and discriminative power, since they provide a statistically well-founded overview of the comparison of measures over various topic set sizes, and visualise what levels of di erences across measures might be of practical importance. First, we prove that Sakai's ANOVA-based topic set size design tool can be used for discussing WCW instead of his CI-based tool that cannot handle large topic set sizes. We then provide some case studies of evaluating evaluation measures using WCW curves based on the ANOVA-based tool, using data from TREC and NTCIR.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>IR systems are built to satisfy users' information needs, but it is not practical to make the users evaluate the systems all the time for the purpose of improving them-that would annoy the users, not satisfy them! Hence, we o en turn to IR evaluation measures in laboratory experiments. But which IR measures are good?</p><p>In laboratory studies, evaluation measures are o en compared in terms of rank correlation between two system rankings (e.g. <ref type="bibr" target="#b8">[9]</ref>), agreement with the users' document preferences (e.g. <ref type="bibr" target="#b6">[7]</ref>), the swap method (e.g. <ref type="bibr" target="#b7">[8]</ref>), and discriminative power (e.g. <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref>). Since IR evaluation measures are o en regarded as surrogates of user satisfaction or user performance measurements, we view the agreement with users as the most important, although it needs to be said that user preference studies o en use hired assessors such as crowd workers intead of real users with an information need. Moreover, studies involving human assessors obviously incur costs.</p><p>To supplement user-based studies of IR evaluation measures, we propose to use Worst-case Con dence interval Width (WCW) curves in test-collection environments. WCW is the worst-case width of a con dence interval (CI) for the di erence between any two systems, given a topic set size. We argue that WCW curves are more useful than the swap method and discriminative power, since they provide a statistically well-founded overview of the comparison of measures over various topic set sizes, and visualise what levels of di erences across measures might be of practical importance. To this end, we leverage one of the publicly available topic set size design Excel tools of Sakai <ref type="bibr" target="#b5">[6]</ref>. First, we prove that Sakai's ANOVA-based topic set size design tool<ref type="foot" target="#foot_0">1</ref> can be used for discussing WCW instead of his CI-based tool<ref type="foot" target="#foot_1">2</ref> that cannot handle large topic set sizes (See Section 2). We then provide some case studies of evaluating evaluation measures using WCW curves based on the ANOVA-based tool, using data from TREC and NTCIR.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">PRIOR ART IN EVALUATING EVALUATION MEASURES</head><p>When a new IR evaluation measure is invented, a system ranking according to this measure (averaged over a set of topics) is o en compared with another according to a well-established measure; rank correlation measures such as Kendall's τ or the top-heavy τ ap <ref type="bibr" target="#b8">[9]</ref> are o en used to quantify the similarity between two rankings. However, this approach cannot tell us whether a measure is good or bad, due to the lack of a "correct" system ranking. It merely tells us whether a new measure is similar to an existing one or not; it only serves as a sanity check. For a given query, a user sees two Search Engine Result Pages (SERPs) side by side, and says that SERP 1 is be er than SERP 2 ("SERP 1 &gt; SERP 2 "). If an evaluation measure also says "SERP 1 &gt; SERP 2 ," this is a preference agreement; if it says "SERP 1 &lt; SERP 2 ," this is a preference disagreement. We can count the number of agreements over di erent queries and SERP pairs, and use it for comparing the "goodness" of evaluation measures. In practice, this approach also has a few limitations: (a) the judges employed in the preference assessments are o en not real search engine users with an information need; (b) human assessments can be unreliable and/or inconsistent; and (c) hiring judges comes at a cost, no ma er how small. e swap method <ref type="bibr" target="#b7">[8]</ref> may be used to measure the consistency (i.e., "preference agreement with itself") of evaluation measures across di erent topic sets. Given a set of n topics, the set is split in half, and the number of inconsistent preferences (e.g., SERP 1 &gt; SERP 2 with the rst half but SERP 1 &lt; SERP 2 with the second half) is counted, using di erent systems and di erent splits. As this method can only consider half the original topic set size, Voorhees and Buckley used a simple extrapolation method to estimate what will happen for topic set sizes larger than n. However, estimating the swap rate for (say) n = 100 topics based on observations with (say) n = 10, 25, 50 topics may not be reliable. To directly consider the size n, bootstrap samples <ref type="bibr" target="#b2">[3]</ref> can be used to replace the samplingwithout-replacement approach of Voorhees and Buckley, but this method cannot consider topic set sizes larger than n either.</p><p>Given a set of runs and an evaluation measure, a p-value can be obtained for every system pair using an appropriate signi cance test, and the sorted p-values can be plo ed against the system pairs <ref type="bibr" target="#b2">[3,</ref><ref type="bibr" target="#b3">4]</ref>: this is called the discriminative power curve. While highly discriminative measures are useful in the sense that they can obtain more statistically signi cant results in a given environment with exactly n topics, discriminative power does not provide a view over di erent choice of topics. Moreover, it is not clear, for example, a measure with 90% discriminative power should actually be preferred over one with 80% discriminative power.</p><p>Sakai <ref type="bibr" target="#b5">[6]</ref> released three Excel tools based on topic set size design, which determines the number of topics n to create for a new test collection given a set of statistical requirements. His ANOVA-based tool takes the following as input: α (Type I error probability), β (Type II error probability), m (the number of systems to be compared in one-way ANOVA), σ 2 (an estimate of the within-system variance for a particular evaluation measure), and minD (minimum detectable range); the tool returns the topic set size n that ensures 100(1 − β)% statistical power whenever the true di erence between the best and the worst among the m systems is minD or larger. Whereas, his CI-based tool takes the following as input: α, σ 2 t (an estimate of the variance of the between-system di erences in terms of a particular evaluation measure), and δ , which is exactly what we call WCW in this study; the tool returns the topic set size n that ensures that the width of the 100(1 − α)% CI for any system pair is no larger than δ . Following Sakai, we simply let σ 2 t = 2 σ 2 for any evaluation measure.</p><p>While the relationship between minD for ANOVA and n can be plo ed for di erent evaluation measures, this seems problematic as a way to compare evaluation measures, since, for example, a minD of 0.1 in term of one measure is not equivalent to a minD of 0.1 in terms of another. In contrast, if we plot δ against n, this is probably a more valid comparison since, at least for any normalised measures that lie in the [0, 1] score range, we usually want the CI width to be as small as possible. is is why we propose to plot δ against topic set sizes to compare di erent measures. However, Sakai's CI-based tool cannot handle large topic set sizes: the limitation of his CIbased tool is due to that of Excel's GAMMA function: GAMMA(172) is greater than 10 307 and cannot be computed <ref type="bibr" target="#b5">[6]</ref>. Hence, we start by proving that his ANOVA-based tool can be used instead of the less robust CI-based one, for IR researchers to compare the statistical reliability of evaluation measures based on WCW.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">PROOF THAT ANOVA-BASED TOPIC SET SIZE DESIGN CAN BE USED INSTEAD OF CI-BASED ONE</head><p>According to Sakai's CI-based topic set size design, the initial topic set size estimate for ensuring that the CI width for the di erence in means for any two systems is no larger than δ (&gt; 0) is given by <ref type="bibr" target="#b5">[6]</ref>:</p><formula xml:id="formula_0">n CI = 4{z inv (α/2)} 2 σ 2 t δ 2 = 4{z inv (α/2)} 2 (2 σ 2 ) δ 2 ,<label>(1)</label></formula><p>where z inv (P) is the upper z-value<ref type="foot" target="#foot_2">3</ref> for probability P. Subsequently, this estimate is incremented until it actually satis es the requirement (α, δ ). us, while the actual CI relies on a t-distribution, the method starts o with a standard normal distribution by assuming that the variance estimate σ 2 t is perfectly accurate <ref type="foot" target="#foot_3">4</ref> . is is why Eq. 1 involves a z-value rather than a t-value.</p><p>Whereas, according to Sakai's ANOVA-based topic set size design, the initial topic set size estimate for ensuring 100(1 − β)% statistical power whenever the true di erence between the best and the worst systems is minD or larger is given by <ref type="bibr" target="#b5">[6]</ref>:</p><formula xml:id="formula_1">n ANOVA = 2 σ 2 λ minD 2 , (<label>2</label></formula><formula xml:id="formula_2">)</formula><p>where λ is a noncentrality parameter of a noncentral χ 2 distribution with ϕ = m − 1 degrees of freedom; as discussed below, linear formulae are available for estimating λ from ϕ <ref type="bibr" target="#b1">[2]</ref>. As Eq. 2 is based on a series of approximations, n ANOVA is then incremented until it actually satis es the requirement (α, β, minD, m). Sakai <ref type="bibr" target="#b5">[6]</ref> observed that, for the data he considered, "the topic set size required based on the CI-based design with α = 0.05 and δ = c is almost the same as the topic set size required based on the ANOVA-based design with (α, β, m) = (0.05, 0.20, 10) and minD = c, for any c." We analytically explain and generalise his observation as follows. From Eqs. 1 and 2, we have:</p><formula xml:id="formula_3">n ANOVA n CI = λδ 2 4{z inv (α/2)} 2 minD 2 = λ 4{z inv (α/2)} 2 ( δ minD ) 2 . (<label>3</label></formula><formula xml:id="formula_4">)</formula><p>Here, note that 4{z inv (α/2)} 2 is a constant for a given α; also, λ is a constant given α, β and m. Figure <ref type="figure">1</ref> visualises the relationship between the two constants for α = 0.01, 0.05 and β = 0.10, 0.20, while varying the number of systems m. e linear formulae for approximating λ based on ϕ = m − 1 <ref type="bibr" target="#b5">[6]</ref> are provided in the bo om half of the gure. Figure <ref type="figure">1</ref> shows that </p><formula xml:id="formula_5">λ ≈ 4{z inv (α/2)} 2<label>(4)</label></formula><formula xml:id="formula_6">n ANOVA n CI ≈ ( δ minD ) 2 .<label>(5)</label></formula><p>us, when one of the above four conditions holds, by le ing δ = minD in Eq. 5 we obtain n ANOVA /n CI ≈ 1, that is, n ANOVA ≈ n CI , regardless of the variance estimate σ 2 . Q.E.D.</p><p>Henceforth, we only consider the popular Cohen's ve-eighty convention <ref type="bibr" target="#b0">[1]</ref>, i.e., (α, β) = (0.05, 0.20) <ref type="foot" target="#foot_4">5</ref> , and leverage Condition (a) Figure <ref type="figure">1</ref>: e noncentrality parameter λ vs. 4{z inv (α/2)} 2 Table <ref type="table">1</ref>: σ 2 : estimates of within-system variances. md stands for measurement depth (i.e., document cuto ). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">WCW-BASED EVALUATION OF EVALUATION MEASURES: CASE STUDIES</head><p>Having proven that the ANOVA-based tool can be used instead of the less robust CI-based tool, we now demonstrate how di erent evaluation measures can be compared using WCW curves obtained with the ANOVA-based tool.</p><p>Table <ref type="table">1</ref> shows the variance estimates σ 2 of various evaluation measures reported in the literature <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref>. For the purpose of the present study, the knowledge of each evaluation measure is not necessary; the measures with a pre x "std-AB" denote standardised measures of the original measures, where the raw score for each topic is transformed based on a set of known systems, to absorb the hardness of that topic as well as its variation across systems <ref type="bibr" target="#b4">[5]</ref>. Given a topic-by-run score matrix for a particular evaluation measure, σ 2 can easily be obtained as the residual variance of ANOVA. While some evaluation measures are substantially less stable across topics than others (e.g., Compare nERR and nDCG in Table 1(a)), it is not clear just from this table how such di erences will actually impact our evaluation results.</p><p>Figure <ref type="figure" target="#fig_3">3</ref> shows the WCW curves that correspond to the variances shown in Table <ref type="table">1</ref>, for α = 0.05, i.e., 95% CIs. For each evaluation e advantages of the proposed WCW-based comparison of evaluation measures are as follows:</p><p>• Unlike discriminative power and the swap method, we can easily consider a wide range of topic set sizes; • For a particular topic set size, we can easily compare across di erent evaluation measures, since an evaluation measure with a small WCW is usually more desirable than one with a large WCW under the same condition; • e WCW curves can visualise the di erences among measures that practically ma er. For example, from Figure <ref type="figure" target="#fig_3">3</ref>(b), when the topic set size is n = 50, it is clear that the WCW of nDCG and that of Q are about the same (around 0.16), while those of AP and nERR are substantially larger (around 0.23). Similarly, from Figure <ref type="figure" target="#fig_3">3</ref>(d), while it is clear that the standardised ("std-AB") measures have substantially lower WCW values than the unstandardised ones, the di erences within the set of standardised measures are probably not of practical importance, as indicated by the near-perfect overlaps of the curves.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head></head><label></label><figDesc>holds when: Condition (a) α = 0.05, β = 0.20, m = 10; or Condition (b) α = 0.05, β = 0.10, m = 5; or Condition (c) α = 0.01, β = 0.20, m = 18; or Condition (d) α = 0.01, β = 0.10, m = 10. Hence, whenever one of the above four conditions holds true, then from Eqs. 3 and 4 we obtain:</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head></head><label></label><figDesc>(a) TREC03-04Robust (md = 1000), from Sakai<ref type="bibr" target="#b5">[6]</ref> TREC11-12WebAdhoc (md = 10), from Sakai<ref type="bibr" target="#b5">[6]</ref> above. Figure2compares, for di erent and quite extreme values of the variance estimate σ 2 , the topic set size curve using the CI-based tool with α = 0.05 and one using the ANOVA-based tool with α = 0.05, β = 0.20, m = 10. Due to the aforementioned limitation of the CI-based tool, it was not possible to obtain the entire curves with this tool. On the other hand, it is clear that the ANOVA-based curves can serve as highly accurate surrogates for the CI-based curves and can handle large topic set sizes. In summary, to discuss WCW, we can always use the more robust ANOVA-based tool and treat the minD values as if they are δ values.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: e actual relationship between δ for CI and minD for ANOVA in topic set size design.</figDesc><graphic coords="3,317.96,495.17,240.94,135.53" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 3 :</head><label>3</label><figDesc>Figure 3: WCW curves for 95% CIs.</figDesc><graphic coords="4,53.80,503.75,240.94,135.53" type="bitmap" /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">h p://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">h p://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">NORM.S.INV(1 − P ) with Microso Excel.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">Replacing the true population variance of a standard normal distribution with a sample variance constitutes the very de nition of a t -distribution.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">Note that "eighty" refers to the statistical power: 100(1 − β )%.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">CONCLUSIONS AND FUTURE WORK</head><p>We proposed to evaluate evaluation measures by comparing the WCW for various topic set sizes, using an existing ANOVA-based tool instead of the less robust CI-based tool. We proved the relationship between these two topic set size design methods, and demonstrated the advantages of WCW curves over well-known methods such as the swap test and discriminative power. It is hoped that this method will supplement user-based studies of evaluation measures.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<title level="m" type="main">e Essential Guide to E ect Sizes</title>
		<author>
			<persName><forename type="first">Paul</forename><forename type="middle">D</forename><surname>Ellis</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2010">2010</date>
			<publisher>Cambridge University Press</publisher>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<title level="m" type="main">How to Design the Sample Size</title>
		<author>
			<persName><forename type="first">Yasushi</forename><surname>Nagata</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2003">2003</date>
			<publisher>Asakura Shoten</publisher>
		</imprint>
	</monogr>
	<note>in Japanese</note>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Evaluating Evaluation Metrics based on the Bootstrap</title>
		<author>
			<persName><forename type="first">Tetsuya</forename><surname>Sakai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ACM SIGIR</title>
				<meeting>ACM SIGIR</meeting>
		<imprint>
			<date type="published" when="2006">2006. 2006</date>
			<biblScope unit="page" from="525" to="532" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Alternatives to Bpref</title>
		<author>
			<persName><forename type="first">Tetsuya</forename><surname>Sakai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ACM SIGIR</title>
				<meeting>ACM SIGIR</meeting>
		<imprint>
			<date type="published" when="2007">2007. 2007</date>
			<biblScope unit="page" from="71" to="78" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">e E ect of Score Standardisation on Topic Set Size Design</title>
		<author>
			<persName><forename type="first">Tetsuya</forename><surname>Sakai</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of AIRS 2016</title>
				<meeting>AIRS 2016</meeting>
		<imprint>
			<date type="published" when="2016">2016. 9994</date>
			<biblScope unit="page" from="16" to="28" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Topic Set Size Design</title>
		<author>
			<persName><forename type="first">Tetsuya</forename><surname>Sakai</surname></persName>
		</author>
		<ptr target="//link.springer.com/content/pdf/10.1007%2Fs10791-015-9273-z.pdf" />
	</analytic>
	<monogr>
		<title level="j">Information Retrieval Journal</title>
		<imprint>
			<biblScope unit="volume">19</biblScope>
			<biblScope unit="issue">3</biblScope>
			<biblScope unit="page" from="256" to="283" />
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Do User Preferences and Evaluation Measures Line Up?</title>
		<author>
			<persName><forename type="first">Mark</forename><surname>Sanderson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Monica</forename><forename type="middle">Lestari</forename><surname>Paramita</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Paul</forename><surname>Clough</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Evangelos</forename><surname>Kanoulas</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ACM SIGIR</title>
				<meeting>ACM SIGIR</meeting>
		<imprint>
			<date type="published" when="2010">2010. 2010</date>
			<biblScope unit="page" from="555" to="562" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">e E ect of Topic Set Size on Retrieval Experiment Error</title>
		<author>
			<persName><forename type="first">Ellen</forename><forename type="middle">M</forename><surname>Voorhees</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Chris</forename><surname>Buckley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ACM SIGIR</title>
				<meeting>ACM SIGIR</meeting>
		<imprint>
			<date type="published" when="2002">2002. 2002</date>
			<biblScope unit="page" from="316" to="323" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">A New Rank Correlation Coe cient for Information Retrieval</title>
		<author>
			<persName><forename type="first">Emine</forename><surname>Yilmaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Javed</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Stephen</forename><surname>Aslam</surname></persName>
		</author>
		<author>
			<persName><surname>Robertson</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of ACM SIGIR</title>
				<meeting>ACM SIGIR</meeting>
		<imprint>
			<date type="published" when="2008">2008. 2008</date>
			<biblScope unit="page" from="587" to="594" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
