=Paper=
{{Paper
|id=Vol-1436/Paper73
|storemode=property
|title=Evaluating Search and Hyperlinking: An Example of the Design, Test, Refine Cycle for Metric Development
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper73.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/RaccaJ15
}}
==Evaluating Search and Hyperlinking: An Example of the Design, Test, Refine Cycle for Metric Development==
Evaluating Search and Hyperlinking: An Example of the Design, Test, Refine Cycle for Metric Development David N. Racca, Gareth J. F. Jones ADAPT Centre School of Computing Dublin City University, Dublin 9, Ireland {dracca, gjones}@computing.dcu.ie ABSTRACT 2. EXAMPLE: SEARCH & HYPERLINKING Designing meaninful metrics for evaluating MediaEval tasks As a concrete example, let us consider the MediaEval that are able to capture multiple aspects of system effec- Search & Hyperlinking (S&H) [3, 5, 4] task. We consider tiveness and user satisfaction is far from straighforward. A only the search sub-task which requires participants to find considerable part of the effort in organising such a task must relevant video content from within a collection in response often be devoted to selecting, designing or refining a suitable to a user query. The system is required to return a list of evaluation metric. We review evaluation metrics from the video segments (video ID, start time, end time), where start MediaEval Search and Hyperlinkiing task, illustrating the time suggests the beginning of a relevant portion of a video motivation behind metrics proposed for the task, and how and end time suggests where this relevant content ends. reflection on results has led to iterative metric refinement in The task can be framed as an IR task and be evaluated subsequent campaigns. by using the widely-adopted Cranfield paradigm for evalu- ating IR systems. In the context of the S&H task, this is implemented by first generating a pool of the top ranked 1. INTRODUCTION retrieved segments (video ID, start time, end time) submit- It is a principle of MediaEval tasks that they should be ted by the participants for each query. Human assessors built around a realistic use-case. This means that it is im- recruited through Amazon Mechanical Turk then judge the plicit in a MediaEval task that it should seek to evaluate relevance of each individual segment in the pool with respect participant submissions with respect to their effectiveness to its corresponding query. The set of segments judged rele- in performing the task, and that by implication that this vant by the human annotators then forms the ground truth should be related to a user’s satisfaction with the actions of for the task, specifying for each query, which time spans in the system used in the participant’s submission. the video collection contain some relevant content. The objective of a MediaEval task will vary depending on the task itself. Measuring the success with which a particu- 2.1 User Models and Evaluation Metrics lar system achieves its task objective can be complex, partic- The evaluation metrics used in the S&H search sub-task ularly in the case of temporal multimedia content [10]. For have all been based on standard Mean Average Precision example, in conventional text information retrieval (IR) ap- (MAP). MAP models a user that scans a ranked results plications, items are often viewed as either relevant or non- list from top to bottom looking for relevant items. MAP is relevant to the user’s information need. While often much of calculated by computing the average of the precision at each such a document will not actually be relevant, it is generally rank where a relevant document is found for a query, and deemed reasonable to label a document as either relevant or then computing the mean for a set of queries. non-relevant without taking account of the the cost of iden- Standard MAP is not an appropriate measure for tasks tifying and extracting the relevant information from it. By like S&H where the cost of finding relevant information within contrast, in temporal media, the cost of identifying relevant a suggested relevant segment is non-negligible. Thus, vari- content and extracting relevant information can be very sig- ous adaptations of MAP have been explored. Most of these nificant. Thus, metrics typically make consideration of the take into account segment overlap or the distance to jump-in specific points where relevant content begins and ends, and points, to compute the precision with which relevant content the cost, most often measured as the temporal distance, of has been retrieved, and also reflect expected user effort to locating this within a retrieved item. Further, temporal doc- find and extract the relevant information. uments may be divided into segments in order to search for Mean Generalized Average Precision (mGAP) [8, 10], is units with maximal proportions of relevant content to seek a variation of MAP which replaces simple binary relevance to promote their retrieval rank and improve content access with a continuous function that penalises systems based on efficiency. Measuring the multiple dimensions of relevance, the distance from the ideal jump-in point to the beginning retrieval rank and “cost” to access relevant content in a single of a retrieved segment. In S&H 2014, three additional mea- metric presents many challenges. sures based on MAP were used: overlap MAP (MAP-over), binned MAP (MAP-bin), and tolerance to irrelevance MAP (MAP-tol) [1, 5]. While these metrics were designed care- Copyright is held by the author/owner(s). fully to measure performance in the S&H search task, sub- MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany sequent analysis of results reveals weaknesses in all of them. MAP-over rewards systems that return segments that over- MAiSP MAiSP MAP MAP MAP rel ret tol bin over lap with some relevant content. As defined in [1], this mea- MAiSP rel 1.0 -0.12 0.83 0.89 0.88 sure presents various issues. First, a system receives extra MAiSP ret -0.12 1.0 -0.53 0.01 0.08 credit if it returns multiple segments overlapping with the MAP tol 0.83 -0.53 1.0 0.78 0.74 same relevant content. The metric therefore fails to acknowl- MAP bin 0.89 0.01 0.78 1.0 0.86 edge that most users will generally not want to see the same MAP over 0.88 0.08 0.74 0.86 1.0 relevant content more than once. Furthermore, if a system retrieves more relevant items than the number of relevant Table 1: Correlation between measures when over- segments in the ground truth, MAP-over can be ≥ 1 [7]. lapping segments are removed from the ranked-lists. MAP-bin splits videos into bins of equal length. Bins overlapping with relevant segments are marked as relevant. MAiSP MAiSP MAP MAP MAP A segment is considered relevant if its start time falls within rel ret tol bin over a relevant bin. A system is therefore assumed to return a MAiSP-rel 1.0 -0.02 0.89 0.45 -0.47 MAiSP-ret -0.02 1.0 -0.33 0.01 -0.33 ranked list of bins and a user is assumed to watch the content MAP-tol 0.89 -0.33 1.0 0.51 -0.27 of entire bins in the order given by their ranks. In contrast to MAP-bin 0.45 0.01 0.51 1.0 0.39 MAP-over, in MAP-bin systems that return multiple jump- MAP-over -0.47 -0.38 -0.27 0.39 1.0 in points falling in the same relevant bin only get credit for its best-ranked instance. Analogously, systems that retrieve Table 2: Correlation between measures when the multiple jump-in points falling in the same non-relevant bin ranked-lists contain overlapping segments. are penalised only once, even when checking every extra non- relevant bin may represent an additional effort for the user. Thus, a system that retrieves multiple jump-in points in the present in ranked lists that contain short and/or overlap- proximity of the intersection of two relevant bins is likely ping segments, we calculated correlation coefficients with to obtain a higher MAP-bin score, because doing so would the original set of 10,000 ranked lists and also with a mod- increase its chances of hitting more than just one relevant ified version of the ranked lists that did not contain any bin without receiving any extra penalty. overlapping segments in the results. Table 1 shows how the MAP-tol [2, 1] is a simplified form of mGAP which only measures correlate when overlapping segments are removed rewards retrieved segments that start within a pre-defined from the ranked-lists. Most of the measures correlate rel- tolerance window from unseen relevant content. In contrast atively well in this case. However MAiSP-ret seems to be to MAP-over and MAP-bin, MAP-tol successfully reflects orthogonal to MAiSP-rel, MAP-bin and MAP-over, and to the fact that users will not be satisfied if presented with con- correlate negatively with MAP-tol. This is because MAiSP- tent that they have seen before. However, MAP-tol equally ret is the only measure that assesses the quality of both the rewards retrieved segments that point to large and short start and end time points of the retrieved segments. Table 2 amounts of relevant content. It is thus more akin to standard shows correlations for the ranked lists that contain overlap- MAP and not sufficiently informative of system behaviour. ping segments. MAP-over correlates negatively with most of Moving on from the variants of MAP introduced in 2014, the other measures, while MAP-bin correlates less strongly for this year’s search sub-task [4], we introduced a measure with MAP-tol and MAiSP than in Table 1, suggesting that that estimates the user’s effort in checking the relevance of these measures fail to penalise ranked lists containing dupli- each retrieved item and that does not reward duplicate re- cate results and that they therefore fail to reflect the users’ sults. User effort is measured in terms of the number of preference against redundancy in the result lists. seconds that they must spend auditioning content, and user satisfaction in terms of the number of seconds of new rele- 3. CONCLUSIONS vant content that they can watch starting from a suggested Designing evaluation measures for MediaEval tasks is of- jump-in point. This measure resembles Mean Average Seg- ten challenging. In tasks such as S&H, it is important to seek ment Precision (MASP) [6], but differs from it in that pre- a measure of effectiveness which reflects the system’s ability cision is computed at fixed-recall points rather than at rank to find the necessary content and to maximise the satisfac- levels. Because of this similarity, we refer to it as MAiSP. We tion of the user in doing so. In the context of the S&H task, introduced two user models for MAiSP. MAiSP-ret assumes this essentially means minimising the user’s effort in satis- that the user watches the entire retrieved segment indepen- fying their information need. This note has shown how task dently of whether the segment contains any relevant con- evaluation measures can be refined over multiple editions of tent. MAiSP-rel assumes that the user watches a retrieved a task as the organisers come to better understand their task segment until the end point suggested by the system in the and reflect on its nature and its evaluation. From our expe- case that no new relevant material continues thereafter, or riences in the S&H task, it is important for task organisers until the last span of new relevant material is complete. to consider the necessary features of the evaluation metrics of the task, and to be open to reflecting on the strengths 2.2 Correlation analysis and weaknesses of the metric itself, as well as the calculated To compare the behaviour of these measures, we ran a results when evaluating participant submissions. series of retrieval experiments using the test collection used in the S&H 2014 search sub-task and computed the pair- 4. ACKNOWLEDGMENTS wise Pearson’s r correlation between MAP-over, MAP-bin, MAP-tol, MAiSP-ret, and MAiSP-rel across 10,000 ranked This work was supported by Science Foundation Ireland lists produced with the Terrier IR platform [9]. Since most (Grant 12/CE/I2267) as part of the Centre for Global Intel- of the issues relating to the measures are more likely to be ligent Content CNGL II project at DCU. 5. REFERENCES [1] R. Aly, M. Eskevich, R. Ordelman, and G. J. F. Jones. Adapting binary information retrieval evaluation metrics for segment-based retrieval tasks. Technical Report arXiv preprint arXiv:1312.1913, 2013. [2] A. P. De Vries, G. Kazai, and M. Lalmas. Tolerance to irrelevance: A user-effort oriented evaluation of retrieval systems without predefined retrieval unit. In RIAO 2004 Conference Proceedings, pages 463–473, Avignon, France, April 2004. [3] M. Eskevich, R. Aly, R. Ordelman, S. Chen, and G. J. F. Jones. The search and hyperlinking task at MediaEval 2013. In Proceedings of the MediaEval 2013 Workshop, Barcelona, Spain, 2013. [4] M. Eskevich, R. Aly, D. N. Racca, S. Chen, and G. J. F. Jones. SAVA at MediaEval 2015: Search and anchoring in video archives. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany, September 2015. [5] M. Eskevich, R. Aly, D. N. Racca, R. Ordelman, S. Chen, and G. J. F. Jones. The search and hyperlinking task at MediaEval 2014. In Proceedings of the MediaEval 2014 Multimedia Benchmark Workshop, Barcelona, Spain, October 2014. [6] M. Eskevich, W. Magdy, and G. J. F. Jones. New metrics for meaningful evaluation of informally structured speech retrieval. In Proceedings of ECIR 2012, pages 170–181, Barcelona, Spain, 2012. [7] P. Galušcáková and P. Pecina. CUNI at MediaEval 2014 search and hyperlinking task: Search task experiments. In Proceedings of the MediaEval 2014 Multimedia Benchmark Workshop, Barcelona, Spain, October 2014. [8] B. Liu and D. W. Oard. One-sided measures for evaluating ranked retrieval effectiveness with spontaneous conversational speech. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 673–674. ACM, 2006. [9] I. Ounis, C. Lioma, C. Macdonald, and V. Plachouras. Research directions in Terrier: a search engine for advanced retrieval on the web. Novatica/UPGRADE Special Issue on Next Generation Web Search, pages 49–56, 2007. [10] P. Pecina, P. Hoffmannova, G. J. F. Jones, Y. Zhang, and D. W. Oard. Overview of the CLEF 2007 cross-language speech retrieval track. In Proceedings CLEF’07, pages 674–686, 2007.