=Paper=
{{Paper
|id=Vol-1436/Paper73
|storemode=property
|title=Evaluating Search and Hyperlinking: An Example of the Design, Test, Refine Cycle for Metric Development
|pdfUrl=https://ceur-ws.org/Vol-1436/Paper73.pdf
|volume=Vol-1436
|dblpUrl=https://dblp.org/rec/conf/mediaeval/RaccaJ15
}}
==Evaluating Search and Hyperlinking: An Example of the Design, Test, Refine Cycle for Metric Development==
<pdf width="1500px">https://ceur-ws.org/Vol-1436/Paper73.pdf</pdf>
<pre>
     Evaluating Search and Hyperlinking: An Example of the
       Design, Test, Refine Cycle for Metric Development

                                          David N. Racca, Gareth J. F. Jones
                                                        ADAPT Centre
                                                     School of Computing
                                            Dublin City University, Dublin 9, Ireland
                                          {dracca, gjones}@computing.dcu.ie

ABSTRACT                                                           2.    EXAMPLE: SEARCH & HYPERLINKING
Designing meaninful metrics for evaluating MediaEval tasks           As a concrete example, let us consider the MediaEval
that are able to capture multiple aspects of system effec-         Search & Hyperlinking (S&H) [3, 5, 4] task. We consider
tiveness and user satisfaction is far from straighforward. A       only the search sub-task which requires participants to find
considerable part of the effort in organising such a task must     relevant video content from within a collection in response
often be devoted to selecting, designing or refining a suitable    to a user query. The system is required to return a list of
evaluation metric. We review evaluation metrics from the           video segments (video ID, start time, end time), where start
MediaEval Search and Hyperlinkiing task, illustrating the          time suggests the beginning of a relevant portion of a video
motivation behind metrics proposed for the task, and how           and end time suggests where this relevant content ends.
reflection on results has led to iterative metric refinement in      The task can be framed as an IR task and be evaluated
subsequent campaigns.                                              by using the widely-adopted Cranfield paradigm for evalu-
                                                                   ating IR systems. In the context of the S&H task, this is
                                                                   implemented by first generating a pool of the top ranked
1.   INTRODUCTION                                                  retrieved segments (video ID, start time, end time) submit-
   It is a principle of MediaEval tasks that they should be        ted by the participants for each query. Human assessors
built around a realistic use-case. This means that it is im-       recruited through Amazon Mechanical Turk then judge the
plicit in a MediaEval task that it should seek to evaluate         relevance of each individual segment in the pool with respect
participant submissions with respect to their effectiveness        to its corresponding query. The set of segments judged rele-
in performing the task, and that by implication that this          vant by the human annotators then forms the ground truth
should be related to a user’s satisfaction with the actions of     for the task, specifying for each query, which time spans in
the system used in the participant’s submission.                   the video collection contain some relevant content.
   The objective of a MediaEval task will vary depending on
the task itself. Measuring the success with which a particu-       2.1   User Models and Evaluation Metrics
lar system achieves its task objective can be complex, partic-        The evaluation metrics used in the S&H search sub-task
ularly in the case of temporal multimedia content [10]. For        have all been based on standard Mean Average Precision
example, in conventional text information retrieval (IR) ap-       (MAP). MAP models a user that scans a ranked results
plications, items are often viewed as either relevant or non-      list from top to bottom looking for relevant items. MAP is
relevant to the user’s information need. While often much of       calculated by computing the average of the precision at each
such a document will not actually be relevant, it is generally     rank where a relevant document is found for a query, and
deemed reasonable to label a document as either relevant or        then computing the mean for a set of queries.
non-relevant without taking account of the the cost of iden-          Standard MAP is not an appropriate measure for tasks
tifying and extracting the relevant information from it. By        like S&H where the cost of finding relevant information within
contrast, in temporal media, the cost of identifying relevant      a suggested relevant segment is non-negligible. Thus, vari-
content and extracting relevant information can be very sig-       ous adaptations of MAP have been explored. Most of these
nificant. Thus, metrics typically make consideration of the        take into account segment overlap or the distance to jump-in
specific points where relevant content begins and ends, and        points, to compute the precision with which relevant content
the cost, most often measured as the temporal distance, of         has been retrieved, and also reflect expected user effort to
locating this within a retrieved item. Further, temporal doc-      find and extract the relevant information.
uments may be divided into segments in order to search for            Mean Generalized Average Precision (mGAP) [8, 10], is
units with maximal proportions of relevant content to seek         a variation of MAP which replaces simple binary relevance
to promote their retrieval rank and improve content access         with a continuous function that penalises systems based on
efficiency. Measuring the multiple dimensions of relevance,        the distance from the ideal jump-in point to the beginning
retrieval rank and “cost” to access relevant content in a single   of a retrieved segment. In S&H 2014, three additional mea-
metric presents many challenges.                                   sures based on MAP were used: overlap MAP (MAP-over),
                                                                   binned MAP (MAP-bin), and tolerance to irrelevance MAP
                                                                   (MAP-tol) [1, 5]. While these metrics were designed care-
Copyright is held by the author/owner(s).                          fully to measure performance in the S&H search task, sub-
MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany        sequent analysis of results reveals weaknesses in all of them.
   MAP-over rewards systems that return segments that over-                        MAiSP      MAiSP      MAP      MAP      MAP
                                                                                     rel        ret       tol      bin     over
lap with some relevant content. As defined in [1], this mea-
                                                                    MAiSP rel        1.0       -0.12     0.83     0.89     0.88
sure presents various issues. First, a system receives extra        MAiSP ret       -0.12       1.0      -0.53    0.01     0.08
credit if it returns multiple segments overlapping with the         MAP tol         0.83       -0.53      1.0     0.78     0.74
same relevant content. The metric therefore fails to acknowl-       MAP bin         0.89       0.01      0.78      1.0     0.86
edge that most users will generally not want to see the same        MAP over        0.88       0.08      0.74     0.86      1.0
relevant content more than once. Furthermore, if a system
retrieves more relevant items than the number of relevant          Table 1: Correlation between measures when over-
segments in the ground truth, MAP-over can be ≥ 1 [7].             lapping segments are removed from the ranked-lists.
   MAP-bin splits videos into bins of equal length. Bins
overlapping with relevant segments are marked as relevant.                         MAiSP      MAiSP      MAP      MAP      MAP
A segment is considered relevant if its start time falls within                      rel        ret       tol      bin     over
a relevant bin. A system is therefore assumed to return a           MAiSP-rel        1.0       -0.02     0.89     0.45     -0.47
                                                                    MAiSP-ret       -0.02       1.0      -0.33    0.01     -0.33
ranked list of bins and a user is assumed to watch the content
                                                                    MAP-tol         0.89       -0.33      1.0     0.51     -0.27
of entire bins in the order given by their ranks. In contrast to    MAP-bin         0.45       0.01      0.51      1.0     0.39
MAP-over, in MAP-bin systems that return multiple jump-             MAP-over        -0.47      -0.38     -0.27    0.39      1.0
in points falling in the same relevant bin only get credit for
its best-ranked instance. Analogously, systems that retrieve       Table 2: Correlation between measures when the
multiple jump-in points falling in the same non-relevant bin       ranked-lists contain overlapping segments.
are penalised only once, even when checking every extra non-
relevant bin may represent an additional effort for the user.
Thus, a system that retrieves multiple jump-in points in the       present in ranked lists that contain short and/or overlap-
proximity of the intersection of two relevant bins is likely       ping segments, we calculated correlation coefficients with
to obtain a higher MAP-bin score, because doing so would           the original set of 10,000 ranked lists and also with a mod-
increase its chances of hitting more than just one relevant        ified version of the ranked lists that did not contain any
bin without receiving any extra penalty.                           overlapping segments in the results. Table 1 shows how the
   MAP-tol [2, 1] is a simplified form of mGAP which only          measures correlate when overlapping segments are removed
rewards retrieved segments that start within a pre-defined         from the ranked-lists. Most of the measures correlate rel-
tolerance window from unseen relevant content. In contrast         atively well in this case. However MAiSP-ret seems to be
to MAP-over and MAP-bin, MAP-tol successfully reflects             orthogonal to MAiSP-rel, MAP-bin and MAP-over, and to
the fact that users will not be satisfied if presented with con-   correlate negatively with MAP-tol. This is because MAiSP-
tent that they have seen before. However, MAP-tol equally          ret is the only measure that assesses the quality of both the
rewards retrieved segments that point to large and short           start and end time points of the retrieved segments. Table 2
amounts of relevant content. It is thus more akin to standard      shows correlations for the ranked lists that contain overlap-
MAP and not sufficiently informative of system behaviour.          ping segments. MAP-over correlates negatively with most of
   Moving on from the variants of MAP introduced in 2014,          the other measures, while MAP-bin correlates less strongly
for this year’s search sub-task [4], we introduced a measure       with MAP-tol and MAiSP than in Table 1, suggesting that
that estimates the user’s effort in checking the relevance of      these measures fail to penalise ranked lists containing dupli-
each retrieved item and that does not reward duplicate re-         cate results and that they therefore fail to reflect the users’
sults. User effort is measured in terms of the number of           preference against redundancy in the result lists.
seconds that they must spend auditioning content, and user
satisfaction in terms of the number of seconds of new rele-        3.   CONCLUSIONS
vant content that they can watch starting from a suggested
                                                                      Designing evaluation measures for MediaEval tasks is of-
jump-in point. This measure resembles Mean Average Seg-
                                                                   ten challenging. In tasks such as S&H, it is important to seek
ment Precision (MASP) [6], but differs from it in that pre-
                                                                   a measure of effectiveness which reflects the system’s ability
cision is computed at fixed-recall points rather than at rank
                                                                   to find the necessary content and to maximise the satisfac-
levels. Because of this similarity, we refer to it as MAiSP. We
                                                                   tion of the user in doing so. In the context of the S&H task,
introduced two user models for MAiSP. MAiSP-ret assumes
                                                                   this essentially means minimising the user’s effort in satis-
that the user watches the entire retrieved segment indepen-
                                                                   fying their information need. This note has shown how task
dently of whether the segment contains any relevant con-
                                                                   evaluation measures can be refined over multiple editions of
tent. MAiSP-rel assumes that the user watches a retrieved
                                                                   a task as the organisers come to better understand their task
segment until the end point suggested by the system in the
                                                                   and reflect on its nature and its evaluation. From our expe-
case that no new relevant material continues thereafter, or
                                                                   riences in the S&H task, it is important for task organisers
until the last span of new relevant material is complete.
                                                                   to consider the necessary features of the evaluation metrics
                                                                   of the task, and to be open to reflecting on the strengths
2.2    Correlation analysis                                        and weaknesses of the metric itself, as well as the calculated
   To compare the behaviour of these measures, we ran a            results when evaluating participant submissions.
series of retrieval experiments using the test collection used
in the S&H 2014 search sub-task and computed the pair-             4.   ACKNOWLEDGMENTS
wise Pearson’s r correlation between MAP-over, MAP-bin,
MAP-tol, MAiSP-ret, and MAiSP-rel across 10,000 ranked                This work was supported by Science Foundation Ireland
lists produced with the Terrier IR platform [9]. Since most        (Grant 12/CE/I2267) as part of the Centre for Global Intel-
of the issues relating to the measures are more likely to be       ligent Content CNGL II project at DCU.
5.   REFERENCES
 [1] R. Aly, M. Eskevich, R. Ordelman, and G. J. F. Jones.
     Adapting binary information retrieval evaluation
     metrics for segment-based retrieval tasks. Technical
     Report arXiv preprint arXiv:1312.1913, 2013.
 [2] A. P. De Vries, G. Kazai, and M. Lalmas. Tolerance to
     irrelevance: A user-effort oriented evaluation of
     retrieval systems without predefined retrieval unit. In
     RIAO 2004 Conference Proceedings, pages 463–473,
     Avignon, France, April 2004.
 [3] M. Eskevich, R. Aly, R. Ordelman, S. Chen, and
     G. J. F. Jones. The search and hyperlinking task at
     MediaEval 2013. In Proceedings of the MediaEval 2013
     Workshop, Barcelona, Spain, 2013.
 [4] M. Eskevich, R. Aly, D. N. Racca, S. Chen, and
     G. J. F. Jones. SAVA at MediaEval 2015: Search and
     anchoring in video archives. In Proceedings of the
     MediaEval 2015 Workshop, Wurzen, Germany,
     September 2015.
 [5] M. Eskevich, R. Aly, D. N. Racca, R. Ordelman,
     S. Chen, and G. J. F. Jones. The search and
     hyperlinking task at MediaEval 2014. In Proceedings
     of the MediaEval 2014 Multimedia Benchmark
     Workshop, Barcelona, Spain, October 2014.
 [6] M. Eskevich, W. Magdy, and G. J. F. Jones. New
     metrics for meaningful evaluation of informally
     structured speech retrieval. In Proceedings of ECIR
     2012, pages 170–181, Barcelona, Spain, 2012.
 [7] P. Galušcáková and P. Pecina. CUNI at MediaEval
     2014 search and hyperlinking task: Search task
     experiments. In Proceedings of the MediaEval 2014
     Multimedia Benchmark Workshop, Barcelona, Spain,
     October 2014.
 [8] B. Liu and D. W. Oard. One-sided measures for
     evaluating ranked retrieval effectiveness with
     spontaneous conversational speech. In Proceedings of
     the 29th Annual International ACM SIGIR
     Conference on Research and Development in
     Information Retrieval, pages 673–674. ACM, 2006.
 [9] I. Ounis, C. Lioma, C. Macdonald, and V. Plachouras.
     Research directions in Terrier: a search engine for
     advanced retrieval on the web. Novatica/UPGRADE
     Special Issue on Next Generation Web Search, pages
     49–56, 2007.
[10] P. Pecina, P. Hoffmannova, G. J. F. Jones, Y. Zhang,
     and D. W. Oard. Overview of the CLEF 2007
     cross-language speech retrieval track. In Proceedings
     CLEF’07, pages 674–686, 2007.

</pre>