=Paper= {{Paper |id=Vol-1739/MediaEval_2016_paper_11 |storemode=property |title=Pursuing a Moving Target: Iterative Use of Benchmarking of a Task to Understand the Task |pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_11.pdf |volume=Vol-1739 |dblpUrl=https://dblp.org/rec/conf/mediaeval/EskevichJAOH16 }} ==Pursuing a Moving Target: Iterative Use of Benchmarking of a Task to Understand the Task== https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_11.pdf
 Pursuing a Moving Target: Iterative Use of Benchmarking
            of a Task to Understand the Task

          Maria Eskevich1 , Gareth J. F. Jones2 , Robin Aly3 , Roeland Ordelman3 , Benoit Huet4
     1
      Radboud University, The Netherlands; 2 ADAPT Centre, School of Computing, Dublin City University, Ireland
                  3
                    University of Twente, The Netherlands; 4 EURECOM, Sophia Antipolis, France
     m.eskevich@let.ru.nl; gjones@computing.dcu.ie; {r.aly, ordelman}@ewi.utwente.nl; Benoit.Huet@eurecom.fr

ABSTRACT                                                          2.   SEARCH AND HYPERLINKING AT ME-
Individual tasks carried out within benchmarking initiatives,          DIAEVAL
or campaigns, enable direct comparison of alternative ap-            Our idea to define and shape an exploration of Search
proaches to tackling shared research challenges and ideally       and Hyperlinking (S&H) through a benchmarking activity
promote new research ideas and foster communities of re-          initially emerged from a diverse combination of reasons. A
searchers interested in common or related scientific topics.      number of varied and challenging large scale multimedia
When a task has a clear predefined use case, it might straight-   data archives relevant to such a task were already becoming
forwardly adopt a well established framework and method-          available, while the constantly increasing and diverse del-
ology. For example, an ad hoc information retrieval task          uge of new multimedia content being produced, stored and
adopting the standard Cranfield paradigm. On the other            shared by both non-, semi- and professionals meant that
hand, in cases of new and emerging tasks which pose more          there was a compelling motivation to explore methods to
complex challenges in terms of use scenarios or dataset de-       search and manage this content. At the same time, scien-
sign, the development of a new task is far from a straightfor-    tific advances had reached the stage where algorithms with
ward process. This letter summarises our reflections on our       the potential to address more creative tasks that could en-
experiences as task organisers of the Search and Hyperlinking     compass known-item and ad hoc retrieval of specific parts of
task from its origins as a Brave New Task at the MediaEval        content, as well as personalised collection exploration, were
benchmarking campaign (2011–2014) to its current instan-          becoming available. Embarking on this adventure was also
tiation as a task at the NIST TRECVid benchmark (since            appealing since various aspects of the overall S&H task had
2015). We highlight the challenges encountered in the de-         already been investigated or tested in smaller scale tasks,
velopment of the task over a number of annual iterations,         e.g. the MediaEval 2011 Rich Speech Retrieval (RSR) Task
the solutions found so far, and our process for maintaining a     [6] and the VideoCLEF 2009 Linking Task [7].
vision for the ongoing advancement of the task’s ambition.

1.       INTRODUCTION                                             3.   FROM A BRAVE NEW TASK TO A BRAVE
   Benchmark evaluation campaigns have become a key ac-                NEW WORLD
tivity within a broad range of information processing dis-           From a starting point of a use case for a new task, the
ciplines, having demonstrated their critical impact on the        development of an actual benchmark activity often appears
fields’ scientific progress especially for the information re-    straightforward. However, this is often not the case, and
trieval research community [11]. Individual benchmark tasks       once the task organisers begin to operationalize their ideas
within these campaigns facilitate direct comparison of alter-     technical and practical challenges begin to emerge. This
native approaches to specific technical challenges, encourage     means that the task released to the participants is generally
scientific innovation and, perhaps less obviously, enable un-     a technical and practical compromise, often containing hid-
derstanding of what the task actually is. This last point is      den questions that the task organisers are unable to answer
significant in the sense that the goal of a task can often be     based on their current understanding of the user behaviour
viewed as a “moving target” over successive (usually annual)      model or technical issues of the task. Thus the current in-
iterations of the task. This situation arises over the period     stance of a task can itself be designed to answer these ques-
in which the task is active as the task organisers come to        tions in order to move the task forward towards its ultimate
better understand what the task is seeking to achieve as a        research goals by exploiting better use case definition and
result of working to address questions raised by specification    representation in a subsequent version of the task.
of the task itself, development of task datasets, the task par-      The S&H tasks were a classic example of this situation.
ticipants feedback, and evaluation and analysis of the task       Once we began to examine the scope of what the task re-
results. In this letter we provide a brief review of our expe-    quired in terms of specification and implementation, we re-
riences of multiple iterations of the Search and Hyperlinking     alised that there were many questions to be addressed in
task developed within the MediaEval benchmark campaigns.          order to fully understand the task itself and how it should
                                                                  best be implemented to benchmark the usability of its out-
                                                                  puts and the algorithmic contributions of the participants’
Copyright is held by the author/owner(s).
MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether-     solutions. The activity thus began as a relatively small scale
lands                                                             Brave New Task at MediaEval 2011 [6]. The key issue ad-
dressed in the first iteration was the exploration of the po-     and then switched to a BBC dataset [3, 2]. However, the
tential of crowdsourcing technologies for the query creation      usage of professionally created and copyright material also
stage for a given collection and for the ground truth defini-     makes the task more dependent on the external partners
tion [4]. Setting up a task, that we envisaged as inspired by     and liable to all potential changes of the legal status of the
users’ potential interests and request creation, we wanted to     data. Overall, the opportunity to run the task with different
engage the real users in both task definition and evaluation.     datasets enriches the discussion of the scientific approaches.
   In subsequent years the task received the status of a Main        On the other hand, crowdsourcing of the task definition
Task, meaning that we were able to gather a group of core         and results evaluation keeps the focus of the task on the user,
participants (at least five) who expressed their interest in      and allows us to relate the scientific methods under test
participating each year. Being a Main Task did not mean           to the current users technology expectations. This brings
that the task definition and evaluation were set in stone, and    a practical insight into the impact of the performance im-
thus, we kept experimenting with the collection, the type of      provements in algorithms on user experience. In a way, the
users and their requests, evaluation metrics each year.           workers become part of the organisers team, i.e., the task,
   In 2014, we felt that the innovative Video Hyperlinking        although being envisaged by the scientists, is finally shaped
subtask within the S&H task had reached a good level of           and vetted by the real users.
maturity in terms of task infrastructure, i.e., task definition
[8], data availability and evaluation procedure [1], but there    5.   THE VIEW FROM A NEW HOME
were still many questions unanswered in terms of addressing
                                                                     Having run the task already for two years at the TRECVid
the algorithmic challenges of the task. We therefore sought
                                                                  benchmark, we can compare our experiences and outline the
the opportunity to increase participation and the range of
                                                                  differences. At both venues at the initial stage of the yearly
scientific input by offering the task at TRECVid 2015, sub-
                                                                  cycle, the tasks get feedback from the overall benchmark
sequently accepted by the TRECVid chairs [9].
                                                                  organisers committee in terms of task feasibility and interest
   Although we took the task to another venue, where most
                                                                  within the targeted scientific community. However, during
of the evaluation is usually done by NIST experts, we ad-
                                                                  the yearly cycle of actually running the task, within the
hered to the crowdsourcing anchor creation and evaluation
                                                                  MediaEval campaign the organisers of all the tasks are aware
procedures that were established within our MediaEval ac-
                                                                  of task progress, raising issues and sharing their solutions via
tivities. This approach preserved our flexibility in terms of
                                                                  bi-weekly conference calls. This is especially helpful, when
the creativity of the task definition, and we kept our commit-
                                                                  tasks are sharing the datasets, or when they are being run
ment to have users involved at all stages of benchmarking.
                                                                  for the first time and the organisers lack experience.
                                                                     As organisers of a creative novel task, we found that inter-
4.   ITERATIVE TASK EVOLUTION                                     action within the community of task organisers and with the
   Traditionally, well established tasks with a straightfor-      actual task participants proved to be very useful to enable us
ward scenario follow a pattern of gradually growing their         to react quickly to any issues arising with the task, from the
dataset with each year iteration, using the same evaluation       data release to submissions and evaluation release. However,
metric or a set of metrics, sometimes running the same soft-      TRECVid allows organisers to delegate some organisational
ware on the revised dataset, in order to be able to carry out     activities to the NIST, thus saving time.
direct comparison between the technology performance over            Running the benchmarking task requires a lot of commit-
the years. In the case of a more exploratory and innovative       ment from the organisers, and an interest and engagement
task, that is being developed through collaboration and feed-     of the scientific community. In our experience, the grow-
back with participants, the same broad user scenario can be       ing cycle in terms of interest and participation in the S&H
tested under different conditions, e.g. diverse target users      task coincided with a number of related projects funded at a
of the potentially developed approaches, different data sets      time which also meant that the ending of the funding cycle
and evolving evaluation metrics that cover aspects of the         affected the number of participants, while the actual scien-
task that could not have been foreseen beforehand.                tific findings and discussions were still on an upwards path.
   When the task is defined by a clear use case scenario exist-   The move to the TRECVid allowed us to involve large labs
ing within an industrial set up, the task can be promoted by      and companies that often participate in this venue.
these industrial partners via data provision and help with
the on-site evaluation. As our research focus is on large         6.   SUMMARY FUTURE OUTLOOK
video archives that are not always created and gathered with
                                                                     We have presented the evolution of the S&H task to date.
a clear monetization strategy in mind, often aiming at cul-
                                                                  Despite operating at two benchmarking venues, future chal-
tural heritage preservation (without predefined usage sce-
                                                                  lenges remain, with the most critical issue being sustainabil-
narios), we were more free in defining the framework. The
                                                                  ity [5]. While research is often bound to projects of finite
feedback from the crowdworkers helped us to test algorithms
                                                                  length, the organization of tasks should ideally be able to
addressing the task in a fast iterative way.
                                                                  continue independent of these. This is challenging in partic-
   Another aspect that has to be taken into account when
                                                                  ular in terms of human resources and technical resources.
setting up a task with a large data collection in mind is
the copyright question. When the task is in its initial early
development stage, it is easier to use a Creative Commons         7.   ACKNOWLEDGMENTS
dataset to initially test the task feasibility. This proof of       This work has been partially supported by: ESF Research
concept of the task viability allows the organisers to demon-     Networking Programme ELIAS; BpiFrance within the Nex-
strate the soundness of the overall framework, and thus to        GenTV project, grant no. F1504054U; Science Foundation
engage potential industry partners. This was the case for         Ireland (SFI) as a part of the ADAPT Centre at DCU
the S&H task that started with the BlipTV collection [10],        (13/RC/2106); EC FP7 project FP7-ICT 269980 (AXES).
8.   REFERENCES                                                Journal of the American Society for Information
                                                               Science and Technology, 62(4):613–627, 2011.
 [1] R. Aly, R. Ordelman, M. Eskevich, G. J. F. Jones, and
     S. Chen. Linking inside a video collection: what and
     how to measure? In Proceedings of the 22nd
     International World Wide Web Conference (WWW
     ’13), Companion Volume, pages 457–460, 2013.
 [2] S. Chen, M. Eskevich, G. J. F. Jones, and N. E.
     O’Connor. An Investigation into Feature Effectiveness
     for Multimedia Hyperlinking. In MultiMedia Modeling
     - 20th Anniversary International Conference (MMM
     2014), Proceedings, Part II, pages 251–262, Dublin,
     Ireland, 2014.
 [3] M. Eskevich, R. Aly, D. N. Racca, R. Ordelman,
     S. Chen, and G. J. F. Jones. The Search and
     Hyperlinking Task at MediaEval 2014. In Working
     Notes Proceedings of the MediaEval 2014 Workshop,
     Barcelona, Catalunya, Spain, 2014.
 [4] M. Eskevich, G. J. F. Jones, M. Larson, and
     R. Ordelman. Creating a Data Collection for
     Evaluating Rich Speech Retrieval. In Proceedings of
     the Eighth International Conference on Language
     Resources and Evaluation (LREC 2012), pages
     1736–1743, Istanbul, Turkey, 2012.
 [5] F. Hopfgartner, A. Hanbury, H. Müller, N. Kando,
     S. Mercer, J. Kalpathy-Cramer, M. Potthast,
     T. Gollub, A. Krithara, J. Lin, K. Balog, and I. Eggel.
     Report on the Evaluation-as-a-Service (EaaS) Expert
     Workshop. SIGIR Forum, 49(1):57–65, June 2015.
 [6] M. Larson, M. Eskevich, R. Ordelman, C. Kofler,
     S. Schmiedeke, and G. J. F. Jones. Overview of
     MediaEval 2011 Rich Speech Retrieval Task and
     Genre Tagging Task. In Working Notes Proceedings of
     the MediaEval 2011 Workshop, Santa Croce in
     Fossabanda, Pisa, Italy, 2011.
 [7] M. Larson, E. Newman, and G. J. F. Jones. Overview
     of VideoClef 2009: New Perspectives on Speech-based
     Multimedia Content Enrichment. In Proceedings of the
     10th International Conference on Cross-language
     Evaluation Forum: Multimedia Experiments,
     CLEF’09, pages 354–368, Berlin, Heidelberg, 2010.
     Springer-Verlag.
 [8] R. J. F. Ordelman, M. Eskevich, R. Aly, B. Huet, and
     G. J. F. Jones. Defining and Evaluating Video
     Hyperlinking for Navigating Multimedia Archives. In
     A. Gangemi, S. Leonardi, and A. Panconesi, editors,
     Proceedings of the 24th International Conference on
     World Wide Web Companion (WWW 2015),
     Companion Volume, pages 727–732, Florence, Italy,
     2015. ACM.
 [9] P. Over, J. Fiscus, D. Joy, M. Michel, G. Awad,
     W. Kraaij, A. F. Smeaton, G. Quénot, and
     R. Ordelman. TRECVID 2015 – an overview of the
     goals, tasks, data, evaluation mechanisms and metrics.
     In Proceedings of TRECVID 2015. NIST, USA, 2015.
[10] S. Schmiedeke, P. Xu, I. Ferrané, M. Eskevich,
     C. Kofler, M. A. Larson, Y. Estève, L. Lamel, G. J. F.
     Jones, and T. Sikora. Blip10000: a social video dataset
     containing SPUG content for tagging and retrieval. In
     Multimedia Systems Conference 2013 (MMSys ’13),
     pages 96–101, Oslo, Norway, 2013.
[11] C. V. Thornley, A. C. Johnson, A. F. Smeaton, and
     H. Lee. The scholarly impact of trecvid (2203 – 2009).