=Paper=
{{Paper
|id=Vol-1739/MediaEval_2016_paper_11
|storemode=property
|title=Pursuing a Moving Target: Iterative Use of Benchmarking of a Task to Understand the Task
|pdfUrl=https://ceur-ws.org/Vol-1739/MediaEval_2016_paper_11.pdf
|volume=Vol-1739
|dblpUrl=https://dblp.org/rec/conf/mediaeval/EskevichJAOH16
}}
==Pursuing a Moving Target: Iterative Use of Benchmarking of a Task to Understand the Task==
Pursuing a Moving Target: Iterative Use of Benchmarking of a Task to Understand the Task Maria Eskevich1 , Gareth J. F. Jones2 , Robin Aly3 , Roeland Ordelman3 , Benoit Huet4 1 Radboud University, The Netherlands; 2 ADAPT Centre, School of Computing, Dublin City University, Ireland 3 University of Twente, The Netherlands; 4 EURECOM, Sophia Antipolis, France m.eskevich@let.ru.nl; gjones@computing.dcu.ie; {r.aly, ordelman}@ewi.utwente.nl; Benoit.Huet@eurecom.fr ABSTRACT 2. SEARCH AND HYPERLINKING AT ME- Individual tasks carried out within benchmarking initiatives, DIAEVAL or campaigns, enable direct comparison of alternative ap- Our idea to define and shape an exploration of Search proaches to tackling shared research challenges and ideally and Hyperlinking (S&H) through a benchmarking activity promote new research ideas and foster communities of re- initially emerged from a diverse combination of reasons. A searchers interested in common or related scientific topics. number of varied and challenging large scale multimedia When a task has a clear predefined use case, it might straight- data archives relevant to such a task were already becoming forwardly adopt a well established framework and method- available, while the constantly increasing and diverse del- ology. For example, an ad hoc information retrieval task uge of new multimedia content being produced, stored and adopting the standard Cranfield paradigm. On the other shared by both non-, semi- and professionals meant that hand, in cases of new and emerging tasks which pose more there was a compelling motivation to explore methods to complex challenges in terms of use scenarios or dataset de- search and manage this content. At the same time, scien- sign, the development of a new task is far from a straightfor- tific advances had reached the stage where algorithms with ward process. This letter summarises our reflections on our the potential to address more creative tasks that could en- experiences as task organisers of the Search and Hyperlinking compass known-item and ad hoc retrieval of specific parts of task from its origins as a Brave New Task at the MediaEval content, as well as personalised collection exploration, were benchmarking campaign (2011–2014) to its current instan- becoming available. Embarking on this adventure was also tiation as a task at the NIST TRECVid benchmark (since appealing since various aspects of the overall S&H task had 2015). We highlight the challenges encountered in the de- already been investigated or tested in smaller scale tasks, velopment of the task over a number of annual iterations, e.g. the MediaEval 2011 Rich Speech Retrieval (RSR) Task the solutions found so far, and our process for maintaining a [6] and the VideoCLEF 2009 Linking Task [7]. vision for the ongoing advancement of the task’s ambition. 1. INTRODUCTION 3. FROM A BRAVE NEW TASK TO A BRAVE Benchmark evaluation campaigns have become a key ac- NEW WORLD tivity within a broad range of information processing dis- From a starting point of a use case for a new task, the ciplines, having demonstrated their critical impact on the development of an actual benchmark activity often appears fields’ scientific progress especially for the information re- straightforward. However, this is often not the case, and trieval research community [11]. Individual benchmark tasks once the task organisers begin to operationalize their ideas within these campaigns facilitate direct comparison of alter- technical and practical challenges begin to emerge. This native approaches to specific technical challenges, encourage means that the task released to the participants is generally scientific innovation and, perhaps less obviously, enable un- a technical and practical compromise, often containing hid- derstanding of what the task actually is. This last point is den questions that the task organisers are unable to answer significant in the sense that the goal of a task can often be based on their current understanding of the user behaviour viewed as a “moving target” over successive (usually annual) model or technical issues of the task. Thus the current in- iterations of the task. This situation arises over the period stance of a task can itself be designed to answer these ques- in which the task is active as the task organisers come to tions in order to move the task forward towards its ultimate better understand what the task is seeking to achieve as a research goals by exploiting better use case definition and result of working to address questions raised by specification representation in a subsequent version of the task. of the task itself, development of task datasets, the task par- The S&H tasks were a classic example of this situation. ticipants feedback, and evaluation and analysis of the task Once we began to examine the scope of what the task re- results. In this letter we provide a brief review of our expe- quired in terms of specification and implementation, we re- riences of multiple iterations of the Search and Hyperlinking alised that there were many questions to be addressed in task developed within the MediaEval benchmark campaigns. order to fully understand the task itself and how it should best be implemented to benchmark the usability of its out- puts and the algorithmic contributions of the participants’ Copyright is held by the author/owner(s). MediaEval 2016 Workshop, Oct. 20-21, 2016, Hilversum, Nether- solutions. The activity thus began as a relatively small scale lands Brave New Task at MediaEval 2011 [6]. The key issue ad- dressed in the first iteration was the exploration of the po- and then switched to a BBC dataset [3, 2]. However, the tential of crowdsourcing technologies for the query creation usage of professionally created and copyright material also stage for a given collection and for the ground truth defini- makes the task more dependent on the external partners tion [4]. Setting up a task, that we envisaged as inspired by and liable to all potential changes of the legal status of the users’ potential interests and request creation, we wanted to data. Overall, the opportunity to run the task with different engage the real users in both task definition and evaluation. datasets enriches the discussion of the scientific approaches. In subsequent years the task received the status of a Main On the other hand, crowdsourcing of the task definition Task, meaning that we were able to gather a group of core and results evaluation keeps the focus of the task on the user, participants (at least five) who expressed their interest in and allows us to relate the scientific methods under test participating each year. Being a Main Task did not mean to the current users technology expectations. This brings that the task definition and evaluation were set in stone, and a practical insight into the impact of the performance im- thus, we kept experimenting with the collection, the type of provements in algorithms on user experience. In a way, the users and their requests, evaluation metrics each year. workers become part of the organisers team, i.e., the task, In 2014, we felt that the innovative Video Hyperlinking although being envisaged by the scientists, is finally shaped subtask within the S&H task had reached a good level of and vetted by the real users. maturity in terms of task infrastructure, i.e., task definition [8], data availability and evaluation procedure [1], but there 5. THE VIEW FROM A NEW HOME were still many questions unanswered in terms of addressing Having run the task already for two years at the TRECVid the algorithmic challenges of the task. We therefore sought benchmark, we can compare our experiences and outline the the opportunity to increase participation and the range of differences. At both venues at the initial stage of the yearly scientific input by offering the task at TRECVid 2015, sub- cycle, the tasks get feedback from the overall benchmark sequently accepted by the TRECVid chairs [9]. organisers committee in terms of task feasibility and interest Although we took the task to another venue, where most within the targeted scientific community. However, during of the evaluation is usually done by NIST experts, we ad- the yearly cycle of actually running the task, within the hered to the crowdsourcing anchor creation and evaluation MediaEval campaign the organisers of all the tasks are aware procedures that were established within our MediaEval ac- of task progress, raising issues and sharing their solutions via tivities. This approach preserved our flexibility in terms of bi-weekly conference calls. This is especially helpful, when the creativity of the task definition, and we kept our commit- tasks are sharing the datasets, or when they are being run ment to have users involved at all stages of benchmarking. for the first time and the organisers lack experience. As organisers of a creative novel task, we found that inter- 4. ITERATIVE TASK EVOLUTION action within the community of task organisers and with the Traditionally, well established tasks with a straightfor- actual task participants proved to be very useful to enable us ward scenario follow a pattern of gradually growing their to react quickly to any issues arising with the task, from the dataset with each year iteration, using the same evaluation data release to submissions and evaluation release. However, metric or a set of metrics, sometimes running the same soft- TRECVid allows organisers to delegate some organisational ware on the revised dataset, in order to be able to carry out activities to the NIST, thus saving time. direct comparison between the technology performance over Running the benchmarking task requires a lot of commit- the years. In the case of a more exploratory and innovative ment from the organisers, and an interest and engagement task, that is being developed through collaboration and feed- of the scientific community. In our experience, the grow- back with participants, the same broad user scenario can be ing cycle in terms of interest and participation in the S&H tested under different conditions, e.g. diverse target users task coincided with a number of related projects funded at a of the potentially developed approaches, different data sets time which also meant that the ending of the funding cycle and evolving evaluation metrics that cover aspects of the affected the number of participants, while the actual scien- task that could not have been foreseen beforehand. tific findings and discussions were still on an upwards path. When the task is defined by a clear use case scenario exist- The move to the TRECVid allowed us to involve large labs ing within an industrial set up, the task can be promoted by and companies that often participate in this venue. these industrial partners via data provision and help with the on-site evaluation. As our research focus is on large 6. SUMMARY FUTURE OUTLOOK video archives that are not always created and gathered with We have presented the evolution of the S&H task to date. a clear monetization strategy in mind, often aiming at cul- Despite operating at two benchmarking venues, future chal- tural heritage preservation (without predefined usage sce- lenges remain, with the most critical issue being sustainabil- narios), we were more free in defining the framework. The ity [5]. While research is often bound to projects of finite feedback from the crowdworkers helped us to test algorithms length, the organization of tasks should ideally be able to addressing the task in a fast iterative way. continue independent of these. This is challenging in partic- Another aspect that has to be taken into account when ular in terms of human resources and technical resources. setting up a task with a large data collection in mind is the copyright question. When the task is in its initial early development stage, it is easier to use a Creative Commons 7. ACKNOWLEDGMENTS dataset to initially test the task feasibility. This proof of This work has been partially supported by: ESF Research concept of the task viability allows the organisers to demon- Networking Programme ELIAS; BpiFrance within the Nex- strate the soundness of the overall framework, and thus to GenTV project, grant no. F1504054U; Science Foundation engage potential industry partners. This was the case for Ireland (SFI) as a part of the ADAPT Centre at DCU the S&H task that started with the BlipTV collection [10], (13/RC/2106); EC FP7 project FP7-ICT 269980 (AXES). 8. REFERENCES Journal of the American Society for Information Science and Technology, 62(4):613–627, 2011. [1] R. Aly, R. Ordelman, M. Eskevich, G. J. F. Jones, and S. Chen. Linking inside a video collection: what and how to measure? In Proceedings of the 22nd International World Wide Web Conference (WWW ’13), Companion Volume, pages 457–460, 2013. [2] S. Chen, M. Eskevich, G. J. F. Jones, and N. E. O’Connor. An Investigation into Feature Effectiveness for Multimedia Hyperlinking. In MultiMedia Modeling - 20th Anniversary International Conference (MMM 2014), Proceedings, Part II, pages 251–262, Dublin, Ireland, 2014. [3] M. Eskevich, R. Aly, D. N. Racca, R. Ordelman, S. Chen, and G. J. F. Jones. The Search and Hyperlinking Task at MediaEval 2014. In Working Notes Proceedings of the MediaEval 2014 Workshop, Barcelona, Catalunya, Spain, 2014. [4] M. Eskevich, G. J. F. Jones, M. Larson, and R. Ordelman. Creating a Data Collection for Evaluating Rich Speech Retrieval. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pages 1736–1743, Istanbul, Turkey, 2012. [5] F. Hopfgartner, A. Hanbury, H. Müller, N. Kando, S. Mercer, J. Kalpathy-Cramer, M. Potthast, T. Gollub, A. Krithara, J. Lin, K. Balog, and I. Eggel. Report on the Evaluation-as-a-Service (EaaS) Expert Workshop. SIGIR Forum, 49(1):57–65, June 2015. [6] M. Larson, M. Eskevich, R. Ordelman, C. Kofler, S. Schmiedeke, and G. J. F. Jones. Overview of MediaEval 2011 Rich Speech Retrieval Task and Genre Tagging Task. In Working Notes Proceedings of the MediaEval 2011 Workshop, Santa Croce in Fossabanda, Pisa, Italy, 2011. [7] M. Larson, E. Newman, and G. J. F. Jones. Overview of VideoClef 2009: New Perspectives on Speech-based Multimedia Content Enrichment. In Proceedings of the 10th International Conference on Cross-language Evaluation Forum: Multimedia Experiments, CLEF’09, pages 354–368, Berlin, Heidelberg, 2010. Springer-Verlag. [8] R. J. F. Ordelman, M. Eskevich, R. Aly, B. Huet, and G. J. F. Jones. Defining and Evaluating Video Hyperlinking for Navigating Multimedia Archives. In A. Gangemi, S. Leonardi, and A. Panconesi, editors, Proceedings of the 24th International Conference on World Wide Web Companion (WWW 2015), Companion Volume, pages 727–732, Florence, Italy, 2015. ACM. [9] P. Over, J. Fiscus, D. Joy, M. Michel, G. Awad, W. Kraaij, A. F. Smeaton, G. Quénot, and R. Ordelman. TRECVID 2015 – an overview of the goals, tasks, data, evaluation mechanisms and metrics. In Proceedings of TRECVID 2015. NIST, USA, 2015. [10] S. Schmiedeke, P. Xu, I. Ferrané, M. Eskevich, C. Kofler, M. A. Larson, Y. Estève, L. Lamel, G. J. F. Jones, and T. Sikora. Blip10000: a social video dataset containing SPUG content for tagging and retrieval. In Multimedia Systems Conference 2013 (MMSys ’13), pages 96–101, Oslo, Norway, 2013. [11] C. V. Thornley, A. C. Johnson, A. F. Smeaton, and H. Lee. The scholarly impact of trecvid (2203 – 2009).