A Framework for Crowdsourced Multimedia Processing and Querying Alessandro Bozzon Ilio Catallo Eleonora Ciceri Piero Fraternali Davide Martinenghi Marco Tagliasacchi Dipartimento di Elettronica e Informazione Via Ponzio 34/5 Milan, Italy first.last@polimi.it ABSTRACT have been exploited to address problems where humans out- This paper introduces a conceptual and architectural frame- perform machines, most notably common sense knowledge work for addressing the design, execution and verification of elicitation and content tagging for multimedia search [3]. tasks by a crowd of performers. The proposed framework In the latter field, crowdsourced multimedia content pro- is substantiated by an ongoing application to a problem of cessing exploits the fact that humans have superior capac- trademark logo detection in video collections. Preliminary ity for understanding the content of audiovisual materials, results show that the contribution of crowds can improve the and thus replaces the output of automatic content classifi- recall of state-of-the-art traditional algorithms, with no loss cation with human annotations, and feature extraction with in terms of precision. However, task-to-executor matching, human-made tags. as expected, has an important influence on the task perfor- In this paper, we adopt a different approach: rather than mance. replacing feature extraction algorithms, we aim at improv- ing their performance with well-selected tasks assigned to human executors, thus realizing a more integrated collabo- Categories and Subject Descriptors ration between human judgement and algorithms. H.4 [Information Systems Applications]: Miscellaneous; The contribution of the paper is the illustration of the de- D.2.8 [Software Engineering]: Metrics—complexity mea- sign, implementation, and preliminary evaluation of a crowd- sures, performance measures searching application for trademark logo detection in video collections, in which the help of the crowd is sought for se- lecting the most appropriate images associated with a brand General Terms name, so as to improve the precision and recall of a classi- Design, Experimentation, Human Factors, Measurement, Per- cal image retrieval approach based on local features (e.g. formance SIFT) [5]. The paper is organized as follows: prior to introducing the Keywords logo detection application, in Section 2 we present a high- level conceptual framework that helps in characterizing the Human Computation, Crowdsourcing, Multimedia problems that need to be faced in crowdsearching applica- tion development. Next, Section 3 explains the design and 1. INTRODUCTION implementation of the logo detection application, while Sec- Human computation is an approach to problem solving tion 4 reports on an experimental evaluation that compares that integrates the computation power of machines with the three scenarios of image to video matching: one completely perceptual, rational or social contribution of humans [6]. automated, one with expert support to task execution, and Within human computation, crowdsearching can be defined one with the help of generic facebook users. Finally, Sec- as the application of the principles and techniques of human tion 5 concludes with an illustration of the ongoing efforts computation to information retrieval, so as to promote the for implementing and evaluating the utility of a crowdsearch individual and social participation to search-based applica- platform capable of assisting the development of a broad va- tions and improve the performance of information retrieval riety of crowdsearching solutions. algorithms with the calibrated contribution of humans. Tra- ditionally, crowdsearching methods in information retrieval 2. A FRAMEWORK FOR HUMAN COMPU- TATION In any human computation approach to problem solving, and hence also in crowdsearching, a problem is mapped into a set of tasks, which are then assigned to both human and machine executors in a way that optimizes some quality cri- Copyright c 2012 for the individual papers by the papers’ authors. Copy- terion on the problem-solving process, like, e.g., the quality ing permitted for private and academic purposes. This volume is published of the found solution or the time or money spent to find it. and copyrighted by its editors. Figure 1 conceptualizes the steps that lead from the for- CrowdSearch 2012 workshop at WWW 2012, Lyon, France mulation of a complex problem solving goal to its execution Figure 1: The CrowdSearch Framewok and verification with the help of a crowd of executors. incentivizing or paying workers, bounds on the number of The entry point is the specification of a problem solving executors or on the number of outputs (e.g., a level of re- process, defined as a workflow of tasks that leads to a de- dundancy for output verification) and desired demographic sired goal. Such a notion is purposely broad and embraces properties (e.g., workers’ distribution with respect to geo- both general-purpose cooperative and distributed processes, graphical position or skill level). as found, e.g., in Business Process Management (BPM), and Symmetrically to the problem solving process, also the more focused problem instances, like crowd-supported mul- crowd of potential performers and their capacities have to timedia feature extraction. The common trait is that multi- be abstracted. A natural Crowd Abstraction is a bi-partite ple tasks must be executed respecting precedence constraints graph (as shown in Figure 1), where nodes denote either and input-output dependencies, and that some of the tasks performers or content elements, and edges may connect per- are executed by machines and some by an open-ended com- formers (to denote, e.g., friendship), content elements (to munity of workers (the latter are named Crowd Tasks in denote semantic relationships) and performers to content Figure 1). Unlike in classic BPM, the community of work- elements (to denote, e.g., interest). The bi-partite graph ers is not known a priori, and the task assignment rules are representation can be refined by attaching semantics to both dynamic and based on the tasks specification and on the nodes and edges: users can be associated with profile data; characteristics of the potential members of the crowd. content elements can be summarized or classified in topics; A Crowd Task is the subject of Human Task Design, a performer edges can express explicit friendship or weak ties step that has the objective of defining the modalities for due to interactions between users; content element edges crowdsourced task execution. Human Task Design produces can express ontological knowledge, like classification, part- the actual design of the Task Execution GUI, and the spec- of, etc.; user to content edges can represent a specific capa- ification of Task Deployment Criteria. These criteria can be bility (e.g., ability to review, produce, or judge about con- logically subdivided into two subject areas: Content Affinity tent elements). Criteria (what topic the task is about) and Execution Crite- The subsequent Task Deployment step comprises the se- ria (how the task should be executed). The Content Affinity lection of the candidate performers, by People to Task Match- Criteria can be regarded as a query on a representation of ing, and then the Task Assignment to a set of actual per- the users’ capacities: in the simplest case, this can be just formers. The People to Task Matching can be abstracted a descriptor denoting a desired topical affinity of the user as a query matching problem, in which the Task Deploy- (e.g., image geo-positioning): in more complex cases it can ment Criteria are used to extract from the crowd abstraction be a semi-structured data query (e.g., a set of attribute-value graph a ranked list of potential candidate workers ranked ac- pairs, like action=translation from=English to=Italian), or cording to their expected suitability as task executors. In a query in a logical language. the most general case, the measure of suitability is com- The Execution Criteria could specify constraints or de- posed of a part that embodies the Content Affinity Criteria sired characteristics of task execution, including: a time of the candidates to the task (e.g., user-to-task topical sim- budget for completing the work, a monetary budget for ilarity) and a part that measures the appropriateness of a Validated Validated Low Validate Results Logo Logo Logo Confidence Low-confidence Images Name Images Results Results Retrieve Logo Validate Match Logo Join Results and Logo Detection + + Images Logo Images Images in Videos Emit Report High Video Confidence collection Results Figure 2: The Process of the Logo Detection Application candidate, or of a set of candidates, to satisfy the Task Ex- files. The logo detection problem is a well-known challenge ecution Criteria. Evaluating the People to Task Matching in image similarity search, where local features, e.g., SIFT, under the Execution Criteria can be a complex achievement are normally employed to detect the occurrences of generic that requires addressing different aspects, such as the match object based on their scale invariant properties [5]. How- between task difficulty and skill level, the role and influence ever, image similarity based on SIFT is largely affected by of users in the network (which determines their ability to the quality of the input set of images used to perform the spread the task), and so on. Then, the topical and execu- matches, which makes it an interesting case for introducing tion suitability measures should be combined to obtain a the contribution of humans. globally good set of candidates. This can be regarded as the Therefore, we have designed a crowdsearching application aggregation of a content-based and of an execution-based that consists of a sequence of both automated tasks and ranked list of potential candidates, of which the top-k ones crowd tasks, illustrated in Figure 2, with the goal of increas- are selected for the actual task assignment. ing precision and recall with respect to fully automated so- As an example, task-to-performer assignment can be for- lutions. Specifically, the crowd contribution is exploited on mulated as a matching problem in a vector space, by repre- two levels: for retrieving good images that well represent the senting both the Task Deployment Criteria and the candi- brand name to be searched; and for validating matches of date Performers as feature vectors in a space of appropriate the logos in the video collection for which the content-based dimensionality and then computing the match score using image retrieval based on SIFT descriptors has reported a vector similarity measures. The problem could be further low matching score. modularized by decomposing the task description vector into The process receives as input a textual keyword indicating two components: one denoting the Content Affinity Criteria the name of the brand. The first task (Retrieve Logo Im- and one denoting the Execution Criteria. The result sets for ages) performs text-based image retrieval to associate a set these two queries could then be merged with a rank aggre- of representative logo images to the brand name string; this gation approach. task can be executed by an automated component (in our The Task Execution step represents the actual execution implementation, we have used the Google Images APIs1 ). of the task by the selected performers, which results in mul- However, as we will see, the output of Google is far from tiple outputs; these are then aggregated to form the final perfect, as the search engine returns images based on the outcome of the Crowd Task. As usual in crowdsourcing, re- textual content of the page that contains them, which de- dundancy can be exploited to cope with the uncertain qual- termines the presence of many false positive and low quality ity of the worker’s performance. In this case, multiple out- images in the automatically constructed result set. puts for the same task must be merged to obtain a final task The next task (Validate Logo Images) is a crowd task: it output with a high level of confidence. As a result of eval- employs human computing in order to assist the validation uating the task output, feedback can be generated on the of the images retrieved by Google Images, so as to enhance skill level of performers (by the Executor Evaluation phase). the performance of the content-based image retrieval based. Then, an automated task (Match Images in Videos) looks 3. THE LOGO DETECTION APPLICATION for occurrences of the different versions of the logos in a collection of video clips, using a content-based image re- In Section 2 we presented a framework for crowdsourced trieval component (i.e., the OpenCV library [2] implement- multimedia processing and querying, that requires the crowd ing the SIFT algorithm [5]). The output is a list of triples: to execute tasks during a Problem Solving Process. In this , where matchingScore Section we illustrate an example of Problem Solving Pro- is a number between 0 and 1 that expresses how well the cess, for which we have designed, deployed and evaluated searched logo has been matched within the frame (identified one Crowd Task: the Logo Detection application. by frameID) in the video identified by videoID. 3.1 Specifications The process continues with a second crowd task (Vali- date Low Confidence Results), which dispatches to the crowd As a proof of concept, we have designed a problem solv- those matches that have a matching score lower than a given ing process for trademark logo detection in video collections. threshold. Finally, the result report is constructed (Join The goal is to receive from a user a query consisting of a Results and Emit Report Task) by adding the high score brand name and to produce a report that identifies all the 1 occurrences of logos of that brand in a given set of video http://images.google.com/ matches found by the algorithms and the low score matches which images must be validated) in the form of a post manually validated by the users. on the performer’s wall (as shown in Figure 4). 3.2 An Architecture for Crowd Task Execu- The Crowd Task Execution step is implemented as a na- tion tive Facebook application that the performer has to install In order to enable the deployment of applications that in order to perform the task, and, if he wishes, to re-dispatch comprise crowd tasks, such as the one described in Section it to friends. 3.1, we are building the technical architecture for performer The Task output collection and aggregation collects the out- and task management illustrated in Figure 3. At present, put of the task instances that have been dispatched by the we have implemented and deployed the crowd task Validate task owner. producing an unified view on the retrieved val- Logo Images. ues. In the logo detection application, for each brand name The architecture of Figure 3 supports the creation of a it returns 1) the new images suggested by the performers, task and of the associated GUI, its assignment to a pool of and 2) for each image from Google Images, the number of performers through a crowd task execution platform, and accorded preferences. the collection of the task output. We envision three major task execution modalities: structured crowdsourcing plat- forms (e.g., Microtask.com and Mechanical Turk), open so- cial networks (e.g., Facebook and G+) and email invitations to perform the task using a custom Web application. At present, the implementation uses Facebook as a crowd task execution platform. The Task GUI and Criteria Generation is an online ap- plication that generates a task GUI and some simple de- ployment criteria from a Task Template, i.e., an abstract description of a piece of work, characterized by the following dimensions: task type (open/closed question, custom Web application), task description, deployment criteria (none, by location, by skill, by similar work history), and task alter- nates (i.e., variants of the same task specification that can be associated with specific deployment criteria, e.g., with different skill levels). In the logo detection application, we have designed a task template with two variants. In the base variant for novice users, the task GUI presents a brand name and a set of images taken from Google Images (see Figure 5), with checkboxes for choosing the images that best match the brand name. In the second variant, aimed at people that Figure 3: Architecture for Crowd Task manage- have done at least one instance of the basic variant of the ment. task, the challenge is to input the URLs of new (and possi- bly better) images that match the brand name (see Figure 6). The Task Deployment function consists of: • Task to People Matching lets the task owner connect to a number of community platforms and collect can- didate performers in worker pools. Worker pools can be also edited manually, like a contact list; to ease selection, candidates in a pool can be ranked w.r.t. the deployment criteria of a task, based on the avail- able candidate profile data and work history informa- tion. The present implementation of this functionality Figure 4: UI of the Facebook task invitation mes- is a native Facebook application that enables the task sage. owner to assign candidate performers to the worker pool by picking them from his list of friends. 3.3 Status of the implementation • Task assignment allows the task owner to create an The logo application is currently based on the Crowd- instance of a task and dispatch it, with the associ- Searcher task distribution and execution platform [1]. The ated GUI, to the Task Execution platform. The task system has been configured to 1) create task templates by submission occurs differently based on the target plat- connecting with the logo detection application to retrieve form: it may be a post on the user’s space in a so- the set of logos being evaluated, 2) send queries to the Face- cial network, a personal invitation email message, or book platform, 3) select the set of workers to be involved the task publication in a crowdsourcing platform. The in the task, and 4) gather the results. Being deployed on a present implementation is the same Facebook applica- social network platform, CrowdSearcher acts in the context tion used for Task to People Matching that supports provided by a given Facebook user, who is instrumental to the dispatch of a task instance (i.e., a brand name for the crowd-sourcing process, being responsible of initiating Figure 5: UI of the task execution interface: logo Figure 6: UI of the task execution interface: logo selection. insertion. the tasks which are spawn to the crowd, and by offering 1) select existing logos (Figure 5) to improve the precision of friends and colleagues as workers. the application by providing the matching component with The Facebook application embeds a platform-specific client correct logo instances, and to 2) add new logos (Figure 6), which communicates with the CrowdSearcher server. The with the purpose of increasing the overall recall by adding client serves a twofold purpose. On one hand, it allows novel logo examples. workers (Facebook users) to perform deployed tasks, as de- Around 40 people were involved as workers for the selec- picted in Figure 5 and 6. On the other hand, the applica- tion or provision of image logos, mostly from the students in tion exploits the native Facebook Graph API to enable a our classes, or student’s friends who volunteered to be part user-defined worker selection, where new workers are explic- of the experiment. Some 50 task instances were generated itly invited by their friends; Figure 4 depicts an example in a time-span of three days, equally distributed on the set of task invitation performed on the Facebook wall of a tar- of considered logos, resulting in 70 collected answers, 58% geted user. The choice of allowing a manual worker selection of which related to logo images selection tasks. is supported by the findings in [1], where it is shown how a We tested performance under three experimental settings. higher participation can be achieved when workers are man- (1) No human intervention in the logo validation task: here, ually picked by the user. the top-4 Google Images result set is used as a baseline for Tasks are assumed to have a timeout for completion, spec- the logo search in the video collection; the result set may ified in the logo identification application, that defines how contain some irrelevant images, since they did not undergo long the system should wait for human execution. When validation. (2) Logo validation performed by a crowd of do- the timeout triggers, the system automatically aggregates main experts (simulation): the top-32 Google Images results the task results – respectively, the number of preferences for are filtered by experts, thereby deleting the non-relevant each logo image, and the URL of the newly provided logo logos and choosing three images among the relevant ones. images – feeding the validated logos archive, which is next (3) Inclusion of the actual crowd knowledge: filtering and used for the matching of logo images in the video collection. expansion of the set of matched logos is done via the Crowd- Searcher application. 4. EXPERIMENTAL EVALUATION The results are shown in Table 1. For each logo, preci- In this section we present the preliminary experiments sion and recall are evaluated for the three versions of the that we conducted on a public video collection [4] contain- application. ing 120 grocery product logos, to compare automated and Expert evaluation clearly increases the application per- crowd-supported logo detection. Experiments involving hu- formance in both precision and recall, with respect to a mans have been performed on three medium-sized logos (Aleve, fully-automated solution. This is due to the fact that the Chunky and Claritin). We tested the performance of the validation process performed on the set of logo instances process in Figure 2 using the standard measures of precision eliminates irrelevant logos from the query set, and conse- and recall on the output of Match Logo Images in Video quently reduces the number of false positives in the result task. set. On the other hand, the validation conducted by the The CrowdSearcher system2 has been adopted to support crowd showed generally a slight increase in both precision the Validate Logo Images crowd task, by allowing users to and recall. However, the performance increase is not evenly distributed over all logo brands: we believe that this is due 2 Available at https://apps.facebook.com/crowd search/ to a different user behavior in the choice of the relevant im- Brand name Test Precision Recall the correct temporal sequence of user-generated videos re- No Crowd 0.27 0.27 garding particular events (e.g., breaking news). At peak mo- Aleve Experts 0.42 0.54 ments, there may be a proliferation of such videos by means Crowd 0.33 0.41 of reposting and re-editing of their original content, which No Crowd 0.65 0.19 makes the problem non-trivial. Indeed, the actual creation Chunky Experts 0.70 0.58 date of a video clip (as indicated by tags in the file) may Crowd 0.40 0.21 be uncertain and thus unreliable, thereby producing ambi- No Crowd 0.31 0.09 guities in the temporal ordering of the clips, much in the Claritin Experts 0.57 0.72 same way in which, in top-k queries, uncertain scores deter- Crowd 0.36 0.73 mine multiple possible rankings [7]. While finding temporal dependencies between two clips may in some cases be done Table 1: Precision and Recall of the logo detection automatically by detecting near duplicates of a video, fully application in the three considered experimental set- reconstructing the temporal sequence will require human in- tings. tervention. Humans will assist the process by i) resolving conflicts between the creation dates available in the tags and the temporal dependencies inferred by near duplicate age set. In particular, when validating the Chunky brand detection, and ii) refining the time interval associated with logos, the crowd chose within the top-2 image an irrelevant a video clip’s creation date. In our research agenda, empha- logo, as shown in Figure 7. Consequently, the performance sis will be placed on the following two aspects. (1) Tasks has been affected, with a heavy decrease both in terms of will be engineered in such a way that their resolution max- precision and recall w.r.t. the expert evaluation. This result imizes the expected decrease of the amount of uncertainty brings to a consideration about the context of human en- associated with the temporal ordering of the clips. (2) We acted executions: the chances to get good responses depend shall give priority to reducing uncertainty of videos close to on the appropriateness of the users’ community w.r.t the a certain date of interest (typically, the date of the event at task at hand. Both the geographical location and the exper- hand), thereby focusing on the ordering of the “top-k” such tise of the involved users can heavily influence the outcome videos. of human enacted activities, thus calling for a fine-grained Acknowledgments This work is partially supported by task to people matching phase. the FP7 Cubrik Integrating Project3 and by the Capacities Research for SMEs project BPM4People of the Research Executive Agency of the European Commission 4 . 6. REFERENCES [1] A. Bozzon, M. Brambilla, and S. Ceri. Answering search queries with crowdsearcher. In To appear in Proceedings of the 21st International Conference on Figure 7: An example of a) correct and b) wrong World Wide Web, WWW 2012, Lyon, France, April logo for the Chunky brand 16-20, 2012, 2012. [2] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000. [3] X. Hu, M. Stalnacke, T. B. Minde, R. Carlsson, and 5. DISCUSSION S. Larsson. A mobile game to collect and improve We have presented a framework and an architecture for position of images. In Proc. Third International handling task design, assignment and execution in crowd- Conference on Next Generation Mobile Applications, empowered settings. We have then described a trademark Services and Technologies, 2009. logo detection application that can be easily accommodated [4] U. C. V. Laboratory. Grozi-120 database. in the presented framework and whose execution can largely [5] D. G. Lowe. Distinctive image features from benefit from the presence of a crowd of users. Our initial scale-invariant keypoints. International Journal of experiments have shown that human-enriched tasks, such as Computer Vision, 60(2):91–110, 2004. logo validation and insertion, contribute to a non-negligible [6] A. J. Quinn and B. B. Bederson. Human computation: improvement of both recall and precision in the obtained re- a survey and taxonomy of a growing field. In sult set. Yet, such an improvement is unevenly distributed Proceedings of the 2011 annual conference on Human over the different queries we tried, mostly because some factors in computing systems, CHI ’11, pages users did not have an adequate background to answer the 1403–1412, 2011. questions that were sent to them. This suggests that users should be more carefully selected during task assignment. [7] M. A. Soliman, I. F. Ilyas, D. Martinenghi, and Future directions of research therefore include studying how M. Tagliasacchi. Ranking with uncertain scoring to associate the most suitable request with the most appro- functions: semantics and sensitivity measures. In priate user, so as to implement a ranking function on worker SIGMOD Conference, pages 805–816, 2011. pools whereby the task owner is aided in the dispatch of the task to the top-k best candidates. Along the same lines, we are also studying a crowdsearch 3 scenario in which engineering the most suitable task/request http://www.cubrikproject.eu/ 4 plays a crucial role. Here, the end user wants to reconstruct http://www.bpm4people.org/