A Framework for Crowdsourced
                            Multimedia Processing and Querying

                             Alessandro Bozzon    Ilio Catallo                           Eleonora Ciceri
                         Piero Fraternali  Davide Martinenghi                            Marco Tagliasacchi
                                                Dipartimento di Elettronica e Informazione
                                                             Via Ponzio 34/5
                                                                Milan, Italy
                                                              first.last@polimi.it

ABSTRACT                                                                    have been exploited to address problems where humans out-
This paper introduces a conceptual and architectural frame-                 perform machines, most notably common sense knowledge
work for addressing the design, execution and verification of               elicitation and content tagging for multimedia search [3].
tasks by a crowd of performers. The proposed framework                      In the latter field, crowdsourced multimedia content pro-
is substantiated by an ongoing application to a problem of                  cessing exploits the fact that humans have superior capac-
trademark logo detection in video collections. Preliminary                  ity for understanding the content of audiovisual materials,
results show that the contribution of crowds can improve the                and thus replaces the output of automatic content classifi-
recall of state-of-the-art traditional algorithms, with no loss             cation with human annotations, and feature extraction with
in terms of precision. However, task-to-executor matching,                  human-made tags.
as expected, has an important influence on the task perfor-                    In this paper, we adopt a different approach: rather than
mance.                                                                      replacing feature extraction algorithms, we aim at improv-
                                                                            ing their performance with well-selected tasks assigned to
                                                                            human executors, thus realizing a more integrated collabo-
Categories and Subject Descriptors                                          ration between human judgement and algorithms.
H.4 [Information Systems Applications]: Miscellaneous;                         The contribution of the paper is the illustration of the de-
D.2.8 [Software Engineering]: Metrics—complexity mea-                       sign, implementation, and preliminary evaluation of a crowd-
sures, performance measures                                                 searching application for trademark logo detection in video
                                                                            collections, in which the help of the crowd is sought for se-
                                                                            lecting the most appropriate images associated with a brand
General Terms                                                               name, so as to improve the precision and recall of a classi-
Design, Experimentation, Human Factors, Measurement, Per-                   cal image retrieval approach based on local features (e.g.
formance                                                                    SIFT) [5].
                                                                               The paper is organized as follows: prior to introducing the
Keywords                                                                    logo detection application, in Section 2 we present a high-
                                                                            level conceptual framework that helps in characterizing the
Human Computation, Crowdsourcing, Multimedia                                problems that need to be faced in crowdsearching applica-
                                                                            tion development. Next, Section 3 explains the design and
1.    INTRODUCTION                                                          implementation of the logo detection application, while Sec-
   Human computation is an approach to problem solving                      tion 4 reports on an experimental evaluation that compares
that integrates the computation power of machines with the                  three scenarios of image to video matching: one completely
perceptual, rational or social contribution of humans [6].                  automated, one with expert support to task execution, and
Within human computation, crowdsearching can be defined                     one with the help of generic facebook users. Finally, Sec-
as the application of the principles and techniques of human                tion 5 concludes with an illustration of the ongoing efforts
computation to information retrieval, so as to promote the                  for implementing and evaluating the utility of a crowdsearch
individual and social participation to search-based applica-                platform capable of assisting the development of a broad va-
tions and improve the performance of information retrieval                  riety of crowdsearching solutions.
algorithms with the calibrated contribution of humans. Tra-
ditionally, crowdsearching methods in information retrieval                 2.   A FRAMEWORK FOR HUMAN COMPU-
                                                                                 TATION
                                                                               In any human computation approach to problem solving,
                                                                            and hence also in crowdsearching, a problem is mapped into
                                                                            a set of tasks, which are then assigned to both human and
                                                                            machine executors in a way that optimizes some quality cri-
Copyright c 2012 for the individual papers by the papers’ authors. Copy-    terion on the problem-solving process, like, e.g., the quality
ing permitted for private and academic purposes. This volume is published   of the found solution or the time or money spent to find it.
and copyrighted by its editors.                                                Figure 1 conceptualizes the steps that lead from the for-
CrowdSearch 2012 workshop at WWW 2012, Lyon, France                         mulation of a complex problem solving goal to its execution
                                          Figure 1: The CrowdSearch Framewok


and verification with the help of a crowd of executors.           incentivizing or paying workers, bounds on the number of
   The entry point is the specification of a problem solving      executors or on the number of outputs (e.g., a level of re-
process, defined as a workflow of tasks that leads to a de-       dundancy for output verification) and desired demographic
sired goal. Such a notion is purposely broad and embraces         properties (e.g., workers’ distribution with respect to geo-
both general-purpose cooperative and distributed processes,       graphical position or skill level).
as found, e.g., in Business Process Management (BPM), and            Symmetrically to the problem solving process, also the
more focused problem instances, like crowd-supported mul-         crowd of potential performers and their capacities have to
timedia feature extraction. The common trait is that multi-       be abstracted. A natural Crowd Abstraction is a bi-partite
ple tasks must be executed respecting precedence constraints      graph (as shown in Figure 1), where nodes denote either
and input-output dependencies, and that some of the tasks         performers or content elements, and edges may connect per-
are executed by machines and some by an open-ended com-           formers (to denote, e.g., friendship), content elements (to
munity of workers (the latter are named Crowd Tasks in            denote semantic relationships) and performers to content
Figure 1). Unlike in classic BPM, the community of work-          elements (to denote, e.g., interest). The bi-partite graph
ers is not known a priori, and the task assignment rules are      representation can be refined by attaching semantics to both
dynamic and based on the tasks specification and on the           nodes and edges: users can be associated with profile data;
characteristics of the potential members of the crowd.            content elements can be summarized or classified in topics;
   A Crowd Task is the subject of Human Task Design, a            performer edges can express explicit friendship or weak ties
step that has the objective of defining the modalities for        due to interactions between users; content element edges
crowdsourced task execution. Human Task Design produces           can express ontological knowledge, like classification, part-
the actual design of the Task Execution GUI, and the spec-        of, etc.; user to content edges can represent a specific capa-
ification of Task Deployment Criteria. These criteria can be      bility (e.g., ability to review, produce, or judge about con-
logically subdivided into two subject areas: Content Affinity     tent elements).
Criteria (what topic the task is about) and Execution Crite-         The subsequent Task Deployment step comprises the se-
ria (how the task should be executed). The Content Affinity       lection of the candidate performers, by People to Task Match-
Criteria can be regarded as a query on a representation of        ing, and then the Task Assignment to a set of actual per-
the users’ capacities: in the simplest case, this can be just     formers. The People to Task Matching can be abstracted
a descriptor denoting a desired topical affinity of the user      as a query matching problem, in which the Task Deploy-
(e.g., image geo-positioning): in more complex cases it can       ment Criteria are used to extract from the crowd abstraction
be a semi-structured data query (e.g., a set of attribute-value   graph a ranked list of potential candidate workers ranked ac-
pairs, like action=translation from=English to=Italian), or       cording to their expected suitability as task executors. In
a query in a logical language.                                    the most general case, the measure of suitability is com-
   The Execution Criteria could specify constraints or de-        posed of a part that embodies the Content Affinity Criteria
sired characteristics of task execution, including: a time        of the candidates to the task (e.g., user-to-task topical sim-
budget for completing the work, a monetary budget for             ilarity) and a part that measures the appropriateness of a
                                                                                                                                 Validated
                                                               Validated                     Low               Validate           Results
                                         Logo
             Logo                                                Logo                     Confidence        Low-confidence
                                        Images
             Name                                               Images                     Results             Results


                        Retrieve Logo              Validate                  Match Logo                                             Join Results and
      Logo Detection                                                                                   +                     +
                           Images                Logo Images               Images in Videos                                           Emit Report

                                                                                                         High
                                                          Video                                        Confidence
                                                        collection                                      Results


                                Figure 2: The Process of the Logo Detection Application


candidate, or of a set of candidates, to satisfy the Task Ex-                        files. The logo detection problem is a well-known challenge
ecution Criteria. Evaluating the People to Task Matching                             in image similarity search, where local features, e.g., SIFT,
under the Execution Criteria can be a complex achievement                            are normally employed to detect the occurrences of generic
that requires addressing different aspects, such as the match                        object based on their scale invariant properties [5]. How-
between task difficulty and skill level, the role and influence                      ever, image similarity based on SIFT is largely affected by
of users in the network (which determines their ability to                           the quality of the input set of images used to perform the
spread the task), and so on. Then, the topical and execu-                            matches, which makes it an interesting case for introducing
tion suitability measures should be combined to obtain a                             the contribution of humans.
globally good set of candidates. This can be regarded as the                            Therefore, we have designed a crowdsearching application
aggregation of a content-based and of an execution-based                             that consists of a sequence of both automated tasks and
ranked list of potential candidates, of which the top-k ones                         crowd tasks, illustrated in Figure 2, with the goal of increas-
are selected for the actual task assignment.                                         ing precision and recall with respect to fully automated so-
   As an example, task-to-performer assignment can be for-                           lutions. Specifically, the crowd contribution is exploited on
mulated as a matching problem in a vector space, by repre-                           two levels: for retrieving good images that well represent the
senting both the Task Deployment Criteria and the candi-                             brand name to be searched; and for validating matches of
date Performers as feature vectors in a space of appropriate                         the logos in the video collection for which the content-based
dimensionality and then computing the match score using                              image retrieval based on SIFT descriptors has reported a
vector similarity measures. The problem could be further                             low matching score.
modularized by decomposing the task description vector into                             The process receives as input a textual keyword indicating
two components: one denoting the Content Affinity Criteria                           the name of the brand. The first task (Retrieve Logo Im-
and one denoting the Execution Criteria. The result sets for                         ages) performs text-based image retrieval to associate a set
these two queries could then be merged with a rank aggre-                            of representative logo images to the brand name string; this
gation approach.                                                                     task can be executed by an automated component (in our
   The Task Execution step represents the actual execution                           implementation, we have used the Google Images APIs1 ).
of the task by the selected performers, which results in mul-                        However, as we will see, the output of Google is far from
tiple outputs; these are then aggregated to form the final                           perfect, as the search engine returns images based on the
outcome of the Crowd Task. As usual in crowdsourcing, re-                            textual content of the page that contains them, which de-
dundancy can be exploited to cope with the uncertain qual-                           termines the presence of many false positive and low quality
ity of the worker’s performance. In this case, multiple out-                         images in the automatically constructed result set.
puts for the same task must be merged to obtain a final task                            The next task (Validate Logo Images) is a crowd task: it
output with a high level of confidence. As a result of eval-                         employs human computing in order to assist the validation
uating the task output, feedback can be generated on the                             of the images retrieved by Google Images, so as to enhance
skill level of performers (by the Executor Evaluation phase).                        the performance of the content-based image retrieval based.
                                                                                        Then, an automated task (Match Images in Videos) looks
3.    THE LOGO DETECTION APPLICATION                                                 for occurrences of the different versions of the logos in a
                                                                                     collection of video clips, using a content-based image re-
  In Section 2 we presented a framework for crowdsourced
                                                                                     trieval component (i.e., the OpenCV library [2] implement-
multimedia processing and querying, that requires the crowd
                                                                                     ing the SIFT algorithm [5]). The output is a list of triples:
to execute tasks during a Problem Solving Process. In this
                                                                                     <videoID, frameID, matchingScore>, where matchingScore
Section we illustrate an example of Problem Solving Pro-
                                                                                     is a number between 0 and 1 that expresses how well the
cess, for which we have designed, deployed and evaluated
                                                                                     searched logo has been matched within the frame (identified
one Crowd Task: the Logo Detection application.
                                                                                     by frameID) in the video identified by videoID.
3.1    Specifications                                                                   The process continues with a second crowd task (Vali-
                                                                                     date Low Confidence Results), which dispatches to the crowd
  As a proof of concept, we have designed a problem solv-                            those matches that have a matching score lower than a given
ing process for trademark logo detection in video collections.                       threshold. Finally, the result report is constructed (Join
The goal is to receive from a user a query consisting of a                           Results and Emit Report Task) by adding the high score
brand name and to produce a report that identifies all the
                                                                                     1
occurrences of logos of that brand in a given set of video                               http://images.google.com/
matches found by the algorithms and the low score matches              which images must be validated) in the form of a post
manually validated by the users.                                       on the performer’s wall (as shown in Figure 4).

3.2   An Architecture for Crowd Task Execu-                         The Crowd Task Execution step is implemented as a na-
      tion                                                       tive Facebook application that the performer has to install
   In order to enable the deployment of applications that        in order to perform the task, and, if he wishes, to re-dispatch
comprise crowd tasks, such as the one described in Section       it to friends.
3.1, we are building the technical architecture for performer       The Task output collection and aggregation collects the out-
and task management illustrated in Figure 3. At present,         put of the task instances that have been dispatched by the
we have implemented and deployed the crowd task Validate         task owner. producing an unified view on the retrieved val-
Logo Images.                                                     ues. In the logo detection application, for each brand name
   The architecture of Figure 3 supports the creation of a       it returns 1) the new images suggested by the performers,
task and of the associated GUI, its assignment to a pool of      and 2) for each image from Google Images, the number of
performers through a crowd task execution platform, and          accorded preferences.
the collection of the task output. We envision three major
task execution modalities: structured crowdsourcing plat-
forms (e.g., Microtask.com and Mechanical Turk), open so-
cial networks (e.g., Facebook and G+) and email invitations
to perform the task using a custom Web application. At
present, the implementation uses Facebook as a crowd task
execution platform.
   The Task GUI and Criteria Generation is an online ap-
plication that generates a task GUI and some simple de-
ployment criteria from a Task Template, i.e., an abstract
description of a piece of work, characterized by the following
dimensions: task type (open/closed question, custom Web
application), task description, deployment criteria (none, by
location, by skill, by similar work history), and task alter-
nates (i.e., variants of the same task specification that can
be associated with specific deployment criteria, e.g., with
different skill levels). In the logo detection application, we
have designed a task template with two variants. In the base
variant for novice users, the task GUI presents a brand name
and a set of images taken from Google Images (see Figure
5), with checkboxes for choosing the images that best match
the brand name. In the second variant, aimed at people that      Figure 3:    Architecture for Crowd Task manage-
have done at least one instance of the basic variant of the      ment.
task, the challenge is to input the URLs of new (and possi-
bly better) images that match the brand name (see Figure
6).
   The Task Deployment function consists of:
   • Task to People Matching lets the task owner connect
     to a number of community platforms and collect can-
     didate performers in worker pools. Worker pools can
     be also edited manually, like a contact list; to ease
     selection, candidates in a pool can be ranked w.r.t.
     the deployment criteria of a task, based on the avail-
     able candidate profile data and work history informa-
     tion. The present implementation of this functionality      Figure 4: UI of the Facebook task invitation mes-
     is a native Facebook application that enables the task      sage.
     owner to assign candidate performers to the worker
     pool by picking them from his list of friends.
                                                                 3.3    Status of the implementation
   • Task assignment allows the task owner to create an            The logo application is currently based on the Crowd-
     instance of a task and dispatch it, with the associ-        Searcher task distribution and execution platform [1]. The
     ated GUI, to the Task Execution platform. The task          system has been configured to 1) create task templates by
     submission occurs differently based on the target plat-     connecting with the logo detection application to retrieve
     form: it may be a post on the user’s space in a so-         the set of logos being evaluated, 2) send queries to the Face-
     cial network, a personal invitation email message, or       book platform, 3) select the set of workers to be involved
     the task publication in a crowdsourcing platform. The       in the task, and 4) gather the results. Being deployed on a
     present implementation is the same Facebook applica-        social network platform, CrowdSearcher acts in the context
     tion used for Task to People Matching that supports         provided by a given Facebook user, who is instrumental to
     the dispatch of a task instance (i.e., a brand name for     the crowd-sourcing process, being responsible of initiating
Figure 5: UI of the task execution interface: logo               Figure 6: UI of the task execution interface: logo
selection.                                                       insertion.


the tasks which are spawn to the crowd, and by offering
                                                                 1) select existing logos (Figure 5) to improve the precision of
friends and colleagues as workers.
                                                                 the application by providing the matching component with
   The Facebook application embeds a platform-specific client
                                                                 correct logo instances, and to 2) add new logos (Figure 6),
which communicates with the CrowdSearcher server. The
                                                                 with the purpose of increasing the overall recall by adding
client serves a twofold purpose. On one hand, it allows
                                                                 novel logo examples.
workers (Facebook users) to perform deployed tasks, as de-
                                                                    Around 40 people were involved as workers for the selec-
picted in Figure 5 and 6. On the other hand, the applica-
                                                                 tion or provision of image logos, mostly from the students in
tion exploits the native Facebook Graph API to enable a
                                                                 our classes, or student’s friends who volunteered to be part
user-defined worker selection, where new workers are explic-
                                                                 of the experiment. Some 50 task instances were generated
itly invited by their friends; Figure 4 depicts an example
                                                                 in a time-span of three days, equally distributed on the set
of task invitation performed on the Facebook wall of a tar-
                                                                 of considered logos, resulting in 70 collected answers, 58%
geted user. The choice of allowing a manual worker selection
                                                                 of which related to logo images selection tasks.
is supported by the findings in [1], where it is shown how a
                                                                    We tested performance under three experimental settings.
higher participation can be achieved when workers are man-
                                                                 (1) No human intervention in the logo validation task: here,
ually picked by the user.
                                                                 the top-4 Google Images result set is used as a baseline for
   Tasks are assumed to have a timeout for completion, spec-
                                                                 the logo search in the video collection; the result set may
ified in the logo identification application, that defines how
                                                                 contain some irrelevant images, since they did not undergo
long the system should wait for human execution. When
                                                                 validation. (2) Logo validation performed by a crowd of do-
the timeout triggers, the system automatically aggregates
                                                                 main experts (simulation): the top-32 Google Images results
the task results – respectively, the number of preferences for
                                                                 are filtered by experts, thereby deleting the non-relevant
each logo image, and the URL of the newly provided logo
                                                                 logos and choosing three images among the relevant ones.
images – feeding the validated logos archive, which is next
                                                                 (3) Inclusion of the actual crowd knowledge: filtering and
used for the matching of logo images in the video collection.
                                                                 expansion of the set of matched logos is done via the Crowd-
                                                                 Searcher application.
4.     EXPERIMENTAL EVALUATION                                      The results are shown in Table 1. For each logo, preci-
  In this section we present the preliminary experiments         sion and recall are evaluated for the three versions of the
that we conducted on a public video collection [4] contain-      application.
ing 120 grocery product logos, to compare automated and             Expert evaluation clearly increases the application per-
crowd-supported logo detection. Experiments involving hu-        formance in both precision and recall, with respect to a
mans have been performed on three medium-sized logos (Aleve,     fully-automated solution. This is due to the fact that the
Chunky and Claritin). We tested the performance of the           validation process performed on the set of logo instances
process in Figure 2 using the standard measures of precision     eliminates irrelevant logos from the query set, and conse-
and recall on the output of Match Logo Images in Video           quently reduces the number of false positives in the result
task.                                                            set. On the other hand, the validation conducted by the
  The CrowdSearcher system2 has been adopted to support          crowd showed generally a slight increase in both precision
the Validate Logo Images crowd task, by allowing users to        and recall. However, the performance increase is not evenly
                                                                 distributed over all logo brands: we believe that this is due
2
    Available at https://apps.facebook.com/crowd search/         to a different user behavior in the choice of the relevant im-
       Brand name       Test        Precision   Recall           the correct temporal sequence of user-generated videos re-
                      No Crowd        0.27       0.27            garding particular events (e.g., breaking news). At peak mo-
          Aleve        Experts        0.42       0.54            ments, there may be a proliferation of such videos by means
                       Crowd          0.33       0.41            of reposting and re-editing of their original content, which
                      No Crowd        0.65       0.19            makes the problem non-trivial. Indeed, the actual creation
         Chunky        Experts        0.70       0.58            date of a video clip (as indicated by tags in the file) may
                       Crowd          0.40       0.21            be uncertain and thus unreliable, thereby producing ambi-
                      No Crowd        0.31       0.09            guities in the temporal ordering of the clips, much in the
         Claritin      Experts        0.57       0.72            same way in which, in top-k queries, uncertain scores deter-
                       Crowd          0.36       0.73            mine multiple possible rankings [7]. While finding temporal
                                                                 dependencies between two clips may in some cases be done
Table 1: Precision and Recall of the logo detection              automatically by detecting near duplicates of a video, fully
application in the three considered experimental set-            reconstructing the temporal sequence will require human in-
tings.                                                           tervention. Humans will assist the process by i) resolving
                                                                 conflicts between the creation dates available in the tags
                                                                 and the temporal dependencies inferred by near duplicate
age set. In particular, when validating the Chunky brand         detection, and ii) refining the time interval associated with
logos, the crowd chose within the top-2 image an irrelevant      a video clip’s creation date. In our research agenda, empha-
logo, as shown in Figure 7. Consequently, the performance        sis will be placed on the following two aspects. (1) Tasks
has been affected, with a heavy decrease both in terms of        will be engineered in such a way that their resolution max-
precision and recall w.r.t. the expert evaluation. This result   imizes the expected decrease of the amount of uncertainty
brings to a consideration about the context of human en-         associated with the temporal ordering of the clips. (2) We
acted executions: the chances to get good responses depend       shall give priority to reducing uncertainty of videos close to
on the appropriateness of the users’ community w.r.t the         a certain date of interest (typically, the date of the event at
task at hand. Both the geographical location and the exper-      hand), thereby focusing on the ordering of the “top-k” such
tise of the involved users can heavily influence the outcome     videos.
of human enacted activities, thus calling for a fine-grained        Acknowledgments This work is partially supported by
task to people matching phase.                                   the FP7 Cubrik Integrating Project3 and by the Capacities
                                                                 Research for SMEs project BPM4People of the Research
                                                                 Executive Agency of the European Commission 4 .

                                                                 6.     REFERENCES
                                                                 [1] A. Bozzon, M. Brambilla, and S. Ceri. Answering
                                                                     search queries with crowdsearcher. In To appear in
                                                                     Proceedings of the 21st International Conference on
Figure 7: An example of a) correct and b) wrong                      World Wide Web, WWW 2012, Lyon, France, April
logo for the Chunky brand                                            16-20, 2012, 2012.
                                                                 [2] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal
                                                                     of Software Tools, 2000.
                                                                 [3] X. Hu, M. Stalnacke, T. B. Minde, R. Carlsson, and
5.   DISCUSSION                                                      S. Larsson. A mobile game to collect and improve
   We have presented a framework and an architecture for             position of images. In Proc. Third International
handling task design, assignment and execution in crowd-             Conference on Next Generation Mobile Applications,
empowered settings. We have then described a trademark               Services and Technologies, 2009.
logo detection application that can be easily accommodated       [4] U. C. V. Laboratory. Grozi-120 database.
in the presented framework and whose execution can largely
                                                                 [5] D. G. Lowe. Distinctive image features from
benefit from the presence of a crowd of users. Our initial
                                                                     scale-invariant keypoints. International Journal of
experiments have shown that human-enriched tasks, such as
                                                                     Computer Vision, 60(2):91–110, 2004.
logo validation and insertion, contribute to a non-negligible
                                                                 [6] A. J. Quinn and B. B. Bederson. Human computation:
improvement of both recall and precision in the obtained re-
                                                                     a survey and taxonomy of a growing field. In
sult set. Yet, such an improvement is unevenly distributed
                                                                     Proceedings of the 2011 annual conference on Human
over the different queries we tried, mostly because some
                                                                     factors in computing systems, CHI ’11, pages
users did not have an adequate background to answer the
                                                                     1403–1412, 2011.
questions that were sent to them. This suggests that users
should be more carefully selected during task assignment.        [7] M. A. Soliman, I. F. Ilyas, D. Martinenghi, and
Future directions of research therefore include studying how         M. Tagliasacchi. Ranking with uncertain scoring
to associate the most suitable request with the most appro-          functions: semantics and sensitivity measures. In
priate user, so as to implement a ranking function on worker         SIGMOD Conference, pages 805–816, 2011.
pools whereby the task owner is aided in the dispatch of the
task to the top-k best candidates.
   Along the same lines, we are also studying a crowdsearch
                                                                 3
scenario in which engineering the most suitable task/request         http://www.cubrikproject.eu/
                                                                 4
plays a crucial role. Here, the end user wants to reconstruct        http://www.bpm4people.org/