Discovering User Perceptions of Semantic Similarity
                 in Near-duplicate Multimedia Files

             Raynor Vliegendhart                             Martha Larson                        Johan Pouwelse
           R.Vliegendhart@tudelft.nl                      M.A.Larson@tudelft.nl               J.A.Pouwelse@tudelft.nl
                         Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands


ABSTRACT                                                                    rithms that organize search result lists. In order to simplify
We address the problem of discovering new notions of user-                  the problem of semantic similarity, we focus on a partic-
perceived similarity between near-duplicate multimedia ﬁles.                ular area of search, namely, search within ﬁle-sharing sys-
We focus on ﬁle-sharing, since in this setting, users have a                tems. We choose ﬁle-sharing, because it is a rich, real-world
well-developed understanding of the available content, but                  use scenario in which user information needs are relatively
what constitutes a near-duplicate is nonetheless nontrivial.                well constrained and users have a widely-shared and well-
We elicited judgments of semantic similarity by implement-                  developed understanding of the characteristics of the items
ing triadic elicitation as a crowdsourcing task and ran it on               that they are looking for.
Amazon Mechanical Turk. We categorized the judgments                           Our investigation is focused on dimensions of semantic
and arrived at 44 diﬀerent dimensions of semantic similarity                similarity that go beyond what is depicted in the visual
perceived by users. These discovered dimensions can be used                 channel of the video. In this way, our work diﬀers from other
for clustering items in search result lists. The challenge in               work on multimedia near duplicates that puts its main em-
performing elicitations in this way is to ensure that workers               phasis on visual content [1]. Speciﬁcally, we deﬁne a notion
are encouraged to answer seriously and remain engaged.                      of near duplicate multimedia items that is related to the
                                                                            reasons for which users are searching for them. By using
Categories and Subject Descriptors:                                         a deﬁnition of near duplicates that is related to the func-
H.3 [Information Storage and Retrieval]: H.3.3 Information                  tion or purpose that multimedia items fulﬁll for users, we
Search and Retrieval—Search process                                         conjecture that we will be able arrive at a set of semantic
General Terms: Human Factors, Experimentation                               similarities that will reﬂect user search goals and in this way
                                                                            be highly suited for use in multimedia retrieval results lists.
Keywords: Near-duplicates, perceived similarity, triadic                       The paper is organized as follows. After presenting back-
elicitation, Mechanical Turk                                                ground and related work in Section 2, we describe the crowd-
                                                                            sourcing experiment by which we elicit human judgments in
                                                                            Section 3. The construction of the dataset used in the exper-
1. INTRODUCTION                                                             iment is given in Section 4. Direct results of the experiment
   Crowdsourcing platforms make it possible to elicit seman-                and the derived similarity dimensions are discussed in Sec-
tic judgments from users. Crowdsourcing can be particularly                 tion 5. We ﬁnish with conclusions in Section 6.
helpful in cases in which human interpretations are not im-
mediately self evident. In this paper, we report on a crowd-
sourcing experiment designed to elicit human judgments on                   2.    BACKGROUND AND RELATED WORK
semantic similarity between near duplicate multimedia ﬁles.
We use crowdsourcing for this application because it allows                 2.1    Near-duplicates in search results
us to easily collect a large number of human similarity judg-                  Well-organized search results provide an easy means for
ments. The major challenge we address is designing the                      users to overview search results lists. A simple, straight-
crowdsourcing task, which we ran on Amazon Mechanical                       forward method of organization groups together similar re-
Turk, to ensure that the workers from whom we elicit judg-                  sults and represents each group with a concise surrogate,
ments are both serious and engaged.                                         e.g., a single representative item. Users can then scan a
   Multimedia content is semantically complex. This com-                    shorter list of groups, rather than a longer list of individual
plexity means that it is diﬃcult to make reliable assumptions               result items. Hiding near duplicate items in the interface
about the dimensions of semantic similarity along which                     is a speciﬁc realization of near-duplicate elimination, which
multimedia items can resemble each other, i.e., be consid-                  has been suggested in order to make video retrieval more
ered near duplicates. Knowledge of such dimensions is im-                   eﬃcient for users [11]. Algorithms that can identify near
portant for designing retrieval systems. We plan ultimately                 duplicates can be used to group items in the interface. One
to use this knowledge to inform the development of algo-                    of the challenges in designing such algorithms is being able
                                                                            to base them on similarity between items as it is perceived by
Copyright c 2012 for the individual papers by the papers’ authors. Copy-
                                                                            users. Clustering items with regard to general overall simi-
ing permitted for private and academic purposes. This volume is published   larity is a possibility. However, this approach is problematic
and copyrighted by its editors.                                             since items are similar in many diﬀerent ways at the same
CrowdSearch 2012 workshop at WWW 2012, Lyon, France                         time [7]. Instead, our approach, and the ultimate aim of our
       Figure 1: One of the triads of files and the corresponding question as presented to the workers.


work, is to develop near-duplicate clustering algorithms that     we design our task to control quality by encouraging workers
are informed by user-perceptions of dimensions of semantic        to be serious and engaged. We adopt the approach of [10]
similarity between items. We assume that these algorithms         of using a pilot HIT to recruit serious workers. In order
stand to beneﬁt if they draw on a set of possible dimensions      to increase worker engagement, we also adopt the approach
of semantic similarity that is as large as possible.              of [5], which observes that open-ended questions are more
   Our work uses a deﬁnition of near duplicates based on the      enjoyable and challenging.
function they fulﬁll for the user:
     Functional near-duplicate multimedia items are               3.    CROWDSOURCING TASK
     items that fulﬁll the same purpose for the user.                The goal of our crowdsourcing task is to elicit the various
     Once the user has one of these items, there is no            notions of similarity perceived by users of a ﬁle-sharing sys-
     additional need for another.                                 tem. This task provides input for a card sort, which we carry
In [11], one video is deemed to be a near duplicate of another    out as a next step (Section 5.2) in order to derive a small
if a user would clearly identify them as essentially the same.    set of semantic similarity dimensions from the large set of
However, this deﬁnition is not as broad as ours, since only       user-perceived similarities we collect via crowdsourcing.
the visual channel is considered.                                    The crowdsourcing task aims to achieve workers’ serious-
   Our work is related to [3], which consults users to ﬁnd        ness and engagement with judicious design decisions. Our
whether particular semantic diﬀerences make important con-        task design places particular focus on ensuring task credi-
tributions to their perceptions of near duplicates. Our work      bility. For example, the title and description of the pilot
diﬀers because we are interested in discovering new dimen-        makes clear the purpose of the task, i.e., research, and that
sions of semantic similarity rather than testing an assumed       the workers should not expect a high volume of work of-
list of similarity dimensions.                                    fered. Further, we strive to ensure that workers are conﬁ-
                                                                  dent that they understand what is required of them. We
2.2 Eliciting judgments of semantic similarity                    explain functional similarity in practical terms, using easy-
   We are interested in gathering human judgments on se-          to-understand phrases such as “comparable”, “like”, and “for
mantic interpretation, which involves the acquisition of new      all practical purposes the same”. We also give consideration
knowledge on human perception of similarity. Any thought-         to task awareness by including questions in the recruitment
ful answer given by a human is of potential interest to us.       task designed to determine basic familiarity with ﬁle-sharing
No serious answer can be considered wrong.                        and interest level in the task.
   The technique we use, triadic elicitation, is adopted from
psychology [6], where it is used for knowledge acquisition.       3.1    Task description
Given three elements, a subject is asked to specify in what          The task consists of a question, illustrated by Figure 1,
important way two of them are alike but diﬀerent from the         that is repeated three times, once for three diﬀerent triads of
third [8]. Two reasons make triadic eliciation well suited for    ﬁles. For each presented triad, we ask the workers to imagine
our purposes. First, being presented with three elements,         that they have downloaded all three ﬁles and to compare the
workers have to abstract away from small diﬀerences be-           ﬁles to each other on a functional level. The ﬁle information
tween any two speciﬁc items, which encourages them to iden-       shown to the workers is taken from a real ﬁle-sharing system
tify those similarities that are essential. Second, the triadic   (see the description of the dataset in Section 4) and are
method is found to be cognitively more complex than the           displayed as in a real-world system, with ﬁlename, ﬁle size
dyadic method [2], supporting our goal of creating an en-         and uploader. The worker is not given the option to view the
gaging crowdsourcing task by adding a cognitive challenge.        actual ﬁles, reﬂecting the common real ﬁle-sharing scenario
   A crowdsourcing task that involves the elicitation of se-      in which the user does not have the resources (e.g., the time)
mantic judgments diﬀers from other tasks in which the cor-        to download and compare all items when scrolling through
rectness of answers can be veriﬁed. In this way, our task         the search results.
resembles the one designed in [10], which collects viewer-           The ﬁrst section of the question is used to determine
reported judgments. Instead of verifying answers directly,        whether it is possible to deﬁne a two-way contrast between
the three ﬁles. We use this section to eliminate cases in              • Acceptable to consider all diﬀerent:
which ﬁles are perceived to be all the same or all diﬀerent.
                                                                          Black Eyed Peas - Rock that body
This is following the advice on when not to use triadic elic-
                                                                          Black Eyed Peas - Time of my life
itation that is given in [9]. Speciﬁcally, we avoid forcing a
                                                                          Black Eyed Peas - Alive
contrast in cases where it does not make sense.
   The following triad is an example of a case in which a                Here, we disallowed the option of considering all ﬁles
two-way contrast should not be forced:                                   to be comparable as one might actually want to down-
     Despicable Me The Game                                              load all three ﬁles. For the same reason, we also disal-
     VA-Despicable Me (Music From The Motion Picture)                    lowed the option of considering two the same and one
     Despicable Me 2010 1080p                                            diﬀerent.

These ﬁles all bear the same title. If workers were forced to          • Acceptable to consider all same or to consider two the
identify a two-way contrast, we would risk eliciting diﬀer-              same and one diﬀerent:
ences that are not on the functional level, e.g., “the second             The Sorcerers Apprentice 2010 BluRay MKV x264 (8 GB)
ﬁlename starts with a V while the other two start with a D”.              The Sorcerers Apprentice CAM XVID-NDN (700 MB)
Avoiding nonsense questions also enhances the credibility of              The Sorcerers Apprentice CAM XVID-NDN (717 MB)
our task.
   In order to ensure that the workers follow our deﬁnition of           Here, we disallowed the option of considering all ﬁles
functional similarity in their judgment, we elaborately deﬁne            diﬀerent. For instance, someone downloading the sec-
the use-case of the three ﬁles in the all-same and all-diﬀerent          ond ﬁle would not also download the third ﬁle as these
options. We specify that the three ﬁles are the same when                represent the same movie of comparable quality.
someone would never need all of them. Similarly, the three
                                                                    The key idea here is to check whether the workers under-
ﬁles can be considered to be all diﬀerent from each other if
                                                                  stood the task and are taking it seriously, while at the same
the worker can think of an opposite situation where someone
                                                                  time not to exclude people who do not share a a similar view
would want to download all three ﬁles. Note that emphasiz-
                                                                  onto the world as us. To this end, we aim to choose the least
ing the functional perspective of similarity guides workers
                                                                  controversial cases and also admit more than one acceptable
away from only matching strings and towards considering
                                                                  answers.
the similarity of the underlying multimedia items. Also, we
                                                                    We deemed the recruitment HIT to be completed success-
intend the elaborate description to discourage workers to
                                                                  fully if the following conditions were met:
take the easy way out, i.e., selecting one of the ﬁrst two
options and thereby not having to contrast ﬁles.
                                                                       • No unacceptable answers (listed above) were given in
   Workers move on to the second section only if they report
                                                                         comparing ﬁles in each triad.
it is possible to make a two-way contrast. Here they are
asked to indicate which element of the triad diﬀers from the           • The answer to the free-text question provided evidence
remaining two and to specify the diﬀerence by answering a                that the worker generalized beyond the ﬁlename, i.e.,
free-text question.                                                      they compared the ﬁles on a functional level.

3.2 Task setup                                                         • All questions regarding demographic background were
   We ran two batches of Human Intelligence Tasks (HITs)                 answered.
on Amazon Mechanical Turk on January 5th, 2011: a re-
cruitment HIT and the main HIT. The recruitment HIT               Workers who completed the recruitment HIT, who expressed
consisted of the same questions as the regular main HIT           interest in our HIT, and who also gave answers that demon-
(Section 3.1) using three triads and included an additional       strated aﬃnity with ﬁle sharing, were admitted to the main
survey. In the survey, workers had to tell whether they liked     HIT.
the HIT and if they wanted to do more HITs. If the latter            The recruitment HIT and the main HIT ran concurrently.
was the case, they had to supply general demographic infor-       This allowed workers who received a qualiﬁcation to continue
mation and report their aﬃnity with ﬁle-sharing and online        without delay. The reward for both HITs was $0.10. The
media consumption.                                                recruitment HIT was open to 200 workers and the main HIT
   The three triads, listed below, were selected from the por-    allowed for 3 workers per task and consisted of 500 tasks
tion of the dataset (Section 4) reserved for validation. We       in total. Each task contained 2 triads from the test set
selected examples for which at least one answer was deemed        and 1 triad from the validation set. Since our validation
uncontroversially wrong and the others acceptable.                set (Section 4) is smaller than our test set, the validation
                                                                  triads were recycled and used multiple times. The order of
   • Acceptable to consider all diﬀerent or to consider two       the questions was randomized to ensure the position of the
     the same and one diﬀerent:                                   validation question was not ﬁxed.
       Desperate Housewives s03e17 [nosubs]
       Desperate Housewives s03e18 [portugese subs]               4.     DATASET
       Desperate Housewives s03e17 [portugese subs]                 We created a test dataset of a 1000 triads based on pop-
                                                                  ular content on The Pirate Bay (TPB),1 a site that indexes
     Here, we disallowed the option of considering all ﬁles
                                                                  content that can be downloaded using the BitTorrent [4]
     to be comparable. For instance, someone download-
                                                                  ﬁle-sharing system. We fetched the top 100 popular content
     ing the third ﬁle would also want to have the second
                                                                  page on December 14, 2010. From this page and further
     ﬁle as these represent two consecutive episodes from a
                                                                  1
     television series.                                               http://thepiratebay.com
       Table 1: Dimensions of semantic similarity discovered by categorizing crowdsourced judgments
           Diﬀerent movie vs. TV show                                  Diﬀerent movie
           Normal cut vs. extended cut                                 Movie vs. trailer
           Cartoon vs. movie                                           Comic vs. movie
           Movie vs. book                                              Audiobook vs. movie
           Game vs. corresponding movie                                Sequels (movies)
           Commentary document vs. movie                               Soundtrack vs. corresponding movie
           Movie/TV show vs. unrelated audio album                     Movie vs. wallpaper
           Diﬀerent episode                                            Complete season vs. individual episodes
           Episodes from diﬀerent season                               Graphic novel vs. TV episode
           Multiple episodes vs. full season                           Diﬀerent realization of same legend/story
           Diﬀerent songs                                              Diﬀerent albums
           Song vs. album                                              Collection vs. album
           Album vs. remix                                             Event capture vs. song
           Explicit version                                            Bonus track included
           Song vs. collection of songs+videos                         Event capture vs. unrelated movie
           Language of subtitles                                       Diﬀerent language
           Mobile vs. normal version                                   Quality and/or source
           Diﬀerent codec/container (MP4 audio vs. MP3)                Diﬀerent game
           Crack vs. game                                              Software versions
           Diﬀerent game, same series                                  Diﬀerent application
           Addon vs. main application                                  Documentation (pdf) vs. software
           List (text document) vs. unrelated item                     Safe vs. X-Rated


queried pages, we only scraped content metadata, e.g., ﬁle-      ing validation questions consistently. The repeated answers
name, ﬁle size and uploader. We did not download any             allowed us to conﬁrm that the large volume workers were
actual content for the creation of our dataset.                  serious and not sloppy. In fact, the highest volume worker
   Users looking for a particular ﬁle normally formulate a       had perfect consistency.
query based on their idea of the ﬁle they want to download.         The workers produced free-text judgments for 308 of the
Borrowing this approach, we constructed a query for each of      1000 test triads. The other 692 triads consisted of ﬁles that
the items from the retrieved top 100 list. The queries were      were considered either all diﬀerent or all similar. Workers
constructed automatically by taking the ﬁrst two terms of        fully agreed on which ﬁle diﬀered from the other two for 68
a ﬁlename, ignoring stop words and terms containing digits.      of the 308 triads. Only two judgments out of the three given
This resulted in 75 unique derived queries.                      judgments agreed which ﬁle was diﬀerent for 93 triads. For
   The 75 queries were issued to TPB on January 3, 2011.         the remaining 147 triads no agreement was reached. Note
Each query resulted in between 4 and 1000 hits (median 335)      that whether an agreement was reached is not of direct im-
and in total 32,773 ﬁlenames were obtained. We randomly          portance to us since we are mainly interested in just the
selected 1000 triads for our test dataset. All ﬁles in a triad   justiﬁcations for the workers’ answers, which we use to dis-
correspond to a single query. Using the same set of queries      cover the new dimensions of semantic similarity.
and retrieved ﬁlenames, we manually crafted a set of 28
triads for our validation set. For each of the triads in the     5.2    Card sorting the human judgments
validation set, we determined the acceptable answers.              We applied a standard card sorting technique [9] to cat-
                                                                 egorize the explanations for the semantic similarity judg-
                                                                 ments that the workers provided in the free-text question.
5. RESULTS                                                       Each judgment was printed on a small piece of paper and
                                                                 similar judgments were grouped together into piles. Piles
5.1 Crowdsourcing task                                           were iteratively merged until all piles were distinct and fur-
   Our crowdsourcing task appeared to be attractive and          ther merging was no longer possible. Each pile was given a
ﬁnished quickly. The main HIT was completed within 36            category name reﬂecting the basic distinction described by
hours. During the run of the recruitment HIT, we handed          the explanations. To list a few examples: the pile containing
out qualiﬁcations to 14 workers. This number proved to be        explanations “The third item is a Hindi language version of
more than suﬃcient and caused us to decide to stop the re-       the movie.” and “This is a Spanish version of the movie rep-
cruitment HIT prematurely. The total work oﬀered by the          resented by the other two” was labeled as diﬀerent language;
main HIT was completed by eight of these qualiﬁed workers.       the pile containing “This is the complete season. The other
Half of the workers were eager and worked on a large volume      2 are the same single episode in the season.” and “This is
of assignments (between 224 and 489 each). A quick look          the full season 5 while the other two are episode 12 of sea-
at the results did not raise any suspicions that the workers     son 5” was labeled complete season vs. individual episodes;
were under-performing compared to their work on the re-          the pile containing “This is a discography while the two are
cruitment HIT. We therefore decided not to use the valida-       movies” and “This is the soundtrack of the movie while the
tion questions to reject work. However, we were still curious    other two are the movie.” was labeled soundtrack vs. corre-
as to whether the eager workers were answering the repeat-       sponding movie.
   The list of categories resulting from the card sort is listed     [9] G. Rugg and P. McGeorge. The sorting techniques: a
in Table 1. We found 44 similarity dimensions, many more                 tutorial paper on card sorts, picture sorts and item
than we had anticipated prior to the crowdsourcing experi-               sorts. Expert Systems, 14(2):80–93, 1997.
ment. The large number of unexpected dimensions we dis-             [10] M. Soleymani and M. Larson. Crowdsourcing for
covered support the conclusion that the user perception of               aﬀective annotation of video: Development of a
semantic similarity among near duplicates is not trivial. For            viewer-reported boredom corpus. In Proceedings of the
example, the “commentary document versus movie” dimen-                   ACM SIGIR 2010 Workshop on Crowdsourcing for
sion, which arose from a triad consisting of two versions of             Search Evaluation (CSE 2010), pages 4–8, 2010.
a motion picture and a text document that explained the             [11] X. Wu, A. G. Hauptmann, and C.-W. Ngo. Practical
movie, was particularly surprising, but nonetheless impor-               elimination of near-duplicates from web video search.
tant for the ﬁle-sharing setting.                                        In Proceedings of the 15th international conference on
   Generalizing our ﬁndings in Table 1, we can see that most             Multimedia, MM ’07, pages 218–227, New York, 2007.
dimensions are based on diﬀerent instantiations of particular            ACM.
content (e.g., quality and extended cuts), on the serial na-
ture of content (e.g., episodic), or on the notion of collections
(e.g., seasons and albums). These ﬁndings and generaliza-
tions will serve to inform the design of algorithms for the
detection of near duplicates in results lists in future work.

6. CONCLUSION
   In this work, we have described a crowdsourcing experi-
ment that discovers user-perceived dimensions of semantic
similarity among near duplicates. Launching an interesting
task with the focus on engagement and encouraging serious
workers, we have been able to quickly acquire a wealth of dif-
ferent dimensions of semantic similarity, which we otherwise
could not have thought of. Our future work will involve
expanding this experiment to encompass a larger number
of workers and other multimedia search settings. Our ex-
periment opens up the perspective that crowdsourcing can
be used to gain a more sophisticated understanding of user
perceptions of semantic similarity among multimedia near-
duplicate items.

7. REFERENCES
 [1] A. Basharat, Y. Zhai, and M. Shah. Content based
     video matching using spatiotemporal volumes.
     Computer Vision and Image Understanding,
     110(3):360–377, June 2008.
 [2] P. Caputi and P. Reddy. A comparison of triadic and
     dyadic methods of personal construct elicitation.
     Journal of Constructivist Psychology, 12(3):253–264,
     1999.
 [3] M. Cherubini, R. de Oliveira, and N. Oliver.
     Understanding near-duplicate videos: a user-centric
     approach. In Proceedings of the 17th ACM
     international conference on Multimedia, MM ’09,
     pages 35–44, New York, 2009. ACM.
 [4] B. Cohen. Incentives build robustness in BitTorrent.
     In Workshop on Economics of Peer-to-Peer systems,
     2003.
 [5] C. Eickhoﬀ and A. P. de Vries. Increasing cheat
     robustness of crowdsourcing tasks. Information
     Retrieval, 15, 2012. To appear.
 [6] F. Fransella and R. Bannister. A Manual for
     Repertory Grid Technique. Wiley, 2nd edition, 2003.
 [7] M. A. Hearst. Search User Interfaces. Cambridge
     University Press, 1st edition, Sept. 2009.
 [8] G. A. Kelly. The Psychology of Personal Constructs,
     volume one: Theory and personality. Norton, New
     York, 1955.