Discovering User Perceptions of Semantic Similarity in Near-duplicate Multimedia Files Raynor Vliegendhart Martha Larson Johan Pouwelse R.Vliegendhart@tudelft.nl M.A.Larson@tudelft.nl J.A.Pouwelse@tudelft.nl Delft University of Technology, Mekelweg 4, 2628 CD Delft, The Netherlands ABSTRACT rithms that organize search result lists. In order to simplify We address the problem of discovering new notions of user- the problem of semantic similarity, we focus on a partic- perceived similarity between near-duplicate multimedia files. ular area of search, namely, search within file-sharing sys- We focus on file-sharing, since in this setting, users have a tems. We choose file-sharing, because it is a rich, real-world well-developed understanding of the available content, but use scenario in which user information needs are relatively what constitutes a near-duplicate is nonetheless nontrivial. well constrained and users have a widely-shared and well- We elicited judgments of semantic similarity by implement- developed understanding of the characteristics of the items ing triadic elicitation as a crowdsourcing task and ran it on that they are looking for. Amazon Mechanical Turk. We categorized the judgments Our investigation is focused on dimensions of semantic and arrived at 44 different dimensions of semantic similarity similarity that go beyond what is depicted in the visual perceived by users. These discovered dimensions can be used channel of the video. In this way, our work differs from other for clustering items in search result lists. The challenge in work on multimedia near duplicates that puts its main em- performing elicitations in this way is to ensure that workers phasis on visual content [1]. Specifically, we define a notion are encouraged to answer seriously and remain engaged. of near duplicate multimedia items that is related to the reasons for which users are searching for them. By using Categories and Subject Descriptors: a definition of near duplicates that is related to the func- H.3 [Information Storage and Retrieval]: H.3.3 Information tion or purpose that multimedia items fulfill for users, we Search and Retrieval—Search process conjecture that we will be able arrive at a set of semantic General Terms: Human Factors, Experimentation similarities that will reflect user search goals and in this way be highly suited for use in multimedia retrieval results lists. Keywords: Near-duplicates, perceived similarity, triadic The paper is organized as follows. After presenting back- elicitation, Mechanical Turk ground and related work in Section 2, we describe the crowd- sourcing experiment by which we elicit human judgments in Section 3. The construction of the dataset used in the exper- 1. INTRODUCTION iment is given in Section 4. Direct results of the experiment Crowdsourcing platforms make it possible to elicit seman- and the derived similarity dimensions are discussed in Sec- tic judgments from users. Crowdsourcing can be particularly tion 5. We finish with conclusions in Section 6. helpful in cases in which human interpretations are not im- mediately self evident. In this paper, we report on a crowd- sourcing experiment designed to elicit human judgments on 2. BACKGROUND AND RELATED WORK semantic similarity between near duplicate multimedia files. We use crowdsourcing for this application because it allows 2.1 Near-duplicates in search results us to easily collect a large number of human similarity judg- Well-organized search results provide an easy means for ments. The major challenge we address is designing the users to overview search results lists. A simple, straight- crowdsourcing task, which we ran on Amazon Mechanical forward method of organization groups together similar re- Turk, to ensure that the workers from whom we elicit judg- sults and represents each group with a concise surrogate, ments are both serious and engaged. e.g., a single representative item. Users can then scan a Multimedia content is semantically complex. This com- shorter list of groups, rather than a longer list of individual plexity means that it is difficult to make reliable assumptions result items. Hiding near duplicate items in the interface about the dimensions of semantic similarity along which is a specific realization of near-duplicate elimination, which multimedia items can resemble each other, i.e., be consid- has been suggested in order to make video retrieval more ered near duplicates. Knowledge of such dimensions is im- efficient for users [11]. Algorithms that can identify near portant for designing retrieval systems. We plan ultimately duplicates can be used to group items in the interface. One to use this knowledge to inform the development of algo- of the challenges in designing such algorithms is being able to base them on similarity between items as it is perceived by Copyright c 2012 for the individual papers by the papers’ authors. Copy- users. Clustering items with regard to general overall simi- ing permitted for private and academic purposes. This volume is published larity is a possibility. However, this approach is problematic and copyrighted by its editors. since items are similar in many different ways at the same CrowdSearch 2012 workshop at WWW 2012, Lyon, France time [7]. Instead, our approach, and the ultimate aim of our Figure 1: One of the triads of files and the corresponding question as presented to the workers. work, is to develop near-duplicate clustering algorithms that we design our task to control quality by encouraging workers are informed by user-perceptions of dimensions of semantic to be serious and engaged. We adopt the approach of [10] similarity between items. We assume that these algorithms of using a pilot HIT to recruit serious workers. In order stand to benefit if they draw on a set of possible dimensions to increase worker engagement, we also adopt the approach of semantic similarity that is as large as possible. of [5], which observes that open-ended questions are more Our work uses a definition of near duplicates based on the enjoyable and challenging. function they fulfill for the user: Functional near-duplicate multimedia items are 3. CROWDSOURCING TASK items that fulfill the same purpose for the user. The goal of our crowdsourcing task is to elicit the various Once the user has one of these items, there is no notions of similarity perceived by users of a file-sharing sys- additional need for another. tem. This task provides input for a card sort, which we carry In [11], one video is deemed to be a near duplicate of another out as a next step (Section 5.2) in order to derive a small if a user would clearly identify them as essentially the same. set of semantic similarity dimensions from the large set of However, this definition is not as broad as ours, since only user-perceived similarities we collect via crowdsourcing. the visual channel is considered. The crowdsourcing task aims to achieve workers’ serious- Our work is related to [3], which consults users to find ness and engagement with judicious design decisions. Our whether particular semantic differences make important con- task design places particular focus on ensuring task credi- tributions to their perceptions of near duplicates. Our work bility. For example, the title and description of the pilot differs because we are interested in discovering new dimen- makes clear the purpose of the task, i.e., research, and that sions of semantic similarity rather than testing an assumed the workers should not expect a high volume of work of- list of similarity dimensions. fered. Further, we strive to ensure that workers are confi- dent that they understand what is required of them. We 2.2 Eliciting judgments of semantic similarity explain functional similarity in practical terms, using easy- We are interested in gathering human judgments on se- to-understand phrases such as “comparable”, “like”, and “for mantic interpretation, which involves the acquisition of new all practical purposes the same”. We also give consideration knowledge on human perception of similarity. Any thought- to task awareness by including questions in the recruitment ful answer given by a human is of potential interest to us. task designed to determine basic familiarity with file-sharing No serious answer can be considered wrong. and interest level in the task. The technique we use, triadic elicitation, is adopted from psychology [6], where it is used for knowledge acquisition. 3.1 Task description Given three elements, a subject is asked to specify in what The task consists of a question, illustrated by Figure 1, important way two of them are alike but different from the that is repeated three times, once for three different triads of third [8]. Two reasons make triadic eliciation well suited for files. For each presented triad, we ask the workers to imagine our purposes. First, being presented with three elements, that they have downloaded all three files and to compare the workers have to abstract away from small differences be- files to each other on a functional level. The file information tween any two specific items, which encourages them to iden- shown to the workers is taken from a real file-sharing system tify those similarities that are essential. Second, the triadic (see the description of the dataset in Section 4) and are method is found to be cognitively more complex than the displayed as in a real-world system, with filename, file size dyadic method [2], supporting our goal of creating an en- and uploader. The worker is not given the option to view the gaging crowdsourcing task by adding a cognitive challenge. actual files, reflecting the common real file-sharing scenario A crowdsourcing task that involves the elicitation of se- in which the user does not have the resources (e.g., the time) mantic judgments differs from other tasks in which the cor- to download and compare all items when scrolling through rectness of answers can be verified. In this way, our task the search results. resembles the one designed in [10], which collects viewer- The first section of the question is used to determine reported judgments. Instead of verifying answers directly, whether it is possible to define a two-way contrast between the three files. We use this section to eliminate cases in • Acceptable to consider all different: which files are perceived to be all the same or all different. Black Eyed Peas - Rock that body This is following the advice on when not to use triadic elic- Black Eyed Peas - Time of my life itation that is given in [9]. Specifically, we avoid forcing a Black Eyed Peas - Alive contrast in cases where it does not make sense. The following triad is an example of a case in which a Here, we disallowed the option of considering all files two-way contrast should not be forced: to be comparable as one might actually want to down- Despicable Me The Game load all three files. For the same reason, we also disal- VA-Despicable Me (Music From The Motion Picture) lowed the option of considering two the same and one Despicable Me 2010 1080p different. These files all bear the same title. If workers were forced to • Acceptable to consider all same or to consider two the identify a two-way contrast, we would risk eliciting differ- same and one different: ences that are not on the functional level, e.g., “the second The Sorcerers Apprentice 2010 BluRay MKV x264 (8 GB) filename starts with a V while the other two start with a D”. The Sorcerers Apprentice CAM XVID-NDN (700 MB) Avoiding nonsense questions also enhances the credibility of The Sorcerers Apprentice CAM XVID-NDN (717 MB) our task. In order to ensure that the workers follow our definition of Here, we disallowed the option of considering all files functional similarity in their judgment, we elaborately define different. For instance, someone downloading the sec- the use-case of the three files in the all-same and all-different ond file would not also download the third file as these options. We specify that the three files are the same when represent the same movie of comparable quality. someone would never need all of them. Similarly, the three The key idea here is to check whether the workers under- files can be considered to be all different from each other if stood the task and are taking it seriously, while at the same the worker can think of an opposite situation where someone time not to exclude people who do not share a a similar view would want to download all three files. Note that emphasiz- onto the world as us. To this end, we aim to choose the least ing the functional perspective of similarity guides workers controversial cases and also admit more than one acceptable away from only matching strings and towards considering answers. the similarity of the underlying multimedia items. Also, we We deemed the recruitment HIT to be completed success- intend the elaborate description to discourage workers to fully if the following conditions were met: take the easy way out, i.e., selecting one of the first two options and thereby not having to contrast files. • No unacceptable answers (listed above) were given in Workers move on to the second section only if they report comparing files in each triad. it is possible to make a two-way contrast. Here they are asked to indicate which element of the triad differs from the • The answer to the free-text question provided evidence remaining two and to specify the difference by answering a that the worker generalized beyond the filename, i.e., free-text question. they compared the files on a functional level. 3.2 Task setup • All questions regarding demographic background were We ran two batches of Human Intelligence Tasks (HITs) answered. on Amazon Mechanical Turk on January 5th, 2011: a re- cruitment HIT and the main HIT. The recruitment HIT Workers who completed the recruitment HIT, who expressed consisted of the same questions as the regular main HIT interest in our HIT, and who also gave answers that demon- (Section 3.1) using three triads and included an additional strated affinity with file sharing, were admitted to the main survey. In the survey, workers had to tell whether they liked HIT. the HIT and if they wanted to do more HITs. If the latter The recruitment HIT and the main HIT ran concurrently. was the case, they had to supply general demographic infor- This allowed workers who received a qualification to continue mation and report their affinity with file-sharing and online without delay. The reward for both HITs was $0.10. The media consumption. recruitment HIT was open to 200 workers and the main HIT The three triads, listed below, were selected from the por- allowed for 3 workers per task and consisted of 500 tasks tion of the dataset (Section 4) reserved for validation. We in total. Each task contained 2 triads from the test set selected examples for which at least one answer was deemed and 1 triad from the validation set. Since our validation uncontroversially wrong and the others acceptable. set (Section 4) is smaller than our test set, the validation triads were recycled and used multiple times. The order of • Acceptable to consider all different or to consider two the questions was randomized to ensure the position of the the same and one different: validation question was not fixed. Desperate Housewives s03e17 [nosubs] Desperate Housewives s03e18 [portugese subs] 4. DATASET Desperate Housewives s03e17 [portugese subs] We created a test dataset of a 1000 triads based on pop- ular content on The Pirate Bay (TPB),1 a site that indexes Here, we disallowed the option of considering all files content that can be downloaded using the BitTorrent [4] to be comparable. For instance, someone download- file-sharing system. We fetched the top 100 popular content ing the third file would also want to have the second page on December 14, 2010. From this page and further file as these represent two consecutive episodes from a 1 television series. http://thepiratebay.com Table 1: Dimensions of semantic similarity discovered by categorizing crowdsourced judgments Different movie vs. TV show Different movie Normal cut vs. extended cut Movie vs. trailer Cartoon vs. movie Comic vs. movie Movie vs. book Audiobook vs. movie Game vs. corresponding movie Sequels (movies) Commentary document vs. movie Soundtrack vs. corresponding movie Movie/TV show vs. unrelated audio album Movie vs. wallpaper Different episode Complete season vs. individual episodes Episodes from different season Graphic novel vs. TV episode Multiple episodes vs. full season Different realization of same legend/story Different songs Different albums Song vs. album Collection vs. album Album vs. remix Event capture vs. song Explicit version Bonus track included Song vs. collection of songs+videos Event capture vs. unrelated movie Language of subtitles Different language Mobile vs. normal version Quality and/or source Different codec/container (MP4 audio vs. MP3) Different game Crack vs. game Software versions Different game, same series Different application Addon vs. main application Documentation (pdf) vs. software List (text document) vs. unrelated item Safe vs. X-Rated queried pages, we only scraped content metadata, e.g., file- ing validation questions consistently. The repeated answers name, file size and uploader. We did not download any allowed us to confirm that the large volume workers were actual content for the creation of our dataset. serious and not sloppy. In fact, the highest volume worker Users looking for a particular file normally formulate a had perfect consistency. query based on their idea of the file they want to download. The workers produced free-text judgments for 308 of the Borrowing this approach, we constructed a query for each of 1000 test triads. The other 692 triads consisted of files that the items from the retrieved top 100 list. The queries were were considered either all different or all similar. Workers constructed automatically by taking the first two terms of fully agreed on which file differed from the other two for 68 a filename, ignoring stop words and terms containing digits. of the 308 triads. Only two judgments out of the three given This resulted in 75 unique derived queries. judgments agreed which file was different for 93 triads. For The 75 queries were issued to TPB on January 3, 2011. the remaining 147 triads no agreement was reached. Note Each query resulted in between 4 and 1000 hits (median 335) that whether an agreement was reached is not of direct im- and in total 32,773 filenames were obtained. We randomly portance to us since we are mainly interested in just the selected 1000 triads for our test dataset. All files in a triad justifications for the workers’ answers, which we use to dis- correspond to a single query. Using the same set of queries cover the new dimensions of semantic similarity. and retrieved filenames, we manually crafted a set of 28 triads for our validation set. For each of the triads in the 5.2 Card sorting the human judgments validation set, we determined the acceptable answers. We applied a standard card sorting technique [9] to cat- egorize the explanations for the semantic similarity judg- ments that the workers provided in the free-text question. 5. RESULTS Each judgment was printed on a small piece of paper and similar judgments were grouped together into piles. Piles 5.1 Crowdsourcing task were iteratively merged until all piles were distinct and fur- Our crowdsourcing task appeared to be attractive and ther merging was no longer possible. Each pile was given a finished quickly. The main HIT was completed within 36 category name reflecting the basic distinction described by hours. During the run of the recruitment HIT, we handed the explanations. To list a few examples: the pile containing out qualifications to 14 workers. This number proved to be explanations “The third item is a Hindi language version of more than sufficient and caused us to decide to stop the re- the movie.” and “This is a Spanish version of the movie rep- cruitment HIT prematurely. The total work offered by the resented by the other two” was labeled as different language; main HIT was completed by eight of these qualified workers. the pile containing “This is the complete season. The other Half of the workers were eager and worked on a large volume 2 are the same single episode in the season.” and “This is of assignments (between 224 and 489 each). A quick look the full season 5 while the other two are episode 12 of sea- at the results did not raise any suspicions that the workers son 5” was labeled complete season vs. individual episodes; were under-performing compared to their work on the re- the pile containing “This is a discography while the two are cruitment HIT. We therefore decided not to use the valida- movies” and “This is the soundtrack of the movie while the tion questions to reject work. However, we were still curious other two are the movie.” was labeled soundtrack vs. corre- as to whether the eager workers were answering the repeat- sponding movie. The list of categories resulting from the card sort is listed [9] G. Rugg and P. McGeorge. The sorting techniques: a in Table 1. We found 44 similarity dimensions, many more tutorial paper on card sorts, picture sorts and item than we had anticipated prior to the crowdsourcing experi- sorts. Expert Systems, 14(2):80–93, 1997. ment. The large number of unexpected dimensions we dis- [10] M. Soleymani and M. Larson. Crowdsourcing for covered support the conclusion that the user perception of affective annotation of video: Development of a semantic similarity among near duplicates is not trivial. For viewer-reported boredom corpus. In Proceedings of the example, the “commentary document versus movie” dimen- ACM SIGIR 2010 Workshop on Crowdsourcing for sion, which arose from a triad consisting of two versions of Search Evaluation (CSE 2010), pages 4–8, 2010. a motion picture and a text document that explained the [11] X. Wu, A. G. Hauptmann, and C.-W. Ngo. Practical movie, was particularly surprising, but nonetheless impor- elimination of near-duplicates from web video search. tant for the file-sharing setting. In Proceedings of the 15th international conference on Generalizing our findings in Table 1, we can see that most Multimedia, MM ’07, pages 218–227, New York, 2007. dimensions are based on different instantiations of particular ACM. content (e.g., quality and extended cuts), on the serial na- ture of content (e.g., episodic), or on the notion of collections (e.g., seasons and albums). These findings and generaliza- tions will serve to inform the design of algorithms for the detection of near duplicates in results lists in future work. 6. CONCLUSION In this work, we have described a crowdsourcing experi- ment that discovers user-perceived dimensions of semantic similarity among near duplicates. Launching an interesting task with the focus on engagement and encouraging serious workers, we have been able to quickly acquire a wealth of dif- ferent dimensions of semantic similarity, which we otherwise could not have thought of. Our future work will involve expanding this experiment to encompass a larger number of workers and other multimedia search settings. Our ex- periment opens up the perspective that crowdsourcing can be used to gain a more sophisticated understanding of user perceptions of semantic similarity among multimedia near- duplicate items. 7. REFERENCES [1] A. Basharat, Y. Zhai, and M. Shah. Content based video matching using spatiotemporal volumes. Computer Vision and Image Understanding, 110(3):360–377, June 2008. [2] P. Caputi and P. Reddy. A comparison of triadic and dyadic methods of personal construct elicitation. Journal of Constructivist Psychology, 12(3):253–264, 1999. [3] M. Cherubini, R. de Oliveira, and N. Oliver. Understanding near-duplicate videos: a user-centric approach. In Proceedings of the 17th ACM international conference on Multimedia, MM ’09, pages 35–44, New York, 2009. ACM. [4] B. Cohen. Incentives build robustness in BitTorrent. In Workshop on Economics of Peer-to-Peer systems, 2003. [5] C. Eickhoff and A. P. de Vries. Increasing cheat robustness of crowdsourcing tasks. Information Retrieval, 15, 2012. To appear. [6] F. Fransella and R. Bannister. A Manual for Repertory Grid Technique. Wiley, 2nd edition, 2003. [7] M. A. Hearst. Search User Interfaces. Cambridge University Press, 1st edition, Sept. 2009. [8] G. A. Kelly. The Psychology of Personal Constructs, volume one: Theory and personality. Norton, New York, 1955.