A Study of Narrative Creation by Means of Crowds and Niches Oana Inel Sabrina Sauer Lora Aroyo Vrije Universiteit Amsterdam University of Groningen Vrije Universiteit Amsterdam oana.inel@vu.nl s.c.sauer@rug.nl lora.aroyo@vu.nl Abstract providing interactive access to multimedia objects enriched with events, people, locations and concepts. Online video constitutes the largest, continuously growing portion of the Web content. Web users drive this growth by Visualizing, mapping and constructing narratives play massively sharing their personal stories on social media plat- a significant role in humanities research as they help to forms as compilations of their daily visual memories, or with contextualize historical material (de Leeuw 2012; Mamber animated GIFs and memes based on existing video material. 2012). The remix of AV content as animated GIFs (High- Therefore, it is crucial to gain understanding of the semantics field and Leaver 2016) gained popularity as an object of of video stories, i.e., what do they capture and how. The remix study and it is considered a powerful way of understanding of visual content is also a powerful way of understanding the the implicit aspects of storytelling. However, the availability implicit aspects of storytelling, as well as the essential parts of metadata information and semantic annotations (Macca- of audio-visual (AV) material. In this paper we take a digi- trozzo et al. 2013; Aroyo, Nixon, and Miller 2011) such as tal hermeneutics approach to understand what are the visual events, objects depicted in the video, relevance of the videos attributes and semantics that drive the creation of narratives. We present insights from a nichesourcing study in which hu- is still a fundamental requirement (Kemman et al. 2013) for manities scholars remix keyframes and video fragments into scholars to accelerate their narrative-formation process. micro-narratives i.e., (sequences of) GIFs. To support the nar- The focus of this paper is to understand how niches rative creation for humanities scholars a specific video anno- (De Boer et al. 2012), humanities scholars, interact with AV tation is needed, e.g., (1) annotations that consider literal and archives to generate (micro-)narratives. Our research ques- abstract connotations of video material, and (2) annotations tion is: can we model the data and the semantics of AV con- that are coarse-grained, i.e., focusing on keyframes and video tent to ease the creation of narratives? To answer this ques- fragments as opposed to full length videos. The main find- tion we conduct a nichesourcing study with millenials, hu- ings of the study are used to facilitate the automatic creation manities students in which they use AV content to create of narratives in the digital humanities exploratory search tool stories by means of sequences of GIFs. We analyze the nar- DIVE+1 . rative creation process on three levels: (1) data - the remixed videos to understand how the story is developed, (2) nar- Introduction rative - the micro-story created in and across sequences of Social media provide a mainstream environment to produce, GIFs to understand what drives the creation of a narrative, share and comment on video material, which constitutes the and (3) semantics - the keywords describing the story to un- largest and still growing portion of Web content (CISCO derstand the data enrichment needed to generate narratives. 2016). An increasingly popular form of shared content are GIFS (Bakhshi et al. 2016) as micro-stories, i.e., short video On the Use of Narratives in Digital Humanities fragments that contain summaries or highlights of video DIVE+ accommodates the digital hermeneutics approach by content on participatory platforms GIPHY and Twitter Vine means of proto-narratives, i.e., relations between events and or social media platforms such as Facebook and Instagram. their participating entities. To support the creation of such Humanities scholars use AV archives (De Jong, Ordel- proto-narratives, we gathered events and links between their man, and Scagliola 2011) to answer their research questions participating entities in textual AV content (i.e., description) (Melgar et al. 2017), but they face the challenge of grap- through a hybrid machine-crowd pipeline (de Boer et al. pling with a vast amount of diverse AV content. The DIVE+ 2017). To further improve the narrative exploration and cre- (De Boer et al. 2015) tool is conceived to assist scholars ation in DIVE+, we performed a nichesourcing study with in their exploration of digital content to ultimately create millennial digital humanities master students to understand meaningful stories and narratives. DIVE+ extends the dig- how this community builds stories using AV material and ital hermeneutics approach (Van Den Akker et al. 2011) by which are the needs in terms of data representation. While Copyright c 2018 for this paper by its authors. Copying permitted in previous studies we focused on textual AV content, the for private and academic purposes. current study aims to understand the creation of narratives 1 http://diveplus.beeldengeluid.nl/ through visual aspects such as video stills and fragments. Nine international humanities master students (age be- Each story was composed of around eight GIFs (stdev of tween 21-25) enrolled in an interdisciplinary course about five GIFs), with a minimum of four and a maximum of 20 urban street visualization in Amsterdam participated in our GIFs. In total, 75 GIFs were generated: seven keyframe- niche study. Their task was to explore a dataset of archival based GIFs and 68 fragment-based GIFs. Only two users AV material and to construct overarching micro-stories, in generated keyframe-based GIFs, while all nine users gen- the shape of sequences of GIFs. A GIF is composed of three erated fragment-based GIFs. The 68 fragment-based GIFs keyframes, or a (set of) short video fragment(s). The stu- were generated by remixing and combining 89 video frag- dents were free to explore the dataset and to create GIFs ments, meaning that around 25% of the fragment-based about topics that drew their attention in relation to the city GIFs were composed of more than one video fragment. On of Amsterdam, or in relation to the course literature. average, 10 video fragments (stdev of 10) were used in each The dataset consists of archival video material about Am- micro-story, with a minimum of two and a maximum of 35 sterdam, part of the Netherlands Institute for Sound and Vi- video fragments. Furthermore, eight GIFs were generated sion2 (NISV) open collections. We retrieved 624 videos cre- by remixing keyframes and video fragments from multiple ated between 1910-1989 on the NISV portal using the search videos (six keyframe-based and two fragment-based GIFs). keyword “Amsterdam”. The dataset consists of news broad- In general, mostly keyframes and fragments from the be- casts, varying in length from 50 seconds to 10 minutes, from ginning of the videos were picked (55.45%), followed by which we identified three time periods, as shown in Table 1. keyframes and fragments from the middle (24.55%) and then by keyframes and fragments from the end of the video Table 1: Dataset overview (20%). When multiple keyframes and fragments from the same video were remixed in the same GIF, the order was Time Period Period Interval #Videos #Users always preserved, i.e., the keyframes and the fragments P1 1910-1929 60 2 were used in chronological order with respect to the video P2 1950-1969 288 3 stream. However, when looking at the entire story, we ob- P3 1970-1989 96 4 serve that the users break the natural temporal and linear sequence of videos by starting the story with video fragments In the study we asked the students to choose a time period or keyframes from the middle or the end part of the videos. in Table 1 and to watch at least 20 videos from that period. The majority of the GIFs are shorter than six seconds, The users had one week to complete the entire task, to log with only a few longer than 10 seconds. The average length their activity3 and: (1) indicate the GIF type, i.e., keyframe- of a story is 43 seconds, with a maximum length of one or fragment-based; (2) describe each GIF, keyframe and minute and 48 seconds and a minimum length of 12 sec- video fragment with keywords; (3) provide the timestamps onds. On average, only 3.6% of the videos length was used of the keyframes (keyframe-based GIFs) or the interval of to generate each story, but, the length of the story is not al- the video fragment (fragment-based GIFs), among others. ways proportional with the total length of the videos. The students were also asked to prepare a short presentation to describe and motivate (1) the videos and the time period The Narrative Level they selected, (2) the selection of keyframes and video frag- The users focused their micro-stories around themes that ments and (3) the story that is told in their GIFs. were either inspired by the content of the videos, or by the course literature (i.e., visualization of urban spaces). The Nichesourcing Study Results themes of the stories are: (1) mobility across the city, (2) We present the study results4 and analyze the data gathered citizens co-constructing urban spaces, (3) gender relations from the participating users by focusing on keyframes, video and (4) how urban routines relate to feelings of alienation fragments, GIFs and finally, the overarching micro-stories. in a globalized world. Some users created literal narratives, depicting aeroplanes, trains and bicycles to indicate mobil- The Data Level ity, while others worked on an abstract level by, for example, The users picked a time period as shown in Table 1. Their juxtaposing fragments of a person in a deep-sea diving suit choice was informed by either: (1) feeling unknowledgeable with shots of a newspaper article lamenting loneliness in the about that period or (2) curiosity about a period when their city, to create a story about alienation. parents were their own current age. In total, 68 videos were Users reported that creating sequences of GIFs enabled used across all the micro-stories and seven videos were used them to develop more elaborate stories. However, moving in more than one micro-story. All the overlaps occurred for from GIF to GIF does not denote a sequential development the users that chose period P3, which is explained by the low in time, but it is used to zoom out spatially, or to create a jar- number of videos in P3 and the fact that the users were asked ring contrast between GIFs and thus, a more abstract story - to watch at least 20 videos. On the average, each user used for example, moving from a GIF about riots in the street, to eight videos to generate a story, with a minimum of three a deserted, ruined square in the city, to children repainting and a maximum of 20 videos per story. a building, to create a story about urban decay and ideals. Similarly, the story about gender relation creates a counter- 2 http://www.beeldengeluid.nl point between women undergoing beauty procedures, while 3 Log File Template: http://tinyurl.com/zwgotp7 men, in a separate GIF, seemingly loom over them. 4 https://tinyurl.com/alternate-stories The Semantics Level GIFs when compared to the visual tags. The professionals- The users were asked to provide keywords, tags, for their user gap is clearly defined at the level of abstract concepts. GIFs, selected keyframes and video fragments. These tags represent the users’ interpretation of the multimedia content Discussion and Future Work comprising their narratives and do not necessarily describe The nichesourcing study aimed to bring insight into story- the content, but act as an interpretation medium for the story. telling in digital humanities by exploring the interaction and To determine the type of keywords, we manually evaluated interpretation of micro-narratives remixed using archival AV them using the Panofsky-Shatford model (Panofsky 1962; content. Overall, users tend to generate GIFs by remixing Shatford 1986) presented in (Gligorov et al. 2011). We dis- material positioned in the first part of the videos, disregard- tinguish three levels of keywords: abstract - symbolic or ing the GIF position in the final produced micro-story. The subjective concepts that allow for various interpretations, temporal aspect is even more disrupted when users start their general - generic words and specific - property of being narrative with GIFs that contain keyframes and fragments unique. Further, each level consists of four facets: who - sub- from the middle and the end part of the videos, or when they ject, what - object or event, where - location and when - time. finish their story with GIFs containing keyframes and frag- We classified 207 (168 unique) tags that describe the GIFs ments from the beginning of videos. Therefore, the original and 262 (159 unique) tags that describe the keyframes and temporal sequence of the video is not relevant when remix- fragments composing the GIFs. The majority of the key- ing video footage for creative storytelling. words are general, followed by specific and then by abstract Users ascribe similar interpretations and meanings to their keywords. When looking at the facets, we observe that more micro-narratives to those contained in visual tags, while they than 60% of the keywords belong to the what facet. The tag the chosen sequences more in terms of their function as smallest number of keywords belongs to the when facet, with a narrative building-block. Although at the GIF level users around 1% in all cases. While the keywords describing the ascribe similar meaning to the video material as the pro- who and where facets are evenly distributed among the key- fessionals, they engage in scholarly interpretation on the words describing the GIFs, the amount of keywords describ- keyframe level. Thus, the interpretation of meaning in story- ing the keyframes and fragments belonging to the where telling is, to some extent, developed serendipitously and as a facet is much greater than the amount of keywords describ- user- and context-centric development, driven by humanities ing the who facet. While at the abstract and general levels a research interests. Time seems - as our facet analysis empha- significant amount of keywords belong to the what facet, at sizes - less important than the where or what facets. Hence, the specific level, the users provided more keywords belong- people find events and objects the most relevant when build- ing to the where facet, and less for the what facet, showing ing narratives. General keywords referring to events, objects, that users tend to provide specific locations. places and people almost entirely overlap with visual tags. In storytelling, people can refer to concepts, perspec- Thus, the understanding of visual aspects, especially event tives, opinions that are not physically present in the video, and concept-centric, is needed to steer the story line. but are referred to or expressed. As research (Trant 2009; In summary, humanities scholars need rich enrichments of Gligorov et al. 2010) indicates, there is also a gap between AV datasets to facilitate the creation of narratives. However, professional and lay user tags describing video content. To storytelling through video remixing is a creative process that understand the semantics of the keywords provided by users, can not rely only on visual aspects. Deep semantic enrich- we look at their overlap with: (1) the machine extracted key- ment is needed to cover both implicit and explicit video con- words and (2) the professional tags. We retrieved the profes- cepts and perspectives. For exploratory-centric tools such as sional tags from the NISV portal and we extracted the visual DIVE+ it is crucial to: (1) provide easy access to already tags and concepts from each video fragment and keyframe extracted keyframes and video fragments as opposed to ex- composing each GIF using the online tool Clarifai5 , which pecting the user to watch full videos; (2) provide deep se- performs both image and video concepts recognition. mantic enrichment of keyframes and video fragments focus- The overlap between the visual and the keywords pro- ing on specific and general actors or people, locations, time vided by the users is quite low: 33% with the keywords periods, objects and most importantly events. Events play a describing stills and fragments and 49% with the keywords central role in narrative development. Since event centrality describing the GIFs. At the level of general concepts, the is already a main aspect of DIVE+, we will focus on also tags provided by the scholars overlap in proportion of 99% integrating crowd-driven keyframes and video fragments se- with the visual tags. This suggests that for (micro)narrative mantics to offer users direct access to relevant information. creation, what is visualized - generally - steers the narrative DIVE+ users should be able to access smaller video granu- contained in the story. The overlap between user and pro- larity of interest and their enrichments, as opposed to watch- fessional keywords is even lower, 26% for keyframes and ing the entire video and inspecting general video metadata. fragments and 30% for GIFs. In contrast to the visual tags, the professional tags do contain specific tags which usually Acknowledgements refer to places, the where facet. For the facet distribution at The research for this paper was made possible by the the general level, the proportion of overlapping what facets CLARIAH-CORE (www.clariah.nl) project financed is higher at the level of sequences but lower at the level of the by NWO and by the Netherlands Institute for Sound and Vi- 5 https://www.clarifai.com sion and NWO under project nr. CI-14-25. References Van Hage, W. R.; et al. 2013. Crowdsourced evaluation of [Aroyo, Nixon, and Miller 2011] Aroyo, L.; Nixon, L.; and semantic patterns for recommendations. Miller, L. 2011. Notube: the television experience en- [Mamber 2012] Mamber, S. 2012. Narrative mapping. In hanced by online social and semantic data. In Consumer Everett, A., and Caldwell, J., eds., New Media: Theories and Electronics-Berlin (ICCE-Berlin), 2011 IEEE International Practices of Intertextuality. Routledge. 145–158. Conference on, 269–273. IEEE. [Melgar et al. 2017] Melgar, L.; Koolen, M.; Huurdeman, [Bakhshi et al. 2016] Bakhshi, S.; Shamma, D. A.; Kennedy, H.; and Blom, J. 2017. A process model of scholarly me- L.; Song, Y.; de Juan, P.; and Kaye, J. J. 2016. Fast, cheap, dia annotation. In Proceedings of the 2017 Conference on and good: Why animated gifs engage us. In Proceedings of Conference Human Information Interaction and Retrieval, the 2016 CHI Conference on Human Factors in Computing 305–308. ACM. Systems, CHI ’16, 575–586. New York, NY, USA: ACM. [Panofsky 1962] Panofsky, E. 1962. Studies in Iconology: [CISCO 2016] CISCO. 2016. Cisco visual networking in- Humanist Themes in the Art of the Renaissance. Harper & dex: Forecast and methodology, 2015 - 2020. http:// Row. tinyurl.com/hd7gd45. [Shatford 1986] Shatford, S. 1986. Analyzing the subject of [De Boer et al. 2012] De Boer, V.; Hildebrand, M.; Aroyo, a picture: a theoretical approach. Cataloging & classifica- L.; De Leenheer, P.; Dijkshoorn, C.; Tesfa, B.; and Schreiber, tion quarterly 6(3):39–62. G. 2012. Nichesourcing: harnessing the power of crowds of [Trant 2009] Trant, J. 2009. Steve: The art museum social experts. In International Conference on Knowledge Engi- tagging project: A report on the tag contributor experience. neering and Knowledge Management, 16–20. Springer. In Museums and the Web. [De Boer et al. 2015] De Boer, V.; Oomen, J.; Inel, O.; [Van Den Akker et al. 2011] Van Den Akker, C.; Legêne, S.; Aroyo, L.; Van Staveren, E.; Helmich, W.; and De Beurs, Van Erp, M.; Aroyo, L.; Segers, R.; van Der Meij, L.; D. 2015. Dive into the event-based browsing of linked his- Van Ossenbruggen, J.; Schreiber, G.; Wielinga, B.; Oomen, torical media. Web Semantics: Science, Services and Agents J.; et al. 2011. Digital hermeneutics: Agora and the online on the World Wide Web 35:152–158. understanding of cultural heritage. In Proceedings of the 3rd [de Boer et al. 2017] de Boer, V.; Melgar, L.; Inel, O.; Or- International Web Science Conference, 10. ACM. tiz, C. M.; Aroyo, L.; and Oomen, J. 2017. Enriching media collections for event-based exploration. In Research Conference on Metadata and Semantics Research, 189–201. Springer. [De Jong, Ordelman, and Scagliola 2011] De Jong, F.; Or- delman, R.; and Scagliola, S. 2011. Audio-visual collections and the user needs of scholars in the humanities: a case for co-development. [de Leeuw 2012] de Leeuw, S. 2012. European television history online: history and challenges. VIEW Journal of Eu- ropean Television History and Culture 1(1):3–11. [Gligorov et al. 2010] Gligorov, R.; Baltussen, L. B.; van Os- senbruggen, J.; Aroyo, L.; Brinkerink, M.; Oomen, J.; and van Ees, A. 2010. Towards integration of end-user tags with professional annotations. [Gligorov et al. 2011] Gligorov, R.; Hildebrand, M.; van Os- senbruggen, J.; Schreiber, G.; and Aroyo, L. 2011. On the role of user-generated metadata in audio visual collec- tions. In Proceedings of the sixth international conference on Knowledge capture, 145–152. ACM. [Highfield and Leaver 2016] Highfield, T., and Leaver, T. 2016. Instagrammatics and digital methods: studying vi- sual social media, from selfies and gifs to memes and emoji. Communication Research and Practice 2(1):47–62. [Kemman et al. 2013] Kemman, M.; Scagliola, S.; de Jong, F.; and Ordelman, R. 2013. Talking with scholars: Devel- oping a research environment for oral history collections. In International Conference on Theory and Practice of Digital Libraries, 197–201. Springer. [Maccatrozzo et al. 2013] Maccatrozzo, V.; Aroyo, L.;