1. Introduction: Video Analytics and Scalable Viewing

Paris, France £ nicolas.ruth@studserv.uni-leipzig.d(Ne. Ruth); bernhard.liebl@uni-leipzig.d(Be. Liebl); burghardt@informatik.uni-leipzig.d(Me. Burghardt) ç https://ch.uni-leipzig.de/(N. Ruth); https://ch.uni-leipzig.de/(B. Liebl);https://ch.uni-leipzig.de/ (M. Burghardt) ȉ

From Clusters to Graphs - Toward a Scalable Viewing of News Videos

NicolasRuth

BernhardLiebl

ManuelBurghardt

0 0 Computational Humanities Group, Institute for Computer Science, Leipzig University , Germany

2023

000 0 0002

In this paper, we present a novel approach that combines density-based clustering and graph modeling to create a scalable viewing application for the exploration of similarity patterns in news videos. Unlike most existing video analysis tools that focus on individual videos, our approach allows for an overview of a larger collection of videos, which can be further examined based on their connections or communities. By utilizing scalable reading, speci昀椀c subgraphs can be selected from the overview and their respective clusters can be explored in more detail on the video frame level.

eol>scalable viewing hdbscan clustering community detection graph visualization

1. Introduction: Video Analytics and Scalable Viewing

analysis perspectives for video material, termed “scalable viewin3g,”1[ 6 ]. However, for video material, a signi昀椀cant challenge lies in providing a comprehensive overview from a distant perspective in a dynamic medium.

In this paper, we propose a novel approach to address this challenge by visualizing relationships and patterns of similarity within a video collection. The basis for our research stems from the ongoingFakeNarratives1 project, in which we investigate the use of narrative strategies for the purpose of disinformation in German news video2s1[]. We build on theZoetrope prototype [ 11, 10 ], a tool developed in the project’s early phase, initially designed for the analysis of individual news videos. Zoetrope relies on a complex multimodal information extraction pipeline from which we have acquired CLIP embeddings1[ 8 ] for a sample of 117 German news videos from “Tagesschau”, including all the videos from 01.01.2022 to 14.03.2022.

In the following, we present an approach that allows researchers to interactively navigate extensive video collections and to explore underlying patterns that are latent in the CLIP embeddings. Our approach opens up novel opportunities for scalable viewing of visual media in computational humanities research. The suggested approach enhances the existing landscape of video analysis tools such as thDeistant Viewing Toolkit [ 1 ] or VIAN [ 5 ], which are mostly focused on the analysis of single video2s.In a sense our interactive visualization can be compared to recent cultural analytics tools, suchPaixsPlot3 or theCollection Space Navigator [ 15 ], which visualize patterns in large collections of images. However, we extend these existing tools as we propose a two-fold approach that combines initial clustering of similar images and a 3D network visualization of similar image clusters. Also, our approach is optimized for being used with video frames rather than other types of imagery. Please note that we describe a novel analytical work昀氀ow rather than a ready-to-use too4lW.hile the use case we present is based on the analysis of news videos, the work昀氀ow can also be adapted for other collections of video material.

2. From Clusters to Graphs – Conceptual Approach

The employed methodology aims to summarize the visual content of news formats and to construct an exploratory visualization to investigate their connections and interrelationships. In the following sections, we describe our conceptual approach in some more detail and also provide information about the utilized clustering and graph algorithms and there parameterization. 1Project website and further informationh:ttps://fakenarratives.github.io/ 2For a comprehensive overview of similar video analysis tools see the survey paper by Pustu-Iren, Sittel, Mauer, Bulgakowa, and Ewerth 1[ 7 ]. 3Yale DH Lab – PixPlot:https://github.com/YaleDHLab/pix-plot 4The full code is available vihattps://github.com/Nicolas-le/from-clusters-to-gra p.hDsue to copyright issues, we currently cannot share a demo visualization of the news video use case. However, to get a basic idea of what the interactive visualization actually looks like, you will also 昀椀nd a short demo video (2:05 min) in the above repository.

2.1. Feature Extraction via CLIP

First, we are concerned with the extraction of semantic elements from the video materi2a]l [ by means of CLIP embeddings, which were calculated for every frame every 0.5 seconds for each of the 117 news videos in the dataset, resulting in a total of 300,029 embedding vectors. Contrastive Language-Image Pre-Training (CLIP) is a neural network that was trained on 400 million image-text pairs with the primary objective to combine the textual description of an image and the image itself within a common vector space. As a result, CLIP embeddings have the ability to represent the content description of an image in a high-dimensional vector space. Therefore, embedding vectors with high similarity indicate a resemblance in the semantic content of the corresponding images1[ 8 ].

2.2. Dimensionality Reduction and Density-Based Clustering

Next we aggregate similar content across videos via a clustering of the embedding vectors. To reduce the dimensionality of the embeddings, the data was standardized using the scikit-learn standard scaler and then followed by a Principal Component Analysis (PCA), to retain only 30 principal components5. The parameterization of the PCA was the result of a mutual adjustment with the parameters of the clustering algorithm with the focus on the interpretability of the visualization and had mainly the goal to reduce the computation time of the clustering. In subsequent extensions of the procedure, this part can be extended by systematic determination of the principal components by, for example, a threshold of the explained variance and by tests with further dimension reduction procedures. In the early stages of the project, we experimented with KMeans clustering. However, this approach had certain limitations, including the clustering of noise and the requirement to de昀椀ne a 昀椀xed number of clusters. KMeans also showed some problems in dealing with outliers. Following that, our transition to a density-based clustering approach led us to using the HDBScan algorithm. THhierarchical Density-Based Spatial Clustering of Applications with Noise (HDBScan) extends DBScan with a complete clustering hierarchy composed of all possible density-based clusters. It thus inherits the advantages of density-based clustering, among others the recognition of noise and the non-static number of clusters, and improves them by the ability to detect clusters of varying densities [ 4 ]. The algorithm can be parameterized in numerous manners. For instance, the Euclidean distance has been designated as the distance metric. However, a critical parameter to consider pertains to the minimum number of data pointmsin_cluster_size required to constitute a cluster. There is typically no upper limit, so the algorithm is not likely to split bigger clusters into smaller sub-clusters, which can lead to clusters with signi昀椀cantly varying sizes. The optimization ofmin_cluster_size poses challenges due to a trade-o昀: On one hand, lower parameter values enable the algorithm to detect 昀椀ne-grained clean clusters, o昀琀en caused by consecutive video frames. On the other hand, higher values may lead to the complete loss of smaller clusters. Another trade-o昀 is introduced by the second parameter, the minimum number of neighbors for a point to be a core point. Lowering the parameter values can result in the formation of larger clusters, which, in turn, ampli昀椀es the issue of 5Scikit-learn standard scalerh:ttps://scikit-learn.org/stable/modules/generatedkl/searn.preprocessing.StandardSc aler.html semanticallyimpure clusters, which refers to clusters in which the topics exhibited within the frames of the cluster manifest considerable heterogeneity. Semanticalpluyre clusters are favored by higher values, but they increase the detection of noise and encourage the focus on overly dense clusters created by successive frames or clusters with frequently occurring items, like recurring images. A昀琀er conducting extensive experiments to compare the results and to assess usability in the visualization process, thmein_cluster_size was set to 100, to detect 昀椀ne-grained clusters. The minimum threshold for a core point was set to default, that is the same as the min_cluster_size. However, the 昀椀ne-tuning of the parameters needs some more consideration in future work. In our current approach, the detection of small clusters was given greater weight, which resulted in the detection of altogether 231 clusters of varying size. The deliberate emphasis on smaller, semantically more distinctive clusters also favors the incidental classi昀椀cation of substantial quantities of data points as noise. To counteract the focus on too similar data points, the points previously clustered as noise were added again to the clustering, using a function in the HDBScan implementation o1f3[] to assign new data points into the computed clustering, which returns the probability of the new point belonging to the individual clusters. The points were added to the cluster with the highest probability if it exceeded a threshold of 10%. Out of a total of 300,029 embeddings, 198,520 were classi昀椀ed as noise, while 10,685 embeddings were reassigned to clusters, following the described process. However, it became apparent that the application of a uniform static threshold across all clusters leads to the contamination of smaller as well as visually very complex clusters, such as cluster 162 (also see Sec. 3.1), which represents images in medical contexts. The cluster manifests an array of distinct scenarios, all semantically interlinked through the presence of medical themes. However, due to the heightened diversity of represented scenes, the cluster’s demarcation from other topics becomes less pronounced and elevating the threshold leads to a reduction in its semantic distinctiveness. Please note that clusters that only occur in a single news video will not be displayed in the later graph visualization.

2.3. Bidirectional Graph Modeling

The next step toward a scalable visualization relates the clusters to each other by mapping them into a bidirectional graph structure. The nodes of the graph represent the previously generated clusters. For the creation of the edges, we experimented with two di昀erent approaches, resulting in two di昀erent graphs.

1. Co-appearance frequency: The 昀椀rst approach inserts an edge between clusters, if their constituent elements appear together in news videos and weights the edges based on the normalized amount of co-appearance. Frames belonging to a cluster must appear at least 昀椀ve times in a news video for the cluster to be counted as present in the video, and clusters must appear at least three times together for the edge to exist. 2. Pearson correlation: The second approach draws an edge if the clusters appear at least two times together and are weighted based on the Pearson correlation coe昀케cient between the clusters. Edges with a weight under 0.3 were deleted.

The graph based on the number of co-appearances, aims towards an understanding of the levels of centrality of motifs in the Tagesschau, so they can be experienced intuitively. F1ig. shows that the graph takes a circular form and concentrates around frequent topics or core motifs in central nodes. The periphery shows the less prominent clusters. A core topic could be the weather forecast and a core motif the intro sequence of each broadcast. The edges of this graph exhibit a high degree of intuitive comprehensibility. However, the mere co-appearance does not directly indicate a correlation between clusters.

Therefore, the second graph aims towards the exploration of the correlation explicitly. In doing so, the visualization of this graph reduces the in昀氀uence of the central topics of the Tagesschau. It creates a network that also highlights edges between clusters that are less frequent in the dataset, but still have high correlations. These clusters may then represent a speci昀椀c interest, because they involve concrete issues and are less likely to be core motifs. This graph can be seen in Fig. 2.

Even though the two-fold graph structure is mainly used to create an interactive visualization, it also o昀ers the possibility to use graph algorithms to further enrich the visualization with information that can be included as needed. Thus, communities were calculated with the Clauset-Newman-Moore greedy modularity maximization, including consideration of the weights. In this application, the community algorithm can be interpreted as a second clustering algorithm on the generated graph. In the later visualization, this will provide initial support for the identi昀椀cation of linked subgraphs. The communities, i.e. subgraphs, thus represent networks of frequently occurring motifs in daily broadcasts, or: networks of strong correlations. The resolution parameter, that if it is less than 1, favors larger communities and if it is greater than 1 smaller communities, was chosen to be 1.2.

2.4. Visualization and Interactivity

The created data structure is visualized in an interactive three-dimensional approach. The frontend of the web app is based on a Javascript framework built on ThreeJS/WebGL (https://github.com/vasturiano/3d-force-graph) and the force engine works with D3. The visualization is guided on a theoretical level by Arnold’s and Tilton’s third demand for distant viewing [ 2 ], with a pronounced emphasis on thescalability of the observational granularity. Additionally, it follows the principles by Shneiderman: “Overview 昀椀rst, zoom and 昀椀lter, then details-on-demand” 1[ 9 ].

Zooming is possible via scrolling, allowing for an individually adjustable view magni昀椀cation of the interrelationships. The cluster nodes are represented by a randomly selected frame of the cluster. The diameter of the edges depends on the edge weight and the graph structure represented in the 3-D space based on gravitational force in昀氀uenced by the weight of the edge (Fig. 1 and Fig. 2). By clicking in the space and moving the cursor, the perspective of the camera can be changed, which allows the focus to be on di昀erent parts of the graph and, together with the zoom, provides a certain immersive ‘昀氀ight e昀ect’. Individual nodes can be moved via drag and drop and the graph reorients itself depending on the force. By hovering over the node, the ID of the cluster can be explored. For a more detailed insight into the semantics of the cluster, a scrollable overlay can be opened by right-clicking on a cluster (F3)i,g.in which up to 80 random frames of the cluster are displayed. This feature enables a detailed examination and analysis of single clusters.

To reduce the complexity of the cluster and to set a focus, it is possible to display only directly connected clusters by le昀琀-clicking on a cluster. Using the GUI, nodes can also be displayed that are connected via another edge. Another possibility to view sub graphs is the possibility to display one of the calculated communities in the graph via the GUI (Fi4g).

All of these features aim to create an interactive experience for the user that encourages continuous scaling of the viewing level, breaking down the barrier between micro- and macroanalytical viewing. Therefore, the developed visualization facilitates a comprehensive examination of large-scale patterns, pertaining to motifs and topics present in the data source, in this case speci昀椀cally news videos. It serves as a crucial entry point for diving deeper into the detailed analysis of extensive volumes of video data, which would otherwise be impracticable to achieve manually.

3. Examples Analyses

The following examples are meant to demonstrate the exploratory analysis of video data with our methodology. Please note that all of these examples are using the correlation coe昀케cient graph. 3.1. Topic: Covid-19 During the immersive exploration of the motifs of the Tagessschau in the visualization, cluster 162 stood out prominently. A昀琀er scrolling through numerous images within the cluster and exploring their semantics via right-click, the majority of the images were con昀椀rmed to pertain to medical topics, predominantly related to Covid-19. Next, the focus was shi昀琀ed toward clusters that showed direct linkage, which were accessed by le昀琀-clicking on cluster 162. As a result, four associated clusters emerged, each displaying distinct characteristics: the 昀椀rst (1) cluster depicted images from thBeundespressekonferenz, another (2) exhibited an anchor with a headline discussing Boris Johnson’Psartygate scandal, there was an (3) interview cluster, and most notably, a cluster (4) presenting multiple instances of tMheeinung-format, which translates toopinion. This exploration revealed a potential object of further analysis: the interaction between topics related to Covid-19 and the utilization of a speci昀椀c format that allows individuals to express subjective opinions. Fig5. shows the directly connected nodes of the medical cluster 162.

3.2. Topic: Ukraine War

The second example starts by analyzing the communities in the graph. Community 2 exhibits a sub-graph encompassing topics related to the war in Ukraine and another subgraph featuring a core element of the Tagesschau, namely, the weather forecast (Fig6.).

Both sub-graphs are interconnected through a few edges, thereby forming a cohesive community together. Within the ‘Ukraine-sub-graph,’ three distinct clusters have been identi昀椀ed. The 昀椀rst cluster contains images of Volodymyr Zelenskyy, the current president of Ukraine. The second cluster consists of maps of Ukraine, and the last cluster showcases calls for donations to support Ukraine. At the periphery of the graph, two clusters featuring war reporters can be observed. Moreover, there is another link connecting to a cluster containing an anchorman and background reporting on war criminals, and another cluster showcasing Robert Habeck, the Federal Minister for Economic A昀airs and Climate Action. The discovered connection of motifs appears to be more closely related to the contents of the Tagesschau. Hence, the visualization enables interactive exploration and presentation of complex themes that co-occur across multiple videos, allowing for the enrichment of these themes with additional information as required.

4. Discussion

The force simulation of the graph proved particularly bene昀椀cial in gaining an intuitive understanding of signi昀椀cant clusters. For instance, in the circular graph depicting co-appearance, it becomes evident that the force simulation places the core topics of the Tagesschau at the center. Indeed, prominent subjects within the dataset, such as the introductions, weather forecasts, charts displaying results of the German football league, and the previously described medical cluster 162, all converge at the center of the circular graph due to the force simulation. The analysis of the two graph variants should be considered in combination, as their complementary focus allows us to gain valuable insights into the motif network of the Tagesschau. Together, they o昀er a comprehensive perspective on the relationships and patterns within the dataset.

In the context of exploratory visualization, another crucial aspect to consider is that the ability to freely focus on speci昀椀c elements while having abundant information might lead individuals to seek con昀椀rmation of their own hypotheses rather than openly discovering new connections between clusters. Similarly, the projection of semantics onto a cluster needs careful re昀氀ection and must be balanced and negotiated with one’s own con昀椀rmation bias. Furthermore, the co-appearance in a news video as a unifying element and resulting correlations must always be considered in connection with certain basic conditions of the respective broadcast. This refers, for instance, to external factors that in昀氀uence the program, such as unpredictable current events. However, this does not make the interaction of special topics, set focuses, motifs and special broadcast elements any less important. Finally, we want to highlight that during the creation of the graph it became evident that the parameters have a strong in昀氀uence on the actual visualization. Their adjustment must therefore be critically examined to avoid the con昀椀rmation bias of a desired hypothesis.

5. Conclusion and Future Work

In this short paper, we present a novel approach that combines density-based clustering and graph modeling to create a scalable viewing application for the exploration of similarity patterns in news videos. Unlike most existing video analysis tools that focus on individual videos, our approach allows for an overview of a larger collection of videos, which can be further examined based on their connections or communities. By utilizing scalable viewing, speci昀椀c subgraphs can be selected from the overview, and their respective clusters can be explored in more detail on the video frame level. The next step involves connecting the aforementioned prototype for analyzing individual videos with the current scalable viewing prototype, enabling analyses of news videos at various levels of granularity.

Currently, the similarity function between frame clusters relies solely on CLIP embedding vectors. However, other information, such as written and spoken language, object and face recognition, color information, etc. can also be easily extracted from videos and will serve as additional information sources for future visualizations. While our current experiments focus on news videos, the scalable viewing approach can be applied to other video formats as well. Depending on the type and genre of the video, some clustering parameters may need to be adjusted since they are currently optimized for news videos.

6. Acknowledgements

This research was made possible through funding provided by the GermFaenderal Ministry of Education and Research as part of the ”FakeNarratives” project (16KIS1516).

[1]

Arnold and

Tilton . “ Distant viewing toolkit: A python package for the analysis of visual culture” . InJ:ournal of Open Source So昀琀ware 5 .45 ( 2020 ), p. 1800 .

[2]

Arnold and

Tilton . “ Distant viewing: analyzing large visual corpora”D.Iingi:tal Scholarship in the Humanities 34 .Supplement_ 1 ( 2019 ), pp. i3 - i16 .

[3]

Burghardt ,

Kao , and

N.-O.

Walkowski . “ Scalable MovieBarcodes-An Exploratory Interface for the Analysis of Movies” . InI:EEE VIS Workshop on Visualization for the Digital Humanities . Vol. 2 . 2018 .

[4]

R. J. G. B.

Campello ,

Moulavi ,

Zimek , and

Sander . “ Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection” . AInC:M Trans. Knowl. Discov. Data 10.1 ( 2015 ).

[5]

Halter ,

Ballester-Ripoll ,

Flueckiger , and

Pajarola . “ VIAN: A visual annotation tool for 昀椀lm analysis” . In:Computer Graphics Forum . Vol. 38 . 3. Wiley Online Library. 2019 , pp. 119 - 129 .

[6]

Jänicke , G. Franzini,

M. F.

Cheema , and G. Scheuermann. “ On Close and Distant Reading in Digital Humanities: A Survey and Future Challenges .” IEnu: roVis (STARs) 2015 ( 2015 ), pp. 83 - 103 .

[7]

Jänicke , G. Franzini,

M. F.

Cheema , and G. Scheuermann. “ Visual text analysis in digital humanities” . In:Computer Graphics Forum . Vol. 36 . 6. Wiley Online Library. 2017 , pp. 226 - 250 .

[8]

Keim , G. Andrienko, J.-D. Fekete , C.

Görg , J.

Kohlhammer , and G.

MelançoVni.sual analytics: De昀椀nition, process , and challenges . Springer, 2008 .

[9]

Kurzhals ,

John ,

Heimerl ,

Kuznecov , and

Weiskopf . “ Visual movie analytics” . In: IEEE Transactions on Multimedia 18.11 ( 2016 ), pp. 2149 - 2160 .

[10]

Liebl and

Burghardt . “ Designing a Prototype for Visual Exploration of Narrative Patterns in News Videos” . In:INFORMATIK 2023 - Lecture Note in Informatics (LNI) . Ed. by

Klein ,

Krupka ,

Winter , and

Wohlgemuth . 2023 .

[11]

Liebl and

Burghardt . “ Zoetrope - Interactive Feature Exploration in News Videos” . In: Annual International Conference of the Alliance of Digital Humanities Organizations, DH 2022 , Graz, Austria, July 10-14 , 2023 ,

Conference

Abstracts . Ed. by

Baillot ,

Tasovac ,

Scholger , and

Vogeler . 2023 . doi1: 0 .5281/zenodo.8107770. url: https://doi.org/10 .5281/zenodo.8107770.

[12]

Manovich . Cultural analytics. Mit Press, 2020 .

[13]

McInnes ,

Healy , and

Astels . “hdbscan: Hierarchical density based clustering” . In: The Journal of Open Source So昀琀ware 2 .11 ( 2017 ), p. 205 .

[14]

Moretti . Graphs, maps, trees: abstract models for a literary history . Verso , 2005 .

[15]

Ohm ,

M. C.

Solà ,

Karjus , and M. SchichC.ollection Space Navigator: An Interactive Visualization Interface for Multidimensional Datasets . 2023 . arXiv: 2305 .06809 [cs.CV].

[16]

Pause ,

Burghardt , and

N.-O.

Walkowski . “ Scalable Viewing in den Filmwissenscha昀琀en” . In: DHd 2019: Digital Humanities: multimedial & multimodal . Konferenzabstracts. Ed. by

Sahle . Frankfurt & Mainz , 2019 , pp. 201 - 204 .

[17]

Pustu-Iren ,

Sittel ,

Mauer ,

Bulgakowa , and

Ewerth . “ Automated Visual Content Analysis for Film Studies: Current Status and Challenges.”DInH:Q: Digital Humanities Quarterly 14.4 ( 2020 ).

[18]

Radford ,

J. W.

Kim ,

Hallacy ,

Ramesh , G. Goh,

Agarwal ,

Sastry ,

Askell ,

Mishkin ,

Clark , G. Krueger, and I. SutskeveLr.earning Transferable Visual Models From Natural Language Supervision . 2021 . arXiv: 2103 .00020 [cs.CV].

[19]

Shneiderman . “ The eyes have it: a task by data type taxonomy for information visualizations” . In: Proceedings 1996 IEEE Symposium on Visual Languages . 1996 , pp. 336 - 343 .

[20]

Shneiderman ,

Plaisant ,

M. S.

Cohen ,

Jacobs ,

Elmqvist , and

Diakopoulos . Designing the user interface: strategies for e昀ective human-computer interaction . Pearson , 2016 .

[21]

Tseng ,

Liebl ,

Burghardt , and

Bateman . “ FakeNarratives - First Forays in Understanding Narratives of Disinformation in Public and Alternative News Videos” . In: 9. Tagung des Verbands Digital Humanities im deutschsprachigen Raum , DHd 2023 , Belval, Luxembourg and Trier, Germany, March 13 - 17, 2022 . Ed. by

Busch and

Trilcke . 2023 , p. 138 . doi: 10 .5281/zenodo.7715277. url: https://doi.org/10.5281/zenodo.771527.7

[22]

Ware . Information visualization: perception for design . Morgan Kaufmann, 2019 .

[23]

Weitin . “Scalable reading”. InZ:eitschri昀琀 für literaturwissenscha昀琀 und linguistik ( 2017 ), pp. 1 - 6 .