J. Yaghob (Ed.): ITAT 2015 pp. 150–158 Charles University in Prague, Prague, 2015 Search for Structure in Audiovisual Recordings of Lectures and Conferences Michal Kopp1 , Petr Pulc1 , and Martin Holeňa2 1 Faculty of Information Technology, Czech Technical University in Prague Thákurova 9, 160 00 Prague koppmic2@fit.cvut.cz, petrpulc@gmail.com 2 Institute of Computer Science, Czech Academy of Sciences Pod Vodárenskou věží 2, 182 07 Prague martin@cs.cas.cz Abstract: With the quickly rising popularity of multime- closer the videos represented by their respective vectors in dia, especially of the audiovisual data, the need to under- the map are, the more similar they should be. stand the inner structure of such data is increasing. In The most simple clustering of the original data would this case study, we propose a method for structure discov- be based on individual neurons, i.e., all feature vectors ery in recorded lectures. The method consists in integrat- mapped to the same neuron would form one cluster. How- ing a self-organizing map (SOM) and hierarchical clus- ever, there is more information involved in the SOM than tering to find a suitable cluster structure of the lectures. could be represented by such a simple clustering. To have The output of every SOM is evaluated by various levels a clue what SOM-based clustering would be most suitable, of hierarchical clustering with different number of clus- we complement SOM with hierarchical clustering. ters mapped to the SOM. Within these mapped levels we search for the one with the lowest average within-cluster distance, which we consider the most appropriate number 2 Related Work of clusters for the map. In experiments, we applied the proposed approach, with SOMs of four different sizes, to Studies for utilizing SOM as a tool for clustering multime- nearly 16 000 slides extracted from the recorded lectures. dia data were proposed in [3, 4]. In the PocketSOM [3], the SOM is utilized for a creation of the map of songs. The songs are characterized by their acoustic attributes us- 1 Introduction ing Rhythm Patterns [5]. These attributes are used for the creation of a high dimensional vector of features which In the past few decades, the amount of audiovisual data has represents the original song. The PocketSOM utilization grown rapidly. Every day, modern technology is accessi- of the song’s acoustic attributes leads to the extraction of ble to more and more people. With these modern devices, many of these features. In [3], the extraction of the acous- such as smartphones or cameras, people are able to create tic attributes generated more than thousand features. This large amounts of audiovisual data. Growing popularity of is quite similar to our approach to the multimedia cluster- audiovisual data is the main reason why many new sites ing proposed in this paper. However, we included even containing such data are created. From the consumer’s more features describing the content of multimedia files point of view, the main purpose of such a site is to find because we were able to use more than one kind of fea- a video concerning the topic they are interested in. tures due to the multimodality of our audiovisual data. Usually, any video file that is being uploaded to some There are also some other differences, in particular be- video sharing site is annotated with some keywords by the tween the main purpose of the PocketSOM and our clus- uploader. Beside that, it also contains some metadata such tering method. The main purpose of the PocketSOM is as resolution, frame rate or length of the video. Conse- to provide the user with an ability to build his or her own quently, any search in the site is restricted to the keywords playlists. The user can choose the songs by virtually draw- provided by the uploader and/or some technical data about ing a path in the map. The songs which are mapped to the the video. It does not rely on any relationship between the nodes (neurons) on the path, are selected to be included contents of the videos themselves. This would need to first in the playlist and the user has to decide which of them discover some structure in the data available at the site. will be selected for the final playlist creation. It is quite We investigated the use of a self-organizing map obvious that the user is involved a lot in the generation of (SOM) [1], which is a kind of artificial neural network, for the playlists. However, our structure discovery method is clustering the videos with respect to their content. First, user-independent. some features are extracted from the videos and then the Till the SOM has been trained, PocketSOM and our ap- map is built as a two-dimensional grid of neurons, so it can plication of SOM proceed in a similar way. However, after be easily visualized. According to the training principle of the map has been trained, PocketSOM starts involving the SOMs, the videos which are similar in their features are user, whereas our application integrates the SOM with hi- placed close to each other in the map. Consequently, the erarchical clustering. Search for Structure in Audiovisual Recordings of Lectures and Conferences 151 Our method has also some similarities with the Music and data consistency, only lectures and their related con- Map [4], which also utilizes SOM for the music playlist tent in the Czech language have been chosen. generation. Both, the Music Map and our application uti- All lectures were recorded with Mediasite recorder, lize SOM to create a two dimensional map of similarities a platform maintained by Sonic Foundry, Inc. The idea be- in the input data. However, the purpose of each of both hind the Mediasite platform is to enable streaming of lec- applications is rather different. The Music Map is used tures as easy as possible. The main element of the system for playlist generation from a collection of songs. The is Mediasite server which stores the lectures and streams user’s preferences for the songs that should be included them to the internet. On the side of the lecturer, only audio in the playlist are transformed to an optimization criterion and video sources are connected to the device. for the path planing. Then the algorithm that traverses the The data consist of three main modalities – synchro- map and optimizes the criterion is executed, and the re- nized video with sound and slides from lecturer’s pre- sulting optimal path determines the generated playlist. Al- sentation. The slides are captured directly from the lec- though the user is not involved in the Music Map playlist turer’s presentation through the VGA/DVI card of Media- generation as much as in the PocketSOM playlist genera- site, which is supplied with the same signal as the projec- tion, the main idea remains the same – to provide the user tor. a possibility to create a playlist that corresponds to his or The signal input for the slides is continuously monitored her demands. In the Music Map, the trained SOM is used and any change to it above given threshold is considered as as a background layer which is able to discover and pre- a new slide. Mediasite stores the image alongside with the serve some structure in the data, although the main pur- timestamps of slide transition used for the synchronization pose is to traverse the map and generate the final product during playback. – the playlist. Differently to the Music Map and Pocket- Audio is recorded through Mediasite just as lecturer SOM, our application does not primarily focus on any final speaks. It is synchronized with the video and slides by product generated from the map. It rather tries to discover the use of timestamps of slide transitions. structure in the employed data. The Mediasite platform only provides a video signal in There is also another difference between the Music Map the resolution up to 240 rows and the bitrate of 350Kbps. and our method – the dimensionality of the feature vector. However, for purpose of the lectures the high-definition In the Music Map, every song is described by nine fea- video is not necessary, because we have the slides in tures. However, in our application we use more than five a good resolution synchronized with the videos. Also, dur- thousands features. This also means that the training of the ing the lecture, there are usually not many changes hap- map takes much more time. pening on the scene, since only the lecturer is supposed to Studies utilizing SOM as a clustering tool were also pro- move significantly. posed in [6, 7]. In [6], a SOM is used as a one of the clustering tools for the analysis of embryonic stem cells gene expression data. The important difference compared 4 Proposed Structuring Approach to our approach is that each neuron represents one cluster 4.1 Data Preprocessing with a strict boundaries, so there are as many clusters as neurons in the SOM. Then it is easy to measure the within The multimedia data described in the previous section is cluster distance and between cluster distance to provide an saved by the Mediasite system as audio, video and slides information about the overall clusters’ quality. files in their respective formats. However, for the purposes The study[7] deals with cluster quality evaluation. Sim- of knowledge discovery in that data, including the discov- ilarly as in [6], a SOM is used as a clustering tool where ery of its inner structure, we need all the data to be prepro- every neuron corresponds to a one cluster and it is treated cessed to a suitable form. Because we have three modali- this way when quality of a clustering solution is being ties in the data, we treated each one separately, so we can measured. Thus, the overall number of clusters is deter- apply appropriate feature extraction tools. mined by the size of the 2-D map. First, we used Optical Character Recognition (OCR) Our approach, on the other hand, relies on combining system for the slides. The slides usually contain some text the information from SOM with information from hierar- related to the topic or the subtopic of the lecture, so an chical clustering. OCR can give us a good insight into the content of the slides. The downside of this approach are slides, which 3 Case Study Audiovisual Data contains, for example, just an image. For these slides OCR often returns nothing. However, we still have the other The data in which we would like to search structure has modalities we can rely on. been created over the period of several years during the To this end, we used the OCR system Tesseract, which Weeks of Science and Technology, a two-week science is an open-source software made publicly available by festivals organized by the Czech Academy of Sciences. Google. The problem of the OCR output is that it some- Altogether, the data includes more than 100 hours of mul- times contains a lot of unrecognized characters and mis- timedia content. For the purposes of further preprocessing spelled words, so a basic text processing was applied. All 152 M. Kopp, P. Pulc, M. Holeňa punctuation is removed. To reduce the dimensionality of erarchical clustering. This resulted in 32 final descriptor the data before further processing, only the words found in concepts. the spellchecking dictionary are considered. Also, a stem- The histogram describes each image (taken from the mer is applied. video) by count of pixels with certain color value in each Second, a speech recognition tool was used for the audio color channel. We use the RGB coding with 1 byte depth. files. Because we consider only lectures in the Czech lan- Each color count is relatively scaled to the size of the im- guage, Google’s API for speech recognition, which sup- age. ports this language, was used. The result of the API call All three kinds of features are combined into feature is a recognized speech converted to a text. To reduce mis- vectors, their resulting dimension is 5800 features. spelled words and the dimensionality of the data, the same methods as for OCR are applied, resulting in a text doc- 4.2 Hierarchical Clustering ument for every slide, aggregated from the OCR and the speech recognition. General purpose of clustering is grouping objects that are The text data returned from the speech and slide recog- similar into same group. There are 2 main kinds of clus- nition have to be transformed into term-by-document ma- tering – the number of clusters is a priori known or it is trices. For term weighting on cleaned text data, a tf-idf unknown, in which case the clusters are formed hierarchi- (term frequency – inverse document frequency) scheme cally. Hierarchical clustering can also capture all levels of has been used [9]: similarities, with respect to how the similarity is defined. The result of hierarchical clustering is a multilevel hier- Wt,d = (nt,d /maxt (nt,d )) · log2 (nD /nD,t ) (1) archy of clusters, where two clusters at one level are joined as a cluster at the next level. At the bottom level every sin- where nt,d is the count of term t in document d, nD de- gle cluster corresponds to a single observation. notes the number of documents in the employed corpus Decision, which clusters will be joined on the higher and nD,t is count of documents from the corpus that con- level, is based on a similarity between these clusters, tain at least one occurrence of term t. Logarithmic weight- which is measured in respect of a chosen distance met- ing in idf part is used to not penalize relatively frequent ric between observations / clusters at the previous level of terms as much. the cluster hierarchy and a linkage criterion. The distance As the resulting matrices had tens of thousands of metric determines how similarity between two observation columns and we would like to minimize the impact of the is calculated, while the linkage criterion determines how curse of dimensionality, the Latent Semantic Analysis [8] the distance metric is employed to calculate the similar- has been used. To this end, the k largest eigenvalues of ity between two clusters. The Euclidean distance and the the term-by-document matrix of weights (1), correspond- Ward’s criterion were chosen as the metric and the linkage ing to the most significant concepts, and their associated criterion in all our experiments. The Euclidean distance eigenvectors are found first. Let Uk , Σk and Vk be the ma- was chosen because it the most common distance metric trices resulting from the singular value decomposition of being used. The Ward’s criterion (also called the minimum the term-by-document matrix, and k denotes the number variance criterion or the inner squared distance) was cho- of the largest eigenvalues. sen because it minimizes the total within-cluster variance To obtain dense matrices of significantly lower dimen- after a potential merge of the two clusters. This variance is sions, a concept-by-document matrix [10] is then com- calculated as a weighted Euclidean distance between clus- puted as: ters’ centers. The weight w is calculated as: s Ck = ΣkVk> (2) 2nr ns wr,s = (3) (nr + ns ) Third, the video is used by two different extraction methods – Speeded Up Robust Features (SURF) and color where nr and ns are the number of elements in clusters r histogram. Both methods work with an image, so just one and s. frame from the center of the video’s time span related to a slide is taken. SURF is used to find visual descriptors 4.3 SOM Clustering in the image. We have almost 16 000 slides, therefore the same number of images taken from the videos, and each of Self-organizing map (or Self-organizing feature map)[1] these images produced nearly a hundred SURF descriptors is a kind of an artificial neural network. The main idea – summing up to over a million descriptors in total. To re- behind SOM is that it can preserve topological relations duce the dimensionality, a dictionary of SURF descriptors of the input space, which is typically high-dimensional, in concepts was created by k-means clustering of the origi- a low dimensional map (typically 2-D). That means the nal SURF descriptors with k empirically set to 400. Be- data which is similar in their original high-dimensional cause the matrix of these 400 descriptor concepts was still input space is also similar in the map. Due to this char- very sparse, we decided to cluster it even further, by hi- acteristics, the high-dimensional data whose relationships Search for Structure in Audiovisual Recordings of Lectures and Conferences 153 are often very hard to visualize in the original high- process is completed, each neuron in the map could be in- dimensional space can be relative easily visualized in the terpreted as a cluster which is formed by a subset of the low-dimensional map. SOM does not only learn the dis- input feature vectors, those for which it is a winner. tribution of the input data, but it also learns a topology. In many scenarios, for example in studies [6] and [7], Since in all our experiments we used a 2-D map, which the output of SOM is interpreted in that way. However, is also the most common type of SOM, the following de- this means that every neuron in the map is interpreted as a scription will be restricted just to that specific case. single cluster. SOM consists of one layer of neurons which are orga- In this case study, we use a different approach – an in- nized in some regular topology according to a topology tegration of SOM and hierarchical clustering. First, we function, which determines how the neurons in the map train SOM with all the input feature vectors. The result- are organized. Most common are the grid, hexagonal and ing map serves as a reference for all clustering solutions random topology. Also, a distance function, which mea- obtained by cutting the cluster hierarchy. These solutions sures some kind of distance between neurons needs to be are mapped to that reference SOM and then evaluated. Al- selected. Common distance functions are Euclidean dis- gorithm 1 and the following paragraphs describe how the tance, Manhattan distance or the length of the shortest path average withing cluster distance of a clustering solution is between neurons. calculated. Each neuron in the map is assigned a weight vector which is of the same dimension as the input vectors. The Algorithm 1 An average within-cluster distance weight vectors are first initialized, most easily with sam- map ← trained SOM ples from the input data or with small random values. S ← clustering solution obtained by cutting cluster hier- Training the SOM is an iterative process. In each itera- archy tion, one sample vector X is selected and it is presented to for all c ∈ S do the map. A similarity between the selected feature vector WCD[c] ← W ITHIN C LUSTER D ISTANCE(map,c) and all weight vectors in the map is calculated, usually as end for Euclidean distance. The neuron that has the most similar SolutionAvgWCD ← avg(WCD) //over all clusters c weight vector is selected as a winning neuron c. When the winning neuron is found, the weight vectors function W ITHIN C LUSTER D ISTANCE(map,c) of the winning neuron and its neighbors are updated. They for each pair p of input vectors u, v ∈ c do are moved closer to the presented input vector, with de- Nu ← F IND N EURON(u,map) creasing magnitude of changes according to the growing Nv ← F IND N EURON(v,map) distance from the winning neuron. The update rule for Dist[p] ← dist(Nu , Nv ) a neuron i with weight vector Wi (t) is: end for return avg(Dist) between all pairs Wi (t + 1) = Wi (t) + α(t)θ (c, i,t)(X(t) −Wi (t)) (4) end function where α(t) is learning rate, which is from the interval function F IND N EURON(v, map) (0, 1) and it is decreasing with increasing iterations, and return Neuron N in the map map to which the input θ (c, i,t) is the neighborhood function which depends on vector v is mapped. the distance between the winning neuron c and the neu- end function ron i, and may depend also on the iteration t. Another possibilty for SOM training is the batch After the reference SOM has been created, hierarchi- algorithm[2]. In each epoch, this algorithm presents the cal clustering is performed. The cluster hierarchy is cut whole training data set to the network at once. The win- at many different levels and the clusters at the respective ning neurons for all the input data are selected and each level are used as a starting point to calculate the distance weight vector is modified according to the position of all between their respective vectors positioned in the map. the input vectors for which it is a winner or for which it is When evaluating a solution, all feature vectors that form in the neighborhood of a winner. a cluster by hierarchical clustering are taken, the positions In our experiments with SOM, we used the MATLAB’s of their respective neurons in the map are found and the Neural network toolbox implementation of SOM, which distance between each pair of them is calculated accord- uses the batch algorithm by default. ing to the distance function used in the map. From these distances between all pairs, we calculate the average dis- 4.4 Integrating Hierarchical Clustering with SOM tance, which is interpreted as a within-cluster distance. From those within-cluster distances, an average dis- During each epoch of the batch algorithm that is used in tance of the whole solution is calculated. That average the training process of the SOM, each vector of values of distance is used as a measure of quality of each clustering input features is assigned to the neuron that is the win- solution obtained by cutting the cluster hierarchy. We also ner for that input feature vector. Thus, after the training interpret it as a measure of similarity between the map out- 154 M. Kopp, P. Pulc, M. Holeňa put and the hierarchical clustering output – the smaller the 5.2 Evaluation Results within-cluster distance is, the more similar the outputs of SOM and hierarchical clustering are. We also consider that In this section we are going to use three types of figures to the more similar they are (the smaller the within-cluster illustrate the measured results. Because these figure types distance is), the better the number of clusters produced are used repeatedly, their interpretation will be first shortly from cutting the tree suits the map. described. The first type shows the map topology (which is hexag- onal in all our experiments) with a relative distribution of the input feature vectors. Each neuron is represented as 5 Experimental Evaluation a white hexagon with a blue patch. The bigger the blue patch at each neuron is, the more feature vectors has been 5.1 Evaluation Methodology mapped to that neuron. Also in smaller maps, the numbers representing the total number of feature vectors mapped to the neurons are shown in each hexagon. For our experimental evaluation, we need to train several The second type is a neighbor distances map. It shows differently sized SOMs first. Data from all available lec- the distances between weight vectors of the neurons. The tures has been used for the training to make the SOMs as small regular hexagons are the neurons and the lines con- precise as possible. However, with 5800 features, the data necting them represents their direct neighbor relations. dimensionality is quite high and so is the number of fea- The colored patches between the neurons show how close ture vectors – 15953. Due to the time complexity of SOM each neuron’s weight vector is to the weight vectors of its training and also with respect to the size of our data, only neighbors. Color range varies from yellow to black, where relatively small maps were used. Namely, we considered the darker color means the greater distance between the only the squared maps with sizes of 8 × 8, 12 × 12, 16 × 16 weight vectors. and 20 × 20 neurons. The number of training epochs was The third type is a figure showing an average within- set empirically, taking into account the time taken by the cluster distance depending on the number of clusters training process, to 200 for the three smaller maps and 150 produced from the hierarchical clustering, which were for the largest one. mapped to the SOM as described in Section 4.4. Even though we need outputs from many levels of We started the evaluation with the smallest, 8 × 8 map. the hierarchical clustering, the cluster hierarchy was con- A relative distribution of the input feature vectors onto the structed only once. It is important to realize that the hi- map is shown in Figure 1. erarchy must be constructed from the same set of feature vectors as the SOMs to make the comparison possible. In all considered clustering solutions, the same cluster hier- archy was considered, cut at an appropriate level to obtain a solution with the desired number of clusters. A different number of clustering solutions for different sized SOMs was used. We started comparison with hierar- chical clustering output containing just a few clusters and ended with an output containing three or for times more clusters than the number of neurons present in the respec- tive map. We also introduced different steps for increasing the number of clusters, depending on the map size. In Table 1, the empirically chosen values for of our ex- periments are shown. The number of clusters range is an interval which determines, together with the step size, how many clustering solutions produced by cutting the tree were evaluated. SOM size # of clusters range step # of solutions Figure 1: Relative distribution of the input feature vec- 8×8 16 – 256 4 61 tors alongside with the total numbers of vectors mapped to 12 × 12 16 – 500 4 122 each neuron in the 8 × 8 map. 16 × 16 32 – 800 8 97 20 × 20 32 – 1000 8 122 From the distribution of feature vectors shown in Fig- ure 1 we can see that from the total number of 15 953 fea- Table 1: Settings used in the performed experiments ture vectors, there are 1854 feature vectors mapped to a single neuron. That indicates that there are many quite similar input vectors in the data. This is even more evident Search for Structure in Audiovisual Recordings of Lectures and Conferences 155 when we look at the neurons in its neighborhood, which are also quite a lot populated with the input vectors in con- trast with the lower-left part of the map. The neuron with the most mapped feature vectors has them mapped more than twice as many as any other neuron in this map. When we look at the input data, we find that many different slides are mapped to this neuron. Another reason for the large number of feature vectors mapped to this neuron is that these feature vectors pro- vide just one modality of the available input data. Namely, nearly a half of input data mapped to this neuron seems to be slides created from video playback or they are just images without text. So, even though these slides look different, the OCR produces little to no text, which can re- sult in great similarity between them, depending on other modalities. These feature vectors also have in common that there is little of recognized speech, thus they are very similar in the speech-to-text modality. The only really important difference in feature vectors is due to the SURFs and his- tograms produced from one frame taken from the video, Figure 2: The neighbor distances in the 8 × 8 map. but this difference can also be just a small one. In the case of lectures, the majority of lecture halls seem very similar – white walls and brown desks accompanied by projection screen. However, there are also some inputs mapped to this neu- ron that seem to be typical examples of a lecture – a slide with some text, video from lecture hall and an average speech-recognized audio. Quite interesting is that usually not the whole sequence of similar input data from one lecture is mapped to this neuron. However, the input data from the sequences, which are not present in this neuron, are in the most cases mapped to its direct neighbours. This behaviour supports the idea of large clusters spread over more neighbouring neurons. In Figure 2, which shows the distances between neigh- bors, we can see that the distances are the shortest be- tween the neurons with the highest number of vectors. Conversely, neurons in the lower-left corner are the most Figure 3: A within-cluster distance, measured as the distant from each other. When we look at the input data, length of the shortest path between neurons, dependent we find that the segments of video-recordings and associ- on the number of clusters produced from the hierarchical ated slides projected during that segment indeed substan- clustering, mapped to the 8 × 8 map. tially differ from the remaining segments of all recorded lectures. First, the camcorder during the time span of re- spective segment was set to shoot only the projected slide, an average within-cluster distance drops down to its local instead of the whole lecture hall. Second, the projected minimal value when the cluster hierarchy is cut at an ap- slides in most cases were examples of web pages. More- propriate level to form about a hundred clusters. It means over, in 35 out of 42 slides mapped to the neuron in the that the hierarchical clustering should produce more clus- bottom-left corner, the examples of web pages were taken ters than the number of neurons in the map, which is from the Wikipedia, so they contain a lot of text and hy- only 64. perlinks. The image taken from the video, containing just Let us now pass to the 12 × 12 and 16 × 16 maps. In a slide instead of the whole lecture hall, caused the visual Figures 4 and 7 we can see similar, one highly populated features and histogram to be very different. A lot of text neuron, like in the 8 × 8 map. In bigger maps, the total on a Wikipedia page instead of only few lines of text on number of vectors mapped to that neuron is lower, but in a typical slide caused the main difference in OCR. comparison to other neurons, the relative count is the same Figure 3 shows that in the map of size 8 × 8 neurons, or even higher. 156 M. Kopp, P. Pulc, M. Holeňa Figure 6: A within-cluster distance, measured as the length of the shortest path between neurons, dependent Figure 4: Relative distribution of the input feature vec- on the number of clusters produced from the hierarchical tors alongside with the total numbers of vectors mapped to clustering, mapped to the 12 × 12 map. each neuron in the 12 × 12 map. Figure 7: Relative distribution of the input feature vec- Figure 5: The neighbor distances in the 12 × 12 map. tors alongside with the total numbers of vectors mapped to each neuron in the 16 × 16 map. In Figures 5 and 8, we can see that there are some neurons, mostly on the left side of the map 5 and in the the most features. This can be seen in the map as a darker top-center, lower-left and top-right corners in 8, that have “star” in a lighter neighborhood. a large distance to their respective neighbors. The situation In Figures 6 and 9, which show the within-cluster dis- is quite similar to that of the smaller map in Figure 2. The tance dependence on the number of clusters formed by darkest areas just moved around to another physical loca- cutting the cluster hierarchy, we can not find a clear lo- tion in each of those two maps, but the relationships seems cal minimum. The value of distance drops down quickly to be similar and those neurons tend to stay together. with the increasing number of clusters, forming an elbow It is also interesting, that in those larger maps a few neu- in the figure, which is at just about a few less clusters than rons are more distant from their respective neighbors than is the number of neurons in the map. others in their neighborhoods, which indicates that there Even though the most suitable number of clusters in the is some input data which is quite different from the rest in map could not be define precisely this way when there is Search for Structure in Audiovisual Recordings of Lectures and Conferences 157 Figure 8: The neighbor distances in the 16 × 16 map. Figure 10: Relative distribution of the input feature vec- tors alongside with the total numbers of vectors mapped to each neuron in the 20 × 20 map. Figure 9: A within-cluster distance, measured as the length of the shortest path between neurons, dependent on the number of clusters produced from the hierarchical Figure 11: The neighbor distances in the 20 × 20 map. clustering, mapped to the 16 × 16 map. distant from each other is still present, situated at the top no distinct local minimum, locating an elbow on the curve of the map, almost in the right corner. Alongside with gives us a great possible candidate when we are searching Figures 1, 4 and 7, we can also see that as the map grows, for a solution with a good average within-cluster distance there are more distant neurons and the whole map becomes and the least number of clusters. darker. This is due to the higher number of neurons which Finally, let us pass to the largest, 20×20 map. Figure 10 allows more distinct input feature vectors to form their shows that there is still one neuron with significantly more own clusters. input vectors mapped to it than to any other neuron. A The 20 × 20 map also has other interesting characteris- great part of the map is low populated with feature vectors, tics. In Figure 12, there is once again a local minimum, even though there are some neurons with small neighbor- though in this case it can be found within the interval of hoods which are populated a little more. 400 and 500 clusters. It seems that clustering solutions In Figure 11, we can see that the small area of neurons produced by hierarchical clustering and clusters formed in 158 M. Kopp, P. Pulc, M. Holeňa segments of video-recordings and their associated slides, which substantially differs from the others. In the future, we want to further elaborate our approach, taking into account the results obtained in the case study presented here. We also intend to apply it to further com- plex multimedia data sets, in particular to data nowadays being collected within the project Open Narra at the Film and TV School of the Academy of Performing Arts in Prague. Acknowledgement The research reported in this paper has been supported by the Czech Science Foundation (GAČR) grant 13-17187S. Access to computing and storage facilities owned by par- ties and projects contributing to the National Grid In- frastructure MetaCentrum, provided under the programme “Projects of Large Infrastructure for Research, Develop- Figure 12: A within-cluster distance, measured as the ment, and Innovations” (LM2010005), is greatly appreci- length of the shortest path between neurons, dependent ated. on the number of clusters produced from the hierarchical clustering, mapped to the 20 × 20 map. References the map are very similar in that interval, because there is [1] Kohonen, T.: Self-organized formation of topologically exactly 400 neurons in this map. correct feature maps. Biological Cybernetics 43 (1982), 59–69 [2] Kohonen, T.: Self-organizing maps. Springer Series in In- 6 Conclusion formation Sciences, 1995, 2nd ed. 1997 [3] Neumayer, R., Dittenbach, M., Rauber, A.: PlaySOM and The objective of this paper was to provide a proof of PocketSOM player: alternative interfaces to large music concept for the possibility to improve structure discov- collections. In: Proceedings of the 6th International Con- ery in complex multimedia data through the integration of ference on Music Information Retrieval, 618–623, 2005 two unsupervised approaches – hierarchical clustering and [4] Hartono, P., Yoshitake, R.: Automatic playlist generation self-organizing maps. For that proof of concept, we have from self-organizing music map. Journal of Signal Process- used a case study of more then 100 hours of audiovisual ing 17 (1) (2013), 11–19 recordings of lectures from several years of a Czech sci- [5] Rauber, A., Pampalk, E., Merkl, D.: Using psycho-acoustic ence festival called Week of Science and Technology. The models and self-organizing maps to create a hierarchical approach we propose relies on the one hand on the flexibil- structuring of music by musical styles. In: Proceedings of ity and universality of hierarchical clustering, on the other the 3rd International Symposium on Music Information Re- trieval, 2002, 71–80, hand on the suitability of SOM for multimedia data, con- vincingly documented by Pitoyo Hartono. Our approach [6] Chen, G., Jaradat, S. A., Banerjee, N., et al.: Evaluation and comparison of clustering algorithms in analyzing ES cell has been much inspired by his paper [4]. However, the gene expression data. Statistica Sinica 12.1 (2002), 241– need to integrate SOM with hierarchical clustering is a 262 consequence of the fact that our data are much more com- [7] Lamirel, J. C., Cuxac, P., Mall, R., et al.: A new efficient plex, both from the point of view of involved modalities, and unbiased approach for clustering quality evaluation. and from the point of view of the number of attributes. [8] Deerwester, S. C., Dumais, S. T., Landauer, T. K., et al.: In- We have tested the proposed approach with 4 differently dexing by latent semantic analysis. JASIS 41 (6) (1990), sized self-organizing maps. Although space limitations 391–407 allowed us to present only two examples of the interpre- [9] Ramos, J.: Using tf-idf to determine word relevance in doc- taion of the obtained results in the context of the original ument queries. In: Proceedings of the First Instructional audio-visual data, even these examples indicate that SOMs Conference on Machine Learning, 2003 indeed help to recognize structure in that data. In the [10] Skopal, T., Moravec, P.: Modified LSI model for efficient first provided example, the SOM organized the most com- search by metric access methods. In: Advances in Informa- mon data into a large cluster spread over few neighbour- tion Retrieval, 2005, 245–259 ing neurons and in the second example, the SOM discov- ered a specific group among the nearly 16000 considered