J. Yaghob (Ed.): ITAT 2015 pp. 150–158
Charles University in Prague, Prague, 2015


    Search for Structure in Audiovisual Recordings of Lectures and Conferences

                                             Michal Kopp1 , Petr Pulc1 , and Martin Holeňa2
                              1   Faculty of Information Technology, Czech Technical University in Prague
                                                       Thákurova 9, 160 00 Prague
                                            koppmic2@fit.cvut.cz, petrpulc@gmail.com
                                      2 Institute of Computer Science, Czech Academy of Sciences

                                                  Pod Vodárenskou věží 2, 182 07 Prague
                                                           martin@cs.cas.cz

Abstract: With the quickly rising popularity of multime-               closer the videos represented by their respective vectors in
dia, especially of the audiovisual data, the need to under-            the map are, the more similar they should be.
stand the inner structure of such data is increasing. In                  The most simple clustering of the original data would
this case study, we propose a method for structure discov-             be based on individual neurons, i.e., all feature vectors
ery in recorded lectures. The method consists in integrat-             mapped to the same neuron would form one cluster. How-
ing a self-organizing map (SOM) and hierarchical clus-                 ever, there is more information involved in the SOM than
tering to find a suitable cluster structure of the lectures.           could be represented by such a simple clustering. To have
The output of every SOM is evaluated by various levels                 a clue what SOM-based clustering would be most suitable,
of hierarchical clustering with different number of clus-              we complement SOM with hierarchical clustering.
ters mapped to the SOM. Within these mapped levels we
search for the one with the lowest average within-cluster
distance, which we consider the most appropriate number                2   Related Work
of clusters for the map. In experiments, we applied the
proposed approach, with SOMs of four different sizes, to               Studies for utilizing SOM as a tool for clustering multime-
nearly 16 000 slides extracted from the recorded lectures.             dia data were proposed in [3, 4]. In the PocketSOM [3],
                                                                       the SOM is utilized for a creation of the map of songs.
                                                                       The songs are characterized by their acoustic attributes us-
1    Introduction                                                      ing Rhythm Patterns [5]. These attributes are used for the
                                                                       creation of a high dimensional vector of features which
In the past few decades, the amount of audiovisual data has            represents the original song. The PocketSOM utilization
grown rapidly. Every day, modern technology is accessi-                of the song’s acoustic attributes leads to the extraction of
ble to more and more people. With these modern devices,                many of these features. In [3], the extraction of the acous-
such as smartphones or cameras, people are able to create              tic attributes generated more than thousand features. This
large amounts of audiovisual data. Growing popularity of               is quite similar to our approach to the multimedia cluster-
audiovisual data is the main reason why many new sites                 ing proposed in this paper. However, we included even
containing such data are created. From the consumer’s                  more features describing the content of multimedia files
point of view, the main purpose of such a site is to find              because we were able to use more than one kind of fea-
a video concerning the topic they are interested in.                   tures due to the multimodality of our audiovisual data.
   Usually, any video file that is being uploaded to some                 There are also some other differences, in particular be-
video sharing site is annotated with some keywords by the              tween the main purpose of the PocketSOM and our clus-
uploader. Beside that, it also contains some metadata such             tering method. The main purpose of the PocketSOM is
as resolution, frame rate or length of the video. Conse-               to provide the user with an ability to build his or her own
quently, any search in the site is restricted to the keywords          playlists. The user can choose the songs by virtually draw-
provided by the uploader and/or some technical data about              ing a path in the map. The songs which are mapped to the
the video. It does not rely on any relationship between the            nodes (neurons) on the path, are selected to be included
contents of the videos themselves. This would need to first            in the playlist and the user has to decide which of them
discover some structure in the data available at the site.             will be selected for the final playlist creation. It is quite
   We investigated the use of a self-organizing map                    obvious that the user is involved a lot in the generation of
(SOM) [1], which is a kind of artificial neural network, for           the playlists. However, our structure discovery method is
clustering the videos with respect to their content. First,            user-independent.
some features are extracted from the videos and then the                  Till the SOM has been trained, PocketSOM and our ap-
map is built as a two-dimensional grid of neurons, so it can           plication of SOM proceed in a similar way. However, after
be easily visualized. According to the training principle of           the map has been trained, PocketSOM starts involving the
SOMs, the videos which are similar in their features are               user, whereas our application integrates the SOM with hi-
placed close to each other in the map. Consequently, the               erarchical clustering.
Search for Structure in Audiovisual Recordings of Lectures and Conferences                                                          151


   Our method has also some similarities with the Music                  and data consistency, only lectures and their related con-
Map [4], which also utilizes SOM for the music playlist                  tent in the Czech language have been chosen.
generation. Both, the Music Map and our application uti-                    All lectures were recorded with Mediasite recorder,
lize SOM to create a two dimensional map of similarities                 a platform maintained by Sonic Foundry, Inc. The idea be-
in the input data. However, the purpose of each of both                  hind the Mediasite platform is to enable streaming of lec-
applications is rather different. The Music Map is used                  tures as easy as possible. The main element of the system
for playlist generation from a collection of songs. The                  is Mediasite server which stores the lectures and streams
user’s preferences for the songs that should be included                 them to the internet. On the side of the lecturer, only audio
in the playlist are transformed to an optimization criterion             and video sources are connected to the device.
for the path planing. Then the algorithm that traverses the                 The data consist of three main modalities – synchro-
map and optimizes the criterion is executed, and the re-                 nized video with sound and slides from lecturer’s pre-
sulting optimal path determines the generated playlist. Al-              sentation. The slides are captured directly from the lec-
though the user is not involved in the Music Map playlist                turer’s presentation through the VGA/DVI card of Media-
generation as much as in the PocketSOM playlist genera-                  site, which is supplied with the same signal as the projec-
tion, the main idea remains the same – to provide the user               tor.
a possibility to create a playlist that corresponds to his or               The signal input for the slides is continuously monitored
her demands. In the Music Map, the trained SOM is used                   and any change to it above given threshold is considered as
as a background layer which is able to discover and pre-                 a new slide. Mediasite stores the image alongside with the
serve some structure in the data, although the main pur-                 timestamps of slide transition used for the synchronization
pose is to traverse the map and generate the final product               during playback.
– the playlist. Differently to the Music Map and Pocket-                    Audio is recorded through Mediasite just as lecturer
SOM, our application does not primarily focus on any final               speaks. It is synchronized with the video and slides by
product generated from the map. It rather tries to discover              the use of timestamps of slide transitions.
structure in the employed data.                                             The Mediasite platform only provides a video signal in
   There is also another difference between the Music Map                the resolution up to 240 rows and the bitrate of 350Kbps.
and our method – the dimensionality of the feature vector.               However, for purpose of the lectures the high-definition
In the Music Map, every song is described by nine fea-                   video is not necessary, because we have the slides in
tures. However, in our application we use more than five                 a good resolution synchronized with the videos. Also, dur-
thousands features. This also means that the training of the             ing the lecture, there are usually not many changes hap-
map takes much more time.                                                pening on the scene, since only the lecturer is supposed to
   Studies utilizing SOM as a clustering tool were also pro-             move significantly.
posed in [6, 7]. In [6], a SOM is used as a one of the
clustering tools for the analysis of embryonic stem cells
gene expression data. The important difference compared
                                                                         4   Proposed Structuring Approach
to our approach is that each neuron represents one cluster
                                                                         4.1 Data Preprocessing
with a strict boundaries, so there are as many clusters as
neurons in the SOM. Then it is easy to measure the within                The multimedia data described in the previous section is
cluster distance and between cluster distance to provide an              saved by the Mediasite system as audio, video and slides
information about the overall clusters’ quality.                         files in their respective formats. However, for the purposes
   The study[7] deals with cluster quality evaluation. Sim-              of knowledge discovery in that data, including the discov-
ilarly as in [6], a SOM is used as a clustering tool where               ery of its inner structure, we need all the data to be prepro-
every neuron corresponds to a one cluster and it is treated              cessed to a suitable form. Because we have three modali-
this way when quality of a clustering solution is being                  ties in the data, we treated each one separately, so we can
measured. Thus, the overall number of clusters is deter-                 apply appropriate feature extraction tools.
mined by the size of the 2-D map.                                           First, we used Optical Character Recognition (OCR)
   Our approach, on the other hand, relies on combining                  system for the slides. The slides usually contain some text
the information from SOM with information from hierar-                   related to the topic or the subtopic of the lecture, so an
chical clustering.                                                       OCR can give us a good insight into the content of the
                                                                         slides. The downside of this approach are slides, which
3    Case Study Audiovisual Data                                         contains, for example, just an image. For these slides OCR
                                                                         often returns nothing. However, we still have the other
The data in which we would like to search structure has                  modalities we can rely on.
been created over the period of several years during the                    To this end, we used the OCR system Tesseract, which
Weeks of Science and Technology, a two-week science                      is an open-source software made publicly available by
festivals organized by the Czech Academy of Sciences.                    Google. The problem of the OCR output is that it some-
Altogether, the data includes more than 100 hours of mul-                times contains a lot of unrecognized characters and mis-
timedia content. For the purposes of further preprocessing               spelled words, so a basic text processing was applied. All
152                                                                                                  M. Kopp, P. Pulc, M. Holeňa


punctuation is removed. To reduce the dimensionality of         erarchical clustering. This resulted in 32 final descriptor
the data before further processing, only the words found in     concepts.
the spellchecking dictionary are considered. Also, a stem-         The histogram describes each image (taken from the
mer is applied.                                                 video) by count of pixels with certain color value in each
   Second, a speech recognition tool was used for the audio     color channel. We use the RGB coding with 1 byte depth.
files. Because we consider only lectures in the Czech lan-      Each color count is relatively scaled to the size of the im-
guage, Google’s API for speech recognition, which sup-          age.
ports this language, was used. The result of the API call          All three kinds of features are combined into feature
is a recognized speech converted to a text. To reduce mis-      vectors, their resulting dimension is 5800 features.
spelled words and the dimensionality of the data, the same
methods as for OCR are applied, resulting in a text doc-        4.2 Hierarchical Clustering
ument for every slide, aggregated from the OCR and the
speech recognition.                                             General purpose of clustering is grouping objects that are
   The text data returned from the speech and slide recog-      similar into same group. There are 2 main kinds of clus-
nition have to be transformed into term-by-document ma-         tering – the number of clusters is a priori known or it is
trices. For term weighting on cleaned text data, a tf-idf       unknown, in which case the clusters are formed hierarchi-
(term frequency – inverse document frequency) scheme            cally. Hierarchical clustering can also capture all levels of
has been used [9]:                                              similarities, with respect to how the similarity is defined.
                                                                   The result of hierarchical clustering is a multilevel hier-
         Wt,d = (nt,d /maxt (nt,d )) · log2 (nD /nD,t )   (1)   archy of clusters, where two clusters at one level are joined
                                                                as a cluster at the next level. At the bottom level every sin-
   where nt,d is the count of term t in document d, nD de-      gle cluster corresponds to a single observation.
notes the number of documents in the employed corpus               Decision, which clusters will be joined on the higher
and nD,t is count of documents from the corpus that con-        level, is based on a similarity between these clusters,
tain at least one occurrence of term t. Logarithmic weight-     which is measured in respect of a chosen distance met-
ing in idf part is used to not penalize relatively frequent     ric between observations / clusters at the previous level of
terms as much.                                                  the cluster hierarchy and a linkage criterion. The distance
   As the resulting matrices had tens of thousands of           metric determines how similarity between two observation
columns and we would like to minimize the impact of the         is calculated, while the linkage criterion determines how
curse of dimensionality, the Latent Semantic Analysis [8]       the distance metric is employed to calculate the similar-
has been used. To this end, the k largest eigenvalues of        ity between two clusters. The Euclidean distance and the
the term-by-document matrix of weights (1), correspond-         Ward’s criterion were chosen as the metric and the linkage
ing to the most significant concepts, and their associated      criterion in all our experiments. The Euclidean distance
eigenvectors are found first. Let Uk , Σk and Vk be the ma-     was chosen because it the most common distance metric
trices resulting from the singular value decomposition of       being used. The Ward’s criterion (also called the minimum
the term-by-document matrix, and k denotes the number           variance criterion or the inner squared distance) was cho-
of the largest eigenvalues.                                     sen because it minimizes the total within-cluster variance
   To obtain dense matrices of significantly lower dimen-       after a potential merge of the two clusters. This variance is
sions, a concept-by-document matrix [10] is then com-           calculated as a weighted Euclidean distance between clus-
puted as:                                                       ters’ centers. The weight w is calculated as:
                                                                                             s
                         Ck = ΣkVk>                       (2)                                     2nr ns
                                                                                      wr,s =                               (3)
                                                                                                (nr + ns )
   Third, the video is used by two different extraction
methods – Speeded Up Robust Features (SURF) and color           where nr and ns are the number of elements in clusters r
histogram. Both methods work with an image, so just one         and s.
frame from the center of the video’s time span related to
a slide is taken. SURF is used to find visual descriptors       4.3 SOM Clustering
in the image. We have almost 16 000 slides, therefore the
same number of images taken from the videos, and each of        Self-organizing map (or Self-organizing feature map)[1]
these images produced nearly a hundred SURF descriptors         is a kind of an artificial neural network. The main idea
– summing up to over a million descriptors in total. To re-     behind SOM is that it can preserve topological relations
duce the dimensionality, a dictionary of SURF descriptors       of the input space, which is typically high-dimensional, in
concepts was created by k-means clustering of the origi-        a low dimensional map (typically 2-D). That means the
nal SURF descriptors with k empirically set to 400. Be-         data which is similar in their original high-dimensional
cause the matrix of these 400 descriptor concepts was still     input space is also similar in the map. Due to this char-
very sparse, we decided to cluster it even further, by hi-      acteristics, the high-dimensional data whose relationships
Search for Structure in Audiovisual Recordings of Lectures and Conferences                                                         153


are often very hard to visualize in the original high-                   process is completed, each neuron in the map could be in-
dimensional space can be relative easily visualized in the               terpreted as a cluster which is formed by a subset of the
low-dimensional map. SOM does not only learn the dis-                    input feature vectors, those for which it is a winner.
tribution of the input data, but it also learns a topology.                 In many scenarios, for example in studies [6] and [7],
   Since in all our experiments we used a 2-D map, which                 the output of SOM is interpreted in that way. However,
is also the most common type of SOM, the following de-                   this means that every neuron in the map is interpreted as a
scription will be restricted just to that specific case.                 single cluster.
   SOM consists of one layer of neurons which are orga-                     In this case study, we use a different approach – an in-
nized in some regular topology according to a topology                   tegration of SOM and hierarchical clustering. First, we
function, which determines how the neurons in the map                    train SOM with all the input feature vectors. The result-
are organized. Most common are the grid, hexagonal and                   ing map serves as a reference for all clustering solutions
random topology. Also, a distance function, which mea-                   obtained by cutting the cluster hierarchy. These solutions
sures some kind of distance between neurons needs to be                  are mapped to that reference SOM and then evaluated. Al-
selected. Common distance functions are Euclidean dis-                   gorithm 1 and the following paragraphs describe how the
tance, Manhattan distance or the length of the shortest path             average withing cluster distance of a clustering solution is
between neurons.                                                         calculated.
   Each neuron in the map is assigned a weight vector
which is of the same dimension as the input vectors. The                 Algorithm 1 An average within-cluster distance
weight vectors are first initialized, most easily with sam-                map ← trained SOM
ples from the input data or with small random values.                      S ← clustering solution obtained by cutting cluster hier-
   Training the SOM is an iterative process. In each itera-                archy
tion, one sample vector X is selected and it is presented to               for all c ∈ S do
the map. A similarity between the selected feature vector                      WCD[c] ← W ITHIN C LUSTER D ISTANCE(map,c)
and all weight vectors in the map is calculated, usually as                end for
Euclidean distance. The neuron that has the most similar                   SolutionAvgWCD ← avg(WCD) //over all clusters c
weight vector is selected as a winning neuron c.
   When the winning neuron is found, the weight vectors                      function W ITHIN C LUSTER D ISTANCE(map,c)
of the winning neuron and its neighbors are updated. They                       for each pair p of input vectors u, v ∈ c do
are moved closer to the presented input vector, with de-                            Nu ← F IND N EURON(u,map)
creasing magnitude of changes according to the growing                              Nv ← F IND N EURON(v,map)
distance from the winning neuron. The update rule for                               Dist[p] ← dist(Nu , Nv )
a neuron i with weight vector Wi (t) is:                                        end for
                                                                                return avg(Dist) between all pairs
      Wi (t + 1) = Wi (t) + α(t)θ (c, i,t)(X(t) −Wi (t))       (4)           end function

where α(t) is learning rate, which is from the interval                      function F IND N EURON(v, map)
(0, 1) and it is decreasing with increasing iterations, and                     return Neuron N in the map map to which the input
θ (c, i,t) is the neighborhood function which depends on                        vector v is mapped.
the distance between the winning neuron c and the neu-                       end function
ron i, and may depend also on the iteration t.
   Another possibilty for SOM training is the batch                         After the reference SOM has been created, hierarchi-
algorithm[2]. In each epoch, this algorithm presents the                 cal clustering is performed. The cluster hierarchy is cut
whole training data set to the network at once. The win-                 at many different levels and the clusters at the respective
ning neurons for all the input data are selected and each                level are used as a starting point to calculate the distance
weight vector is modified according to the position of all               between their respective vectors positioned in the map.
the input vectors for which it is a winner or for which it is            When evaluating a solution, all feature vectors that form
in the neighborhood of a winner.                                         a cluster by hierarchical clustering are taken, the positions
   In our experiments with SOM, we used the MATLAB’s                     of their respective neurons in the map are found and the
Neural network toolbox implementation of SOM, which                      distance between each pair of them is calculated accord-
uses the batch algorithm by default.                                     ing to the distance function used in the map. From these
                                                                         distances between all pairs, we calculate the average dis-
4.4    Integrating Hierarchical Clustering with SOM                      tance, which is interpreted as a within-cluster distance.
                                                                            From those within-cluster distances, an average dis-
During each epoch of the batch algorithm that is used in                 tance of the whole solution is calculated. That average
the training process of the SOM, each vector of values of                distance is used as a measure of quality of each clustering
input features is assigned to the neuron that is the win-                solution obtained by cutting the cluster hierarchy. We also
ner for that input feature vector. Thus, after the training              interpret it as a measure of similarity between the map out-
154                                                                                                  M. Kopp, P. Pulc, M. Holeňa


put and the hierarchical clustering output – the smaller the   5.2 Evaluation Results
within-cluster distance is, the more similar the outputs of
SOM and hierarchical clustering are. We also consider that     In this section we are going to use three types of figures to
the more similar they are (the smaller the within-cluster      illustrate the measured results. Because these figure types
distance is), the better the number of clusters produced       are used repeatedly, their interpretation will be first shortly
from cutting the tree suits the map.                           described.
                                                                  The first type shows the map topology (which is hexag-
                                                               onal in all our experiments) with a relative distribution of
                                                               the input feature vectors. Each neuron is represented as
5      Experimental Evaluation                                 a white hexagon with a blue patch. The bigger the blue
                                                               patch at each neuron is, the more feature vectors has been
5.1     Evaluation Methodology                                 mapped to that neuron. Also in smaller maps, the numbers
                                                               representing the total number of feature vectors mapped to
                                                               the neurons are shown in each hexagon.
For our experimental evaluation, we need to train several
                                                                  The second type is a neighbor distances map. It shows
differently sized SOMs first. Data from all available lec-
                                                               the distances between weight vectors of the neurons. The
tures has been used for the training to make the SOMs as
                                                               small regular hexagons are the neurons and the lines con-
precise as possible. However, with 5800 features, the data
                                                               necting them represents their direct neighbor relations.
dimensionality is quite high and so is the number of fea-
                                                               The colored patches between the neurons show how close
ture vectors – 15953. Due to the time complexity of SOM
                                                               each neuron’s weight vector is to the weight vectors of its
training and also with respect to the size of our data, only
                                                               neighbors. Color range varies from yellow to black, where
relatively small maps were used. Namely, we considered
                                                               the darker color means the greater distance between the
only the squared maps with sizes of 8 × 8, 12 × 12, 16 × 16
                                                               weight vectors.
and 20 × 20 neurons. The number of training epochs was
                                                                  The third type is a figure showing an average within-
set empirically, taking into account the time taken by the
                                                               cluster distance depending on the number of clusters
training process, to 200 for the three smaller maps and 150
                                                               produced from the hierarchical clustering, which were
for the largest one.
                                                               mapped to the SOM as described in Section 4.4.
   Even though we need outputs from many levels of                We started the evaluation with the smallest, 8 × 8 map.
the hierarchical clustering, the cluster hierarchy was con-    A relative distribution of the input feature vectors onto the
structed only once. It is important to realize that the hi-    map is shown in Figure 1.
erarchy must be constructed from the same set of feature
vectors as the SOMs to make the comparison possible. In
all considered clustering solutions, the same cluster hier-
archy was considered, cut at an appropriate level to obtain
a solution with the desired number of clusters.
   A different number of clustering solutions for different
sized SOMs was used. We started comparison with hierar-
chical clustering output containing just a few clusters and
ended with an output containing three or for times more
clusters than the number of neurons present in the respec-
tive map. We also introduced different steps for increasing
the number of clusters, depending on the map size.
   In Table 1, the empirically chosen values for of our ex-
periments are shown. The number of clusters range is an
interval which determines, together with the step size, how
many clustering solutions produced by cutting the tree
were evaluated.

    SOM size     # of clusters range   step   # of solutions
                                                               Figure 1: Relative distribution of the input feature vec-
      8×8             16 – 256           4          61
                                                               tors alongside with the total numbers of vectors mapped to
     12 × 12          16 – 500           4         122
                                                               each neuron in the 8 × 8 map.
     16 × 16          32 – 800           8          97
     20 × 20         32 – 1000           8         122            From the distribution of feature vectors shown in Fig-
                                                               ure 1 we can see that from the total number of 15 953 fea-
      Table 1: Settings used in the performed experiments      ture vectors, there are 1854 feature vectors mapped to
                                                               a single neuron. That indicates that there are many quite
                                                               similar input vectors in the data. This is even more evident
Search for Structure in Audiovisual Recordings of Lectures and Conferences                                                        155


when we look at the neurons in its neighborhood, which
are also quite a lot populated with the input vectors in con-
trast with the lower-left part of the map.
   The neuron with the most mapped feature vectors has
them mapped more than twice as many as any other neuron
in this map. When we look at the input data, we find that
many different slides are mapped to this neuron.
   Another reason for the large number of feature vectors
mapped to this neuron is that these feature vectors pro-
vide just one modality of the available input data. Namely,
nearly a half of input data mapped to this neuron seems
to be slides created from video playback or they are just
images without text. So, even though these slides look
different, the OCR produces little to no text, which can re-
sult in great similarity between them, depending on other
modalities.
   These feature vectors also have in common that there
is little of recognized speech, thus they are very similar
in the speech-to-text modality. The only really important
difference in feature vectors is due to the SURFs and his-
tograms produced from one frame taken from the video,                        Figure 2: The neighbor distances in the 8 × 8 map.
but this difference can also be just a small one. In the case
of lectures, the majority of lecture halls seem very similar
– white walls and brown desks accompanied by projection
screen.
   However, there are also some inputs mapped to this neu-
ron that seem to be typical examples of a lecture – a slide
with some text, video from lecture hall and an average
speech-recognized audio.
   Quite interesting is that usually not the whole sequence
of similar input data from one lecture is mapped to this
neuron. However, the input data from the sequences,
which are not present in this neuron, are in the most cases
mapped to its direct neighbours. This behaviour supports
the idea of large clusters spread over more neighbouring
neurons.
   In Figure 2, which shows the distances between neigh-
bors, we can see that the distances are the shortest be-
tween the neurons with the highest number of vectors.
Conversely, neurons in the lower-left corner are the most                Figure 3: A within-cluster distance, measured as the
distant from each other. When we look at the input data,                 length of the shortest path between neurons, dependent
we find that the segments of video-recordings and associ-                on the number of clusters produced from the hierarchical
ated slides projected during that segment indeed substan-                clustering, mapped to the 8 × 8 map.
tially differ from the remaining segments of all recorded
lectures. First, the camcorder during the time span of re-
spective segment was set to shoot only the projected slide,              an average within-cluster distance drops down to its local
instead of the whole lecture hall. Second, the projected                 minimal value when the cluster hierarchy is cut at an ap-
slides in most cases were examples of web pages. More-                   propriate level to form about a hundred clusters. It means
over, in 35 out of 42 slides mapped to the neuron in the                 that the hierarchical clustering should produce more clus-
bottom-left corner, the examples of web pages were taken                 ters than the number of neurons in the map, which is
from the Wikipedia, so they contain a lot of text and hy-                only 64.
perlinks. The image taken from the video, containing just                   Let us now pass to the 12 × 12 and 16 × 16 maps. In
a slide instead of the whole lecture hall, caused the visual             Figures 4 and 7 we can see similar, one highly populated
features and histogram to be very different. A lot of text               neuron, like in the 8 × 8 map. In bigger maps, the total
on a Wikipedia page instead of only few lines of text on                 number of vectors mapped to that neuron is lower, but in
a typical slide caused the main difference in OCR.                       comparison to other neurons, the relative count is the same
   Figure 3 shows that in the map of size 8 × 8 neurons,                 or even higher.
156                                                                                                    M. Kopp, P. Pulc, M. Holeňa


                                                                  Figure 6: A within-cluster distance, measured as the
                                                                  length of the shortest path between neurons, dependent
Figure 4: Relative distribution of the input feature vec-         on the number of clusters produced from the hierarchical
tors alongside with the total numbers of vectors mapped to        clustering, mapped to the 12 × 12 map.
each neuron in the 12 × 12 map.


                                                                  Figure 7: Relative distribution of the input feature vec-
  Figure 5: The neighbor distances in the 12 × 12 map.            tors alongside with the total numbers of vectors mapped to
                                                                  each neuron in the 16 × 16 map.

   In Figures 5 and 8, we can see that there are some
neurons, mostly on the left side of the map 5 and in the          the most features. This can be seen in the map as a darker
top-center, lower-left and top-right corners in 8, that have      “star” in a lighter neighborhood.
a large distance to their respective neighbors. The situation        In Figures 6 and 9, which show the within-cluster dis-
is quite similar to that of the smaller map in Figure 2. The      tance dependence on the number of clusters formed by
darkest areas just moved around to another physical loca-         cutting the cluster hierarchy, we can not find a clear lo-
tion in each of those two maps, but the relationships seems       cal minimum. The value of distance drops down quickly
to be similar and those neurons tend to stay together.            with the increasing number of clusters, forming an elbow
   It is also interesting, that in those larger maps a few neu-   in the figure, which is at just about a few less clusters than
rons are more distant from their respective neighbors than        is the number of neurons in the map.
others in their neighborhoods, which indicates that there            Even though the most suitable number of clusters in the
is some input data which is quite different from the rest in      map could not be define precisely this way when there is
Search for Structure in Audiovisual Recordings of Lectures and Conferences                                                         157


   Figure 8: The neighbor distances in the 16 × 16 map.                  Figure 10: Relative distribution of the input feature vec-
                                                                         tors alongside with the total numbers of vectors mapped to
                                                                         each neuron in the 20 × 20 map.


Figure 9: A within-cluster distance, measured as the
length of the shortest path between neurons, dependent
on the number of clusters produced from the hierarchical                     Figure 11: The neighbor distances in the 20 × 20 map.
clustering, mapped to the 16 × 16 map.

                                                                         distant from each other is still present, situated at the top
no distinct local minimum, locating an elbow on the curve                of the map, almost in the right corner. Alongside with
gives us a great possible candidate when we are searching                Figures 1, 4 and 7, we can also see that as the map grows,
for a solution with a good average within-cluster distance               there are more distant neurons and the whole map becomes
and the least number of clusters.                                        darker. This is due to the higher number of neurons which
   Finally, let us pass to the largest, 20×20 map. Figure 10             allows more distinct input feature vectors to form their
shows that there is still one neuron with significantly more             own clusters.
input vectors mapped to it than to any other neuron. A                      The 20 × 20 map also has other interesting characteris-
great part of the map is low populated with feature vectors,             tics. In Figure 12, there is once again a local minimum,
even though there are some neurons with small neighbor-                  though in this case it can be found within the interval of
hoods which are populated a little more.                                 400 and 500 clusters. It seems that clustering solutions
   In Figure 11, we can see that the small area of neurons               produced by hierarchical clustering and clusters formed in
158                                                                                                     M. Kopp, P. Pulc, M. Holeňa


                                                                segments of video-recordings and their associated slides,
                                                                which substantially differs from the others.
                                                                   In the future, we want to further elaborate our approach,
                                                                taking into account the results obtained in the case study
                                                                presented here. We also intend to apply it to further com-
                                                                plex multimedia data sets, in particular to data nowadays
                                                                being collected within the project Open Narra at the Film
                                                                and TV School of the Academy of Performing Arts in
                                                                Prague.


                                                                Acknowledgement

                                                                The research reported in this paper has been supported by
                                                                the Czech Science Foundation (GAČR) grant 13-17187S.
                                                                Access to computing and storage facilities owned by par-
                                                                ties and projects contributing to the National Grid In-
                                                                frastructure MetaCentrum, provided under the programme
                                                                “Projects of Large Infrastructure for Research, Develop-
Figure 12: A within-cluster distance, measured as the
                                                                ment, and Innovations” (LM2010005), is greatly appreci-
length of the shortest path between neurons, dependent
                                                                ated.
on the number of clusters produced from the hierarchical
clustering, mapped to the 20 × 20 map.
                                                                References
the map are very similar in that interval, because there is      [1] Kohonen, T.: Self-organized formation of topologically
exactly 400 neurons in this map.                                     correct feature maps. Biological Cybernetics 43 (1982),
                                                                     59–69
                                                                 [2] Kohonen, T.: Self-organizing maps. Springer Series in In-
6 Conclusion                                                         formation Sciences, 1995, 2nd ed. 1997
                                                                 [3] Neumayer, R., Dittenbach, M., Rauber, A.: PlaySOM and
The objective of this paper was to provide a proof of                PocketSOM player: alternative interfaces to large music
concept for the possibility to improve structure discov-             collections. In: Proceedings of the 6th International Con-
ery in complex multimedia data through the integration of            ference on Music Information Retrieval, 618–623, 2005
two unsupervised approaches – hierarchical clustering and        [4] Hartono, P., Yoshitake, R.: Automatic playlist generation
self-organizing maps. For that proof of concept, we have             from self-organizing music map. Journal of Signal Process-
used a case study of more then 100 hours of audiovisual              ing 17 (1) (2013), 11–19
recordings of lectures from several years of a Czech sci-        [5] Rauber, A., Pampalk, E., Merkl, D.: Using psycho-acoustic
ence festival called Week of Science and Technology. The             models and self-organizing maps to create a hierarchical
approach we propose relies on the one hand on the flexibil-          structuring of music by musical styles. In: Proceedings of
ity and universality of hierarchical clustering, on the other        the 3rd International Symposium on Music Information Re-
                                                                     trieval, 2002, 71–80,
hand on the suitability of SOM for multimedia data, con-
vincingly documented by Pitoyo Hartono. Our approach             [6] Chen, G., Jaradat, S. A., Banerjee, N., et al.: Evaluation and
                                                                     comparison of clustering algorithms in analyzing ES cell
has been much inspired by his paper [4]. However, the
                                                                     gene expression data. Statistica Sinica 12.1 (2002), 241–
need to integrate SOM with hierarchical clustering is a              262
consequence of the fact that our data are much more com-
                                                                 [7] Lamirel, J. C., Cuxac, P., Mall, R., et al.: A new efficient
plex, both from the point of view of involved modalities,            and unbiased approach for clustering quality evaluation.
and from the point of view of the number of attributes.
                                                                 [8] Deerwester, S. C., Dumais, S. T., Landauer, T. K., et al.: In-
   We have tested the proposed approach with 4 differently           dexing by latent semantic analysis. JASIS 41 (6) (1990),
sized self-organizing maps. Although space limitations               391–407
allowed us to present only two examples of the interpre-         [9] Ramos, J.: Using tf-idf to determine word relevance in doc-
taion of the obtained results in the context of the original         ument queries. In: Proceedings of the First Instructional
audio-visual data, even these examples indicate that SOMs            Conference on Machine Learning, 2003
indeed help to recognize structure in that data. In the         [10] Skopal, T., Moravec, P.: Modified LSI model for efficient
first provided example, the SOM organized the most com-              search by metric access methods. In: Advances in Informa-
mon data into a large cluster spread over few neighbour-             tion Retrieval, 2005, 245–259
ing neurons and in the second example, the SOM discov-
ered a specific group among the nearly 16000 considered