CUNI at MediaEval 2013
                    Similar Segments in Social Speech Task

                                        Petra Galuščáková and Pavel Pecina
                                                 Charles University in Prague
                                            Faculty of Mathematics and Physics
                                         Institute of Formal and Applied Linguistics
                                                   Prague, Czech Republic
                                       {galuscakova,pecina}@ufal.mff.cuni.cz

ABSTRACT                                                          in the ASR transcripts. While in the ASR transcripts the
We describe our experiments for the Similar Segments in So-       exact playback time is given for each word, in the human
cial Speech Task at MediaEval 2013 Benchmark. We mainly           transcripts such information is available only on sentence
focus on segmentation of the recordings into shorter passages     level and therefore we approximate it by assuming equal
on which we apply standard retrieval techniques. We exper-        duration of words in a sentence.
iment with machine-learning-based segmentation employing          2.1   Query processing
textual (word n-grams, tag n-grams, letter cases, lexical co-
                                                                     The query segments are specified by their starting and
hesion, etc.) and prosodic features (silence) and compare
                                                                  ending time. The queries are constructed by including all
the results with those obtained by regular segmentation.
                                                                  words lying within the boundaries of the query segment in
                                                                  both tracks.
1.   INTRODUCTION                                                    We tried to expand the queries by adding words appearing
   The main aim of the Similar Segments in Social Speech          in the vicinity of the query segment (allowing ±5, ±10, ±15,
Task is to find segments similar to the given ones (query seg-    ±20, ±30, and ±60 seconds) but none of these experiments
ments) in the collection of audio-visual recordings containing    improved the results.
English dialogues of a university student community. In ad-          We also attempted to generate the queries from both the
dition to the human and automatic (ASR) transcripts (both         human and ASR transcripts and apply them to search in
transcripts are given separately for each speaker), the col-      both types of transcripts. The queries created from the hu-
lection also contains prosodic features and metadata. The         man transcripts achieved higher scores when applied on both
training data consists of segments manually assigned to sim-      the human and ASR transcripts, therefore they are used in
ilarity sets of the query segments. The details of the task       the experiments presented in this paper.
and data are described in the task description [7].
                                                                  2.2   Segmentation
                                                                     In this work, we mainly focus on segmentation of the
2.   APPROACH DESCRIPTION                                         recordings, which appears to be crucial for segment retrieval
   In our experiments, the queries are created from the hu-       [2]. We experiment with regular segmentation and two meth-
man transcripts of the query segments. The recordings are         ods based on (supervised) machine learning (ML).
segmented into overlapping passages (identified by their start-      In regular segmentation, the recordings are divided into
ing and ending times) which are then indexed using the Ter-       equilong segments of 50 seconds (which is approximately
rier IR Platform [6]. The set of potential jump-in points         equal to the average segment length in the collection). The
needed in retrieval then consists of the known beginnings of      shift between the segments (and the overlap) is also regular,
the acquired segments.                                            set to 25 seconds, since according to our experience from
   For the indexing, we use the default settings, which out-      the 2012 Search and Hyperlinking task, the shift of 10 to 30
performed our most successful setting from previous exper-        seconds achieves optimal results [2].
iments in the Search and Hyperlinking MediaEval Bench-               In the first ML approach, we identify segment boundaries
mark [3]. We remove stopwords and apply stemming using            using classification trees [1], implemented in the rpart li-
the Porter stemmer. Ranked lists of retrieved segments are        brary in R. For each word in the transcripts, we assume that
pruned by removing segments overlapping with those ranked         it belongs to a segment and detect whether it is followed by
higher.                                                           a segment boundary, or the segment continues. Class distri-
   As both transcripts are given in separated tracks for each     bution in this task (segment boundary vs. segment continu-
speaker, we join these tracks into a single one. In the hu-       ation) is highly unbalanced and the corresponding weights
man transcripts, we sort sentences from both transcripts          must be set accordingly to prevent too short segments. We
according to their beginnings to acquire single sequential        set the weight of segment boundary misclassified as segment
transcript. Similarly, we sort the speakers’ segments given       continuation in the loss matrix to 21, the weight of the seg-
                                                                  ment continuation misclassified as segment boundary to 11,
                                                                  and the complexity parameter to 0.
Copyright is held by the author/owner(s).                            In the second ML approach, we apply a similar process to
MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain    detect beginnings of segments which are then set to be 50
   Segmentation      Normalized    Normalized     F-measure        Segmentation       Normalized    Normalized    F-measure
 beginings ends         SUR          Recall                       beginings ends         SUR          Recall
   REG     REG          0.57          0.78           0.58           REG     REG          0.87          1.19          0.90
    ML     REG          0.65          0.90           0.67           ML      REG          0.70          1.00          0.72
    ML      ML          0.59          0.80           0.61           ML       ML          0.65          0.90          0.67

Table 1: Retrieval results on the human transcripts.             Table 2: Retrieval results on the ASR transcripts.


seconds long (naturally, the segments can overlap). In this        In the overall results, the ASR transcripts surprisingly
case, we aim at higher recall of the decision process to find    outperform human transcripts. This is probably caused by
all possible segment beginnings, but still keep the number       the approximation of word timing and duration in the hu-
of created segments reasonable. We set the weight of the         man transcripts – in the ASR transcripts, we are able to
segment boundary misclassified as segment continuation in        determine precise segment beginning and end times but the
the loss matrix to 61, the weight of the segment continuation    times in the human transcripts are inaccurate.
misclassified as segment boundary to 1, and the complexity
parameter to 0.                                                  4.   CONCLUSIONS AND FUTURE WORK
   For comparison, the classification models trained and tuned
                                                                   The overall best result is achieved using regular segmen-
on the human transcripts are also applied on the ASR tran-
                                                                 tation on the ASR transcripts. For the human transcripts,
scripts despite their mutual inconsistency. The transcripts
                                                                 however, the proposed ML-based segmentation outperformed
differ in the length of silence (which is in human transcripts
                                                                 the regular segmentation, which is very promising and we
only approximated as the duration between the imprecise
                                                                 will attempt to project this results into experiments using
word beginnings), tokenization, and letter capitalization.
                                                                 the ASR transcripts. In our future work, we would also like
Therefore, our future plans include to train the classifica-
                                                                 to employ a joint model for identification of both segment
tion model on the ASR transcripts too.
                                                                 beginnings and the segment ends.
2.3   Features
   Our classification model exploits the following features:     5.   ACKNOWLEDGMENTS
cue words and cue tags, letter cases, length of the silence         This research is supported by the Charles University Grant
before the word, division given in transcripts, and the output   Agency (GA UK n. 920913) and the Czech Science Founda-
of the TextTiling algorithm [4].                                 tion (grant n. P103/12/G084).
   The cue words are the words that appear frequently at
the segment boundary and often do not carry special mean-        6.   REFERENCES
ing. Based on the training data, we have identified words
                                                                 [1] L. Breiman, J. Friedman, R. Olshen, and C. Stone.
which frequently stand at the segment boundary and words
                                                                     Classification and Regression Trees. Wadsworth and
which are the most informative for the segment boundary
                                                                     Brooks, Monterey, CA, 1984.
(the mutual information between these words and the seg-
                                                                 [2] M. Eskevich, G. J. Jones, R. Aly, R. J. Ordelman,
ment boundary is high). We have also defined our own set
                                                                     S. Chen, D. Nadeem, C. Guinaudeau, G. Gravier,
of words which might occur at such boundary and created
                                                                     P. Sébillot, T. de Nies, P. Debevere, R. V. de Walle,
sets for unigrams, bigrams and trigrams, for words and tags
                                                                     P. Galuščáková, P. Pecina, and M. Larson. Multimedia
(obtained by Featurama tagger [5]) for both segment begin-
                                                                     information seeking through search and hyperlinking.
nings and ends. Occurrence of each n-gram is captured by a
                                                                     In Proc. of ICMR, pages 287–294, Dallas, Texas, USA,
separate feature. An additional feature indicates whether at
                                                                     2013.
least one feature from the set (n-grams for frequent words,
informative words and defined words for either beginning or      [3] P. Galuščáková and P. Pecina. CUNI at MediaEval
end) occurs.                                                         2012 Search and Hyperlinking Task. In MediaEval 2012
   As the TextTiling algorithm is based on calculating simi-         Workshop, volume 927, Pisa, Italy, 2012.
larity between adjacent regions, utilizing its output, we can    [4] M. A. Hearst. TextTiling: Segmenting Text into
also employ lexical cohesion into our decision process.              Multi-paragraph Subtopic Passages. Computational
                                                                     Linguistics, 23(1):33–64, 1997.
                                                                 [5] M. Spousta. Featurama – a library that implements
3.    RESULTS                                                        various sequence-labeling algorithms.
   We employ three automatic evaluation measures: Normal-            http://sourceforge.net/projects/featurama/.
ized Searcher Utility Ratio (SUR), Normalized Recall, and        [6] Terrier IR Platform. An open source search engine.
the F-measure (for details, see the task description [7]). The       http://terrier.org/.
results for various types of segmentation for the human tran-    [7] N. G. Ward, S. D. Werner, D. G. Novick, E. E.
scripts are displayed in Table 1 and for the ASR transcripts         Shriberg, C. Oertel, L.-P. Morency, and T. Kawahara.
in Table 2.                                                          The Similar Segments in Social Speech Task. In
   In the experiment utilizing human transcripts, the ML-            MediaEval 2013 Workshop, Barcelona, Spain, October
based segmentation outperforms the regular segmentation.             18-19 2013.
However in the experiments with the ASR transcripts, the
regular segmentation wins. In both cases, ML-based seg-
mentation searching for segment beginnings outperforms ML-
segmentation searching for entire segments.