CUNI at MediaEval 2013 Similar Segments in Social Speech Task Petra Galuščáková and Pavel Pecina Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics Prague, Czech Republic {galuscakova,pecina}@ufal.mff.cuni.cz ABSTRACT in the ASR transcripts. While in the ASR transcripts the We describe our experiments for the Similar Segments in So- exact playback time is given for each word, in the human cial Speech Task at MediaEval 2013 Benchmark. We mainly transcripts such information is available only on sentence focus on segmentation of the recordings into shorter passages level and therefore we approximate it by assuming equal on which we apply standard retrieval techniques. We exper- duration of words in a sentence. iment with machine-learning-based segmentation employing 2.1 Query processing textual (word n-grams, tag n-grams, letter cases, lexical co- The query segments are specified by their starting and hesion, etc.) and prosodic features (silence) and compare ending time. The queries are constructed by including all the results with those obtained by regular segmentation. words lying within the boundaries of the query segment in both tracks. 1. INTRODUCTION We tried to expand the queries by adding words appearing The main aim of the Similar Segments in Social Speech in the vicinity of the query segment (allowing ±5, ±10, ±15, Task is to find segments similar to the given ones (query seg- ±20, ±30, and ±60 seconds) but none of these experiments ments) in the collection of audio-visual recordings containing improved the results. English dialogues of a university student community. In ad- We also attempted to generate the queries from both the dition to the human and automatic (ASR) transcripts (both human and ASR transcripts and apply them to search in transcripts are given separately for each speaker), the col- both types of transcripts. The queries created from the hu- lection also contains prosodic features and metadata. The man transcripts achieved higher scores when applied on both training data consists of segments manually assigned to sim- the human and ASR transcripts, therefore they are used in ilarity sets of the query segments. The details of the task the experiments presented in this paper. and data are described in the task description [7]. 2.2 Segmentation In this work, we mainly focus on segmentation of the 2. APPROACH DESCRIPTION recordings, which appears to be crucial for segment retrieval In our experiments, the queries are created from the hu- [2]. We experiment with regular segmentation and two meth- man transcripts of the query segments. The recordings are ods based on (supervised) machine learning (ML). segmented into overlapping passages (identified by their start- In regular segmentation, the recordings are divided into ing and ending times) which are then indexed using the Ter- equilong segments of 50 seconds (which is approximately rier IR Platform [6]. The set of potential jump-in points equal to the average segment length in the collection). The needed in retrieval then consists of the known beginnings of shift between the segments (and the overlap) is also regular, the acquired segments. set to 25 seconds, since according to our experience from For the indexing, we use the default settings, which out- the 2012 Search and Hyperlinking task, the shift of 10 to 30 performed our most successful setting from previous exper- seconds achieves optimal results [2]. iments in the Search and Hyperlinking MediaEval Bench- In the first ML approach, we identify segment boundaries mark [3]. We remove stopwords and apply stemming using using classification trees [1], implemented in the rpart li- the Porter stemmer. Ranked lists of retrieved segments are brary in R. For each word in the transcripts, we assume that pruned by removing segments overlapping with those ranked it belongs to a segment and detect whether it is followed by higher. a segment boundary, or the segment continues. Class distri- As both transcripts are given in separated tracks for each bution in this task (segment boundary vs. segment continu- speaker, we join these tracks into a single one. In the hu- ation) is highly unbalanced and the corresponding weights man transcripts, we sort sentences from both transcripts must be set accordingly to prevent too short segments. We according to their beginnings to acquire single sequential set the weight of segment boundary misclassified as segment transcript. Similarly, we sort the speakers’ segments given continuation in the loss matrix to 21, the weight of the seg- ment continuation misclassified as segment boundary to 11, and the complexity parameter to 0. Copyright is held by the author/owner(s). In the second ML approach, we apply a similar process to MediaEval 2013 Workshop, October 18-19, 2013, Barcelona, Spain detect beginnings of segments which are then set to be 50 Segmentation Normalized Normalized F-measure Segmentation Normalized Normalized F-measure beginings ends SUR Recall beginings ends SUR Recall REG REG 0.57 0.78 0.58 REG REG 0.87 1.19 0.90 ML REG 0.65 0.90 0.67 ML REG 0.70 1.00 0.72 ML ML 0.59 0.80 0.61 ML ML 0.65 0.90 0.67 Table 1: Retrieval results on the human transcripts. Table 2: Retrieval results on the ASR transcripts. seconds long (naturally, the segments can overlap). In this In the overall results, the ASR transcripts surprisingly case, we aim at higher recall of the decision process to find outperform human transcripts. This is probably caused by all possible segment beginnings, but still keep the number the approximation of word timing and duration in the hu- of created segments reasonable. We set the weight of the man transcripts – in the ASR transcripts, we are able to segment boundary misclassified as segment continuation in determine precise segment beginning and end times but the the loss matrix to 61, the weight of the segment continuation times in the human transcripts are inaccurate. misclassified as segment boundary to 1, and the complexity parameter to 0. 4. CONCLUSIONS AND FUTURE WORK For comparison, the classification models trained and tuned The overall best result is achieved using regular segmen- on the human transcripts are also applied on the ASR tran- tation on the ASR transcripts. For the human transcripts, scripts despite their mutual inconsistency. The transcripts however, the proposed ML-based segmentation outperformed differ in the length of silence (which is in human transcripts the regular segmentation, which is very promising and we only approximated as the duration between the imprecise will attempt to project this results into experiments using word beginnings), tokenization, and letter capitalization. the ASR transcripts. In our future work, we would also like Therefore, our future plans include to train the classifica- to employ a joint model for identification of both segment tion model on the ASR transcripts too. beginnings and the segment ends. 2.3 Features Our classification model exploits the following features: 5. ACKNOWLEDGMENTS cue words and cue tags, letter cases, length of the silence This research is supported by the Charles University Grant before the word, division given in transcripts, and the output Agency (GA UK n. 920913) and the Czech Science Founda- of the TextTiling algorithm [4]. tion (grant n. P103/12/G084). The cue words are the words that appear frequently at the segment boundary and often do not carry special mean- 6. REFERENCES ing. Based on the training data, we have identified words [1] L. Breiman, J. Friedman, R. Olshen, and C. Stone. which frequently stand at the segment boundary and words Classification and Regression Trees. Wadsworth and which are the most informative for the segment boundary Brooks, Monterey, CA, 1984. (the mutual information between these words and the seg- [2] M. Eskevich, G. J. Jones, R. Aly, R. J. Ordelman, ment boundary is high). We have also defined our own set S. Chen, D. Nadeem, C. Guinaudeau, G. Gravier, of words which might occur at such boundary and created P. Sébillot, T. de Nies, P. Debevere, R. V. de Walle, sets for unigrams, bigrams and trigrams, for words and tags P. Galuščáková, P. Pecina, and M. Larson. Multimedia (obtained by Featurama tagger [5]) for both segment begin- information seeking through search and hyperlinking. nings and ends. Occurrence of each n-gram is captured by a In Proc. of ICMR, pages 287–294, Dallas, Texas, USA, separate feature. An additional feature indicates whether at 2013. least one feature from the set (n-grams for frequent words, informative words and defined words for either beginning or [3] P. Galuščáková and P. Pecina. CUNI at MediaEval end) occurs. 2012 Search and Hyperlinking Task. In MediaEval 2012 As the TextTiling algorithm is based on calculating simi- Workshop, volume 927, Pisa, Italy, 2012. larity between adjacent regions, utilizing its output, we can [4] M. A. Hearst. TextTiling: Segmenting Text into also employ lexical cohesion into our decision process. Multi-paragraph Subtopic Passages. Computational Linguistics, 23(1):33–64, 1997. [5] M. Spousta. Featurama – a library that implements 3. RESULTS various sequence-labeling algorithms. We employ three automatic evaluation measures: Normal- http://sourceforge.net/projects/featurama/. ized Searcher Utility Ratio (SUR), Normalized Recall, and [6] Terrier IR Platform. An open source search engine. the F-measure (for details, see the task description [7]). The http://terrier.org/. results for various types of segmentation for the human tran- [7] N. G. Ward, S. D. Werner, D. G. Novick, E. E. scripts are displayed in Table 1 and for the ASR transcripts Shriberg, C. Oertel, L.-P. Morency, and T. Kawahara. in Table 2. The Similar Segments in Social Speech Task. In In the experiment utilizing human transcripts, the ML- MediaEval 2013 Workshop, Barcelona, Spain, October based segmentation outperforms the regular segmentation. 18-19 2013. However in the experiments with the ASR transcripts, the regular segmentation wins. In both cases, ML-based seg- mentation searching for segment beginnings outperforms ML- segmentation searching for entire segments.