DCU at MediaEval 2011: Rich Speech Retrieval (RSR) Maria Eskevich Gareth J. F. Jones CDVP, School of Computing CDVP & CNGL, School of Computing Dublin City University Dublin City University Dublin 9, Ireland Dublin 9, Ireland meskevich@computing.dcu.ie gjones@computing.dcu.ie ABSTRACT ’opinion’) are more neutral and appear more as simple tex- We describe our runs and results for the Rich Speech Re- tual requests for information, while the otherw are more trieval (RSR) Task at MediaEval 2011. Our runs examine emotional and subjective, and therefore less similar to the the use of alternative segmentation methods on the provided usual textual query style. A full description of the task can ASR transcripts to locate the beginning of the topic, assum- be found in [3]. The official metric of the RSR task was used ing that this will capture or get close to the starting point of to evaluate our results - mGAP which reflects how close the the relevant segment; combination of various types of queries predicted jump in point of the run result is to the manual and weighting of metadata to move the relevant segment ground truth within a certain window. The following sec- higher in the ranked list; and different ASR transcripts to tions summarise our methods and results. compare the influence of the ASR transcripts quality. Our results show that newer versions of the transcripts and use 2. APPROACH DESCRIPTION of metadata produce better results on average. So far we The videos in the data setare diverse in their structure, have not used information about the illocutionary act type style of language and length. Both ASR transcripts and corresponding to each query, but analysis of the retrieval confusion networks are provided for all videos. This infor- results shows difference in behaviour for queries associated mation can be used as input for the retrieval process. We with certatin classes of act. treated both the 2010 transcripts and the 2011 confusion networks in the same way: creating clean text out of the Categories and Subject Descriptors words and punctuation from the transcripts. The next step was to preprocess the data for retrieval. We first automat- H.3 [Information Storage and Retrieval]: H.3.1 Con- ically segmented the data into topically coherent segments. tent Analysis and Indexing; H.3.3 Information search and For this we examined the use of two existing text segmenta- Retrieval; H.3.4 Systems and Software tion algorithms: C99 [1] and TextTiling [2]. Most videos in the collection are accompanied by meta- General Terms data relating to the whole video regardless of its length or Measurement, Experimentation the number of topics discussed. This metadata tag informa- tion was added once (’m1’) or 5 times (’m5’, to give it more weight) to all of the segments in the file. Segment indexing Keywords and retrieval were carried out using the lemur2 Indri toolkit. Speech search, information retrieval, automatic speech recog- As queries we used only the naturally formulated full query nition (’title’) and the short query similar to the query for an in- ternet search engine (’google’) and the combination of both 1. INTRODUCTION (’title + google’). For these experiments, the starting time of the segment was selected as the jump-in point the results. The Rich Speech Retrieval (RSR) Task at MediaEval 2011 seeks to open discussion of a new task in the search of spoken content. The information to be found has special features - 3. RESULTS a certain speaker’s intention (illocutionary act1 ). This new Table 1 shows the results of our runs. As could have been way of setting the problem of speech search raises the ques- anticipated, larger window size shows better scores, since tion of uniformity of the structures of naturally produced more of the results have non zero GAP; more complicated queries for different speech acts and how belonging to cer- queries (’title + google’) make the request for information tain type of acts affects retrieval behaviour. This dataset more detailed and consequently relevant segments are found contains 5 basic speech acts: ’apology’, ’definition’, ’opin- better; addition of metadata, and especially allocation of ion’, ’promise’ and ’warning’. Two of these (’definition’ and more weight to the metadata can overcome the problem of 1 some keywords being misrecognized or not uttered at all in http://en.wikipedia.org/wiki/Speech acts the segment and therefore improves the overall results. The confusion networks provided for 2011 dataset have a restric- tion that the second variant is reported only if its confidence Copyright is held by the author/owner(s). 2 MediaEval 2010 Workshop, September 1-2, 2011, Pisa, Italy http://www.lemurproject.org/ Table 1: mGap results on the test set Transcript type Segmentation type Metadata used Query type Window size Granularity mGAP 2011 tt + (5) title + google 60 10 0.2043 2011 c99 + (5) title + google 60 10 0.1622 2011 c99 + (1) title + google 60 10 0.1603 2011 tt + (5) title + google 30 10 0.1394 2010 c99 – title 60 10 0.1344 2011 c99 + (5) title + google 30 10 0.1193 2011 c99 + (1) title + google 30 10 0.1192 2010 c99 – title 30 10 0.1078 2011 c99 – google 60 10 0.1068 2011 c99 – google 30 10 0.0686 2011 tt + (5) title + google 10 10 0.0646 2011 c99 – google 30 10 0.0686 2011 c99 + (5) title + google 10 10 0.0554 2011 c99 + (1) title + google 10 10 0.0554 2010 c99 – title 10 10 0.0542 2011 c99 – google 10 10 0.0061 measure is higher than 50%, in most cases this second vari- Preliminary experiments suggested C99 to be the better ant is either the same word written with a capital letter or algorithm for segmenting the data, hence more runs were is another grammatical form of the same word. Since we submitted with C99. However, the results from the full runs were taking all the words from the confusion networks to show that TextTiling can outperform C99, more runs with prepare our text, these variants do not bring new terms into different combinations of transcripts and queries will be car- the document, but increase the weight of the term that has ried out in further work. multiple entries. 6. FUTURE WORK 4. ILLOCUTIONARY ACT BREAKDOWN In future work we plan to compare all the possible com- mGAP over all the queries shows average performance for binations of query types, use of metadata and transcript a specific combination of different system parameters, but segmentation to be able to demonstrate our results more it is also interesting to look into the results of the same solidly. Segmentation algorithms that have been developed combinations separated into illocutionary act type. When for other types of spoken content (i.e. meetings, broadcast simple queries (’title’) are used on the 2010 transcript not news) can be applied to the data in order to examine al- enriched with metadata information, the results fall into two ternative ways of splitting the transcripts into search units. classes: ’definition’ and ’opinion’ have scores of the same Since so far we were calling the beginning of the segment level for window size 60, while the three other act types have the jump-in point, another potential research direction may significantly lower scores. In the case of the other simple be to postprocess the retrieved segment locate the assigned query type (’google’), the difference in speech acts types is jump-in point closer to the manually assigned position. not so distinct, however with the small number of queries for certain types (only 1 for apology), it is hard to argue that 7. ACKNOWLEDGMENTS the query type is the reason for the results achieved or the dataset itself. This work is funded by a grant under the Science Founda- Our runs enable us to compare the affect of using meta- tion Ireland Research Frontiers Programme 2008 Grant No: data with different weight (2011 c99 m5 title and google and 08/RFP/CMS1677. 2011 c99 m1 title and google). In general the ’m1’ run has lower scores than the ’m5’, but in reality the scores are 8. REFERENCES the same for all window sizes for ’apology’, ’definition’ and [1] F. Y. Y. Choi. Advances in domain independent linear ’promise’ and higher for ’m5’ for ’opinion’ and ’warning’. text segmentation. In Proceedings of the 1st North American chapter of the Association for Computational 5. CONCLUSIONS Linguistics conference, pages 26–33, 2000. This investigation has shown that queries that have sev- [2] M. Hearst. TextTiling: A quantitative approach to eral dimensions - not only requesting specific data in the discourse segmentation. Technical Report Sequoia transcript, but also certain emotion or illocution related to 93/24, Computer Science Department, University of it, that have to be treated in a different way depending on California, Berkeley, 1993. the type of the speech act. When the illocution is less neu- [3] M. Larson, M. Eskevich, R. Ordelman, C. Kofler, tral more data needs to be combined in order to find the S. Schmeideke, and G. J. F. Jones. Overview of relevant segments. While the distribution of the illocution- mediaeval 2011 rich speech retrieval task and genre ary acts in the query set models real life, perhaps we need to tagging task. In Proceedings of the MediaEval create more queries of specific less popular types in order to Workshop 2011, 2011. develop better ways of processing the different query types.