DCU at MediaEval 2011: Rich Speech Retrieval (RSR)

                          Maria Eskevich                                         Gareth J. F. Jones
                     CDVP, School of Computing                           CDVP & CNGL, School of Computing
                       Dublin City University                                  Dublin City University
                         Dublin 9, Ireland                                       Dublin 9, Ireland
                meskevich@computing.dcu.ie                                  gjones@computing.dcu.ie


ABSTRACT                                                           ’opinion’) are more neutral and appear more as simple tex-
We describe our runs and results for the Rich Speech Re-           tual requests for information, while the otherw are more
trieval (RSR) Task at MediaEval 2011. Our runs examine             emotional and subjective, and therefore less similar to the
the use of alternative segmentation methods on the provided        usual textual query style. A full description of the task can
ASR transcripts to locate the beginning of the topic, assum-       be found in [3]. The official metric of the RSR task was used
ing that this will capture or get close to the starting point of   to evaluate our results - mGAP which reflects how close the
the relevant segment; combination of various types of queries      predicted jump in point of the run result is to the manual
and weighting of metadata to move the relevant segment             ground truth within a certain window. The following sec-
higher in the ranked list; and different ASR transcripts to        tions summarise our methods and results.
compare the influence of the ASR transcripts quality. Our
results show that newer versions of the transcripts and use        2.     APPROACH DESCRIPTION
of metadata produce better results on average. So far we              The videos in the data setare diverse in their structure,
have not used information about the illocutionary act type         style of language and length. Both ASR transcripts and
corresponding to each query, but analysis of the retrieval         confusion networks are provided for all videos. This infor-
results shows difference in behaviour for queries associated       mation can be used as input for the retrieval process. We
with certatin classes of act.                                      treated both the 2010 transcripts and the 2011 confusion
                                                                   networks in the same way: creating clean text out of the
Categories and Subject Descriptors                                 words and punctuation from the transcripts. The next step
                                                                   was to preprocess the data for retrieval. We first automat-
H.3 [Information Storage and Retrieval]: H.3.1 Con-
                                                                   ically segmented the data into topically coherent segments.
tent Analysis and Indexing; H.3.3 Information search and
                                                                   For this we examined the use of two existing text segmenta-
Retrieval; H.3.4 Systems and Software
                                                                   tion algorithms: C99 [1] and TextTiling [2].
                                                                      Most videos in the collection are accompanied by meta-
General Terms                                                      data relating to the whole video regardless of its length or
Measurement, Experimentation                                       the number of topics discussed. This metadata tag informa-
                                                                   tion was added once (’m1’) or 5 times (’m5’, to give it more
                                                                   weight) to all of the segments in the file. Segment indexing
Keywords                                                           and retrieval were carried out using the lemur2 Indri toolkit.
Speech search, information retrieval, automatic speech recog-         As queries we used only the naturally formulated full query
nition                                                             (’title’) and the short query similar to the query for an in-
                                                                   ternet search engine (’google’) and the combination of both
1.     INTRODUCTION                                                (’title + google’). For these experiments, the starting time
                                                                   of the segment was selected as the jump-in point the results.
   The Rich Speech Retrieval (RSR) Task at MediaEval 2011
seeks to open discussion of a new task in the search of spoken
content. The information to be found has special features -        3.     RESULTS
a certain speaker’s intention (illocutionary act1 ). This new         Table 1 shows the results of our runs. As could have been
way of setting the problem of speech search raises the ques-       anticipated, larger window size shows better scores, since
tion of uniformity of the structures of naturally produced         more of the results have non zero GAP; more complicated
queries for different speech acts and how belonging to cer-        queries (’title + google’) make the request for information
tain type of acts affects retrieval behaviour. This dataset        more detailed and consequently relevant segments are found
contains 5 basic speech acts: ’apology’, ’definition’, ’opin-      better; addition of metadata, and especially allocation of
ion’, ’promise’ and ’warning’. Two of these (’definition’ and      more weight to the metadata can overcome the problem of
1                                                                  some keywords being misrecognized or not uttered at all in
    http://en.wikipedia.org/wiki/Speech acts                       the segment and therefore improves the overall results. The
                                                                   confusion networks provided for 2011 dataset have a restric-
                                                                   tion that the second variant is reported only if its confidence
Copyright is held by the author/owner(s).                          2
MediaEval 2010 Workshop, September 1-2, 2011, Pisa, Italy              http://www.lemurproject.org/
                                        Table 1: mGap results on the test set
       Transcript type    Segmentation type Metadata used   Query type    Window size            Granularity   mGAP
            2011                 tt              + (5)     title + google     60                     10        0.2043
            2011                c99              + (5)     title + google     60                     10        0.1622
            2011                c99              + (1)     title + google     60                     10        0.1603
            2011                 tt              + (5)     title + google     30                     10        0.1394
            2010                c99               –              title        60                     10        0.1344
            2011                c99              + (5)     title + google     30                     10        0.1193
            2011                c99              + (1)     title + google     30                     10        0.1192
            2010                c99               –              title        30                     10        0.1078
            2011                c99               –             google        60                     10        0.1068
            2011                c99               –             google        30                     10        0.0686
            2011                 tt              + (5)     title + google     10                     10        0.0646
            2011                c99               –             google        30                     10        0.0686
            2011                c99              + (5)     title + google     10                     10        0.0554
            2011                c99              + (1)     title + google     10                     10        0.0554
            2010                c99               –              title        10                     10        0.0542
            2011                c99               –             google        10                     10        0.0061


measure is higher than 50%, in most cases this second vari-         Preliminary experiments suggested C99 to be the better
ant is either the same word written with a capital letter or     algorithm for segmenting the data, hence more runs were
is another grammatical form of the same word. Since we           submitted with C99. However, the results from the full runs
were taking all the words from the confusion networks to         show that TextTiling can outperform C99, more runs with
prepare our text, these variants do not bring new terms into     different combinations of transcripts and queries will be car-
the document, but increase the weight of the term that has       ried out in further work.
multiple entries.
                                                                 6.   FUTURE WORK
4.   ILLOCUTIONARY ACT BREAKDOWN                                   In future work we plan to compare all the possible com-
   mGAP over all the queries shows average performance for       binations of query types, use of metadata and transcript
a specific combination of different system parameters, but       segmentation to be able to demonstrate our results more
it is also interesting to look into the results of the same      solidly. Segmentation algorithms that have been developed
combinations separated into illocutionary act type. When         for other types of spoken content (i.e. meetings, broadcast
simple queries (’title’) are used on the 2010 transcript not     news) can be applied to the data in order to examine al-
enriched with metadata information, the results fall into two    ternative ways of splitting the transcripts into search units.
classes: ’definition’ and ’opinion’ have scores of the same      Since so far we were calling the beginning of the segment
level for window size 60, while the three other act types have   the jump-in point, another potential research direction may
significantly lower scores. In the case of the other simple      be to postprocess the retrieved segment locate the assigned
query type (’google’), the difference in speech acts types is    jump-in point closer to the manually assigned position.
not so distinct, however with the small number of queries for
certain types (only 1 for apology), it is hard to argue that     7.   ACKNOWLEDGMENTS
the query type is the reason for the results achieved or the
dataset itself.                                                     This work is funded by a grant under the Science Founda-
   Our runs enable us to compare the affect of using meta-       tion Ireland Research Frontiers Programme 2008 Grant No:
data with different weight (2011 c99 m5 title and google and     08/RFP/CMS1677.
2011 c99 m1 title and google). In general the ’m1’ run has
lower scores than the ’m5’, but in reality the scores are        8.   REFERENCES
the same for all window sizes for ’apology’, ’definition’ and    [1] F. Y. Y. Choi. Advances in domain independent linear
’promise’ and higher for ’m5’ for ’opinion’ and ’warning’.           text segmentation. In Proceedings of the 1st North
                                                                     American chapter of the Association for Computational
5.   CONCLUSIONS                                                     Linguistics conference, pages 26–33, 2000.
   This investigation has shown that queries that have sev-      [2] M. Hearst. TextTiling: A quantitative approach to
eral dimensions - not only requesting specific data in the           discourse segmentation. Technical Report Sequoia
transcript, but also certain emotion or illocution related to        93/24, Computer Science Department, University of
it, that have to be treated in a different way depending on          California, Berkeley, 1993.
the type of the speech act. When the illocution is less neu-     [3] M. Larson, M. Eskevich, R. Ordelman, C. Kofler,
tral more data needs to be combined in order to find the             S. Schmeideke, and G. J. F. Jones. Overview of
relevant segments. While the distribution of the illocution-         mediaeval 2011 rich speech retrieval task and genre
ary acts in the query set models real life, perhaps we need to       tagging task. In Proceedings of the MediaEval
create more queries of specific less popular types in order to       Workshop 2011, 2011.
develop better ways of processing the different query types.