=Paper=
{{Paper
|id=Vol-1263/paper31
|storemode=property
|title=DCU Search Runs at MediaEval 2014 Search and Hyperlinking
|pdfUrl=https://ceur-ws.org/Vol-1263/mediaeval2014_submission_31.pdf
|volume=Vol-1263
|dblpUrl=https://dblp.org/rec/conf/mediaeval/RaccaEJ14
}}
==DCU Search Runs at MediaEval 2014 Search and Hyperlinking==
<pdf width="1500px">https://ceur-ws.org/Vol-1263/mediaeval2014_submission_31.pdf</pdf>
<pre>
                             DCU Search Runs
                 at MediaEval 2014 Search and Hyperlinking

                               David N. Racca, Maria Eskevich, Gareth J.F. Jones
                                         CNGL Centre for Global Intelligent Content
                                                 School of Computing
                                              Dublin City University, Ireland
                                 {dracca, meskevich, gjones}@computing.dcu.ie

ABSTRACT                                                               This paper describes an implementation of an approach
We described Dublin City University (DCU)’s participation           that incorporates loudness, duration, and pitch into TF-IDF
in the Search sub-task of the Search and Hyperlinking Task          weights in order to examine their potential to improve re-
at MediaEval 2014. Exploratory experiments were carried             trieval effectiveness of video segments.
out to investigate the utility of prosodic prominence features
in the task of retrieving relevant video segments from a col-       2.     FEATURE PROCESSING
lection of BBC videos. Normalised acoustic correlates of               Following Guinaudeau’s method [5], loudness and pitch
loudness, pitch, and duration were incorporated in a stan-          correlates were extracted from the speech signal and nor-
dard TF-IDF weighting scheme to increase weights for terms          malised and aligned to each word occurrence in the tran-
that were prominent in speech. Prosodic models outper-              scripts. To perform this alignment, word timestamps were
formed a text-based TF-IDF baseline on the training set             used in the case of LIMSI/Vocapia and NST-Sheffield tran-
but failed to surpass the baseline on the test set.                 scripts. For subtitles, word timestamps had to be approxi-
                                                                    mated from each segment’s starting and ending timestamps.
1.   INTRODUCTION                                                   This was done by dividing the number of words included in
                                                                    a segment by its length to obtain the average word dura-
   Increasing amounts of multimedia content are being pro-
                                                                    tion for that segment. Starting times and duration of words
duced and stored on a daily basis. In order to make this
                                                                    were then approximated by considering the starting time of
data useful, computer applications are required that facil-
                                                                    a segment plus multiples of its average word duration. In
itate search, browsing, and navigation through these large
                                                                    the case of the LIUM transcript, duration of words was ap-
data collections. The MediaEval Search and Hyperlinking
                                                                    proximated for the test set by the average word duration of
task seeks to contribute to addressing this problem.
                                                                    all words in the training set.
   In contrast with previous years where a known-item task
                                                                       After the alignment was performed, minimum, maximum,
was examined, this year an ad-hoc search task was intro-
                                                                    mean, and standard deviation of loudness and pitch were
duced. The retrieval collection consists of an extension of
                                                                    computed for each word. These four statistics were nor-
last year’s collection, comprising 4021 hours of BBC TV
                                                                    malised in order to be compared against other words spoken
Broadcast content split into training and test sets of 1335
                                                                    in different acoustic conditions. The final objective was to
and 2686 hours respectively. For every video file in the col-
                                                                    calculate an acoustic score for each spoken word that repre-
lection, the organizers provided human-generated subtitles,
                                                                    sents how salient a word is relative to its surounding context.
three different automatic speech recognition (ASR) tran-
                                                                    With this in mind, two different definitions of surounding
scripts (LIMSI/Vocapia, LIUM, and NST-Sheffield), prosodic
                                                                    context for a word were then considered:
features, shot boundaries, visual concept detection output,
and additional metadata associated with each TV-show. The                • Context given by the words that belong to the same
training set includes 50 text queries while the test set com-              speech segment predicted by the ASR (seg).
prises 30 queries. More details about the data collection and            • Full length of document, this is, all the words spoken
task evaluation metrics can be found in [4].                               in the video (doc).
   Previous research has demonstrated that prosodic infor-
mation is useful for a wide range of speech processing tasks [6],   Finally, two normalisation functions were explored for nor-
including speech search tasks. In [2], Crestani suggests that       malising a feature fi over a context C:
there might be a direct relationship between acoustic stress             1. Range: (fi − minC )/(maxC − minC ).
of terms and their TF-IDF score in the OGI Stories Corpus,               2. Z-score: (fi − µC )/σC .
while Chen reports improvements on a spoken document
retrieval task by using energy and durational features [1].
In [5], Guinaudeau and Hirschberg improve a topic tracking          3.     RETRIEVAL FRAMEWORK
system by incorporating intensity and pitch values into the           Text transcripts were segmented into fixed-time adjacent
retrieval weighting scheme.                                         (non-overlapping) segments of 90 seconds duration. Before
                                                                    indexing, stop words from the standard Terrier list [7] were
                                                                    removed and Porter stemming applied. Segments were then
Copyright is held by the author/owner(s).                           indexed using a modified version of Terrier-3.5 that associ-
MediaEval 2014 Workshop, October 16-17, 2014, Barcelona, Spain      ated acoustic features with term occurrences in the inverted
                       Normalisation Type               Run Parameters                               Results
     Transcript Type
                       Function Context           Weighting Scheme θir   θac   MAP     MAP-bin     MAP-tol   P@5     P@10    P@20
                           -         -                TF-IDF         1    0    0.639    0.394       0.293    0.727   0.627   0.478
                         range      seg                 G-pr         2    3    0.599    0.371       0.278    0.727   0.583   0.452
        Subtitles
                        z-score     doc                 G-lp         3    1    0.533    0.336       0.265    0.633   0.570   0.433
                           -         -                 G-dur         1    3    0.345    0.206       0.154    0.367   0.327   0.280
                           -         -                TF-IDF         1    0    0.440    0.295       0.215    0.600   0.513   0.393
                         range      seg                G-pr          2    3    0.434    0.294       0.214    0.613   0.510   0.393
      NST-Sheffield
                        z-score     doc                 G-lp         3    1    0.435    0.294       0.218    0.587   0.503   0.393
                           -         -                 G-dur         1    3    0.404    0.272       0.199    0.567   0.457   0.363
                           -         -                TF-IDF         1    0    0.525    0.339       0.242    0.620   0.543   0.430
                         range      seg                 G-pr         2    3    0.508    0.331       0.239    0.607   0.543   0.423
         LIMSI
                        z-score     doc                 G-lp         3    1    0.428    0.283       0.201    0.460   0.457   0.370
                           -         -                 G-dur         1    3    0.505    0.330       0.237    0.607   0.537   0.413
                           -         -                TF-IDF         1    0    0.451    0.300       0.233    0.693   0.573   0.430
                         range      seg                 G-pr         2    3    0.444    0.293       0.222    0.653   0.527   0.418
         LIUM
                        z-score     doc                G-lp          3    1    0.436    0.291       0.215    0.633   0.547   0.412
                           -         -                 G-dur         1    3    0.358    0.240       0.186    0.540   0.453   0.357

Table 1: Evaluation results over the test set. Overlap MAP [MAP], Binned MAP [MAP-bin], and Tolerance to
Irrelevance MAP [MAP-tol] are shown in the Results columns. Precision at different cut-offs are based on a “Tolerance
to Irrelevance” definition of relevance.


index. Note that due to stemming, multiple words can be                  ing and test sets were produced with different objectives in
mapped to the same stem. In these cases, acoustic feature                mind. This could be another reason why the models pre-
vectors associated with each non-stemmed word ccurrence                  sented in this work seem to have overfitted the training set.
in a segment were treated as belonging to the same stem                    In future work, an error analysis will be carried out in
and thus were linked with this term in the inverted index.               order to identify queries for which prosodic-based models
   Retrieval was performed using Terrier’s standard imple-               could have outperformed the baseline.
mentation of the vector space model (VSM) with a modi-
fied TF-IDF weighting function that takes into account the               5.    ACKNOWLEDGMENTS
acoustic features from the inverted index when computing                    This work was supported by Science Foundation Ireland
term weights. The weight of a term t in a segment was                    (Grant 12/CE/I2267) as part of the Centre for Global Intel-
computed using Guinaudeau’s harmonic mean [5]:                           ligent Content CNGL II project at DCU.
                          θir ∗ idf t ∗ tf t +θac ∗ act
                 w(t) =                                                  6.    REFERENCES
                                   θir + θac                             [1] B. Chen, H.-M. Wang, and L.-S. Lee. Improved spoken
Different definitions were explored for the acoustic score                   document retrieval by exploring extra acoustic and
(act ). In all cases, act was intended to represent the level                linguistic cues. In Proceedings Interspeech’01, pages
of salience of t from its surrounding context. In particular,                299–302, Aalborg, Denmark, 2001.
simple multiplications of the maximum loudness and max-                  [2] F. Crestani. Towards the use of prosodic information
imum pitch (G-lp), pitch range considering the maximum                       for spoken document retrieval. In Proceedings ACM
and minimum pitch (G-pr), and the maximum duration (G-                       SIGIR’01, pages 420–421, New Orleans, LA, USA,
dur) with which t was pronounced in the segment were used                    2001.
as definitions for act . Values for the free parameters θir and          [3] M. Eskevich, R. Aly, R. Ordelman, S. Chen, and
θac were selected to optimise mean reciprocal rank (MRR),                    G. J. F. Jones. The search and hyperlinking task at
mean generalised average presicion (mGAP), and mean av-                      MediaEval 2013. In Proceedings of the MediaEval 2013
erage segment precision (MASP) [3] on the training set for                   Workshop, Barcelona, Spain, 2013.
individual ASR transcripts and normalisation type. Specifi-              [4] M. Eskevich, R. Aly, D. N. Racca, R. Ordelman,
cally, the runs G-lp, G-pr, and G-dur were optimised for the                 S. Chen, and G. J. F. Jones. The search and
LIUM, NST-Sheffield, and LIMSI transcripts respectively.                     hyperlinking task at MediaEval 2014. In Proceedings of
                                                                             the MediaEval 2014 Multimedia Benchmark Workshop,
4.     RESULTS AND CONCLUSIONS                                               Barcelona, Spain, 2014.
   Table 1 shows details of submitted runs and presents a                [5] C. Guinaudeau and J. Hirschberg. Accounting for
summary of evaluation results over the test set for every                    prosodic information to improve ASR-based topic
type of transcript. Runs that made use of prosodic informa-                  tracking for TV broadcast news. In Proceedings
tion for computing term weights, namely G-pr, G-lp, and                      Interspeech’11, pages 1401–1404, Florence, Italy, 2011.
G-dur, clearly underperformed the baseline TF-IDF system                 [6] J. Hirschberg. Communication and prosody: Functional
in general. Given the fact that the baseline system can be                   aspects of prosody. Speech Communication,
beaten on the training set, results over the test set suggest                36(1):31–43, 2002.
that models optimised on MRR, mGAP, and MASP for a                       [7] I. Ounis, C. Lioma, C. Macdonald, and V. Plachouras.
known-item task evaluation scheme, as it is the case of the                  Research directions in Terrier: a search engine for
training set, fail to generalise to an ad-hoc retrieval task                 advanced retrieval on the web. Novatica/UPGRADE
performed over the test set.                                                 Special Issue on Next Generation Web Search, pages
   It is important to note here, that text queries for train-                49–56, 2007.

</pre>