<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DCLab at MediaEval2014 Search and Hyperlinking Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zsombor Paróczi</string-name>
          <email>paroczi@tmit.bme.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bálint Fodor</string-name>
          <email>fodor@aut.bme.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gábor Szu˝ cs</string-name>
          <email>szucs@tmit.bme.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Inter-University Centre for, Telecommunications and</institution>
          ,
          <addr-line>Informatics</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <fpage>16</fpage>
      <lpage>17</lpage>
      <abstract>
        <p>The aim of the paper was to support the answer to a query with a ranked list of video segments (search sub-task) and to generate possible hyperlinks (in ranked order) to other video segments in the same collection that provide information about the found segments (linking sub-task). Our solution is based on concept enrichment i.e. the set of words is extended with their synonyms or other conceptually connected words. The other contribution is the content mixing using the combination of all transcripts and manual subtitles of the videos.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>
        Our paper is about a user who searches for di erent
segments of videos within a video collection that address
certain topic of interest expressed in a query. If the user
nds the segments that are relevant to his initial information
need, he may wish to nd additional information connected
to these segments [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Our aims were to support the
answer to a query with a ranked list of documents (search
sub-task) and to generate a ranked list of video segments
in the same collection that provide information about the
found segments (linking sub-task). Both sub-tasks represent
ad-hoc retrieval scenario, and were evaluated by organizers.
      </p>
      <p>
        We used the same collection of the BBC videos as a source
for the test set collection. Collection of BBC consists of
video keyframes, audio content, 3 sets of automatic speech
recognition (ASR) transcripts: LIMSI/Vocapia [
        <xref ref-type="bibr" rid="ref4 ref6">4, 6</xref>
        ], LIUM
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] , NST/She eld [
        <xref ref-type="bibr" rid="ref5 ref7">7, 5</xref>
        ] furthermore 1 manual subtitles,
metadata and prosodic features [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>SYSTEM OVERVIEW</title>
      <p>During the tasks we developed a small system for
processing the data. Our solution is solely based on textual
analysis, we only used the subtitles and ASR transcripts.
It has 5 distinctive stages: data normalization (2.1), shot
cutting (2.2), concept enrichment (2.3), content mixing
(2.4), indexing and retrieval (2.5).</p>
    </sec>
    <sec id="sec-3">
      <title>Normalization</title>
      <p>The data set was given in various forms, so the rst step
was to normalize the data formats and to convert all data
to the same scale. We used the time dimension as scale and
csv as the common data format.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Shot cutting</title>
      <p>Since in the data set each le represented a whole
television program and we wanted to work on 'shot' level
we created a tool, that based on the provided 'scenecut'
description cuts each input data into shots. Using this
method we created more than 300000 small les, each
representing one shot with only one metric (like LIMSI
transcript).</p>
      <p>Our main goal was to create a concept enrichmented so
called 'shot-document' le for each shot with each metric,
by doing this the content can be found using synonyms in
the search query. For example if the search query is "dog"
and there is a shot-document which has the word 'puppy' in
it, the aim is to connect them and return the needed result.
2.3</p>
    </sec>
    <sec id="sec-5">
      <title>Concept enrichment</title>
      <p>Our concept enrichment stage consists of three text
transformation stages. First, each word in the
shot-documents is analysed by the phpMorphy1 morphology
engine. This engine can create the normal form (stem)
of each word using basic grammatical rules and a large
dictionary. In our work we replaced every word with its
stem by this engine. In this point the shot-document is only
a bag of words.</p>
      <p>After this step we ltered out the stop words, we used
702 di erent English stop words2 for that, including search
term like words e.g.: less, important. This way we narrowed
down the word list of a shot-document to its core concept.</p>
      <p>For a better match we needed to enrich this list with
synonyms and conceptually connected other items. For this
we used the well known ConceptNet 53 system, which can
give us other words / phrases connected to each word in
a shot-document. We experimented with a wider and a
smaller range solution: including 50 and 10 conceptually
connected words for each word, respectively. In the results
the (C2) notates the smaller range result. We introduced
a weight for each word, the "original" words inside the
shot-document's weight is 1, the weight of connected words
are lower (for wide range: 0.2, for small range: 0.1).
At aggregation all of the enriched words there can be
duplicates (like 'home' is connected to 'school' and 'teacher'
is connected to 'school'), we aggregate them by a simple
weight sum. Using this method the weight represents the
importance of a word in the conceptual graph (sum of all
words in the shot-document).
1http://sourceforge.net/projects/phpmorphy
2http://www.ranks.nl/stopwords
3http://conceptnet5.media.mit.edu/</p>
    </sec>
    <sec id="sec-6">
      <title>Content mixing</title>
      <p>We created multiple shot-document types (3 transcripts
and manual subtitles), furthermore a combined type, so
called "All transcript and subtitle". This later case was
created by taking each shot-document with word weights
and put together by the same sum method explained before.
This way we could represent each and every possible word
in our concept le, but on the other side we added a lot of
conceptual noise to the originally clean document.
2.5</p>
    </sec>
    <sec id="sec-7">
      <title>Indexing and retrieval</title>
      <p>For indexing and retrieval we used Apache Solr4, and each
shot-document is considered in Solr as a single continuous
text stream, the order of the words represented the weight
in the shot-document. Important note is that during the
word reordering we kept concept phrases as one entity.</p>
      <p>In the search sub-task the we only included the following
steps: stop word ltering for the query, creating the norm
form for each word in the query, using the query as a search
input in Solr. The result was limited to 30 retrieved items.</p>
      <p>In the linking sub-task we used the shot-document
representing the needed section as a search query, but we
removed the concept enriched words from it.</p>
    </sec>
    <sec id="sec-8">
      <title>RESULTS AND CONCLUSIONS</title>
      <p>
        The whole dataset was more than 3700 hours of video and
the evaluation was on a shot level base (sometimes less than
5 seconds). Table 1 and 2 represent the results of the search
and linking sub-tasks, where P@N are precision oriented
metrics, such as precision at various cuto s (adjusted for
this task).[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
      </p>
      <sec id="sec-8-1">
        <title>Manual subtitles</title>
        <p>LIMSI transcripts</p>
        <p>LIUM transcripts
NST/She eld transcripts
All transcripts and subtitles</p>
        <p>Manual subtitles (C2)
LIMSI transcripts (C2)</p>
        <p>LIUM transcripts (C2)
NST/She eld transcripts (C2)
All transcripts and subtitles (C2)
0.1778
0.1481
0.1630
0.1769
0.1517
0.3407
0.3111
0.3704
0.2846
0.1655
0.2000
0.1667
0.1444
0.1308
0.1345
0.3074
0.2926
0.2815
0.2231
0.1586
0.1407
0.1185
0.1148
0.0981
0.1017
0.2074
0.2204
0.2204
0.1692
0.1190
In the search sub-task we reached a quite stable result
for each subtitle / transcript. Using a manually written
transcript is much better since it can include visual clues,
non-spoken informations (e.g. texts) and it has lower error
rate, on the other hand in the transcripts there can be
'misheard' sentences.</p>
        <p>In the linking sub-task the Manual subtitles gave us the
best result, but it is interesting to note that for 2 of the
anchors we cannot nd any relevant items among all of our
results, that is why the P@N results are so low. These
anchors are anchor 22 and anchor 27.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>ACKNOWLEDGMENTS</title>
      <sec id="sec-9-1">
        <title>Manual subtitles</title>
        <p>LIMSI transcripts</p>
        <p>LIUM transcripts
NST/She eld transcripts
All transcripts and subtitles</p>
        <p>Manual subtitles (C2)
LIMSI transcripts (C2)</p>
        <p>LIUM transcripts (C2)
NST/She eld transcripts (C2)
All transcripts and subtitles (C2)
0.0750
0.0444
0.0533
0.0400
0.0370
0.1818
0.0500
0.0526
0.0300
0.0143
0.0500
0.0333
0.0400
0.0467
0.0407
0.1000
0.0625
0.0316
0.0350
0.0250
0.0312
0.0167
0.0200
0.0233
0.0222
0.0500
0.0375
0.0184
0.0175
0.0196
The publication was supported by the
TAMOP-4.2.2.C-11/1/KONV-2012-0001 project. The
project has been supported by the European Union,
co- nanced by the European Social Fund.
5.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Adapting binary information retrieval evaluation metrics for segment-based retrieval tasks</article-title>
          .
          <source>CoRR, abs/1312</source>
          .
          <year>1913</year>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Eskevich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. N.</given-names>
            <surname>Racca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>The Search and Hyperlinking Task at MediaEval 2014</article-title>
          .
          <source>In Proceedings of the MediaEval 2014 Multimedia Benchmark Workshop</source>
          , Barcelona, Spain,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Eyben</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Weninger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Gross</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Schuller</surname>
          </string-name>
          .
          <article-title>Recent developments in openSMILE, the Munich open-source multimedia feature extractor</article-title>
          .
          <source>In Proceedings of the 21st ACM international conference on Multimedia</source>
          , pages
          <volume>835</volume>
          {
          <fpage>838</fpage>
          ,
          <string-name>
            <surname>Barcelona</surname>
          </string-name>
          , Spain,
          <year>2013</year>
          . ACM.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          , and
          <string-name>
            <surname>G. Adda.</surname>
          </string-name>
          <article-title>The LIMSI broadcast news transcription system</article-title>
          .
          <source>Speech Communication</source>
          ,
          <volume>37</volume>
          (
          <issue>1</issue>
          ):
          <volume>89</volume>
          {
          <fpage>108</fpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>El Hannani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. N.</given-names>
            <surname>Wrigley</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Wan</surname>
          </string-name>
          .
          <article-title>Automatic speech recognition for scienti c purposes-webASR</article-title>
          .
          <source>In Interspeech</source>
          , pages
          <volume>504</volume>
          {
          <fpage>507</fpage>
          ,
          <string-name>
            <surname>Brisbane</surname>
          </string-name>
          , Australia,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          .
          <article-title>Multilingual speech processing activities in QUAERO: Application to multimedia search in unstructured data</article-title>
          .
          <source>In Human Language Technologies{The Baltic Perspective: Proceedings of the Fifth International Conference Baltic HLT</source>
          <year>2012</year>
          , volume
          <volume>247</volume>
          , pages
          <fpage>1</fpage>
          <article-title>{8</article-title>
          . IOS Press,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lanchantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gales</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Quinnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Renals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Saz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seigel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Swietojanski</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Woodland</surname>
          </string-name>
          .
          <article-title>Automatic transcription of multi-genre media archives</article-title>
          .
          <source>In First Workshop on Speech, Language and Audio in Multimedia (SLAM</source>
          <year>2013</year>
          ), Marseille, France,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rousseau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Deleglise</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Esteve</surname>
          </string-name>
          .
          <article-title>Enhancing the TED-LIUM corpus with selected data for language modeling and more TED Talks. In The 9th edition of the Language Resources and Evaluation Conference (LREC</article-title>
          <year>2014</year>
          ), Reykjavik, Iceland,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>