<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Corpus Generation and Analysis: Incorporating Audio Data Towards Curbing Missing Information</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Atiqah Izzati Masrani</string-name>
          <email>amasrani1@sheffield.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yoshihiko Gotoh</string-name>
          <email>y.gotoh@sheffield.ac.uk</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of She eld</institution>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>As video data becomes widely available, it is crucial that these videos are properly annotated for e ective search, mining and retrieval purposes. Signi cant work has been done to explore natural language description as it can provide better understanding of the video content. Ideally, a summary should be informative and accurate in order for the users to have good understanding of the video content. An experiment has been conducted to evaluate the impact of audio information towards natural language summary annotations of a video content. The experiment proved that although events and human activities can be captured using visual features alone, key information of the video content would be missing without the audio information. Thus, future work on natural language summary generation should incorporate both visual and audio data to curb missing and erroneous information.</p>
      </abstract>
      <kwd-group>
        <kwd>Corpus generation</kwd>
        <kwd>hand annotation</kwd>
        <kwd>visual features</kwd>
        <kwd>audio data</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Nowadays, there is an abundant of videos that are accessible online. The
widespread use of the Internet has allowed videos to be accessed easily via video
search engines such as from the YouTube or Daily Motion. The YouTube itself
has more than 1 billion users and is estimated to have 300 hours of video
uploaded every minute. It generates billions of views on a daily basis. Furthermore,
the number of hours people spent on watching YouTube each month is up 50%
year over year. 1 This raises the question on how the users can be more selective
during video browsing and retrieval. Although some of these videos are
wellorganized with manually annotated tags or labels, some has no clear description
of its content. Therefore, users may tend to skim through the video to grasp a
hint of its semantic content.</p>
      <p>Video summarization addresses this issue by providing brief information of
the video. Signi cant work has been done in this area with a large part of it
optimizes graphical representations. The graphical representations can be further</p>
    </sec>
    <sec id="sec-2">
      <title>1 https://www.youtube.com/yt/press/statistics.html</title>
      <p>
        divided into two classes. The rst class focuses on compressing the video into
a shorter representation of the video that is also known as video skimming.
This include works from [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The second class uses image key frames
extracted from the video stream to re ect the content or the highlights of the
video [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        Natural language has also proven to be a popular choice to represent a video.
It is an appealing option as it is less space consuming, has faster processing time
for retrieval and is readable by both human and machine. Most early research
such as works by [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] uses representative keywords. Using keywords can
boost the potential for fast video retrieval because it helps e cient video
categorization. However, using keywords alone may not be able to capture the whole
key points of the video, as keywords tend to be ambiguous. This may a ect the
accuracy and e ectiveness of video classi cation due to its ambiguity and lack
of information. Natural language representation in the form of a \summary"
or \abstract" is one way to address this. There has been signi cant works on
creating a natural language summary that emphasizes on its coherency and
informativeness. However, human's perspective when watching a video is subjective.
Although presented with the same visual scene, one's interpretation may vary.
This may in uence on how they will write the summary of the video.
      </p>
      <p>In this paper, an experiment was conducted to study the overlapped
similarities of human's perspectives and also the impact of incorporating audio data
during summary annotations. This paper aims to prove that the dissimilarity lies
in the words used to semantically convey the meaning, and the similarity lies in
the key information that is included in the summary. This paper also aims at
proving that both the visual and audio data are important towards determining
the key points of the video. Thus, using only one without the other towards
natural language generation framework for video data may cause missing or
erroneous information in the summary.
2</p>
      <sec id="sec-2-1">
        <title>Corpus Generation</title>
        <p>
          As video data becomes widely available, it is crucial that these videos are
properly annotated for e ective search, mining and retrieval purposes. Signi cant
progress has been made to use natural language description as it can provide
better understanding of the video content such as work by [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], and [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Most
of these works crafted their own video corpora that consist of the video data and
its corresponding hand annotation. Each dataset are speci cally designed with
a certain prerequisites or constraints to ful ll a speci c task or purpose.
        </p>
        <p>
          In [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], the dataset is designed for the task of generating natural language
descriptions of the video content. The work focuses on the natural language
generation phase that is heavily dependent on the visual features extracted during
the HLFs processing phase. The dataset is crafted from videos that consist of
subjects, objects, actions and scene settings that can be easily identi ed using
existing visual processing techniques. Therefore, the crafted videos are short and
consist of a single shot or scene with minimal activity. Some other existing video
corpora are more domain-focused such as football, tra c, surveillance, cooking
videos and so on [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>In this study, video clips from the BBC EastEnder series were selected. 2
It consists of approximately 244 episodes and each are associated with its own
metadata and transcripts. This dataset is chosen because of its realistic elements
with human subjects showing various activities, emotions and interactions with
other objects. In this experiment, 5 episodes were chosen. These episodes were
crosschecked with their metadata and transcripts. Each episode has a
synopsis and description included in their metadata le. Assuming that the synopsis
(summary) describes the highlight of its corresponding episode, these videos were
cropped focusing on the episodes' highlight. The cropped video ranges between 4
to 20 minutes of playtime. Figure 1, shows the selected video with their synopsis,
description and the duration of the cropped version.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2 http://www-nlpir.nist.gov/projects/tvpubs/tv13.papers/tv13overview.pdf</title>
      <p>{ Can the hand annotations consist of similarities that focus on the key interest
points of the video?
{ Can audio data help to reduce missing information?
4</p>
      <sec id="sec-3-1">
        <title>Results</title>
        <p>Total number of documents for this corpus was 25 (5 participants each created
1 summary for 5 di erent videos). The total number of words for the summaries
was 1856, hence the average length of one document was roughly 74 words. Total
number of unique words is 402. 4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3 Refers to the subclasses as de ned in [8]</title>
    </sec>
    <sec id="sec-5">
      <title>4 This statistic is generated using www.linguakit.com</title>
      <p>Human Related Features Figure 5 presents human related information
observed in the hand annotations. The participants is shown to focus on identifying
human's presence in the video because the top three most frequently used words
(nouns) are woman with 41 occurrences, man with 31 occurrences and lady with
19 occurrences. For human related features, the human gender information has
the highest number of occurrences: female with 77 occurrences and male with
54 occurrences. Related words such as `lady' and `woman' are combined into the
same category `female'. The same goes for `male', which combines related words
such as `man' and `boy'. Age information (e.g., old, young, child), identity (e.g.,
mother, nurse, groom) and grouping (e.g., one, two, crowd) are also often used.
The words used to describe emotions are categorized into six basic emotions
as described by Paul Ekman 5. These six basic emotions are `anger', `disgust',
`fear', `happy', `sad', and `surprise'. The least described features are body parts
and dressing.</p>
      <p>Non-human Related Features Figure 6 presents non-human related
information observed in the hand annotations. The participants shown keen interest
in identifying the location of a particular scene such as the hospital, restaurant,</p>
    </sec>
    <sec id="sec-6">
      <title>5 Paul Ekman is a psychologist and a co-discoverer of micro expressions with Friesen,</title>
      <p>Haggard and Isaacs
church etc. They also showed interest in describing man-made objects involved
(e.g., car, food, book etc.) and scene settings (e.g., ceremony, wedding, and
outside). Natural objects and colours are rarely described. No word has been used
to describe size.
4.2</p>
      <p>Hand Annotations (With Audio)
Total number of documents for this corpus was 25 (5 participants each created
1 summary for 5 di erent videos). The total number of words for the summaries
was 1983, hence the average length of one document was roughly 79 words. Total
number of unique words is 426. 6
Human Related Features Figure 7 presents human related information
observed in the hand annotations. The participants is shown to focus on identifying
human's presence in the video because the top three most frequently used words
(nouns) are mother with 21 occurrences, baby with 20 occurrences and woman
with 19 occurrences. For human related features, the human gender information
has the highest number of occurrences: female with 75 occurrences and male with
75 occurrences. Identity features (e.g., mother, Dawn, nurse) also recorded high
number of occurrences. Age information, emotions and grouping are described
signi cantly. The least described features are body parts and dressing.</p>
    </sec>
    <sec id="sec-7">
      <title>6 This statistic is generated using www.linguakit.com</title>
      <p>Non-human Related Features Figure 8 presents non-human related
information observed in the hand annotations. The participants showed keen interest
in identifying the location of a particular scene such as the pub, house, hospital
etc. They also showed interest in describing man-made objects involved (e.g.,
rubbish, co ee, car etc.) and scene settings (e.g., ceremony, wedding, and
outside). Natural objects and size are rarely described. No words have been used to
describe colours.
5</p>
      <sec id="sec-7-1">
        <title>Analysis and Discussion</title>
        <p>The ndings and analysis from this experiment will be presented in two
subsections focusing on the two research questions.
Finding Overlapped Key Interest Points The hand annotations are fed
into an automatic summarizer tool7 to identify the sentence relevance and the
best keywords. This automatic summarization tool works in three phases. In
the rst phase, it will extract the sentences from the input text. Next, it will
identify the keywords in the text and count each word's relevance. And in the
nal phase, it will identify the sentences with the most relevant keywords and
displaying them based on the options selected. Table 3 and Table 4 shows the
sentences with the highest relevance when the threshold8 is set to 80. Based on
these ndings, the overlapped key interest points that have been identi ed for
Video ID: 5082189274976367100 are: they are two women having a conversation
at the hospital; one of the woman ran out from the hospital crying after giving
the baby; one of the woman argues with the nurse to get out from the hospital.
Using Audio Data to Reduce Erroneous and Missing Information
Table 5 shows the best keywords that were identi ed. It is shown that when</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>7 http://www.tools4noobs.com/summarize</title>
    </sec>
    <sec id="sec-9">
      <title>8 The value used to limit the sentences based on their relevance. The relevance is</title>
      <p>determined by the number or relevant words in it</p>
      <p>Summary
There is one patient lying at the hospital bed and is
talking with another woman who is holding a baby.</p>
      <p>In the hospital, the young woman prepares to go back
while a nurse came to talk to her.</p>
      <p>One is a young woman sitting on the bed and one is
middle age woman standing holding the baby.</p>
      <p>A woman fought with nurse possibly about getting
out from the hospital.</p>
      <p>She ran out from the hospital and cried after gave the
baby to that young lady.
audio data is present, the participants are keener towards identifying the
identity of the human subjects (e.g., `Dawn', `mother'). Besides that, the key
information of the video is also identi ed. In the hand annotations (without
audio), although they managed to identify that the two women are having
a conversation, the information regarding the conversation itself is missing.
Keywords such as `probably', `possibly', and `maybe' were often used. In the
hand annotations (with audio), there is a substantial increment in relevance
for the keyword `baby'. This clearly shows that the participants have grasped
the key information of the video that is about the two women arguing over the
baby. Therefore, we can conclude that incorporating audio data may reduce
erroneous and missing information.</p>
      <p>This experiment also shows that there are a few challenges to be overcome
when these two types of data are incorporated. First is to establish the relation
between what is spoken and what is shown visually. The audio information
extracted may or may not be related to the events or activities that are happening
in that particular scene. For example, a conversation may be something about</p>
      <p>Summary
The mother begs and cries to get the baby and called
the nurse to ask the middle age woman to leave.</p>
      <p>Dawn was furious and demanded her baby but the
lady tried to increase the amount of money.</p>
      <p>May wants to take the newborn baby from Dawn who
is the mother as they agreed before by paying some
amount of money.</p>
      <p>Dawn had just gave birth and a lady was trying to
take her baby away and claimed that they had agreed
on giving the baby to her in return of GBP10, 000.</p>
      <p>Dawn want to leave the hospital but the nurse try to
stop her because she need some rest.</p>
      <p>Dawn wants to leave from the hospital with her baby.</p>
      <p>The baby's mother want to go back home because
she worries the middle age woman will come back
and steal her baby.
the past that di ers (non-relevant) from what is visually shown. Future work
should consider a decision-making process to lter non-relevant audio
information by crosschecking it with the visual features and calculate their overlapped
similarities.</p>
      <p>Secondly, various audio processing tasks should be incorporated to get
optimum results. For example, detecting a person's identity or relationship. Speech
recognition alone is not su cient to determine which spotted keyword can be
associated with which detected person. It should include various cues in audio and
video to determine either the keyword is referring to the person whom he/she is
having the conversation with or a third person that may or may not be present in
the video stream. Associating a detected person with a keyword that represents
his identity or relationship is a challenge that is yet to be overcome.</p>
      <p>Third, this experiment uses the Eastender dataset that has been crafted
to include scenes with human activities and events. Thus, it is \rich" in both
audio and visual information to highlight the key interest points in the video
stream. Di erent set of guidelines should be given to the participants depending
on the type of the video dataset. For example, a lecture video may include a
person presenting a PowerPoint slide. Although, the audio features may di er to
their detected visual counterpart, in this context the information is relevant to
describe the video content. For surveillance videos, the guideline should outline
what is expected to be annotated. Due to the nature of this type of dataset that
has no clear storyline or video highlights, a clear guideline is crucial to minimize
hand annotations that are too diverse or subjective between one another.</p>
      <p>Therefore, in order to incorporate audio data towards curbing missing
information, these are the challenges that need to be put into consideration to
achieve optimum results.
6</p>
      <sec id="sec-9-1">
        <title>Conclusion</title>
        <p>This paper has proven that although visual data is su cient to detect humans,
their interactions with related objects, actions, and scenes, using this information
alone to generate natural language descriptions may not be able to capture the
\key interest point" of the video content. An ideal video summary provides a
brief overview of the video. It is not merely stating what is present (detected) in
the video. Therefore, incorporating audio data is crucial towards curbing missing
or erroneous information. Future work should consider the challenges that may
arise when incorporating both of these data primarily the challenge of ltering
relevant and non-relevant information. The corpus dataset (hand annotations)
can also be used as a mean of evaluation against future works on natural language
generation of a video stream.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>M.A.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Kanade</surname>
          </string-name>
          , T.:
          <article-title>Video Skimming and Characterization through the Combination of Image and Language Understanding Techniques</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition</source>
          ,
          <year>1997</year>
          . Proceedings.,
          <source>1997 IEEE Computer Society Conference</source>
          , pp.
          <volume>775</volume>
          {
          <fpage>781</fpage>
          . (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Lienhart</surname>
          </string-name>
          ,
          <article-title>Rainer and Pfei er, Silvia and E elsberg, Wolfgang: Video Abstracting</article-title>
          .
          <source>In: Commun. ACM</source>
          , vol.
          <volume>40</volume>
          , pp.
          <volume>54</volume>
          {
          <fpage>62</fpage>
          . New York (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>He</surname>
          </string-name>
          ,
          <article-title>Liwei and Sanocki, Elizabeth and Gupta, Anoop and Grudin, Jonathan: Autosummarization of Audio-video Presentations</article-title>
          .
          <source>In: Proceedings of the Seventh ACM International Conference on Multimedia (Part 1)</source>
          , pp.
          <volume>489</volume>
          {
          <fpage>498</fpage>
          ., Orlando, Florida (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Uchihashi</surname>
          </string-name>
          , Shingo and Foote, Jonathan and Girgensohn, Andreas and Boreczky, John: Video Manga:
          <article-title>Generating Semantically Meaningful Video Summaries</article-title>
          .
          <source>In: Proceedings of the Seventh ACM International Conference on Multimedia (Part 1)</source>
          , pp.
          <volume>383</volume>
          {
          <fpage>392</fpage>
          ., Orlando, Florida (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Yeung</surname>
            ,
            <given-names>M.M.</given-names>
          </string-name>
          and
          <string-name>
            <surname>Boon-Lock</surname>
            <given-names>Yeo</given-names>
          </string-name>
          :
          <article-title>Video Visualization for Compact Presentation and Fast Browsing of Pictorial Content</article-title>
          .
          <source>In: Circuits and Systems for Video Technology, IEEE Transactions</source>
          , pp.
          <volume>771</volume>
          {
          <fpage>785</fpage>
          ., (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Assfalg</surname>
          </string-name>
          ,
          <article-title>Jurgen and Bertini, Marco and Colombo, Carlo and Bimbo, Alberto Del and Nunziati, Walter: Semantic Annotation of Soccer Videos: Automatic Highlights Identi cation</article-title>
          .
          <source>In: Comput. Vis. Image Underst.</source>
          , vol.
          <volume>92</volume>
          , pp.
          <volume>285</volume>
          {
          <fpage>305</fpage>
          . New York (
          <year>2003</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cui</surname>
          </string-name>
          ,
          <article-title>Bin and Pan, Bei and Shen, HengTao and Wang, Ying and Zhang, Ce: Video Annotation System Based on Categorizing and Keyword Labelling</article-title>
          .
          <source>In: Database Systems for Advanced Applications</source>
          , vol.
          <volume>5463</volume>
          , pp.
          <volume>764</volume>
          {
          <fpage>767</fpage>
          . Springer, Heidelberg (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Khan</surname>
          </string-name>
          ,
          <article-title>Muhammad Usman Ghani and Nawab, Rao Muhammad Adeel and Gotoh, Yoshihiko: Natural Language Descriptions of Visual Scenes: Corpus Generation and Analysis</article-title>
          .
          <source>In: Proceedings of the Joint Workshop on Exploiting Synergies Between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra)</source>
          , pp.
          <volume>38</volume>
          {
          <fpage>47</fpage>
          ., Avignon, France (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Khan</surname>
          </string-name>
          ,
          <article-title>Muhammad Usman Ghani and Lei Zhang and Gotoh, Yoshihiko: Generating coherent natural language annotations for video streams</article-title>
          .
          <source>In: Image Processing (ICIP)</source>
          ,
          <year>2012</year>
          19th IEEE International Conference, pp.
          <fpage>2893</fpage>
          -
          <lpage>2896</lpage>
          , (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <article-title>Niveda Krishnamoorthy and Girish Malkarnenkar and Raymond Mooney and Kate Saenko</article-title>
          and Sergio Guadarrama: Generating
          <string-name>
            <surname>Natural-Language Video Descriptions Using Text-Mined</surname>
            <given-names>Knowledge</given-names>
          </string-name>
          , (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kojima</surname>
          </string-name>
          ,
          <article-title>Atsuhiro and Tamura, Takeshi and Fukunaga, Kunio: Natural Language Description of Human Activities from Video Images Based on Concept Hierarchy of Actions</article-title>
          .
          <source>In: Int. J. Comput. Vision</source>
          , vol.
          <volume>50</volume>
          , pp.
          <volume>171</volume>
          {
          <fpage>184</fpage>
          ., Hingham, MA, USA (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Das</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          and Chenliang Xu and Doell,
          <string-name>
            <given-names>R.F.</given-names>
            and
            <surname>Corso</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.J. :</surname>
          </string-name>
          <article-title>A Thousand Frames in Just a Few Words: Lingual Description of Videos through Latent Topics and Sparse Object Stitching</article-title>
          .
          <source>In: Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>2013 IEEE Conference</source>
          , pp.
          <fpage>2634</fpage>
          -
          <lpage>2641</lpage>
          ., (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>