<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>The Similar Segments in Social Speech Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elizabeth E. Shriberg SRI</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Catharine Oertel KTH</string-name>
          <email>catha@kth.se</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Louis-Philippe Morency ICT, University of Southern California</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Nigel G. Ward Steven D. Werner David G. Novick University of Texas at El Paso</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Tatsuya Kawahara Kyoto University</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2013</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>Similar Segments in Social Speech was one of the Brave New Tasks at MediaEval 2013. The task involves finding segments similar to a query segment, in a multimedia collection of informal, unstructured dialogs among members of a small community.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>With users’ growing willingness to share personal
activity information, the eventual expansion of social media to
include social multimedia, such as video and audio
recordings of casual interactions, seems inevitable. To unlock the
potential value, we need to develop methods for searching
such records. This requires us to develop good models of the
similarity between dialog-region pairs.</p>
      <p>Our motivating scenario is the following: A new member
has joined an organization or social group that has a small
archive of conversations among members. He starts to
listen, looking for any information that can help him better
understand, participate in, enjoy, find friends in, and
succeed in this group. As he listens to the archive (perhaps at
random, perhaps based on some social tags, perhaps based
on an initial keyword search), he finds something of interest.
He marks this region of interest and requests “more like this”.
The system returns a set of “jump-in” points, places in the
archive to which he could jump and start listening/watching
with the expectation of finding something similar. In this
scenario users may lack specific intentions, and their
behavior may resemble undirected search or even recommendation
requests more than directed search.</p>
      <p>
        Despite the large volume of research in technologies for
audio and multimedia search, as surveyed for example by [
        <xref ref-type="bibr" rid="ref1 ref5">1,
5</xref>
        ], there has been no research addressing such a scenario, or
otherwise search in social multimedia. There is a need for
evaluation support, both for examinations of the suitability
for this task of existing techniques and for the exploration
of new techniques. To support both, we provide a task, a
dataset, and an evaluation method.
2.
      </p>
      <p>The task is, given a short audio/video region (segment)
of interest, to return an ordered list of jump-in points for
regions similar to it, where similarity is based on the
perceptions of human searchers. In directly addressing pure
similarity this task is novel; it avoids the need to use any of
the typical simplifications — such as framing the problem as
being topic match, search-term match, or dialog-act match
— which ultimately, we believe, are distorting and limiting.
3.</p>
    </sec>
    <sec id="sec-2">
      <title>DATASET</title>
      <p>
        We audio- and video-recorded two-person dialogs among
members of the computer science community at our
university. They talked about whatever they wanted, for about
10 minutes each [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. They were told that that their dialogs
were going to be annotated for later searching, and many
of the conversations turned out to be rich in information
likely to be of interest to fellow CS students, rather than
just personal talk.
      </p>
      <p>The training set is 20 dialogs, 241 minutes in total, mostly
involving undergraduates, with the most common topics
relating to classes and class assignments, interesting new
technologies, career ambitions, games, and movies. The test set
is 6 dialogs, 68 minutes total, involving only research-active
students, with less talk about classes and more about
research, but otherwise fairly similar.</p>
      <p>The annotations are tagsets which indicate regions similar
in some way. These were done by students, mostly members
of the same community, including some who had contributed
dialogs to the collection. The annotators worked mostly
independently. In the first pass each listened to and viewed
a few dialogs and developed a set of tags to use, each tag
associated with some set of somehow-related regions that
some future searcher may potentially be interested in. They
then did a second pass over all the dialogs, and for every
region found that was relevant to some tag, assigned that tag
to the data. Regions could span any fragment of the dialog,
regardless of any notion of topic or utterance boundary. The
average durations were 50 seconds in the training set and,
after clarifying the instructions to annotators slightly, 31 in
the test set. There were 198 tagsets over the training set,
with a total of 1697 tagged regions, and 29 and 189 for the
test set.</p>
      <p>While most tags were related to traditional-style
topics, such as #food, #travel, #cars-and-driving,
#planningclass-schedules, #TV-shows, #lack-of-money, and
#family, others related instead, or in addition, to dialog
activity, for example #anecdotes, #problems,
#shortterm-future-plans, #advice, #gossip, and
#positive-thingsabout-classes. While the tags themselves are not relevant
for our purposes, each tag serves to define a “similarity set”
of regions in which every pair is a positive example of
similarity. Task participants can use these examples to hone
their similarity metrics. Those similarity metrics can then
be used in a system to support the search scenario: given
any new query, to return a set of similar regions.</p>
      <p>Participants were also given human-generated transcripts
and automatic speech recognition output. The latter was
far more errorful, typically having more incorrect words than
correct ones, but was more faithful with the ums and uhs, as
our human transcribers were told not to bother with those.</p>
    </sec>
    <sec id="sec-3">
      <title>EVALUATION OF RESULTS</title>
      <p>For evaluation purposes, an input to the system is a region
from one of the similarity sets of an annotator, and the ideal
result is a set of jump-in points that closely index all the
other regions in that set. As the testset speakers and topics
differ from those in the training set, systems that performed
well will have demonstrated that their methods generalize,
at least to some extent.</p>
      <p>Our specific performance measures are based on the
scenario, in which the user watches/listens and browses around
the points suggested, rather than passively consuming some
precisely delimited segments. (Despite the title of the task,
the dialogs were not segmented in any way). For this reason
standard metrics based on accuracy and precision are not
appropriate. Instead, we use a rough model of how searchers
are likely to use the suggested jump-in points. Extending
Liu and Oard’s (2006) model, we define a “Searcher
Utility Ratio”, where the numerator is the estimated value to
the searcher and the denominator the estimated cost, both
measured in seconds.</p>
      <p>Specifically, the value to the searcher is modeled as the
number of seconds of relevant audio/video she can likely
find by using the suggested jump-in points. We assume that
she will find a region if a jump-in point is no earlier than 5
seconds before the region start and no later than 3 seconds
before the region end.</p>
      <p>The estimated cost to a searcher is the number of
seconds needed to peruse the suggested jump-in points. There
are three cases. 1) If the suggested jump-in point does not
correspond to any same-tagset region (a false-positive error),
then the cost is 8 seconds, an estimate of the time a searcher
needs to recognize a false alarm. 2) If the suggested jump-in
point is no more than 5 seconds before the actual region start
point, the cost is the time from that jump-in point to the
end of the actual region, reflecting the time spent to scan
forward to the start of the relevant content and the time
to listen to it. 3) If the suggested jump-in point is within
the region, then the benefit is the remaining duration of the
region, and the cost is the same.</p>
      <p>We further assume the searcher devotes two minutes to
each search. The total value is accordingly estimated as the
amount of relevant audio she can find and consume in that
time, according to the model above.</p>
      <p>In addition there is a measure of recall, to counter for the
possibility of systems doing well by generating only a
handful of jump-in points, just the easiest ones. Thus Recall is
the fraction of obtainable content actually found, where the
obtainable content is the total content in the other regions
in the tagset, up to a two-minute maximum.</p>
      <p>The raw Searcher Utility Ratio and raw Recall are valid
for comparing systems’ ability to find similar regions, but
they significantly understate performance. This is because
regions other than those in the specific similarity set for a
query may in fact be similar to that query in other respects,
but will be counted as false alarms. That is, because each
similarity set is generated by a specific annotator, with his
or her own perspective and interests, no system could be
expected to return the target results exactly. Accordingly we
adjusted the scores by dividing by an estimate of the
bestobtainable performance values. This estimate was obtained
using an algorithm that consults closely-overlapping
otherannotator tags to propose jump-in points (although later we
found that this underestimated the possible performance,
enabling performance results to exceed 1.0). Thus for the
testset the Normalized Searcher Utility Ratio (NSUR) is
obtained by dividing the raw value by 0.159 and the
Normalized Recall by dividing the raw value by 0.211.</p>
      <p>The overall measure is the F-measure, with NSRU
weighted higher than Normalized Recall (specifically by a
factor of 9, based on consideration of how appreciative users
might be of results having various NSRU and NR values).
10 ∗ U ∗ R</p>
      <p>U + 9R
(1)
5.</p>
    </sec>
    <sec id="sec-4">
      <title>OUTLOOK</title>
      <p>
        While very challenging, this task will enable researchers
to explore search in dialog archives, the more-like-this task,
pure-similarity models, and the social-speech domain. While
our scenario is for search in social recordings, technologies
developed for this task are likely to be useful also for other
needs [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], such as search of workplace recordings, of
surveillance recordings, of personal recordings, and so on.
      </p>
    </sec>
    <sec id="sec-5">
      <title>ACKNOWLEDGMENTS</title>
      <p>We thank Martha Larson, Steve Renals and Khiet Truong,
and the National Science Foundation for support via a REU
supplement to Award IIS-0914868.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Larson</surname>
          </string-name>
          and
          <string-name>
            <given-names>G. J. F.</given-names>
            <surname>Jones</surname>
          </string-name>
          .
          <article-title>Spoken content retrieval: A survey of techniques and technologies</article-title>
          .
          <source>Foundations and Trends in Info. Retrieval</source>
          ,
          <volume>5</volume>
          (
          <issue>4</issue>
          -5):
          <fpage>235</fpage>
          -
          <lpage>422</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          .
          <article-title>One-sided measures for evaluating ranked retrieval effectiveness with spontaneous conversational speech</article-title>
          .
          <source>In 29th SIGIR</source>
          , pages
          <fpage>673</fpage>
          -
          <lpage>674</lpage>
          ,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N. G.</given-names>
            <surname>Ward</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Werner</surname>
          </string-name>
          .
          <article-title>Thirty-two sample audio search tasks</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2012</year>
          . University of Texas at El Paso,
          <source>Tech. Report, UTEP-CS-12-39.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>N. G.</given-names>
            <surname>Ward</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Werner</surname>
          </string-name>
          .
          <article-title>Data collection for the Similar Segments in Social Speech task</article-title>
          .
          <source>Technical report</source>
          ,
          <year>2013</year>
          . University of Texas at El Paso, UTEP-CS-
          <volume>13</volume>
          -58.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N. G.</given-names>
            <surname>Ward</surname>
          </string-name>
          and
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Werner</surname>
          </string-name>
          .
          <article-title>Using dialog-activity similarity for spoken information retrieval</article-title>
          .
          <source>In Interspeech</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>