<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards fast browsing of found audio data: 11 presidents</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>KTH Royal Institute of Technology</institution>
          ,
          <addr-line>Speech, Music &amp; Hearing, Stockholm</addr-line>
        </aff>
      </contrib-group>
      <fpage>133</fpage>
      <lpage>142</lpage>
      <abstract>
        <p>Our aim is to rapidly explore prohibitively large audio collections by exploiting the insight that people are able to make fast judgments about lengthy recordings by listening to temporally disassembled audio (TDA) segments played simultaneously. We have previously shown the proof-of-concept; here we develop the method and corroborate its usefulness. We conduct an experiment with untrained human annotators, and show that they are able to place meaningful annotation on a completely unknown 8 hour corpus in a matter of minutes. The audio is temporally disassembled and spread out over a 2-dimensional map. Participants explore the resulting soundscape by hovering over different regions with a mouse. We used a collection of 11 State of the Union addresses given by 11 different US presidents, spread over half a century in time, as a corpus. The results confirm that (a) participants can distinguish between different regions and are able to describe the general contents of these regions; (b) the regions identified serve as labels describing the contents of the original audio collection; and (c) that the regions and labels can be used to segment the temporally reassembled audio into categories. We include an evaluation of the last step for completeness.</p>
      </abstract>
      <kwd-group>
        <kwd>found data</kwd>
        <kwd>speech processing</kwd>
        <kwd>self-organizing maps</kwd>
        <kwd>dimensionality reduction</kwd>
        <kwd>visualisation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>This paper presents a method to explore prohibitively large audio collections rapidly
using a combination of techniques. The work is motivated by the prevalence of audio
archives that remain unused and unexplored because of their large size.</p>
      <p>For example, the audiovisual archives of the Swedish National Library currently
contain more than ten million hours of data - an amount that would take 100 people
spending 40-hour weeks throughout their entire working lives just to listen through.
The scope of the project motivating the present work is slightly less daunting: 13000
hours of mixed audio recordings gathered over a period of more than a century.</p>
      <p>Previous results show that our method can help differentiate audio segments [1]. It
facilitates browsing and annotation of lengthy recordings in little time by combining a
number of techniques designed to present large quantities of audio simultaneously and
by building experimental setups that allow listeners to judge what they hear quickly.</p>
      <p>The present work extends these findings and corroborates their usefulness through a
study in which 8 participants are used as annotators of an 8 hour corpus containing
material unfamiliar to them. The study simulates a situation of an archivist faced with
a set of audio recordings they do not know anything about and do not have labelling for
- a situation that takes place often enough in current archive work. The goal for the
archivist is to explore the data, get a sense of what it contains, and possibly provide
some crude annotations. Rather than listening through the recordings from start to
finish, which is a time consuming and tiresome task, even if the data is sampled sparsely,
the intention is that they use the proposed method to facilitate the process.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>Found data comprises data collections that were not recorded with the specific purpose
of being used in research. As such, typical examples of found resources are political
speeches, radio segments, interviews, audiovisual data such as television and movies,
music, recordings and transcriptions of the NASA's Apollo missions [2] and archived
material in general. In many cases, these data collections are of higher value compared
to artificially constructed data sets with regard to ecological validity. As such, there is
no risk for any unnatural properties in the data that may have been transferred from a
poorly constructed lab setting. Additionally, they possess significant cultural value,
their full potential is, however, not realisable without the help of automatic processing.
Furthermore, the sheer size of found data collections demonstrates that there is not a
shortage of data out there, rather a lack of methods that are able to handle the huge
quantities at hand. Unsupervised machine learning methods are very helpful in this
domain, but to perform conventional supervised classification tasks, one needs labels.</p>
      <p>
        To tackle this, many initiatives have been started with a focus on national collections
of archive data. In Sweden, the project TillTal [3] aims to organize Swedish archives
that have collections exceeding 10 million hours of audiovisual data, a number that is
increasing significantly every day. [4] describes a software platform for automatic
transcription and indexation of Czech and former Czechoslovakian radio archives,
containing more than 100,000 hours of audio. Furthermore, there are older initiatives with
similar aims: SpeechFind [
        <xref ref-type="bibr" rid="ref6">5</xref>
        ] had the purpose to serve as an audio index and search engine
for spoken word collections from the 20th century containing 60,000 hours of audio;
[6] considers automatic transcriptions of the INA (Institut National de l’Audiovisuel)
archives in France containing 1.5 million hours of radio and television programs dating
back to 1933 and 1949 respectively. The MALACH (Multilingual Access to Large
Spoken Archives) project [7] addressed the problems in analysis and processing of large
multilingual spoken archives containing 116,000 hours of interviews from Holocaust
survivors while CHoral [8] considered audio indexing tools for Dutch audiovisual
cultural heritage collections.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>3.1</p>
      <sec id="sec-3-1">
        <title>Data</title>
        <p>In an effort to corroborate the method, we set up an experiment with people with no
prior knowledge of the project or the method.</p>
        <p>The data set was chosen to be manageable from an experimental point of view, yet
representative of archive (found) data. It consists of 11 American State of the Union
addresses recorded over a span of half a century, with the oldest being delivered by
John F. Kennedy the 25th of May in 1961 and the newest by Donald Trump in 2017.
Albeit a restricted domain, the data set holds the type of variability one might expect
from unknown archival data: different speakers, different venues, different equipment,
different post processing, different audio quality different content, and different times.
As such, we deemed this data to be suitable for corroborating the method.</p>
        <p>The speeches range from 32 minutes and 38 seconds to 1 hours 1 minute and 56
seconds. Each audio file was converted to one channel with a sample rate of 16kHz.
Table 1 shows the details of the recording1.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Stimuli generation</title>
        <p>The method we are exploiting starts with temporally disassembling the audio signal by
chopping it up in small segments that are then reorganized without consideration of
their original temporal organization. We have used the technique (TDA) in several
applications where we insert humans-in-the-loop of audio processing. In this case we used
SOMs, self-organizing maps [9], to organize the sound segments in two-dimensional
1 The recordings were downloaded from americanrhetoric.com and youtube.com
maps. We then take advantage of a proof-of-concept technique, massively
multi-component audio environments [10], to present a multitude of sound snippets
simultaneously. The soundscape created in this manner is what was explored by our participants.</p>
        <p>Each recording was converted into a greyscale spectrogram using the Sound
EXchange Library (SOX)2. Apart from adopting a temporal resolution of 1000 pixels per
second and a height of 65 pixels, default settings were used. Both audio and
spectrograms were then segmented into equal sized chunks resulting in 100 ms long
segmentpairs of audio and spectrogram - disassembled audio with connected spectrograms.
Each spectrogram was then converted into a 6500 dimensional vector where each
element corresponds to the greyscale value of the given pixel. This results in a matrix
where each row corresponds to a feature vector of its original audio segment. We
sampled over this matrix extracting one segment per second which gave us a training data
set of ~32000 datapoints.</p>
        <p>The data was then used as input into a SOM projecting each data point onto a 90x90
2-D grid with the goal of forming regions based on the audio characteristics of each
segment. We then visualise the grid as a plot where each audio segment is represented
by a point in a certain position. By assigning the corresponding audio to each point, so
that the target audio segment is played when a user hovers over it with the mouse, we
get a simple interactive interface that can be used to browse the original audio in a more
efficient manner. The technical details of the proposed method, and motivation to our
decision on using self-organizing maps, is described in more detail in [1].
3.3</p>
        <p>Subjects
8 participants (M = 30.38 years, SD = 7.24, 22 - 46 years, Male = 6, Female = 2), all
without known hearing impairments, were given the same task. One participant
misunderstood the task and did not provide labels, and is therefore excluded from the results
by necessity.
2 sox.sourceforge.net
Participants were told to take the role of an archivist with the task of exploring large
amounts of audio recordings. They did not receive any details on the data, only that it
could be anything recorded at any time, hence the nature of the recordings was not
known to them. Their instructions were to interact and explore the plot and see if they
could find different types of regions based on what type of sound they heard. They were
then asked to give the centroid coordinates and labels for the three most distinctly
sounding regions they found. The participants had to assign labels without any
guidance. This was done so that we would not influence the participants choice of labels.
The top two labels (based on frequency) were used for further analysis.
3.5</p>
      </sec>
      <sec id="sec-3-3">
        <title>Analysis</title>
        <p>For each coordinate pair and each label, selected as described above, the Euclidian
distance to each point - each audio segment - on the map was calculated. The audio was
then temporally reassembled, and smoothed graphs showing the relative closeness to a
label centroid for each segment in time were constructed. This was done by adding each
inverted distance to a Kernel Density Estimate (KDE) for each label and speech. The
KDEs were then sampled, and the value of the two top ranked labels deducted from
each other at each point in time. The result is a rough, smoothed estimate of which of
the two labels any given segment belongs to.</p>
        <p>For further validation one person (not participating in the previously mentioned
annotation process) performed a crude manual labelling of 5 minutes from each speech
(10:00 - 15:00). Anything that was applause was labelled as such, and anything else
was labelled as speech. This means that silent segments, for example, were labelled as
speech. Temporal granularity was limited to 100 ms, identical to the automatic temporal
disassembly rate used in the preprocessing for the experiment. The resulting number of
segments was 3000 in each recording. The manual labels provided by the expert and
provided by the participant in each recording were compared. Albeit a crude
comparison, it reflects the purpose of the method: to get a rough idea of the general contents of
unknown audio quickly.
4
4.1</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Salient areas</title>
        <p>The participants selected coordinates spread out over the entire map (Figure 1). Without
asking them for labels, it would be hard to find any patterns or clusters in their results.
With the labels provided by the participants, however, a clear pattern appears.
#
3
2
1</p>
        <p>Label
applause, speech
silence
sports commentary, human speech,
news segment, nothing much going on
in big hall, in street, radio talk - tv
commercial, people speaking, low pitch
environmental sounds, in plane,
background noise, synthesized speech,
high pitch high volume environment,
some news or political report</p>
      </sec>
      <sec id="sec-4-2">
        <title>Labels of disassembled audio</title>
        <p>Inspection of the labels given by our informants revealed that 2 labels had been supplied
on 3 occasions, 1 label on 2 occasions, and the remaining 13 labels only once each (see
Table 2). The labels with 3 mentions were "applause" and "speech", and we contrast
these in our analysis. The label mentioned twice was "silence". Among the remaining
labels with a solitary mention, many seem to point to something very similar to
"speech" or "applause", but for our purposes, the two highest ranking labels will be
sufficient, so we leave the rest out of the discussion. The spatial distribution of the
labels is shown in Fig. 1. Fig. 2 shows the same data, but now colour coded according
to what audio it's originating from.
The seemingly good match between visual areas in the spectrogram and the automatic
segmentation is corroborated by results presented in Table 3. The table shows how well
the automatic annotation compares to the manual annotation of 5*11 minutes of the
data.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>The experiment we presented provides good evidence that the combination of
disassembling audio temporally, reorganizing according to a similarity measure, and
displaying interactively to people using massively multi-component audio replay does
provide a window into recordings that would otherwise remain unexplored. Our
participants spent only a few minutes on the task - 10 at most. The participant that took the
longest later stated that "it took much longer because I found listening to the sounds
intriguing". Out of 21 labels collected, 2 groups of 3 identical words were found. A
number of similar words could have been clustered with these easily.</p>
      <p>The validation of segmentation/categorization results showed that we beat the
majority class. That is hardly a feat, however. There are obvious optimizations that could
be made, such as selecting a better threshold for "speech"/"applause" (the optimal
threshold on this data achieves 80+% precision at 40+% recall) or discarding the
poorest quality audio sources. But good results on a specific, known classification task is
not what we look to achieve here. Rather, the key is that the system did not know what
to look for, nor did the participants providing the labels. We believe that this is a good
starting point when it comes to investigating large quantities of completely unknown
found audio data.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Future work</title>
      <p>We pursue this line of inquiry along with optimizing usability and improving
robustness. As to the former, as it stands, the framework presented here involves a preliminary
experimental setup that does not offer full comfort of use to an actual investigator of
archives. In parallel with the work presented here, which is aimed at validation and
development, we place considerable effort into creating a freely available software
resource for anyone interested in employing these techniques.</p>
      <p>Regarding robustness, the more generalizable framework we aim for, with added
functionality and better facilities for optimization, will help exploit the hidden resources
found in archives and freely available collections.</p>
      <p>On the technological side, this goes hand in hand with exploring alternative methods
for dimensionality reduction, such as auto-encoders, and for feature extraction -
conventional spectrograms are clearly just one of many methods to represent sound.
The work is funded in full by the Riksbankens Jubileumsfond funded project TillTal
(SAF16-0917: 1). Its results will be made more widely accessible through the national
infrastructure Nationella språkbanken and Swe-Clarin (Swedish research Council
2017-00626).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Fallgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Malisz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Edlund</surname>
          </string-name>
          , “
          <article-title>Bringing order to chaos: a non-sequential approach for browsing large sets of found audio data,”</article-title>
          <source>in Proc. of the 12th International Conference on Language Resources (LREC2018)</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Sangwan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaushik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H. L.</given-names>
            <surname>Hansen</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. W.</given-names>
            <surname>Oard</surname>
          </string-name>
          , “'Houston,
          <article-title>We have a solution' : Using NASA Apollo program to advance speech and language processing technology</article-title>
          ,”
          <source>in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1139</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Berg</surname>
          </string-name>
          et al.,
          <article-title>“TillTal - making cultural heritage accessible for speech research</article-title>
          ,” in CLARIN Annual Conference,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Nouza</surname>
          </string-name>
          et al.,
          <article-title>“Making Czech historical Radio archive accessible and searchable for wide public,”</article-title>
          <string-name>
            <given-names>J.</given-names>
            <surname>Multimed</surname>
          </string-name>
          ., vol.
          <volume>7</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>169</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>J. H. L. Hansen</surname>
          </string-name>
          et al.,
          <article-title>“SpeechFind: Advances in Spoken Document Retrieval for a National Gallery of the Spoken Word,” IEEE Trans</article-title>
          .
          <article-title>Speech Audio Process</article-title>
          ., vol.
          <volume>13</volume>
          , no.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          5, pp.
          <fpage>712</fpage>
          -
          <lpage>730</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Barras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Allauzen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          , “
          <article-title>Transcribing audio-video archives,” Language (Baltim)</article-title>
          ., vol.
          <volume>1</volume>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Psutka</surname>
          </string-name>
          et al.,
          <article-title>“Automatic Transcription of Czech Language Oral History in the MALACH Project: Resources and Initial Experiments,”</article-title>
          <source>in Proceedings of TSD</source>
          <year>2002</year>
          ,
          <year>2002</year>
          , pp.
          <fpage>253</fpage>
          -
          <lpage>260</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          , F. De Jong, and W. Heeren, “
          <article-title>Exploration of Audiovisual Heritage Using Audio Indexing Technology</article-title>
          ,” in
          <source>In Proceedings of the 1st ECAI Workshop on Intelligent Technologies for Cultural Heritage Exploitation</source>
          ,
          <year>2006</year>
          , pp.
          <fpage>36</fpage>
          -
          <lpage>39</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Cybern.</surname>
          </string-name>
          , vol.
          <volume>43</volume>
          , no.
          <issue>1</issue>
          , pp.
          <fpage>59</fpage>
          -
          <lpage>69</lpage>
          ,
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>J.</given-names>
            <surname>Edlund</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gustafson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Beskow</surname>
          </string-name>
          , “Cocktail-
          <article-title>-a demonstration of massively multicomponent audio environments for illustration and analysis</article-title>
          ,
          <source>” in SLTC</source>
          <year>2010</year>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>