<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Tool for Exploring Large Amounts of Found Audio Data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Per Fallgren</string-name>
          <email>perfall@kth.se</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zofia Malisz</string-name>
          <email>malisz@kth.se</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jens Edlund</string-name>
          <email>edlund@speech.kth.se</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>KTH Royal Institute of Technology</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>We demonstrate a method and a set of open source tools (beta) for nonsequential browsing of large amounts of audio data. The demonstration will contain versions of a set of functionalities in their first stages, and will provide a good insight in how the method can be used to browse through large quantities of audio data efficiently.</p>
      </abstract>
      <kwd-group>
        <kwd>Found data</kwd>
        <kwd>visualization</kwd>
        <kwd>machine learning</kwd>
        <kwd>speech processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        In many fields, the absence of data is no longer a pressing issue, instead there is a lack
of methods that are able to handle the large collections of data that exists. To this end
we present an early version a tool that lets professionals from different fields explore
audio more efficiently. Our aim is to make it possible to utilize found data, meaning
data that was not recorded with the purpose of being used in (speech) research.
Examples of found data can include archive data, radio and television speech, and interviews.
These data sets constitute speech found in the wild, and are often more valuable for
research than manually constructed speech datasets as they are not constrained by a
fabricated lab setting. Despite this, these kinds of data are rarely used, the reasons
sometimes being legal and ethical issues or the simple fact that the existence of the data is
unknown to many
        <xref ref-type="bibr" rid="ref1">(Edlund &amp; Gustafson, 2016)</xref>
        .
      </p>
      <p>Large collections of this kind of data abound. In Sweden, the Institute for Language
and Folklore (ISOF) hosts 13 000 hours of digitized speech, and the National Library
(KB) hosts a staggering 10 million hours of audiovisual data. To put it simply, there is
not a lack of data, but a lack in generalizable tools and methods that can help make use
of the data. While one in theory could listen through 13 000 hours of data sequentially
it would take several years, not taking into account added time for annotation. Instead,
novel methods that dramatically cut down on the time it takes to explore sound are
required. With this in mind, we present a tool in its early stages that fits well in the
context of digital humanities given its ability to let a user browse their data in a more
efficient manner. It should prove useful to a wide variety of professionals, not only in
speech and speech technology research, but also to archivists, scholars and others in the
social sciences and humanities.</p>
    </sec>
    <sec id="sec-2">
      <title>Method</title>
      <p>Our approach removes the
temporal dimension – the
sequential layout of the acoustic
data – and instead organizes
the data along a low number
(2-3) of spatial dimensions
representing acoustic features
of the audio.</p>
      <p>For each audio file, we create
a monochrome spectrogram,
i.e. a visual representation of
the sound frequencies and
their intensity over time. The
two representations (audio
and spectrogram) are then
segmented in parallel into
equal-sized chunks of T
duration (T should sensibly be in
the range of 50 or so
milliseconds to one or two seconds).
The processing will potentially result in a distribution of spectrogram chunks that have
formed coherent regions in the sense that similar sounds are positioned in the same
vicinity. Plotting the data points and adding advanced listening functionality results in
a simple, interactive interface allowing its users to explore the initial audio at a
considerably higher pace than the real-time duration of the initial audio.</p>
      <p>Our proof-of-concept experiments show that the method captures characteristics of
audio that are readily interpreted by a human listener. Among the audio types we have
experimented with are speech, where we find crude regions of vowels, consonants and
silence; animals noises, where the sounds of birds, cows and sheep are distinguished
almost perfectly; music, where the algorithm differentiates between a singing voice and
an instrument (guitar or piano); and café buzz (Fig. 1), for which verbal chattering with
and without background noise is distinguished. One might argue that there already exist
methods that potentially would outperform the proposed approach in the mentioned
tasks, however these methods would all most likely lack generalizability. They are as
such tuned for certain kinds of audio and are not usable when the nature of the data is
unknown, which is often the case when dealing with found data.
4</p>
    </sec>
    <sec id="sec-3">
      <title>Future work</title>
      <p>The method is still under development and there are many directions to explore. Our
vision is to turn the tools into a web based framework that can be used by anyone
regardless of hardware and operating system. We also wish to consider that our target
audience has a varied technology expertise, hence we want the tools to be easy to use
for a wide range of people. The interaction between user and framework is an essential
means to strengthen the link between users and their data. A prerequisite is a set of tools
that provides a smooth and pleasant experience.</p>
      <p>Sound can be represented in a wide variety of manners. Each of these captures some
characteristics better and others worse. In our current version, the SOX library1 is used
to extract the spectrogram we use for clusters. Although we show promising results
with this approach, there are other techniques that might capture acoustic features
better. Regarding characteristics that are relevant to speech, window size adjustments are
a primary candidate for tweaking. Analysis of small windows (e.g. 25ms) will create
very different maps compared to larger (e.g. 1s) windows, something that will also
greatly affect the listening experience.</p>
      <p>Furthermore, we will add an annotation function, so that a user interested in a certain
type of sound event can locate and tag a region for further exploration with ease.
Additionally, the user will have the option to revert to the original audio with new
information and be able to see when any observed sound events occurred in the sequence.
This would also facilitate the incorporation of a human-in-the-loop, which is the idea
of retraining a map based on human feedback.</p>
      <p>The demonstration will contain first versions of these functionalities, and will provide
a good insight in how the method can be used to browse through large quantities of
audio data efficiently.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>The project described here is funded in full by Riksbankens Jubileumsfond
(SAF160917: 1). Its results will be made more widely accessible through the infrastructure
supported by SWE-CLARIN (Swedish research Council 2013-02003).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Edlund</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Gustafson</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Hidden Resources ― Strategies to Acquire and Exploit Potential Spoken Language Resources in National Archives</article-title>
          . In N. C. (Conference Chair),
          <string-name>
            <given-names>K.</given-names>
            <surname>Choukri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Declerck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grobelnik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Maegaard</surname>
          </string-name>
          , … S. Piperidis (Eds.),
          <source>Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC</source>
          <year>2016</year>
          ). Paris, France: European Language Resources Association (ELRA).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Kohonen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          (
          <year>1982</year>
          ).
          <article-title>Self-organized formation of topologically correct feature maps</article-title>
          .
          <source>Biological Cybernetics</source>
          ,
          <volume>43</volume>
          (
          <issue>1</issue>
          ),
          <fpage>59</fpage>
          -
          <lpage>69</lpage>
          . https://doi.org/10.1007/BF00337288
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Van Der Maaten</surname>
            ,
            <given-names>L. J. P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G. E.</given-names>
          </string-name>
          (
          <year>2008</year>
          ).
          <article-title>Visualizing high-dimensional data using t-sne</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>9</volume>
          ,
          <fpage>2579</fpage>
          -
          <lpage>2605</lpage>
          . https://doi.org/10.1007/s10479-011-0841-3
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>