<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Zero-Cost Speech Recognition Task at Mediaeval 2016</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Igor Szoke</string-name>
          <email>szoke@fit.vutbr.cz</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xavier Anguera</string-name>
          <email>xavier@elsanow.io</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Brno University of Technology</institution>
          ,
          <addr-line>Brno</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>ELSA Corp.</institution>
          ,
          <addr-line>Lisboa</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>The main goal of the Zero-Cost Speech Recognition task is to bring researchers together on the topic of training ASR systems using only publicly available data. In particular, the task consists on the development of either an LVCSR or a subword speech recognizer on a given target language. For this year we selected Vietnamese as the target language The organizers provided participants several sets of publicly available data combined with one proprietary set gathered for this evaluation. Participants are free to find and use other publicly available resources (free for research purposes). These resources must be shared with other participants till end of July. The data-set for the evaluation is then fixed and no outside data can be used.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. INTRODUCTION</title>
      <p>There are more than thousand spoken languages. Any
research lab or “Speech company” that wants to develop
technology in a new language usually needs to buy a speech
database (audio + carefully hand made transcripts) to get
started 1. The cost of such databases range from 5k to 20k
EURs (provided by LDC / ELRA / Appen etc.).
Alternatively, if you have enough budget, you can collect your own
data and cover some specific topics, acoustic environments
etc. This brings a huge gap between “top” labs and
companies having enough budget to afford such expenditures with
“the other” small players, endowed to freely available data,
tools or even systems.</p>
      <p>
        The main goal of this task is to challenge participating
teams to come up and experiment with bootstrapping
techniques, which allow to train initial ASR system for “free”.
We are interested in the exploration of techniques to
allow researchers to train ASR systems on public multimedia
resources (texts, audios, videos, dictionaries etc.), without
the need to buy expensive (ideally any) data-sets.
Participants may inspire in approaches used in under-resourced
languages [
        <xref ref-type="bibr" rid="ref5 ref8">5, 8</xref>
        ]. There were also other initiatives close to
the topic of this tasks: Zero Resource Speech Challenge in
2015 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and The Third Frederick Jelinek Memorial Summer
Workshop 2016 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
1We understand any speech tokenizer under abbreviation
ASR – i.e. including phoneme recognizer, word recognizer,
automatic unit discovery
2.
      </p>
    </sec>
    <sec id="sec-2">
      <title>THE ZERO-COST 2016 DATA-SET</title>
      <p>
        The target language selected for this year is Vietnamese.
One of the reasons why we choose Vietnamese is that it was
one of the languages of OpenKWS/BABEL in 2013 [
        <xref ref-type="bibr" rid="ref1 ref7">1, 7</xref>
        ] and
there are many papers reporting results in ASR [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15 ref6">14, 10, 11,
12, 6, 15, 13</xref>
        ]. BUT as the task co-organizer provided
“upperbound” results using their BABEL system [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and “calibrate”
results of Zero-Cost participants to the larger world wide
speech community. The BUT baseline was trained only on
conversational telephone speech without any adaptation on
target domain (Zero-Cost).
      </p>
      <p>Other reasons for choosing Vietnamese are that it is a
lowresourced language with limited resources available on-line
(more difficult for participants to “cheat” with data other
than what is provided), it is a tonal language (with its
inherent difficulties) but it is a syllabic language (simpler to
treat for zero-resources algorithms where clear phoneme
sequences can be seen repetitively in the data).</p>
      <p>Task organizers provided participants with an initial set
of free multimedia resources – a mix of audio data and
imperfect transcripts like audios/videos with subtitles:
• Forvo.com – Download of Vietnamese data from Forvo.com
service. It is composed of a collection of short
recordings with one or more word pronunciations each.
• Rhinospike.com – Download of Vietnamese data from
Rhinospike.com service. It is a collection of recordings
consisting between one short to several long sentences.
• ELSA – Proprietary prompted data recorded with a
mobile application by Vietnamese students. It
contains several read sentences obtained from a book of
Vietnamese quotes. This data simulates a case where
participant is able to collect small amount of data
themselves.
• Other “surprise” test data – Surprise data aiming at
evaluating how robust participant systems are to new
data. This data is a download of 35 YouTube videos
(broadcast news, presentations, talks) mostly
containing one speaker. The first 2 minutes of each videos
were transcribed and used as ground truth. The rest
of video was let in the test set to augment the data for
possible unsupervised adaptation.</p>
      <p>Please note, that transcripts may not match the audio in
100%. In addition, any audio may contain some dropouts,
noise or some speech may be missing. This data has been
preprocessed, split into Train / Devel / Test, and converted
to 16kHz wav + STM references.</p>
      <sec id="sec-2-1">
        <title>Devel-Local</title>
      </sec>
      <sec id="sec-2-2">
        <title>Devel</title>
      </sec>
      <sec id="sec-2-3">
        <title>Forvo.com</title>
        <p>Rhinospike.com
ELSA
Surprise data
SUM</p>
        <p>Train</p>
        <p>In addition to the “official” datasets, several participants
have provided some free data which we encourage other
participants to use. Apart from these, the use of no other data
is allowed. Train, Development and Test sets are available
already during system training. Participants can use them
and adapt their system on them (e.g. unsupervised
adaptation on the Test set). However, reference transcripts are
not provided for the development / test data and it is not
allowed to transcribe or manually analyze it.
2.1</p>
        <p>
          Participants’ data description
• I2R - A list of 890k Vietnamese webpage URLs.
• I2R - A Vietnamese wordlist – 80k words.
• I2R - A raw dump of Vietnamese wiktionary – later
cleaned by I2R to 750MB of text.
• BUT - A download of Vietnamese-English subtitles [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
– 93MB of text.
• BUT - A set of Vietnamese videos and subtitles –
14 partly subtitled episodes of a Vietnamese telenovel.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>BRIEF SUB-TASKS DESCRIPTION</title>
      <p>Participants of the task are asked to train a speech
tokenizer – LVCSR or subword – on a collection of public data
(see section 2) in Vietnamese language. Each participant
must take part in at least one sub-task.
3.1</p>
    </sec>
    <sec id="sec-4">
      <title>Large vocabulary continuous speech recognition (LVCSR) sub-task</title>
      <p>This task targets full speech recognition where the output
is a sequence of recognized words. Systems will be evaluated
on the Word-Error-Rate (WER) metric (using cstk scoring
tool). The WER is based on the comparison of transcripts
(reference and generated hypothesis) at word level. Both
transcripts should be produced in uppercase and without
punctuation, hesitation markers etc. There is no other text
normalization done. This sub-task main use-case scenario is
in areas where full speech transcript is needed.
3.2</p>
    </sec>
    <sec id="sec-5">
      <title>Subword speech recognition sub-task</title>
      <p>This task aims at building a “light weight” speech
recognizer. The output is a sequence of subword tokens/units.
We do not define what the tokens should be. It can be
phonemes, graphemes, syllables, triphones, automatically
estimated units, etc. This sub-task’s main use-case scenario
is for areas where speech must be converted to a sequence
of discrete symbols (LID, SID, KWS, Topic detection, etc.).
Phoneme units are used as ground truth for this sub-task.</p>
      <p>The evaluation metric used in this sub-task is a
Normalized version of the Mutual Information (NMI) also called the
symmetric uncertainty. It ranges between 0 and 1. When
both variables, X and Y, are independent, meaning that
the units discovered are completely unrelated to the
reference phone labels, the results is zero. The maximum, one, is
achieved when one can fully recover the phone sequence from
the discovered units AND the entropy of both the
discovered units and the reference phone is the same. This means
that the metric penalizes systems that have too many units.
The evaluation algorithm used to compute NMI takes into
account timing of the discovered units. It matches them to
reference ones (according to time) first and then calculates
the NMI.
4.</p>
    </sec>
    <sec id="sec-6">
      <title>EVALUATION, SCORING AND LEADER</title>
    </sec>
    <sec id="sec-7">
      <title>BOARD</title>
      <p>Participants are provided with Training, Development and
Test (evaluation) data all at once. However they do not
have references transcripts for Development and Test data.
They can use the on-line leader board to score their systems
and compute development results. When the evaluations are
over, the results on the Test set will be published. To make
the development faster and easier, we defined a Devel-Local
subset and provided the ground truth to participants, so
that they can perform initial development on their systems
locally.</p>
      <p>• The Devel-Local is 1/5 subset of Devel. Participants
are provided with references and scoring scripts so
that they can score their system outputs on this
subset. This was done to allow for quick iterations during
training period and to overcome the need to upload
the system outputs to the leader board too often.
• The Devel consists of the full Devel dataset. Once
participants end up with some good enough / improved
enough system, they are encouraged to upload their
results to the leader board and be scored on much more
data. The uploaded scores are available for all
participants to see.
• The Test is ”unseen” data. It partly contains data
similar to training / devel one but also unseen one.
Participants are encouraged to adapt their systems on
this data (in unsupervised ways).</p>
      <p>Each participant has to register and submit their results
to the on-line Leader Board (http://www.zero-cost.org/).
There is no maximum limit on the number of submissions
per team. For the official final scoring each participant must
define one primary submission by adding the P- prefix to
their submission and optionally at most 5 others as
contrastive systems C1- – C5-.</p>
      <p>The submission deadline is 12th of September 2016.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENTS</title>
      <p>We would like to thank the Mediaeval organizers for their
support and all the participants for their hard work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>[1] Openkws/babel in 2013 web: https://www.nist.gov/multimodal-informationgroup/openkws13-evaluation/.</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>[2] The third frederick jelinek memorial summer workshop</article-title>
          2016 web: http://www.clsp.jhu.edu/workshops/16- workshop/building-speech
          <article-title>-recognition-system-fromuntranscribed-data/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>[3] Vietnamese-english subtitles web</article-title>
          : http://opus.lingfil.uu.se/.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>[4] Zero resource speech challenge in 2015 web</article-title>
          : http://www.lscp.net/persons/dupoux/bootphon/ zerospeech2014/website/.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Besacier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Barnard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Karpov</surname>
          </string-name>
          , and
          <string-name>
            <given-names>T.</given-names>
            <surname>Schultz</surname>
          </string-name>
          .
          <article-title>Automatic speech recognition for under-resourced languages: A survey</article-title>
          .
          <source>Speech Communication</source>
          ,
          <volume>56</volume>
          :
          <fpage>85</fpage>
          -
          <lpage>100</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sivadas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Lim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. G.</given-names>
            <surname>Ngo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. T.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ma</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          .
          <article-title>Strategies for vietnamese keyword search</article-title>
          .
          <source>In IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          ,
          <string-name>
            <surname>ICASSP</surname>
          </string-name>
          <year>2014</year>
          , Florence, Italy, May 4-
          <issue>9</issue>
          ,
          <year>2014</year>
          , pages
          <fpage>4121</fpage>
          -
          <lpage>4125</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Fiscus</surname>
          </string-name>
          and
          <string-name>
            <given-names>N.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <article-title>Overview of the NIST open keyword search 2013 evaluation workshop</article-title>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Fraga-Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Laurent</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-L.</given-names>
            <surname>Gauvain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lamel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V. B.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Messaoudi</surname>
          </string-name>
          .
          <article-title>Improving data selection for low-resource STT and KWS</article-title>
          .
          <source>In Proceedings of ASRU</source>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Karafi</surname>
          </string-name>
          ´at, F. Gr´ezl, M. Hannemann, and
          <string-name>
            <surname>J.</surname>
          </string-name>
          <article-title>Cˇ ernocky´</article-title>
          .
          <article-title>BUT neural network features for spontaneous Vietnamese in BABEL</article-title>
          .
          <source>In Proceedings of ICASSP 2014</source>
          , pages
          <fpage>5659</fpage>
          -
          <lpage>5663</lpage>
          .
          <source>IEEE Signal Processing Society</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H. Q.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nocera</surname>
          </string-name>
          , E. Castelli, and
          <string-name>
            <given-names>V. L.</given-names>
            <surname>Trinh</surname>
          </string-name>
          .
          <article-title>Large vocabulary continuous speech recognition for Vietnamese, an under-resourced language</article-title>
          .
          <source>In Proceedings of SLTU</source>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaloupka</surname>
          </string-name>
          .
          <article-title>Phoneme set and pronouncing dictionary creation for large vocabulary continuous speech recognition of vietnamese</article-title>
          ., pages
          <fpage>394</fpage>
          -
          <lpage>401</lpage>
          . Berlin: Springer,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Tsakalidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hsiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ranjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Saikumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Nguyen,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.</given-names>
            <surname>Makhoul</surname>
          </string-name>
          .
          <article-title>The 2013 BBN vietnamese telephone speech keyword spotting system</article-title>
          .
          <source>In IEEE International Conference on Acoustics, Speech and Signal Processing</source>
          ,
          <string-name>
            <surname>ICASSP</surname>
          </string-name>
          <year>2014</year>
          , Florence, Italy, May 4-
          <issue>9</issue>
          ,
          <year>2014</year>
          , pages
          <fpage>7829</fpage>
          -
          <lpage>7833</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>N. T.</given-names>
            <surname>Vu</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Schultz</surname>
          </string-name>
          .
          <article-title>Vietnamese large vocabulary continuous speech recognition</article-title>
          .
          <source>In 2009 IEEE Workshop on Automatic Speech Recognition &amp; Understanding</source>
          ,
          <string-name>
            <surname>ASRU</surname>
          </string-name>
          <year>2009</year>
          , Merano/Meran, Italy,
          <source>December 13-17</source>
          ,
          <year>2009</year>
          , pages
          <fpage>333</fpage>
          -
          <lpage>338</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T. T.</given-names>
            <surname>Vu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Luong</surname>
          </string-name>
          , and
          <string-name>
            <given-names>J.-P.</given-names>
            <surname>Hosom</surname>
          </string-name>
          .
          <article-title>Vietnamese large vocabulary continuous speech recognition</article-title>
          .
          <source>In Proceedings of INTERSPEECH</source>
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Guo</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          .
          <article-title>The vietnamese speech recognition based on rectified linear units deep neural network and spoken term detection system combination</article-title>
          .
          <source>In The 9th International Symposium on Chinese Spoken Language Processing, Singapore, September 12-14</source>
          ,
          <year>2014</year>
          , pages
          <fpage>183</fpage>
          -
          <lpage>186</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>