<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Cross-language transfer learning, contin-
uous learning, and domain adaptation for end-
to-end automatic speech recognition. CoRR,
abs/</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Playing with NeMo for Building an Automatic Speech Recogniser for Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Fabio Tamburini</string-name>
          <email>fabio.tamburini@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FICLIT - University of Bologna</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2005</year>
      </pub-date>
      <volume>04290</volume>
      <fpage>6124</fpage>
      <lpage>6128</lpage>
      <abstract>
        <p>This paper presents work in progress for the creation of a Large Vocabulary Automatic Speech Recogniser for Italian using NVIDIA NeMo. Thanks to this package, we were able to build a reliable recogniser for adults' speech by fine tuning the English model provided by NVIDIA and rescoring it with powerful neural language models, obtaining very good performances. The lack of a standard, reliable and publicy available baseline for Italian motivated this work.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>The advent of the “Deep Learning Revolution”
introduced astonishing changes also in the field of
speech processing allowing for the development of
brand new tools and devices able to recognise and
synthesise speech exhibiting performances never
seen before. It is sufficient to think to the new
virtual assistants that populates our houses and
mobile phones for getting an immediate idea about
the improvements in this research field.</p>
      <p>Most big IT companies developed, in the past
3/4 years, solutions well integrated with various
devices that include high performance tools for
speech processing. However, these solutions very
often are not released freely, sometimes they
require registrations and fees and, in the best
situations, codes are free, but the models for a
specific language are not available. A notable
exception regards NVIDIA NeMo1, a conversational
AI toolkit built for researchers working on
Automatic Speech Recognition (ASR), Natural
Language Processing (NLP), and Text-To-Speech
synthesis (TTS). The primary objective of NeMo is</p>
      <p>Copyright © 2021 for this paper by its author. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).</p>
      <p>1Other exceptions providing also multilingual models
including Italian are Facebook Wav2Vec and SpeechBrain.
to help researchers from industry and academia
to reuse prior work, namely code and pretrained
models for various languages, and make it easier
to create new conversational AI models, maybe
adapting tools and models to specific languages or
particular domains.</p>
      <p>This paper reports an attempt to build a high
performance Large Vocabulary ASR system for
Italian adults’ speech by exploiting all the features
available in NeMo and most of the largest Italian
spoken corpora available to the community.</p>
      <p>Section 2 describes the various speech datasets
used for developing the model, followed by
Section 3 that describes the state of the art; in
Section 4 we will describe the NeMo ASR model
used in the experiments and Section 5 will discuss
the experiments and the obtained results. Section
6 draws some provisional conclusions about our
work.</p>
    </sec>
    <sec id="sec-2">
      <title>2 Italian Spoken Corpora for ASR</title>
      <p>This section describes the datasets we used for the
creation of the Italian ASR model. We have to
say that, of course, these are not the only spoken
corpora available, but they are the biggest corpora
commonly used for setting up an ASR system for
Italian. They are typically very big, already
organised and structured exactly for training ASR
systems or specifically designed to maximise their
impact and usefulness for ASR. We have also to
say that, as far as we know, this is the first attempt
to use all of them for ASR training in a single
project.
2.1</p>
      <sec id="sec-2-1">
        <title>Mozilla Common Voice (v7.0)</title>
        <p>
          Common Voice
          <xref ref-type="bibr" rid="ref2">(Ardila et al., 2020)</xref>
          is a
crowdsourcing project started by Mozilla to create a free
database for setting up speech recognition
software. The project is supported by volunteers who
record sample sentences with a microphone and
review recordings of other users. The transcribed
utterances will be collected in a voice database
available under the public domain license CC0.
This license ensures that developers can use the
database for voice-to-text applications without
restrictions or costs.
        </p>
        <p>With regard to the Italian subcorpus, they
currently2 released version 7.0 (MCV7), containing
6,407 speakers for a total of 160,570 utterances
with the correct transcriptions. In the standard
splitting provided with the dataset the training
set contains 131,041 utterances corresponding to
189.50 hours of speech, the validation set 14,764
utterances for 24.41 hours and the test set 14,765
utterances corresponding to 25.74 hours.</p>
        <p>These splitting are very important for our
experiments, as discussed in Section 5.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Multilingual LibriSpeech</title>
        <p>
          Multilingual LibriSpeech3 (MLS) dataset
          <xref ref-type="bibr" rid="ref13">(Pratap
et al., 2020)</xref>
          is a large multilingual corpus
suitable for speech research. The dataset is derived
from read audiobooks from LibriVox and consists
of 8 languages - English, German, Dutch,
Spanish, French, Italian, Portuguese and Polish. The
Italian section contains 42,935 utterances for a
total of 160.06 hours of transcribed speech.
2.3
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>VoxForge</title>
        <p>VoxForge4 is an open speech dataset that was
set up to collect transcribed speech for use with
Free and Open Source Speech Recognition
Engines. The Italian portion of VoxForge contains
10,633 utterances totalling 20.16 hours of
transcribed speech.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>APASCI</title>
        <p>
          APASCI
          <xref ref-type="bibr" rid="ref1">(Angelini et al., 1994)</xref>
          is an Italian
speech database recorded in an insulated room
with a Sennheiser MKH 416 T microphone. The
speech material, consisting of 2,170 utterances
with a wide phonetic/diphonic coverage and
totalling 2.91 hours of speech, was read by 100
Italian speakers (50 male and 50 female). The
database includes the transcription of each
utterance both at phonemic and at orthographic
levels. This database in the past allowed to design,
train and evaluate continuous speech recognition
systems (speaker independent, speaker adaptive,
2July 2021.
3http://www.openslr.org/94/
4http://www.voxforge.org/
speaker dependent, multispeakers). It was also
designed for research on acoustic modelling as well
as on acoustic parameters for speech recognition
and for research on speaker recognition.
3
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>State of the Art for Italian ASR</title>
      <p>In order to properly describe the state of the art,
we should first define the typical metrics used for
evaluating ASR systems. Given the system
transcription for an utterance and the correct
transcription extracted from the gold standard, the most
important metric is certainly the Word Error Rate
(WER) defined as
W ER = (InsertioGnosl+dNSuubmstbietrutoifonWs o+rdDseletions) ,
typically expressed in percentage. It compares the
two transcriptions counting all the differences at
word level using the edit distance between them.
We can also define the Phone Error Rate (PER)
and the Character Error Rate (CER) that use the
same principle but applied, respectively, at phone
or character level.</p>
      <p>
        Examining the literature for the construction of
ASR models for Italian we immediately recognise
a lack of works devoted to the building of a general
Large Vocabulary ASR for adults’ speech. The
only work we found on that was
        <xref ref-type="bibr" rid="ref7">presented by Cosi
and Hosom (2000</xref>
        ), used a rather old approach to
the problem (a hybrid HMM/ANN architecture)
and measures the performance only on phones and
not on words. Using PER instead the most
common WER is a common trait of all the subsequent
works we found in literature
        <xref ref-type="bibr" rid="ref6 ref8 ref9">(Cosi and Pellom,
2005; Cosi, 2008; Cosi et al., 2014; Cosi, 2015)</xref>
        that applied a lot of different system architectures
only on child speech. This large bundle of works
represent the main line of research for building
Italian ASR systems, but the aim of these studies
is completely different from ours and, moreover,
their results are not directly comparable with ours.
      </p>
      <p>
        An exception to what we said before is
represented by the wo
        <xref ref-type="bibr" rid="ref10">rk of Gretter (2014</xref>
        ): he first built
a large multilingual benchmark corpus, extracting
data from the portal Euronews, consisting of about
100 hours of adults’ speech for each language and,
second, he developed also some ASR baselines,
based on triphone Hidden Markov Models and
n-gram Language models, obtaining on Italian a
word recognition accuracy of 83.5% leading to a
WER=16.5%, a quite remarkable result obtained
using non-neural stochastic systems.
      </p>
      <p>
        More recent studies employing neural
models were able to build other quite reliable
systems. Weibin (2019) trained a system based on
DeepSpeech
        <xref ref-type="bibr" rid="ref11">(Hannun et al., 2014)</xref>
        using
VoxForge, CLIPS5, SI-CALLIOPE
        <xref ref-type="bibr" rid="ref15">(Tedesco et al.,
2018)</xref>
        , LibriVox Audiobooks6 and Mozilla
Common Voice corpora for a total of 438 hours of
speech, obtaining a WER=13.8% on a mixed test
set. Pratap et al. (2020) made some experiments
using wav2letter++7 followed by a 5gram
rescoring obtaining a test WER=28.19%. They used
different test sets w.r.t. the one used in this work,
thus they can only provide some general
indications about WER, but they are not directly
comparable to our work.
4
      </p>
    </sec>
    <sec id="sec-4">
      <title>NVIDIA NeMo ASR</title>
      <p>Traditional speech recognition takes a
generative approach, modelling the recognition
process of speech sounds acoustics (O) as
W = argmaxW P (O|W )P (W ) where W is a
possible transcription as sequence of words. The
actors of the game include a language model
P (W ) that allows to estimate the most likely
orderings of words in a given language (e.g. an
n-gram model), a pronunciation model for each
word in that sequence (e.g. a lexicon of
phonetically transcribed words) and an acoustic model
P (O|W ) that allows to estimate the probability of
an input sequence of acoustic observations given
each possible words sequence W . When we
receive some spoken input, our goal would be to find
the most likely sequence of text that maximises the
words probability given a speech-acoustic input.</p>
      <p>Over time, neural nets advanced to the point
where each component of the traditional speech
recognition model could be replaced by a neural
model that had better performance and that had a
greater potential for generalisation. For example,
we could replace an n-gram model with a neural
language model, and replace a pronunciation
table with a neural pronunciation model, and so on.
However, each of these neural models need to be
trained individually on different tasks, and errors
in any model in the pipeline could throw off the
5http://www.clips.unina.it
6https://librivox.org/
7https://github.com/flashlight/wav2le
tter
whole prediction.</p>
      <p>Nowadays, end-to-end ASR discriminative
architectures models that simply take a sequence of
audio inputs and give a sequence of textual
outputs, and in which all components of the
architecture are trained jointly towards the same goal,
largely dominate the field. The model’s encoder
would be akin to an acoustic model for extracting
speech features, which can then be directly piped
to a decoder which directly outputs text, as a
sequence of characters, in a given language. If
desired, we could still integrate a language model
that would improve our predictions, piping it
after the decoder8.</p>
      <p>
        Grasping information from NeMo github site9,
we learn that the base ASR model provided by
NVIDIA is Jasper (”Just Another Spee
        <xref ref-type="bibr" rid="ref17">ch
Recognizer”) (Li et al., 2019</xref>
        ) a deep Time Delay
Neural Network comprising of blocks of
1Dconvolutional layers. The Jasper family of
models are denoted as “Jasper [BxR]” where B is the
number of blocks and R is the number of
convolutional sub-blocks within a block. Each sub-block
contains a 1-D convolution, batch normalisation,
ReLU, and dropout.
      </p>
      <p>Most state-of-the-art ASR models are extremely
large; they tend to have on the order of a few
hundred million parameters. This makes them hard to
deploy on a large scale given current limitations of
devices on the edge. Another model is included
into NeMo, QuartzNet (Kriman et al., 2020), a
version of Jasper with separable convolutions and
larger filters. It can achieve performance similar
to Jasper but with an order of magnitude fewer
parameters. Similarly to Jasper, the QuartzNet
family of models are denoted as “QuartzNet [BxR]”,
where B is the number of blocks and R is the
number of convolutional sub-blocks within a block,
and do not use the computationally costly
recurrent layers in favour of more efficient
convolutional layers. Each sub-block contains a 1-D
separable convolution, batch normalisation, ReLU, and
dropout (see Figure 1 for a complete diagram
describing the QuartzNet internal structure). Both
models described before optimise the
Connectionist Temporal Classification (CTC) loss.</p>
      <p>NVIDIA provided also a large number of
pre8Partially taken from, https://docs.nvidia.co
m/deeplearning/nemo/user-guide/docs/en/m
ain/asr/intro.html
9https://github.com/NVIDIA/NeMo
trained models10 for various languages. The two
models for English “STT en Quartznet15x5” and
Italian “STT it Quartznet15x5” (both at version
1.0.0rc1 published the 30th June 2021) are
relevant for our work. The Quartznet 15x5 model
family consists of 79 layers and has a total of 18.9
million parameters, with vfie blocks that repeat fifteen
times plus four additional convolutional layers.</p>
      <p>QuartzNet15x5 Encoder and Decoder
English neural module’s checkpoints from NVIDIA
were trained using Multilingual LibriSpeech and
Mozilla’s English Common Voice 6.1 ”validated”
set (a huge amount of data containing more than
3,300 hours of speech) with two types of data
augmentation techniques: speed perturbation and
Cutout. Speed perturbation means that additional
training samples were created by slowing down
or speeding up the original audio data by 10%.
Cutout refers to randomly masking out small
rectangles out of the spectrogram input as a
regularization technique. NVIDIA’s Apex/Amp O1
optimization level was used for training achieveing
4.19% WER on LibriSpeech test-clean.</p>
      <p>NeMo documentation also describes a
procedure for fine-tuning the English model to adapt it
to other languages, keeping the acoustic encoder
10https://ngc.nvidia.com/catalog/colle
ctions/nvidia:nemo asr
frozen and fine-tuning the decoder for
producing transcriptions for a different language (Huang
et al., 2020). In the cited paper they also get the
relevant conclusion that it is much better, in terms
of performance, to fine-tune the English model
than to retrain from scratch a new model for a
specific language. The Italian model provided by
NVIDIA has been produced following the
suggested procedure, in particular by retraining the
QuartzNet decoder using the training portion of
MCV version 6.1. We will consider this Italian
model as a baseline for our experiments.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Model Setup and Results</title>
      <p>The STT it QuartzNet model provided by
NVIDIA was trained using a reduced set of data
and applying an output dictionary that includes
some characters that do not belong to the Italian
alphabet. For these reasons we preferred to restart
the fine-tuning process directly from the original
STT en Quartznet15x5 English model.</p>
      <p>The training set we used to fine tune the
NVIDIA STT en model to Italian is composed by
joining the training portion of MCV7 and all files
from MLS, VoxForge and APASCI, and contains
186,778 utterances/speech files totalling 372.62
hours of transcribed speech. 19,199 utterance/files
were filtered out from the training set totalling
97.77 hours of removed speech. This is due to
the fact that in some dataset, mainly in MLS
and VoxForge, there were some utterances longer
than 16.7 seconds, a time limit hard coded into
NeMo in order to keep the model
computationally tractable. We checked also that transcriptions
contain only the 34 standard characters from the
Italian alphabet (26 lowercase letters plus six
accented characters, the apostrophe and the space) as
it is a standard practice in ASR to lowercase
transcriptions and to remove any punctuation mark not
strictly useful or relevant to help the recognition.</p>
      <p>
        With regard to decoding and rescoring, NeMo
offers various possibilities:
• Greedy Decoding. This method simply
computes the most likely sequence of characters,
also called as the “best-path decoder”, given the
audio input.
• Beam Search Decoding. Beam Search
Decoding (BSD) is another way of decoding model
prediction that leads to better results than the
greedy search. BSD, instead of choosing
always the best prediction at each step,
considers the top-K hypothesis having the
highest probabilities, where K is the so called
beam size. For all the subsequent experiments
we used beam size=1024, beam alpha=1.0 and
beam beta=0.5 (see NeMo documentation).
Language Models (LM) have shown to help the
accuracy of ASR models when combined to
BSD. NeMo currently supports the following
two approaches to incorporate language models
into the ASR models through BSD:
– N-gram Rescoring. In this approach, an
Ngram Language Model is trained on text data,
then it is used in fusion with beam search
decoding to find the best candidates. The beam
search decoders in NeMo support language
models trained with the KenLM library
        <xref ref-type="bibr" rid="ref12">(Heafield et al., 2013)</xref>
        . We used this library
code for building a 3-gram and a 6-gram
LM using the 165-million-token-version of
the CORIS corpus11
        <xref ref-type="bibr" rid="ref14">(Rossini Favretti et al.,
2002)</xref>
        specially cleaned and prepared for this
task.
– Neural Rescoring. In the neural rescoring
approach a neural network is used to give
scores to a candidate text transcript predicted
by the decoder of the ASR model. The top K
candidates produced by the beam search
decoding are given to a neural language model
to rank them. This score is usually combined
with the scores from the beam search
decoding to produce the final scores and
rankings. NeMo neural LMs are based on the
Transformer sequence-to-sequence
architecture like those described in
        <xref ref-type="bibr" rid="ref16">(Vaswani et al.,
2017)</xref>
        . Again, we used the CORIS corpus
described above to train an Italian neural LM
from scratch and, after a month of training,
we reached a perplexity of 29.30.
      </p>
      <p>Given such possibilities, we fine tuned the
STT en model on a single V100 GPU using our
joined dataset described above and the MCV7
validation and test set respectively for early stopping
the training process and to evaluate all models.
The hyperparameters we modified w.r.t. the
original English model, and contained in the model
itself, are listed in Table 1.</p>
      <p>As notable exception to the NVIDIA suggested
procedure for fine tuning a model, we have to
re11Corresponding to the 2021 brand new update.</p>
      <p>Par.
train ds.batch size
validation ds.batch size
optim.lr
optim.betas
optim.weight decay
optim.warmup steps
optim.sched.min lr
trainer.precision
trainer.amp level
port that we obtained the best results by
unfreezing the encoder and letting it to slightly adapt
the extracted speech features to the new language,
namely Italian, that certainly share most of the
sounds with the starting English model STT en,
but contains also specific sounds (e.g. [ñ] and [L])
that may require small adaptations.</p>
      <p>Table 2 outlines our results after a complete fine
tuning of the end-to-end ASR model using the
Italian dataset described before and applying
different decoding and rescoring schemas. The
improvement obtained with the fine-tuning process,
when compared to the original model delivered by
NVIDIA is relevant, but not so big, while when
applying the BSD with the two rescoring algorithms
the WER metric improve of 40% w.r.t. the greedy
decoding schema.</p>
      <sec id="sec-5-1">
        <title>System Valid.</title>
        <p>Baseline (NVIDIA STT it)
Greedy Decoding 15.64/4.00
BSD &amp; 3-gram Resc. 10.79/3.18
BSD &amp; 6-gram Resc. 10.77/3.17
BSD &amp; Neural Resc. 9.54/
NVIDIA STT en + Our Retraining
Greedy Decoding 14.86/3.78
BSD &amp; 3-gram Resc. 10.41/2.97
BSD &amp; 6-gram Resc. 10.36/2.95
BSD &amp; Neural Resc. 9.04/
Test
16.90/4.46
11.59/3.54
11.57/3.53
10.51/
15.82/4.14
10.96/3.27
10.94/3.26
9.67/</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusions</title>
      <p>This paper presented work in progress for the
construction of a reliable and performing ASR system
for Italian adults’ speech. Thanks to the NVIDIA
NeMo package, we were able to produce a very
strong baseline reaching a WER = 9.67% over the
MCV7 test set.</p>
      <p>This is only the beginning of our work, as any
change in the kind of speech used to train the
system could degrade the whole performance, but,
having used a collection of four different datasets
containing thousands of different speakers and
speech utterances for setting up such ASR system,
we believe that the result should be robust enough.
Unfortunately, the lack of a standardised
benchmark for Italian does not allow for a quantitative
and objective evaluation of this statement.</p>
      <p>End-to-end character ASR model, and its
improvement on WER, is only part of the game: the
work on decoding and rescoring procedures
produced much more improvements. Thus, the most
important “take home lesson” is certainly to
focus on the development of high performance LM
specifically tuned for ASR.</p>
      <p>All the models presented in this paper as well as
the scripts and additional codes for using NeMo
and generating the results will be made
available12.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>We acknowledge the CINECA13 award no.
HP10C7XVUO (project QT4CLML) under the
ISCRA initiative, for the availability of HPC
resources and support.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>B.</given-names>
            <surname>Angelini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Brugnara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Falavigna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Giuliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gretter</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Omologo</surname>
          </string-name>
          .
          <year>1994</year>
          .
          <article-title>Speaker Independent Continuous Speech Recognition Using An Acoustic-Phonetic Italian Corpus</article-title>
          .
          <source>In Proc. of the 3rd International Conference on Spoken Language Processing - ICSLP '94</source>
          , pages
          <fpage>1391</fpage>
          -
          <lpage>1394</lpage>
          , Yokohama, Japan.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Ardila</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Branson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kohler</surname>
          </string-name>
          , J. Meyer, M. Henretty,
          <string-name>
            <given-names>R.</given-names>
            <surname>Morais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saunders</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tyers</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Weber</surname>
          </string-name>
          .
          <year>2020</year>
          . Common 12https://github.com/ftamburin/ItaNeMo
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>ASR 13https://www.cineca.it/en</mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>In Proc. of the 12th Language Resources and</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Evaluation</given-names>
            <surname>Conference</surname>
          </string-name>
          , pages
          <fpage>4218</fpage>
          -
          <lpage>4222</lpage>
          , Mar-
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Cosi</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>A kaldi-dnn-based asr system for italian</article-title>
          .
          <source>In Proc. 2015 International Joint Conference on Neural Networks (IJCNN)</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Cosi</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.P.</given-names>
            <surname>Hosom</surname>
          </string-name>
          .
          <year>2000</year>
          .
          <article-title>High performance “general purpose” phonetic recognition for italian</article-title>
          .
          <source>In Sixth International Conference on Spoken Language Processing, ICSLP 2000/Interspeech</source>
          <year>2000</year>
          , Beijing, China,
          <source>October 16-20</source>
          ,
          <year>2000</year>
          , pages
          <fpage>527</fpage>
          -
          <lpage>530</lpage>
          . ISCA.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Cosi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Nicolao</surname>
          </string-name>
          , G. Paci, G. Sommavilla, and
          <string-name>
            <given-names>T.</given-names>
            <surname>Tesser</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Comparing open source ASR toolkits on Italian children speech</article-title>
          .
          <source>In Proc. 4th Workshop on Child Computer Interaction (WOCCI</source>
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>P.</given-names>
            <surname>Cosi</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.L.</given-names>
            <surname>Pellom</surname>
          </string-name>
          .
          <year>2005</year>
          .
          <article-title>Italian children's speech recognition for advanced interactive literacy tutors</article-title>
          .
          <source>In Proc. Interspeech</source>
          <year>2005</year>
          , pages
          <fpage>2201</fpage>
          -
          <lpage>2204</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Gretter</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Euronews: a multilingual benchmark for ASR and LID</article-title>
          .
          <source>In Proc. of the 15th Annual Conference of the International Speech Communication Association - INTERSPEECH</source>
          <year>2014</year>
          , pages
          <fpage>1603</fpage>
          -
          <lpage>1607</lpage>
          , Singapore. ISCA.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>A.Y.</given-names>
            <surname>Hannun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Case</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Casper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Catanzaro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Diamos</surname>
          </string-name>
          , E. Elsen,
          <string-name>
            <given-names>R.</given-names>
            <surname>Prenger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Satheesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sengupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Coates</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Deep speech: Scaling up end-to-end speech recognition</article-title>
          .
          <source>CoRR, abs/1412</source>
          .5567.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>K.</given-names>
            <surname>Heafield</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Pouzyrevsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.H.</given-names>
            <surname>Clark</surname>
          </string-name>
          , and
          <string-name>
            <given-names>P.</given-names>
            <surname>Koehn</surname>
          </string-name>
          .
          <year>2013</year>
          .
          <article-title>Scalable modified Kneser-Ney language model estimation</article-title>
          .
          <source>In Proc. of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)</source>
          , pages
          <fpage>690</fpage>
          -
          <lpage>696</lpage>
          , Sofia, Bulgaria. Association for Computational Linguistics.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>V.</given-names>
            <surname>Pratap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sriram</surname>
          </string-name>
          , G. Synnaeve, and
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>MLS: A large-scale multilingual dataset for speech research</article-title>
          .
          <source>In Proc. of , 21st Annual Conference of the International Speech Communication Association (Interspeech</source>
          <year>2020</year>
          ), pages
          <fpage>2757</fpage>
          -
          <lpage>2761</lpage>
          , Shanghai, China.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Rossini Favretti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tamburini</surname>
          </string-name>
          , and
          <string-name>
            <surname>C. De Santis</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>CORIS/CODIS: A corpus of written Italian based on a defined and a dynamic model</article-title>
          . In A. Wilson,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rayson</surname>
          </string-name>
          , and T. McEnery, editors,
          <source>A Rainbow of Corpora: Corpus Linguistics and the Languages of the World</source>
          , pages
          <fpage>27</fpage>
          -
          <lpage>38</lpage>
          . Lincom-Europa, Munich.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Tedesco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cenceschi</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Sbattella</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Verso il riconoscimento automatico della prosodia</article-title>
          .
          <source>In Proc. AISV</source>
          <year>2018</year>
          , pages
          <fpage>433</fpage>
          -
          <lpage>439</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and I.</given-names>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>In Proceedings of the 31st International Conference on Neural Information Processing Systems - NIPS'17, page 6000-6010</source>
          ,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA. Curran Associates Inc.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>C.</given-names>
            <surname>Weibin</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Phoenix: Deep Speech Based Automatic Speech Recognition System for Italian Language</article-title>
          .
          <source>Master Thesis</source>
          , Politecnico di Milano.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>