<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards automatic spoken gram matical error correction of L2 learners of English</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Stefano Bannò</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michela Rais</string-name>
          <email>michela.rais@studenti.unitn.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Matassoni</string-name>
          <email>matasso@fbk.eu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Center for Mind/Brain Sciences, University of Trento</institution>
          ,
          <addr-line>Corso Bettini 31, Rovereto (TN), 38068</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Psychology and Cognitive Science, University of Trento</institution>
          ,
          <addr-line>Corso Bettini 84, Rovereto (TN), 38068</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>via Sommarive 18, Trento, 38123</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The demand for learning English as a second language (L2) has been growing consistently over the past decades, as it has become the lingua franca of culture, entertainment, business, and academia. In this regard, mastering grammar is one of the key elements of L2 proficiency. In this paper, we illustrate an approach to spoken grammatical error correction (GEC) in a cascaded fashion using only publicly available training data. Specifically, we start from learners' utterances, investigate disfluency detection (DD) and removal, and finally explore GEC. Despite using only publicly available data, we achieve promising results that are aligned with previous studies which leveraged a large proprietary dataset. We discuss these results and reflect on some open issues and challenges of spoken GEC. computer-assisted language learning, spoken grammatical error correction, disfluency detection, L2 assessment and feedback With the rise of English as the global language of cul- the ones made by L2 learners in written texts. As a result, has been consistently increasing over the past decades [1]. speech recognition (ASR) module is used to transcribe the This has resulted in a growing interest in automated ap- spoken text. This is followed by a disfluency detection Ital-IA 2023: 3rd National Conference on Artificial Intelligence, orga-</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
ture, entertainment, business, and academia, the ability to
speak it fluently has become increasingly valued and the
demand for learning English as a second language (L2)
proaches to evaluate spoken language proficiency for
applications in Computer-Assisted Language Learning
(CALL) for both individual practice and classroom
settings, as well as to certify proficiency in language exams.</p>
      <p>In particular, the assessment of learners’ grammar
through grammatical error correction (GEC) has attracted
considerable attention over the past years. While
textbased GEC has become an established area of study [2, 3],
spoken GEC is still a relatively new area of research,
mainly due to the limited availability of specifically
demar requires several adjustments to standard GEC models
as these tend not to generalize to speech. Spoken GEC
(see Table 2) is in fact more challenging than written GEC
(see Table 1) as spoken grammar tends to be more flexible
and less encoded than written grammar [5]. L2 spoken
grammar is often characterized by disfluencies, naturally
nized by CINI, May 29–31, 2023, Pisa, Italy
∗Corresponding author.
our models, which we tested on a the TLT-GEC, a subset</p>
    </sec>
    <sec id="sec-2">
      <title>Italian learners of English presented in [7]. For the DD</title>
      <p>module training we employed two corpora, the
NICTsigned and annotated data [4]. Assessing spoken gram- transformer-based models both for DD and spoken GEC
0000-0002-2799-0601 (S. Bannò); 0009-0006-5873-8894 (M. Rais); of the TLT corpus, a small proprietary corpus of young
Original</p>
      <sec id="sec-2-1">
        <title>2.1. NICT-JLE</title>
        <p>The National Institute of Information and
Communications Technology - Japanese Learner English (NICT-JLE)
corpus, originally introduced in [12], is a collection of
manual transcriptions of approximately 300 hours of oral
interviews of Japanese learners of English which does not
include the original audio recordings.1 A subset of the
corpus was manually annotated with disfluencies as well
as grammatical errors which were corrected.
Furthermore, this subset includes annotations about proficiency
scores ranging from A1 to B2 of the Common European
Framework of Reference (CEFR) [13].</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. KIT Speaking Test Corpus</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>The Kyoto Institute of Technology (KIT) Speaking Test</title>
      <p>Corpus, released for public use by [14] consists of manual
transcriptions of approximately 4,448 hours of interviews
of 574 Japanese undergraduate students.2 As in the case
of NICT-JLE, the corpus does not include the original
audio recordings. The manual annotations follow the
tagging system employed in the NICT-JLE corpus,
however these only include disfluencies, whereas
grammatical errors are not annotated. The proficiency level of the
students approximately ranges from CEFR level A1 to
B2.</p>
      <sec id="sec-3-1">
        <title>2.3. EFCAMDAT</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>EFCAMDAT is one of the largest publicly available L2</title>
      <p>learner corpus and consists of 1,180,310 scripts written</p>
    </sec>
    <sec id="sec-5">
      <title>1alaginrc.nict.go.jp/nict_jle/index_E.html#license 2kitstcorpus.jp/</title>
      <p>by 174,743 L2 learners.3 The scripts are annotated with
POS tags and information on grammatical dependencies,
and are partially error-tagged by human experts. After
excluding noisy responses and incorrect annotations, we
kept 762,475 responses from which we removed
punctuation and capitalisation in order to make them more
similar to speech transcriptions. We used spaCy4 to extract
pairs of parallel sentences (i.e., original versus correct)
from which we removed sentences shorter than 4 words
as well as those containing broken XML tags and manual
annotations on word limit. Following [15], we further
excluded parallel sentences where the token edit distance
is higher than 60% of the length of the original sentence
in order to guarantee consistency between the original
sentences and their corrected counterparts.
2.4. BEA-2019</p>
    </sec>
    <sec id="sec-6">
      <title>The corpora from the BEA 2019 shared task are text-based</title>
      <p>corpora tagged with GEC annotations.5
CLC-FCE: the Cambridge Learner Corpus - First
Certificate English (CLC-FCE) [ 16] is a publicly available
section of the larger proprietary Cambridge Learner
Corpus (CLC) [17] consisting of 1244 FCE exam scripts.6
Write &amp; Improve: it is a dataset derived from Write &amp;
Improve with Cambridge, an online platform where L2
learners of English can practise their writing skills [18].7
LOCNESS: it is a section of the the Louvain Corpus of
Native English Essays (LOCNESS), consisting of 100 essays
written by L1 English undergraduates from the United
Kingdom and the United States [19].</p>
      <p>Lang-8: The Lang-8 Corpus of Learner English is a
dataset extracted from the Lang-8 website,8 whose users
are encouraged to correct each other’s grammar [20, 21].</p>
      <p>NUCLE: The National University of Singapore Corpus
of Learner English (NUCLE) is a collection of 1,400
3philarion.mml.cam.ac.uk/
4spacy.io
5cl.cam.ac.uk/research/nl/bea2019st/#data
6ilexir.co.uk/datasets/index.html
7writeandimprove.com/
8lang-8.com/
essays written by Asian undergraduate students at the model is trained on NICT-JLE and KIT Speaking Test
National University of Singapore [22]. Corpus and uses an Adam optimiser [27] with batch size
64, learning rate 1e-06, dropout rate 0.2, and negative log</p>
      <p>Including EFCAMDAT, the data used for training likelihood as loss.
the spoken GEC system amount to 2,552,825 sentences, For evaluation, we use precision, recall, and  1 scores.
which we randomly split into a training set of 2,527,296 Table 4 shows the results of the DD model on the test
and a development set of 25,529 sentences. and development sets of TLT-GEC in terms of precision,</p>
      <p>As a benchmark for assessing the performance of spo- recall and  1 score.
ken GEC system we employed the same test set of the
CLC-FCE corpus used in previous studies ([23, 4]) with
punctuation and capitalisation removed. 4. GEC
2.5. TLT-GEC
The TLT-GEC is a small proprietary dataset of speech
utterances of young Italian learners of English which
we have manually annotated with disfluencies and two
sets of grammatical error corrections performed by two
diferent human annotators. The dataset is derived from
the larger TLT-school corpus presented by [7] and
contains 1127 sentences for a total of 4.96 hours. The CEFR
proficiency levels of the speakers are approximately A2
and B1. The data was split into two sets, a
development set of 605 sentences and a test set of 522 sentences
with non-overlapping speakers. The ASR transcriptions
were obtained through a Conformer model, made
available by NVIDIA in the popular NeMo toolkit 9. The
Conformer architecture [24] efectively combines
selfattention layers and convolutions blocks to learn
simultaneously global and local local correlations; this
variant uses a decoder based on CTC loss instead of a
standard RNNT/Transducer, substituting the auto-regressive
LSTM component with a simpler linear decoder. The
word error rate (WER) is 24.72% considering both
development and test sets.
3. Disfluency detection</p>
    </sec>
    <sec id="sec-7">
      <title>We performed DD as a sequence tagging task using a BERT-based [25] token classifier:</title>
      <p>d1∶ = BERT( 1∶ ) (  | 1∶ ) =   (d )
where rm is a binary tag which indicates whether word
wm is fluent or disfluent. Subsequently, all words
classiifed as disfluencies are removed from the transcriptions.
Table 3 considers the example previously shown in Table
2 and clarifies each passage once again.</p>
      <p>Specifically, the BERT-based model consists of a BERT
layer in the version provided by the HuggingFace
Transformer Library [26] (bert-base-uncased), a dropout layer, a
dense layer of 768 nodes, a dropout layer, another dense
layer of 128 nodes, and finally the output layer. The</p>
    </sec>
    <sec id="sec-8">
      <title>9https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/</title>
      <p>stt_en_conformer_ctc_large</p>
    </sec>
    <sec id="sec-9">
      <title>For the GEC model, we used a T5 model [28] initialised</title>
      <p>from the version provided by the HuggingFace
Transformer Library [26] (t5-base) trained on EFCAMDAT and
BEA-2019 with the exclusion of the CLC-FCE test set,
that we used to compare the results on TLT-GEC. We set
the maximum sequence length to 64 using an AdamW
optimiser [29] with learning rate 1e-5, batch size 32.</p>
      <p>To evaluate the performance of our model, we use two
common metrics for GEC, i.e., MaxMatch ( 2) score [30]
and General Language Evaluation Understanding (GLEU)
metric [31]. The former computes the  -score of edits
over the optimal phrasal alignment between the
hypothesis and the reference sentences, whereas the latter is
inspired by BLEU [32] and captures grammatical
corrections as well as fluency rewrites.</p>
      <p>In Table 5, we report the results of the spoken GEC
system on the TLT-GEC test set in terms of  2 and GLEU.
For further comparison, we also report the results of our
model on the CLC-FCE test and we compare them to the
results of the GEC model described in [4]. We also report
the agreement between the two human annotators.</p>
      <p>Considering the performance on CLC-FCE test set, it
can be observed that our proposed model performs
moderately better than the model from [4]. These results are
quite remarkable, given that we used only publicly
available data, whereas [4] employed the entire CLC corpus
in addition to the BEA-2019 data.</p>
      <p>For completeness, we report the results on TLT-school
considering the performance of the GEC model on the
manual transcriptions with disfluencies (dsf), with
dislfuencies manually removed (flt), and with disfluencies
automatically removed (autoflt). As expected, there is a
remarkable improvement both in terms of GLEU and  2
when disfluencies are removed from the transcriptions.
Finally, we report the performance of our GEC system on
ASR transcriptions. It can be observed that also in this
case removing disfluencies improves the performance for
both metrics. It also noticeable that the performance on
the ASR transcriptions (autoflt) is slightly better than the
one on manual transcriptions (dsf) in terms of GLEU.</p>
      <sec id="sec-9-1">
        <title>Disfluent</title>
      </sec>
      <sec id="sec-9-2">
        <title>Fluent Corrected</title>
        <p>he see the thief is catched by policeman the last night
that there are still several open problems which are
particularly evident in the TLT-GEC data. Specifically,
the presence of code-switched words is a challenging
issue, as can be seen in the following example drawn
from the data (manual transcriptions):10</p>
        <p>hello my name is giovanni uhm and i’m from trento
and i live in rovereto uhm rovereto is in nord italien uhm
uhm and uhm hobby uhm f- f- my favourite hobby uhm is
uhm football and and koch</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>As can be observed, not only does the answer feature</title>
      <p>Italian names and toponyms, but it also contains German
code-switched words. The output of the GEC system
after automatically removing the disfluencies is the
following:</p>
      <p>hello my name is giovanni and i’m from trento and i
live in trento it is in north italien my favourite hobby is
football and cooking</p>
    </sec>
    <sec id="sec-11">
      <title>It appears to handle the code-switched words nord and</title>
      <p>koch quite eficiently, but it fails to correct italien. 11</p>
      <p>Therefore, future works will attempt to address
the problem of named entities recognition and
code5. Conclusions and future works switching in the framework of spoken GEC.
Another interesting problem concerns the relevance of
In this paper, we explored an approach to automatic spo- learners’ answers to the question prompts. For example,
ken grammatical error correction of Italian learners of one of the question prompts is:
English using only publicly available training data.</p>
      <p>First, we investigated DD. Our DD module achieved What country would you like to visit in the future?
a good performance in terms of Precision, Recall and Why?
 1 score on both the development and test sets of the
TLT-GEC. A sample answer drawn from the data is the following:</p>
      <p>The second module of our cascaded framework is a
spoken GEC system which achieves results aligned with i like to visit turkey because i like speaking the language
previous studies. As we expected, we found that dis- [...]
lfuency removal has a positive impact on GEC on both
manual and ASR transcriptions of the TLT-GEC. Fur- Although the answer is grammatically correct if
conthermore, we observed that the fully automated system sidered individually, it does, in fact, contain a verbal error
(i.e., ASR+DD+GEC) achieves higher results than the sys- in relation to the question prompt. We also plan to
adtem including manual transcriptions with disfluencies in dress this issue starting from concatenating the question
terms of GLEU.</p>
      <p>Although we identified disfluencies as problematic
elements for spoken GEC and we investigated an
eficient way to detect and remove them, we acknowledge
10We only changed the first name and one toponym due to privacy</p>
      <p>reasons, but the example is still valid.
11In fact, it also does not correct the agreement error hobby is football
and cooking, which should feature hobbies are instead of hobby is.
prompt with the learner’s answer.</p>
      <p>Finally, we plan to investigate an SSL-based approach
(e.g., using wav2vec 2.0 [33] or more recent models such
as HuBERT [34] or WavLM [35]) to spoken GEC.
Specifically, it would be interesting to generate synthetic audio
data using a text-to-speech system on the written learner
corpora we used in this paper for training our models.
guage Database (EFCAMDAT): Information for
users, 2017.
[10] Y. Huang, A. Murakami, T. Alexopoulou, A.
Korhonen, Dependency parsing of learner English,
International Journal of Corpus Linguistics 23 (2018)
28–54. doi:1 0 . 1 0 7 5 / i j c l . 1 6 0 8 0 . h u a .
[11] C. Bryant, M. Felice, Ø. E. Andersen, T. Briscoe, The</p>
      <p>BEA-2019 shared task on grammatical error
correction, in: Proceedings of the Fourteenth
Workshop on Innovative Use of NLP for Building
Educational Applications, Association for
Computa[1] P. Howson, The English efect, British Council, Lon- tional Linguistics, Florence, Italy, 2019, pp. 52–75.</p>
      <p>don, 2013.
[2] Y. Wang, Y. Wang, K. Dang, J. Liu, Z. Liu, A com- [12] dEo.iI:z1u0 m.1i8, 6K5.3 U/vc1h/iWm1 9o-to44, 0H6.. Isahara, The NICT JLE
prehensive survey of grammatical error correction, corpus: Exploiting the language learners’ speech
ACM Transactions on Intelligent Systems and Tech- database for research and education, International
nology (TIST) 12 (2021) 1–51. doi:1 0 . 1 1 4 5 / 3 4 7 4 8 4 0 . journal of the computer, the internet and
manage[3] C. Bryant, Z. Yuan, M. R. Qorib, H. Cao, H. T. ment 12 (2004) 119–125.</p>
      <p>Ng, T. Briscoe, Grammatical error correction: A [13] Council of Europe, Common European Framework
survey of the state of the art, arXiv preprint of Reference for Languages: Learning, Teaching,
AsarXiv:2211.05166 (2022). doi:1 0 . 4 8 5 5 0 / a r X i v . 2 2 1 1 . sessment, Cambridge University Press, Cambridge,
0 5 1 6 6 . 2001. URL: https://rm.coe.int/1680459f97.
[4] Y. Lu, S. Bannò, M. J. F. Gales, On assessing and [14] K. Kanzawa, H. Mitsunaga, G. Edmonds, Y. Hato,
developing spoken ’grammatical error correction’ Y. Tsubota, M. Mori, Y. Shimizu, Development and
systems, in: Proceedings of the 17th Workshop administration of a Skype-based English speaking
on Innovative Use of NLP for Building Educational test in a Japanese high school, Bulletin of Kyoto
Applications (BEA 2022), Association for Compu- Institute of Technology 14 (2022) 27–47.
tational Linguistics, Seattle, Washington, 2022, pp. [15] Y.-C. Lo, J.-J. Chen, C. Yang, J. Chang, Cool English:
51–60. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 2 . b e a - 1 . 9 . a grammatical error correction system based on
[5] M. McCarthy, R. Carter, Ten criteria for a spoken large learner corpora, in: Proceedings of the 27th
Ingrammar, in: Explorations in corpus linguistics, ternational Conference on Computational
LinguisCambridge University Press, 2006, pp. 27–52. tics: System Demonstrations, Association for
Com[6] S. Bannò, M. Matassoni, Proficiency assessment of putational Linguistics, Santa Fe, New Mexico, 2018,
L2 spoken English using wav2vec 2.0, in: 2022 IEEE pp. 82–85. URL: https://aclanthology.org/C18-2018.
Spoken Language Technology Workshop (SLT), [16] H. Yannakoudakis, T. Briscoe, B. Medlock, A new
2023, pp. 1088–1095. doi:1 0 . 1 1 0 9 / S L T 5 4 8 9 2 . 2 0 2 3 . dataset and method for automatically grading ESOL
1 0 0 2 3 0 1 9 . texts, in: Proceedings of the 49th Annual Meeting
[7] R. Gretter, M. Matassoni, S. Bannò, D. Falavigna, of the Association for Computational Linguistics:
TLT-school: a corpus of non native children speech, Human Language Technologies, Association for
in: Proceedings of the Twelfth Language Resources Computational Linguistics, Portland, Oregon, USA,
and Evaluation Conference, European Language 2011, pp. 180–189. URL: https://aclanthology.org/
Resources Association, Marseille, France, 2020, pp. P11-1019.
378–385. URL: https://aclanthology.org/2020.lrec-1. [17] D. Nicholls, The Cambridge Learner Corpus: Error
47. coding and analysis for lexicography and ELT, in:
[8] J. Geertzen, T. Alexopoulou, A. Korhonen, Au- Proceedings of the Corpus Linguistics 2003
Confertomatic linguistic annotation of large scale L2 ence, 2003, pp. 572–581.
databases: The EF-Cambridge Open Language [18] H. Yannakoudakis, Ø. E. Andersen, A. Geranpayeh,
Database (EFCAMDAT), in: Proceedings of T. Briscoe, D. Nicholls, Developing an automated
the 31st Second Language Research Forum, Cas- writing placement system for ESL learners,
Apcadilla Proceedings Project, Somerville, 2013, pp. plied Measurement in Education 31 (2018) 251–267.
240–254. URL: http://www.lingref.com/cpp/slrf/
2012/paper3100.pdf. [19] Sd.oiG:1r0a.n1g0e8r0,/ 0T8h9 e57c3o4 m7.p2u0 t1e8r. 1le4a6r4 n44e7r.corpus: a
ver[9] Y. Huang, J. Geertzen, R. Baker, A. Korhonen, satile new source of data for SLA research, in:
T. Alexopoulou, The EF Cambridge Open Lan- S. Granger (Ed.), Learner English on computer,
Routledge, London, 1998, pp. 3–18. doi:1 0 . 4 3 2 4 /
9 7 8 1 3 1 5 8 4 1 3 4 2 . regularization, in: International Conference on
[20] T. Mizumoto, Y. Hayashibe, M. Komachi, M. Nagata, Learning Representations 2019, 2019.</p>
      <p>Y. Matsumoto, The efect of learner corpus size in [30] D. Dahlmeier, H. T. Ng, Better evaluation for
gramgrammatical error correction of ESL writings, in: matical error correction, in: Proceedings of the
Proceedings of COLING 2012: Posters, The COL- 2012 Conference of the North American Chapter of
ING 2012 Organizing Committee, Mumbai, India, the Association for Computational Linguistics:
Hu2012, pp. 863–872. URL: https://aclanthology.org/ man Language Technologies, Association for
ComC12-2084. putational Linguistics, Montréal, Canada, 2012, pp.
[21] T. Tajiri, M. Komachi, Y. Matsumoto, Tense and as- 568–572. URL: https://aclanthology.org/N12-1067.
pect error correction for ESL learners using global [31] C. Napoles, K. Sakaguchi, M. Post, J. Tetreault,
context, in: Proceedings of the 50th Annual Meet- Ground truth for grammatical error correction
meting of the Association for Computational Linguis- rics, in: Proceedings of the 53rd Annual Meeting
tics (Volume 2: Short Papers), Association for Com- of the Association for Computational Linguistics
putational Linguistics, Jeju Island, Korea, 2012, pp. and the 7th International Joint Conference on
Natu198–202. URL: https://aclanthology.org/P12-2039. ral Language Processing (Volume 2: Short Papers),
[22] D. Dahlmeier, H. T. Ng, S. M. Wu, Building a large Association for Computational Linguistics, Beijing,
annotated corpus of learner English: The NUS cor- China, 2015, pp. 588–593. doi:1 0 . 3 1 1 5 / v 1 / P 1 5 - 2 0 9 7 .
pus of learner English, in: Proceedings of the Eighth [32] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a
Workshop on Innovative Use of NLP for Building method for automatic evaluation of machine
transEducational Applications, 2013, pp. 22–31. lation, in: Proceedings of the 40th annual meeting
[23] Y. Fathullah, M. Gales, A. Malinin, Ensemble dis- of the Association for Computational Linguistics,
tillation approaches for grammatical error correc- 2002, pp. 311–318.
tion, in: ICASSP 2021 - 2021 IEEE International [33] A. Baevski, H. Zhou, A. Mohamed, M. Auli,
Conference on Acoustics, Speech and Signal Pro- wav2vec 2.0: A framework for self-supervised
learncessing (ICASSP), 2021, pp. 2745–2749. doi:1 0 . 1 1 0 9 / ing of speech representations, in: NeurIPS 2020,
I C A S S P 3 9 7 2 8 . 2 0 2 1 . 9 4 1 3 3 8 5 . 2020, pp. 1–12.
[24] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, [34] W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K.
LakhoJ. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, tia, R. Salakhutdinov, A. Mohamed, HuBERT:
et al., Conformer: Convolution-augmented trans- Self-supervised speech representation learning by
former for speech recognition, arXiv preprint masked prediction of hidden units, IEEE/ACM
arXiv:2005.08100 (2020). Transactions on Audio, Speech, and Language
Pro[25] J. Devlin, M. Chang, L. Kenton, K. Toutanova, BERT: cessing 29 (2021) 3451–3460.</p>
      <p>Pre-training of Deep Bidirectional Transformers [35] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen,
for Language Understanding, arXiv e-prints (2018) J. Li, N. Kanda, T. Yoshioka, X. Xiao, et al., WavLM:
arXiv:1810.04805. doi:1 0 . 4 8 5 5 0 / a r X i v . 1 8 1 0 . 0 4 8 0 5 . Large-scale self-supervised pre-training for full
[26] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De- stack speech processing, IEEE Journal of Selected
langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun- Topics in Signal Processing (2022). doi:1 0 . 1 1 0 9 /
towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, J S T S P . 2 0 2 2 . 3 1 8 8 1 1 3 .</p>
      <p>Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger,
M. Drame, Q. Lhoest, A. Rush, Transformers:
Stateof-the-art natural language processing, in:
Proceedings of the 2020 Conference on Empirical Methods
in Natural Language Processing: System
Demonstrations, Association for Computational
Linguistics, Online, 2020, pp. 38–45. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 .</p>
      <p>e m n l p - d e m o s . 6 .
[27] D. Kingma, J. Ba, Adam: a method for
stochastic optimization, in: International Conference on</p>
      <p>Learning Representations, 2014.
[28] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang,</p>
      <p>M. Matena, Y. Zhou, W. Li, P. J. Liu, et al., Exploring
the limits of transfer learning with a unified
textto-text transformer., Journal of Machine Learning</p>
      <p>Research 21 (2020) 1–67.
[29] I. Loshchilov, F. Hutter, Decoupled weight decay</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>