Evaluating Transformer Models for Punctuation
             Restoration in Italian

     Alessio Miaschi1,2 , Andrea Amelio Ravelli2 , and Felice Dell’Orletta2
                 1
                    Department of Computer Science, Università di Pisa
     2
         Istituto di Linguistica Computazionale “Antonio Zampolli” (ILC–CNR),
                              ItaliaNLP Lab, www.italianlp.it
                            alessio.miaschi@phd.unipi.it,
              {andreaamelio.ravelli, felice.dellorletta}@ilc.cnr.it


         Abstract. In this paper, we propose an evaluation of a Transformer-
         based punctuation restoration model for the Italian language. Experi-
         menting with a BERT-base model, we perform several fine-tuning with
         different training data and sizes and tested them in an in- and cross-
         domain scenario. Moreover, we offer a comparison in a multilingual set-
         ting with the same model fine-tuned on English transcriptions. Finally,
         we conclude with an error analysis of the main weaknesses of the model
         related to specific punctuation marks.

         Keywords: punctuation restoration · transformers · speech transcrip-
         tion


1   Introduction
Nowadays, Automatic Speech Recognition (ASR) and Speech-to-Text technolo-
gies and services have reached an incredible level of accuracy in transcribing
recorded (or live) speech audio streams. A simple but effective test can be run,
with any modern smartphone, by using the dictation feature to write a text
message.1 However, we can immediately notice that the audio stream is tran-
scribed as a word stream, lacking any punctuation or sentence segmentation,
and sometimes pieces of text are difficult to understand without some attempts
to mentally insert punctuation marks in the flow of words.2
    Lack of punctuation may be a minor problem in everyday short-text messag-
ing, but correctly inserted punctuation is crucial in long speech transcription,
live subtitling or any NLP processing of speech data, especially for downstream
processes such as parsing, information extraction, dialog modeling. Many major
  Copyright © 2021 for this paper by its authors. Use permitted under Creative
  Commons License Attribution 4.0 International (CC BY 4.0).
1
  Obviously, the speech must have a close-to-standard accent without using dialectal
  or slang words.
2
  Other than difficult, unpunctuated text can be also ambiguous. Here is an amus-
  ing example of two completely different letters, with the same words but different
  punctuation: https://www.nationalpunctuationday.com/dearjohn.html.
2       A. Miaschi et al.

commercial services such as Google Cloud3 or Microsoft Azure4 offer the option
of including automatically generated punctuation. As well, it is possible to train
a public ASR model, such as wav2vec [25] or Vosk,5 and then apply a Punctu-
ation Restoration technique on the output of the first. Both alternatives come
at a cost: on one side, commercial services requires a payment fee; on the other,
training requires computational power, time and, above all, good and enough
training data in the form of aligned audio sources, transcriptions and phonetic
annotations. By assuming of working on already transcribed data, recent Trans-
formers models could be a convenient way of tackling punctuation restoration
in standard language transcriptions, as they can be easily fine-tuned on many
tasks, including the insertion of commas, periods and question marks. The ob-
jective of this paper is to verify if it could be possible to obtain good results in
transcription by post-processing raw text from everyday Speech-to-Text tech-
nologies (e.g. dictation on a smartphone) with a Transformers model fine-tuned
for Punctuation Restoration. More specifically, we set our experiments on Italian
language and we verify the impact of different domains and sizes of fine-tuning
data on the performance of a Transformer-based punctuation restoration model.
Then, we tested its performances on an in- and cross-domain scenario and we
also offer a comparison with the same model trained on the English language.
    The rest of the paper is organized as follows: in Sec. 2 we present related
works, in Sec. 3 we introduce the setting, models and data used for the exper-
iments, in Sec. 4 discuss the obtained results and in Sec. 5 we conclude the
paper.

Contributions In this paper we: i) investigate the impact of different training
sizes on the performance of a punctuation restoration model based on the Trans-
former architecture; ii) we test the performance of the model in different scenar-
ios (in- and cross-domain); iii) we compare the results obtained in Italian with
those obtained with an English model; iv) we inspect the most common errors
emerged during the experiments.


2   Related Work
Punctuation restoration is a well known task, especially in Speech Processing
and Machine Translation, where many approaches have been tested to tackle
the problem in the past decades. In early attempts, acoustic features has been
exploited to train finite-states or Hidden Markov Models [10, 5, 12]: the basic
idea was to model prosody from speech data and use pauses as cues for sentence
boundary, thus as signal of full stop punctuation marks. While prosody is useful
in some cases, most of the time cannot be used to place punctuation in an ASR
output because speakers use pauses in speech not only to shape the rhythm of
their communication, but also for physical needs (e.g. breathing) or hesitations.
3
  https://cloud.google.com/speech-to-text/
4
  https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/
5
  https://alphacephei.com/vosk/
        Evaluating Transformer Models for Punctuation Restoration in Italian     3

    To solve this problem, multimodal models have been proposed, making use
of parallel audio and transcripts as training [26, 11]. Most of these approaches
take Language Models scores, tokens or POS tags of a huge amount of con-
tinuous words as the textual features, and exploit pause, pitch contour, energy
and prosody as principal acoustic features [16]. With the rise of Deep Learning
techniques, many works reported good performances by training Deep Neural
Networks with parallel acoustic and textual features [4, 29, 14, 15]
    Obviously, multimodal models need a discrete amount of parallel audios and
texts, and outside the English World it is not trivial at all to find such data.
Given that, many works have focused on textual-only approaches [30, 34, 13].
    More recently, the potential of Transformer-based Neural Language Models
(NLMs) have been exploited in several studies [32, 33, 19]. For instance, [18]
used a pre-trained BERT model [7] with bidirectional LSTM and a CRF layer to
achieve state-of-the-art results on the reference transcriptions of the IWSLT2012
dataset6 . [33], instead, proposed an adversarial multitask learning approach with
auxiliary part-of-speech tagging using a pre-trained BERT model.
    While the vast majority of this research is focused on the English language,
relatively little work has been done to inspect the potential of these models on
other languages. [9] proposed a method based on Chinese punctuation prediction
by combining the BERT model with a BiLSTM that outperformed the baseline
by up to 31% absolute in overall micro-F1 on a Chinese news dataset. The study
by [1], from which we built our experiments for the Italian language, explored dif-
ferent Transformer-based models and propose an augmentation strategy for the
punctuation restoration task both on high- (English) and low-resource (Bangla)
languages.


3     Experimental Setting
We explored the potential of transformer based language models for the punctu-
ation restoration task on the Italian language. Specifically, we defined two sets
of experiments. The first consists in evaluating the impact of the fine-tuning set
size on the task performances. For that purpose, we tested the performance of
a state-of-the-art transformer based architecture for punctuation restoration [1]
with incremental fine-tuning sizes.
    In the second set of experiments, we compared the performances of two dif-
ferently fine-tuned models on 4 test datasets, as explained in 3.2. Moreover, we
proposed an error analysis in order to investigate strength and weakness of the
proposed methodology.
    Model and datasets used for the experiments are described below.

3.1    Model
We relied on the architecture previously defined in [1]. The architecture is based
on a Transformer model from which the internal representations are then used
6
    http://hltc.cs.ust.hk/iwslt/index.php/evaluation-campaign/ted-task.html
4       A. Miaschi et al.

as input for a BiLSTM layer, consisting of 768 hidden units. The outputs of
the BiLSTM layer are then concatenated at each time step to a fully connected
layer with four output neurons: one for the O (Other) class and three for the
punctuation marks of Comma (C ), Period (P ) and Question (Q). Thus, this
model casts the punctuation restoration problem as a classification problem: the
output is basically a class assigned to each token.
    The pre-trained Transformer used in our experiments is the XXL uncased
version of the BERT model for the Italian language developed by the MDZ Dig-
ital Library Team and available trough the Huggingface’s Transformers library
[31]7 . The model was trained on Italian Wikipedia and texts from the OPUS
[28] and OSCAR [27] corpora. We will refer to the model as BERT-BiLSTM.

3.2   Data
The model has been fine-tuned on two corpora, in order to evaluate divergences in
the results with respect to the domain variation deriving from different data. The
first corpus is a large collection of authentic contemporary texts in Italian derived
from the web, and it is the de-facto reference corpus for Italian in many NLP
applications: the Italian Web as Corpus (ItWaC) [3]. It counts 2 billion words and
it has been built from the Web by limiting the crawl to the .it domain, and using
as seeds medium-frequency words from La Repubblica journalistic corpus [2] and
Il Nuovo Vocabolario di Base (NVdB - list of basic words of Italian) [6]. Given
the extension and the origin, the ItWaC corpus spans across many domains. It
contains texts with registries that vary from colloquial (i.e. texts derived from
forums and social media) to highly formal (i.e. official documents, newspapers,
technical descriptions),and the use of punctuation varies accordingly.
    The second corpus used for the fine-tuning is the Italian sub-corpus of the
Opensubtitles Multilingual Corpus [17].8 This huge corpus has been compiled
from a large database of movie and TV subtitles collected from the Opensubti-
tles website,9 and includes a total of 1,689 bi-texts spanning 2.6 billion sentences
across 60 languages. The Italian-only subcorpus consists of a total of 769.5 mil-
lions of words. Language of movies and television has been often defined as
broadcast-spoken [20, 22], that is a variety of language that sits in the middle be-
tween written and spoken. More specifically, broadcast-spoken is characterised
by the fact that it is a well programmed language, based on pre-written texts,
and performed to mimic spoken variety. Obviously, it lacks features specific of
the spontaneous speech, such as hesitations, retracting and fillers, and it shows
high regularity, especially in the use punctuation as marks of pauses in the tran-
scription.
    By creating two fine-tuned models, we want to investigate if the language
diversity observable in the two corpora (i.e. average written language and multi-
registry from ItWaC, close-to-spoken but highly regular from Opensubtitles) is
reflected in the way the models handle punctuation.
7
  https://huggingface.co/dbmdz/bert-base-italian-xxl-cased
8
  https://opus.nlpl.eu/OpenSubtitles-v2018.php
9
  http://www.opensubtitles.org
         Evaluating Transformer Models for Punctuation Restoration in Italian              5

   Dataset      Sentences      Tokens         Commas          Periods      Questions
        ItWaC 765,491 19,226,715 (25.12) 1,403,527 (1.83) 729,806 (0.95) 35,685 (0.05)
Opensubtitles it 1505279 14,468,346 (9.61) 754,951 (0.5) 1,265,306 (0.84) 239,973 (0.16)
   ParlaMint it 134,887   3,203,374 (23.75) 238,960 (1.77) 130,386 (0.97) 4,501 (0.03)
       TEDx it    1,139     21,667 (19.02)    1,823 (1.6)   1,070 (0.94)    69 (0.06)
      TEDx en     1,210    21,383 (17.67)    1,636 (1.35)   1,142 (0.94)    68 (0.06)
Table 1. Statistics on the datasets used for fine-tuning and test. In parenthesis, the
average distribution per sentence.


                             Class    Punctuation Marks
                           COMMA            ,;—-()
                           PERIOD             . : ! ...
                          QUESTION                ?
     Table 2. Mapping of punctuation marks to reduced classes for model fine-tuning.


    Moreover, we considered other two resources for the purpose of evaluating
the two models performances in a cross-domain scenario. The first resource is
the Italian part of the ParlaMint Comparable Corpora [8], which contains tran-
scriptions of parliamentary debates from 2015 to mid-2020, counting about 20
millions of words.10 Given the context of the texts, language is highly formal,
and thus also the use of punctuation in the transcripts is precise and regular.
    With the second test-only resource we also introduce a multilingual setting,
useful to compare models performances with reference systems available for En-
glish language. We used the Italian-English alignment of the Multilingual TEDx
Dataset [24, 23],11 , which is a collection of audio recordings from TEDx talks in
8 source languages. The Italian-English alignment derives from transcriptions of
Italian TEDx speeches with aligned English translations, and it counts about 18
thousands words in both languages.
    Table 1 reports some numbers about the datasets herein described. These
statistics refer to the whole set of texts processed, and for all the experiments
conducted with different size of fine-tuning a random selection of sentences has
been collected.


Data pre-processing The model implemented in our experiments is trained
on a classification of tokens on the basis of the presence or absence of a punc-
tuation mark immediately after the target token. It is important to remember
that punctuation is a feature of the written language modality, and it is used
to mimic oral pauses in the transcription of speech: commas are used for short
pauses, periods for long pauses at the end of an utterance and question marks
for questions. For this reason, we collapsed all the possible punctuation marks
to these 3 classes, reducing the complexity of the fine-tuning data.
10
   The complete collection of comparable corpora in 17 languages is available at:
   https://www.clarin.si/repository/xmlui/handle/11356/1432
11
   The full dataset is available at: http://www.openslr.org/100/
6      A. Miaschi et al.

    Table 2 shows how each punctuation mark has been mapped to the corre-
sponding class. The majority of the symbols have been mapped to COMMA
because normally they are used to signal parenthetical clauses and do not inter-
rupt the sentence, while exclamation mark and suspension points, which signals
sentence boundaries, are assimilated to the PERIOD class, but question marks
have been considered as a separate class (QUESTION), in order to keep the
distinction between questions and assertions. Along with these 3 classes, the
class OTHER have been used to annotate tokens not followed by a punctuation
mark. We are aware that this mapping and reduction could be simplistic, but,
again, we are targeting our experiments towards speech transcription, where no
punctuation at all exists, and we need to account for all possible punctuation
found in the training data.
    To feed the model with data in the correct format, we previously pos-tagged
all the corpora with Stanza [21] in order to easily recognise punctuation marks
and thus label correctly tokens followed by them. To better explain, consider
the following sentence taken from the TEDx English test dataset data: During
my career, I had responsibilities and many satisfactions.. This sentence has been
converted as shown in Table 3. During the process, we also lower-cased all tokens
to avoid the possibility of predicting full stops (such as periods, exclamation
marks and suspension points) on the basis of the casing of the following token.


         During my career, I had responsibilities and many satisfactions.
         during my career i had responsibilities and many satisfactions
           O    O COMMA O O               O          O   O      PERIOD

         Table 3. Example of a pre-processed sentence for the fine-tuning.


4   Results

We first investigate the impact of different training sizes on the performance
of BERT-BiLSTM. In order to do so, we fine-tuned our punctuation restoration
models in parallel, with increasingly large portions of the two corpora, from 100k
to 10 million tokens, and then tested them on a previously unseen portion of the
two datasets consisting of 200k words. Results (in terms of micro F-score) are
reported in Figure 1. As a general remark, we found that, for both models, the
curve tends to flatten out when the fine-tuning process is performed with portions
larger than 2 million tokens. As regards the differences between the two datasets,
we can notice that the model fine-tuned on Opensubtitles performed slightly
better than the one trained on ItWac. For instance, focusing on the results
obtained in the last run (10 million tokens) we can observe that the difference
between the two models in terms of F-score is about 0.05 points (0.75 vs. 0.70).
Moreover, it is interesting to note that while the Opensubtitles model obtained
      Evaluating Transformer Models for Punctuation Restoration in Italian      7


Fig. 1. Average micro F-scores obtained with increasing ItWaC and Opensubtitles
dataset sizes.


quite good results even with very small portions of the dataset (e.g. 100K), the
itWac model requires at least one million words to achieve comparable results.
This behaviour is quite predictable due to the fact that Opensubtitles texts are
extremely regular and minimum variation is appreciable through the whole set of
data. On the contrary, using ItWaC that is more heterogeneous, the system need
more data to start correctly modelling the distribution of punctuation marks.
    To better investigate their performances, we report in Table 4 the results
obtained by the two models fine-tuned with 10 millions words from Opensub-
titles/ItWac and tested in two different scenarios: i) in-domain, i.e. testing on
the same dataset; ii) cross domain, i.e. testing on the other domain. Moreover,
in order to provide a direct comparison between the two models, we tested both
their performance on the Parlamint datasets.
    As it can bee seen by looking at the average scores (column Avg), the in-
domain configuration always achieves the best results (ItWaC: 0.65; Opensub-
titles: 0.73). By focusing on the cross-domain configurations, it is interesting
to notice that the high variability of ItWaC texts strengthens the model and
enables it to handle punctuation with better performances with respect to the
model fine-tuned on Opensubtitles. Specifically, observing the performances of
ItWaC model on Opensubtitles testset and viceversa, we notice a difference of
0.05 points. While, looking at both models (ItWaC and Opensubtitles) tested on
ParlaMint, the gap increases to 0.07 points in favour of the ItWaC model. We
can explain this behaviour on the basis of the nature of the ParlaMint dataset,
where regularity and formality leads to longer sentences with punctuation usage
closer to average written texts. Thus, ItWaC model, which is based on an het-
erogeneous collection of texts larger than Opensubtitles, is capable of predicting
punctuation in a more robust way.
    Looking at per-class scores, it is possible to notice that all systems perform
better in predicting the PERIOD class: with exclusion of the Opensubtitles
8       A. Miaschi et al.

      Test Set        Other      Comma Period Question Avg (CPQ)
                             ItWaC Fine-tuning
                     P R F1 P R F 1 P R F1 P R F1 P R F 1
      ItWaC         .96 .97 .97 .73 .67 .70 .70 .72 .71 .52 .59 .55 .65 .66 .65
      Opensubtitles .96 .98 .97 .62 .36 .46 .68 .74 .71 .54 .61 .58 .61 .57 .58
      ParlaMint     .97 .98 .98 .80 .70 .74 .77 .83 .80 .54 .63 .58 .70 .72 .70
                        Opensubtitles Fine-tuning
                     P R F1 P R F 1 P R F1 P R F1 P R F 1
      Opensubtitles .97 .98 .98 .74 .64 .69 .80 .80 .80 .75 .69 .72 .76 .71 .73
      ItWaC         .95 .96 .96 .62 .45 .52 .50 .67 .57 .56 .45 .50 .56 .52 .53
      ParlaMint     .97 .98 .98 .75 .59 .66 .65 .80 .72 .48 .55 .51 .63 .64 .63
Table 4. Results (Precision, Recall and F-score) on Opensubtitles/ItWaC and Par-
lamint datasets when the fine-tuning is performed on 10 million words of the ItWac
and Opensubtitles datasets. Average scores (Avg column) are computed by averaging
C, P and Q scores. Higher F1 scores per class, across all the models and runs, are in
bold.


model tested on ItWaC, all scores are above 0.70. This result is encouraging
because periods, exclamation marks and other full stops are used to signal the
end of a sentence, thus a similar model can be effectively exploited to tackle
the task of segmenting the continuous flow of speech transcription, enabling bet-
ter subsequent sentence-based methods of analysis (e.g. part-of-speech tagging,
dependency parsing and so on).
    We register lower figures on the QUESTION class. It is probably due to the
unpredictability of these in Italian only on the basis of transcribed text, without
considering intonation. We further investigate this problem in 4.2.


4.1   Model comparison in multilingual setting

As already mentioned in Sec. 3.2, we also decided to compare the performance of
our fine-tuned models with a reference system available for the English language.
Specifically, we compared the results obtained with the ItWaC/Opensubtitles
models when tested on the Italian transcriptions of TEDx speeches with the
ones obtained by the system devised in [1] and tested on the TEDx aligned
English translations. Results are reported in Table 5.


         Test Set           Other Comma Period Question Avg (CPQ)
         Alam et al.[1] (EN) 0.97   0.65    0.78     0.80        0.74
         ItWaC (IT)          0.97   0.60    0.70     0.64        0.64
         Opensubtitles (IT) 0.97    0.56    0.69     0.58        0.61
Table 5. F-scores obtained respectively by the [1] model tested on English TEDx
translations and by the ItWaC/Opensubtitles models tested on Italian TEDx tran-
scriptions.
       Evaluating Transformer Models for Punctuation Restoration in Italian         9

    As noticed in the previous experiments, the model fine-tuned on ItWaC data
achieve better results when tested in a cross-domain scenario. In fact, we can
observe a difference of about 0.03 points in terms of average F-scores. In this
respect, it is interesting to note that the main difference between the performance
of the two models is due to the classification of question marks. Focusing instead
on the comparison between Italian and English models, we can clearly observe
that the latter outperforms the Italian ones. Also in this case, the classification of
question marks is the one that contributed most to the score difference between
the models. This result is quite expected, since a question in English, beside the
presence of a question mark at the end of the sentence, is usually characterised by
an inversion of the subject and the verb in the principal clause. On the contrary,
in Italian the punctuation mark is the only discriminating feature of questions
in the written modality, while intonation plays the main role in spoken Italian.
Therefore, the identification of question marks tends to be much easier for the
English language.

4.2   Error Analysis


Fig. 2. Confusion matrices of the results obtained by the model fine-tuned on ItWaC
and tested on Opensubtitles (Fine-tuning: ItWaC, Test: OPENSUBS ) and vice-versa
(Fine-tuning: Opensubtitles, Test: ITWAC ).


    In order to further deepen our analysis, in this section we investigate in more
detail the main errors made by the two models when predicting the different
punctuation marks. In Figure 2 we report the confusion matrices (in terms of
accuracy) of the results obtained by the model fine-tuned on ItWaC and tested
on Opensubtitles and vice-versa. As a general remark, we can highlight that
the COMMA class is the most confused in both models. Due to the unbalanced
distribution of the O (Other) class with respect to the punctuation classes, the
high confusion of every class with this one is easily predictable. Thus, if we
exclude the O class from the figure, we can notice that the class with which the
COMMA is often confused is PERIOD, for both models tested on the opposite
10      A. Miaschi et al.

dataset (e.g. model fine-tuned on ItWaC and tested on Opensubtitles and vice
versa). We can ascribe this problem to the average length of sentences, that
diverges between the two: in ItWaC, the average sentence counts 25.12 tokens
with about 1.83 commas per sentence; in Opensubtitles, the average sentence
is 9.61 tokens long, with a distribution of commas of 0.5. For this reason, we
can assume that the Opensubtitles model tends to create shorter sentences, thus
using the full stop mark more frequently than the ItWaC one.


Fig. 3. Confusion matrices of the results obtained by the models fine-tuned on
ItWac/Opensubtitles and tested on ParlaMint.


    Figure 3 reports instead the confusion matrices of the results obtained by
the two fine-tuned models (ItWaC and Opensubtitles) on ParlaMint test data.
As we have seen previously, the model fine-tuned on ItWaC is the one that
achieved better results regardless of the class taken into account. In fact, with
the exception of Other (O), in all the other classes we observe a performance gap
that goes from 0.2 (PERIOD) to 0.11 (COMMA) accuracy points. Focusing on
the mismatched classes, we can see once again that commas are often mistaken
as other tokens (O), while question marks are classified as periods, as in the
following example:

     Original: Vorrei ricordarvi i fallimenti ai quali siete andati incontro e state
     continuamente andando incontro con i bonus. Devo ricordarvi, forse il bonus
     vacanze? [en. I would like to remind you of the failures you have experienced
     and you are continually experiencing with the bonuses. Should I remind you
     the holiday bonus? ]
     ItWaC/Opensubtitles: Vorrei ricordarvi i fallimenti ai quali siete andati
     incontro e state continuamente andando incontro con i bonus. devo ricor-
     darvi, forse il bonus vacanze.

   If we look at the differences between the two models, we can clearly notice
that the one fine-tuned on Opensubtitles tends to wrongly classify in 0.15 of the
cases a comma also as a full stop, as in the following example:
      Evaluating Transformer Models for Punctuation Restoration in Italian        11

    Original: Scusate la digressione: pure io sono un mancato operaio, due brac-
    cia rubate all’agricoltura - allora lo si diceva in senso denigratorio, mentre
    oggi tale definizione si è qualificata un po’ di più - e ho potuto permettermi
    di studiare e di laurearmi. [en. Sorry for the digression: I am too a non-
    working class person, two arms stolen from agriculture - at the time this was
    said in a derogatory sense, whereas today this definition has been requalified
    - and I was able to afford my studies and my degree.]
    ItWaC: Scusate la digressione, pure io sono un mancato operaio, due braccia
    rubate all’agricoltura, allora lo si diceva in senso denigratorio, mentre oggi
    tale definizione si è qualificata un po’ di più e ho potuto permettermi di
    studiare e di laurearmi
    Opensubtitles: Scusate la digressione. pure io sono un mancato operaio.
    due braccia rubate all’agricoltura. allora lo si diceva in senso denigratorio,
    mentre oggi tale definizione si è qualificata un po’ di più e ho potuto perme-
    ttermi di studiare e di laurearmi
    From the previous example, we can also highlight that the hyphens were cor-
rectly classified as COMMA by the ItWaC model (punctuation marks to class
mapping in Table 2), while they were identified as full stops by the Opensub-
titles one. This could be due to the fact that since the Opensubtites dataset
is composed of shorter sentences (derived from transcribed dialogic turns), the
model tends to extend this behaviour on its inferences. Conversely, the colon
were correctly identified as PERIOD by the Opensubtitles models.


5   Conclusions
In this paper we verified if it could be possible to obtain good results in restoring
punctuation in raw transcription texts by means of a fine-tuned Transformers
model. We chose to exploit 2 corpora as fine-tuning, namely ItWaC and Open-
subtitles, in order to observe the differences emerging from domain variety and
their projection on performances.
    First, we evaluated the impact of different sizes of fine-tuning datasets, and
we observed that the model fine-tuned on highly regular data (i.e. Opensubti-
tles) need less information to start modelling punctuation with regards to the
model fine-tuned on more heterogeneous data (i.e. ItWaC); for both models, the
curve tends to flatten out with fine-tuning portions larger than 2 million tokens.
Moreover, the model fine-tuned on ItWaC obtains the best results when tested
cross-domain on ParlaMint dataset, which is used as neutral testing field for
both models.
    Lately, we offered a comparison between the Italian models herein fine-tuned
and the English model originally presented in [1], tested on the parallel it-en part
of the TEDx dataset. With this comparison it has been possible to easily inter-
pret the errors deriving from the confusion between question marks and periods,
that is problematic in Italian due to the lack of strong syntactic cues, such as in
English, and for this reason it is extremely difficult to distinguish between ques-
tion marks and periods considering textual information uniquely. In conclusion,
12      A. Miaschi et al.

a precise punctuation restoration with Transformers based models is a difficult
task, but considering the good results in predicting periods positioning, we can
confirm that it is possible to mark sentence boundaries and thus segmenting in
sentences the continuous flow of speech transcription.


References

 1. Alam, T., Khan, A., Alam, F.: Punctuation Restoration using Transformer Models
    for High-and Low-Resource Languages. In: Proceedings of the Sixth Workshop on
    Noisy User-generated Text (W-NUT 2020). pp. 132–142. Association for Compu-
    tational Linguistics, Online (2020). https://doi.org/10.18653/v1/2020.wnut-1.18,
    https://aclanthology.org/2020.wnut-1.18
 2. Baroni, M., Bernardini, S., Comastri, F., Piccioni, L., Volpi, A., Aston, G., Maz-
    zoleni, M.: Introducing the La Repubblica Corpus: A Large, Annotated, TEI
    (XML)-compliant Corpus of Newspaper Italian. In: LREC (2004)
 3. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky wide web:
    a collection of very large linguistically processed web-crawled corpora. Language
    Resources and Evaluation 43(3), 209–226 (2009)
 4. Che, X., Luo, S., Yang, H., Meinel, C.: Sentence boundary detection based on
    parallel lexical and acoustic models. In: Interspeech. pp. 2528–2532 (2016)
 5. Christensen, H., Gotoh, Y., Renals, S.: Punctuation annotation using statistical
    prosody models. (2001)
 6. De Mauro, T.: Il nuovo vocabolario di base della lingua italiana. In: Guida all’uso
    delle parole. Editori Riuniti (1980)
 7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of
    deep bidirectional transformers for language understanding. In: Proceedings of
    the 2019 Conference of the North American Chapter of the Association for
    Computational Linguistics: Human Language Technologies, Volume 1 (Long
    and Short Papers). pp. 4171–4186. Association for Computational Linguis-
    tics, Minneapolis, Minnesota (Jun 2019). https://doi.org/10.18653/v1/N19-1423,
    https://aclanthology.org/N19-1423
 8. Erjavec, T., Ogrodniczuk, M., Osenova, P., Ljubešić, N., Simov, K., Grigorova, V.,
    Rudolf, M., Pančur, A., Kopp, M., Barkarson, S., Steingrı́msson, S., van der Pol,
    H., Depoorter, G., de Does, J., Jongejan, B., Haltrup Hansen, D., Navarretta, C.,
    Calzada Pérez, M., de Macedo, L.D., van Heusden, R., Marx, M., Çöltekin, Ç.,
    Coole, M., Agnoloni, T., Frontini, F., Montemagni, S., Quochi, V., Venturi, G.,
    Ruisi, M., Marchetti, C., Battistoni, R., Sebők, M., Ring, O., Dargis, ‘
                                                                                R., Utka,
    A., Petkevičius, M., Briedienė, M., Krilavičius, T., Morkevičius, V., Diwersy, S.,
    Luxardo, G., Rayson, P.: Multilingual comparable corpora of parliamentary de-
    bates ParlaMint 2.1 (2021), http://hdl.handle.net/11356/1432, slovenian language
    resource repository CLARIN.SI
 9. Fang, M., Zhao, H., Song, X., Wang, X., Huang, S.: Using bidirectional lstm
    with bert for chinese punctuation prediction. In: 2019 IEEE International Con-
    ference on Signal, Information and Data Processing (ICSIDP). pp. 1–5 (2019).
    https://doi.org/10.1109/ICSIDP47821.2019.9172986
10. Gotoh, Y., Renals, S.: Sentence boundary detection in broadcast speech transcripts.
    In: ASR2000-Automatic Speech Recognition: Challenges for the new Millenium
    ISCA Tutorial and Research Workshop (ITRW) (2000)
       Evaluating Transformer Models for Punctuation Restoration in Italian            13

11. Gravano, A., Jansche, M., Bacchiani, M.: Restoring punctuation and capitalization
    in transcribed speech. In: 2009 IEEE International Conference on Acoustics, Speech
    and Signal Processing. pp. 4741–4744. IEEE (2009)
12. Kim, J.H., Woodland, P.C.: A combined punctuation generation and speech recog-
    nition system and its performance enhancement using prosody. Speech Communi-
    cation 41(4), 563–577 (2003)
13. Kim, S.: Deep recurrent neural networks with layer-wise multi-head attentions for
    punctuation restoration. In: ICASSP 2019-2019 IEEE International Conference on
    Acoustics, Speech and Signal Processing (ICASSP). pp. 7280–7284. IEEE (2019)
14. Klejch, O., Bell, P., Renals, S.: Punctuated transcription of multi-genre broadcasts
    using acoustic and lexical approaches. In: 2016 IEEE Spoken Language Technology
    Workshop (SLT). pp. 433–440. IEEE (2016)
15. Klejch, O., Bell, P., Renals, S.: Sequence-to-sequence models for punctuated tran-
    scription combining lexical and acoustic features. In: 2017 IEEE International Con-
    ference on Acoustics, Speech and Signal Processing (ICASSP). pp. 5700–5704.
    IEEE (2017)
16. Levy, T., Silber-Varod, V., Moyal, A.: The effect of pitch, intensity and pause
    duration in punctuation detection. In: 2012 IEEE 27th Convention of Electrical
    and Electronics Engineers in Israel. pp. 1–4. IEEE (2012)
17. Lison, P., Tiedemann, J.: Opensubtitles2016: Extracting large parallel corpora from
    movie and tv subtitles (2016)
18. Makhija, K., Ho, T.N., Chng, E.S.: Transfer learning for punctuation prediction. In:
    2019 Asia-Pacific Signal and Information Processing Association Annual Summit
    and Conference (APSIPA ASC). pp. 268–273. IEEE (2019)
19. Nagy, A., Bial, B., Ács, J.: Automatic punctuation restoration with bert models.
    arXiv preprint arXiv:2101.07343 (2021)
20. Nencioni, G.: Parlato-parlato, parlato-scritto, parlato-recitato. Strumenti critici 29
    (1976)
21. Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: A python natural
    language processing toolkit for many human languages. In: Proceedings of the
    58th Annual Meeting of the Association for Computational Linguistics: System
    Demonstrations. pp. 101–108 (2020)
22. Sabatini, F.: La comunicazione orale, scritta e trasmessa. In: Boccafurni, A.M.,
    Serromani, S. (eds.) Educazione linguistica nella scuola superiore: sei argomenti
    per un curricolo, pp. 105–27 (1982)
23. Salesky, E., Wiesner, M., Bremerman, J., Cattoni, R., Negri, M., Turchi, M., Oard,
    D.W., Post, M.: Multilingual tedx corpus for speech recognition and translation
    (2021), http://www.openslr.org/100/
24. Salesky, E., Wiesner, M., Bremerman, J., Cattoni, R., Negri, M., Turchi, M., Oard,
    D.W., Post, M.: The multilingual TEDx corpus for speech recognition and trans-
    lation. arXiv:2102.01757 (2021)
25. Schneider, S., Baevski, A., Collobert, R., Auli, M.: wav2vec: Unsupervised pre-
    training for speech recognition. arXiv preprint arXiv:1904.05862 (2019)
26. Stolcke, A., Shriberg, E., Bates, R.A., Ostendorf, M., Hakkani, D.Z., Plauche, M.,
    Tür, G., Lu, Y.: Automatic detection of sentence boundaries and disfluencies based
    on recognized words. In: ICSLP. vol. 2, pp. 2247–2250. Citeseer (1998)
27. Suárez, P.J.O., Sagot, B., Romary, L.: Asynchronous pipeline for processing huge
    corpora on medium to low resource infrastructures. Challenges in the Management
    of Large Corpora (CMLC-7) 2019 p. 9 (2019)
28. Tiedemann, J., Nygaard, L.: The opus corpus-parallel and free: http://logos. uio.
    no/opus. Citeseer (2004)
14      A. Miaschi et al.

29. Tilk, O., Alumäe, T.: Lstm for punctuation restoration in speech transcripts. In:
    Sixteenth annual conference of the international speech communication association
    (2015)
30. Tilk, O., Alumäe, T.: Bidirectional recurrent neural network with attention mech-
    anism for punctuation restoration. In: Interspeech. pp. 3047–3051 (2016)
31. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac,
    P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen,
    P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M.,
    Lhoest, Q., Rush, A.: Transformers: State-of-the-art natural language processing.
    In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language
    Processing: System Demonstrations. pp. 38–45. Association for Computational
    Linguistics, Online (Oct 2020). https://doi.org/10.18653/v1/2020.emnlp-demos.6,
    https://www.aclweb.org/anthology/2020.emnlp-demos.6
32. Yi, J., Tao, J.: Self-attention based model for punctuation prediction using word
    and speech embeddings. In: ICASSP 2019-2019 IEEE International Conference on
    Acoustics, Speech and Signal Processing (ICASSP). pp. 7270–7274. IEEE (2019)
33. Yi, J., Tao, J., Bai, Y., Tian, Z., Fan, C.: Adversarial transfer learning for punc-
    tuation restoration. arXiv preprint arXiv:2004.00248 (2020)
34. Yi, J., Tao, J., Wen, Z., Li, Y., et al.: Distilling knowledge from an ensemble of
    models for punctuation prediction. In: Interspeech. pp. 2779–2783 (2017)