=Paper=
{{Paper
|id=Vol-2957/sepp_paper6
|storemode=property
|title=Participation of HULAT-UC3M in SEPP-NLG 2021 shared task (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2957/sepp_paper6.pdf
|volume=Vol-2957
|authors=Jose Manuel Masiello-Ruiz,Jose Luis Lopez Cuadrado,Paloma Martinez
|dblpUrl=https://dblp.org/rec/conf/swisstext/Masiello-RuizCM21
}}
==Participation of HULAT-UC3M in SEPP-NLG 2021 shared task (short paper)==
<pdf width="1500px">https://ceur-ws.org/Vol-2957/sepp_paper6.pdf</pdf>
<pre>
        Participation of HULAT-UC3M in SEPP-NLG 2021 shared task

     Jose Manuel Masiello-Ruiz Jose Luis Lopez Cuadrado                              Paloma Martinez
       Computer Science Dept.     Computer Science Dept.                           Computer Science Dept.
      Univ. Carlos III de Madrid Univ. Carlos III de Madrid                       Univ. Carlos III de Madrid
    jmasiell@eco.uc3m.es jllopez@inf.uc3m.es                                        pmf@inf.uc3m.es


                      Abstract                                the results obtained, and Section 5 presents the
                                                              conclusions and the future work.
     This paper introduces the HULAT-UC3M sys-
     tem developed to participate in the SEPP-NLG
     2021 shared task. The systems is based on                2   Background
     the Punctuator framework, a bidirectional re-
     current neural network model with attention              Automatic generation of punctuation marks from
     mechanism for automatic punctuation trained              the output of an ASR system has many applications
     on the Europarl dataset provided by organizers.          such as enhance dictation systems avoiding that the
     The best results obtained in Subtask 1 are F1            speaker verbalizes special keywords to add punc-
     score of 84%, 79%, 36% and 83% for EN, IT,
                                                              tuation marks (comma, colon, semicolon, question
     DE and FR languages on development dataset,
     respectively. Concerning Subtask 2, F1 score             mark, etc.) to the text or to enhance readability of
     are 63%, 57%, 69% and 64% for EN, IT, DE                 captions in content broadcasting. Some previous
     and FR languages on development dataset, re-             related research concerning automatic punctuation
     spectively.                                              of texts is summarized in this section. System de-
                                                              scribed in Chen (1999) is based on a method that
1     Introduction                                            combines acoustic and lexical evidence. The hy-
Automatic punctuation is a relevant task when it              pothesis is although acoustic pauses do not match
comes to processing text obtained from transcrip-             one to one with linguistic segmentation, the combi-
tion systems. When transcription is made using                nation of acoustic and lexical information allows a
Automatic Speech Recognition (ASR) systems,                   good prediction of punctuation marks. This system
the punctuation marks are not always available or,            used the IBM speech recognizer trained on 1,800
when they are available, they must be reviewed. De-           speakers and with speaker adaptation and a N-gram
tecting the end of phrases or the punctuation mark            model built using 250 million words. Using 4 sce-
to be included in a speciﬁc position of the text im-          narios that consider different types of pauses, the
proves the readability and preserves its meaning.             best performance considering punctuation mark at
When the transcriptions are large raw text docu-              correct place and of correct type is 57%. The test
ments, the process is not affordable by people. This          dataset used was a letter with 333 word with 31
paper presents the HULAT-UC3M system devel-                   punctuation marks read by three speakers.
oped to participate in the SEPP-NLG 2021 shared                  Work described in Matusov et al. (2006) was a
task. The aim of the system is, on the one hand, to           similar approach in the context of machine transla-
detect the full stop marks in the text by training the        tion, considering that it is easier to predict segment
punctuator Matusov et al. (2006) framework with               boundaries taking into account prosodic features
the Europarl dataset provided for the shared task.            and pauses of different length than predicting if a
On the other hand, in the context of subtask 2, the           punctuating marks should be inserted than a word
trained framework will be tested on the detection             position. Using a HMM model the system achieved
of full punctuation marks.                                    a F-measure of 70% (results are worse with spon-
   The remainder of this paper is organized as fol-           taneous speech). For Portuguese language, Batista
lows: Section 2 summarizes the relevant related               et al. (2008) used maximum entropy n-grams with
work for the proposal, Section 3 presents the pro-            features such as lexical features (POS tags, words)
posed system, Section 4 describes and discusses               and acoustic features (time, speaker change among


    Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
others); testing on broadcast news the system got      and a 0.02 learning rate.          The data set
83% of precision and 61% of recall for full stop       sepp nlg 2021 train dev data v5.zip have been
recovering and worse performance for comma re-         used as training and dev data set.
covery (45% of precision and 16% of recall).
   More recently,Öktem et al. (2017) proposed us-     4     Results
ing recurrent neural networks trained on TED talks     4.1    Experiment setup
to predict punctuation marks (with similar features
                                                       We have used a google cloud server with the fol-
of previous works- words, pause, frequency and
                                                       lowing conﬁguration:
intensity values of words, etc). Best performance
of this system is F score of 65.7% for all comma,          • 4 CPU virtuals, 15G memory.
period and question marks. Finally, Sunkara et al.
(2020) introduces pretrained BERT language mod-            • 1 GPU NVIDIA Tesla K80.
els ﬁne-tuned to the medical domain data to im-            • Ubuntu pro 16.04.
prove automatic punctuation and truecasing predic-
tion. This approach was tested using two medical           • Python 3.8.
datasets (dictation and conversational) and the best
F score was 93% for full stop trained on wiki and          • CUDA 10.2.
medical dictation data and 82% for full stop trained       • CNN 7.6.5.
on wiki and medical conversation data.
   By reviewing the previous related works, ap-            • Theano 1.0.5.
proaches that combine lexical and acoustic features
                                                       For training we have used the data sets
integrated in current deep learning architectures
                                                       sepp nlg 2021 train dev data v5.zip         and
could provide better results to cope with the prob-
                                                       for evaluating we have used the data sets
lems of ASR errors and out of vocabulary words.
                                                       sepp nlg 2021 test data unlabeled v5 where there
3   System description                                 are two data sets: test and surprise test.

To respond to the proposed tasks we have               4.2    Data pre-processing
used Punctuator, an implementation of a bidi-          For task 1, both data sets (dev and train column
rectional recurrent neural network with atten-         1), have been processed in the same way. All the
tion mechanism introduced by Ottokar Tilk              training .tsv ﬁles have been merged into a single
and Tanel Alumäe Tilk and Alumäe (2016)              language.train.txt ﬁle where each sentence is a line
(https://github.com/ottokart/punctuator2).             and the mark ”.” has been replaced by ”.PERIOD”.
   Punctuator has been adapted to take into account    Likewise, a language.dev.txt ﬁle has been generated
the set of proposed punctuation marks: ”: -,?. 0”.     from the dev .tsv ﬁles.
The adaptation of the data format to the one ex-          There is a test data set that is a copy of lan-
pected by Puntuactor has been carried out with a       guage.test.txt ﬁle.
previous pre-process.                                     For task 2 the marks (column 2 of the data sets)
                                                       have been mapped as shown in Table 1

                                                                  Mark       Mapped
                                                                  ,          ,COMMA
                                                                  .          .PERIOD
                                                                  ?          ?QUESTIONMARK
                                                                  :          :COLON
                                                                  -          -DASH

                                                               Table 1: Task 1 training characteristics.
       Figure 1: Proposed system for the tasks
                                                          In the same way as task 1 a language.train.txt,
  Eight models have been trained, one for              language.dev.txt and language.test.txt ﬁles have
each task and language.   All models have              been generated from .tsv trainning, test and dev
been conﬁgured with a 256 hidden layer size            ﬁles.
   For the evaluation the pre-processing                                                     de
is the same but we have used the                                                               f1-
                                                                           prec.      recall                support
sepp nlg 2021 test data unlabeled v5 data                                                      score
sets.                                                        0        0.99            0.85     0.92         6067240
                                                             1        0.23            0.90     0.36         291443
4.3   Subtask 1 Results                                      accuracy                          0.85         6358683
For each language we have trained a model with               macro
                                                                      0.61            0.87      0.64        6358683
the following characteristics in Table 2:                    avg
                                                             weighted
          hidden    learning                                          0.96            0.85      0.89        6358683
 lang.                            train ﬁle     dev ﬁle      avg
          layers    rate
 en       256       0.02          en.train.txt en.dev.txt               Table 5: Deutsche.Task 1 results.
 it       256       0.02          it.train.txt it.dev.txt
 de       256       0.02          de.train.txt de.dev.txt                                    fr
 fr       256       0.02          fr.train.txt fr.dev.txt                                      f1-
                                                                           prec.      recall                support
                                                                                               score
         Table 2: Task 1 training characteristics            0        0.99            1.00     0.99         8449263
                                                             1        0.87            0.79     0.83         332330
   We have tested each models with its test.txt (or          accuracy                          0.99         8781593
dev.txt) ﬁle and the results are shown in Tables 3,          macro
                                                                      0.93            0.89      0.91        8781593
4, 5, 6:                                                     avg
                                                             weighted
                                en                                    0.99            0.99      0.99        8781593
                                                             avg
                                  f1-
               prec.     recall                 support
                                  score                                  Table 6: French.Task 1 results.
 0        0.99           0.99     0.99          7422156
 1        0.86           0.81     0.84          321333
                                                            4.4   Subtask 2 Results
 accur.                           0.99          7743489
 macro                                                      For each language we have trained a model with
          0.93           0.90           0.91    7743489     the following characteristics Table 7:
 avg
 weighted
          0.99           0.99           0.99    7743489
 avg                                                                  hidden    learning
                                                             lang.                           train ﬁle     dev ﬁle
            Table 3: English.Task 1 results.                          layers    rate
                                                             en       256       0.02         en.train.txt en.dev.txt
                                                             it       256       0.02         it.train.txt it.dev.txt
                                   it                        de       256       0.02         de.train.txt de.dev.txt
                                        f1-                  fr       256       0.02         fr.train.txt fr.dev.txt
               prec.     recall                 support
                                        score
 0        0.99           1.00           0.99    6904100              Table 7: Task 2 training characteristics.
 1        0.86           0.73           0.79    290089
 accuracy                               0.98    7194189        We have tested each models with its test.txt (or
 macro                                                      dev.txt) ﬁle and the results are shown in Tables 8, 9,
          0.92           0.86           0.89    7194189     10, 11. The ﬁgures 2, 3, 4, 5, shown the confusion
 avg
 weighted                                                   matrix for each language.
          0.98           0.98           0.98    7194189        For each of the unlabeled ﬁles of the data set
 avg
                                                            (selecting column 2 of the .tsv ﬁles), a prediction
             Table 4: Italian.Task 1 results.               ﬁle .tsv has been generated using its corresponding
                                                            model according to language.
   For each of the unlabeled ﬁles of the data set
(selecting column 1 of the .tsv ﬁles), a prediction         4.5   Discussion
ﬁle .tsv has been generated using its corresponding         Regarding Subtask 1, learning rates are the same in
model according to language.                                the four languages. The evaluation is based on the
                               en
                                 f1-
              prec.     recall                 support
                                 score
 ,            0.73      0.70     0.72          401095
 -            0.53      0.07     0.12          18335
 .            0.85      0.86     0.86          319751
 0            0.98      0.99     0.99          6985003
 :            0.64      0.23     0.34          9815
 ?            0.80      0.72     0.76          9490
 accuracy                        0.96          7743489
 macro
              0.76      0.60          0.63     7743489
 avg
 weighted
              0.96      0.96          0.96     7743489        Figure 3: Task 2. Italian. Confusion matrix.
 avg

           Table 8: English. Task 2 results.                                            de
                                                                                          f1-
                                                                       prec.     recall                 support
                                                                                          score
                                                          ,            0.90      0.89     0.90          489257
                                                          -            0.50      0.09     0.15          17412
                                                          .            0.92      0.92     0.92          287680
                                                          0            0.99      1.00     0.99          5544080
                                                          :            0.63      0.36     0.46          11148
                                                          ?            0.83      0.65     0.73          9106
                                                          accuracy                        0.98          6358683
                                                          macro
                                                                       0.79      0.65      0.69         6358683
                                                          avg
                                                          weighted
                                                                       0.98      0.98      0.98         6358683
                                                          avg

    Figure 2: Task 2. English. Confusion matrix.                  Table 10: Deutsche. Task 2 results.

                                 it
                                      f1-
              prec.     recall                 support
                                      score
 ,            0.73      0.63          0.67     385867
 -            0.44      0.05          0.09     13044
 .            0.84      0.83          0.83     290088
 0            0.98      0.99          0.98     6480166
 :            0.58      0.27          0.37     14658
 ?            0.73      0.37          0.49     10366
 accuracy                             0.96     7194189
 macro
              0.72      0.52          0.57     7194189
 avg
 weighted
              0.95      0.96          0.96     7194189      Figure 4: Task 2. Deutsche. Confusion matrix.
 avg

           Table 9: Italian. Task 2 results.
                                                         is slightly worst, 0.79, and the worst result is in
                                                         German with 0.36. When comparing the results of
value 1 (full stop) in each language. The F1-score       Subtask 2, with the same learning rates, the value
in English is 0.84, but the framework presents sim-      of the F-score for the full stop is 0,86 in English
ilar F-scores in French (0,85). In Italian, the result   and French, 0,83 in Italian and 0.92 in German.
                                fr                       Acknowledgments
                                  f1-
               prec.     recall                support   This work has been supported by the Madrid Gov-
                                  score
    ,          0.75      0.71     0.73         445852    ernment (Comunidad de Madrid-Spain) under the
                                                         Multiannual Agreement with UC3M in the line of
    -          0.49      0.07     0.11         18321
                                                         Excellence of University Professors (EPUC3M17),
    .          0.86      0.85     0.86         328795
                                                         and in the context of the V PRICIT (Regional Pro-
    0          0.98      0.99     0.99         7964631
                                                         gramme of Research and Technological Innova-
    :          0.60      0.32     0.42         12482
                                                         tion).
    ?          0.82      0.63     0.71         11512
    accuracy                      0.97         8781593
    macro                                                References
               0.75      0.59      0.64        8781593
    avg
                                                         Fernando Batista, Diamantino Caseiro, Nuno Mamede,
    weighted                                               and Isabel Trancoso. 2008. Recovering capitaliza-
               0.97      0.97      0.97        8781593
    avg                                                    tion and punctuation marks for automatic speech
                                                           recognition: Case study for Portuguese broadcast
           Table 11: French. Task 2 results.               news. Speech Communication, 50:847–862.
                                                         Julian C. Chen. 1999.     Speech recognition with
                                                            automatic punctuation. Sixth European Confer-
                                                            ence on Speech Communication and Technology,
                                                            (January):6–9.
                                                         Evgeny Matusov, Arne Mauser, and Hermann Ney.
                                                           2006. Automatic sentence segmentation and punctu-
                                                           ation prediction for spoken language translation. In
                                                           International Workshop on Spoken Language Trans-
                                                           lation (IWSLT) 2006.
                                                         Alp Öktem, Mireia Farrús, and Leo Wanner. 2017. At-
                                                           tentional parallel RNNs for generating punctuation
                                                           in transcribed speech. Lecture Notes in Computer
                                                           Science (including subseries Lecture Notes in Artiﬁ-
                                                           cial Intelligence and Lecture Notes in Bioinformat-
                                                           ics), 10583 LNAI:131–142.
      Figure 5: Task 2. French. Confusion matrix.
                                                         Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sra-
                                                          van Bodapati, and Katrin Kirchhoff. 2020. Robust
                                                          prediction of punctuation and truecasing for medical
The difference for German between subtask 1 and           asr. In Proceedings of the First Workshop on Natu-
subtask 2 is remarkable. Regarding the rest of the        ral Language Processing for Medical Conversations,
punctuation marks in subtask 2, the worst results         pages 53–62.
in all languages are obtained in the dash mark fol-      Ottokar Tilk and Tanel Alumäe. 2016.          Bidirec-
lowed by the colon (:). Remarkably, the proposed           tional recurrent neural network with attention mech-
framework obtain, for subtask 2 in the four lan-           anism for punctuation restoration. Proceedings of
guages, the best overall measures (accuracy, macro         the Annual Conference of the International Speech
                                                           Communication Association, INTERSPEECH, 08-
average and weighted average) for German.                  12-Sept(September):3047–3051.
                                                         Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
5     Conclusions and Future Work                          Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
                                                           Kaiser, and Illia Polosukhin. 2017. Attention is all
The approach presented in this paper is an ex-             you need. arXiv preprint arXiv:1706.03762.
ploratory participation in the SEPP-NLG 2021 task.
We are interested in automatic segmentation and
punctuation for Spanish spontaneous speech. We
plan to use BETO, the Spanish version of BERT
Vaswani et al. (2017) and mBERT models by inte-
grating different types of word embeddings to face
the out-of-vocabulary problem.

</pre>