<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Participation of HULAT-UC3M in SEPP-NLG 2021 shared task</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Paloma Martinez Computer Science Dept. Univ. Carlos III de Madrid</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Univ. Carlos III de Madrid Univ. Carlos III de Madrid</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper introduces the HULAT-UC3M system developed to participate in the SEPP-NLG 2021 shared task. The systems is based on the Punctuator framework, a bidirectional recurrent neural network model with attention mechanism for automatic punctuation trained on the Europarl dataset provided by organizers. The best results obtained in Subtask 1 are F1 score of 84%, 79%, 36% and 83% for EN, IT, DE and FR languages on development dataset, respectively. Concerning Subtask 2, F1 score are 63%, 57%, 69% and 64% for EN, IT, DE and FR languages on development dataset, respectively.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>Automatic punctuation is a relevant task when it
comes to processing text obtained from
transcription systems. When transcription is made using
Automatic Speech Recognition (ASR) systems,
the punctuation marks are not always available or,
when they are available, they must be reviewed.
Detecting the end of phrases or the punctuation mark
to be included in a specific position of the text
improves the readability and preserves its meaning.
When the transcriptions are large raw text
documents, the process is not affordable by people. This
paper presents the HULAT-UC3M system
developed to participate in the SEPP-NLG 2021 shared
task. The aim of the system is, on the one hand, to
detect the full stop marks in the text by training the
punctuator Matusov et al. (2006) framework with
the Europarl dataset provided for the shared task.
On the other hand, in the context of subtask 2, the
trained framework will be tested on the detection
of full punctuation marks.</p>
      <p>The remainder of this paper is organized as
follows: Section 2 summarizes the relevant related
work for the proposal, Section 3 presents the
proposed system, Section 4 describes and discusses
the results obtained, and Section 5 presents the
conclusions and the future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>Automatic generation of punctuation marks from
the output of an ASR system has many applications
such as enhance dictation systems avoiding that the
speaker verbalizes special keywords to add
punctuation marks (comma, colon, semicolon, question
mark, etc.) to the text or to enhance readability of
captions in content broadcasting. Some previous
related research concerning automatic punctuation
of texts is summarized in this section. System
described in Chen (1999) is based on a method that
combines acoustic and lexical evidence. The
hypothesis is although acoustic pauses do not match
one to one with linguistic segmentation, the
combination of acoustic and lexical information allows a
good prediction of punctuation marks. This system
used the IBM speech recognizer trained on 1,800
speakers and with speaker adaptation and a N-gram
model built using 250 million words. Using 4
scenarios that consider different types of pauses, the
best performance considering punctuation mark at
correct place and of correct type is 57%. The test
dataset used was a letter with 333 word with 31
punctuation marks read by three speakers.</p>
      <p>Work described in Matusov et al. (2006) was a
similar approach in the context of machine
translation, considering that it is easier to predict segment
boundaries taking into account prosodic features
and pauses of different length than predicting if a
punctuating marks should be inserted than a word
position. Using a HMM model the system achieved
a F-measure of 70% (results are worse with
spontaneous speech). For Portuguese language, Batista
et al. (2008) used maximum entropy n-grams with
features such as lexical features (POS tags, words)
and acoustic features (time, speaker change among
others); testing on broadcast news the system got
83% of precision and 61% of recall for full stop
recovering and worse performance for comma
recovery (45% of precision and 16% of recall).</p>
      <p>More recently, O¨ktem et al. (2017) proposed
using recurrent neural networks trained on TED talks
to predict punctuation marks (with similar features
of previous works- words, pause, frequency and
intensity values of words, etc). Best performance
of this system is F score of 65.7% for all comma,
period and question marks. Finally, Sunkara et al.
(2020) introduces pretrained BERT language
models fine-tuned to the medical domain data to
improve automatic punctuation and truecasing
prediction. This approach was tested using two medical
datasets (dictation and conversational) and the best
F score was 93% for full stop trained on wiki and
medical dictation data and 82% for full stop trained
on wiki and medical conversation data.</p>
      <p>By reviewing the previous related works,
approaches that combine lexical and acoustic features
integrated in current deep learning architectures
could provide better results to cope with the
problems of ASR errors and out of vocabulary words.
3</p>
    </sec>
    <sec id="sec-3">
      <title>System description</title>
      <p>To respond to the proposed tasks we have
used Punctuator, an implementation of a
bidirectional recurrent neural network with
attention mechanism introduced by Ottokar Tilk
and Tanel Aluma¨e Tilk and Aluma¨e (2016)
(https://github.com/ottokart/punctuator2).</p>
      <p>Punctuator has been adapted to take into account
the set of proposed punctuation marks: ”: -,?. 0”.
The adaptation of the data format to the one
expected by Puntuactor has been carried out with a
previous pre-process.</p>
      <p>Eight models have been trained, one for
each task and language. All models have
been configured with a 256 hidden layer size
and a 0.02 learning rate. The data set
sepp nlg 2021 train dev data v5.zip have been
used as training and dev data set.</p>
    </sec>
    <sec id="sec-4">
      <title>Results</title>
      <sec id="sec-4-1">
        <title>Experiment setup</title>
        <p>We have used a google cloud server with the
following configuration:
• 4 CPU virtuals, 15G memory.
• 1 GPU NVIDIA Tesla K80.
• Ubuntu pro 16.04.
• Python 3.8.
• CUDA 10.2.
• CNN 7.6.5.</p>
        <p>• Theano 1.0.5.</p>
        <p>For training we have used the data sets
sepp nlg 2021 train dev data v5.zip and
for evaluating we have used the data sets
sepp nlg 2021 test data unlabeled v5 where there
are two data sets: test and surprise test.
4.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Data pre-processing</title>
        <p>For task 1, both data sets (dev and train column
1), have been processed in the same way. All the
training .tsv files have been merged into a single
language.train.txt file where each sentence is a line
and the mark ”.” has been replaced by ”.PERIOD”.
Likewise, a language.dev.txt file has been generated
from the dev .tsv files.</p>
        <p>There is a test data set that is a copy of
language.test.txt file.</p>
        <p>For task 2 the marks (column 2 of the data sets)
have been mapped as shown in Table 1</p>
      </sec>
      <sec id="sec-4-3">
        <title>Mark</title>
        <p>,
.
?
:</p>
      </sec>
      <sec id="sec-4-4">
        <title>Mapped</title>
        <p>,COMMA
.PERIOD
?QUESTIONMARK
:COLON
-DASH</p>
        <p>In the same way as task 1 a language.train.txt,
language.dev.txt and language.test.txt files have
been generated from .tsv trainning, test and dev
files.</p>
        <p>We have tested each models with its test.txt (or
dev.txt) file and the results are shown in Tables 3,
4, 5, 6:</p>
        <p>For each of the unlabeled files of the data set
(selecting column 1 of the .tsv files), a prediction
file .tsv has been generated using its corresponding
model according to language.</p>
        <p>We have tested each models with its test.txt (or
dev.txt) file and the results are shown in Tables 8, 9,
10, 11. The figures 2, 3, 4, 5, shown the confusion
matrix for each language.</p>
        <p>For each of the unlabeled files of the data set
(selecting column 2 of the .tsv files), a prediction
file .tsv has been generated using its corresponding
model according to language.
Regarding Subtask 1, learning rates are the same in
the four languages. The evaluation is based on the</p>
        <p>recall
support
support</p>
        <p>recall
support
489257
17412
287680
5544080
11148
9106
6358683
6358683
6358683
value 1 (full stop) in each language. The F1-score
in English is 0.84, but the framework presents
similar F-scores in French (0,85). In Italian, the result
is slightly worst, 0.79, and the worst result is in
German with 0.36. When comparing the results of
Subtask 2, with the same learning rates, the value
of the F-score for the full stop is 0,86 in English
and French, 0,83 in Italian and 0.92 in German.
The difference for German between subtask 1 and
subtask 2 is remarkable. Regarding the rest of the
punctuation marks in subtask 2, the worst results
in all languages are obtained in the dash mark
followed by the colon (:). Remarkably, the proposed
framework obtain, for subtask 2 in the four
languages, the best overall measures (accuracy, macro
average and weighted average) for German.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>The approach presented in this paper is an
exploratory participation in the SEPP-NLG 2021 task.
We are interested in automatic segmentation and
punctuation for Spanish spontaneous speech. We
plan to use BETO, the Spanish version of BERT
Vaswani et al. (2017) and mBERT models by
integrating different types of word embeddings to face
the out-of-vocabulary problem.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has been supported by the Madrid
Government (Comunidad de Madrid-Spain) under the
Multiannual Agreement with UC3M in the line of
Excellence of University Professors (EPUC3M17),
and in the context of the V PRICIT (Regional
Programme of Research and Technological
Innovation).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Fernando</given-names>
            <surname>Batista</surname>
          </string-name>
          , Diamantino Caseiro, Nuno Mamede, and
          <string-name>
            <given-names>Isabel</given-names>
            <surname>Trancoso</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news</article-title>
          .
          <source>Speech Communication</source>
          ,
          <volume>50</volume>
          :
          <fpage>847</fpage>
          -
          <lpage>862</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Julian C.</given-names>
            <surname>Chen</surname>
          </string-name>
          .
          <year>1999</year>
          .
          <article-title>Speech recognition with automatic punctuation</article-title>
          .
          <source>Sixth European Conference on Speech Communication and Technology</source>
          , (January):
          <fpage>6</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Evgeny</given-names>
            <surname>Matusov</surname>
          </string-name>
          , Arne Mauser, and Hermann Ney.
          <year>2006</year>
          .
          <article-title>Automatic sentence segmentation and punctuation prediction for spoken language translation</article-title>
          .
          <source>In International Workshop on Spoken Language Translation</source>
          (IWSLT)
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Alp O</surname>
          </string-name>
          <article-title>¨ ktem, Mireia Farru´s, and</article-title>
          <string-name>
            <given-names>Leo</given-names>
            <surname>Wanner</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attentional parallel RNNs for generating punctuation in transcribed speech</article-title>
          .
          <source>Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)</source>
          , 10583 LNAI:
          <fpage>131</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Monica</given-names>
            <surname>Sunkara</surname>
          </string-name>
          , Srikanth Ronanki, Kalpit Dixit, Sravan Bodapati, and
          <string-name>
            <given-names>Katrin</given-names>
            <surname>Kirchhoff</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Robust prediction of punctuation and truecasing for medical asr</article-title>
          .
          <source>In Proceedings of the First Workshop on Natural Language Processing for Medical Conversations</source>
          , pages
          <fpage>53</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Ottokar</given-names>
            <surname>Tilk</surname>
          </string-name>
          and Tanel Aluma¨e.
          <year>2016</year>
          .
          <article-title>Bidirectional recurrent neural network with attention mechanism for punctuation restoration</article-title>
          .
          <source>Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH</source>
          ,
          <fpage>08</fpage>
          -
          <lpage>12</lpage>
          -Sept(September):
          <fpage>3047</fpage>
          -
          <lpage>3051</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Ashish</given-names>
            <surname>Vaswani</surname>
          </string-name>
          , Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
          <string-name>
            <surname>Lukasz Kaiser</surname>
            , and
            <given-names>Illia</given-names>
          </string-name>
          <string-name>
            <surname>Polosukhin</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Attention is all you need</article-title>
          .
          <source>arXiv preprint arXiv:1706</source>
          .
          <fpage>03762</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>