=Paper=
{{Paper
|id=Vol-2957/sepp_paper6
|storemode=property
|title=Participation of HULAT-UC3M in SEPP-NLG 2021 shared task (short paper)
|pdfUrl=https://ceur-ws.org/Vol-2957/sepp_paper6.pdf
|volume=Vol-2957
|authors=Jose Manuel Masiello-Ruiz,Jose Luis Lopez Cuadrado,Paloma Martinez
|dblpUrl=https://dblp.org/rec/conf/swisstext/Masiello-RuizCM21
}}
==Participation of HULAT-UC3M in SEPP-NLG 2021 shared task (short paper)==
Participation of HULAT-UC3M in SEPP-NLG 2021 shared task Jose Manuel Masiello-Ruiz Jose Luis Lopez Cuadrado Paloma Martinez Computer Science Dept. Computer Science Dept. Computer Science Dept. Univ. Carlos III de Madrid Univ. Carlos III de Madrid Univ. Carlos III de Madrid jmasiell@eco.uc3m.es jllopez@inf.uc3m.es pmf@inf.uc3m.es Abstract the results obtained, and Section 5 presents the conclusions and the future work. This paper introduces the HULAT-UC3M sys- tem developed to participate in the SEPP-NLG 2021 shared task. The systems is based on 2 Background the Punctuator framework, a bidirectional re- current neural network model with attention Automatic generation of punctuation marks from mechanism for automatic punctuation trained the output of an ASR system has many applications on the Europarl dataset provided by organizers. such as enhance dictation systems avoiding that the The best results obtained in Subtask 1 are F1 speaker verbalizes special keywords to add punc- score of 84%, 79%, 36% and 83% for EN, IT, tuation marks (comma, colon, semicolon, question DE and FR languages on development dataset, respectively. Concerning Subtask 2, F1 score mark, etc.) to the text or to enhance readability of are 63%, 57%, 69% and 64% for EN, IT, DE captions in content broadcasting. Some previous and FR languages on development dataset, re- related research concerning automatic punctuation spectively. of texts is summarized in this section. System de- scribed in Chen (1999) is based on a method that 1 Introduction combines acoustic and lexical evidence. The hy- Automatic punctuation is a relevant task when it pothesis is although acoustic pauses do not match comes to processing text obtained from transcrip- one to one with linguistic segmentation, the combi- tion systems. When transcription is made using nation of acoustic and lexical information allows a Automatic Speech Recognition (ASR) systems, good prediction of punctuation marks. This system the punctuation marks are not always available or, used the IBM speech recognizer trained on 1,800 when they are available, they must be reviewed. De- speakers and with speaker adaptation and a N-gram tecting the end of phrases or the punctuation mark model built using 250 million words. Using 4 sce- to be included in a specific position of the text im- narios that consider different types of pauses, the proves the readability and preserves its meaning. best performance considering punctuation mark at When the transcriptions are large raw text docu- correct place and of correct type is 57%. The test ments, the process is not affordable by people. This dataset used was a letter with 333 word with 31 paper presents the HULAT-UC3M system devel- punctuation marks read by three speakers. oped to participate in the SEPP-NLG 2021 shared Work described in Matusov et al. (2006) was a task. The aim of the system is, on the one hand, to similar approach in the context of machine transla- detect the full stop marks in the text by training the tion, considering that it is easier to predict segment punctuator Matusov et al. (2006) framework with boundaries taking into account prosodic features the Europarl dataset provided for the shared task. and pauses of different length than predicting if a On the other hand, in the context of subtask 2, the punctuating marks should be inserted than a word trained framework will be tested on the detection position. Using a HMM model the system achieved of full punctuation marks. a F-measure of 70% (results are worse with spon- The remainder of this paper is organized as fol- taneous speech). For Portuguese language, Batista lows: Section 2 summarizes the relevant related et al. (2008) used maximum entropy n-grams with work for the proposal, Section 3 presents the pro- features such as lexical features (POS tags, words) posed system, Section 4 describes and discusses and acoustic features (time, speaker change among Copyright ©2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0). others); testing on broadcast news the system got and a 0.02 learning rate. The data set 83% of precision and 61% of recall for full stop sepp nlg 2021 train dev data v5.zip have been recovering and worse performance for comma re- used as training and dev data set. covery (45% of precision and 16% of recall). More recently,Öktem et al. (2017) proposed us- 4 Results ing recurrent neural networks trained on TED talks 4.1 Experiment setup to predict punctuation marks (with similar features We have used a google cloud server with the fol- of previous works- words, pause, frequency and lowing configuration: intensity values of words, etc). Best performance of this system is F score of 65.7% for all comma, • 4 CPU virtuals, 15G memory. period and question marks. Finally, Sunkara et al. (2020) introduces pretrained BERT language mod- • 1 GPU NVIDIA Tesla K80. els fine-tuned to the medical domain data to im- • Ubuntu pro 16.04. prove automatic punctuation and truecasing predic- tion. This approach was tested using two medical • Python 3.8. datasets (dictation and conversational) and the best F score was 93% for full stop trained on wiki and • CUDA 10.2. medical dictation data and 82% for full stop trained • CNN 7.6.5. on wiki and medical conversation data. By reviewing the previous related works, ap- • Theano 1.0.5. proaches that combine lexical and acoustic features For training we have used the data sets integrated in current deep learning architectures sepp nlg 2021 train dev data v5.zip and could provide better results to cope with the prob- for evaluating we have used the data sets lems of ASR errors and out of vocabulary words. sepp nlg 2021 test data unlabeled v5 where there 3 System description are two data sets: test and surprise test. To respond to the proposed tasks we have 4.2 Data pre-processing used Punctuator, an implementation of a bidi- For task 1, both data sets (dev and train column rectional recurrent neural network with atten- 1), have been processed in the same way. All the tion mechanism introduced by Ottokar Tilk training .tsv files have been merged into a single and Tanel Alumäe Tilk and Alumäe (2016) language.train.txt file where each sentence is a line (https://github.com/ottokart/punctuator2). and the mark ”.” has been replaced by ”.PERIOD”. Punctuator has been adapted to take into account Likewise, a language.dev.txt file has been generated the set of proposed punctuation marks: ”: -,?. 0”. from the dev .tsv files. The adaptation of the data format to the one ex- There is a test data set that is a copy of lan- pected by Puntuactor has been carried out with a guage.test.txt file. previous pre-process. For task 2 the marks (column 2 of the data sets) have been mapped as shown in Table 1 Mark Mapped , ,COMMA . .PERIOD ? ?QUESTIONMARK : :COLON - -DASH Table 1: Task 1 training characteristics. Figure 1: Proposed system for the tasks In the same way as task 1 a language.train.txt, Eight models have been trained, one for language.dev.txt and language.test.txt files have each task and language. All models have been generated from .tsv trainning, test and dev been configured with a 256 hidden layer size files. For the evaluation the pre-processing de is the same but we have used the f1- prec. recall support sepp nlg 2021 test data unlabeled v5 data score sets. 0 0.99 0.85 0.92 6067240 1 0.23 0.90 0.36 291443 4.3 Subtask 1 Results accuracy 0.85 6358683 For each language we have trained a model with macro 0.61 0.87 0.64 6358683 the following characteristics in Table 2: avg weighted hidden learning 0.96 0.85 0.89 6358683 lang. train file dev file avg layers rate en 256 0.02 en.train.txt en.dev.txt Table 5: Deutsche.Task 1 results. it 256 0.02 it.train.txt it.dev.txt de 256 0.02 de.train.txt de.dev.txt fr fr 256 0.02 fr.train.txt fr.dev.txt f1- prec. recall support score Table 2: Task 1 training characteristics 0 0.99 1.00 0.99 8449263 1 0.87 0.79 0.83 332330 We have tested each models with its test.txt (or accuracy 0.99 8781593 dev.txt) file and the results are shown in Tables 3, macro 0.93 0.89 0.91 8781593 4, 5, 6: avg weighted en 0.99 0.99 0.99 8781593 avg f1- prec. recall support score Table 6: French.Task 1 results. 0 0.99 0.99 0.99 7422156 1 0.86 0.81 0.84 321333 4.4 Subtask 2 Results accur. 0.99 7743489 macro For each language we have trained a model with 0.93 0.90 0.91 7743489 the following characteristics Table 7: avg weighted 0.99 0.99 0.99 7743489 avg hidden learning lang. train file dev file Table 3: English.Task 1 results. layers rate en 256 0.02 en.train.txt en.dev.txt it 256 0.02 it.train.txt it.dev.txt it de 256 0.02 de.train.txt de.dev.txt f1- fr 256 0.02 fr.train.txt fr.dev.txt prec. recall support score 0 0.99 1.00 0.99 6904100 Table 7: Task 2 training characteristics. 1 0.86 0.73 0.79 290089 accuracy 0.98 7194189 We have tested each models with its test.txt (or macro dev.txt) file and the results are shown in Tables 8, 9, 0.92 0.86 0.89 7194189 10, 11. The figures 2, 3, 4, 5, shown the confusion avg weighted matrix for each language. 0.98 0.98 0.98 7194189 For each of the unlabeled files of the data set avg (selecting column 2 of the .tsv files), a prediction Table 4: Italian.Task 1 results. file .tsv has been generated using its corresponding model according to language. For each of the unlabeled files of the data set (selecting column 1 of the .tsv files), a prediction 4.5 Discussion file .tsv has been generated using its corresponding Regarding Subtask 1, learning rates are the same in model according to language. the four languages. The evaluation is based on the en f1- prec. recall support score , 0.73 0.70 0.72 401095 - 0.53 0.07 0.12 18335 . 0.85 0.86 0.86 319751 0 0.98 0.99 0.99 6985003 : 0.64 0.23 0.34 9815 ? 0.80 0.72 0.76 9490 accuracy 0.96 7743489 macro 0.76 0.60 0.63 7743489 avg weighted 0.96 0.96 0.96 7743489 Figure 3: Task 2. Italian. Confusion matrix. avg Table 8: English. Task 2 results. de f1- prec. recall support score , 0.90 0.89 0.90 489257 - 0.50 0.09 0.15 17412 . 0.92 0.92 0.92 287680 0 0.99 1.00 0.99 5544080 : 0.63 0.36 0.46 11148 ? 0.83 0.65 0.73 9106 accuracy 0.98 6358683 macro 0.79 0.65 0.69 6358683 avg weighted 0.98 0.98 0.98 6358683 avg Figure 2: Task 2. English. Confusion matrix. Table 10: Deutsche. Task 2 results. it f1- prec. recall support score , 0.73 0.63 0.67 385867 - 0.44 0.05 0.09 13044 . 0.84 0.83 0.83 290088 0 0.98 0.99 0.98 6480166 : 0.58 0.27 0.37 14658 ? 0.73 0.37 0.49 10366 accuracy 0.96 7194189 macro 0.72 0.52 0.57 7194189 avg weighted 0.95 0.96 0.96 7194189 Figure 4: Task 2. Deutsche. Confusion matrix. avg Table 9: Italian. Task 2 results. is slightly worst, 0.79, and the worst result is in German with 0.36. When comparing the results of value 1 (full stop) in each language. The F1-score Subtask 2, with the same learning rates, the value in English is 0.84, but the framework presents sim- of the F-score for the full stop is 0,86 in English ilar F-scores in French (0,85). In Italian, the result and French, 0,83 in Italian and 0.92 in German. fr Acknowledgments f1- prec. recall support This work has been supported by the Madrid Gov- score , 0.75 0.71 0.73 445852 ernment (Comunidad de Madrid-Spain) under the Multiannual Agreement with UC3M in the line of - 0.49 0.07 0.11 18321 Excellence of University Professors (EPUC3M17), . 0.86 0.85 0.86 328795 and in the context of the V PRICIT (Regional Pro- 0 0.98 0.99 0.99 7964631 gramme of Research and Technological Innova- : 0.60 0.32 0.42 12482 tion). ? 0.82 0.63 0.71 11512 accuracy 0.97 8781593 macro References 0.75 0.59 0.64 8781593 avg Fernando Batista, Diamantino Caseiro, Nuno Mamede, weighted and Isabel Trancoso. 2008. Recovering capitaliza- 0.97 0.97 0.97 8781593 avg tion and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast Table 11: French. Task 2 results. news. Speech Communication, 50:847–862. Julian C. Chen. 1999. Speech recognition with automatic punctuation. Sixth European Confer- ence on Speech Communication and Technology, (January):6–9. Evgeny Matusov, Arne Mauser, and Hermann Ney. 2006. Automatic sentence segmentation and punctu- ation prediction for spoken language translation. In International Workshop on Spoken Language Trans- lation (IWSLT) 2006. Alp Öktem, Mireia Farrús, and Leo Wanner. 2017. At- tentional parallel RNNs for generating punctuation in transcribed speech. Lecture Notes in Computer Science (including subseries Lecture Notes in Artifi- cial Intelligence and Lecture Notes in Bioinformat- ics), 10583 LNAI:131–142. Figure 5: Task 2. French. Confusion matrix. Monica Sunkara, Srikanth Ronanki, Kalpit Dixit, Sra- van Bodapati, and Katrin Kirchhoff. 2020. Robust prediction of punctuation and truecasing for medical The difference for German between subtask 1 and asr. In Proceedings of the First Workshop on Natu- subtask 2 is remarkable. Regarding the rest of the ral Language Processing for Medical Conversations, punctuation marks in subtask 2, the worst results pages 53–62. in all languages are obtained in the dash mark fol- Ottokar Tilk and Tanel Alumäe. 2016. Bidirec- lowed by the colon (:). Remarkably, the proposed tional recurrent neural network with attention mech- framework obtain, for subtask 2 in the four lan- anism for punctuation restoration. Proceedings of guages, the best overall measures (accuracy, macro the Annual Conference of the International Speech Communication Association, INTERSPEECH, 08- average and weighted average) for German. 12-Sept(September):3047–3051. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob 5 Conclusions and Future Work Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all The approach presented in this paper is an ex- you need. arXiv preprint arXiv:1706.03762. ploratory participation in the SEPP-NLG 2021 task. We are interested in automatic segmentation and punctuation for Spanish spontaneous speech. We plan to use BETO, the Spanish version of BERT Vaswani et al. (2017) and mBERT models by inte- grating different types of word embeddings to face the out-of-vocabulary problem.