=Paper=
{{Paper
|id=Vol-2932/short1
|storemode=property
|title=Fine-Tuning Pre-Trained Language Model for Crowdsourced Texts Aggregation
|pdfUrl=https://ceur-ws.org/Vol-2932/short1.pdf
|volume=Vol-2932
|authors=Mikhail Orzhenovskii
|dblpUrl=https://dblp.org/rec/conf/csw/Orzhenovskii21
}}
==Fine-Tuning Pre-Trained Language Model for Crowdsourced Texts Aggregation==
Fine-Tuning Pre-Trained Language Model for Crowdsourced Texts Aggregation Mikhail Orzhenovskii1 Abstract We report on our system for aggregating crowdsourced texts for the VLDB 2021 Crowd Science Work- shop’s shared task. In the task, for each original audio, several crowdsourced transcriptions need to be combined into a single transcription. We propose a system that uses a pre-trained language model, fine-tuned on the augmented dataset, and task-specific post-processing of the model’s outputs to improve the quality of the results. Our model scored 95.73 (45% fewer mistakes compared to the baseline) and achieved the 1𝑠𝑡 place on the shared task leaderboard. 1 Keywords Crowdsourcing, Text aggregation, Truth discovery 1. Introduction VLDB 2021 Crowd Science Challenge[1] is a shared task on aggregation of crowdsourced texts. Multiple transcriptions made by people needed to be aggregated to produce a single high-quality transcription. The audios were produced using a voice assistant from Wikipedia articles. The problem is that some annotators can be unskilled or malicious. One more thing, different people can make mistakes in different parts of the sentence. The data is very noisy. The metric used to evaluate the solutions in the shared task was highest Average Word Accuracy (AWAcc). Word Accuracy is calculated as 𝑊 𝐴𝑐𝑐 = 100 × 𝑚𝑎𝑥(1 − 𝑊 𝐸𝑅, 0) This aggregation task can be seen as a particular case of multi-document summarization or as mistake correction. Pre-trained language models are widely used for many text-related tasks, including text summarization. Linguistic knowledge is beneficial in this task because it helps to choose the possible word sequences, or replace a misheard word with a word with high probability in the context. We applied end-to-end training because the available dataset was large enough. 1 Source code is available on https://github.com/orzhan/bart-transcription-aggregation VLDB 2021 Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale, August 20, 2021, Copenhagen, Denmark © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) Table 1 Example of data (id 8359) Transcriptions: Her recent|Her research interests include; number theory, Houdin theory automorphic forms and spectral graph theory.|Her research interests include number Theory coding Theory Autumn orphic forms and spectral graph Theory |research interests include number theory coding theory automorphic forms and spectral graph theory|her research interests include number theory coding theory automorphic forms and spectural graph theory|Her research interests include ,number theory, padding theory,automorfic forms and spectual graph theory|,mnlknk Ground truth: h e r r e s e a r c h i n t e r e s t s i n c l u d e n u m b e r t h e o r y c o d i n g t h e o r y a u t o m o r p h i c f o r m s a n d spectral graph theory 2. Related work ROVER system used dynamic programming to align and augment word transition networks. After joining the networks, the final WTN was searched by the scoring module to select the best sequence[2]. In HRRASA system, multiple crowdsourced sequences were aggregated using global annotator reliability and local question-wise reliability based on text similarities[3]. 3. Dataset For each of 9700 task ids, the training dataset contained 7 transcriptions made by the annotators and the ground truth text. The testing dataset contained 4502 task ids with 7 transcriptions for each id. The ground truth texts were typically from 8 to 15 words long. The number of different words used in transcriptions was 1 to 4 times larger than the number of different words in the ground truth label. This indicates that some texts were easier for the annotators than the other ones. An example of the data is shown in Table 1. 4. Model and post-processing Text aggregation can be seen as a sequence-to-sequence task: the input sequence is a concate- nation of the crowdsourced transcriptions separated with a delimiter. The output sequence is the ground truth text. The order of transcriptions does not matter, and all of them can be treated equally, so we generated four sequences with different orders of transcriptions for each task id. This method partially helped to regularize the model. We have evaluated two pre-trained language models: T5[4] and BART[5]. Both models use the same encoder-decoder architecture and are capable of solving sequence-to-sequence problems. The evaluation metric in the shared task was based on Word Error Rate, making, for example, color and colour different words. In the training dataset’s ground truth labels, American English forms were more frequent, so we converted the model’s outputs from British English to American English (if applicable) with vocabulary from American British English Translator1 . Shuffling the transcriptions helped the model to regularize; however, it was sometimes sensitive to the order of the inputs. To obtain more stable results for the test dataset, for each task id, we inputted 20 concatenations with different orders of transcriptions and selected the final result using a majority vote. For most examples, there were only two different generated results, one of which outputted for most of the 20 concatenations. The input permutations were chosen to maximize the total Kendall tau rank distance between them. 5. Experiments For the experiments we were using transfomers[6] and simpletransformers[7] libraries which support both BART and T5 models. The models were pre-trained on different tasks (summa- rization and translation), so fine-tuning was necessary to use them in the aggregation task. We fine-tuned the pre-trained models on 9400 samples of the training dataset. Another 300 samples were used as evaluation dataset to choose the training parameters and to select the best model. T5 model was producing nearly the same results as BART model, but the fine-tuning was taking about 4 times longer, so we chose BART and did the most experiments with it. As expected, the larger models outperformed the smaller ones, so BART-large was selected for the final experiments. We selected a relatively small base learning rate 4 × 10−6 , and followed transformers’ default schema of changing learning rate during fine-tuning (Fig. 5). Batch size during training and evaluation was set to 8 as it was the maximum size fitting on the GPU. We stopped training after the 5th epoch when evaluation AWAcc stopped to increase (Fig. 5). Evaluation loss started to rise during the 1st epoch, but further training helped obtain higher WER scores on evaluation and public test datasets. The increase of the evaluation loss can indicate over-fitting, but the actual target metric WER is different and is not always correlated with the evaluation loss (which is based on maximum likelihood, not error count). Beam search with 5 beams slightly improved the score compared to greedy decoding. Using more beams did not lead to better results. 6. Results The results of the model on the different datasets are shown in the Table 2. The difference between the evaluation score and public/private scores is relatively small. The results of the proposed model and the baselines are shown in the Table 3. Majority vote stands for selecting the most common result from the transcriptions. Random choice stands for choosing a random transcription as the answer. Examples of the model’s outputs are displayed in the Table 4. The model processed 73.14% of the inputs without any error, the first two examples belong to this group. The other 26.86% of the inputs contained some mistakes, which is illustrated by the third example. 1 https://github.com/hyperreality/American-British-English-Translator Figure 1: Evaluation AWAcc Figure 2: Learning rate 7. Conclusion The proposed model outperformed the benchmark and other models, achieving a high score of 95.73 on the shared task. The model only used the texts of the transcriptions (no information about the annotators) to produce the result. Figure 3: Training loss Figure 4: Evaluation loss Possible improvements in quality can be achieved by using the information about the an- notators (for example to assign higher weights to accurate annotators), injecting phonetic knowledge into the model to match misheard word sequences better, or using symmetric model architecture that processes input transcriptions in parallel (removing the need of permutations during training and inference). Table 2 AWAcc of the final model Dataset Score evaluation set 96.10 public test 95.75 private test 95.73 Table 3 AWAcc on private test compared to baselines Model AWAcc Final model 95.73 Final model (no shuffling) 95.55 ROVER baseline[2] 92.25 HRRASA baseline[3] 91.04 Majority vote 72.42 Random choice 68.75 References [1] D. Ustalov, N. Pavlichenko, I. Stelmakh, D. Kuznetsov, VLDB 2021 Crowd Science Challenge on Aggregating Crowdsourced Audio Transcriptions, in: Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale, Copenhagen, Denmark, 2021. [2] J. Fiscus, A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover), in: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, 1997, pp. 347–354. doi:1 0 . 1 1 0 9 / A S R U . 1 9 9 7 . 6 5 9 1 1 0 . [3] J. Li, Crowdsourced text sequence aggregation based on hybrid reliability and representa- tion, in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, Association for Computing Machinery, New York, NY, USA, 2020, p. 1761–1764. URL: https://doi.org/10.1145/3397271.3401239. doi:1 0 . 1 1 4 5 / 3 3 9 7 2 7 1 . 3 4 0 1 2 3 9 . [4] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of Machine Learning Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html. [5] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettle- moyer, Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019. a r X i v : 1 9 1 0 . 1 3 4 6 1 . [6] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, A. Rush, Transformers: State-of-the-art natural language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Table 4 Examples of results Example of an easy aggregation Transcriptions: the jungle finally offering some protection|the jungle finally offering some protection|the jungle finally offering some protection|the jungle finally offering some protection|the jungle finally offering some protection |the jungle finally offering some protection |the gentil finally offering some protection Ground truth: t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n Prediction: t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n AWAcc: 1 0 0 . 0 Example of a difficult aggregation done correctly Transcriptions: h e r v o i c e c a u g h t o n l y u n d e r w e a r | h e r v o i c e c o n f i r m e d t h e w o r d | h e r v o i c e c a u g h t only once|a voice coach on the word|her voice got only word|her voice called on the word|he voice caused only worries Ground truth: h e r v o i c e c a u g h t o n t h e w o r d Prediction: h e r v o i c e c a u g h t o n t h e w o r d AWAcc: 1 0 0 . 0 Example of an incorrect aggregation Transcriptions: a n a n g e r r i c h i n p a i n n d i s s o l u t i o n | a n f i m u n d e r a t e d a n d o j o | u n a n g e r r a t e d i n pain and disillusionment|and i deleted in pain and desolutionment|an angry richard in pain and disillusionment|thanks for the heads are|an anger routed in pain and disillusionment Ground truth: a n a n g e r r o o t e d i n p a i n a n d d i s i l l u s i o n m e n t Prediction: a n a n g e r r i c h i n p a i n a n d d i s i l l u s i o n m e n t AWAcc: 8 5 . 7 1 Online, 2020, pp. 38–45. URL: https://www.aclweb.org/anthology/2020.emnlp-demos.6. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . e m n l p - d e m o s . 6 . [7] T. C. Rajapakse, Simple transformers, https://github.com/ThilinaRajapakse/ simpletransformers, 2019.