1. Introduction

Fine-Tuning Pre-Trained Language Model for Crowdsourced Texts Aggregation

Mikhail Orzhenovskii

We report on our system for aggregating crowdsourced texts for the VLDB 2021 Crowd Science Workshop's shared task. In the task, for each original audio, several crowdsourced transcriptions need to be combined into a single transcription. We propose a system that uses a pre-trained language model, ifne-tuned on the augmented dataset, and task-specific post-processing of the model's outputs to improve the quality of the results. Our model scored 95.73 (45% fewer mistakes compared to the baseline) and achieved the 1 place on the shared task leaderboard. 1 VLDB 2021 Crowd Science Challenge[1] is a shared task on aggregation of crowdsourced texts. Multiple transcriptions made by people needed to be aggregated to produce a single high-quality transcription. The audios were produced using a voice assistant from Wikipedia articles. The problem is that some annotators can be unskilled or malicious. One more thing, diferent people can make mistakes in diferent parts of the sentence. The data is very noisy. The metric used to evaluate the solutions in the shared task was highest Average Word Accuracy (AWAcc). Word Accuracy is calculated as This aggregation task can be seen as a particular case of multi-document summarization or as mistake correction. Pre-trained language models are widely used for many text-related tasks, including text summarization. Linguistic knowledge is beneficial in this task because it helps to choose the possible word sequences, or replace a misheard word with a word with high probability in the context. We applied end-to-end training because the available dataset was large enough.

eol>Crowdsourcing Text aggregation Truth discovery

1. Introduction 2. Related work

ROVER system used dynamic programming to align and augment word transition networks. After joining the networks, the final WTN was searched by the scoring module to select the best sequence[ 2 ].

In HRRASA system, multiple crowdsourced sequences were aggregated using global annotator reliability and local question-wise reliability based on text similarities[ 3 ].

3. Dataset

For each of 9700 task ids, the training dataset contained 7 transcriptions made by the annotators and the ground truth text. The testing dataset contained 4502 task ids with 7 transcriptions for each id.

The ground truth texts were typically from 8 to 15 words long. The number of diferent words used in transcriptions was 1 to 4 times larger than the number of diferent words in the ground truth label. This indicates that some texts were easier for the annotators than the other ones. An example of the data is shown in Table 1.

4. Model and post-processing

Text aggregation can be seen as a sequence-to-sequence task: the input sequence is a concatenation of the crowdsourced transcriptions separated with a delimiter. The output sequence is the ground truth text. The order of transcriptions does not matter, and all of them can be treated equally, so we generated four sequences with diferent orders of transcriptions for each task id. This method partially helped to regularize the model.

We have evaluated two pre-trained language models: T5[ 4 ] and BART[ 5 ]. Both models use the same encoder-decoder architecture and are capable of solving sequence-to-sequence problems.

The evaluation metric in the shared task was based on Word Error Rate, making, for example, color and colour diferent words. In the training dataset’s ground truth labels, American English forms were more frequent, so we converted the model’s outputs from British English to American English (if applicable) with vocabulary from American British English Translator1.

Shufling the transcriptions helped the model to regularize; however, it was sometimes sensitive to the order of the inputs. To obtain more stable results for the test dataset, for each task id, we inputted 20 concatenations with diferent orders of transcriptions and selected the ifnal result using a majority vote. For most examples, there were only two diferent generated results, one of which outputted for most of the 20 concatenations. The input permutations were chosen to maximize the total Kendall tau rank distance between them.

5. Experiments

For the experiments we were using transfomers[ 6 ] and simpletransformers[ 7 ] libraries which support both BART and T5 models. The models were pre-trained on diferent tasks (summarization and translation), so fine-tuning was necessary to use them in the aggregation task. We ifne-tuned the pre-trained models on 9400 samples of the training dataset. Another 300 samples were used as evaluation dataset to choose the training parameters and to select the best model.

T5 model was producing nearly the same results as BART model, but the fine-tuning was taking about 4 times longer, so we chose BART and did the most experiments with it. As expected, the larger models outperformed the smaller ones, so BART-large was selected for the ifnal experiments.

We selected a relatively small base learning rate 4 × 10−6, and followed transformers’ default schema of changing learning rate during fine-tuning (Fig. 5). Batch size during training and evaluation was set to 8 as it was the maximum size fitting on the GPU.

We stopped training after the 5th epoch when evaluation AWAcc stopped to increase (Fig. 5). Evaluation loss started to rise during the 1st epoch, but further training helped obtain higher WER scores on evaluation and public test datasets. The increase of the evaluation loss can indicate over-fitting, but the actual target metric WER is diferent and is not always correlated with the evaluation loss (which is based on maximum likelihood, not error count).

Beam search with 5 beams slightly improved the score compared to greedy decoding. Using more beams did not lead to better results.

6. Results

The results of the model on the diferent datasets are shown in the Table 2. The diference between the evaluation score and public/private scores is relatively small.

The results of the proposed model and the baselines are shown in the Table 3. Majority vote stands for selecting the most common result from the transcriptions. Random choice stands for choosing a random transcription as the answer.

Examples of the model’s outputs are displayed in the Table 4. The model processed 73.14% of the inputs without any error, the first two examples belong to this group. The other 26.86% of the inputs contained some mistakes, which is illustrated by the third example.

1https://github.com/hyperreality/American-British-English-Translator

7. Conclusion

The proposed model outperformed the benchmark and other models, achieving a high score of 95.73 on the shared task. The model only used the texts of the transcriptions (no information about the annotators) to produce the result.

Possible improvements in quality can be achieved by using the information about the annotators (for example to assign higher weights to accurate annotators), injecting phonetic knowledge into the model to match misheard word sequences better, or using symmetric model architecture that processes input transcriptions in parallel (removing the need of permutations during training and inference).

Example of an easy aggregation Transcriptions: t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n | t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n | t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n | t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n | t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n | t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n | t h e g e n t i l f i n a l l y o f f e r i n g s o m e p r o t e c t i o n Ground truth: t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n Prediction: t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n AWAcc: 1 0 0 . 0

[1]

Ustalov ,

Pavlichenko ,

Stelmakh ,

Kuznetsov , VLDB 2021 Crowd Science Challenge on Aggregating Crowdsourced Audio Transcriptions , in: Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale , Copenhagen, Denmark, 2021 .

[2]

Fiscus , A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover) , in: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings , 1997 , pp. 347 - 354 . doi:1 0 . 1 1 0 9 / A S R U . 1 9 9 7 . 6 5 9 1 1 0 .

[3]

Li , Crowdsourced text sequence aggregation based on hybrid reliability and representation , in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR '20, Association for Computing Machinery, New York, NY, USA, 2020 , p. 1761 - 1764 . URL: https://doi.org/10.1145/3397271.3401239. doi:1 0 . 1 1 4 5 / 3 3 9 7 2 7 1 . 3 4 0 1 2 3 9 .

[4]

Rafel ,

Shazeer ,

Roberts ,

Lee ,

Narang ,

Matena ,

Zhou ,

Li ,

P. J.

Liu , Exploring the limits of transfer learning with a unified text-to-text transformer , Journal of Machine Learning Research 21 ( 2020 ) 1 - 67 . URL: http://jmlr.org/papers/v21/ 20 - 074 .html.

[5]

Lewis ,

Liu ,

Goyal ,

Ghazvininejad ,

Mohamed ,

Levy ,

Stoyanov , L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation , and comprehension, 2019 . a r X i v : 1 9 1 0 . 1 3 4 6 1 .

[6]

Wolf ,

Debut ,

Sanh ,

Chaumond ,

Delangue ,

Moi ,

Cistac ,

Rault ,

Louf ,

Funtowicz ,

Davison ,

Shleifer , P. von Platen, C. Ma,

Jernite ,

Plu ,

Xu ,

T. Le

Scao ,

Gugger ,

Drame ,

Lhoest ,

Rush , Transformers: State-of-the-art natural language processing , in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Table 4 Examples of results Online , 2020 , pp. 38 - 45 . URL: https://www.aclweb.org/anthology/2020. emnlp-demos.6. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0 . e m n l p - d e m o s . 6 .

[7]

T. C.

Rajapakse , Simple transformers, https://github.com/ThilinaRajapakse/ simpletransformers, 2019 .