<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fine-Tuning Pre-Trained Language Model for Crowdsourced Texts Aggregation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mikhail Orzhenovskii</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>We report on our system for aggregating crowdsourced texts for the VLDB 2021 Crowd Science Workshop's shared task. In the task, for each original audio, several crowdsourced transcriptions need to be combined into a single transcription. We propose a system that uses a pre-trained language model, ifne-tuned on the augmented dataset, and task-specific post-processing of the model's outputs to improve the quality of the results. Our model scored 95.73 (45% fewer mistakes compared to the baseline) and achieved the 1 place on the shared task leaderboard. 1 VLDB 2021 Crowd Science Challenge[1] is a shared task on aggregation of crowdsourced texts. Multiple transcriptions made by people needed to be aggregated to produce a single high-quality transcription. The audios were produced using a voice assistant from Wikipedia articles. The problem is that some annotators can be unskilled or malicious. One more thing, diferent people can make mistakes in diferent parts of the sentence. The data is very noisy. The metric used to evaluate the solutions in the shared task was highest Average Word Accuracy (AWAcc). Word Accuracy is calculated as This aggregation task can be seen as a particular case of multi-document summarization or as mistake correction. Pre-trained language models are widely used for many text-related tasks, including text summarization. Linguistic knowledge is beneficial in this task because it helps to choose the possible word sequences, or replace a misheard word with a word with high probability in the context. We applied end-to-end training because the available dataset was large enough.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Crowdsourcing</kwd>
        <kwd>Text aggregation</kwd>
        <kwd>Truth discovery</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        ROVER system used dynamic programming to align and augment word transition networks.
After joining the networks, the final WTN was searched by the scoring module to select the
best sequence[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In HRRASA system, multiple crowdsourced sequences were aggregated using global annotator
reliability and local question-wise reliability based on text similarities[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>For each of 9700 task ids, the training dataset contained 7 transcriptions made by the annotators
and the ground truth text. The testing dataset contained 4502 task ids with 7 transcriptions for
each id.</p>
      <p>The ground truth texts were typically from 8 to 15 words long. The number of diferent
words used in transcriptions was 1 to 4 times larger than the number of diferent words in the
ground truth label. This indicates that some texts were easier for the annotators than the other
ones. An example of the data is shown in Table 1.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Model and post-processing</title>
      <p>Text aggregation can be seen as a sequence-to-sequence task: the input sequence is a
concatenation of the crowdsourced transcriptions separated with a delimiter. The output sequence
is the ground truth text. The order of transcriptions does not matter, and all of them can be
treated equally, so we generated four sequences with diferent orders of transcriptions for each
task id. This method partially helped to regularize the model.</p>
      <p>
        We have evaluated two pre-trained language models: T5[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and BART[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Both models
use the same encoder-decoder architecture and are capable of solving sequence-to-sequence
problems.
      </p>
      <p>The evaluation metric in the shared task was based on Word Error Rate, making, for example,
color and colour diferent words. In the training dataset’s ground truth labels, American English
forms were more frequent, so we converted the model’s outputs from British English to American
English (if applicable) with vocabulary from American British English Translator1.</p>
      <p>Shufling the transcriptions helped the model to regularize; however, it was sometimes
sensitive to the order of the inputs. To obtain more stable results for the test dataset, for each
task id, we inputted 20 concatenations with diferent orders of transcriptions and selected the
ifnal result using a majority vote. For most examples, there were only two diferent generated
results, one of which outputted for most of the 20 concatenations. The input permutations were
chosen to maximize the total Kendall tau rank distance between them.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <p>
        For the experiments we were using transfomers[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and simpletransformers[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] libraries which
support both BART and T5 models. The models were pre-trained on diferent tasks
(summarization and translation), so fine-tuning was necessary to use them in the aggregation task. We
ifne-tuned the pre-trained models on 9400 samples of the training dataset. Another 300 samples
were used as evaluation dataset to choose the training parameters and to select the best model.
      </p>
      <p>T5 model was producing nearly the same results as BART model, but the fine-tuning was
taking about 4 times longer, so we chose BART and did the most experiments with it. As
expected, the larger models outperformed the smaller ones, so BART-large was selected for the
ifnal experiments.</p>
      <p>We selected a relatively small base learning rate 4 × 10−6, and followed transformers’ default
schema of changing learning rate during fine-tuning (Fig. 5). Batch size during training and
evaluation was set to 8 as it was the maximum size fitting on the GPU.</p>
      <p>We stopped training after the 5th epoch when evaluation AWAcc stopped to increase (Fig. 5).
Evaluation loss started to rise during the 1st epoch, but further training helped obtain higher
WER scores on evaluation and public test datasets. The increase of the evaluation loss can
indicate over-fitting, but the actual target metric WER is diferent and is not always correlated
with the evaluation loss (which is based on maximum likelihood, not error count).</p>
      <p>Beam search with 5 beams slightly improved the score compared to greedy decoding. Using
more beams did not lead to better results.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>The results of the model on the diferent datasets are shown in the Table 2. The diference
between the evaluation score and public/private scores is relatively small.</p>
      <p>The results of the proposed model and the baselines are shown in the Table 3. Majority vote
stands for selecting the most common result from the transcriptions. Random choice stands for
choosing a random transcription as the answer.</p>
      <p>Examples of the model’s outputs are displayed in the Table 4. The model processed 73.14% of
the inputs without any error, the first two examples belong to this group. The other 26.86% of
the inputs contained some mistakes, which is illustrated by the third example.</p>
      <p>1https://github.com/hyperreality/American-British-English-Translator</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>The proposed model outperformed the benchmark and other models, achieving a high score of
95.73 on the shared task. The model only used the texts of the transcriptions (no information
about the annotators) to produce the result.</p>
      <p>Possible improvements in quality can be achieved by using the information about the
annotators (for example to assign higher weights to accurate annotators), injecting phonetic
knowledge into the model to match misheard word sequences better, or using symmetric model
architecture that processes input transcriptions in parallel (removing the need of permutations
during training and inference).</p>
      <p>Example of an easy aggregation
Transcriptions: t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n | t h e j u n g l e f i n a l l y o f f e r i n g
s o m e p r o t e c t i o n | t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n | t h e j u n g l e f i n a l l y o f f e r i n g s o m e
p r o t e c t i o n | t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n | t h e j u n g l e f i n a l l y o f f e r i n g s o m e
p r o t e c t i o n | t h e g e n t i l f i n a l l y o f f e r i n g s o m e p r o t e c t i o n
Ground truth: t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n
Prediction: t h e j u n g l e f i n a l l y o f f e r i n g s o m e p r o t e c t i o n
AWAcc: 1 0 0 . 0</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ustalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pavlichenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Stelmakh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kuznetsov</surname>
          </string-name>
          ,
          <article-title>VLDB 2021 Crowd Science Challenge on Aggregating Crowdsourced Audio Transcriptions</article-title>
          ,
          <source>in: Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale</source>
          , Copenhagen, Denmark,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Fiscus</surname>
          </string-name>
          ,
          <article-title>A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (rover)</article-title>
          ,
          <source>in: 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings</source>
          ,
          <year>1997</year>
          , pp.
          <fpage>347</fpage>
          -
          <lpage>354</lpage>
          .
          <source>doi:1 0 . 1 1 0 9 / A S R U . 1</source>
          <volume>9 9 7 . 6 5 9 1 1 0 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Crowdsourced text sequence aggregation based on hybrid reliability and representation</article-title>
          ,
          <source>in: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '20,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2020</year>
          , p.
          <fpage>1761</fpage>
          -
          <lpage>1764</lpage>
          . URL: https://doi.org/10.1145/3397271.3401239.
          <source>doi:1 0 . 1 1</source>
          <volume>4 5 / 3 3 9 7 2 7 1 . 3 4 0 1 2 3 9 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>21</volume>
          (
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>67</lpage>
          . URL: http://jmlr.org/papers/v21/
          <fpage>20</fpage>
          -
          <lpage>074</lpage>
          .html.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer, Bart:
          <article-title>Denoising sequence-to-sequence pre-training for natural language generation, translation</article-title>
          , and comprehension,
          <year>2019</year>
          .
          <article-title>a r X i v : 1 9 1 0 . 1 3 4 6 1</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Le</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rush</surname>
          </string-name>
          , Transformers:
          <article-title>State-of-the-art natural language processing</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, Table 4 Examples of results Online</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . URL: https://www.aclweb.org/anthology/2020.
          <article-title>emnlp-demos.6. doi:1 0 . 1 8 6 5 3 / v 1 / 2 0 2 0</article-title>
          . e m n l p - d
          <source>e m o s . 6 .</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>T. C.</given-names>
            <surname>Rajapakse</surname>
          </string-name>
          , Simple transformers, https://github.com/ThilinaRajapakse/ simpletransformers,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>