<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale,
August</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Noisy Text Sequences Aggregation as a Summarization Subtask</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sergey Pletenev</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Research University Higher School of Economics (HSE University)</institution>
          ,
          <addr-line>Moscow, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>20</volume>
      <issue>2021</issue>
      <abstract>
        <p>Most speech-driven systems on the first step convert audio to text through an automatic speech recognition (ASR) model and then pass the text to any downstream natural language processing (NLP) modules. However, these ASR models can lead to system failure or undesirable output when being exposed to natural language perturbation or variation in practice. In this paper, we introduce a simple yet eficient model for improving the understanding of the semantics of the input speeches and error correction by processing multi-hypothesis ASR systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;ASR n-best hypotheses integration</kwd>
        <kwd>ASR</kwd>
        <kwd>Seq2Seq</kwd>
        <kwd>NLP</kwd>
        <kwd>Spoken language understanding</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Preprocessing</title>
      <sec id="sec-2-1">
        <title>We experimented on the three datasets.</title>
        <p>
          • VLDB 2021[
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]: Dataset contains 9500 unique lines, 7 hypotheses for each example.
• DSTC2/3 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]: Dataset consists of human-computer dialogues in a restaurant domain
collected with Amazon Mechanical Turk. It contains reference texts and ASR hypotheses.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>It has around 10 hypotheses for each text.</title>
        <p>
          • Stacked DeBert [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]: Dataset generated by using freely available TTS(text-to-speech)
and STT(speech-to-text) systems. From 6 to 7 hypothesis for each unique line.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>All datasets are shown in the table 1 .</title>
        <p>
          We use JiWER toolkit2 to clean up our datasets and calculate WER (in this case WAcc)[
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]
metric for each line. WER is de facto standard metric for ASR system assessment. It is calculated
by the total error count normalized by the reference length. In our work, we use an additional
scoring metric called Phone Edit Rate (PER) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] to evaluate the phoneme-level noisiness of the
generated samples:
 (, ) =
ℎ( ℎ( ),  ℎ())
        </p>
        <p>( ℎ( ))
 (, ) = 1 −  (, )
(1)
(2)
Where  is original text and  is text with ASR noise.  ℎ is a function to transform
text to phoneme. We use CMU Pronouncing Dictionary3 to transform our texts.</p>
        <p>The PER metric allows us to more accurately measure the accuracy of our models. Table 2
shows an example of estimating results using PER and WER metrics. We can see that in some
cases the WER metric shows no change in quality: the last two rows show the same WER
result, while PER shows the diference between these rows of text. In other cases WER metric
shows worse result than it actually is. In the first two rows of table 2 the diference between the
predicted result and the correct answer is one apostroph. WER shows one whole word error,
while PER shows only one phoneme error, which is much more accurate.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2https://github.com/jitsi/jiwer/</title>
      </sec>
      <sec id="sec-2-5">
        <title>3http://www.speech.cs.cmu.edu/cgi-bin/cmudict</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>3.1. Models
In this work we use several models and baseline.</p>
      <p>• Baseline. As a simple baseline we use majority vote: If some text occurs  times in a
corpus, that text is considered correct, otherwise a random text is selected.
• Advanced baseline. For better baseline we use two algorithms: ROVER[9] and HRRASA[10].
• T5.[11] The T5 model is trained on several datasets for 18 diferent tasks which majorly fall
into 8 categories: text summarization, question answering, translation etc. In experiments
we use 3 diferent sizes: t5-small, t5-base, t5-large.
• PEGASUS.[12] PEGASUS model pretraining task is intentionally similar to
summarization: important sentences are removed/masked from an input document and are generated
together as one output sequence from the remaining sentences, similar to an extractive
summary. We use PEGASUS model trained on Xsum dataset [13]
We use HuggingFace Transformers 4 for model training and prediction. Each model is trained
with following parameteres: encoder length 512, decoder length 64, batch size 3, 8 epochs,
learning rate 5e-05, after each 1000 steps we evaluate our models with beam size 12.
3.2. Data</p>
      <sec id="sec-3-1">
        <title>We use a pipeline to clean-up and prepare our datasets:</title>
      </sec>
      <sec id="sec-3-2">
        <title>1. Remove punctuation marks (except apostrophes) and numbers</title>
      </sec>
      <sec id="sec-3-3">
        <title>2. Convert texts to lowercase</title>
      </sec>
      <sec id="sec-3-4">
        <title>3. Remove unnecessary spaces in the sentence</title>
      </sec>
      <sec id="sec-3-5">
        <title>4. Limit the number of hypotheses for each of the unique texts to 7</title>
      </sec>
      <sec id="sec-3-6">
        <title>5. Concatenate hypotheses to single text with token "|" for T5 and with token "." for PEGASUS.</title>
      </sec>
      <sec id="sec-3-7">
        <title>Test set contains 1400 example for 200 unique texts, and was taken from VLDB 2021 only.</title>
      </sec>
      <sec id="sec-3-8">
        <title>4https://huggingface.co/transformers/</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>model
baseline (N&gt;1)
baseline (N&gt;2)
HRRASA
ROVER
T5-small
T5-small
T5-base
T5-base
T5-large
T5-large
PEGASUS-xsum
PEGASUS-xsum
finetuned
+
+
+
+</p>
    </sec>
    <sec id="sec-5">
      <title>5. Error Analysis</title>
      <p>The first problem we had with models for summarization was the limitation on the output of
text. We can partly control text generation. All models have been pretrained on the tasks of
generating from a paragraph to few sentences, while our task requires only one sentence as the
output. Therefore, in some cases, the model generated multiple sentences, which had a negative
impact on quality. We tried to counter this by replacing the "." token with the "|" token in the T5
model.</p>
      <sec id="sec-5-1">
        <title>The second problem is that almost any additional data gives worse scores. This is probably because the original data has very good quality (due to been human crowd-sourced) at the same time DSTC2/3 and DeBert were computer partitioned.</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This paper presents our approach to noisy text sequence aggregation, which is ranked second
place in the VLDB 2021 Crowd Science Challenge. Our paper shows the efectiveness of the
method. The error analysis also shows that the proposed approach can perform better with
additional datasets.</p>
      <sec id="sec-6-1">
        <title>In the future, we plan to adopt our model to the speech in other domains. We also plan to train the model to generate texts with ASR-noise.</title>
        <p>[9] J. Fiscus, A post-processing system to yield reduced word error rates: Recognizer output
voting error reduction (rover), in: 1997 IEEE Workshop on Automatic Speech Recognition
and Understanding Proceedings, 1997, pp. 347–354. doi:10.1109/ASRU.1997.659110.
[10] J. Li, Crowdsourced Text Sequence Aggregation Based on Hybrid Reliability and
Representation, Association for Computing Machinery, New York, NY, USA, 2020, p. 1761–1764.</p>
        <p>URL: https://doi.org/10.1145/3397271.3401239.
[11] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,</p>
      </sec>
      <sec id="sec-6-2">
        <title>Exploring the limits of transfer learning with a unified text-to-text transformer, Journal of</title>
        <p>Machine Learning Research 21 (2020) 1–67. URL: http://jmlr.org/papers/v21/20-074.html.
[12] J. Zhang, Y. Zhao, M. Saleh, P. J. Liu, Pegasus: Pre-training with extracted gap-sentences
for abstractive summarization, 2020. arXiv:1912.08777.
[13] S. Narayan, S. B. Cohen, M. Lapata, Don’t give me the details, just the summary!
Topicaware convolutional neural networks for extreme summarization, in: Proceedings of the
2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium,
2018.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Egonmwan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chali</surname>
          </string-name>
          ,
          <article-title>Transformer and seq2seq model for paraphrase generation</article-title>
          ,
          <source>in: Proceedings of the 3rd Workshop on Neural Generation and Translation</source>
          , Association for Computational Linguistics, Hong Kong,
          <year>2019</year>
          , pp.
          <fpage>249</fpage>
          -
          <lpage>255</lpage>
          . URL: https://www.aclweb.org/ anthology/D19-5627. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -5627.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer, Bart:
          <article-title>Denoising sequence-to-sequence pre-training for natural language generation, translation</article-title>
          , and comprehension,
          <year>2019</year>
          . arXiv:
          <year>1910</year>
          .13461.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention is all you need,
          <year>2017</year>
          . arXiv:
          <volume>1706</volume>
          .
          <fpage>03762</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ustalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pavlichenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Stelmakh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kuznetsov</surname>
          </string-name>
          ,
          <article-title>VLDB 2021 Crowd Science Challenge on Aggregating Crowdsourced Audio Transcriptions</article-title>
          ,
          <source>in: Proceedings of the 2nd Crowd Science Workshop: Trust, Ethics, and Excellence in Crowdsourced Data Management at Scale</source>
          , Copenhagen, Denmark,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Henderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Thomson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <article-title>The second dialog state tracking challenge, in: Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), Association for Computational Linguistics</article-title>
          , Philadelphia, PA, U.S.A.,
          <year>2014</year>
          , pp.
          <fpage>263</fpage>
          -
          <lpage>272</lpage>
          . URL: https://www.aclweb.org/anthology/W14-4337. doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>W14</fpage>
          -4337.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Cunha Sergio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Stacked debert: All attention in incomplete data for text classification</article-title>
          ,
          <source>Neural Networks</source>
          <volume>136</volume>
          (
          <year>2021</year>
          )
          <fpage>87</fpage>
          -
          <lpage>96</lpage>
          . URL: http://dx.doi.org/10.1016/j.neunet.
          <year>2020</year>
          .
          <volume>12</volume>
          . 018. doi:
          <volume>10</volume>
          .1016/j.neunet.
          <year>2020</year>
          .
          <volume>12</volume>
          .018.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Morris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Maier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. D.</given-names>
            <surname>Green</surname>
          </string-name>
          ,
          <article-title>From wer and ril to mer and wil: improved evaluation measures for connected speech recognition</article-title>
          ., in: INTERSPEECH,
          <string-name>
            <surname>ISCA</surname>
          </string-name>
          ,
          <year>2004</year>
          . URL: http: //dblp.uni-trier.de/db/conf/interspeech/interspeech2004.html#MorrisMG04.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>An approach to improve robustness of nlp systems against asr errors</article-title>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2103</volume>
          .
          <fpage>13610</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>