<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Constructing a Multimodal, Multilingual Translation and Interpreting Corpus: A Modular Pipeline and an Evaluation of ASR for Verbatim Transcription</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alice Fedotova</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adriano Ferraresi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maja Miličević Petrović</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Barrón-Cedeño</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>DIT, Università di Bologna</institution>
          ,
          <addr-line>Corso della Repubblica 136, 47121, Forlì</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a novel pipeline for constructing multimodal and multilingual parallel corpora, with a focus on evaluating state-of-the-art automatic speech recognition tools for verbatim transcription. The pipeline was developed during the process of updating the European Parliament Translation and Interpreting Corpus (EPTIC), leveraging recent NLP advancements to automate challenging tasks like multilingual alignment and speech recognition. Our findings indicate that current technologies can streamline corpus construction, with fine-tuning showing promising results in terms of transcription quality compared to out-of-the-box Whisper models. The lowest overall WER achieved for English was 0.180, using a fine-tuned Whisper-small model. As for Italian, the lowest WER (0.152) was obtained by the Whisper Large-v2 model, with the fine-tuned Whisper-small model still outperforming the baseline (0.201 vs. 0.219).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;multimodal corpora construction</kwd>
        <kwd>translation and interpreting corpora</kwd>
        <kwd>verbatim automatic speech recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        lation of EPTIC were developed ad hoc and aim at
reproducing minimal prosodic features, but can still be
The present paper introduces a pipeline for the construc- considered an instance of verbatim transcription [
        <xref ref-type="bibr" rid="ref1 ref10">3, 1</xref>
        ];
tion of multimodal and multilingual parallel corpora that the issue of what truly constitutes verbatimness is still
could be used for translation and interpreting studies an object of debate and will be further discussed. There
(TIS), among others. The construction of such resources is fairly widespread agreement on the statement that
has been acknowledged as a “formidable task” [
        <xref ref-type="bibr" rid="ref1 ref10">1</xref>
        ], which every transcription system reflects a certain
methodif automated —as we propose— involves a number of ological approach [4, 5], and that by “choosing not to
subtasks such as automatic speech recognition (ASR), transcribe a particular dimension, the researcher has
immultilingual sentence alignment, and forced alignment, plicitly decided that the dimension plays no role in the
each of which poses its own challenges. Yet tackling these phenomenon in question” [4]. To investigate the
characsubtasks also ofers a unique way to evaluate state-of- teristics of Whisper’s [2] transcriptions in English and
the-art natural language processing (NLP) tools against Italian, we formulate the following two research
quesa unique, multilingual benchmark. In this paper we dis- tions: RQ1 Is it possible to use fine-tuning to adapt the
cuss the development of a modular pipeline adaptable for transcription style to the one of an expert annotator?
each of these subtasks and address the issue of whether RQ2 What is the impact of speech type (native,
nonperforming ASR with OpenAI’s Whisper [2] could be native, interpreted) on transcription quality?
suitable for verbatim transcription. We find that satisfactory results can be achieved with
      </p>
      <p>
        We showcase the utility of this pipeline by expanding automatic speech recognition, although challenges
rethe European Parliament Translation and Interpreting main, especially with regards to the verbatimness of
Corpus (EPTIC), a multimodal parallel corpus compris- the transcription —a crucial factor in corpora intended
ing speeches delivered at the European Parliament along for TIS. Fine-tuning Whisper-small on English data
obwith their oficial interpretations and translations [
        <xref ref-type="bibr" rid="ref1 ref10">1, 3</xref>
        ]. tains a lower word error rate (WER) of 0.180 compared
The transcription conventions adopted for the compi- to Whisper-large v2 (0.194), potentially indicating that
ifne-tuning Whisper models holds promise for
improvCLiC-it 2024: Tenth Italian Conference on Computational Linguistics, ing their performance in terms of adhering to a certain
Dec 04 — 06, 2024, Pisa, Italy transcription style. However, this was not the case when
*$Coarlircees.pfeodnodtionvgaa2u@thuonri.bo.it (A. Fedotova); considering the experiments based on Italian. In the
adriano.ferraresi@unibo.it (A. Ferraresi); maja.milicevic2@unibo.it Italian scenario, Whisper-large-v2 obtained a WER of
(M. Miličević Petrović); a.barron@unibo.it (A. Barrón-Cedeño) 0.152 compared to a WER of 0.201 obtained by the
fine0009-0001-4850-0974 (A. Fedotova); 0000-0002-6957-0605 tuned Whisper-small model. It should be noted, however,
(A. Ferraresi); 0000-0003-4137-1898 (M. Miličević Petrović); that this constituted an improvement over the baseline
0000-0003-4719-3420 (A. Barrón-Cedeño)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Whisper-small model, which obtained a higher WER of
Attribution 4.0 International (CC BY 4.0).
0.219. A significant limitation in the case of fine-tuning performs best overall among the tools considered.
in Italian was constituted by the smaller amount of data Despite these advancements, several limitations
peravailable for tuning compared to English. Lastly, we find sist in the current research. First, most studies focus
that sentence alignment can be facilitated through state- primarily on English, with only some including other
of-the-art embedding-based tools, whereas forced align- languages such as Chinese [16]. Furthermore, the field
ment can be considered a largely solved problem. This of speech disfluency research faces challenges due to the
makes the construction of corpora such as EPTIC more scarcity of publicly available benchmarking datasets,
atstreamlined and requiring less human intervention, with tributed to high annotation costs, the clinical nature of
wider implications for multilingual corpus construction some tasks, and the use of proprietary datasets [18]. The
in the field of TIS and beyond. choice between Wav2Vec and Whisper remains a point
of debate, with [8] finding similar results for both after
ifne-tuning, while Azure of-the-shelf performed best,
fol2. Related Work lowed by Whisper of-the-shelf. Still, [ 17] did not explore
ifne-tuning, and [ 8] suggests that fine-tuned models
generally perform better. The requirement for punctuation
marks in some corpora, such as EPTIC, introduces
another consideration in model selection. Wav2Vec does
not output punctuation, while Whisper does, potentially
influencing its suitability for certain applications.
Additionally, while [13] used a large corpus, [15] indicated
that Whisper can perform well with less data,
highlighting the need for further investigation into optimal data
requirements.
      </p>
      <p>
        Recent advancements in the field of corpus linguistics
have led to a multitude of complex multilingual and
multimodal corpora, as well as novel approaches to corpus
construction. Transcribing spoken data, identifying prosodic
features, and aligning parallel texts are some of the tasks
that are commonly involved. In this sense, a particularly
representative case in point is constituted by interpreting
corpora, such as EPIC [6], DIRSI [7], and EPTIC [
        <xref ref-type="bibr" rid="ref1 ref10">3, 1</xref>
        ],
the latter also including translated texts. Based on data
obtained from the European Parliament, these complex
corpora require multi-step approaches for gathering and
processing parallel, multilingual texts and multimodal 3. Corpus Construction
data. Though the construction of translation and
interpreting corpora has been largely carried out manually, it The present work is based on the European Parliament
can also constitute a unique opportunity for developing Translation and Interpreting Corpus (EPTIC), a
multinew tools and benchmarking recent advancements in the modal parallel corpus comprising speeches delivered at
ifelds of NLP and ASR. ASR, in particular, has garnered the European Parliament (EP) along with their oficial
inincreasing attention due to the time-consuming nature terpretations and translations.1 Within EPTIC, the corpus
of spoken data transcription. construction process revolves around individual speech
      </p>
      <p>
        A related research strand in the field of ASR concerns events, where edited verbatim reports published by the
the level of detail of the transcriptions produced by ASR EP and transcriptions of the speeches are accompanied
systems, as the task is usually not only to transcribe the by transcriptions of interpretations and oficial
transspeech but to make sure that prosodic features, such as lations into other languages. These components form
disfluencies, are maintained. [ 8] conducted a comprehen- a multi-parallel corpus, i.e. a corpus containing
verbasive comparison of diferent ASR systems and acoustic tim transcriptions of source speeches, oficial verbatim
models for disfluency detection and categorization, ex- reports and corresponding target translations and
inamining Wav2Vec [9], HuBERT [
        <xref ref-type="bibr" rid="ref11 ref4">10</xref>
        ], WavLM [11], Whis- terpretations (quasi parallel at the intermodal level [3]).
per [2], and Azure [12]. Their findings indicate that fine- The English partition consists of source English texts
tuned models generally outperform their of-the-shelf and their translations into various languages. Corpora
counterparts. [13] evaluated pre-trained models, reveal- containing translations in both possible directions (e.g.,
ing that Whisper-Large achieved the best overall WER from English to French and vice versa) are referred to as
and chrF (character -gram F-measure [14]) scores. [15] bidirectional, while those with translations in only one
demonstrated the potential of Whisper for adaptation in direction are referred to as unidirectional. Table 1 shows
spoken language assessment with limited training data. the languages included and the size of the latest version,
In the realm of commercial ASR services, [16] explored EPTIC v2, planned for release by the end of 2024.
IBM’s ofering for transcribing English source speeches Our approach to corpus expansion began with a
reand their interpretation, reporting an impressively low er- view of previous guidelines for developing EPTIC [
        <xref ref-type="bibr" rid="ref1 ref10">1, 19</xref>
        ].
ror rate of 4.7%. [17] conducted a systematic comparison The former procedure first involved obtaining data by
of automatic transcription tools, evaluating factors such either scraping texts from the EP website2 or by
manas data protection, accuracy, time eficiency, and costs for
English and German interviews, and found that Whisper
      </p>
      <sec id="sec-1-1">
        <title>1https://corpora.dipintra.it/eptic/ 2https://www.europarl.europa.eu/plenary/en/debates-video.html</title>
        <p>ually downloading videos and then transcribing them.
Transcripts of the original speeches and interpretations
were manually adapted following editing conventions
to annotate features of orality such as disfluencies and
timestamped using Aegisub.3 Then, the texts were
automatically segmented into sentences and aligned across
languages and modalities, for instance between
transcriptions and verbatim reports, with the help of the Intertext
Editor alignment tool.4</p>
        <p>The creation of the new workflow started with the
previous procedure as a basis. It was first subdivided into
separate tasks, the main ones being automatic speech
recognition, multilingual sentence alignment, and forced
alignment. Software selection was based on criteria such
as ease of use and setup, compatibility with the Python
programming language, linguistic coverage, and
compatibility with Sketch Engine, an established corpus query
tool for teaching and research [20, 21]. Python v. 3.11.5
was used along with the Poetry5 package manager for
portability.6 Next, we discuss the tasks and the
considerations made when designing the pipeline.</p>
        <p>For this task, we use Bertalign [24]. Unlike
predecessors such as Hunalign8 that rely on lexical translation
probabilities, Bertalign employs sentence embeddings
to identify parallel sentences, providing a more robust
approach for handling semantic similarities. We used a
version of the tool that has been extended to produce
outputs in the Sketch Engine format for corpus
indexing [20, 21].</p>
        <p>Forced Alignment, the task of automatically aligning
audio with transcriptions, is the most mature task for
spoken corpora. Although WhisperX performs
timestamping during transcription, we experimented with forced
alignment on an existing portion of spoken EPTIC data,
using the aeneas library, which supports more than thirty
languages.9</p>
      </sec>
      <sec id="sec-1-2">
        <title>The pipeline is structured in a modular fashion so as</title>
        <p>to maximize reusability. The process begins with the
extraction of text and video data from the EP website,
using ad-hoc scripts which partially automate scraping
of the EP website. Transcription is then performed using
WhisperX. To remove mistranscriptions and to ensure
adherence to the transcription guidelines, the transcripts
undergo manual review to incorporate disfluencies and
rectify potential mistranscriptions. Once the texts have
been transcribed, they undergo sentence splitting and
sentence alignment using Bertalign. Relevant metadata,
encompassing session topics, are automatically retrieved
from the EP website. The only item requiring manual
input is the speech type, which can be defined as
impromptu, read out, or mixed. After exporting the
alignments in the Intertext format and performing
part-ofspeech tagging with Sketch Engine, the texts and
metadata are converted to the vertical format required for
indexing in Sketch Engine [20, 21].</p>
      </sec>
      <sec id="sec-1-3">
        <title>3https://aegisub.org</title>
        <p>4https://wanthalf.saga.cz/intertext
5https://python-poetry.org
6The code is available at https://github.com/TinfFoil/eptic_v2_
pipeline
7https://github.com/m-bain/whisperX
Automatic Speech Recognition has seen recent
advancements, with the introduction of Whisper [2] and
Wav2Vec 2.0 [9]. However, achieving a reasonable level of
transcription quality is complex and context-dependent,
as it can be interpreted and evaluated diferently
depending on the domain, task, and application [22]. We decided
to employ the WhisperX7 variant of Whisper, given its
documented reliable performance for long-form
transcription, which is oftentimes needed when dealing with
parliamentary speech [23].</p>
        <p>We require an ASR system to produce a verbatim
transcription where all words are transcribed, along with
disfluencies and extra-linguistic information. However,
verbatimness is a broad concept, given the variety of
transcription conventions existing in linguistics [17].
Whisper has been observed to produce transcripts “often
almost comparable to the final read through of a manual
Sentence Alignment involves identifying and align- (verbatim to gisted) transcript” [17], where gisted refers
ing parallel sentences, both mono- and multilingually. to a transcription that “omits non-essential information
(e.g., filler words, word fragments, repetition of words),
and summarizes or grammatically correctly rephrases
the audio content” [17]. Hereby, we define a verbatim</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. ASR for Verbatim Transcription:</title>
    </sec>
    <sec id="sec-3">
      <title>Evaluating Whisper</title>
      <sec id="sec-3-1">
        <title>8https://github.com/danielvarga/hunalign 9https://www.readbeyond.it/aeneas/</title>
        <p>such as nativeness influenced the WER. Findings for
transcription as a transcription where “all words are tran- these experiments are presented in Table 3, and indicate
scribed without additional grammatical corrections [and] a WER of 0.104 for native English speakers, 0.110 for
word repetitions, utterances, word interruptions, and non-native speakers, and a notably higher WER of 0.222
elisions are kept” along with some rudimentary extra- for interpreted speech. Similar results were also obtained
linguistic contextual information, such as applauses [17]. for Italian, with a WER of 0.131 for native speakers and</p>
        <p>As part of our experiments, we tested the HuggingFace 0.188 for interpreted speech, which provides further
evirelease10 of the Whisper models. The test set included dence for the finding of interpreted speech being more
English, Italian, French, and Slovenian, though further ex- challenging to transcribe [16].
periments were conducted exclusively with English and To further explore the claim that fine-tuning improves
Italian due to dataset limitations. We used 7 hours of au- the performance of the model by steering its output
todio for English, 5 for Italian, 1.5 hours for French and 1.5 wards a more verbatim transcription, we now present the
hours for Slovenian. Besides evaluating the models on the results of a qualitative error analysis. We consider a set of
whole set of held-out data, we computed word error rates “markers of verbatimness” based on the definition in [ 17]:
(WERs) for diferent speech types: native speech, non- contractions, truncated words, discourse markers,
repnative speech, and interpreted speech.11 In addition to etitions, filled pauses and empty pauses. The following
experimenting with the out-of-the-box versions of Whis- paragraphs present results that emerge from the analysis,
per, we explored fine-tuning Whisper-small for English with examples provided in Table 4. Following [15], we
and Italian. To train and test the models, we used 80% furthermore report the recall metric for each category.
of the data for training, 10% for validation, and 10% for As for contractions, they are sometimes incorrectly
testing. The training parameters for the Whisper-small resolved by the standard Whisper-large-v2 model;
finemodel were set to a batch size of 16, a learning rate of 1e- tuning results in improvements. For instance, in the
5, mixed-precision training enabled, and a maximum of example shown in Table 4, the fine-tuned version of
5,000 training steps. Evaluation and saving checkpoints Whisper-small maintains the contraction while the large
were enabled every 1,000 steps, optimizing for WER. model does not. Generally, however, Whisper-large-v2</p>
        <p>The experimented Whisper models showed a robust shows acceptable performance even when fine-tuning is
performance across languages and speech types. Our not performed, as Whisper was trained with
unnormalifndings suggest that satisfactory results can be achieved ized transcripts including contractions, punctuation and
for Italian, which exhibits a low WER of 0.152, and En- capitalization [2].
glish, with a WER as low as 0.194. The full set of results Truncations are not transcribed by the Whisper
modis presented in Table 2, where the fine-tuned model is ref- els out-of-the-box. Fine-tuning shows some promising
erenced as Small-FT. This fine-tuned model obtained the results, though truncations are not always transcribed
relowest WER for English, performing better than Whisper- liably and transcription errors are sometimes introduced,
large-v2, which could indicate that the model is learning as illustrated in Table 4. This is possibly due to the
obserto produce a more verbatim transcription. In the case vation in [15] that, being largely trained on speech data
of Italian, the fine-tuned model obtains a lower WER with a high level of inverse text normalization (ITN), a
compared to the baseline Whisper-small model (0.201 process including disfluency removal, Whisper tends to
for the fine-tuned model compared to the WER of 0.219 omit features of orality in favor of readability, which is
obtained by the baseline Whisper-small). However, the unfavorable for the purpose of verbatim transcription.
lowest WER of 0.152 is obtained by Whisper-large-v2, Discourse markers are mostly transcribed in English,
which could be attributed to the lower amount of data even by the baseline Whisper-large-v2. In Italian,
disavailable for fine-tuning compared to English. course markers are omitted considerably more often. An
Lastly, to address RQ2, we evaluated whether factors example of this is provided in Table 4. This could be
attributed to the fact that, even though Whisper
models have been trained to produce transcriptions without
any significant standardization [ 2], the amount and
qual10https://huggingface.co/docs/transformers/en/model_doc/</p>
        <p>whisper
11Which can be both into the interpreter’s A or B language.
ity of training data for English are likely more
extensive and varied compared to Italian, especially when it
comes to examples of spontaneous speech. As for
repetitions, the example in Table 4 shows both a repetition
and a truncation, a common occurrence due to disfluent
speech often comprising a combination of both. In the
example, the fine-tuned Whisper-small model accurately
Table 4 transcribes both disfluencies, while Whisper-large-v2
Transcription examples by disfluency type. For each example, rephrases them into a corrected transcription. Overall,
we include (a) the reference transcription, (b) the transcription the baseline Whisper-large-v2 model always omitted
repproduced by Whisper-small-FT and (c) by Whisper-large-v2. etitions both in English and Italian. This could be due to
Example Transcription Rec EN Rec IT the powerful language model used by Whisper, which
has been observed to correct such errors [13].</p>
        <p>Contractions The last examples in Table 4 illustrate transcriptions of
(a) I’m encouraged that the interim 100.00 – empty and filled pauses. Whereas Whisper-small-FT
of(b) lIe’madeernschoipur.a. .ged that the interim 95.40 – ten captures them, the baseline model does not. However,
leadership . . . the fine-tuned model’s performance is not consistent, and
(c) I am encouraged that the interim 86.30 – occasionally non-existent empty pauses are transcribed
leadership . . . by the model. As in the case of truncations, pauses are
never transcribed by Whisper-large-v2, likely due to the
models having been trained on data processed with ITN.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Conclusions and Future Work</title>
      <p>This paper presented a novel pipeline for constructing
multimodal and multilingual parallel corpora, with a
focus on evaluating state-of-the-art automatic speech
recognition tools for verbatim transcription. Experiments
with Whisper models on EPTIC revealed robust
performance across languages and speech types, particularly
for English and Italian. However, some limitations
remain regarding ASR performance and achieving verbatim
transcriptions. Fine-tuning Whisper showed promising
reductions in WER, particularly for English, indicating
the potential of adapting the model to use a more
verbatim style. Yet qualitative analysis revealed
inconsistencies in handling disfluencies, truncations, and discourse
markers. Furthermore, higher WERs for non-native and
interpreted speech underscore remaining challenges.</p>
      <p>Future research eforts could explore incorporating
additional metrics beyond WER to better capture the degree
of verbatimness in the transcriptions, and expanding the
Italian dataset to potentially improve the performance
of the fine-tuned model. Another avenue for research
could include augmenting the dataset with external data
containing pairs of audio and verbatim transcripts, most
notably the Switchboard corpus introduced in [25]. Other
methods besides fine-tuning could be explored to
enhance the quality of transcriptions, for instance by
leveraging the oficial verbatim reports on the European
Parliament’s website. Lastly, a model could be developed for
detecting the metadata item relative to the speech type,
i.e. impromptu, read out, or mixed, based on textual or
multimodal features.</p>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>The work of A. Fedotova is supported by the NextGeneration EU programme, ALMArie CURIE 2021 - Linea SUpER, Ref. CUPJ45F21001470005.</title>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <volume>1</volume>
          (
          <year>2014</year>
          )
          <fpage>7</fpage>
          -
          <lpage>36</lpage>
          . doi:https://doi.org/10.1007/
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>s40607-014-0009-9</source>
          . [22]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kersken</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Reuter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Egger</surname>
          </string-name>
          , G. Zim-
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>Accessible Computing</source>
          <volume>16</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          . doi:https:
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          //doi.org/10.1145/3636513. [23]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huh</surname>
          </string-name>
          , T. Han,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Zisserman, Whisperx:
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>audio</surname>
          </string-name>
          , arXiv preprint (
          <year>2023</year>
          ). URL: https://arxiv.org/
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>pdf/2303</source>
          .00747, retrieved May 20,
          <year>2024</year>
          . [24]
          <string-name>
            <given-names>L.</given-names>
            <surname>Lei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , Bertalign: Improved word
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>Digital Scholarship in the Humanities</source>
          <volume>38</volume>
          (
          <year>2022</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          621-
          <fpage>634</fpage>
          . doi:https://doi.org/10.1093/llc/
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          fqac089. [25]
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Godfrey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Holliman</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McDaniel</surname>
          </string-name>
          , Switch-
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>ume 1</source>
          , IEEE Computer Society,
          <year>1992</year>
          , pp.
          <fpage>517</fpage>
          -
          <lpage>520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>doi:10</source>
          .1109/ICASSP.
          <year>1992</year>
          .
          <volume>225858</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>