<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Transcription of Courtroom Recordings in the JUMAS pro ject</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Daniele Falavigna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Diego Giuliani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Gretter</string-name>
          <email>gretterg@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jonas Loof</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Gollan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralf Schluter</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hermann Ney</string-name>
          <email>neyg@cs.rwth-aachen.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>FBK-Irst - Via Sommarive 18</institution>
          ,
          <addr-line>38050 Povo, Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Lehrstuhl fur Informatik 6 - Computer Science Dept. RWTH Aachen University</institution>
          ,
          <addr-line>Aachen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2009</year>
      </pub-date>
      <fpage>65</fpage>
      <lpage>72</lpage>
      <abstract>
        <p>In this paper we present ongoing work on speech recognition for the judicial domain, performed in the European project JUMAS (Judicial management for digital library semantics.) The speci c challenges for courtroom speech recognition are discussed, and the development of speech recognition systems for Italian and Polish are described. The results achieved on the target domain are presented and discussed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        This paper presents work performed in the context of the JUMAS project [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ],
a European Union project aimed at information extraction and indexing in the
judicial domain, speci cally for the processing of court recordings. As part of
the project Polish, and Italian automatic speech recognition (ASR) system for
the domain of court proceedings is being developed.
      </p>
      <p>
        State of the art ASR systems allow to achieve good performance
(approximately word error rates inferior to 10%) when the speech signal is acquired
in controlled conditions. However, as demonstrated in recent DARPA
evaluations [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] performance signi cantly decreases if transcription tasks include speech
coming from non professional speakers, uttering their sentences in "free"
conditions, where the recording environment is not optimal.
connected digits
continuous dictation
studio broadcast news
telephone news reports
telephone conversations
meetings (head mounted microphone)
meetings (distant microphone)
      </p>
      <p>WER
0:5%
5%
10%
20%
30%
30%
50%
1.1</p>
      <sec id="sec-1-1">
        <title>Courtroom Speech Recognition { Challenges</title>
        <p>The courtroom environment presents most of the phenomena listed in the
previous section. In addition, the audio recorded in courtrooms is acquired with
several distinct microphones, generally one for each of the actors of the trial,
located on xed positions with respected to the speakers who are free to move
inside the room (sometimes we have observed speakers uttering their sentences in
the direction opposite to the microphone assigned to them). This type of
acoustic environment originates high levels of noise and reverberation in the speech
signal, reducing the signal-to-noise ratio and introducing non linear distortions
which are di cult to remove and which are known to be detrimental to good
ASR performance.</p>
        <p>A further problem relies on the language that is used: it comes from sentences
spontaneously uttered, syntactically wrong, containing a large number of
hesitations, pauses and false starts. Often, speakers are non native or make use of
strong dialectal in ections. Foreign words are also frequently present, especially
referring foreign people.
2
2.1</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>System Development</title>
      <sec id="sec-2-1">
        <title>Italian System</title>
        <p>For the development of the Italian automatic speech recognition system for the
JUMAS Project, no in-domain acoustic data was yet available; the rst
prototype was trained using acoustic data from the broadcast news domain. But a
signi cant set of text resources, mainly consisting of transcriptions of trials in
several Italian Courtrooms are available: this allowed to train di erent language
models using in-domain text corpora.</p>
        <p>
          An initial set of recordings, about 30 hours of radio news programs, has been
provided by RAI, the major Italian broadcast company. These recordings were
manually segmented, labelled and transcribed and used to train a preliminary
version of an automatic broadcast news speech transcription system for
Italian [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Successively, about 100 hours of TV broadcast news have been collected
and automatically transcribed with the. The resulting partially supervised
corpus of about 130 hours of audio recordings was used to train the acoustic models
employed in the Italian system.
        </p>
        <p>The system consists of two main components: the audio partitioner and the
speech recognizer. The aim of the audio partitioner is to divide the continuous
audio stream into homogeneous non-overlapping segments and to cluster these
segments into homogeneous groups. From this process, each audio le is divided
in a set of temporal segments, each with a label that indicates its nature and the
cluster to which it belongs (e.g. speaker A, speaker B, etc.) The speech
recognizer, which uses continuous density Hidden Markov Models (HMMs), generates
a word transcription for each speech segment.</p>
        <p>
          The Italian speech transcription system makes use of two decoding steps:
for each clusters of segments, the output of the rst step allows to estimate
parameters of linear transformations utilized in the second decoding step (for
feature normalization and acoustic model adaptation) to maximizing the system
performance [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Language Resources</title>
        <p>
          Three di erent 4-gram LMs were used for the experiments in this paper, all
estimated using improved Kneser-Ney smoothing [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. A rst LM was trained on
a 606M word news text corpus, containing about 1.2M unique words. This LM
will be referred as the "Out of Domain" (OD) LM. A second LM was trained
on a 25M word corpus of court transcriptions, containing about 150K unique
words; this LM will be referred as the "In Domain" (ID) LM. A third LM was
estimated by adapting the OD LM with the in-domain data. This latter LM will
be referred as the "Adapted" (AD) LM. For each of the three LMs we evaluated
the number of 4-grams in the training corpus, the Perplexity (PP) on the test
set, and the out of vocabulary (OOV) word rate on the test set. The statistics
are presented in Table 2, together with the dictionary sizes of the LMs.
        </p>
        <p>The lexicon employed in the Italian transcription system is based on the
SAMPA phonetic alphabet, and includes a total of 85 phone-like units. Of these
units, 50 are needed for representing the Italian language, while the remaining 35
are needed for representing foreign words. The lexicon was rst produced with
an automatic transcription tool, and then manually checked to correct possible
errors in the transcription of acronyms and foreign words.
2.2</p>
      </sec>
      <sec id="sec-2-3">
        <title>Polish System</title>
        <p>In this section, the development of the Polish Automatic Speech Recognition
(ASR) system is described. Due to the limited availability of in-domain data at
the start of the project, the rst e orts on a Polish ASR system was performed
on the domain of political (parliamentary) speeches.</p>
      </sec>
      <sec id="sec-2-4">
        <title>Language Resources</title>
        <p>In a European Parliament Plenary Session (EPPS) di erent languages of the EU
are spoken, and simultaneously interpreted into every o cial language of the EU.
Starting at the time of the (now nished) TC-STAR project and continuing since
then, the recordings of the parliamentary sessions have been collected.</p>
        <p>For Polish, several hundreds of hours of parliament recordings are currently
available. From this a black-out period was chosen, and a half hour tuning set as
well as a three hour development set were extracted and transcribed by native
Polish speakers, see Table 3. In addition a development set consisting of audio
data from the Court of Wroclaw, was de ned at RWTH and transcribed as
above. The data of this corpus is also included in Table 3.</p>
        <p>Table 3 also describes the EPPS acoustic recordings used for (unsupervised)
acoustic training for the current system. This data was taken from outside of
the blackout period, and included both original politician speeches as well as
interpreter audio. Since this data is completely untranscribed, the word statistics
are taken from the automatic transcription output.</p>
        <p>For Polish, only the transcriptions of the politician portions of the recordings,
and not the interpreted portions, totaling about half a million running words,</p>
        <p>EPPS Tune EPPS Dev WCC Dev Train
Net Duration
# Segments
# Speakers
# Running words
0.45h
195
9
2944
3.03h
1326
37
21938
2.66h
1904
49
21938
127.8h
40995</p>
        <p>{
788098
are available. Since this is clearly inadequate for language model (LM) training,
several additional sources of text data were used. The additional data consisted
of o cial Polish translations of European Union legal documents, as well as news
articles collected over the web from two Polish news sources, see Table 4.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Lexicon and Language Model</title>
        <p>Since Polish is a highly in ected language, the out of vocabulary (OOV) rate
is typically much higher than that for a language such as English for the same
vocabulary size. Since good ASR performance require an OOV rate of about
one percent or lower, it is necessary to use an increased vocabulary size when
working in Polish.</p>
        <p>
          To achieve this, four di erent vocabularies were used, approximately of sizes
75, 150, 300 and 600 thousand words, respectively. For each of the vocabulary
sizes a three-gram language model using modi ed Kneser-Ney smoothing was
produced. Separate models trained for each of the four portions were combined
using interpolation tuned on the perplexity on the tuning corpus. For the
pronunciation lexicon the Polish SAMPA phonemes consisting of 37 phonemes were
used. The pronunciations for the vocabulary were generated using letter to sound
rules described in [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>Unsupervised Acoustic Training Using Cross-language Bootstrapping</title>
        <p>
          The development of the Polish acoustic model using cross-language
bootstrapping and unsupervised training is described in the following section. A more
detailed description is available in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
        </p>
        <p>
          Cross-language bootstrapping is the technique of initializing acoustic model
training using a acoustic model originally trained on a di erent language. For
the present system a Spanish European Parliament acoustic model, described
in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], was used as a starting point. As described in [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], for cross-language
bootstrapping, a mapping from the target language phoneme set (in our case Polish)
to the source language phonemes (Spanish) is needed.
        </p>
        <p>Both the source and target model used SAMPA phoneme sets. A manual
mapping was constructed by keeping the SAMPA phoneme symbol if present
in both phoneme sets, and using the Spanish phoneme with the most similar
properties for the remaining 14 Polish phonemes. Once a mapping is available
it is possible to use the Spanish acoustic model in combination with the Polish
pronunciation lexicon for acoustic model retraining, and even for recognition
(with a high error rate).</p>
        <p>
          The thus initialized acoustic model was further improved using unsupervised
training. The basic idea of unsupervised training is to improve an acoustic model
by iterated recognition and retraining on training data for which no manual
transcriptions are available. For e ective use of available acoustic data, it is
important to use con dence measures to select or weight the contributions of
the audio data in such a way that correctly recognized data is more likely to
contribute to the modeling. For the present work, the state posterior con dence
method, as presented in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] was used.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Experiments</title>
      <sec id="sec-3-1">
        <title>Italian</title>
        <p>The acquisition of the Italian audio baseline for the JUMAS Project is still under
way. At the moment we are writing this paper about 4 hours of audio recordings
have been collected in the Court of Naples, with the goal being about 30 hours.
We plan to use 10 hours for development and test sets, and the remaining 20
hours for acoustic training/adaptation purposes. All the audio recordings are
acquired with 4 microphones; one each for the judge, witness, prosecutor and
lawyer. The sampling frequency is 16kHz, the precision is 16 bit per sample.</p>
        <p>The four above mentioned audio tracks of Naples were automatically
transcribed and are going to be manually corrected. From the automatic
transcriptions some statistics, reported in Table 5, were estimated, namely:
1. the number of speech segments in each audio track and in total, detected
using a start-end-point detection procedure.
2. the average speech segment duration in each audio track and in total;
3. the total speech duration in each audio track and in total;
4. the number of uttered words in each audio track and in total.</p>
        <p>Note in Table 5, that the total duration (5:8h) of the automatically
detected speech segments is signi cantly higher than the total duration of the trial
recording, which is about 4h. This is due to speaker overlap between the di erent
microphone channels, and also due to speech being recorded and (automatically)
detected as speech in multiple channels simultaneously.</p>
        <p>Till now, of the available audio data, only the prosecutor audio track has
been completely manually transcribed, together with about 50m of the other 3
tracks. On the small audio data set, formed by the manually transcribed parts
(50m) of the 4 audio tracks, we evaluated the same statistics as for the full set.
These are reported in Table 6. We can note a slight higher average duration of
the manually detected speech segments compared to the automatic ones (given
in Table 5), suggesting to try to reduce the triggering thresholds of the
startend-point detection module.</p>
        <p>A pilot ASR experiment has also been carried out on the small data set
reported in Table 6, to get some hints on both the possible level of performance
achievable for this domain and on what could be critical parameters to tune
for the overall automatic transcription system. The obtained results, although
lacking of \statistical signi cance", due to the small size of the test set, are given
in Table 7, for each of the 3 LMs decribed in section 2.1.</p>
        <p>Although the test data set is small, large improvements are necessary if we
want to deliver an ASR technology reliable for the judicial domain. In
particular, the improvement between the rst and second decoding pass is smaller than
that obtained on other application domains (typically about 20% relative WER
improvement), probably due to the high absolute values of WERs. Furthermore,
from the Table are evident the bene ts from using in-domain data for LM
training/adaptation, giving hope to further improvements if also in-domain data are
used for acoustic model training/adaptation.</p>
        <p>The nal experiment, we report in this paper, was carried out on the four hour
audio track of the prosecutor, with transcriptions available. Table 8 reports the
statistics of this latter audio track derived from both the automatic and manual
segmentations, and the corresponding word error rates, measured in the second
decoding steps.</p>
        <p>Results of Table 8 still shows a high absolute value of WER, higher than that
obtained with state of the art automatic transcription of meetings with distant
microphones (see Table 1).
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Polish</title>
        <p>The unsupervised retraining of the acoustic model was performed as several
recognitions and retrainings on about 130 hours of untranscribed recordings
from the European Parliament. On the original Spanish task, the bootstrap
model achieves an error rate of approximately 10%. Using cross-language
bootstrapping without retraining, the initial error rate is 60%. Several iterations of
retraining are necessary to achieve adequate performance. In Table 9 the
recognition performance on the EPPS Tuning set after the di erent retraining steps, as
well as the amount of data selected by con dence thresholding, are summarized.</p>
        <p>Train. data [h] Data sel. [h] WER [%]
Training step
Initial Spanish AM
First MAP iter.</p>
        <p>Second MAP iter.</p>
        <p>First training iter.</p>
        <p>Second training iter.</p>
        <p>First SAT iter.</p>
        <p>First full data train.</p>
        <p>SAT full data
SAT re-training
SAT re-training
n.a.
1.7</p>
        <p>The bootstrap model used vocal tract length normalized (VTLN)
melfrequency cepstral coe cient (MFCC) features, using cepstral mean
normalization and linear discriminant analysis, resulting in a 45 dimensional feature
vector. The system uses classi cation and regression tree state tying, with 4500
generalized triphone states. The acoustic models are hidden Markov models with
pooled covariance Gaussian mixture model emission probabilities. A fully trained
model consist of approximately 900k distributions in total.</p>
        <p>n.a.
1.9
System
75k+
150k+
300k+
600k+</p>
        <p>For the rst two re-estimation iterations maximum unsupervised a
posteriori (MAP) adaptation, was used. The nal iterations of unsupervised training
were made using speaker adaptive training (SAT) with feature space maximum
likelihood linear regression (fMLLR), using con dence measures for estimation.
The e ect of the di erent vocabulary sizes on out of vocabulary rate, as well
as on rst pass recognition error rate is shown in Table 10. As a nal
improvement, maximum likelihood linear regression (MLLR) adaptation was used in
recognition. The nal results for the two pass system on the WCC and EPPS
development sets are presented in Table 11.
In this paper the ASR systems being developed in the context of the JUMAS
project have been described, and the current results have been described. Results
obtained on the present Italian acoustic data set are very preliminary and need
to be con rmed by further experiments to be carried out on the whole set of
data forming the planned Italian baseline, under acquisition. We also believe
that the usage of unsupervised training methods (similarly to what done for
Polish) or of lightly supervised training methods and, in general, of in domain
data for training/adapting the acoustic models, can improve the present level of
performance. In any case, these latter topics need to be deeply investigated.</p>
        <p>The polish system, being until now optimized primarily for the European
parliament domain, show good results for this domain. The results for the court
recordings should still be considered preliminary, though. Substantial
improvements are to be expected by the inclusion of in-domain training data both for
the language model and the acoustic model.
5</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgements</title>
      <p>This work was partly funded by the European Union under the FP6 project
JUMAS, Contract No. 214306.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1. JUMAS:
          <article-title>Judicial Management by Digital Libraries Semantics (http://www</article-title>
          .jumasproject.eu)
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Fiscus</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ajot</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>The rich transcription 2007 speech-to-text, stt, and speaker attributed stt, sastt, results</article-title>
          .
          <source>In: Meeting Recognition Workshop</source>
          . (May
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Brugnara</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cettolo</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Federico</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Giuliani</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>A Baseline for the Transcription of Italian Broadcast News</article-title>
          .
          <source>In: Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing</source>
          , Istanbul, Turkey (
          <year>June 2000</year>
          )
          <volume>1667</volume>
          {
          <fpage>1670</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Giuliani</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gerosa</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Brugnara</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Improved automatic speech recognition through speaker normalization</article-title>
          .
          <source>Computer Speech and Language</source>
          <volume>20</volume>
          (
          <issue>1</issue>
          ) (
          <year>January 2006</year>
          )
          <volume>107</volume>
          {
          <fpage>123</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Kneser</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ney</surname>
          </string-name>
          , H.:
          <article-title>Improved backing-o for m-gram language modeling</article-title>
          .
          <source>In: Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing</source>
          . (
          <year>1995</year>
          )
          <volume>181</volume>
          {
          <fpage>184</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Oliver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Polish text to speech synthesis</article-title>
          .
          <source>Master's thesis</source>
          , Edinburgh University, Edinburgh, UK (
          <year>1998</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. Loof,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Gollan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Rybach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            , Schluter, R.,
            <surname>Ney</surname>
          </string-name>
          , H.:
          <article-title>The RWTH 2007 TCSTAR evaluation system for European English and Spanish</article-title>
          .
          <source>In: Proc. Int. Conf. on Spoken Language Processing</source>
          , Antwerp, Belgium (
          <year>August 2007</year>
          )
          <volume>2145</volume>
          {
          <fpage>2148</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Loof,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Gollan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Ney</surname>
          </string-name>
          , H.:
          <article-title>Cross-language bootstrapping for unsupervised acoustic model training: Rapid development of a polish speech recognition system</article-title>
          .
          <source>In: Proc. Int. Conf. on Spoken Language Processing</source>
          , Brighton, UK (
          <year>September 2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Schultz</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Waibel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Experiments on cross-language acoustic modeling</article-title>
          .
          <source>In: Proc. European Conf. on Speech Communication and Technology</source>
          , Aalborg, Denmark (
          <year>September 2001</year>
          )
          <volume>2721</volume>
          {
          <fpage>2724</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Gollan</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hahn</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Schluter, R.,
          <string-name>
            <surname>Ney</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>An improved method for unsupervised training of LVCSR systems</article-title>
          . In: Interspeech, Antwerp, Belgium (
          <year>August 2007</year>
          )
          <volume>2101</volume>
          {
          <fpage>2104</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>