<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Directional and Qualitative Feature Classification for Speaker Diarization with Dual Microphone Arrays</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sergei Astapov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dmitriy Popov</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vladimir Kabarov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>International Research Laboratory “Multimodal Biometric and Speech Systems,” ITMO University</institution>
          ,
          <addr-line>Kronverksky prospekt 49A, St. Petersburg, 197101, Russian Federation</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Speech Technology Center</institution>
          ,
          <addr-line>Vyborgskaya naberezhnaya 45E, St. Petersburg, 194044, Russian Federation</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>Automatic meeting transcription has long been one of the common applications for natural language processing methods. The quality of automatic meeting transcription for the cases of distant speech acquired by a common audio recording device sufers from the negative efects of distant speech signal attenuation, distortion imposed by reverberation and background noise pollution. Automatic meeting transcription mainly involves the tasks of Automatic Speech Recognition (ASR) and speaker diarization. While state-of-the-art approaches to ASR are able to reach decent recognition quality on distant speech, there still exists a lack of prominent speaker diarization methods for the distant speech case. This paper studies a set of directional and qualitative features extracted from a dual microphone array signal and evaluates their applicability to speaker diarization for the noisy distant speech case. These features represent respectively the speaker spatial distribution and the intrinsic signal quality properties. Evaluation of the feature sets is performed on real life data acquired in babble noise conditions by conducting several classification experiments aimed at distinguishing between utterances produced by diferent conversation participants and between those produced by the background speakers. The study shows that specific sets of features result in satisfying classification accuracy and can be further investigated in experiments combining them with biometric and other types of properties.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Meeting transcription</kwd>
        <kwd>Distant speech processing</kwd>
        <kwd>Dual microphone arrays</kwd>
        <kwd>GCC-PHAT</kwd>
        <kwd>Beamforming</kwd>
        <kwd>Signal quality features</kwd>
        <kwd>Artificial neural networks</kwd>
        <kwd>Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Automatic meeting transcription has long been a common application for Natural Speech
Processing (NSP) methods [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4">1, 2, 3, 4</xref>
        ]. Automatic meeting transcription and meeting minutes
logging is a problem, which mainly employs methods of Automatic Speech Recognition (ASR) and
speaker identification and diarization [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The available commercial solutions to automatic
meeting transcription can be divided into two major groups: ones tending to close-talking
speech processing and ones tending to distant speech processing. Close-talking speech
processing implies speech acquisition with a close-talking microphone per each speaker. It tends
to the scenarios where each speaker has a headset, lapel microphone, or any other type of
personal audio recording device. Distant speech processing methods, on the other hand, tend to
the scenarios where one or several microphones are situated at a distance to all the attending
speakers and, thus, record a mixture of (occasionally overlapping) speech signals incoming
from diferent speakers and background noise incoming from a variety of sources.
      </p>
      <p>
        Automatic meeting transcription solutions for close-talking speech are quite diverse with
several commercial solutions available on the market [
        <xref ref-type="bibr" rid="ref1 ref6 ref7">6, 7, 1</xref>
        ]. Close-talking speech
processing poses a significantly less complicated problem for automatic NSP methods if compared to
distant speech processing methods. Such, a condition that each attending speaker possesses a
personal audio recording device makes the task of speaker identification and diarization almost
a trivial one. Voice detection is performed in each independent channel separately and the lack
of efects caused by distance as attenuation and interference are negligible. Furthermore, the
quality of state-of-the-art methods for ASR perform on par with human perception levels for
close-talking speech [
        <xref ref-type="bibr" rid="ref7 ref8">8, 7</xref>
        ].
      </p>
      <p>
        Distant speech processing, on the other hand, poses a greater problem for ASR and speaker
diarization [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The efects of speech signal mixing, signal attenuation due to distance,
influence of interference and reverberation, noise pollution — all negatively afect ASR and speaker
diarization accuracy. While the state-of-the-art in ASR has reached decent recognition quality
on distant speech, there still exists a lack of prominent speaker diarization methods fitted for
the task. Part of the recent developments in the field of distant speaker diarization focuses on
applying both lexical [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ] and acoustic [
        <xref ref-type="bibr" rid="ref10 ref12">12, 10</xref>
        ] features for model training, usually based
on Artificial Neural Networks (ANN) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Application of multichannel sound acquisition
devices (microphone arrays) also provides the opportunity to extract spatial features for diferent
sound sources [
        <xref ref-type="bibr" rid="ref13 ref4 ref9">4, 9, 13</xref>
        ]. Along with biometric profile extraction this tends to be a prominent
direction of research [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ]. Noise pollution remains the main concern for diarization systems,
specifically under babble noise produced by speakers not-of-interest situated nearby [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ].
Babble noise harshly afects diarization quality based on any type of feature, lexical or acoustic.
      </p>
      <p>This paper considers a set of acoustic features extracted from an audio signal acquired by a
dual microphone array. The examined set of features consists of directional features, regarding
the spatial disposition of speakers, and signal quality features, regarding the amount of
distortion in the signal, closeness to the microphone, estimated Signal to Noise Ratio (SNR) and
the  60 reverberation time. The applicability of features to the task of distant speaker
diarization is evaluated by determining the classification accuracy of speaker utterances belonging
either to a target speaker among other active speakers or to the background speakers
not-ofinterest. Thus the combination of spatial and qualitative acoustic features is aimed at reducing
the negative efects of babble noise on diarization quality.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem Formulation</title>
      <p>
        The considered application of the method described in this paper consists of logging a
conversation between two parties of speakers situated on the two opposite sides of a desk or table.
Such a scenario rises, for example, in cases of an interview, negotiations, service provision,
or any other kind of meeting where such a disposition of parties is appropriate [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The
audio recording device is placed in the middle of the desk or table between the two parties of
speakers. The recording device houses two microphones situated in a straight line parallel to
the speaker disposition, i.e., one of the two microphones is directed to one side of the desk
and the other — to the opposite side of the desk. The recording device is compact, with a
distance between the two microphones measured in several centimeters. The two microphones
are sampled synchronously and thus form a dual microphone array.
      </p>
      <p>
        The task of speaker diarization in the considered scenario consists of estimating the active
speaker party per each spoken word or phrase (utterance). If the conversation involves just
two active speakers, the distinction of utterances must be made between these two speakers; if
any party contains more than one active speaker, an entire party is considered as one speaker
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This is allowed in our study as speaker biometric information is not included in the set of
examined features. In a more general case employing a greater amount of microphones and
involving speaker biometrics it should be possible to expand the dirization task for a more
general meeting scenario, where any number of speakers is situated anywhere around the
common audio recording device [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In this study we primarily address the problem of speaker
diarization in babble noise, i.e., a conversation where speakers not-of-interest are present near
the conversation area.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Applied Features and Methods</title>
      <p>is the discrete time instance and  is the STFT frequency band.</p>
      <p>The examined features are extracted trough several methods consisting of conventional signal
processing and ANN based models. This section addresses these feature extraction methods.
Any operation is performed either on the temporal audio signal or its frequency domain
representation acquired through the Short-Time Fourier Transform (STFT). The dual channel signal
in the time domain is represented by a sequence of observation vectors  ( ) = [ 1( ),  2( )] and
the STFT representation of the temporal signal is denoted as  (,  ) = [ 1(,  ),  2(,  )], where</p>
      <sec id="sec-3-1">
        <title>3.1. Directional Features</title>
        <p>
          The extracted directional features are based on the Time Diference of Arrival (TDOA) between
the two microphones. While the array is placed such, that one microphone is situated closer to
the speaker-of-interest that the other, the propagating acoustic waves reach the farthest
microphone with a specific delay, compared to the closest microphone. This delay defines the TDOA,
which in turn defines the direction to the sound source (speaker). We estimate the TDOA by
apThe GCC- PHAT for a dual microphone array is defined by the following equation:
plying Generalized Cross-Correlation with  -weighted Phase Transform (GCC- PHAT) [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ].
 12(,  ) = ℜ
 1(,  ) 2∗(,  )
        </p>
        <p>( || 1(,  ) 2∗(,  )|
|

−i2 
)
where  is the time delay between two channels, | 1(,  ) 2(, 
∗
|
(⋅)∗ denotes the complex conjugate, and ℜ(⋅) denotes the real part| of a complex number. The
)
| is the  PHAT coeficient,
range of physically possible time delays between two microphones is defined as  ∈ [− /,  / ],
maximal value:
where  is the distance between the two microphones and  is the speed of sound in air. The
TDOA for time instance  is then defined as the time delay at which GCC-  PHAT reaches its
 TDOA( ) = arg max ( 12(,  )) .</p>
        <p>values:</p>
        <p>For a recognized utterance 
= { ( 1,  ), … ,  ( 2,  )} ranging in the interval 
∈ [ 1,  2] we
extract directional features in the following manner. The TDOA values corresponding to this
time interval  = { TDOA( 1), … ,  TDOA( 2)} are split into two groups of positive and negative
 (+) =
 (−) =
{ TDOA( ) ∣  TDOA( ) &gt; ,  ∈ [ 1,  2]} ,
{ TDOA( ) ∣  TDOA( ) &lt; −, 
∈ [ 1,  2]} ,
where  is a small positive number defining the TDOA ambiguity around zero. As the two
speakers or two speaker parties are situated on the opposite sides of the dual microphone
array (i.e., along the straight line connecting the two microphones), the TDOA values should
be either positive or negative depending on the side of the sound source. As any utterance may
include micro-pauses and noise instances, we extract the following directional features based
on TDOA:
| , mean ( (+)), mean ( (−)), mean ( )⎬ ,
⎫
⎪
⎪
⎪
⎪
⎭
(2)
(3)
(4)
(5)
 (−) have zero cardinality, the mean value is deemed zero:
where |⋅| denotes the cardinality of a set. As the set of TDOA values  may include both positive
and negative values due to noise pollution and pauses (silence), one simple feature of average
TDOA along  is not suficient to represent the utterance. It should be addressed that if  (+) or
mean ( (+))= ⎨
⎪⎪0,
⎩
⎧ 1
⎪ | (+)| ∑ (+), || (+)|| &gt; 0,
⎪
| |
|
|| (+)|| = 0.</p>
        <p>|
  = ⎨
⎪⎪⎧ ||| (+)|| || (−)||
| |
,
⎩
⎪⎪ | |</p>
        <p>| |</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Qualitative Features</title>
        <p>Directional features on their own are quite representative in an ideal case of absence of any
background noise. If noise and speech-not-of-interest are present in the signal, and the sources
of coherent noise (including background speakers) are situated in the vicinity of the
conversation desk or table, the quality of TDOA features may decrease due to masking efects. An
example of GCC- PHAT (discussed in Subsection 3.1) value sequence corresponding to a
conversation in noisy conditions with presence of babble noise is presented in Fig. 1. The figure
displays a GCC- PHAT vector  12(,  ) from equation (1) for each of the sequence of signal
STFT frames. Here the TDOA values corresponding to the actual conversation participants
are distributed at the spectral shift index of approximately 90 for one side of the desk and
approximately −100 for the other. It can be noticed, that several other distributions exist: at the
spectral shift indices of approximately 100 to 200, −200 to −100 and around 0. The first two
correspond to babble and other noise and the third one — to uncorrelated difuse noise and
interference. This aspect would not pose a problem in close-talking applications and where ASR
higher GCC values.
would recognize only closest spoken utterances. Unfortunately, in distant speech processing
applications this is rarely the case, and phrases spoken in the vicinity of the target
conversation are very often recognized by distant speech ASR systems. Explicitly gathering biometric
information for all detected speakers would remedy the situation, however, such an approach
involves a significant amount of manual manipulations and is not often feasible. We attempt
at distinguishing between closest speech of target speakers and farther speech of background
speakers by applying several signal quality metrics as features.</p>
        <p>
          The first discussed signal quality metric is the signal Envelope-Variance (EV) [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ]. It was
originally proposed for the purposes of blind channel selection in multi-microphone systems. It
involves calculating the Mel-frequency filterbank energy coeficients of STFT frames, similarly
to the process involved in Mel-frequency Cepstral Coeficient (MFCC) calculation [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ]. The Mel
iflterbank coeficients are denoted for an utterance as
 = {
        </p>
        <p>Mel( 1,  ), … ,</p>
        <p>Mel( 2,  )}, where 
are log frequencies in the Mel-scale. To remove the short term efects of diferent electric gains
and impulse responses of the microphones from the signal in each channel the mean value is
subtracted in the log domain from each sub-band as
̂</p>
        <p>Mel
(,  ) = 
log</p>
        <p>Mel(, )−  Mel ( )

,
where  
the mean
After mean nMoel (rm)ailsizeasttiiomna,ttehdebsyeqauteimnceeaovfeMraeglefilitnerebaacnhkseunbe-rbgaineds iaslocnogmtphreeswsehdolaepuptltyeirnagncae.</p>
        <p>
          Mel(,  ) is the Mel filterbank coeficient for microphone channel
 ∈ [
          <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
          ], 
∈ [ 1,  2];
cube root function, and a variance measure is calculated for each sub-band and channel:
  ( ) = var [ 
̂
        </p>
        <p>Mel</p>
        <p>1
(,  ) 3 ] .</p>
        <p>(6)
(7)
The cube root compression function is preferred to the conventionally used logarithm, because
very small values in the silent portions of the utterance may lead to extremely large negative
values after the log operation, which would distort variance estimation. The EV metric is then
calculated for each signal channel across all sub-bands as</p>
        <p>EV = ∑</p>
        <p>( )

max (  ( ))
.</p>
        <p>EV represents the degree of distortion in a signal; the EV value is higher in a signal channel,
where the degree of interference, imposed by distant signal distortion and reverberation, is
lower. Thus, the feature vector EV = {EV1, EV2} not only points to the channel with less
distorted speech (speaking party direction), but also gives highlight on the nature of the utterance
(either conversation participant, or background talker).</p>
        <p>
          The second and third discussed quality metrics are the Cepstral Distance (CD) and the related
Covariance-weighted Cepstral Distance (WCD) [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. CD is another metric of signal distortion
but computed in the cepstral domain, i.e., on the coeficients resulting from MFCC.
Cepstrumbased comparisons are equivalent to comparisons of the smoothed log spectra; in this domain
the reverberation efect can be viewed as additive. An utterance in the cepstral domain is
denoted by
        </p>
        <p>= { ( 1,  ), … ,  ( 2,  )}, where  (,  ) are the cepstral coeficients and
coeficient index (so-called quefrency). For our application we define the distance for each
 is the
 ̂ ( ) = arg max CEP( ).</p>
        <p>CD = |
|
{
||  ̂ ( ) ∣  ̂ ( ) =  ||
{
||  ̂ ( ) ||
}
|
|
}
|
|
,
|
|

(8)
(9)</p>
        <p>(10)
(11)
where |{⋅}| denotes the cardinality of a set and  ∈ [ 1,  2]. The feature vector for a dual channel
signal is then CD = {CD1, CD2}. The channel with less distortion will have a higher ratio value
channel  as

 =1
 CEP( ) = ∑ (  (,  ) −  ̄ (,  ))2 ,
where   (,  ) is the cepstral coeficient of  (,  ) for the  -th channel,  ̄ (,  ) is the mean cepstral
coeficient computed by taking the MFCC from the averaged signal along all channels, and
is the number of cepstral coeficients. The mean coeficients contain the averaged close-talk
the least distorted channel can be selected as
signal, and the average reverberation component. Let us assume that one microphone signal
is better than the others in terms of direct to reverberant ratio. The basic assumption is that
such a signal will be characterized by a larger distance from the mean cepstrum. Therefore,
from the set of distances  CEP( ) between the mean cepstrum and all the available channels,
As an entire utterance can contain some number of coeficients corresponding to noise and
silence, we define the CD feature for each channel for the entire utterance as the ratio
and the ratio diference between channels will be greater for close-talking utterances than for
the background ones.</p>
        <p>The WCD features are obtained in a similar fashion as the CD features, with the only
difcepstral distance vector is computed for each channel:
ference being the computation of equation (9). For WCD the  ×  covariance matrix of the
  ( ) = cov [  ( ) −  ̄ ( )] ,
and the covariance weights   (,  ) are retrieved as the inverse of every  -th diagonal element
of the covariance matrix</p>
        <p>( ). The WCD measure is then a weighted Euclidean distance
measure, where each individual cepstral component is variance-equalized by the weight:

 =1
 WCEP( ) = ∑   (,  ) (  (,  ) −  ̄ (,  ))2 .</p>
        <p>The subsequent computations are equivalent to equations (10), (11). The WCD ratio for an
entire utterance for all channels is defined as</p>
        <p>WCD = {WCD1, WCD2}
.</p>
        <p>Other quality features include Root Mean Square (RMS) energy and common SNR and  60
reverberation time estimates. Instant RMS is computed for each channel for each STFT frame
  = {EV, CD, WCD, RMS, SNR,  60} .</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Beamformed Features</title>
        <p>
          Additionally to extracting features from the raw input signal we study the influence of features
extracted from the beamformed signal [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. For this study we apply two types of simple
beamfrorming algorithms: Delay and Sum Beamforming (DSB) and Diferential beamforming (DIF).
We intend to examine the influence on the features imposed by steering the dual channel signal
in two extreme directions along the linear array (i.e., in the directions to the two participants).
For this we apply beamforming in an endfire fashion.
        </p>
        <p>The principle of DSB is expressed in the following equation:
(12)
(13)
(14)
(15)
(16)
 
as
 60
ance:
 DSB(,  ,  ) =
1
2 ( 1(,  ) +  2(,  ) −i2</p>
        <p>
          ),
√
  /2
 =0
RMS ( ) =
2 ∑ |  (,  )| ,
2
for  ∈ [ 1,  2]. And the RMS feature for all channels is RMS = {RMS1, RMS2}
where   is the sampling rate and |  (,  )| is the modulus of the complex spectrum. For an entire
utterance the RMS energy is computed as the average of instant values RMS = mean(RMS ( )),
. The SNR and
are estimated by a neural network based voice activity detector (NN VAD) integrated into
the the ASR system applied in this study [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. For every detected and recognized utterance a
scalar estimate of the common SNR (dB) and  60 (seconds), one for all channels, is provided.
        </p>
        <p>The entire set of qualitative features thus consists of the following features per each
utterwhere  −i2  is the spectral shift operator at time delay  , the same as for equation (1). The
DIF beamformer, on the other hand, is defined as</p>
        <p>1
 DIF(,  ,  ) = 2 ( 1(,  ) −  2(,  ) −i2 
).</p>
        <p>As the frequency response of the DIF beamformer is significantly nonuniform, with the lower
frequencies being attenuated in the first half-lobe of the response, we apply an equalizer to the
lower frequencies before the cutting frequency of the first lobe   =  / (for the endfire case):
  ( ) =
{
1,</p>
        <p>−1
[sin (   )] ,  ≤   ,
 &gt;   .</p>
        <p>(17)
(18)
The equalizer is then applied to all DIF beamformed frames as  DIF(,  ) ←   ( ) DIF(,  ).</p>
        <p>Applying beamforming in an endfire fashion means that the time delay is set to the two
extreme values  = {− /,  / }. And so we obtain the two beamformed channels as  DSB(,  ) =
[ DSB (,  , − / ) ,  DSB (,  ,  / )] and  DIF(,  ) = [ DIF (,  , − / ) ,  DIF (,  ,  / )]. As
beamformed signals lose the initial phase information, they cannot be applied to directional feature
extraction. Thus they are applied to extract all the qualitative features excluding SNR and  60
as these are provided by NN VAD, which implies only raw signal input. These features are
thus: EV, CD, WCD, RMS. The sets of these features extracted on specific beamformed data are
denoted as  DSB and  DIF.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Extraction Procedure Review</title>
        <p>
          The block diagram of the entire feature extraction procedure is presented in Fig. 2. The dual
channel signal in the figure denotes the signal interval of one recognized speech segment
(utterance). The SNR and  60 estimates are retrieved from this signal by the NN VAD [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ], as
described in Subsection 3.2, and for the extraction of other features it is suficient to perform
all the steps of MFCC separately, while extracting respective features on every intermediate
step. Thus, as described in Subsections 3.1, 3.2, after STFT the directional features and RMS are
extracted; after computing Mel filterbank energy coeficients the EV features are extracted; and
the CD and WCD features are extracted at the last stage of computing cepstral coeficients. The
signal is beamformed to additionally extract the features described in Subsection 3.3. This
approach reduces the number of MFCC computation instances to just three (one, if beamformers
are not involved), which reduces feature extraction computational load.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation and Results</title>
      <p>This section presents the experimental setup for data acquisition, feature classification
approaches applied and discusses the feature evaluation results.</p>
      <sec id="sec-4-1">
        <title>4.1. Experimental Setup and Data</title>
        <p>The experiments are performed on a signal database of real life recordings of conversations
according to the scenario highlighted in Section 2. People were asked to take on their meetings</p>
        <sec id="sec-4-1-1">
          <title>2-channel signal</title>
          <p>STFT
GCC-PHAT
Beamforming</p>
          <p>Mel
Filterbanks
Envelope
Variance
MFCC
Cepstral
Distance
NN VAD
SNR, T60
RMS</p>
          <p>Energy</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>RMS energy</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>TDOA features</title>
        </sec>
        <sec id="sec-4-1-4">
          <title>EV qiality features CD qiality features</title>
          <p>and discussions at a table with dimensions 2 × 1 meters in an open ofice space. No physical
boundaries were erected around the table, i.e., background speakers were able to move and
participate in their daily routine along both sides of the longer side of this table. Several desk
phones, a printer and an air conditioning unit were situated in the vicinity of the table. Private
discussions (2 participants) and working group meetings (3–6 participants) were taking place
at the table with the only restriction being that all the participants had to be seated along both
of the longer sides of the table. The dual microphone array with a inter-microphone distance
of  = 0.05 m was placed at the center of the table perpendicular to the longer side.</p>
          <p>The resulting database contains conversations between people of both sexes in a noisy ofice
environment with ample babble noise. The audio signals are acquired in 16 bit PCM WAV files
with the sampling rate equal to 16 kHz. The reverberation time measured at the table equals
 60 = 450 − 500 ms. The meeting recordings were manually transcribed with assignment of
three distinct classes: person on the left, person on the right side of the table, background
speaker. Background speech was transcribed according to typical human hearing capabilities,
i.e., distant inaudible speech was not transcribed. Transcription resulted in 115,460 utterances,
which were selected in order to reach an approximately equal sample distribution between the
three classes.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Feature Classification</title>
        <p>According to Section 3, the feature set comprises a directional feature vector   (4) of length 5,
a qualitative feature vector   (15) of length 10, and two qualitative feature vectors  DSB,  DIF
extracted from DSB and DIF beamformed data, both of length 8. The entire feature set then
comprises 31 features. The according feature subsets are examined in specific combinations by
applying a classification procedure for three variations of target classes:
• 3 class: speaker right (sp1), speaker left (sp2), background speaker (bg);
• 2 class: any of the participant speakers (sp1&amp;2), background speaker (bg);
• 2 class: speaker right (sp1), speaker left (sp2).</p>
        <p>For feature classification an ANN classifier is trained and tested on every examined subset of
features. For the first two variations of target classes all utterances from the database are used;
for the last variation only participant utterances are used. The classifier is implemented in
tensorflow and consists of three layers: first dense layer, 80–200 neurons, ReLU activation;
second dense layer, 40–100 neurons, ReLU activation, third dense layer, softmax activation. The
number of neurons is experimentally selected to benefit classification of diferent feature
subsets. EarlyStopping and ReduceLROnPlateau are applied during training for monitoring
the validation loss.</p>
        <p>Additionally feature selection is performed by applying Recursive Feature Elimination with
Cross-Validation (RFECV) from the sklearn package while iteratively training a classifier of
type GradientBoostingClassifier on combinations of features with elimination factor
equal to 1. As a result the selected feature set excludes 10 features without significant
classification accuracy loss.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation Results</title>
        <p>The results of feature subset evaluation are presented in Table 1. It is evident from the table
that directional (TDOA)   features alone cannot distinguish between the 3 classes defined in
Subsection 4.2 beyond the margin of random choice. Futhermore, for the first two classification
problems the recall of class 3 (bg) is the lowest, which means that applying just TDOA features
does not provide near vs far speaker separation. Applying qualitative features, on the other
hand, significantly improves classification quality for all three classification tasks. The
application of beamformed data improves the quality just slightly and, therefore, may be superfluous
for limited resource solutions. Classification on the selected feature set, which incorporates 21
of 31 features (reduction by 32%), is on par with the accuracy over the entire feature set. This
implies that single features from all regarded subsets are redundant for the classification task
at hand. However, this may be the case for this specific dataset.</p>
        <p>Generally, the discussed directional and qualitative features seem to be applicable for the
task of distinguishing between the participating speakers and background speakers.
Qualitative features significantly improve classification accuracy. The established feature set can be
considered for further investigation in combination with biometric and other types of features.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>The paper regarded a set of directional and qualitative signal features for the task of speaker
diarization. The discussed feature set is proven to be applicable to the task of speaker utterance
classification for the distant speech processing case in ofice and babble noise conditions. The
feature set can be considered for further investigation along in combination with biometric and
other types of features.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This research was financially supported by the Foundation NTI (Contract 20/18gr,
ID 0000000007418QR20002) and by the Government of the Russian Federation (Grant 08-08).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nedoluzhko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Bojar</surname>
          </string-name>
          ,
          <article-title>Towards automatic minuting of the meetings</article-title>
          ,
          <source>in: 19th Conference ITAT 2019: Slovenskocesky NLP workshop (SloNLP</source>
          <year>2019</year>
          ), CreateSpace Independent Publishing Platform,
          <year>2019</year>
          , pp.
          <fpage>112</fpage>
          -
          <lpage>119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bokhove</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Downey</surname>
          </string-name>
          ,
          <article-title>Automated generation of ”good enough” transcripts as a first step to transcription of audio-recorded data</article-title>
          ,
          <source>Methodological Innovations</source>
          <volume>11</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Hain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dines</surname>
          </string-name>
          , G. Garau,
          <string-name>
            <given-names>M.</given-names>
            <surname>Karafiát</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ordelman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Renals</surname>
          </string-name>
          ,
          <article-title>Transcription of conference room meetings: an investigation</article-title>
          ,
          <source>in: INTERSPEECH</source>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yoshioka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hinthorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stolcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <article-title>Meeting transcription using virtual microphone arrays</article-title>
          , in: ArXiv,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Aronowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Suzuki</surname>
          </string-name>
          , G. Kurata,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hoory</surname>
          </string-name>
          ,
          <article-title>New advances in speaker diarization</article-title>
          ,
          <source>in: INTERSPEECH</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <article-title>Acoustic model adaptation for presentation transcription and intelligent meeting assistant systems</article-title>
          ,
          <source>in: ICASSP</source>
          <year>2020</year>
          , IEEE,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>F.</given-names>
            <surname>Filippidou</surname>
          </string-name>
          , L. Moussiades,
          <article-title>Alpha benchmarking of IBM, Google and Wit automatic speech recognition systems</article-title>
          , in: I.
          <string-name>
            <surname>Maglogiannis</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Iliadis</surname>
          </string-name>
          , E. Pimenidis (Eds.),
          <source>Artificial Intelligence Applications and Innovations</source>
          , Springer International Publishing, Cham,
          <year>2020</year>
          , pp.
          <fpage>73</fpage>
          -
          <lpage>82</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Stolcke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Droppo</surname>
          </string-name>
          ,
          <article-title>Comparing human and machine errors in conversational speech transcription</article-title>
          ,
          <source>ArXiv</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Yoshioka</surname>
          </string-name>
          , I. Abramovski,
          <string-name>
            <given-names>C.</given-names>
            <surname>Aksoylar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>David</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitriadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Gurvich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hurvitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koubi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Krupka</surname>
          </string-name>
          , I. Leichter, C. Liu,
          <string-name>
            <given-names>P.</given-names>
            <surname>Parthasarathy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vinnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Advances in online audio-visual meeting transcription</article-title>
          ,
          <source>in: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>276</fpage>
          -
          <lpage>283</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>S.</given-names>
            <surname>Horiguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fujita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nagamatsu</surname>
          </string-name>
          ,
          <article-title>Utterance-wise meeting transcription system using asynchronous distributed microphones</article-title>
          ,
          <source>in: INTERSPEECH</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Horiguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Takashima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fujita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nagamatsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Watanabe</surname>
          </string-name>
          ,
          <article-title>Auxiliary interference speaker loss for target-speaker speech recognition</article-title>
          ,
          <source>in: INTERSPEECH</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Astapov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Popov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kabarov</surname>
          </string-name>
          ,
          <article-title>Directional clustering with polyharmonic phase estimation for enhanced speaker localization</article-title>
          , in: A.
          <string-name>
            <surname>Karpov</surname>
          </string-name>
          , R. Potapova (Eds.),
          <source>Speech and Computer</source>
          , Springer International Publishing, Cham,
          <year>2020</year>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Zmolikova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Delcroix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kinoshita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Higuchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ogawa</surname>
          </string-name>
          , T. Nakatani,
          <article-title>Speakeraware neural network based beamformer for speaker extraction in speech mixtures</article-title>
          ,
          <source>in: INTERSPEECH</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I.</given-names>
            <surname>Medennikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Korenevsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Prisyach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Khokhlov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Korenevskaya</surname>
          </string-name>
          , I. Sorokin,
          <string-name>
            <given-names>T.</given-names>
            <surname>Timofeeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mitrofanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Andrusenko</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Podluzhny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Laptev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Romanenko</surname>
          </string-name>
          ,
          <article-title>Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario</article-title>
          ,
          <source>in: Interspeech</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>274</fpage>
          -
          <lpage>278</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fujita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Horiguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nagamatsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Watanabe</surname>
          </string-name>
          ,
          <article-title>End-to-end neural speaker diarization with self-attention, 2019 IEEE Automatic Speech Recognition</article-title>
          and Understanding Workshop (ASRU) (
          <year>2019</year>
          )
          <fpage>296</fpage>
          -
          <lpage>303</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H. R.</given-names>
            <surname>Muckenhirn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. L.</given-names>
            <surname>Moreno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hershey</surname>
          </string-name>
          , K. Wilson,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sridhar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Saurous</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weiss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          , Voicefilter:
          <article-title>Targeted voice separation by speaker-conditioned spectrogram masking</article-title>
          ,
          <source>in: ICASSP</source>
          <year>2019</year>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>M.</given-names>
            <surname>Delcroix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zmolikova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kinoshita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ogawa</surname>
          </string-name>
          , T. Nakatani,
          <article-title>Single channel target speaker extraction and recognition with speaker beam</article-title>
          ,
          <source>in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>5554</fpage>
          -
          <lpage>5558</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Brandstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <source>Microphone Arrays: Signal Processing Techniques and Applications</source>
          ,
          <string-name>
            <surname>Digital Signal</surname>
          </string-name>
          Processing - Springer-Verlag, Springer,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nadeu</surname>
          </string-name>
          ,
          <article-title>Channel selection measures for multi-microphone speech recognition</article-title>
          ,
          <source>Speech Communication</source>
          <volume>57</volume>
          (
          <year>2014</year>
          )
          <fpage>170</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Davis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mermelstein</surname>
          </string-name>
          ,
          <article-title>Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences</article-title>
          ,
          <source>IEEE Transactions on Acoustics, Speech, and Signal Processing</source>
          <volume>28</volume>
          (
          <year>1980</year>
          )
          <fpage>357</fpage>
          -
          <lpage>366</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>C.</given-names>
            <surname>Guerrero Flores</surname>
          </string-name>
          , G. Tryfou,
          <string-name>
            <given-names>M.</given-names>
            <surname>Omologo</surname>
          </string-name>
          ,
          <article-title>Cepstral distance based channel selection for distant speech recognition</article-title>
          ,
          <source>Computer Speech &amp; Language</source>
          <volume>47</volume>
          (
          <year>2018</year>
          )
          <fpage>314</fpage>
          -
          <lpage>332</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lavrentyeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Volkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Avdeeva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Novoselov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gorlanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Andzhukaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ivanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kozlov</surname>
          </string-name>
          ,
          <article-title>Blind speech signal quality estimation for speaker verification systems</article-title>
          ,
          <source>in: Proc. Interspeech</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>1535</fpage>
          -
          <lpage>1539</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>