=Paper=
{{Paper
|id=Vol-2893/paper_17
|storemode=property
|title=Directional and Qualitative Feature Classification for Speaker Diarization with Dual Microphone Arrays
|pdfUrl=https://ceur-ws.org/Vol-2893/paper_17.pdf
|volume=Vol-2893
|authors=Sergei Astapov,Dmitry Popov,Vladimir Kabarov
|dblpUrl=https://dblp.org/rec/conf/micsecs/AstapovPK20
}}
==Directional and Qualitative Feature Classification for Speaker Diarization with Dual Microphone Arrays==
<pdf width="1500px">https://ceur-ws.org/Vol-2893/paper_17.pdf</pdf>
<pre>
Directional and Qualitative Feature Classification for
Speaker Diarization with Dual Microphone Arrays
Sergei Astapova , Dmitriy Popovb and Vladimir Kabarova
a
  International Research Laboratory “Multimodal Biometric and Speech Systems,” ITMO University, Kronverksky
prospekt 49A, St. Petersburg, 197101, Russian Federation
b
  Speech Technology Center, Vyborgskaya naberezhnaya 45E, St. Petersburg, 194044, Russian Federation


                                         Abstract
                                         Automatic meeting transcription has long been one of the common applications for natural language
                                         processing methods. The quality of automatic meeting transcription for the cases of distant speech ac-
                                         quired by a common audio recording device suffers from the negative effects of distant speech signal
                                         attenuation, distortion imposed by reverberation and background noise pollution. Automatic meeting
                                         transcription mainly involves the tasks of Automatic Speech Recognition (ASR) and speaker diariza-
                                         tion. While state-of-the-art approaches to ASR are able to reach decent recognition quality on distant
                                         speech, there still exists a lack of prominent speaker diarization methods for the distant speech case.
                                         This paper studies a set of directional and qualitative features extracted from a dual microphone array
                                         signal and evaluates their applicability to speaker diarization for the noisy distant speech case. These
                                         features represent respectively the speaker spatial distribution and the intrinsic signal quality proper-
                                         ties. Evaluation of the feature sets is performed on real life data acquired in babble noise conditions
                                         by conducting several classification experiments aimed at distinguishing between utterances produced
                                         by different conversation participants and between those produced by the background speakers. The
                                         study shows that specific sets of features result in satisfying classification accuracy and can be further
                                         investigated in experiments combining them with biometric and other types of properties.

                                         Keywords
                                         Meeting transcription, Distant speech processing, Dual microphone arrays, GCC-PHAT, Beamforming,
                                         Signal quality features, Artificial neural networks, Classification


1. Introduction
Automatic meeting transcription has long been a common application for Natural Speech Pro-
cessing (NSP) methods [1, 2, 3, 4]. Automatic meeting transcription and meeting minutes log-
ging is a problem, which mainly employs methods of Automatic Speech Recognition (ASR) and
speaker identification and diarization [5]. The available commercial solutions to automatic
meeting transcription can be divided into two major groups: ones tending to close-talking
speech processing and ones tending to distant speech processing. Close-talking speech pro-
cessing implies speech acquisition with a close-talking microphone per each speaker. It tends

Proceedings of the 12th Majorov International Conference on Software Engineering and Computer Systems, December
10–11, 2020, Online & Saint Petersburg, Russia
" astapov@speechpro.com (S. Astapov); popov-d@speechpro.com (D. Popov); kabarov@speechpro.com (V.
Kabarov)
 0000-0001-8381-8841 (S. Astapov); 0000-0001-7641-5542 (D. Popov); 0000-0001-6300-9473 (V. Kabarov)
                                       © 2020 Copyright for this paper by its authors.
                                       Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
to the scenarios where each speaker has a headset, lapel microphone, or any other type of per-
sonal audio recording device. Distant speech processing methods, on the other hand, tend to
the scenarios where one or several microphones are situated at a distance to all the attending
speakers and, thus, record a mixture of (occasionally overlapping) speech signals incoming
from different speakers and background noise incoming from a variety of sources.
   Automatic meeting transcription solutions for close-talking speech are quite diverse with
several commercial solutions available on the market [6, 7, 1]. Close-talking speech process-
ing poses a significantly less complicated problem for automatic NSP methods if compared to
distant speech processing methods. Such, a condition that each attending speaker possesses a
personal audio recording device makes the task of speaker identification and diarization almost
a trivial one. Voice detection is performed in each independent channel separately and the lack
of effects caused by distance as attenuation and interference are negligible. Furthermore, the
quality of state-of-the-art methods for ASR perform on par with human perception levels for
close-talking speech [8, 7].
   Distant speech processing, on the other hand, poses a greater problem for ASR and speaker
diarization [9]. The effects of speech signal mixing, signal attenuation due to distance, influ-
ence of interference and reverberation, noise pollution — all negatively affect ASR and speaker
diarization accuracy. While the state-of-the-art in ASR has reached decent recognition quality
on distant speech, there still exists a lack of prominent speaker diarization methods fitted for
the task. Part of the recent developments in the field of distant speaker diarization focuses on
applying both lexical [10, 11] and acoustic [12, 10] features for model training, usually based
on Artificial Neural Networks (ANN) [5]. Application of multichannel sound acquisition de-
vices (microphone arrays) also provides the opportunity to extract spatial features for different
sound sources [4, 9, 13]. Along with biometric profile extraction this tends to be a prominent di-
rection of research [14, 15]. Noise pollution remains the main concern for diarization systems,
specifically under babble noise produced by speakers not-of-interest situated nearby [16, 17].
Babble noise harshly affects diarization quality based on any type of feature, lexical or acoustic.
   This paper considers a set of acoustic features extracted from an audio signal acquired by a
dual microphone array. The examined set of features consists of directional features, regarding
the spatial disposition of speakers, and signal quality features, regarding the amount of dis-
tortion in the signal, closeness to the microphone, estimated Signal to Noise Ratio (SNR) and
the 𝑇60 reverberation time. The applicability of features to the task of distant speaker diariza-
tion is evaluated by determining the classification accuracy of speaker utterances belonging
either to a target speaker among other active speakers or to the background speakers not-of-
interest. Thus the combination of spatial and qualitative acoustic features is aimed at reducing
the negative effects of babble noise on diarization quality.


2. Problem Formulation
The considered application of the method described in this paper consists of logging a conver-
sation between two parties of speakers situated on the two opposite sides of a desk or table.
Such a scenario rises, for example, in cases of an interview, negotiations, service provision,
or any other kind of meeting where such a disposition of parties is appropriate [1, 2]. The
audio recording device is placed in the middle of the desk or table between the two parties of
speakers. The recording device houses two microphones situated in a straight line parallel to
the speaker disposition, i.e., one of the two microphones is directed to one side of the desk
and the other — to the opposite side of the desk. The recording device is compact, with a dis-
tance between the two microphones measured in several centimeters. The two microphones
are sampled synchronously and thus form a dual microphone array.
   The task of speaker diarization in the considered scenario consists of estimating the active
speaker party per each spoken word or phrase (utterance). If the conversation involves just
two active speakers, the distinction of utterances must be made between these two speakers; if
any party contains more than one active speaker, an entire party is considered as one speaker
[1]. This is allowed in our study as speaker biometric information is not included in the set of
examined features. In a more general case employing a greater amount of microphones and
involving speaker biometrics it should be possible to expand the dirization task for a more
general meeting scenario, where any number of speakers is situated anywhere around the
common audio recording device [2]. In this study we primarily address the problem of speaker
diarization in babble noise, i.e., a conversation where speakers not-of-interest are present near
the conversation area.


3. Applied Features and Methods
The examined features are extracted trough several methods consisting of conventional signal
processing and ANN based models. This section addresses these feature extraction methods.
Any operation is performed either on the temporal audio signal or its frequency domain repre-
sentation acquired through the Short-Time Fourier Transform (STFT). The dual channel signal
in the time domain is represented by a sequence of observation vectors 𝐱(𝑡) = [𝑥1 (𝑡), 𝑥2 (𝑡)] and
the STFT representation of the temporal signal is denoted as 𝐗(𝑡, 𝑓 ) = [𝑋1 (𝑡, 𝑓 ), 𝑋2 (𝑡, 𝑓 )], where
𝑡 is the discrete time instance and 𝑓 is the STFT frequency band.

3.1. Directional Features
The extracted directional features are based on the Time Difference of Arrival (TDOA) between
the two microphones. While the array is placed such, that one microphone is situated closer to
the speaker-of-interest that the other, the propagating acoustic waves reach the farthest micro-
phone with a specific delay, compared to the closest microphone. This delay defines the TDOA,
which in turn defines the direction to the sound source (speaker). We estimate the TDOA by ap-
plying Generalized Cross-Correlation with 𝛽-weighted Phase Transform (GCC-𝛽PHAT) [18].
The GCC-𝛽PHAT for a dual microphone array is defined by the following equation:

                                                   𝑋1 (𝑡, 𝑓 )𝑋2∗ (𝑡, 𝑓 ) −i2𝜋𝑓 𝜏
                             𝑅12 (𝑡, 𝜏 ) = ℜ                                𝑒      ,                (1)
                                               ( ||𝑋1 (𝑡, 𝑓 )𝑋2∗ (𝑡, 𝑓 )||𝛽      )
                                                                         𝛽
where 𝜏 is the time delay between two channels, ||𝑋1 (𝑡, 𝑓 )𝑋2∗ (𝑡, 𝑓 )|| is the 𝛽PHAT coefficient,
(⋅)∗ denotes the complex conjugate, and ℜ(⋅) denotes the real part of a complex number. The
range of physically possible time delays between two microphones is defined as 𝜏 ∈ [−𝑑/𝑐, 𝑑/𝑐],
where 𝑑 is the distance between the two microphones and 𝑐 is the speed of sound in air. The
TDOA for time instance 𝑡 is then defined as the time delay at which GCC-𝛽PHAT reaches its
maximal value:
                               𝜏TDOA (𝑡) = arg max (𝑅12 (𝑡, 𝜏 )) .                       (2)
                                                    𝜏
   For a recognized utterance 𝐒 = {𝐗(𝑡1 , 𝑓 ), … , 𝐗(𝑡2 , 𝑓 )} ranging in the interval 𝑡 ∈ [𝑡1 , 𝑡2 ] we
extract directional features in the following manner. The TDOA values corresponding to this
time interval 𝝉 = {𝜏TDOA (𝑡1 ), … , 𝜏TDOA (𝑡2 )} are split into two groups of positive and negative
values:

                          𝝉 (+) = {𝜏TDOA (𝑡) ∣ 𝜏TDOA (𝑡) > 𝜖, 𝑡 ∈ [𝑡1 , 𝑡2 ]} ,
                                                                                                     (3)
                          𝝉 (−) = {𝜏TDOA (𝑡) ∣ 𝜏TDOA (𝑡) < −𝜖, 𝑡 ∈ [𝑡1 , 𝑡2 ]} ,
where 𝜖 is a small positive number defining the TDOA ambiguity around zero. As the two
speakers or two speaker parties are situated on the opposite sides of the dual microphone
array (i.e., along the straight line connecting the two microphones), the TDOA values should
be either positive or negative depending on the side of the sound source. As any utterance may
include micro-pauses and noise instances, we extract the following directional features based
on TDOA:
                          ⎧
                          ⎪  | (+) | | (−) |                                           ⎫
                                                                                       ⎪
                          ⎪ ||𝝉 || ||𝝉 ||                                              ⎪
                    𝐅𝐷 = ⎨          ,        , mean (𝝉 (+) ) , mean (𝝉 (−) ) , mean (𝝉)⎬ ,     (4)
                          ⎪
                          ⎪    |𝝉|    |𝝉|                                              ⎪
                                                                                       ⎪
                          ⎩                                                            ⎭
where |⋅| denotes the cardinality of a set. As the set of TDOA values 𝝉 may include both positive
and negative values due to noise pollution and pauses (silence), one simple feature of average
TDOA along 𝝉 is not sufficient to represent the utterance. It should be addressed that if 𝝉 (+) or
𝝉 (−) have zero cardinality, the mean value is deemed zero:
                                          ⎧
                                          ⎪    1   (+)           | (+) |
                                          ⎪ (+) ∑ 𝝉 ,            |𝝉 | > 0,
                              mean (𝝉 ) = ⎨ |𝝉 |
                                       (+)                       | |
                                                                 | (+) |                             (5)
                                          ⎪
                                          ⎪ 0,                   |𝝉 | = 0.
                                          ⎩                      | |

3.2. Qualitative Features
Directional features on their own are quite representative in an ideal case of absence of any
background noise. If noise and speech-not-of-interest are present in the signal, and the sources
of coherent noise (including background speakers) are situated in the vicinity of the conver-
sation desk or table, the quality of TDOA features may decrease due to masking effects. An
example of GCC-𝛽PHAT (discussed in Subsection 3.1) value sequence corresponding to a con-
versation in noisy conditions with presence of babble noise is presented in Fig. 1. The figure
displays a GCC-𝛽PHAT vector 𝑅12 (𝑡, 𝜏 ) from equation (1) for each of the sequence of signal
STFT frames. Here the TDOA values corresponding to the actual conversation participants
are distributed at the spectral shift index of approximately 90 for one side of the desk and ap-
proximately −100 for the other. It can be noticed, that several other distributions exist: at the
spectral shift indices of approximately 100 to 200, −200 to −100 and around 0. The first two
correspond to babble and other noise and the third one — to uncorrelated diffuse noise and in-
terference. This aspect would not pose a problem in close-talking applications and where ASR
Figure 1: A sequence of GCC-𝛽PHAT frames for a conversation recording. Darker color represents
higher GCC values.


would recognize only closest spoken utterances. Unfortunately, in distant speech processing
applications this is rarely the case, and phrases spoken in the vicinity of the target conversa-
tion are very often recognized by distant speech ASR systems. Explicitly gathering biometric
information for all detected speakers would remedy the situation, however, such an approach
involves a significant amount of manual manipulations and is not often feasible. We attempt
at distinguishing between closest speech of target speakers and farther speech of background
speakers by applying several signal quality metrics as features.
    The first discussed signal quality metric is the signal Envelope-Variance (EV) [19]. It was
originally proposed for the purposes of blind channel selection in multi-microphone systems. It
involves calculating the Mel-frequency filterbank energy coefficients of STFT frames, similarly
to the process involved in Mel-frequency Cepstral Coefficient (MFCC) calculation [20]. The Mel
filterbank coefficients are denoted for an utterance as 𝐒 = {𝐗Mel (𝑡1 , 𝑓 ), … , 𝐗Mel (𝑡2 , 𝑓 )}, where 𝑓
are log frequencies in the Mel-scale. To remove the short term effects of different electric gains
and impulse responses of the microphones from the signal in each channel the mean value is
subtracted in the log domain from each sub-band as
                                      Mel                Mel (𝑡,𝑓 )−𝜇
                                                    log 𝑋𝑖                 (𝑓 )
                                   𝑋̂ 𝑖 (𝑡, 𝑓 ) = 𝑒                  𝑋𝑖Mel
                                                                                  ,                   (6)

where 𝑋𝑖Mel (𝑡, 𝑓 ) is the Mel filterbank coefficient for microphone channel 𝑖 ∈ [1, 2], 𝑡 ∈ [𝑡1 , 𝑡2 ];
the mean 𝜇𝑋𝑖Mel (𝑓 ) is estimated by a time average in each sub-band along the whole utterance.
After mean normalization, the sequence of Mel filterbank energies is compressed applying a
cube root function, and a variance measure is calculated for each sub-band and channel:
                                                        Mel       1
                                      𝑉𝑖 (𝑓 ) = var [𝑋̂ 𝑖 (𝑡, 𝑓 ) 3 ] .                               (7)
The cube root compression function is preferred to the conventionally used logarithm, because
very small values in the silent portions of the utterance may lead to extremely large negative
values after the log operation, which would distort variance estimation. The EV metric is then
calculated for each signal channel across all sub-bands as

                                                          𝑉𝑖 (𝑓 )
                                        EV𝑖 = ∑                       .                                 (8)
                                                    𝑓
                                                        max (𝑉𝑖 (𝑓 ))
                                                            𝑖

EV represents the degree of distortion in a signal; the EV value is higher in a signal channel,
where the degree of interference, imposed by distant signal distortion and reverberation, is
lower. Thus, the feature vector EV = {EV1 , EV2 } not only points to the channel with less dis-
torted speech (speaking party direction), but also gives highlight on the nature of the utterance
(either conversation participant, or background talker).
   The second and third discussed quality metrics are the Cepstral Distance (CD) and the related
Covariance-weighted Cepstral Distance (WCD) [21]. CD is another metric of signal distortion
but computed in the cepstral domain, i.e., on the coefficients resulting from MFCC. Cepstrum-
based comparisons are equivalent to comparisons of the smoothed log spectra; in this domain
the reverberation effect can be viewed as additive. An utterance in the cepstral domain is
denoted by 𝐒 = {𝐜(𝑡1 , 𝑘), … , 𝐜(𝑡2 , 𝑘)}, where 𝐜(𝑡, 𝑘) are the cepstral coefficients and 𝑘 is the
coefficient index (so-called quefrency). For our application we define the distance for each
channel 𝑖 as
                                                𝑝
                                   𝑑𝑖CEP (𝑡) = ∑ (𝑐𝑖 (𝑡, 𝑘) − 𝑐̄ (𝑡, 𝑘))2 ,                             (9)
                                               𝑘=1

where 𝑐𝑖 (𝑡, 𝑘) is the cepstral coefficient of 𝐜(𝑡, 𝑘) for the 𝑖-th channel, 𝑐̄ (𝑡, 𝑘) is the mean cepstral
coefficient computed by taking the MFCC from the averaged signal along all channels, and 𝑝
is the number of cepstral coefficients. The mean coefficients contain the averaged close-talk
signal, and the average reverberation component. Let us assume that one microphone signal
is better than the others in terms of direct to reverberant ratio. The basic assumption is that
such a signal will be characterized by a larger distance from the mean cepstrum. Therefore,
from the set of distances 𝑑𝑖CEP (𝑡) between the mean cepstrum and all the available channels,
the least distorted channel can be selected as

                                        𝑀̂ (𝑡) = arg max𝑑𝑖CEP (𝑡).                                    (10)
                                                        𝑖

As an entire utterance can contain some number of coefficients corresponding to noise and
silence, we define the CD feature for each channel for the entire utterance as the ratio
                                         |{                   }|
                                         | 𝑀̂ (𝑡) ∣ 𝑀̂ (𝑡) = 𝑖 |
                                         |                     |
                                  CD𝑖 = |      |{       }|
                                                               |,                        (11)
                                               | 𝑀̂ (𝑡) |
                                               |          |
                                               |          |
where |{⋅}| denotes the cardinality of a set and 𝑡 ∈ [𝑡1 , 𝑡2 ]. The feature vector for a dual channel
signal is then CD = {CD1 , CD2 }. The channel with less distortion will have a higher ratio value
and the ratio difference between channels will be greater for close-talking utterances than for
the background ones.
   The WCD features are obtained in a similar fashion as the CD features, with the only dif-
ference being the computation of equation (9). For WCD the 𝑘 × 𝑘 covariance matrix of the
cepstral distance vector is computed for each channel:

                                       𝐕𝑖 (𝑡) = cov [𝑐𝑖 (𝑡) − 𝑐̄ (𝑡)] ,                         (12)

and the covariance weights 𝑤𝑖 (𝑡, 𝑘) are retrieved as the inverse of every 𝑘-th diagonal element
𝑣𝑘𝑘 of the covariance matrix 𝐕𝑖 (𝑡). The WCD measure is then a weighted Euclidean distance
measure, where each individual cepstral component is variance-equalized by the weight:
                                            𝑝
                            𝑑𝑖WCEP (𝑡) = ∑ 𝑤𝑖 (𝑡, 𝑘) (𝑐𝑖 (𝑡, 𝑘) − 𝑐̄ (𝑡, 𝑘))2 .                 (13)
                                           𝑘=1

The subsequent computations are equivalent to equations (10), (11). The WCD ratio for an
entire utterance for all channels is defined as WCD = {WCD1 , WCD2 }.
   Other quality features include Root Mean Square (RMS) energy and common SNR and 𝑇60
reverberation time estimates. Instant RMS is computed for each channel for each STFT frame
as                                            √
                                                          𝐹𝑠 /2
                                     RMS𝑖 (𝑡) =         2 ∑ |𝑋𝑖 (𝑡, 𝑓 )|2 ,                     (14)
                                                          𝑓 =0

where 𝐹𝑠 is the sampling rate and |𝑋𝑖 (𝑡, 𝑓 )| is the modulus of the complex spectrum. For an entire
utterance the RMS energy is computed as the average of instant values RMS𝑖 = mean(RMS𝑖 (𝑡)),
for 𝑡 ∈ [𝑡1 , 𝑡2 ]. And the RMS feature for all channels is RMS = {RMS1 , RMS2 }. The SNR and
𝑇60 are estimated by a neural network based voice activity detector (NN VAD) integrated into
the the ASR system applied in this study [22]. For every detected and recognized utterance a
scalar estimate of the common SNR (dB) and 𝑇60 (seconds), one for all channels, is provided.
  The entire set of qualitative features thus consists of the following features per each utter-
ance:
                              𝐅𝑄 = {EV, CD, WCD, RMS, SNR, 𝑇60 } .                              (15)

3.3. Beamformed Features
Additionally to extracting features from the raw input signal we study the influence of features
extracted from the beamformed signal [18]. For this study we apply two types of simple beam-
frorming algorithms: Delay and Sum Beamforming (DSB) and Differential beamforming (DIF).
We intend to examine the influence on the features imposed by steering the dual channel signal
in two extreme directions along the linear array (i.e., in the directions to the two participants).
For this we apply beamforming in an endfire fashion.
   The principle of DSB is expressed in the following equation:
                                                1
                           𝑋DSB (𝑡, 𝑓 , 𝜏 ) =     (𝑋1 (𝑡, 𝑓 ) + 𝑋2 (𝑡, 𝑓 )𝑒
                                                                            −i2𝜋𝑓 𝜏
                                                                                    ),          (16)
                                                2
where 𝑒 −i2𝜋𝑓 𝜏 is the spectral shift operator at time delay 𝜏 , the same as for equation (1). The
DIF beamformer, on the other hand, is defined as
                                                   1
                              𝑋DIF (𝑡, 𝑓 , 𝜏 ) =     (𝑋1 (𝑡, 𝑓 ) − 𝑋2 (𝑡, 𝑓 )𝑒
                                                                               −i2𝜋𝑓 𝜏
                                                                                       ).                (17)
                                                   2
As the frequency response of the DIF beamformer is significantly nonuniform, with the lower
frequencies being attenuated in the first half-lobe of the response, we apply an equalizer to the
lower frequencies before the cutting frequency of the first lobe 𝑓𝑐 = 𝑐/𝑑 (for the endfire case):
                                        {
                                                   𝑑 −1
                                          [sin (𝜋𝑓 𝑐 )] , 𝑓 ≤ 𝑓𝑐 ,
                             𝐻𝑒𝑞 (𝑓 ) =                                                      (18)
                                          1,                 𝑓 > 𝑓𝑐 .

The equalizer is then applied to all DIF beamformed frames as 𝑋DIF (𝑡, 𝑓 ) ← 𝐻𝑒𝑞 (𝑓 )𝑋DIF (𝑡, 𝑓 ).
   Applying beamforming in an endfire fashion means that the time delay is set to the two ex-
treme values 𝜏 = {−𝑑/𝑐, 𝑑/𝑐}. And so we obtain the two beamformed channels as 𝐗DSB (𝑡, 𝑓 ) =
[𝑋DSB (𝑡, 𝑓 , −𝑑/𝑐) , 𝑋DSB (𝑡, 𝑓 , 𝑑/𝑐)] and 𝐗DIF (𝑡, 𝑓 ) = [𝑋DIF (𝑡, 𝑓 , −𝑑/𝑐) , 𝑋DIF (𝑡, 𝑓 , 𝑑/𝑐)]. As beam-
formed signals lose the initial phase information, they cannot be applied to directional feature
extraction. Thus they are applied to extract all the qualitative features excluding SNR and 𝑇60
as these are provided by NN VAD, which implies only raw signal input. These features are
thus: EV, CD, WCD, RMS. The sets of these features extracted on specific beamformed data are
denoted as 𝐅DSB and 𝐅DIF .

3.4. Extraction Procedure Review
The block diagram of the entire feature extraction procedure is presented in Fig. 2. The dual
channel signal in the figure denotes the signal interval of one recognized speech segment (ut-
terance). The SNR and 𝑇60 estimates are retrieved from this signal by the NN VAD [22], as
described in Subsection 3.2, and for the extraction of other features it is sufficient to perform
all the steps of MFCC separately, while extracting respective features on every intermediate
step. Thus, as described in Subsections 3.1, 3.2, after STFT the directional features and RMS are
extracted; after computing Mel filterbank energy coefficients the EV features are extracted; and
the CD and WCD features are extracted at the last stage of computing cepstral coefficients. The
signal is beamformed to additionally extract the features described in Subsection 3.3. This ap-
proach reduces the number of MFCC computation instances to just three (one, if beamformers
are not involved), which reduces feature extraction computational load.


4. Experimental Evaluation and Results
This section presents the experimental setup for data acquisition, feature classification ap-
proaches applied and discusses the feature evaluation results.

4.1. Experimental Setup and Data
The experiments are performed on a signal database of real life recordings of conversations
according to the scenario highlighted in Section 2. People were asked to take on their meetings
                                             2-channel
                                               signal         Beamforming


                                                                    Mel
                                              STFT             Filterbanks           MFCC


                          RMS                                   Envelope            Cepstral
     NN VAD              Energy            GCC-PHAT             Variance            Distance


    SNR, T60           RMS energy        TDOA features     EV qiality features CD qiality features

Figure 2: Block diagram of the applied feature extraction procedure.


and discussions at a table with dimensions 2 × 1 meters in an open office space. No physical
boundaries were erected around the table, i.e., background speakers were able to move and
participate in their daily routine along both sides of the longer side of this table. Several desk
phones, a printer and an air conditioning unit were situated in the vicinity of the table. Private
discussions (2 participants) and working group meetings (3–6 participants) were taking place
at the table with the only restriction being that all the participants had to be seated along both
of the longer sides of the table. The dual microphone array with a inter-microphone distance
of 𝑑 = 0.05 m was placed at the center of the table perpendicular to the longer side.
   The resulting database contains conversations between people of both sexes in a noisy office
environment with ample babble noise. The audio signals are acquired in 16 bit PCM WAV files
with the sampling rate equal to 16 kHz. The reverberation time measured at the table equals
𝑇60 = 450 − 500 ms. The meeting recordings were manually transcribed with assignment of
three distinct classes: person on the left, person on the right side of the table, background
speaker. Background speech was transcribed according to typical human hearing capabilities,
i.e., distant inaudible speech was not transcribed. Transcription resulted in 115,460 utterances,
which were selected in order to reach an approximately equal sample distribution between the
three classes.

4.2. Feature Classification
According to Section 3, the feature set comprises a directional feature vector 𝐅𝐷 (4) of length 5,
a qualitative feature vector 𝐅𝑄 (15) of length 10, and two qualitative feature vectors 𝐅DSB , 𝐅DIF
extracted from DSB and DIF beamformed data, both of length 8. The entire feature set then
comprises 31 features. The according feature subsets are examined in specific combinations by
applying a classification procedure for three variations of target classes:

    • 3 class: speaker right (sp1), speaker left (sp2), background speaker (bg);

    • 2 class: any of the participant speakers (sp1&2), background speaker (bg);
Table 1
Feature classification results.
                                         Classification accuracy (%) for classes
                       Feature set
                                         sp1, sp2, bg sp(1&2), bg sp1, sp2
                       TDOA                  60.3           67.7         76.1
                       TDOA+Qual             78.2           85.2         89.9
                       TDOA+Qual+DSB         79.8           87.5         91.8
                       TDOA+Qual+DIF         79.7           87.6         93.0
                       All                   81.2           87.7         93.2
                       Selected              80.8           87.8         93.5


    • 2 class: speaker right (sp1), speaker left (sp2).

For feature classification an ANN classifier is trained and tested on every examined subset of
features. For the first two variations of target classes all utterances from the database are used;
for the last variation only participant utterances are used. The classifier is implemented in
tensorflow and consists of three layers: first dense layer, 80–200 neurons, ReLU activation;
second dense layer, 40–100 neurons, ReLU activation, third dense layer, softmax activation. The
number of neurons is experimentally selected to benefit classification of different feature sub-
sets. EarlyStopping and ReduceLROnPlateau are applied during training for monitoring
the validation loss.
   Additionally feature selection is performed by applying Recursive Feature Elimination with
Cross-Validation (RFECV) from the sklearn package while iteratively training a classifier of
type GradientBoostingClassifier on combinations of features with elimination factor
equal to 1. As a result the selected feature set excludes 10 features without significant classifi-
cation accuracy loss.

4.3. Evaluation Results
The results of feature subset evaluation are presented in Table 1. It is evident from the table
that directional (TDOA) 𝐅𝐷 features alone cannot distinguish between the 3 classes defined in
Subsection 4.2 beyond the margin of random choice. Futhermore, for the first two classification
problems the recall of class 3 (bg) is the lowest, which means that applying just TDOA features
does not provide near vs far speaker separation. Applying qualitative features, on the other
hand, significantly improves classification quality for all three classification tasks. The applica-
tion of beamformed data improves the quality just slightly and, therefore, may be superfluous
for limited resource solutions. Classification on the selected feature set, which incorporates 21
of 31 features (reduction by 32%), is on par with the accuracy over the entire feature set. This
implies that single features from all regarded subsets are redundant for the classification task
at hand. However, this may be the case for this specific dataset.
   Generally, the discussed directional and qualitative features seem to be applicable for the
task of distinguishing between the participating speakers and background speakers. Qualita-
tive features significantly improve classification accuracy. The established feature set can be
considered for further investigation in combination with biometric and other types of features.
5. Conclusion
The paper regarded a set of directional and qualitative signal features for the task of speaker
diarization. The discussed feature set is proven to be applicable to the task of speaker utterance
classification for the distant speech processing case in office and babble noise conditions. The
feature set can be considered for further investigation along in combination with biometric and
other types of features.


Acknowledgments
This research was financially supported by the Foundation NTI (Contract 20/18gr,
ID 0000000007418QR20002) and by the Government of the Russian Federation (Grant 08-08).


References
 [1] A. Nedoluzhko, O. Bojar, Towards automatic minuting of the meetings, in: 19th Confer-
     ence ITAT 2019: Slovenskocesky NLP workshop (SloNLP 2019), CreateSpace Independent
     Publishing Platform, 2019, pp. 112–119.
 [2] C. Bokhove, C. Downey, Automated generation of ”good enough” transcripts as a first
     step to transcription of audio-recorded data, Methodological Innovations 11 (2018).
 [3] T. Hain, J. Dines, G. Garau, M. Karafiát, D. Moore, V. Wan, R. Ordelman, S. Renals, Tran-
     scription of conference room meetings: an investigation, in: INTERSPEECH, 2005.
 [4] T. Yoshioka, Z. Chen, D. Dimitriadis, W. Hinthorn, X. Huang, A. Stolcke, M. Zeng, Meeting
     transcription using virtual microphone arrays, in: ArXiv, 2019.
 [5] H. Aronowitz, W. Zhu, M. Suzuki, G. Kurata, R. Hoory, New advances in speaker diariza-
     tion, in: INTERSPEECH, 2020.
 [6] Y. Huang, Y. Gong, Acoustic model adaptation for presentation transcription and intelli-
     gent meeting assistant systems, in: ICASSP 2020, IEEE, 2020.
 [7] F. Filippidou, L. Moussiades, Alpha benchmarking of IBM, Google and Wit automatic
     speech recognition systems, in: I. Maglogiannis, L. Iliadis, E. Pimenidis (Eds.), Artificial
     Intelligence Applications and Innovations, Springer International Publishing, Cham, 2020,
     pp. 73–82.
 [8] A. Stolcke, J. Droppo, Comparing human and machine errors in conversational speech
     transcription, ArXiv (2017).
 [9] T. Yoshioka, I. Abramovski, C. Aksoylar, Z. Chen, M. David, D. Dimitriadis, Y. Gong,
     I. Gurvich, X. Huang, Y. Huang, A. Hurvitz, L. Jiang, S. Koubi, E. Krupka, I. Leichter, C. Liu,
     P. Parthasarathy, A. Vinnikov, L. Wu, X. Xiao, W. Xiong, H. Wang, Z. Wang, J. Zhang,
     Y. Zhao, T. Zhou, Advances in online audio-visual meeting transcription, in: 2019 IEEE
     Automatic Speech Recognition and Understanding Workshop (ASRU), 2019, pp. 276–283.
[10] S. Horiguchi, Y. Fujita, K. Nagamatsu, Utterance-wise meeting transcription system using
     asynchronous distributed microphones, in: INTERSPEECH, 2020.
[11] N. Kanda, S. Horiguchi, R. Takashima, Y. Fujita, K. Nagamatsu, S. Watanabe, Auxiliary
     interference speaker loss for target-speaker speech recognition, in: INTERSPEECH, 2019.
[12] S. Astapov, D. Popov, V. Kabarov, Directional clustering with polyharmonic phase esti-
     mation for enhanced speaker localization, in: A. Karpov, R. Potapova (Eds.), Speech and
     Computer, Springer International Publishing, Cham, 2020, pp. 45–56.
[13] K. Zmolikova, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, T. Nakatani, Speaker-
     aware neural network based beamformer for speaker extraction in speech mixtures, in:
     INTERSPEECH, 2017.
[14] I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin,
     T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, A. Romanenko,
     Target-speaker voice activity detection: A novel approach for multi-speaker diarization
     in a dinner party scenario, in: Interspeech 2020, 2020, pp. 274–278.
[15] Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, S. Watanabe, End-to-end neural
     speaker diarization with self-attention, 2019 IEEE Automatic Speech Recognition and
     Understanding Workshop (ASRU) (2019) 296–303.
[16] H. R. Muckenhirn, I. L. Moreno, J. Hershey, K. Wilson, P. Sridhar, Q. Wang, R. A. Saurous,
     R. Weiss, Y. Jia, Z. Wu, Voicefilter: Targeted voice separation by speaker-conditioned
     spectrogram masking, in: ICASSP 2019, 2018.
[17] M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, T. Nakatani, Single channel target
     speaker extraction and recognition with speaker beam, in: 2018 IEEE International Con-
     ference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5554–5558.
[18] M. Brandstein, D. Ward, Microphone Arrays: Signal Processing Techniques and Applica-
     tions, Digital Signal Processing - Springer-Verlag, Springer, 2001.
[19] M. Wolf, C. Nadeu, Channel selection measures for multi-microphone speech recognition,
     Speech Communication 57 (2014) 170 – 180.
[20] S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic
     word recognition in continuously spoken sentences, IEEE Transactions on Acoustics,
     Speech, and Signal Processing 28 (1980) 357–366.
[21] C. Guerrero Flores, G. Tryfou, M. Omologo, Cepstral distance based channel selection for
     distant speech recognition, Computer Speech & Language 47 (2018) 314 – 332.
[22] G. Lavrentyeva, M. Volkova, A. Avdeeva, S. Novoselov, A. Gorlanov, T. Andzhukaev,
     A. Ivanov, A. Kozlov, Blind speech signal quality estimation for speaker verification sys-
     tems, in: Proc. Interspeech 2020, 2020, pp. 1535–1539.

</pre>