<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CNN-Based Audio Recognition in Open-set Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hitham Jleed</string-name>
          <email>h.jleed@ieee.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Bouchard</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Electrical Engineering and Computer Science, University of Ottawa</institution>
          ,
          <addr-line>Ottawa</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a method that uses convolutional neural networks for audio recognition in an open-set scenario. The audio sounds in an open-set scenario are usually out of the training data distribution, which necessitates a model that can recognize the known classes while rejecting the unknown ones. We propose a convolutional approach for recognizing audio events, that can effectively address open-set recognition by adding inclusion probabilities of extreme value machines. Extensive experiments conducted showed that our proposed method outrivals representative existing methods under the open-set regime.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Audio Recognition</kwd>
        <kwd>Open-set Recognition</kwd>
        <kwd>Convolutional Neural Networks</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        CNNs with the SoftMax activation function have been used in closed-set recognition tasks and
demonstrated an outstanding performance in many applications in literature, such as speech sound
recognition [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], audio source identification [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], and environmental audio recognition [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. However,
traditional closed-set recognition methods have no way of rejecting data from previously unknown
classifications. The perception of open-set recognition has attracted more interest in image recognition
and computer vision researchers for both deep and shadow classifiers. Some non-deep efforts have been
investigated in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for open-set image recognition. The work was then expanded in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] by proposing
the 1-vs-set machine approach to improve the robustness of image recognition. A Weibull-calibrated
SVM was introduced in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for open-set image recognition. It was built for minimizing the empirical
error and open space risk. An open-set algorithm called PSR-SVM was proposed in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] to compute the
posterior probability distribution for all classifier outputs, using a confidence measurement to determine
whether a certain event belongs to a specific group of predefined events or not. Similarly, in radar image
recognition, an automatic target recognition was published in [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where open-set recognition was used
on high-scale resolution in radar images. It formulated an automated target images recognition. For the
deep open-set problem, a deep CNN for face recognition produced some promising results in [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
Gutoski et al [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] introduced a human action recognition system using a 3D CNN that rejects inputs
belonging to unknown classes. A deep CNN for environmental sound recognition was proposed in [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]
but it did not perform an open-set recognition well. In this paper, we adapt the EVM and
metarecognition [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] in the SoftMax activation layer function. This method measures the sample signal
probability and detects a potential novel class.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
    </sec>
    <sec id="sec-4">
      <title>3.1. Preprocessing</title>
      <p>
        The proposed CNN architecture includes several convolutional layers [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The convolutional layers
extract higher-level features for the final classification. We used 2D CNNs since they can capture the
spatiotemporal information of the signal [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The output of the last convolutional layer is compressed
into a 1D vector after the series of convolutional layers. The automatically generated feature vector is
the result of this phase.
      </p>
      <p>
        Each audio signal is composed of different frequencies and different energy amplitudes, with quick
variations within a short time. There is a need to define and represent audio signals such that a robust
recognition system can be built. We have chosen the log-Mel spectrogram since it has proven to be
suitable to model the human auditory system and is used in many speech and audio recognition tasks
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. The Mel spectrogram is a spectrogram that converts frequencies to the Mel scale. It is computed
by using a set of overlapping triangle filters to ascertain the energy of each spectral band. Audio features
are obtained by computing 64 log-Mel bins with a window length of 1024 and a hop size of 500 samples
at a 44.1 kHz sampling rate.
      </p>
    </sec>
    <sec id="sec-5">
      <title>3.2. Network Architectures</title>
      <p>
        The architecture block of the CNN is composed of convolution layers and pooling layers. A
convolution layer applies filters to the input then takes the inner product and adds the bias. Each filter
has its own bias and weights. A pooling layer reduces the dimensionality of the subsequent layers. It is
applied to each convolution feature map independently. Please refer to [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for a more detailed
description. The input features (2D Mel spectrogram array) are organized to be fed into the CNN
algorithm, each representing a small window of input audio signal for training or testing. Rectified
Linear Unit (ReLU) activation functions are used in each convolutional layer, which imposes
nonlinearity on the feature maps.
      </p>
      <p>When we apply more convolution and pooling techniques to feature maps at higher levels, their
resolution decreases. Before feeding the features to the output layer, they must be integrated across all
frequency bands. On top of the last CNN layer, fully connected hidden layers are formed. The SoftMax
is the output layer used as an activation function to predict probability over the class labels. Each
number in the softmax function's output is inferred as the probability of belonging to each class.
However, for open-set recognition, SoftMax cannot work well, so we propose to replace it with the
EVM to determine the probability of the output for each class.</p>
      <p>For closed-set recognition, let us assume the known classes {C1, C2 ,..., CN } , where Nk is the number
of known classes. The final layer has the same size as the number of known classes. We denote the
representation of this final network layer as y = f (x) , where f denotes the network as a function.
When an audio data point x arrives, the SoftMax function to label this sound is defined as follows:
p(Ci | x, x  N ) = SoftMaxi ( y) =</p>
      <p>exp(xi )
 Nj exp(x j )
()</p>
      <p>
        The SoftMax function assigns a certain probability to each training class by computing the maximum
SoftMax probability, which is suitable for the optimization of the deep network in the closed-set
recognition. In open-set settings, we need to consider x  N , where the class CN +1 corresponds to a novel
class. The crucial step is to find a suitable value for thresholding between known and unknown classes.
Some previous works such as [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] used test data distributions and thresholds values. In this
work, we set a threshold by calibrating the activation vector with the inclusion probabilities of each
class, where the extreme-value theory indicates that the Weibull family of distributions is fit for this
purpose [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ]. To build a matched score distribution during training time, the distance between all
training samples from a given class and its associated class mean  is calculated using some distance
functions, such as, Euclidean, hybrid, and cosine distance. Then, a Weibull distribution is equipped to
the tail of the matched distribution. We used the libmr library [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] to compute the parameters in the
Weibull distribution, whose values of hyperparameters were taken as suggested in [19].
      </p>
    </sec>
    <sec id="sec-6">
      <title>4. Evaluation Metrics</title>
      <p>We assess the effectiveness of our proposed algorithm by computing similarities after aligning the
recognition outputs with a reference ground truth. The evaluation utilizes cross-validation, which allows
evaluation of the accuracy of data that may not be part of the training dataset. Two fundamental
assumptions have been used in the DCASE/ AASP challenges [20] to evaluate how individual audio
sounds are classified:
• Segment-based evaluation: the system output and ground truth are compared for each segment
length.
• Event-based evaluation: the system output is considered the same within all ranges (duration) of
the event. This means that event labels in the recognition output will be compared to the ground
truth events.</p>
      <p>Let us consider a binary classification, where the labels consist only of positives or negatives. Based
on true labels and predicted labels, we divide the metrics into four intermediate statistics: true-positive
(TP), false-positive (FP), true-negative (TN), and false-negative (FN). A count is made for each category.
Applying this to a multi-class problem, every single classifier that produces a “positive” or “negative”
prediction can be “true” or “false” depending on the corresponding ground-truth label.</p>
    </sec>
    <sec id="sec-7">
      <title>4.1. Recognition Accuracy</title>
      <p>The recognition accuracy (RA) can be described as the ratio of the correctly labeled predictions to the
whole pool:</p>
      <p>RA =</p>
      <p>TP + TN</p>
      <p>TP + FP + FN + TN</p>
    </sec>
    <sec id="sec-8">
      <title>4.2. Precision And Recall</title>
      <p>Precision (Pr) is the ratio of predicted positive samples that are calculated correctly (true) divided by
all predicted positive samples, while recall (Re) is the fraction of predicted positive samples correctly
detected from all ground truth positive samples (labels). For multi-class classification, there are two ways
of computation: macro-averaging and micro-averaging [21].</p>
      <p>Prmacro =
1 N TPi
N i=1 TPi + FPi</p>
      <p>N
TPi
i=1
Prmicro = N
 (TPi + FPi )
i=1</p>
    </sec>
    <sec id="sec-9">
      <title>4.3. F1-measure</title>
      <p>Remacro =
1 N TPi
N i=1 TPi + FNi</p>
      <p>N
TPi
i=1
Remicro = N
 (TPi + FNi )
i=1</p>
      <p>The F1-measure includes both precision and recall merged in a single score, which is computed as the
harmonic mean between precision and recall. The F1-measure is computed as:</p>
      <p>F1- measure =
2PrRe</p>
    </sec>
    <sec id="sec-10">
      <title>4.4. Confusion Matrix</title>
      <p>The confusion matrix CM (i, j) summarizes the performance of multi-class recognition. It depicts the
various ways in which the classification model gets confused when making predictions. Each column of
the matrix represents a real class, whereas each row represents a predicted class. The diagonal of this
matrix (i = j) reveals the correct prediction.</p>
    </sec>
    <sec id="sec-11">
      <title>5. Experiments</title>
      <p>Extensive experiments were conducted using Python. A Keras [22] implementation of CNNs was
used, with TensorFlow [23] as the backend. First, we carried out closed-set recognition experiments
where the audio dataset is separated into training and testing datasets.</p>
      <p>To model the classifiers, we applied the 5-folding cross-validation technique where a total of 80%
is used as the training dataset, while the remaining 20% of the data is used for testing. The experiment
is conducted on the DCASE2016 dataset. This dataset consists of audio recorded in everyday life, which
includes 11 sound classes that were recorded in an office environment: clearing throat, coughing,
speech, drawer, keyboard, keys drop, knock, laughter, page-turning, phone ringing, and a door slam.</p>
    </sec>
    <sec id="sec-12">
      <title>5.1. Closed-set Recognition</title>
      <p>The classification output is evaluated to be correct or not according to the ground truth. We did not
perform comparisons with other algorithms in this part, because the experiments in a closed-set regime
aim to evaluate the ability of the algorithm to differentiate among recognized classes. The comparison
will be conducted later in the open-set recognition part. As can be noticed from Fig. 1, the event-based
confusion matrix discloses that most of the classes have been recognized very well except door-slam
and phone-ring classes, whose accuracies were 60% and 70 %, respectively. Fig. 2 shows the confusion
matrix after applying frame-based recognition. The right column reveals the percentage accuracy of
each class. Most of the classes have been recognized correctly, and some misclassification can be also
observed that is because the similarities among these classes are high. The sound class has a great impact
on the results, as expected. For example, the door-slam class was the hardest class to recognize,
probably because of the short length of such sounds.</p>
      <sec id="sec-12-1">
        <title>Known</title>
        <p>Yk</p>
      </sec>
      <sec id="sec-12-2">
        <title>Target</title>
        <p>Yt</p>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>5.2. Open-set Recognition</title>
      <p>The experiments in this section were performed to recognize audio sounds where the testing set also
includes classes that may not be part of the training dataset. These experiments measure the capability
to discriminate known classes from novel classes and to discriminate known classes from one another.
The level of openness for a classification task can be defined as:</p>
      <p>Openness = 1−
2  Yt
Yk + Yu
()</p>
      <p>Where the subscripts k , t and u are for the known, target, and unknown label sets, as defined in Fig.
3. The testing dataset is defined as Ytest = Yk  Yt  Yu , while the training dataset is the combination
Ytrain = Yt Yk , and the unknown classes are its complement Yu =  y | y Ytrain and y Ytest  . If we set Yk = Yt
these yields Ytrain = Yk .</p>
      <p>
        We used varying degrees of openness and followed k-fold cross-validation to obtain robust evaluation
metrics. The experiments were performed by generating different amounts of openness. Our experiments
were conducted for several evaluations in which we examined how well our proposed algorithm
performs in comparison to other representative algorithms: W-SVM [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], IOmSVM [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], OSNN [24],
and OSmIL [25].
      </p>
      <p>The parameters of all previous algorithms were set according to the corresponding paper. To ensure
a fair comparison, all of the algorithms were run on the same dataset and the same distribution of classes.</p>
      <p>The experimental results are depicted in Fig.4 for event-based recognition and in Fig.5 for
framebased recognition. It is clear that there are considerable differences in the performance among the
different methods. Looking at these figures, all the methods, in general, suffer from a performance
decrease if the openness increases. However, our proposed algorithm performs relatively well compared
to the other methods, in terms of determining the novel classes and discriminating among the known
classes. The OSmIL algorithm had the worst performance. As expected, the performance of event-based
measures outperforms the performance of frame-based measurements.</p>
    </sec>
    <sec id="sec-14">
      <title>6. Conclusion</title>
      <p>In this work, we presented a CNN network architecture that is efficient for robust audio open-set
recognition. Extensive testing was done to distinguish between known and unknown audio classes. Our
proposed method overall outperformed representative previous work across a wide range of openness
levels. For further work, more research should be done to see how well the proposed CNN performs on
large real-world audio datasets. Experiments and algorithmic modifications for incremental learning
should also be performed.</p>
    </sec>
    <sec id="sec-15">
      <title>7. References</title>
      <p>[19] A. Bendale and T. E. Boult, “Towards open set deep networks,” in 2016 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), Jun. 2016, pp. 1563–1572. doi:
10.1109/CVPR.2016.173.
[20] A. Diment, T. Heittola, and T. Virtanen, “Sound event detection for office live and office synthetic
AASP challenge,” Proc IEEE AASP Chall. Detect. Classif Acoust Scenes Events WASPAA, 2013,
Accessed: Nov. 11, 2016. [Online]. Available:
http://c4dm.eecs.qmul.ac.uk/sceneseventschallenge/abstracts/OL/DHV.pdf
[21] K. Zhang, H. Su, and Y. Dou, “Beyond AP: a new evaluation index for multiclass classification
task accuracy,” Appl. Intell., vol. 51, no. 10, pp. 7166–7176, Oct. 2021, doi:
10.1007/s10489-02102223-7.
[22] J. Moolayil, Learn Keras for Deep Neural Networks: A Fast-Track Approach to Modern Deep</p>
      <p>Learning with Python. Berkeley, CA: Apress, 2019. doi: 10.1007/978-1-4842-4240-7.
[23]M. Abadi et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,”</p>
      <p>ArXiv Prepr. ArXiv160304467, 2016.
[24] P. R. Mendes Júnior et al., “Nearest neighbors distance ratio open-set classifier,” Mach. Learn.,
vol. 106, no. 3, pp. 359–386, Mar. 2017, doi: 10.1007/s10994-016-5610-8.
[25] S. Dang, Z. Cao, Z. Cui, Y. Pi, and N. Liu, “Open set incremental learning for automatic target
recognition,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 7, pp. 4445–4456, Jul. 2019, doi:
10.1109/TGRS.2019.2891266.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Marsland</surname>
          </string-name>
          ,
          <source>Machine Learning : An Algorithmic Perspective</source>
          ,
          <string-name>
            <given-names>Second</given-names>
            <surname>Edition</surname>
          </string-name>
          . Chapman and Hall/CRC,
          <year>2014</year>
          . doi:
          <volume>10</volume>
          .1201/b17476.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Abdel-Hamid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          , G. Penn, and
          <string-name>
            <given-names>D.</given-names>
            <surname>Yu</surname>
          </string-name>
          , “
          <article-title>Convolutional neural networks for speech recognition,” IEEEACM Trans</article-title>
          .
          <article-title>Audio Speech Lang</article-title>
          . Process., vol.
          <volume>22</volume>
          , no.
          <issue>10</issue>
          , pp.
          <fpage>1533</fpage>
          -
          <lpage>1545</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Grais</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wierstorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ward</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Plumbley</surname>
          </string-name>
          ,
          <article-title>“Multi-resolution fully convolutional neural networks for monaural audio source separation</article-title>
          ,
          <source>” in International Conference on Latent Variable Analysis and Signal Separation</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>340</fpage>
          -
          <lpage>350</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Das</surname>
          </string-name>
          , “
          <article-title>Very Deep Convolutional Neural Networks for Raw Waveforms</article-title>
          ,” ArXiv161000087 Cs, Oct.
          <year>2016</year>
          , Accessed: Feb.
          <volume>28</volume>
          ,
          <year>2022</year>
          . [Online]. Available: http://arxiv.org/abs/1610.00087
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Scheirer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D. R.</given-names>
            <surname>Rocha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sapkota</surname>
          </string-name>
          , and T. E. Boult, “
          <article-title>Toward open set recognition,”</article-title>
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell., vol.
          <volume>35</volume>
          , no.
          <issue>7</issue>
          , pp.
          <fpage>1757</fpage>
          -
          <lpage>1772</lpage>
          , Jul.
          <year>2013</year>
          , doi: 10.1109/TPAMI.
          <year>2012</year>
          .
          <volume>256</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Scheirer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Jain</surname>
          </string-name>
          , and T. E. Boult, “
          <article-title>Probability models for open set recognition,”</article-title>
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell., vol.
          <volume>36</volume>
          , no.
          <issue>11</issue>
          , pp.
          <fpage>2317</fpage>
          -
          <lpage>2324</lpage>
          , Nov.
          <year>2014</year>
          , doi: 10.1109/TPAMI.
          <year>2014</year>
          .
          <volume>2321392</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Scheirer</surname>
          </string-name>
          , and T. E. Boult,
          <article-title>“Multi-class open set recognition using probability of inclusion,” in Computer Vision</article-title>
          - ECCV
          <year>2014</year>
          , vol.
          <volume>8691</volume>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fleet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pajdla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiele</surname>
          </string-name>
          , and T. Tuytelaars, Eds. Cham: Springer International Publishing,
          <year>2014</year>
          , pp.
          <fpage>393</fpage>
          -
          <lpage>409</lpage>
          . doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>319</fpage>
          -10578-9_
          <fpage>26</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Jleed</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Bouchard</surname>
          </string-name>
          , “
          <article-title>Open set audio recognition for multi-class classification with rejection</article-title>
          ,
          <source>” IEEE Access</source>
          , vol.
          <volume>8</volume>
          , pp.
          <fpage>146523</fpage>
          -
          <lpage>146534</lpage>
          ,
          <year>2020</year>
          , doi: 10.1109/ACCESS.
          <year>2020</year>
          .
          <volume>3015227</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Roos</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Shaw</surname>
          </string-name>
          , “
          <article-title>Probabilistic SVM for open set automatic target recognition on high range resolution radar data</article-title>
          ,” Anaheim, California, United States, May
          <year>2017</year>
          , p.
          <source>102020B. doi: 10.1117/12</source>
          .2262840.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>O. M.</given-names>
            <surname>Parkhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , “
          <article-title>Deep face recognition</article-title>
          ,”
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gutoski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Lazzaretti</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Lopes</surname>
          </string-name>
          , “
          <article-title>Deep metric learning for open-set human action recognition in videos,” Neural Comput</article-title>
          . Appl.,
          <string-name>
            <surname>Jun</surname>
          </string-name>
          .
          <year>2020</year>
          , doi: 10.1007/s00521-020-05009-z.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mushtaq</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.-F.</given-names>
            <surname>Su</surname>
          </string-name>
          , “
          <article-title>Environmental sound classification using a regularized deep convolutional neural network with data augmentation</article-title>
          ,
          <source>” Appl. Acoust.</source>
          , vol.
          <volume>167</volume>
          , p.
          <fpage>107389</fpage>
          ,
          <string-name>
            <surname>Oct</surname>
          </string-name>
          .
          <year>2020</year>
          , doi: 10.1016/j.apacoust.
          <year>2020</year>
          .
          <volume>107389</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Scheirer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rocha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Micheals</surname>
          </string-name>
          , and T. E. Boult, “
          <article-title>Meta-Recognition: The Theory and Practice of Recognition Score Analysis,”</article-title>
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell., vol.
          <volume>33</volume>
          , no.
          <issue>8</issue>
          , pp.
          <fpage>1689</fpage>
          -
          <lpage>1695</lpage>
          , Aug.
          <year>2011</year>
          , doi: 10.1109/TPAMI.
          <year>2011</year>
          .
          <volume>54</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>N. K.</given-names>
            <surname>Manaswi</surname>
          </string-name>
          ,
          <article-title>Deep Learning with Applications Using Python</article-title>
          . Berkeley, CA: Apress,
          <year>2018</year>
          . doi:
          <volume>10</volume>
          .1007/978-1-
          <fpage>4842</fpage>
          -3516-4.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hershey</surname>
          </string-name>
          et al.,
          <article-title>“CNN Architectures for Large-Scale Audio Classification,” ArXiv160909430 Cs Stat</article-title>
          , Jan.
          <year>2017</year>
          , Accessed: Mar.
          <volume>01</volume>
          ,
          <year>2022</year>
          . [Online]. Available: http://arxiv.org/abs/1609.09430
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>H.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Wei</surname>
          </string-name>
          , “
          <article-title>Speech Emotion Recognition From 3D Log-Mel Spectrograms With Deep Learning Network,” IEEE Access</article-title>
          , vol.
          <volume>7</volume>
          , pp.
          <fpage>125868</fpage>
          -
          <lpage>125881</lpage>
          ,
          <year>2019</year>
          , doi: 10.1109/ACCESS.
          <year>2019</year>
          .
          <volume>2938007</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. L. S.</given-names>
            <surname>Machado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          , and
          <string-name>
            <surname>J. A.</surname>
          </string-name>
          dos Santos, “
          <article-title>Fully convolutional open set segmentation,”</article-title>
          <string-name>
            <given-names>Mach. Learn.</given-names>
            ,
            <surname>Jul</surname>
          </string-name>
          .
          <year>2021</year>
          , doi: 10.1007/s10994-021-06027-1.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E. M.</given-names>
            <surname>Rudd</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. P.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Scheirer</surname>
          </string-name>
          , and T. E. Boult, “
          <article-title>The Extreme value machine,”</article-title>
          <source>IEEE Trans. Pattern Anal. Mach</source>
          . Intell., vol.
          <volume>40</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>762</fpage>
          -
          <lpage>768</lpage>
          , Mar.
          <year>2018</year>
          , doi: 10.1109/TPAMI.
          <year>2017</year>
          .
          <volume>2707495</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>