=Paper=
{{Paper
|id=Vol-2696/paper_188
|storemode=property
|title=Bird Species Recognition via Neural Architecture Search
|pdfUrl=https://ceur-ws.org/Vol-2696/paper_188.pdf
|volume=Vol-2696
|authors=Markus Mühling,Jakob Franz,Nikolaus Korfhage,Bernd Freisleben
|dblpUrl=https://dblp.org/rec/conf/clef/MuhlingFKF20
}}
==Bird Species Recognition via Neural Architecture Search==
Bird Species Recognition via Neural Architecture Search Markus Mühling, Jakob Franz, Nikolaus Korfhage, and Bernd Freisleben Department of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Straße 6, D-35032 Marburg, Germany {muehling,franz,korfhage,freisleb}@informatik.uni-marburg.de Abstract. This paper presents the winning approach of the BirdCLEF 2020 challenge. The challenge is to automatically recognize bird sounds in continuous soundscapes. In our approach, a deep convolutional neural network model is used that directly operates on the audio data. This neural network architecture is based on a neural architecture search and contains multiple auxiliary heads and recurrent layers. During the train- ing process, scheduled drop path is used as a regularization method and extensive data augmentation is applied to the audio input. Furthermore, species location lists are used in the post-processing step to reject un- likely classes. Our best run on the test set obtains a classification mean average precision score (cmap) of 13.1% and a retrieval mean average precision score (rmap) of 19.2%. Keywords: Bird Species Recognition · BirdClef 2020 · Neural Architec- ture Search · Gabor Wavelet Layer 1 Introduction Automatically identifying bird species in continuous sound recordings is an im- portant task for monitoring the populations of different bird species in forest ecosystems. In the BirdCLEF 2020 challenge [4], which is part of the LifeClef challenge [3], participants have to identify 960 bird species in 5 seconds snip- pets of continuous audio recordings. The test data contains 153 soundscapes of a length of 10 minutes recorded at four different locations in Peru (Concesion de Ecoturismo Inka Terra), the USA (Sierra Nevada/High Sierra in California and Sapsucker Woods/Ithaca in New York), and Germany (Laubach). Each sound- scape contains high quantities of (overlapping) bird vocalizations. In contrast to the BirdCLEF 2019 challenge [5], no other than the provided training data is allowed to build the recognition system. This prohibits fine-tuning convolutional neural networks (CNNs) pretrained on other datasets, as applied in the best approaches of previous BirdClef challenges. This is reasonable, since fine-tuning Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. a CNN pretrained for the task of image classification on the ILSVRC dataset [11] has proven to yield excellent results for many tasks, even for the task of bird recognition using spectogram images. This restriction makes it even more important to find an optimal neural net- work architecture for the BirdClef 2020 challenge. For this purpose, we applied a network architecture search (NAS) approach. NAS is a current field of research. It allows to automatically learn an optimal neural network architecture for a specific problem and offers an alternative to the time-consuming task of manual architecture optimization. The designed architecture is directly applied to the audio input using a Ga- bor wavelet transformation, similar to Zeghidour et al. [13] in speech recognition. This transformation is integrated into the neural network architecture as a com- plex 1-D convolutional layer. The contributions of the paper are as follows: – A novel neural architecture search approach based on a memetic algorithm is used to find an optimal neural network architecture for the task of bird sound recognition. – The proposed neural network architecture directly operates on the audio input. The paper is organized as follows. Section 2 describes the bird recognition approach including data pre-processing, data augmentation, the construction of the neural network architecture, and details of the training process. Experimental results are presented in Section 3. Section 4 concludes the paper and outlines areas for future work. 2 Methods In this section, the proposed system for bird sound recognition is presented. Sec- tion 2.1 describes the preprocessing steps. The data augmentation methods used during the training process are specified in Section 2.2. The design of the neural network architecture is explained in Section 2.3, including the Gabor wavelet layer, the NAS approach, and the composition of the neural network using mul- tiple auxiliary heads and recurrent layers. Section 2.4 provides information about the training process. 2.1 Data Pre-processing First, the audio recordings are split into 5 seconds segments with an overlap of 0.25 seconds. Then, these snippets are classified as either bird sound or noise (i.e., no audible bird sounds) using the heuristic of the BirdCLEF 2018 baseline system [6]. The sets are called Tsignal (bird sounds) and Tnoise (noise segments). Finally, the recordings are normalized to -3 db and resampled to 22,050 Hz. (a) No augmentation (b) Pitch augmentation (c) Mask augmentation (d) Noise augmentation (e) Loudness augmentation (f) All previous augmentations (g) Noise snippets augmentation (h) All augmentations Fig. 1: Visualization of the impact of different audio augmentation methods on the resulting spectrogram. 2.2 Data Augmentation To avoid overfitting and to improve the generalization capabilities of the neural network model to various recording conditions, the following data augmentation methods are applied to the audio segments (see Figure 1): – Pitch augmentation: Shifting pitch by one to three semitones. – Mask augmentation: A randomly chosen segment of 0.5 seconds duration is masked with zeros. – Noise augmentation: White noise is added to the training snippet. – Loudness augmentation: The volume of a training snippet is increased by a randomly chosen factor within range [0.25, 4] (0.25 leads to a decrease by factor 4, 4 leads to an increase by factor 4). – Noise snippets augmentation: Each training snippet is a randomly weighted sum of one augmented bird snippet from Tsignal and four noise snippets from Tnoise . First, pitch, mask, noise, and loudness augmentation are applied to the train- ing segments of Tsignal . For this purpose, between one and four augmentation methods are randomly selected and applied in random order. Second, noise snip- pet augmentation is used. Figure 1 visualizes the impact of the different data augmentation methods on the resulting audio spectograms. 2.3 Neural Network Architecture A NAS approach based on a memetic algorithm is used to find an optimal convo- lutional neural network architecture for the task of bird sound recognition. The overall network architecture is based on a Gabor wavelet layer directly operating on the audio input, similar to Zeghidour et al. [13] for speech recognition. The optimal cell structure found by NAS and multiple architecture heads including recurrent layers are used to take further advantage of temporal information. Gabor Wavelet Layer. The extraction of the audio spectograms is integrated into the neural network architecture using a Gabor wavelet layer. For this pur- pose, a complex 1-D convolution with n filters is applied to the audio input where n is the number of frequencies. The weights of the complex kernels are created using Gabor wavelets. The complex 1-D convolution is followed by applying the logarithm and a zero centering normalization. Furthermore, we experimented with audio spectograms generated by apply- ing the FFT, as provided by the baseline system of the BirdClef challenge 2018 [7]. While our experiments on this dataset showed that using the Gabor wavelet layer did not lead to quality improvements compared to using audio spectograms generated by the FFT, the implemented Gabor wavelet layer has some advan- tages. First, data augmentation methods can be also flexibly applied to the audio input. Second, arbitrary frequencies can be extracted in contrast to the FFT with equidistant spacing. This allows us to directly use mel-scaled frequencies. Neural Architecture Search. NAS is a recent field of research. The aim of NAS is to automatically find an optimal neural network architecture on a spe- cific task. The NASNet architecture [14] is the first design that outperformed handcrafted, manually optimized architectures for the task of image classifica- tion. The main idea is to break down the search space and search for cells that form the building blocks for the overall network architecture. Other approaches primarily exhibit runtime improvements [14, 9, 1, 10, 2]. While the approaches of Chen et al. [1] and Real et al. [10] use an evolutionary search method, Dong et al. [2] start with a huge, over-parameterized network which is then shrunk and locally optimized step by step. Fig. 2: Stacked network with reduction cells (R) and normal cells (N). In the following, we describe an NAS approach to find an optimized architec- ture for bird sound recognition. The approach uses a memetic search algorithm that combines local optimization with an evolutionary algorithm. Like Zoph et al. [14], we do not search for full network architectures but instead for relatively small structures called cells that are later stacked in a predefined manner to build the full network. NAS approaches are typically categorized according to the used search space, the search strategy, and the performance estimation/evaluation strategy, which are described in the following paragraphs. Search Space. As previously described, we search for cell structures that are later upscaled to the full network architecture. Like the NASNet search space [14], a cell consists of blocks and operations. While a cell in the NASNet search space consists of a fixed number of blocks and operations, our cell structure is more flexible. A cell can be composed of a variable number of blocks, and each block can contain a variable number of operations. The set of allowed operations to perform is as follows: – identity – depth-wise separable convolution – normal convolution – max pooling – average pooling Kernel sizes for the convolution and pooling operations are restricted to the following sizes: {3 × 3, 5 × 5, 7 × 7}. The outputs of the operations within a block are merged with an add layer that forms the output of the corresponding block. Therefore, each operation contains some extra layers to satisfy shape constraints. The output of blocks that have not been used as input to any other operation within the cell are merged using either an add or a concatenate layer to form the output of the cell. The input of an operation can be either the output of one of the previous two cells or the output of a preceding block of the current or previous cell. Performance Estimation. For performance estimation, a network architecture is constructed from the cell. First, a normal and a reduction cell are derived from the cell structure. While the normal cell retains the spatial size of its predecessor cell (i.e., every convolution or pooling operation has stride 1 × 1 and appropriate padding), the reduction cell reduces the spatial size by a factor of 2 Algorithm 1: Neural architecture search Input: initial population P , number of cells to visit r for cell in P do // transform cell to full network cell.network := cell-to-cnn(cell, F =8); cell.accuracy := train-and-eval(cell.network); cell.f lops := compute-flops(cell.network); end // compute fitness values of population members compute-fitness(P ); for round = 1 to r do parents := select-parents(P ); // perform the crossover operation child-cell := create-child(parents); // perform mutation operations child-cell := mutate-child(child-cell ); // add child cell to population P .append(child-cell ); child − cell.network := cell-to-cnn(child − cell, F =8); child − cell.accuracy := train-and-eval(child − cell.network); child − cell.f lops := compute-flops(child − cell.network); compute-fitness(P ); end return max(P , key=lambda cell : cell.f itness); (i.e., convolutions and pooling operations have stride 2 × 2). Second, the overall network is stacked, as shown in Figure 2. The output head is a global average pooling layer, followed by a densely connected layer with softmax activation. The convolutions of the first reduction cell have F filters. The number of filters is then doubled after every reduction cell. To determine the quality (or fitness) of a cell, a full network is derived from each cell with F = 8 initial filters. Then, the model is trained and evaluated on a subset of the BirdCLEF 2020 training data set. For this purpose, 300 training samples are randomly selected for each of 50 randomly chosen classes. The re- sulting data set with 15, 000 audio samples is split into training and validation set with a validation split of 0.1. Samples of the same audio file are either all in the training set or in the validation set. The model is trained for 20 epochs using the ADAM optimizer and a cosine learning rate scheduler. Finally, the accuracy of the model is calculated on the validation set. Search Strategy. The idea of the search algorithm is to use an evolutionary algorithm and local optimization. This memetic search algorithm starts with a population of cells randomly drawn from the search space. gabor wavelet reduction cell normal cell reduction cell normal cell reduction cell normal cell layer normal head global avg reduction cell normal cell concat pooling recurrent head normal head recurrent head Fig. 3: General structure of the full network The fitness of a cell c is calculated as follows: c.accuracy − accmean f lopsmean − c.f lops c.f itness = 0.7 · + (1 − 0.7) · accstd f lopsstd where c.accuracy is the performance estimation and c.f lops is the number of floating point operations of cell c. accmean , accstd , f lopsmean and f lopsstd are the mean and the standard deviation of the validation accuracy and flops, re- spectively, of all cells of the current population. The tournament selection method is used to select two cells of the current population based on their fitness values. Cells with high fitness values have a higher chance to be chosen. The two selected cells are used to create a new cell by merging all blocks. Considering the order of the blocks, all operations are joined, so that block i of the new cell contains all operations of the i-th blocks of the parents whereby the input connections of the operations are preserved. In a local optimization step, the most important operations per block are identified similar to the approach of Dong et al. [2]. For this purpose, a network is derived from the child cell, trained for two epochs and shrunk to the most important operations. Afterwards, mutation operations are applied to the new cell, for example, changing operation types, input connections or adding and removing blocks. Finally, the cell is added to the population, and the fitness values are recomputed. This procedure is repeated for a certain number of rounds, and the cell with the best fitness value is returned. The search algorithm is described in Algorithm 1. Multi-Head Model. In contrast to natural images, audio spectrograms con- tain temporal information. It seems reasonable to use this information, since the individual sounds of a bird’s song may follow a certain chronological order. For this reason, we added output heads with recurrent layers to the network architec- ture. Altogether, the final network architecture contains two output heads, each with and without recurrent layers: one pair at the end of the network and one pair at an intermediate stage. Output heads without a recurrent layer contain a global average pooling layer followed by a densely connected layer with softmax activation, recurrent output heads contain two consecutive GRU layers with 128 units each, followed by a global average pooling layer and a densely connected layer with softmax activation. Each output head has its own loss function in the training phase, whereas for inference the output heads are merged using a concat layer followed by global average pooling. Figure 3 shows the structure of the network including all output heads. 2.4 Training Methodology We trained our models using a weighted loss function consisting of a softmax log loss for each output head. The loss terms of the intermediate output heads are weighted by a factor of 0.6, whereas the loss terms of the output heads at the end of the network are weighted by a factor of 1. The ADAM optimizer [8] is used for the training process with a cosine learning rate scheduler. Furthermore, scheduled drop path [14] where the drop rate dr is linearly increased to 0.4 throughout the training process is used as a regularization method. Within the scheduled drop path, each operation of each cell in the network gets dropped (i.e., its output is set to zero) by the probability dr · nc if the network has n cells and if we consider an operation of the c-th cell of the network. 3 Experiments In this section, we present the results of the two official and two post challenge submissions. 3.1 Evaluation Metrics The task of the challenge is to identify all audible bird species in 5 second snippets of continuous soundscape recordings from four different locations. With the BirdClef challenge 2020 [4], the following two metrics are used to measure the performance of a submitted run: – Classification Mean Average Precision (cmap): PC c=1 AveP (c) cmap = C where C is the total number of classes and AveP (c) is the average precision of class c. – Retrieval Mean Average Precision (rmap): PX x=1 AveP (x) rmap = X where X is the total number of audio snippets of all test files and AveP (x) is the average precision of snippet x. sep 3x3 identity concat max 3x3 identity conv 3x3 max 3x3 identity + sep 5x5 sep 3x3 max 3x3 + + + max 3x3 identity concat identity sep 3x3 identity conv 3x3 max 3x3 identity + sep 5x5 identity + + sep 3x3 + concat max 3x3 Fig. 4: Structure of the cell used in Runs 1 and 2 3.2 Data Sets The training set consists of more than 70,000 recordings across 960 bird species classes contributed by the Xeno-canto community. Exactly one foreground bird species is assigned to each audio recording file. Additionally, metadata such as recording location, recording date, elevation, and recording quality is provided. The validation set contains 12 soundscape files recorded at two different locations (one in Peru, one in the USA). Each soundscape file has a duration of 10 minutes and is divided into 5 second snippets. For our evaluation, a list of audible bird species is assigned to each 5 seconds snippet of the validation data. The test set consists of 153 soundscape files of 10 minutes duration recorded at four different locations (one in Peru, two in the USA, and one in Germany). 3.3 Results Two network architectures found by the NAS method described in Section 2.3 are evaluated. Altogether, we submitted five runs: two official runs with the first architecture and three post challenge runs with the second architecture. The first network architecture was found by running the search algorithm described in Section 2.3 for 100 rounds. Figure 4 shows the cell architecture of this network. The overall network was constructed as described in Section 2.3 with the number of initial filters set to F = 16 and trained for 30 epochs, as described in Section 2.4. We used 128 mel-scaled frequencies in the range of 170 Hz to 10,000 Hz. The width of the kernel of the Gabor wavelet layer was set to w = 1024, resulting in spectograms with a resolution of 256x128. The training took 65 hours on four Nvidia TITAN XP GPUs. The trained network takes about 43 seconds to analyze a 10 minutes sound- scape file. However, most of the time (≈ 40 s) is consumed by the non-optimized max 3x3 + max 3x3 + max 3x3 sep 7x7 add + avg 3x3 add max 3x3 sep 7x7 + avg 3x3 conv 3x3 add conv 3x3 Fig. 5: Structure of the cell used in Post-Submission Runs 1, 2, and 3 single-threaded pre-processing step. The inference time of the network is only ≈ 3 s, which corresponds to 40 snippets per second. The two official runs differ only in the post-processing step. Official Run 1. In the first run, the species location lists provided by the chal- lenge organizers are used to reject impossible species from the model’s predic- tions. This run obtains a cmap of 12.8% and a rmap of 19.3%, which is 8.6% better than the cmap of the second best submission (4.2%) and thus wins the challenge. Official Run 2. For the second run, we used species lists per location and sea- son, derived from the training data to remove unlikely species from the model’s predictions. This led to a slight decrease in cmap and a slight increase in rmap, resulting in a cmap value of 12.7% and a rmap value of 19.8%. After the challenge, we ran the NAS algorithm for another 50 rounds and found a new promising cell architecture. The network corresponding to the newly found cell with initial filters set to F = 16 was used to submit three further runs. Figure 5 shows how two consecutive cells of the network are connected. While in the Post-Submission Runs 1 and 2 the species lists per location and season are used, Post-Submission Run 3 only applies the species lists per location in the post-processing step. Post-Submission Run 1. The network used in this run was trained in the same way as the network of Run 1 but only for 10 instead of 30 epochs, since this seemed to be sufficient. This run obtains a cmap of 12.6% and a rmap of 20.3%. Post-Submission Run 2. For this run, the width of the kernel of the Gabor wavelet layer was reduced from w = 1024 to w = 512, resulting in higher reso- lution spectrograms with 428x128 pixels. The network was initialized with the weights obtained by Post-Submission Run 1 and fine-tuned for another 5 epochs. This resulted in a small improvement of the cmap and rmap values (cmap = 12.7%, rmap = 20.6%). validation test cmap rmap cmap rmap Run 1 14.8% 21.8% 12.8% 19.3% Run 2 14.8% 22.2% 12.7% 19.8% Post-Submission Run 1 15.1% 24.4% 12.6% 20.3% Post-Submission Run 2 16.2% 24.0% 12.7% 20.6% Post-Submission Run 3 17.8% 23.4% 13.1% 19.2% Table 1: Comparison of the submitted runs in terms of cmap and rmap on validation and test set. Post-Submission Run 3. In contrast to Post-Submission Run 1, this run uses a different sampling strategy. Instead of sampling from the preclassified bird sound snippets (Tsignal ), the batches are sampled from the audio files extracting a random five second snippet per file. In this case, the number of training samples per epoch equals the number of mono-species audio files. Due to the smaller number of samples per epoch, the network was trained for 100 epochs. This run obtains a cmap of 13.1% and a rmap of 19.2%, which is the best result of the challenge in terms of cmap. 3.4 Discussion Table 1 shows a comparison between all the runs on the validation and test set. It is evident that even though the network used for the post-submission runs scored significantly better on the validation set, this is only partially transferred to the test set. Post-Submission Run 2, for example, obtains a validation cmap of 16.2% and a validation rmap of 24.0%, but yields a test cmap of only 12.7%, which is 0.1% worse than the test cmap of Run 1, although Run 1 obtains a significantly lower validation cmap of only 14.8%. The 2.2% better validation rmap score is, on the other hand, reflected on the test set with a test rmap of 20.6% in comparison to only 19.3% of Run 1. Interestingly, Post-Submission Run 3 yielded the best cmap score both on the validation and test set without distinguishing between bird sound and noise snippets during the training phase. The extraction of snippets at random positions and the different weighting of classes in the training phase seems to be beneficial for the cmap score with 13.1% on the test data, while the rmap score decreases to 19.2%. Both network architectures used for the submissions have a similar inference time of approximately 40 snippets per second. 4 Conclusion The presented approach won the BirdCLEF 2020 challenge with a cmap of 12.8% and a rmap of 19.3%. In the post-challenge phase, the scores have been further improved to 13.1% cmap and 20.6% rmap. However, the task of recognizing all bird species in soundscapes is far from being solved. One possible reason could be the discrepancy between training and test data. The training data consists of sound files of various lengths. Only the most audible bird is labeled in each sound file. This means that there could be samples in the training data where multiple birds or even a bird other than the labeled one is audible. This can be misleading in the training process. However, the task for validation and testing is to identify all audible birds in a snippet. This is a much harder task than just recognizing the most audible bird in a recording. A more fine-grained multi-label annotation of the training data could significantly improve the quality of the bird sound recognition models. There are several possibilities to improve our approach with the existing training data. There is room for improvement in the data augmentation step that has proven to be very important in previous challenges. For example, we could apply further data augmentation methods on the audio data or even apply some augmentation on the spectrograms produced by the Gabor wavelet layer. Another option is to learn the weights of the complex 1-D convolution kernel of the Gabor wavelet layer to produce better spectrograms. or to use a self- supervised pre-training approach to learn a more suitable audio representation [12] and adapt it for birdcall identification. Furthermore, it may be beneficial to run the neural network search algorithm for a longer period of time to find a network that works even better on this task. Finally, the performance could be improved at the cost of longer training and inference runtimes by scaling up the network’s capacity and using stronger regularization. 5 Acknowledgement This work is funded by the Hessian State Ministry for Higher Education, Re- search and the Arts (HMWK) (LOEWE Natur 4.0). References 1. Chen, Y., Meng, G., Zhang, Q., Xiang, S., Huang, C., Mu, L., Wang, X.: Reinforced evolutionary neural architecture search (2018) 2. Dong, X., Yang, Y.: Searching for a robust neural architecture in four gpu hours. In: Proceedings of the IEEE Conference on computer vision and pattern recognition. pp. 1761–1770 (2019) 3. Joly, A., Goëau, H., Kahl, S., Deneu, B., Servajean, M., Cole, E., Picek, L., Ruiz De Castañeda, R., é, Lorieul, T., Botella, C., Glotin, H., Champ, J., Vellinga, W.P., Stöter, F.R., Dorso, A., Bonnet, P., Eggel, I., Müller, H.: Overview of lifeclef 2020: a system-oriented evaluation of automated species identification and species distribution prediction. In: Proceedings of CLEF 2020, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2020, Thessaloniki, Greece. (2020) 4. Kahl, S., Clapp, M., Hopping, A., Goëau, H., Glotin, H., Planqué, R., Vellinga, W.P., Joly, A.: Overview of birdclef 2020: Bird sound recognition in complex acous- tic environments. In: CLEF task overview 2020, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2020, Thessaloniki, Greece. (2020) 5. Kahl, S., Stöter, F.R., Goëau, H., Glotin, H., Planque, R., Vellinga, W.P., Joly, A.: Overview of BirdCLEF 2019: Large-Scale Bird Recognition in Soundscapes. In: Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum. vol. CEUR Workshop Proceedings, pp. 1–9. CEUR (Sep 2019) 6. Kahl, S., Wilhelm-Stein, T., Klinck, H., Kowerko, D., Eibl, M.: Recognizing birds from sound - the 2018 birdclef baseline system. arXiv preprint arXiv:1804.07177 (2018) 7. Kahl, S., Wilhelm-Stein, T., Klinck, H., Kowerko, D., Eibl, M.: Recognizing birds from sound - the 2018 birdclef baseline system. arXiv preprint arXiv:1804.07177 (2018) 8. Kingma, D., Ba, J.: Adam: A method for stochastic optimization in: Proceedings of international conference on learning representations (2015) 9. Liu, C., Zoph, B., Neumann, M., Shlens, J., Hua, W., Li, L.J., Fei-Fei, L., Yuille, A., Huang, J., Murphy, K.: Progressive neural architecture search. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 19–34 (2018) 10. Real, E., Aggarwal, A., Huang, Y., Le, Q.V.: Regularized evolution for image clas- sifier architecture search 33, 4780–4789 (2019) 11. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115(3), 211–252 (2015) 12. Schneider, S., Baevski, A., Collobert, R., Auli, M.: wav2vec: Unsupervised pre- training for speech recognition. arXiv preprint arXiv:1904.05862 (2019) 13. Zeghidour, N., Usunier, N., Kokkinos, I., Schatz, T., Synnaeve, G., Dupoux, E.: Learning filterbanks from raw speech for phone recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (2018) 14. Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8697–8710 (2018)