Xception Based Method for Bird Sound Recognition of BirdCLEF 2020 Jisheng Bai1 , Chen Chen1 , and Jianfeng Chen1 Northwestern Polytechnical University, Xi’an, China {baijs, cc chen524}@mail.nwpu.edu.cn chenjf@nwpu.edu.cn Abstract. In this paper, we present an Xception based method for bird sound recognition of BirdCLEF2020. The goal of BirdCLEF2020 is to de- tect and classify 960 bird species within the provided soundscape record- ings, it is more complex to differentiate such a large number of birds than BirdCLEF2019. In our approach, logmel or loglinear spectrograms are extracted as features, and some data augmentation techniques are uti- lized to improve the performance of detecting the bird sounds. Finally, we evaluate our system on BirdCLEF2020 test dataset and achieve a classification mean average precision (c-mAP) score of 0.0421. Keywords: Bird sound recognition · Xception · Data augmentation. 1 Introduction It is difficult to take clear photos of birds, which requires an open environment, professional equipment, high level of photography, and it takes a lot of time to actively look for the objects to be photographed. Instead, we can identify a bird’s specie through a segment of its voice. Hearing the sound and distinguish- ing birds is important for many environmental and scientific purposes. Some successful techniques will be used in monitoring of ecological environment in the future. Birds are highly sensitive to the environment. If some areas are polluted, some birds will gradually fly away. Therefore, the changes of bird habits and population can reflect the changes of the environment. If some microphones are installed in the forests, the information of bird population and activities can be collected, which can significantly improve the efficiency and accuracy com- pared with manual observation. The sounds of different birds are indeed specific. A large number of sound data can be collected by installing specific recording equipment. This can be done automatically in an unattended environment, with- out the need for people to invest extra time and money. With the development of deep learning, a large body of research in sound classification is proven to outperform traditional methods in bird sound classifi- cation [5]. Convolutional neural networks(CNNs) show great feature extraction Copyright c 2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). CLEF 2020, 22-25 Septem- ber 2020, Thessaloniki, Greece. ability and results in many computer vision tasks. The image-based architec- tures, Inception-v3 for example, can obtain the best performance in sound clas- sification or what ever the targeted domain [6]. 2 Dataset The training data consists of audio recordings for bird species from South and North America and Europe. The Xeno-canto community contributes this data and provides more than 70,000 high-quality recordings across 960 species to this year’s challenge. Each recording is accompanied by metadata containing infor- mation on recording location, date and other high-level descriptions provided by the recordists. The test data consists of 153 soundscapes recorded in Peru, the USA, and Germany. Each soundscape is of ten-minute duration and contains high quanti- ties of (overlapping) bird vocalizations [1]. The number of recordings for each specie are calculated, although the training dataset contains over 70,000 recordings, there are less than 90 recordings in some species, the rest species are between 90 and 100 recordings. For example, the bird specie with a ebird name of ”banfru1” only contains 1 recording, but there are 100 recordings in another specie ”whcspa”. The number distribution of the dataset is shown in Fig.1. The imbalance of the dataset can have effect on recognizing bird species. Fig. 1. The number distribution of training dataset recordings 3 Data preparation 3.1 Bird sing separation To separate bird sound and background noise from a original recording, we apply dilation, erosion and smooth masking. Similar techniques are presented in [11] and used in [9] and [2]. We then separate all recordings into 960 bird sing species and noise classes. Details are described as following: – Every recording is loaded with a sample rate of 22050Hz. – Short-time Fourier transform(STFT) is utilized to calculate spectrograms with a window length of 1024 and hop length of 512. – We calculate median value for each row and column, then set every element in the spectrogram to 1 if it is 1.5 times bigger than the median of its related row and column, otherwise it’s set to 0. – Binary erosion and dilation filters are implemented to distinguish noise and signal parts. The filter size is 4 by 4 square. – Here we create a one-dimension vector named indicator vector, its ith ele- ment is set to 1 if its related column has at least one 1, or it is 0. – Finally, we smooth the indicator vector twice by a dilation filter of size 8 by 1 then use it as a mask to separete original bird recordings. Each recording can be divided into lots of signal and noise parts, all signal parts are concate- nated as one and the same as noise. We cut all recordings of every species into 5 seconds parts. After these, we can get 960 folders of each specie of 5-second bird sound and noise recordings. 3.2 Data augmentation In recent years, some data augmentation techniques are successfully applied in sound recognition tasks. Two data augmentation metheds are utilized in our approach. Some time and frequency augmentation methods have been used in [9] and [11]. In our proposed method, we simply use some of the techniques to augment the dataset and they can be described as follows: – Load a bird sound file from random position (it starts from the beginning if it reach the end). Fig. 2. An example of bird sound separation – Add most three files on the top of a 5-second bird sound fragment with independent chance. These three files are bird sing from the same bird specie, bird noise from the same specie and noise from another specie with chance of 0.3,0.3 and 0.5 respectively. A amplitude factor between 0.3 and 0.5 is applied during the process. – STFT is used to generate spectrogram from the added file with a window size of 1024 and hop length of 512. – Normalization and logarithm is applied to calculate logmel or loglinear spec- trogram with 128 Mel-bands, frequencies beyond 11025Hz and lower than 50Hz are removed. – Due to the input size of Xception, different interpolation filters are applied to resize the spectrograms into 299*299*1. To handle the imbalanced dataset and prevent overfitting, we apply mix-up during training stage [13]. It can be expressed as: x̃ = λxi + (1 − λ)xj (1) ỹ = λyi + (1 − λ)yj (2) where xi , xj are input features, yi , yj are target labels, λ ∈ [0, 1] is a random number drawn from the Beta (c, a) distribution. We can get more training sam- ples without extra computing resource, new samples are linear interpolated of real samples. 3.3 Denoising All the bird sounds come from diverse area around the world with special en- vironmental background noise. In bird sound recordings, loud raining sound, quiet background, sound of wind and sound of other animals may exist and even cover some bird sounds. To eliminate this, we propose a denoising method to do spectral subtraction on a spectrogram: – Load a 5-second sound file and generate an amplitude spectrogram. – Calculate mean amplitude values over frames, mark the minimum 20 frames and calculate the mean amplitude values of these frames, we can get a pri- mary subtraction vector. Then subtract the primary subtraction vector over frequencies for each frame, we can get the final denoised spectrogram. 3.4 Features During the training, separated bird sing, augmented bird sound and spectral subtraction sound with logmel and loglinear spectrograms of 128 bands are gen- erated as input. All the spectrograms are calculated by logarithm. 4 Network architecture 4.1 Xception Inception-v3 is one of the state of art architectures in image classification chal- lenge [12]. And it is confirmed that Inception-based CNNs on Mel spectrograms provide the best performance [4]. The best network for bird song detection seems to be the Inception-v3 architecture and it preforms better than even the more recent architectures [10]. Xception is a improvement of Inception-v3 proposed by Google. In [3], the correlation between channels and spatial correlation should be dealt with sepa- rately. The convolution operation in the original Inception-v3 is replaced by a separate concept (Extreme Inception), which is the basic module of Xception. The Xception network is composed of a series of separable convolution, residual connection similar to ResNet and some other conventional operations. 4.2 Training strategy To recognize 960 species and handle such a large amount of recordings, we used Xception architecture instead of other CNN architectures. As for features, we se- lected logmel and loglinear spectrogram as input. We applied a denoising method and a data augmentation method during the data preprocessing. Pytorch was implemented to train model, and python librosa library was applied to process recordings and generate features. During the training, categorical cross entropy was used as loss function and stochastic gradient descent was used as optimizer with weight decay of 1e-4 and a constant learning rate of 0.001. 5 Results 5.1 Evaluation metrics The evaluation metric is the classification mean Average Precision (c-mAP), considering each class c of the ground truth as a query. This means that for each class c, all predictions are extracted from the run file with ClassId(c), rank them by decreasing probability and compute the average precision for that class, which can be expressed as PC c−1 AveP (c) c − mAP = (3) C where C is the number of species in the ground truth and AveP (c) is the average precision for a given species c computed as: Pn P (k) × rel(k) AveP (c) = k=1 (4) nrel (c) where k is the rank of an item in the list of the predicted segments containing c, n is the total number of predicted segments containing c, P (k) is the precision at cut-off k in the list, rel(k) is an indicator function equaling 1 if the segment at rank k is a relevant one (i.e. is labeled as containing c in the ground truth) and nrel is the total number of relevant segments for c. 5.2 Performance on validation split For each validation or test file, we smoothed the prediction vector by applying a moving window size of 5 seconds. Then we only selected first 3 maximum probabilities as the result of a 5-second test fragment. To achieve the performance of our methods, we only trained 76 bird species included in the validation dataset, the best c-mAP scores of the experiments are shown in Table 1, data augmentation with adding sounds randomly is denoted as AR, spectral subtraction is denoted as Spec sub. Experiment 1 and 2 only used the extracted bird sing to generate features, the rest of experiments extracted features Continuously from original recordings. From Table 1 we can see, logmel with mix-up method achieves the best c-mAP score, loglinear performs well with separated bird sings and data aug- mentaion with randomly adding. 5.3 Performance on test split Four teams submitted 29 submissions in LifeCLEF 2020 Bird Monophone. Fi- nally we got the 2th rank among the teams and achieved a best c-mAP score of 0.0421 and r-mAP of 0.0671. Details are shown in Table 2 [7] [8]. Three different methods were tested and results are shown in Table 3: Table 1. Best c-mAP scores of the experiments on validation split ID Features Data augmentation Dnoised C-map 1 Logmel / / 0.074 2 Loglinear / / 0.096 3 Logmel AR / 0.083 4 Loglinear AR / 0.092 5 Logmel Mix-up / 0.154 6 STFT / Spec sub 0.083 7 Logmel / Spec sub 0.084 8 Loglinear / Spec sub 0.065 – result0: We used the randomly adding data augmentation method, the Xception architecture, and loglinear spectrograms in this run. Test output vectors were smoothed to get final prediction. It’s the best performance of all runs. – result1: We used the spectral subtraction denoising method, the Xception architecture, and logmel spectrograms in this run. Test output vectors were smoothed as well. Finally we got a c-mAP score of 0.032 and r-mAP score of 0.0592. – result2: We used the Xception architecture and the features are bird sing separated loglinear spectrograms. Test output vectors were smoothed as well. Finally we got a c-mAP score of 0.027 and r-mAP score of 0.0558. Table 2. Leaderboard of LifeCLEF 2020 Bird [1] Participant C-Map R-Map mmuehling 0.1282 0.193 NPU bird 0.0421 0.067 thailsson clementino 0.0097 0.008 JS CHUNGNAM 0.072 0.055 Table 3. Results of all runs Run Data augmentation Feature Denoised C-Map R-Map Run0 Yes Loglinear No 0.042 0.067 Run1 No Logmel Yes 0.032 0.059 Run2 No Loglinear No 0.027 0.059 6 Conclusion and future work In this paper, we proposed a system for bird recognition based on Xception with some data augmentation and denoising techniques. Finally, we got a c-mAP score of 0.0421 on official test dataset. To handle more than 70,000 recordings, Xception was chosen because of its great feature extraction ability. During train- ing, mix-up and randomly adding data augmentation methods were applied to prevent overfitting and improve generalization performance. We finally submitted 7 submissions of three main methods. Ensemble of net- works is banned this year. We will focus on the performance of convolutional recurrent neural networks and other data augmentation methods without more computing resources for bird recognition. Features can also have great impact on performance sometimes and they would be studied as well. There is still a lot of work to improve in bird sound recognition in the future. References 1. AIcrowd BirdCLEF2020, https://www.aicrowd.com/challenges/lifeclef-2020-bird- monophone 2. Bai, J., Wang, B., Chen, C., Chen, J., Fu, Z.: Inception-v3 based method of lifeclef 2019 bird recognition. In: CLEF (Working Notes) (2019) 3. Chollet, F.: Xception: Deep learning with depthwise separable convolutions (2016) 4. Hervé Goëau, H.G., Planqué, R., Vellinga, W.P., Kahl, S., Joly, A.: Overview of birdclef 2018: monophone vs. soundscape bird identification. CLEF working notes (2018) 5. Joly, A., Goëau, H., Botella, C., Glotin, H., Bonnet, P., Vellinga, W.P., Planqué, R., Müller, H.: Overview of lifeclef 2018: a large-scale evaluation of species identifica- tion and recommendation algorithms in the era of ai. In: International Conference of the Cross-Language Evaluation Forum for European Languages. pp. 247–266. Springer (2018) 6. Joly, A., Goëau, H., Glotin, H., Spampinato, C., Bonnet, P., Vellinga, W.P., Lom- bardo, J.C., Planque, R., Palazzo, S., Müller, H.: Lifeclef 2017 lab overview: multi- media species identification challenges. In: International Conference of the Cross- Language Evaluation Forum for European Languages. pp. 255–274. Springer (2017) 7. Joly, A., Goëau, H., Kahl, S., Deneu, B., Servajean, M., Cole, E., Picek, L., Ruiz De Castañeda, R., é, Lorieul, T., Botella, C., Glotin, H., Champ, J., Vellinga, W.P., Stöter, F.R., Dorso, A., Bonnet, P., Eggel, I., Müller, H.: Overview of lifeclef 2020: a system-oriented evaluation of automated species identification and species distribution prediction. In: Proceedings of CLEF 2020, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2020, Thessaloniki, Greece. (2020) 8. Kahl, S., Clapp, M., Hopping, A., Goëau, H., Glotin, H., Planqué, R., Vellinga, W.P., Joly, A.: Overview of birdclef 2020: Bird sound recognition in complex acous- tic environments. In: CLEF task overview 2020, CLEF: Conference and Labs of the Evaluation Forum, Sep. 2020, Thessaloniki, Greece. (2020) 9. Lasseck, M.: Audio-based bird species identification with deep convolutional neural networks. Working Notes of CLEF 2018 (2018) 10. Sevilla, A., Glotin, H.: Audio bird classification with inception-v4 extended with time and time-frequency attention mechanisms. In: Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017. (2017), http://ceur-ws.org/Vol-1866/paper 177.pdf 11. Sprengel, E., Jaggi, M., Kilcher, Y., Hofmann, T.: Audio based bird species iden- tification using deep learning techniques. Tech. rep. (2016) 12. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep- tion architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2818–2826 (2016) 13. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)