Automatic Speech Detection on a Smart Beehive’s
                                Raspberry Pi
                                Pascal Janetzky1 , Philip Lissmann1 , Andreas Hotho1 and Anna Krause1
                                1
                                    University of Würzburg, Department of Computer Science, CAIDAS, Chair for Data Science, 97074 Würzburg, Germany


                                                                         Abstract
                                                                         The we4bee project has deployed 100 smart hives all over Germany. These hives are equipped with
                                                                         microphones, among other sensors. Beekeepers and bee researchers have observed the importance of
                                                                         sounds when monitoring bee hives, but audio can only be recorded in accordance with privacy laws. To
                                                                         prevent saving recordings of human voices, our aim is to deploy a pre-trained deep learning model on
                                                                         the Raspberry Pi 3B computer controlling the smart hive. This model has to classify recorded data in
                                                                         real-time.
                                                                             In this technical report, we document the process of setting up the software on the Raspberry Pi, the
                                                                         adaptations required for existing code to run in the new environment, and the necessity of modifying the
                                                                         trained models for deployment on the mini-computer. We find that in both standard operation conditions
                                                                         and under various artificial levels of high CPU and I/O load, the model’s inference runs in real-time.

                                                                         Keywords
                                                                         TensorFlow, machine learning, audio classification, speech detection, mobile computing


                                1. Introduction
                                The we4bee project1 runs around 100 smart beehives all over Germany. Since 2019, these hives
                                collect data in and around the hives with 16 sensors and two cameras. The sensor system is
                                powered by a Raspberry Pi 3B mini computer, which handles data collection and transfer to
                                the central database at the University of Würzburg. In our earliest work, we relied on inside
                                temperatures to detect events in the hive via anomaly detection [1], but audio recorded in
                                beehives is also a valuable data source for precision beekeeping [2, 3, 4]. In recent work, we
                                have thus leveraged audio data for event detection [5]. However, since written consent is
                                mandatory to record audio data, audio is currently recorded in a single beehive only.
                                   In order to record audio data in more hives, Janetzky et al. [6] trained deep learning models
                                that identify human speech in audio recordings obtained from the we4bee hive. In this technical
                                report, we directly deploy the best-performing model from our earlier research on the Raspberry
                                Pi in the smart hives. For successful deployment, we require the model to be real-time capable:
                                a 60 s recording has to be classified in less than a minute so that data without speech is buffered
                                or uploaded before successive samples are recorded.

                                LWDA’23: Lernen, Wissen, Daten, Analysen. October 09–11, 2023, Marburg, Germany
                                Envelope-Open janetzky@informatik.uni-wuerzburg.de (P. Janetzky); philip.lissmann@stud-mail.uni-wuerzburg.de
                                (P. Lissmann); hotho@informatik.uni-wuerzburg.de (A. Hotho); anna.krause@informatik.uni-wuerzburg.de
                                (A. Krause)
                                                                       © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
                                    CEUR
                                    Workshop
                                            CEUR Workshop Proceedings (CEUR-WS.org)
                                    Proceedings
                                                  http://ceur-ws.org
                                                  ISSN 1613-0073


                                1
                                    we4bee.org


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
                        Sensor

                         Audio      TFlite     k-NN       Pred.      SEND

                        Image

                      Hive system                                 Hive system

Figure 1: An overview of the existing data collection system (dotted boxes) and our speech
          detection pipeline. Sensory and image data is automatically collected and uploaded as
          is. For the audio data, we obtain a recording’s embeddings through TensorFlow Lite
          (TFlite) and predict their class. Data is only uploaded if no speech is detected.


2. Detecting Human Speech
In our previous work [6], we evaluated three Siamese neural networks, Saeed [7], ESC [8, 9]
and Bulbul [10], followed by a k-Nearest Neighbor (k-NN) classifier [11, 12], on the detection of
speech in audio recordings obtained from beehives. Of these networks, Bulbul showed the best
performance. This Siamese network consists of four times a block of convolution, leaky ReLU
[13], and max pooling layers. Afterwards, the output is flattened and followed by two blocks of
dropout [14], dense, and leaky ReLU layers.
   The Siamese network has been trained on a total of 200 labeled samples of 60 s, of which we
created random data pairs. The network was then trained to minimize the Euclidean distance
between audio pairs from the same class (e.g., speech-speech) and to maximize the distance
between pairs of different classes. From the trained network, the embeddings of the training
data were then extracted from an intermediate layer and used to train a k-NN classifier to
predict a sample’s class.


3. On-device Classification
To be able to record audio in more than one beehive, we want to identify and discard audio
recordings with human speech directly on the beehive’s Raspberry Pi. To this end, we selected
an exemplary smart beehive and completed the following tasks: installing the TensorFlow Lite
library on the Raspberry Pi; migrating our Python environment and scripts; migrating the
model; and enabling real-time classification of incoming audio data. The proposed approach
to real-time speech detection is visualized in fig. 1. The existing system, visualized on the left,
records sensor and image data and uploads them directly. For the detection of speech, we use
the pipeline visualized in the middle, where only no-speech data is stored for upload. The
remainder of this report will delve into the necessary adaptions in more detail.

3.1. Adaptions to the new environment
Since our earlier research was conducted using the TensorFlow library [15], TensorFlow is
required on the Raspberry Pi. While the full library offers all features related to machine
(a) Sample recording of   (b) Sample recording      (c) Total runtime distri-   (d) Total runtime at dif-
    speech.                   of bee humming.           bution.                     ferent I/O loads.

Figure 2: Sample recordings (a, b), total inference runtime per sample (c) and the average total
          runtime at different I/O loads (d). Note the vertical lines indicating speech in a),
          and the difference in the lower frequencies between a) and b). The total runtime is
          most strongly determined by Bulbul's forward pass, which takes 37.30 s on average.
          Increasing the I/O load also increases runtime.


learning research, the 1 MB small TensorFlow Lite (TFlite) package focuses on inference and
deployment and is sufficient for our purpose.
   We verified the successful installation by running an official audio classification tutorial [16].
In addition to the steps in the tutorial, we had to add our user to the audio group to gain access
to the microphone. Following these configuration changes, we successfully executed the official
audio classification tutorial and thereby confirmed the installation and functionality of TFlite.
   After installing TFlite, we migrated the Python environment and adapted the scripts. The
methodology outlined in [6], requires librosa for audio processing. Installing it was not
possible on Raspbian 11 Bullseye OS due to incompatible dependencies. Therefore, librosa
was replaced by the soundfile 0.12.1 package [17], which also supports processing audio data.
   The last step was migrating the best-performing model from our earlier research. Bulbul
relies on the kapre library [18] for transforming raw audio input data to a spectrogram during
the forward pass. Two spectrograms of speech and bee humming are given in fig. 2a and fig. 2b,
respectively, which show different intensities in the lower frequencies. In fig. 2a, vertical lines
also indicate human speech. The audio conversion to spectrogram takes place in custom layers
requiring non-standard TFLite operations. To avoid installing full TensorFlow, we replaced the
spectrogram and magnitude-scaling layer of the original model with TFLite-compatible ones
and transferred trained weights. After conversion to the .tflite format, we confirmed that
the model performance did not suffer by re-running experiments from [6]. The k-NN model
does not run out-of-the-box as well, and was re-initialized on the device with 𝑘 = 5. A tutorial
of the setup process including code is available at https://professor-x.de/beepi-speech.

3.2. Performance boundaries
The smart beehive runs various sensor recording services, whose average CPU load over 15 min
is 25.48 % (std 15.05). The memory usage generally is small, ranging from 5.78 % to 12.59 %,
indicating that around 100 MB of the 1 GB RAM are occupied. These statistics show that the
Raspberry Pi has enough resources available for real-time audio classification and uploading
data without speech. To evaluate the boundary at which this is no longer feasible, we loaded the
Bulbul and k-NN model, and separately timed the audio pre-processing, embedding extraction,
and prediction over 117 files uploaded to our research beehive. For that, we disabled the sensor
recording services and used the Linux commands stress to induce and nice to prioritize
artificial CPU load. Despite these severe restrictions, the system scheduler ensures that our
classification runs in real time. Further, we also evaluated the prediction performance under
varying I/O loads using the stress --io n_threads command with {1..10} threads. The results
in fig. 2d show that for more than 5 threads inducing I/O load, more outliers arise, and the
average total runtime and its standard deviations increase. However, while these increases
indicate that our script has more waiting time, the system is still capable of real-time inference.

3.3. Real-time Classification of Incoming Audio Data
To verify the performance in a controlled, realistic setting, we re-activated the sensor recording
system and ran the inference five times in sequence, yielding 585 measurements in total. A
histogram of the overall runtimes is given in fig. 2c, which shows that the majority is predicted
in less than 60 s, with the k-NN having negligible influence. Of the 585 tested files, only 19 take
longer than 60 s to classify. Of these delays, according to logs, 10 are caused by the camera
recording, 4 by measuring the finedust concentration, and for 5 the source is unclear. Apart
from this, the parallel recording of the sensor modalities had no negative impact on model
runtimes. In summary, the results show that we only have around 20 min of audio data that
remain unclassified over a period of roughly 10 h. On average, this translates to two audio
snippets per hour that cannot be classified in real-time. Further data loss through connection
failures is prevented by buffering up to 60 h of classified, non-speech data.
   For evaluating the model performance on a different hive, we asked one male and one female
volunteer to perform different activities (talking, playing music and singing, laughing and
rough-housing, and staying silent for a fixed time period) at 1 m, 5 m 10 m distance to the hive.
Our observations show that all activities are well-detected at all distances to the hive, including
quiet speech.
   Lastly, we let the automated audio classification run over one week, logging the predictions.
The speech detected by our system can generally be mapped to real events, as consultation
with the hive’s owners revealed. For example, one morning, between 7 and 7:15 am, speech is
detected when they take their dog for a walk. In another instance, they prepare for travel.


4. Conclusion
In this technical report, we describe the process of deploying a speech detection model onto
a Raspberry Pi 3B in a smart beehive. We show that, after making adaptions to the Python
code and model architectures, we can deploy and run the model on the mobile computer. On
this hardware, our setup can predict the class of a one-minute audio recording in less than 60 s,
i.e., in real-time. Even when high CPU and I/O load simulate extreme scenarios, the setup is
still capable of real-time inference. Lastly, using controlled activities with two volunteers as a
test case, we showed that our model can predict human speech at various distances from the
recording device. The next step is rolling out the models to all hives of the we4bee project.
References
 [1] P. Davidson, M. Steininger, F. Lautenschlager, K. Kobs, A. Krause, A. Hotho, Anomaly
     detection in beehives using deep recurrent autoencoders, CoRR abs/2003.04576 (2020).
     URL: https://arxiv.org/abs/2003.04576. arXiv:2003.04576 .
 [2] A. Žgank, Acoustic monitoring and classification of bee swarm activity using mfcc feature
     extraction and hmm acoustic modeling, in: 2018 ELEKTRO, IEEE, 2018, pp. 1–4.
 [3] S. Cecchi, A. Terenzi, S. Orcioni, F. Piazza, Analysis of the sound emitted by honey bees
     in a beehive, in: Audio Engineering Society Convention 147, Audio Engineering Society,
     2019.
 [4] I. Nolasco, A. Terenzi, S. Cecchi, S. Orcioni, H. L. Bear, E. Benetos, Audio-based identifica-
     tion of beehive states, in: ICASSP 2019-2019 IEEE International Conference on Acoustics,
     Speech and Signal Processing (ICASSP), IEEE, 2019, pp. 8256–8260.
 [5] P. Janetzky, M. Schaller, A. Krause, A. Hotho, Swarming detection in smart beehives
     using auto encoders for audio data, in: 2023 30th International Conference on Systems,
     Signals and Image Processing (IWSSIP), 2023, pp. 1–5. doi:10.1109/IWSSIP58668.2023.
     10180253 .
 [6] P. Janetzky, P. Davidson, M. Steininger, A. Krause, A. Hotho, Detecting presence of speech
     in acoustic data obtained from beehives., in: DCASE, 2021, pp. 26–30.
 [7] A. Saeed, Urban Sound Classification, 2016. URL: https://github.com/aqibsaeed/
     Urban-Sound-Classification, accessed: 2023-09-07.
 [8] K. J. Piczak, Esc: Dataset for environmental sound classification, in: Proceedings of the
     23rd ACM international conference on Multimedia, 2015, pp. 1015–1018.
 [9] K. J. Piczak, Environmental sound classification with convolutional neural networks, in:
     2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP),
     IEEE, 2015, pp. 1–6.
[10] T. Grill, J. Schlüter, Two convolutional neural networks for bird detection in audio
     signals, in: 2017 25th European Signal Processing Conference (EUSIPCO), IEEE, 2017, pp.
     1764–1768.
[11] N. S. Altman, An introduction to kernel and nearest-neighbor nonparametric regression,
     The American Statistician 46 (1992) 175–185.
[12] E. Fix, J. L. Hodges, Discriminatory analysis. nonparametric discrimination: Consistency
     properties, International Statistical Review/Revue Internationale de Statistique 57 (1989)
     238–247.
[13] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al., Rectifier nonlinearities improve neural network
     acoustic models, in: Proc. icml, volume 30, Atlanta, GA, 2013, p. 3.
[14] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple
     way to prevent neural networks from overfitting, The journal of machine learning research
     15 (2014) 1929–1958.
[15] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,
     J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Joze-
     fowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
     C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Van-
     houcke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu,
     X. Zheng, TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
     URL: https://www.tensorflow.org/, software available from tensorflow.org.
[16] Tensorflow lite python audio classification example with raspberry pi, https://github.com/
     tensorflow/examples/tree/master/lite/examples/audio_classification/raspberry_pi, 2022.
     Accessed: 2023-07-13.
[17] B. Bechtold, soundfile audio library, https://pypi.org/project/soundfile/, 2023. Accessed:
     2023-07-13.
[18] K. Choi, D. Joo, J. Kim, Kapre: On-gpu audio preprocessing layers for a quick implementa-
     tion of deep neural network models with keras, in: Machine Learning for Music Discovery
     Workshop at 34th International Conference on Machine Learning, ICML, 2017.