-

Y. Kim, Convolutional neural networks for sentence animal biosensors: A review, Sensors Journal, IEEE classification, in: Proceedings of the

10.3115/v1/D14

Edoardo Fazzari

edoardo.fazzari@santannapisa.it 0 2

Fabio Carrara

fabio.carrara@isti.cnr.it 1

Fabrizio Falchi

fabrizio.falchi@cnr.it 1

Cesare Stefanini

cesare.stefanini@santannapisa.it 0 2

Donato Romano

donato.romano@santannapisa.it 0 2 0 Department of Excellence in Robotics and AI, Sant'Anna School of Advanced Studies , Piazza Martiri della Libertà, Pisa, 56127 , Italy 1 Institute of Information Science and Technologies of the National Research Council of Italy (ISTI-CNR) , via G. Moruzzi, Pisa, 56124 , Italy 2 The Biorobotics Institute , Viale Rinaldo Piaggio, Pontedera, 56025 , Italy

2014

15 2015 29 31

Animal are sometime exploited as biosensors for assessing the presence of volatile organic compounds (VOCs) in the environment by interpreting their stereotyped behavioral responses. However, current approaches are based on direct human observation to assess the changes in animal behaviors associated to specific environmental stimuli. We propose a general workflow based on artificial intelligence that use pose estimation and sequence classification technique to automate this process. This study also provides an example of its application studying the antennae movement of an insect (e.g. a cricket) in response to the presence of two chemical stimuli.

biosensor deep learning pose estimation sequence classification cricket biohybrid system

1. Introduction imal olfactory capabilities to identify volatile organic compounds (VOCs) [1], through the conversion of biological selective responses into measurable signals. This sitive olfactory system is beyond human capabilities and to-use, eco-friendly, and do not need any manufacturing electronic devices; b) Biosensors can be portable, easy- sors. process for the analysis; c) Biosensors have potential application in a wide range of fields, from ecological studies to biomedical uses. tion of explosives and narcotics [2, 3], medical diagnosis [4], and the use of animal biosensors as early warning systems for forest fires [ 5]. This paper focus on classifying the type of response of crickets postexposure to two chemical substances, namely ammonia and sucrose powders, through the analysis of the movement of their antennae. Furthermore, while previous studies have relied on user inputs or direct nerve stimuli readings [6], our work emphasizes the development of an autonomous and intelligent workflow utilizing computer vision tech2.

Materials and Methods We here describe the models, dataset and proposed worklfow. 2.1. Models We base our pose estimation on SLEAP configuring it with an UNet[8] backbone. We tested multiple configu

Previous research has primarily focused on the detec- that are sustainable, and do not require the installation the best combination of max stride, filters number and Human-in-the-Loop Labeling To generate the lainput scaling. bels required to train our SLEAP model, we adopted the

For the classification network, we assessed the perfor- human-in-the-loop approach developed by Pereira et al. mance of LSTM-GRU[9, 10] and 1D-convolutional[11] [12]. This method involves labeling a restricted number neural networks, searching the best architecture via ge- of frames, training the pose estimation model, and then netic algorithm. using it to produce new labels for unlabeled frames. Additionally, incorrectly placed keypoints are repositioned, 2.2. Dataset and the new labels are combined with the previously labeled ones to retrain the model from scratch. A fixed valFor this study, adult crickets (Acheta Domesticus) were idation set is used to determine when the process should obtained from an e-commerce site and maintained in be terminated, and this is done by comparing the results controlled conditions. A total of 69 crickets were se- in terms of mean Average Precision (mAP) between each lected based on size and antennae visibility. To ensure training-labeling iteration. If there is no improvement no behavioral bias, only one video was taken per cricket. in mAP despite increasing the number of labeled frames, Crickets were placed in a Petri dish with one of three then the labeling phase is concluded. stimuli: nothing (i.e., control case), sucrose, or ammonia powder. Each recording was longer than 3 minutes and comprised two parts, the “settling in” (1 minutes) and “interaction” (2 minutes) periods. An iPhone 14 Pro was used to record the Petri dish and a light panel was used to reduce reflections. The resulting dataset is balanced and includes 23 videos for each stimulus, totaling 3 hours, 37 minutes, and 56 seconds. 2.2.1. Dataset Processing

The obtained videos were preprocessed by reducing the

frames per second (fps) to 29 to ensure equal measurement across all videos. The ”interaction period” was identified between frames 1740 and 5220, resulting in 3480 frames. The videos containing only the interactive part were reformatted to 1080x1080 pixels, centering the Petri dish and removing pixels from outside the Petri dish that could interfere with the neural network’s learning for pose estimation. This was done to ensure consistency in the dataset, enabling unbiased and accurate data analysis.

2.3. Proposed Workflow We describe here the suggested workflow steps for the

development of BISS referring only to the techniques exploited and not to our specific experiments (see section 3 for that instead). Figure 1 schematizes the task undergone in our workflow.

Grid Search for Parameter Optimization Once a suitable training set has been identified in the previous step, the subsequent task is to optimize the pose estimation architecture by modifying the parameters that have the most significant impact, such as max stride, initial number of filters, and input scaling. To limit the number of training runs, it is crucial to consider the conifguration used in the previous phase and follow three key guidelines: Firstly, if the objective is to identify fine features characterized by a small number of pixels, it is recommended to increase the value of input scaling.

Conversely, if most of the keypoints are already detected, it is advisable to reduce the value of input scaling to obtain a smaller model. Secondly, to ensure that the entire animal is covered, the receptive field should be resized by changing the value of the max stride. Lastly, the initial number of filters should be tested with values of 32 and 64, and the preference should be given to the lower value, i.e., a smaller and faster network, even if it results in the same mAP.

Keypoint Sequence Extraction and Preprocessing Identified the best pose estimation model, tracking sequences can be obtained for all videos in the dataset. Before proceeding to the next stage, each sequence should undergo preprocessing, which includes implementing a iflling strategy to remove any NaN values, and normalizing the values by subtracting the mean and dividing by the standard deviation. In our workflow, the filling strategy is tailored to each specific keypoint sequence and incorporated to prevent the genetic algorithm from overutilizes the following formula to fill in a missing values iftting on the validation accuracy, which could adversely in position t: afect the training accuracy and hinder the ability to generalize efectively. Finally, we check if the value of the = +1 ++ −1 , = 1/ (1) ftoraritnhinisgcaacsec.uTrahciys wisalsesdsotnheanto0p.1re,vseetnttinagfaalsdeefsauuglgtevsatiloune where k+t is the first subsequent frame with a non- of good models using the first case in Equation 2.3. NaN value for the key-point which we are considering. The employed selection algorithm is tournament selecAn important case was also handled when the value for tion, which randomly selects a predetermined number of t=0 is NaN, in this case the value is set to the first subse- individuals from the population and then selects the most quent non-NaN value. ift individual from the group, adding it to the mating pool. Along with tournament selection, elitism was incorpoGenetic Algorithm (GA) for Architecture Develop- rated as a selection strategy. The chosen crossover algoment In this stage, the search for the best classification rithm was bounded simulated binary crossover (SBX), model is undertaken and constructed using a Genetic which is a bounded version of the Simulated Binary Algorithm (GA). This method typically results in the de- Crossover (SBX)[14]. Lastly, the mutation function utivelopment of models with good predictive accuracy at a lized was bounded polynomial mutation, which is a relatively low cost compared to other approaches, such bounded mutation operator that uses a polynomial funcas random search [13]. In order to employ a GA, six key tion for the probability distribution. parameters are taken into consideration: the initial popu- The chromosomes used to construct the onelation, fitness function, selection, crossover, and mutation convolutional network are composed of 56 real-coded functions, as well as the chromosome structure. The size genes. The first block consists of six genes repeated 5 of the initial population should be chosen based on the times indicating: 1) the presence of the convolutional computational power required, as a larger population block (0 if absent, 1 if present); 2) the number of filters of will take longer to compute due to increased training, the one-convolutional layer (16 to 1024); 3) presence of while a smaller one may lead to inferior results. The the batch normalization layer (0 if absent, 1 if present); 4) iftness function, which is developed as a maximization activation function (0: sigmoid, 1: swish, 2: tanh, 3: relu, objective, is defined as follows: 4: gelu, 5: elu, 6: leaky relu); 5) presence of dropout (0 if absent, 1 if present); 6) dropout rate (0 to 0. 5, considering only multiple values of 0.05). Following this, a gene ⎪⎧−−1150 ⋅ (1 − train_accuracy) iiff visoulusteidontoaliannddicfautellythceotnynpeectoefdcloanyneersc.tiTohnisbegtewneeesnercvoensift(gene) = (2) to determine wheter the layer is Flatten or GlobalAver⎨⎪−20 if agePooling1D (0 indicating the former, and 1 the latter). ⎩−val_loss / The second block consists of 5 genes indicating: 1) the presence of the fully connected block (0 if absent, 1 if where a stands for ”if the training or validation accura- present); 2) the number of units (3 to 512); 3) activacies are less or equal than 1 over the number of classes, or tion function; 4) presence of dropout; 5) dropout rate. the training accuracy is less than the validation accuracy”; The chromosomes used to construct the RNN are comb stands for ”if the training accuracy is less than 0.1”; c posed of 50 real-coding genes instead. Compared with stands for ”no convolutional or RNN layers are present”. the previous case, the only changes are related to the first The decision to create such fitness function, rather than block and the removal of the gene related to the connecsimply minimizing the validation loss, is motivated by tion between the convolutional and the fully connected the fact that for problems with limited data and inherent part, which is not necessary in this case. The first block complexity, such in our experiment, it is possible for a consists of 5 genes repeated 5 times indicating: 1) the model to achieve a validation loss that is close to, or even presence of the RNN block; 2) the use of a bidirectional lower than, models with better validation accuracy and layer (0 if absent, 1 if present); 3) type of RNN (0: LSTM, above the random guess. To overcome this issue, the 1: GRU); 4) number of units (16 to 1024); 5) activation ifnest function was designed to take into account the function. training accuracy, enabling another metric for evaluating the network’s quality. Higher training accuracy values Compare GA Results After finding the best model lead to fitness values closer to those obtained from the using the genetic algorithm for both the convolutional validation loss, indicating a kind of network goodness and RNN cases, the models were evaluated using iterated that can be utilized in subsequent GA iterations. Ad- K-fold validation. This validation technique involves randitionally, the ”train_accuracy<val_accuracy” check is domly shufling the dataset and splitting it into training and validation sets, then running K-fold validation multiple times. This method is crucial in cases where the available data is limited, as in our study, and an accurate evaluation of the model is needed. The final score is obtained by calculating the mean accuracy across all K-fold validation runs.

3. Results & Discussion

Subsequently, in subsection 3.2, we showcase the two models (namely LSTM-GRU and convolutional architectures) that were discovered by the genetic algorithm in tandem with a comparative analysis of their performance based on the tracking sequence obtained from our SLEAP model. Finally, each section entails a discussion of the limitations inherent in our methodology, along with a discourse on possible future extensions that may serve to enhance the efectiveness of our workflow.

In this section, we present the findings of our experiments. Firstly, we elaborate on the results derived from human-in-the-loop testing, followed by a detailed expo- Figure 2: The image depicts an instance in which our model sition of the optimal configuration parameters that were has erroneously labeled a frame by misplacing the right and identified for the pose estimation model in subsection 3.1. left distal ends in a single location. A correction of the placement of the right distal end is indicated by a red dot, which was achieved by scrutinizing the temporal information. This particular example serves to underscore the prospective advantages that may be derived from integrating temporal context into the keypoint detection process.

3.1. Pose Estimation

Our experiment focused on determining whether it was feasible to predict the chemical composition of substances (i.e., sucrose or ammonia powders) by analyzing the movement of a cricket’s antennae. In order to achieve this, we strategically placed five keypoints at the proximal and distal ends of the antennae (both left and right) as well as one over the head.

To generate the required labels for training the SLEAP model, a human-in-the-loop approach was utilized. The SLEAP model employed had a maximum stride value of 64, an initial filter rate of 64, and input scaling of 0.7. This process produced a total of 5460 training frames and resulted in a mean average precision (mAP) value of 0.804768, which was validated using 300 frames from videos diferent from those used in the training set.

A grid search was performed on the obtained training set by varying the maximum stride value (32 or 64), initial iflter rate (32 or 64), and input scaling (increased by 0.1 until it reached 1.0). The optimal parameter configuration was determined to be a maximum stride value of 64 and an input scaling of 1.0, which achieved an mAP of 0.837392 on the validation set. These parameters were selected based on the guidelines outlined in subsection 2.3, as detecting fine features such as the distal end of the antennae (characterized by 4 ± 1 pixels in width) required the parameters to be set to their maximum suggested values.

The mean average precision (mAP) value obtained is strongly associated with the dificulty of locating the distal ends of the antennae, particularly when the crickets are situated close to the wall of the Petri dish. In such instances, the two antennae are more prone to overlap, as is evident in Figure 2. The occurrence of such errors can be mitigated by taking into account temporal information, such as previous and subsequent frames, to predict for a single frame. Although such architecture already exists in the Animal Pose Estimation (APE) domain[15], it employs a less complex neural network[12] compared to the one utilized in this study. Although this may yield higher mAP values, the computational cost may increase substantially.

Training details: we train each model for 400 epochs with batch size of 8 using Adam optimizer. The initial learning rate is set as 1e-4 and we made use of the SLEAP’s default learning rate decay strategy, with a patience of 20 epochs and a minimum delta of 1e-8. To mitigate the potential risk of overfitting, we incorporated the early stopping technique, terminating the training when the validation loss did not decrease for 50 epochs.

3.2. Chemical Interaction Classification The genetic algorithm hyperparaters for the population,

the number of epochs and the elitism were set to 250, 20 and 10 respectively. The structure of the best generated CNN consist of a convolutional layer with 821 filters, followed by a batch normalization layer and a tanh activation layer. This was followed by a convolutional layer with 821 filters and an elu activation layer, which was in turn followed by a dropout layer with a rate of 0.2.

The final convolutional layer contained 483 filters and a tanh activation layer. The output of the convolutional layers was flattened. The RNN was structured as a bidi- also pave the way for the development of novel sensor rectional LSTM layer with 707 units and an elu activation technologies that are inspired by natural systems. function, followed by a GRU layer with 660 units and a Training details: we trained each model constructed leaky relu activation function. This was followed by a for the genetic algorithm and the iterated K-fold valibidirectional GRU layer with 469 units and a leaky relu dation using 1000 epochs with a batch size of 16. To activation function, a dense layer with 138 units and a prevent from overfitting, we made use of early stopping, gelu activation function and a dropout layer with a rate terminated the training when the validation loss did not of 0.2. This was in turn followed by a dense layer with decrease for 100 epochs. The learning rate was reduced 150 units and a leaky relu activation function. The vali- on plateau with a minimum delta of 1e-3 and a patience dation accuracy obtained by the two models was 58.33% of 50. and 50% respectively.

The efective comparison between the two models was conducted using 10 iterated 4-fold validation, and the 4. Conclusion corresponding boxplot for each iteration is presented in Figure 3. Notably, the recurrent neural network exhib- This paper proposes a novel workflow for the creation ited superior accuracy in the initial iteration, achieving of Biohybrid Intelligent Sensing Systems (BISS) utilizing a noteworthy 80% accuracy. However, upon averaging deep learning techniques, specifically those pertaining the results, it performed slightly worse compared to the to convolutional and recurrent neural networks. The convolutional neural network. Specifically, the average underlying motivation for this approach is to enhance accuracy was 45.33% ±5.85% for the former and 44% ±6.6% the performance and broaden the spectrum of potential for the latter, with the generated-CNN model proving applications of animal biosensors, by facilitating a more marginally more efective. It is essential to acknowledge precise mapping of animal behaviors to the identification that while the outcomes are not extraordinary, they are of volatile organic compounds or other environmental still superior to the current human capabilities, as dis- changes. The development of such methodologies has cerning antennae movement remains a challenging task. the potential to address certain limitations associated Additionally, the results may be attributed to the limited with the use of animal biosensors, such as the introducattention span of animals and the potential impact of tion of errors stemming from human observation and behavioral variations[1]. The crickets used in the study interpretation, while also facilitating method standardwere not trained to exhibit specific behaviors towards ization. Additionally, by relying solely on recordings, the two powders; rather, their innate behavior was as- BISS presents an ethical and environmentally sustainable sessed. Moreover, it is worth noting that strategies to alternative. train crickets to respond in a specific manner to certain In our upcoming endeavors, we plan to incorporate stimuli already exist[16, 17] and can be introduced to our temporal information in our pose estimation task to furworkflow to further enhance the accuracy of our models. ther enhance our workflow. Furthermore, we intend Therefore, future studies may incorporate such training to perform experiments with trained animals to more methodologies to overcome the potential limitations of accurately gauge how our operations can significantly using untrained crickets as sensors. By doing so, we improve the current state-of-the-art in animal biosensors. can not only improve the performance of the models but