<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Y. Kim, Convolutional neural networks for sentence
animal biosensors: A review, Sensors Journal, IEEE classification, in: Proceedings of the</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.3115/v1/D14</article-id>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Edoardo Fazzari</string-name>
          <email>edoardo.fazzari@santannapisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Carrara</string-name>
          <email>fabio.carrara@isti.cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabrizio Falchi</string-name>
          <email>fabrizio.falchi@cnr.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cesare Stefanini</string-name>
          <email>cesare.stefanini@santannapisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Donato Romano</string-name>
          <email>donato.romano@santannapisa.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Excellence in Robotics and AI, Sant'Anna School of Advanced Studies</institution>
          ,
          <addr-line>Piazza Martiri della Libertà, Pisa, 56127</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Information Science and Technologies of the National Research Council of Italy (ISTI-CNR)</institution>
          ,
          <addr-line>via G. Moruzzi, Pisa, 56124</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>The Biorobotics Institute</institution>
          ,
          <addr-line>Viale Rinaldo Piaggio, Pontedera, 56025</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2014</year>
      </pub-date>
      <volume>15</volume>
      <issue>2015</issue>
      <fpage>29</fpage>
      <lpage>31</lpage>
      <abstract>
        <p>Animal are sometime exploited as biosensors for assessing the presence of volatile organic compounds (VOCs) in the environment by interpreting their stereotyped behavioral responses. However, current approaches are based on direct human observation to assess the changes in animal behaviors associated to specific environmental stimuli. We propose a general workflow based on artificial intelligence that use pose estimation and sequence classification technique to automate this process. This study also provides an example of its application studying the antennae movement of an insect (e.g. a cricket) in response to the presence of two chemical stimuli.</p>
      </abstract>
      <kwd-group>
        <kwd>biosensor</kwd>
        <kwd>deep learning</kwd>
        <kwd>pose estimation</kwd>
        <kwd>sequence classification</kwd>
        <kwd>cricket</kwd>
        <kwd>biohybrid system</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>1. Introduction
imal olfactory capabilities to identify volatile organic
compounds (VOCs) [1], through the conversion of
biological selective responses into measurable signals. This
sitive olfactory system is beyond human capabilities and
to-use, eco-friendly, and do not need any manufacturing
electronic devices; b) Biosensors can be portable, easy- sors.
process for the analysis; c) Biosensors have potential
application in a wide range of fields, from ecological studies
to biomedical uses.
tion of explosives and narcotics [2, 3], medical diagnosis
[4], and the use of animal biosensors as early warning
systems for forest fires [ 5]. This paper focus on
classifying the type of response of crickets postexposure to
two chemical substances, namely ammonia and sucrose
powders, through the analysis of the movement of their
antennae. Furthermore, while previous studies have
relied on user inputs or direct nerve stimuli readings [6],
our work emphasizes the development of an autonomous
and intelligent workflow utilizing computer vision
tech2.</p>
    </sec>
    <sec id="sec-2">
      <title>Materials and Methods</title>
      <sec id="sec-2-1">
        <title>We here describe the models, dataset and proposed worklfow.</title>
        <sec id="sec-2-1-1">
          <title>2.1. Models</title>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>We base our pose estimation on SLEAP configuring it with an UNet[8] backbone. We tested multiple configu</title>
        <p>Previous research has primarily focused on the detec- that are sustainable, and do not require the installation
the best combination of max stride, filters number and Human-in-the-Loop Labeling To generate the
lainput scaling. bels required to train our SLEAP model, we adopted the</p>
        <p>For the classification network, we assessed the perfor- human-in-the-loop approach developed by Pereira et al.
mance of LSTM-GRU[9, 10] and 1D-convolutional[11] [12]. This method involves labeling a restricted number
neural networks, searching the best architecture via ge- of frames, training the pose estimation model, and then
netic algorithm. using it to produce new labels for unlabeled frames.
Additionally, incorrectly placed keypoints are repositioned,
2.2. Dataset and the new labels are combined with the previously
labeled ones to retrain the model from scratch. A fixed
valFor this study, adult crickets (Acheta Domesticus) were idation set is used to determine when the process should
obtained from an e-commerce site and maintained in be terminated, and this is done by comparing the results
controlled conditions. A total of 69 crickets were se- in terms of mean Average Precision (mAP) between each
lected based on size and antennae visibility. To ensure training-labeling iteration. If there is no improvement
no behavioral bias, only one video was taken per cricket. in mAP despite increasing the number of labeled frames,
Crickets were placed in a Petri dish with one of three then the labeling phase is concluded.
stimuli: nothing (i.e., control case), sucrose, or ammonia
powder. Each recording was longer than 3 minutes and
comprised two parts, the “settling in” (1 minutes) and
“interaction” (2 minutes) periods. An iPhone 14 Pro was
used to record the Petri dish and a light panel was used to
reduce reflections. The resulting dataset is balanced and
includes 23 videos for each stimulus, totaling 3 hours, 37
minutes, and 56 seconds.
2.2.1. Dataset Processing</p>
      </sec>
      <sec id="sec-2-3">
        <title>The obtained videos were preprocessed by reducing the</title>
        <p>frames per second (fps) to 29 to ensure equal
measurement across all videos. The ”interaction period” was
identified between frames 1740 and 5220, resulting in
3480 frames. The videos containing only the interactive
part were reformatted to 1080x1080 pixels, centering the
Petri dish and removing pixels from outside the Petri dish
that could interfere with the neural network’s learning
for pose estimation. This was done to ensure
consistency in the dataset, enabling unbiased and accurate data
analysis.</p>
        <sec id="sec-2-3-1">
          <title>2.3. Proposed Workflow</title>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>We describe here the suggested workflow steps for the</title>
        <p>development of BISS referring only to the techniques
exploited and not to our specific experiments (see
section 3 for that instead). Figure 1 schematizes the task
undergone in our workflow.</p>
        <p>Grid Search for Parameter Optimization Once a
suitable training set has been identified in the previous
step, the subsequent task is to optimize the pose
estimation architecture by modifying the parameters that
have the most significant impact, such as max stride,
initial number of filters, and input scaling. To limit the
number of training runs, it is crucial to consider the
conifguration used in the previous phase and follow three
key guidelines: Firstly, if the objective is to identify fine
features characterized by a small number of pixels, it
is recommended to increase the value of input scaling.</p>
        <p>Conversely, if most of the keypoints are already detected,
it is advisable to reduce the value of input scaling to
obtain a smaller model. Secondly, to ensure that the entire
animal is covered, the receptive field should be resized by
changing the value of the max stride. Lastly, the initial
number of filters should be tested with values of 32 and
64, and the preference should be given to the lower value,
i.e., a smaller and faster network, even if it results in the
same mAP.</p>
        <p>Keypoint Sequence Extraction and Preprocessing
Identified the best pose estimation model, tracking
sequences can be obtained for all videos in the dataset.
Before proceeding to the next stage, each sequence should
undergo preprocessing, which includes implementing a
iflling strategy to remove any NaN values, and
normalizing the values by subtracting the mean and dividing by
the standard deviation. In our workflow, the filling
strategy is tailored to each specific keypoint sequence and incorporated to prevent the genetic algorithm from
overutilizes the following formula to fill in a missing values iftting on the validation accuracy, which could adversely
in position t: afect the training accuracy and hinder the ability to
generalize efectively. Finally, we check if the value of the
  =   +1 ++  −1 ,  = 1/ (1) ftoraritnhinisgcaacsec.uTrahciys wisalsesdsotnheanto0p.1re,vseetnttinagfaalsdeefsauuglgtevsatiloune
where k+t is the first subsequent frame with a non- of good models using the first case in Equation 2.3.
NaN value for the key-point which we are considering. The employed selection algorithm is tournament
selecAn important case was also handled when the value for tion, which randomly selects a predetermined number of
t=0 is NaN, in this case the value is set to the first subse- individuals from the population and then selects the most
quent non-NaN value. ift individual from the group, adding it to the mating pool.
Along with tournament selection, elitism was
incorpoGenetic Algorithm (GA) for Architecture Develop- rated as a selection strategy. The chosen crossover
algoment In this stage, the search for the best classification rithm was bounded simulated binary crossover (SBX),
model is undertaken and constructed using a Genetic which is a bounded version of the Simulated Binary
Algorithm (GA). This method typically results in the de- Crossover (SBX)[14]. Lastly, the mutation function
utivelopment of models with good predictive accuracy at a lized was bounded polynomial mutation, which is a
relatively low cost compared to other approaches, such bounded mutation operator that uses a polynomial
funcas random search [13]. In order to employ a GA, six key tion for the probability distribution.
parameters are taken into consideration: the initial popu- The chromosomes used to construct the
onelation, fitness function, selection, crossover, and mutation convolutional network are composed of 56 real-coded
functions, as well as the chromosome structure. The size genes. The first block consists of six genes repeated 5
of the initial population should be chosen based on the times indicating: 1) the presence of the convolutional
computational power required, as a larger population block (0 if absent, 1 if present); 2) the number of filters of
will take longer to compute due to increased training, the one-convolutional layer (16 to 1024); 3) presence of
while a smaller one may lead to inferior results. The the batch normalization layer (0 if absent, 1 if present); 4)
iftness function, which is developed as a maximization activation function (0: sigmoid, 1: swish, 2: tanh, 3: relu,
objective, is defined as follows: 4: gelu, 5: elu, 6: leaky relu); 5) presence of dropout (0 if
absent, 1 if present); 6) dropout rate (0 to 0. 5,
considering only multiple values of 0.05). Following this, a gene
⎪⎧−−1150 ⋅ (1 − train_accuracy) iiff 
visoulusteidontoaliannddicfautellythceotnynpeectoefdcloanyneersc.tiTohnisbegtewneeesnercvoensift(gene) = (2) to determine wheter the layer is Flatten or
GlobalAver⎨⎪−20 if  agePooling1D (0 indicating the former, and 1 the latter).
⎩−val_loss / The second block consists of 5 genes indicating: 1) the
presence of the fully connected block (0 if absent, 1 if
where a stands for ”if the training or validation accura- present); 2) the number of units (3 to 512); 3)
activacies are less or equal than 1 over the number of classes, or tion function; 4) presence of dropout; 5) dropout rate.
the training accuracy is less than the validation accuracy”; The chromosomes used to construct the RNN are
comb stands for ”if the training accuracy is less than 0.1”; c posed of 50 real-coding genes instead. Compared with
stands for ”no convolutional or RNN layers are present”. the previous case, the only changes are related to the first
The decision to create such fitness function, rather than block and the removal of the gene related to the
connecsimply minimizing the validation loss, is motivated by tion between the convolutional and the fully connected
the fact that for problems with limited data and inherent part, which is not necessary in this case. The first block
complexity, such in our experiment, it is possible for a consists of 5 genes repeated 5 times indicating: 1) the
model to achieve a validation loss that is close to, or even presence of the RNN block; 2) the use of a bidirectional
lower than, models with better validation accuracy and layer (0 if absent, 1 if present); 3) type of RNN (0: LSTM,
above the random guess. To overcome this issue, the 1: GRU); 4) number of units (16 to 1024); 5) activation
ifnest function was designed to take into account the function.
training accuracy, enabling another metric for evaluating
the network’s quality. Higher training accuracy values Compare GA Results After finding the best model
lead to fitness values closer to those obtained from the using the genetic algorithm for both the convolutional
validation loss, indicating a kind of network goodness and RNN cases, the models were evaluated using iterated
that can be utilized in subsequent GA iterations. Ad- K-fold validation. This validation technique involves
randitionally, the ”train_accuracy&lt;val_accuracy” check is domly shufling the dataset and splitting it into training
and validation sets, then running K-fold validation
multiple times. This method is crucial in cases where the
available data is limited, as in our study, and an accurate
evaluation of the model is needed. The final score is
obtained by calculating the mean accuracy across all K-fold
validation runs.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Results &amp; Discussion</title>
      <p>Subsequently, in subsection 3.2, we showcase the two
models (namely LSTM-GRU and convolutional
architectures) that were discovered by the genetic algorithm in
tandem with a comparative analysis of their performance
based on the tracking sequence obtained from our SLEAP
model. Finally, each section entails a discussion of the
limitations inherent in our methodology, along with a
discourse on possible future extensions that may serve
to enhance the efectiveness of our workflow.</p>
      <p>In this section, we present the findings of our
experiments. Firstly, we elaborate on the results derived from
human-in-the-loop testing, followed by a detailed expo- Figure 2: The image depicts an instance in which our model
sition of the optimal configuration parameters that were has erroneously labeled a frame by misplacing the right and
identified for the pose estimation model in subsection 3.1. left distal ends in a single location. A correction of the
placement of the right distal end is indicated by a red dot, which
was achieved by scrutinizing the temporal information. This
particular example serves to underscore the prospective
advantages that may be derived from integrating temporal context
into the keypoint detection process.</p>
      <sec id="sec-3-1">
        <title>3.1. Pose Estimation</title>
        <p>Our experiment focused on determining whether it was
feasible to predict the chemical composition of substances
(i.e., sucrose or ammonia powders) by analyzing the
movement of a cricket’s antennae. In order to achieve
this, we strategically placed five keypoints at the
proximal and distal ends of the antennae (both left and right)
as well as one over the head.</p>
        <p>To generate the required labels for training the SLEAP
model, a human-in-the-loop approach was utilized. The
SLEAP model employed had a maximum stride value of
64, an initial filter rate of 64, and input scaling of 0.7.
This process produced a total of 5460 training frames
and resulted in a mean average precision (mAP) value
of 0.804768, which was validated using 300 frames from
videos diferent from those used in the training set.</p>
        <p>A grid search was performed on the obtained training
set by varying the maximum stride value (32 or 64), initial
iflter rate (32 or 64), and input scaling (increased by 0.1
until it reached 1.0). The optimal parameter
configuration was determined to be a maximum stride value of 64
and an input scaling of 1.0, which achieved an mAP of
0.837392 on the validation set. These parameters were
selected based on the guidelines outlined in subsection 2.3,
as detecting fine features such as the distal end of the
antennae (characterized by 4 ± 1 pixels in width) required
the parameters to be set to their maximum suggested
values.</p>
        <p>The mean average precision (mAP) value obtained is
strongly associated with the dificulty of locating the
distal ends of the antennae, particularly when the crickets
are situated close to the wall of the Petri dish. In such
instances, the two antennae are more prone to overlap, as
is evident in Figure 2. The occurrence of such errors can
be mitigated by taking into account temporal
information, such as previous and subsequent frames, to predict
for a single frame. Although such architecture already
exists in the Animal Pose Estimation (APE) domain[15],
it employs a less complex neural network[12] compared
to the one utilized in this study. Although this may yield
higher mAP values, the computational cost may increase
substantially.</p>
        <p>Training details: we train each model for 400 epochs
with batch size of 8 using Adam optimizer. The
initial learning rate is set as 1e-4 and we made use of the
SLEAP’s default learning rate decay strategy, with a
patience of 20 epochs and a minimum delta of 1e-8. To
mitigate the potential risk of overfitting, we incorporated
the early stopping technique, terminating the training
when the validation loss did not decrease for 50 epochs.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Chemical Interaction Classification</title>
        <sec id="sec-3-2-1">
          <title>The genetic algorithm hyperparaters for the population,</title>
          <p>the number of epochs and the elitism were set to 250, 20
and 10 respectively. The structure of the best generated
CNN consist of a convolutional layer with 821 filters,
followed by a batch normalization layer and a tanh
activation layer. This was followed by a convolutional layer
with 821 filters and an elu activation layer, which was
in turn followed by a dropout layer with a rate of 0.2.</p>
          <p>The final convolutional layer contained 483 filters and
a tanh activation layer. The output of the convolutional
layers was flattened. The RNN was structured as a bidi- also pave the way for the development of novel sensor
rectional LSTM layer with 707 units and an elu activation technologies that are inspired by natural systems.
function, followed by a GRU layer with 660 units and a Training details: we trained each model constructed
leaky relu activation function. This was followed by a for the genetic algorithm and the iterated K-fold
valibidirectional GRU layer with 469 units and a leaky relu dation using 1000 epochs with a batch size of 16. To
activation function, a dense layer with 138 units and a prevent from overfitting, we made use of early stopping,
gelu activation function and a dropout layer with a rate terminated the training when the validation loss did not
of 0.2. This was in turn followed by a dense layer with decrease for 100 epochs. The learning rate was reduced
150 units and a leaky relu activation function. The vali- on plateau with a minimum delta of 1e-3 and a patience
dation accuracy obtained by the two models was 58.33% of 50.
and 50% respectively.</p>
          <p>The efective comparison between the two models was
conducted using 10 iterated 4-fold validation, and the 4. Conclusion
corresponding boxplot for each iteration is presented in
Figure 3. Notably, the recurrent neural network exhib- This paper proposes a novel workflow for the creation
ited superior accuracy in the initial iteration, achieving of Biohybrid Intelligent Sensing Systems (BISS) utilizing
a noteworthy 80% accuracy. However, upon averaging deep learning techniques, specifically those pertaining
the results, it performed slightly worse compared to the to convolutional and recurrent neural networks. The
convolutional neural network. Specifically, the average underlying motivation for this approach is to enhance
accuracy was 45.33% ±5.85% for the former and 44% ±6.6% the performance and broaden the spectrum of potential
for the latter, with the generated-CNN model proving applications of animal biosensors, by facilitating a more
marginally more efective. It is essential to acknowledge precise mapping of animal behaviors to the identification
that while the outcomes are not extraordinary, they are of volatile organic compounds or other environmental
still superior to the current human capabilities, as dis- changes. The development of such methodologies has
cerning antennae movement remains a challenging task. the potential to address certain limitations associated
Additionally, the results may be attributed to the limited with the use of animal biosensors, such as the
introducattention span of animals and the potential impact of tion of errors stemming from human observation and
behavioral variations[1]. The crickets used in the study interpretation, while also facilitating method
standardwere not trained to exhibit specific behaviors towards ization. Additionally, by relying solely on recordings,
the two powders; rather, their innate behavior was as- BISS presents an ethical and environmentally sustainable
sessed. Moreover, it is worth noting that strategies to alternative.
train crickets to respond in a specific manner to certain In our upcoming endeavors, we plan to incorporate
stimuli already exist[16, 17] and can be introduced to our temporal information in our pose estimation task to
furworkflow to further enhance the accuracy of our models. ther enhance our workflow. Furthermore, we intend
Therefore, future studies may incorporate such training to perform experiments with trained animals to more
methodologies to overcome the potential limitations of accurately gauge how our operations can significantly
using untrained crickets as sensors. By doing so, we improve the current state-of-the-art in animal biosensors.
can not only improve the performance of the models but</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>