<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Spiking Emotions: Dynamic Vision Emotion Recognition Using Spiking Neural Networks 1</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Binqiang Wang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gang Dong</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yaqian Zhao</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rengang Li</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongbin Yang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wenfeng Yin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lingyan Liang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Shandong Massive Information Technology Research Institute</institution>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>State Key Laboratory of High-end Server &amp; Storage Technology Inspur (Beijing) Electronic Information Industry Co., Ltd.</institution>
          <addr-line>Beijing</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <fpage>50</fpage>
      <lpage>58</lpage>
      <abstract>
        <p>Emotion recognition from vision information is a significant research topic in the computer vision community. The current prevalent solution based on Artificial Neural Networks (ANNs) demonstrates high accuracy but large computation consumption. Compared with ANNs, Spiking Neural Networks (SNNs) are more biologically realistic and computationally effective. However, it still remains a great challenge to utilize SNNs to vision emotion recognition, mainly due to the lack emotional dataset of Dynamic Vision Sensor (DVS) and a properly designed SNN framework. In this paper, we present a method to generate a simulation dataset of DVS, leveraging the existed emotion recognition dataset containing video segments. Meanwhile, an SNN framework and its counterpart ANNs are adopted to complete dynamic vision emotion recognition based on the simulated DVS dataset and original frames data respectively. The proposed SNN framework consists of a feature extraction module to extract informative features based on spike-trains of input, a voting neurons group module containing two groups of emotional neurons, and an emotional mapping module to translate output spiketrains to emotion polarity labels. The results demonstrate that compared with the ANN, the proposed SNN can achieve better performance and its energy consumption is only one-quarter of the ANN's.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;spiking neural network</kwd>
        <kwd>dynamic vision sensor</kwd>
        <kwd>emotion recognition</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Emotion recognition, as a hot research topic in the affective computing community, has derived many
researchers' attention coming from domains like computer vision [1, 2], natural language processing [3],
speech processing [4, 5], and human-computer interaction [6, 7]. At present, most methods adopt
Artificial Neural Networks (ANNs) to perform emotion recognition, which achieves state-of-art
solutions. An efficient emotion recognition method will facilitate communication between people on the
wearable scenes [8]. However, the high energy consumption of ANNs hinders emotion recognition's
application on embedded and mobile devices. Although knowledge distillation [9] and neural
architecture search [10] can obtain ANN architecture with fewer parameters to reduce energy
consumption and be suitable for mobile devices, it does not change the essence of ANNs.</p>
      <p>As the third generation of neural networks, Spiking Neural Network (SNN) [11] with low power
consumption is one potential solution to lead to an embedded and mobile emotion recognition algorithm
reality. Some researches applying SNNs to complete emotion recognition tasks have been proposed to
extract emotion information from speech, cross-modal, or electroencephalograph (EGG) [12, 13, 14, 15,
16]. The feature extraction in most of these methods involves the pre-processing operation, the audio
feature extraction such as Mel cepstrum coefficient. To complete emotion recognition, a shallow SNN,
a three-layer in most existing methods is adopted as a classifier. Based on these techniques, previous
methods have accomplished encouraging performance on relevant datasets. Nevertheless, it remains
challenging to extract emotional representative information using SNNs from video segments. The first
challenge is to collect an emotion recognition dataset utilizing a dynamic vision sensor, which is
expensive to conduct. To mitigate this cost, a simulated method to generate simulated spikes-like data
herein is proposed inspired by frame difference encoding in [17]. Note that in order to better simulate
the mechanism of human ocular nerve receiving information, a kind of float value frame is adopted in
the simulated method, which is a novel scheme in the spiking encoding domain [18, 19]. On the other
hand, the structures in existing SNN-based emotion recognition methods are simple and the spiking
neuron model used in most previous literature is Leaky Integrate-and-Fire (LIF) model. To take full
advantage of existing abstract structures in ANNs, a framework is designed to leverage the latest progress
of SNN proposed in [20], where a new spiking neuron model is termed Parametric Leaky
Integrate-andFire (PLIF) neuron model.</p>
      <p>In this paper, we propose a scheme that is capable of combining the advantage of short-term
highperformance results based on ordinary cameras and the low energy consumption of dynamic vision
sensors. Thus, the simulated data contains both float-value data in the first capture of the scene and spike
trains data during the remaining observation period. Experiments are designed to demonstrate the
effectiveness of the proposed scheme.</p>
      <p>Our contributions are summarized as:
1) To the best of our knowledge, this is the first attempt to apply the SNNs in emotion recognition
based on simulated dynamic vision sensor data. As SNNs have higher biological plausibility
compared with ANNs, the combination of SNNs and the dynamic vision sensor may be
supportive to exploit the emotion possessed by humans.
2) We propose a method to generate simulated dynamic vision sensor data. Note that the generated
data is not pure spikes. Considering the real application scene, the first frame data is represented
by float-value and the following frames consist of pure spikes.
3) Parametric Leaky Integrate-and-Fire (PLIF) is adopted to construct the SNN in this paper. We
evaluate the SNN on the simulated dynamic vision sensor data. The SNN achieves better
performance compared with the counterpart ANN.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method 2.1</title>
    </sec>
    <sec id="sec-3">
      <title>DVS Simulation Algorithm</title>
      <p>The simulation algorithm is explained in detail in this section. Firstly, the concept of DVS is
introduced briefly. Then, the data format of the DVS is clarified. Finally, the simulation algorithm is
present to generate DVS format data based on video segments.</p>
      <p>Focusing on the dynamic information, DVS records the dynamic changes of a scene that is under
perception. Different from recording the whole scene pixel by pixel with a float number representing the
intensity of light in traditional cameras, DVS only captures the changes of light of the scene, the recorded
contents are either 0 or 1, which indicates whether the intensity of a location in the scene has changed. It
is not trivial to directly collect emotion recognition data using DVS as DVS is expensive. An alternative
idea is to generate simulated dynamic vision emotion recognition data in terms of the data format and
the existing vision emotion dataset.</p>
      <p>The data generated by DVS are named neuromorphic data which is represented by E( ,  ,  ,  )
(i=0,1,...,N-1), where  ,  is the location where the event happened,  is the time when the event
occurred,  is the polarity of the event.</p>
      <p>Emotion recognition based on video segments provides a series of frames that record the change in a
scene, which is publicly available [21]. To simulate a DVS's output based on them, the RGB frame is
converted to gray to represent the intensity of each frame. Then the difference between adjoin frames is
used to generate the polarity. A hyperparameter is named sensitivity to represent the degree of the
intensity change. Finally, the spiking is formed by frame series order in the original video to the
simulation output. The final representation is the simulation of DVS's output based on video segments,
which is compatible with the data format recorded by DVS. Note that to consider the real phenomenon:
We catch a scene firstly with whole and then attend to the change. Inspired by this phenomenon, the first
frame is set as float-valued gray information. In other words, the output of the algorithm includes two
parts: the first frame with float-value representing the ordinary camera and other frames of spike values
representing dynamic vision sensor.
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Neuron Model</title>
      <p>The principle of SNN is to mimic the cell in the brain on the micro-physiological scale. So the neuron
model in SNN is different from that in traditional ANNs. Generally, the basic neuron model in ANN is
the McCulloch-Pitts model, while the popular component neuron model in SNN is Leaky
Integrate-andFire (LIF) model. The information transition in neuron cells is not just the summation of all the input
coming from other neurons by synapses. Actually, as time goes on, the input is accumulated in the cell
membrane to cause the increase of cell membrane potential. Once the membrane potential exceeds a
certain threshold, a spike is generated, and then the potential is set to a reset value. LIF [20] can capture
the temporal information transmitted in SNN, which can be defined as:</p>
      <p>
        τ () = −( ( ) −  ) +  ( ), (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
where () is the cell membrane potential at time  , () denotes the inputs at time  , τ is the
membrane time constant,  is the reset value after one spike is generated. The threshold of potential
can be represented by  , the generation of one spiking at time  can be formatted as:
      </p>
      <p>
        S( ) = 01,,  (( )) &lt;≥  , (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
where 1 is a spike and 0 means no operation. Generally, the τ in Eq. (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) is a constant. Based on the
case analysis in [20], the Parametric Leaky Integrate-and-Fire (PLIF) spiking neuron model is proposed
to adjust the τ during the training phase. To incorporate the expressiveness of the novel neuron type,
PLIF is adopted herein as a fundamental structure of the SNN framework for dynamic vision emotion
recognition. Following the strategy in [20], the surrogate gradient method is used to make
backpropagation-based learning work.
2.3
      </p>
    </sec>
    <sec id="sec-5">
      <title>SNN Framework</title>
      <p>The neuron cells in the brain are connected by synapses. High biological plausibility requires the
connections of neurons in SNN to mimic the true structure in the brain. However, biological scientists
are still studying the connections of the brain, and some regional structures have been used to build neural
networks. The convolutional layer's design is an imitation of the GCs part in the human brain while the
pooling layer achieves a similar invariant effect of CCs in V1 and V4 [22]. Therefore, the convolutional
layer and pooling layer are adopted to make up the SNN framework to complete dynamic vision emotion
recognition.</p>
      <p>The aim of dynamic vision emotion recognition is to analyze emotional results given a series of data
recording a specific scenario. The proposed SNN Framework consists of three parts, including a feature
extraction module to extract informative features from input spike-trains, a voting neuron group module
to give spike-trains of different emotion neuron groups, and an emotional mapping module to convert
the spike-trains to final emotion polarity results. The overview of the framework including the dataset
simulation process is illustrated in Fig. 1. The details of the DVS Simulation Algorithm have been
introduced before and other modules will be presented below.</p>
      <p>1) Feature Extraction Module: Feature Extraction Module (FEM) is utilized to extract informative
features from input spike-trains. The original frames of videos are encoded into spike-trains. Different
from the real data recorded by DVS, whose temporal resolution is high, the temporal resolution of
spiketrains generated based on video frames is determined by the temporal resolution of videos. This schema
of organizing data excludes the step of converting the asynchronous event stream into frames. Moreover,
as mentioned before, the first frame is set to float-value which is more resembles the real application
scenario. The component of FEM utilized herein is inspired by [20], where convolution, batch
normalization, max-pooling, and PLIF neuron are adopted. Different from [20], the max-pooling is
replaced by average pooling (AvgPool) based on the experimental results. Experimental results prove
the superiority of average pooling on our dataset. This may be due to the tremendous information loss in
spike-trains of max-pooling, as stated in [23].</p>
      <p>2) Voting Neurons Group Module: To obtain the final emotion results, there need neurons to represent
the corresponding emotion label. Suppose there are two emotion labels: positive emotions and negative
emotions. In traditional ANNs, two different neurons are generally used directly to represent two
different emotions. But in the human brain, the transmission of information is often carried out through
a group of neurons. Two neuron populations composed of 10 neurons are applied to represent the final
output: 10 positive voting neurons and 10 negative voting neurons. The informative features from FEM
are divided into two groups of voting neurons. This module has no additional parameters, which only
operated as reorganizing the spike-trains from another perspective. Thus, the final outputs of the voting
module are spike-trains of 20 neurons and the length of each train is the simulating steps.</p>
      <p>3) Emotional Mapping Module: Emotional Mapping Module is served as a classifier to conduct the
final emotion recognition, specifically, to translate the output spike-trains into emotion labels. In SNN,
fire rates of neurons during the simulation steps, denoted by $N$, are applied to represent the contribution
to the corresponding target. One potential way is to directly define the desired spike trains for each
emotion label. However, the definition is intricate and tedious [24] and the measurement techniques of
two spike-trains are not as mature as a measurement of two float-value vectors. Thus, the average fire
rate of 10 positive voting neurons is treated as the final output of emotion recognition. It is the same for
negative voting neurons. If the average fire rate of positive voting neurons is larger than that of negative
voting neurons, the input scene is positive, and vice versa. To observe in spike-trains, a higher fire rate
means a dense distribution of spikes during the simulation period.</p>
      <p>The average fire rate makes the measurement of output simple and the Mean Squared Error (MSE)
is used to measure the average fire rates between the ground-truth emotion labels. The surrogate
gradient method is applied to update the parameters in the framework by backpropagation.</p>
    </sec>
    <sec id="sec-6">
      <title>3. Experiments</title>
      <p>In this section, we firstly introduce a popular dataset for emotion recognition based on video segments
and the simulation DVS based on this dataset will be presented. Then, the experimental setting is
followed to detail the SNN settings. Finally, the results of experiments will be analyzed to verify the
functionality of SNN based on the simulated dataset.
3.1</p>
    </sec>
    <sec id="sec-7">
      <title>Dataset</title>
      <p>To conduct experiments with the previous model, a simulation dataset needs to be constructed first.
Herein Carnegie Mellon University Multimodal Opinion Sentiment Intensity (CMU-MOSI) [21], a
dataset recognized by the community and widely used, is chosen as the basis for the DVS simulation
algorithm. The number of categories is mapped to two (positive and negative) based on the original label.
Following the schedule mentioned before, a simulation DVS dataset can be generated. Note that the
hyperparameter, $Sens$, will influence the simulation results. A larger $Sens$ will make the generated
spike-trains sparser and a smaller $Sens$ will generate more spikes. Generally speaking, dense spike
trains can achieve better accuracy, and sparse spike trains can potentially achieve lower energy
consumption. $Sens$ is set to 0.001 to trade-off the accuracy and energy consumption.
3.2</p>
    </sec>
    <sec id="sec-8">
      <title>Experimental setting</title>
      <p>ANN
baseline SNN</p>
      <p>PLIF SNN
0.8SNN
0.4SNN
mpSNN
pureSNN</p>
      <p>counterpart of baseline SNN
baseline structure mentioned with LIF</p>
      <p>change the cell type of baseline
change the voltage threshold of PLIF SNN
change the voltage threshold of PLIF SNN</p>
      <p>change to max-pooling of 0.4SNN
omit the first float-valued frame of 0.4SNN</p>
      <p>Experiments are implemented by SpikingJelly [25], a framework for SNN. The code runs on a Linux
system with four Tesla V100 graphics cards. Initialization of the weights of synapses is completed by
the default method of PyTorch with a fixed random seed [26]. To optimize the parameters, the stochastic
gradient descent optimizer based on surrogate gradients implemented in SpikingJelly is utilized and the
learning rate is 0.01. The batch size is set to 8 for all experiments. To find an appropriate time length of
the input signal, we count the frame numbers of video segments and find that a large number of video
segment samples are around 68. Thus, we set the time length to 68 herein. The input of neural networks
is rescaled to 128×128. In order to make the comparison as fair as possible, an ANN basically parallel
to the SNN structure is constructed, which is called the counterpart ANN, where the PLIF is changed
to ReLU. The classifier for ANN is designed as a fully connected layer. We evaluate the performance
of the SNN and its counterpart ANN on the simulation dataset.
3.3</p>
    </sec>
    <sec id="sec-9">
      <title>Experimental Results</title>
      <p>The performance of the testing set is summarized in Table I. The first column represents the compared
methods or different hyper parameter setting algorithms. The second column is the explanation of the
algorithms. The last column reported the accuracy of corresponding methods. Compared with the
PLIFFang from [20], the counterpart ANN of baseline SNN gives a performance increase of 3.48%. It can be
seen that the performance of PLIF-Fang and baseline SNN is the same, which can be explained by the
influence of different network structures. Although LIF is applicable to the baseline SNN here, and its
performance is theoretically worse than PLIF-Fang. But the pooling method is different: the max-pooling
is utilized in PLIF-Fang while average pooling is adopted in baseline SNN.</p>
      <p>The performance difference of these two pooling methods on the simulated dataset can be seen from
the algorithms of 0.4SNN and mpSNN in Table I. The max-pooling will damage the performance of the
model, which may be due to the loss of some local information compared with average pooling.
Comparing baseline SNN to PLIF SNN, we can find that the performance of networks constructed with
the PLIF neuron model is better (2.24% accuracy improvement), which is consistent with the conclusion
in [20]. However, a different setting about the voltage threshold is presented based on our experiments
on our dataset. The default voltage threshold is 1, it is thought unnecessary to adjust the voltage threshold
in [20]. But in practice, we find that a relatively decreasing voltage threshold can obtain performance
gain. It is shown that 0.4SNN achieves a better emotion recognition performance than its counterpart
ANN. We argue that a relatively smaller voltage threshold of membrane potentials fires the neurons
earlier, which causes a relative more training to reach a better performance. Finally, a pure spike-trains
input setting is conducted to validate the effectiveness of the first float-value frame setting. It can be seen
that compared with 0.4SNN, the pureSNN has a certain loss of performance.</p>
      <p>To demonstrate what the training influence the SNNs' output at every simulating step (the total
simulating herein is 68), experiments are conducted. The setting model using pure spike-trains as input
is adopted to show the effect of training. The accuracy curve and loss curve on the testing set are shown
in Fig. 2. In each subfigure, the accuracy curve is shown above, and the loss curve is shown below. It is
shown in Fig. 2(a) that the SNNs give the same output for all simulating steps before training. This is
due to the random initialization of parameters and no spike is fired in the classifier layer before training.
During the early phase of training, some changes can be observed in Fig. 2(b) to suggest the updates of
weights. After training is done, the loss in Fig. 2(c) decreases as the simulating step is larger at first, but
then the loss curve starts to increase a little. A possible explanation is that the following spikes make the
model out a result different ground truth. The fluctuation of the accuracy curve also illustrates the
influence caused by following spikes after the simulating step achieving the lowest loss. In order to
illustrate the superiority of the setting that using float-value in the first frame, we visualized the loss and
accuracy curve during the training process in Fig. 2(d). It can be observed that compared to the pure
spike-train input setting, using the float-value as the first frame in input can lead to faster loss drops at
the beginning of training and make the accuracy exceed 60% earlier.</p>
      <p>TABLE II The number of operations and energy consumption of ANN and SNN on the generated
dataset. # denotes the number of operations. M means million.</p>
      <p>Energy Total
#First #Other Total</p>
      <p>consumption consumption
layer Layers consumption</p>
      <p>per operation on the dataset
ANN 310.9M 47.9M 4.6pJ 1.65mJ 1.65mJ
SNN 9.1M 47.9M 0.9pJ 1.57mJ(N=30) 0.41mJ
To show the membrane potentials and spikes' pattern vividly, two examples are shown in Fig. 3. Note
that there are ten neurons corresponding to positive and negative emotions and the output is defined as
the average firing rate. For membrane potentials in Fig. 3(a), for every neuron on the y-axis, the
rectangular bars in yellow represent the high potential, which is relatively easy to fire a spike. The output
spike-trains corresponding to these membrane potentials are shown in Fig. 3(b). Neuron indexes 0 to 9
represent the negative emotion output and neuron indexes 10 to 19 represent the negative output. We
can see that the spikes present two different patterns. For the first example with the positive emotion
label shown in Fig. 3(b), from a visual point of view, the output spike-trains generated by the last 10
neurons are denser than the output spike-trains generated by the first 10 neurons. From the numerical
analysis, the average spiking rate of the first ten neurons is 0.4925, which is smaller than that of the last
ten neurons, which is 0.5075. Thus, the emotion is positive. For the second example with the negative
emotion label shown in Fig. 3(d), The visual analysis is similar to the previous example. From the
numerical analysis, the average spiking rate of the first ten neurons is 0.5060, which is larger than that
of the last ten neurons, which is 0.4955. Thus, the emotion is negative, but the ground truth is positive.</p>
      <p>To demonstrate the efficientness of the SNN, the energy consumption of SNN and counterpart ANN
is analyzed theoretically. Note that the operations in SNN are mainly accumulation (ACC) while
operations in ANN are Multiply-ACcumulation (MAC) [27]. It has been shown that a 32-bit
floatingpoint MAC operation consumes 4.6 pJ while an ACC operation consumes 0.9 pJ in 45nm 0.9V chip [28].
The total related information is reported in Table II. As the structures of SNN and counterpart ANN are
most the same, the number differences of MAC and ACC are caused by the input. The input frame's size
is 128×128 and the filter kernel size is 3×3. For ANN with 68 channel inputs, the operation number of
MAC is 310.9M. For SNN with 2 channel (positive events and negative events) input, the operation
number of ACC is 9.1M. It should be noted that the operation of the first layer will be changed from
ACC to MAC under the SNN with first frame float-values. The other part of the calculation of the same
structure is 47.9M. The total consumption of ANN is 4.6× (47.9+310.9)=1.65mJ. For SNN, the first
round is computed separately and the total consumption is (9.1 × 4.6)+9.1 × (N-1) ×
0.9+47.9×N×0.9=(51.3×N+33.67)× 10 mJ. Consequently, it can be calculated the consumption of
SNN is smaller than counterpart ANN when N&lt;32. We can see from Fig. 2, the performance is almost
stable when simulating step at around 30. A critical point is that when SNN is implemented by
neuromorphic hardware [29, 30], the computation of SNN will exclusively happen when there is a spike.
By counting the proportion of the number of spikes in the dataset to the total frame data, it is found that
only 10.79% of positions occur spike events. In other words, the potential effectiveness advantage of
SNN is larger than the above-mentioned. Thus, the total consumption of SNN with 68 simulating steps
is about 0.41mJ. Compared with ANN's energy consumption, SNN's energy consumption is reduced by
three quarters.</p>
    </sec>
    <sec id="sec-10">
      <title>4. Conclusion</title>
      <p>In this paper, we have proposed a simulated method to generate DVS-like data based on video
segments and an SNN framework considering the real application scene to complete recognition.
Inspired by the float input in ANN, the first frame of input to SNN is changed from spikes to float-value.
The proposed SNN framework presents a feature extraction module for informative spike patterns from
simulated input spike-trains and employs a voting neurons group module and emotion mapping module
to convert output spike-trains to the final emotion labels. In addition, in our dataset, the theoretical energy
consumption of SNN is only a quarter of that of ANN. An interesting future direction is to further explore
the topology of other potential structures for SNN.</p>
    </sec>
    <sec id="sec-11">
      <title>5. Acknowledgements</title>
      <p>
        This work was supported by the Natural Science Foundation of Shandong Province (No.
ZR2021QF145)
6. References
[4] K. Maher, Z. Huang, J. Song, X. Deng, Y. Lai, C. Ma, H. Wang, Y. Liu, and H. Wang. “E-ffective:
A visual analytic system for exploring the emotion and effectiveness of inspirational speeches,”
IEEE Transactions on Visualization and Computer Graphics, 28(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ):508–517, 2021.
[5] L. Sun, B. Liu, J. Tao, and Z. Lian. “Multimodal cross-and self-attention network for speech
emotion recognition,” In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech
and Signal Processing (ICASSP), pages 4275–4279. IEEE, 2021.
[6] B. Wang, G. Dong, Y. Zhao, R. Li, Q. Cao, and Y. Chao. “Non-uniform attention network for
multi-modal sentiment analysis,” In International Conference on Multimedia Modeling, pages
612–623. Springer, 2022.
[7] Y. Zhang, G. Zhao, Y. Shu, Y. Ge, D. Zhang, Y. Liu, and X. Sun. “Cped: A chinese positive
emotion database for emotion elicitation and analysis,” IEEE Transactions on Affective
Computing, 2021.
[8] J. Chen, D. Jiang, Y. Zhang, and P. Zhang. “Emotion recognition from spatiotemporal eeg
representations with hybrid convolutional recurrent neural networks via wearable multi-channel
headset,” Computer Communications, 154:58–65, 2020.
[9] L. Wang and K. Yoon. “Knowledge distillation and student-teacher learning for visual intelligence:
A review and new outlooks,” IEEE Transactions on Pattern Analysis and Machine Intelligence,
2021.
[10] T. Elsken, J. Metzen, and F. Hutter. “Neural architecture search: A survey. The Journal of Machine
      </p>
      <p>
        Learning Research,” 20(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ):1997–2017, 2019.
[11] W. Maass. “Networks of spiking neurons: the third generation of neural network models,” Neural
networks, 10(9):1659–1671, 1997.
[12] E. Benssassi and J. Ye. “Investigating multisensory integration in emotion recognition through
bioinspired computational models,” IEEE Transactions on Affective Computing, 2021.
[13] C. Buscicchio, P. Górecki, and L. Caponetti. “Speech emotion recognition using spiking neural
networks,” In International Symposium on Methodologies for Intelligent Systems, pages 38–46.
      </p>
      <p>
        Springer, 2006.
[14] R. Lotfidereshgi and P. Gournay. “Biologically inspired speech emotion recognition,” In 2017
IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5135–
5139. IEEE, 2017.
[15] Y. Luo, Q. Fu, J. Xie, Y. Qin, G. Wu, J. Liu, F. Jiang, Y. Cao, and X. Ding. “Eeg-based emotion
classification using spiking neural networks,” IEEE Access, 8:46007–46016, 2020.
[16] E. Benssassi and J. Ye. “Speech emotion recognition with early visual cross-modal enhancement
using spiking neural networks,” In 2019 International Joint Conference on Neural Networks
(IJCNN), pages 1–8. IEEE, 2019.
[17] C. Posch, D. Matolin, and R. Wohlgenannt. “A qvga 143 db dynamic range frame-free pwm image
sensor with lossless pixel-level video compression and time-domain cds,” IEEE Journal of
SolidState Circuits, 46(
        <xref ref-type="bibr" rid="ref1">1</xref>
        ):259–275, 2010.
[18] P. Diehl and M. Cook. “Unsupervised learning of digit recognition using spike-timing-dependent
plasticity,” Frontiers in Computational Neuroscience, 9:99, 2015.
[19] J. Wu, Y. Chua, and H. Li. “A biologically plausible speech recognition framework based on
spiking neural networks,” In 2018 International Joint Conference on Neural Networks (IJCNN),
2018.
[20] W. Fang, Z. Yu, Y. Chen, T. Masquelier, T. Huang, and Y. Tian. “Incorporating learnable
membrane time constant to enhance learning of spiking neural networks,” In Proceedings of the
IEEE/CVF International Conference on Computer Vision, pages 2661–2671, 2021.
[21] A. Zadeh, R. Zellers, E. Pincus, and L. Morency. “Mosi: multimodal corpus of sentiment intensity
and subjectivity analysis in online opinion videos,” arXiv preprint arXiv:1606.06259, 2016
[22] Q. Xu, Y. Qi, H. Yu, J. Shen, H. Tang, G. Pan, et al. “Csnn: An augmented spiking based
framework with perceptron-inception,” In IJCAI, pages 1646–1652, 2018.
[23] X. Cheng, Y. Hao, J. Xu, and B. Xu. “Lisnn: Improving spiking neural networks with lateral
interactions for robust object recognition,” In IJCAI, pages 1519–1525, 2020.
[24] Q. Liu, G. Pan, H. Ruan, D. Xing, and H. Tang. “Unsupervised aer object recognition based on
multiscale spatio-temporal features and spiking neurons,” IEEE Transactions on Neural Networks
and Learning Systems, PP(99):1–12, 2020.
[25] W. Fang, Y. Chen, J. Ding, D. Chen, Z. Yu, H. Zhou, Y. Tian, and other contributors. “Spikingjelly.
      </p>
      <p>https://github.com/fangwei123456/spikingjelly”, 2020.
[26] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein,
L. Antiga, et al. “Pytorch: An imperative style, high-performance deep learning library,” Advances
in neural information processing systems, 32:8026–8037, 2019.
[27] B. Chakraborty, X. She, and S. Mukhopadhyay. “A fully spiking hybrid neural network for
energyefficient object detection,” arXiv preprint arXiv:2104.10719, 2021.
[28] M. Horowitz. “1.1 computing’s energy problem (and what we can do about it),” In 2014 IEEE
International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10–14.</p>
      <p>IEEE, 2014.
[29] D. Ma, J. Shen, Z. Gu, M. Zhang, X. Zhu, X. Xu, Q. Xu, Y. Shen, and G. Pan. “Darwin: A
neuromorphic hardware co-processor based on spiking neural networks,” Journal of Systems
Architecture, 77:43–51, 2017.
[30] J. Pei, L. Deng, S. Song, M. Zhao, Y. Zhang, S. Wu, G. Wang, Z. Zou, Z. Wu, W. He, et al.
“Towards artificial general intelligence with hybrid tianjic chip architecture,” Nature,
572(7767):106–111, 2019.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Halimi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mccarthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and GS</given-names>
            <surname>Buller</surname>
          </string-name>
          .
          <article-title>“Learning non-local spatial correlations to restore sparse 3d single-photon data</article-title>
          ,
          <source>” IEEE Transactions on Image Processing</source>
          ,
          <source>PP(99)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Chen</surname>
          </string-name>
          and
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          .
          <article-title>“Multi-modal emotion recognition by fusing correlation features of speechvisual,”</article-title>
          <source>IEEE Signal Processing Letters</source>
          ,
          <volume>28</volume>
          :
          <fpage>533</fpage>
          -
          <lpage>537</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , J. Liu, and
          <string-name>
            <given-names>J.</given-names>
            <surname>Dang</surname>
          </string-name>
          . “
          <article-title>A sentiment similarity-oriented attention model with multi-task learning for text-based emotion recognition,”</article-title>
          <source>In International Conference on Multimedia Modeling</source>
          , pages
          <fpage>278</fpage>
          -
          <lpage>289</lpage>
          . Springer,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>