Inner speech recognition through
electroencephalographic signals
Francesca Gasparini1,2,*,† , Elisa Cazzaniga1,† and Aurora Saibene1,2,*,†
1
    University of Milano-Bicocca, Viale Sarca 336, 20126, Milano, Italy
2
    NeuroMI, Milan Center for Neuroscience, Piazza dell’Ateneo Nuovo 1, 20126, Milano, Italy


                                         Abstract
                                         This work focuses on inner speech recognition starting from electroencephalographic (EEG) signals. Inner
                                         speech recognition is defined as the internalised process in which the person thinks in pure meanings,
                                         generally associated with an auditory imagery of own inner “voice”. The decoding of the EEG into text
                                         should be understood as the classification of a limited number of words (commands) or the presence
                                         of phonemes (units of sound that make up words). Speech-related brain computer interfaces provide
                                         effective vocal communication strategies for controlling devices through speech commands interpreted
                                         from brain signals, improving the quality of life of people who have lost the capability to speak, by
                                         restoring communication with their environment. Two public inner speech datasets are analysed. Using
                                         this data, some classification models are studied and implemented starting from basic methods such as
                                         Support Vector Machines, to ensemble methods such as the eXtreme Gradient Boosting classifier up to
                                         the use of neural networks such as Long Short Term Memory (LSTM) and Bidirectional Long Short Term
                                         Memory (BiLSTM). With the LSTM and BiLSTM models, generally not used in the literature of inner
                                         speech recognition, results in line with or superior to those present in the state-of-the-art are obtained.

                                         Keywords
                                         EEG, inner speech recognition, BCI


1. Introduction
Human speech production is a complex motor process that starts in the brain and ends with
respiratory, laryngeal, and articulatory gestures for creating acoustic signals of verbal commu-
nication. Physiological measurements using specialised sensors and methods can be made at
each level of speech processing, including the central and peripheral nervous systems, muscular
action potentials, speech kinematics (tongue, lips, jaw), and sound pressure [1]. However, there
are cases of subjects suffering from neurodegenerative diseases or motor disorders that prevent
the normal activity of signal transmission from the brain to the peripheral areas. These subjects
are prevented from communicating or carrying out certain actions.


Italian Workshop on Artificial Intelligence for Human Machine Interaction (AIxHMI 2022), December 02, 2022, Udine,
Italy
*
  Corresponding author.
†
  These authors contributed equally.
$ francesca.gasparini@unimib.it (F. Gasparini); e.cazzaniga@campus.unimib.it (E. Cazzaniga);
aurora.saibene@unimib.it (A. Saibene)
 0000-0002-6279-6660 (F. Gasparini); 0000-0002-4405-8234 (A. Saibene)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
   Brain Computer Interfaces (BCIs) are promising technologies for improving the quality of
life of people who have lost the capability to move or speak, by restoring communication with
their environment. A BCI is a system that makes possible the interaction between an individual
and a computer without using the brain normal output pathways of peripheral nerves and
muscles. In particular, speech-related BCI technologies provide neuro-prosthetic help for people
with speaking disabilities, neuro-muscular disorders and diseases. It can equip these users
with a medium to communicate and express their thoughts, thereby improving the quality
of rehabilitation and clinical neurology [2]. Speech-related paradigms, based on either silent,
imagined or inner speech provide a more natural way for controlling external devices [3].
   There are different types of brain-signal recording techniques that are mainly divided into
invasive or non-invasive methods. The first ones involve implanting electrodes directly into the
brain. They provide better spatial and temporal resolution, also increasing the quality of the
signal obtained. However, invasive technologies have problems related to usability and the need
for surgical intervention on the subject. This is why non-invasive techniques are increasingly
used in BCI research. Among the non-invasive technologies, the electroencephalogram (EEG)
is the most used method for measuring the electrical activity of the brain from the human
scalp. It has an exceedingly high time resolution, it is simple to record and it is sufficiently
inexpensive [4]. Over the years, EEG hardware technology has evolved and several wireless
multichannel systems have emerged that deliver high quality EEG and physiological signals in
a simpler, more convenient and comfortable design than the traditional, cumbersome systems.
   Therefore, this paper focuses on inner speech recognition starting from EEG signals, where
the basic definition of inner speech is [5] "the subjective experience of language in the absence
of overt and audible articulation".
   As suggested in [6], there is evidence from past neuroscience research that inner speech
engages brain regions that are commonly associated with language comprehension and pro-
duction [7]. This includes temporal, frontal and sensorimotor areas predominantly in the left
hemisphere of the brain [7, 8]. Therefore, by monitoring these brain areas, it is theoretically
possible to develop an inner speech BCI that classifies neural representations of imagined
word [8].
   We propose analyses on the literature available Thinking Out Loud and Imagined Speech
datasets by using a Support Vector Machine (SVM) to provide a traditional machine learning
model and then an ensemble approach based on eXtreme Gradient Boosting (XGBoost). Finally,
we design two deep learning architectures based on Long Short Term Memory (LSTM) and
Bidirectional LSTM (BiLSTM) models.
   In Section 2, the studies in the field of inner speech are described. Section 3 presents the two
publicly available datasets used for the analyses proposed in Section 4. In Section 5 the results
obtained with our models are presented and discussed. Finally, in Section 6 some conclusions
are proposed.


2. Related works
Most studies on classification of inner speech focus on invasive methods such as electrocor-
ticography [9] as they provide higher spatial resolution while fewer studies concerning inner
speech classification using EEG data are available [10]. It is important for a BCI application to
be non-invasive, accessible and easy to implement to be used by a large number of subjects.
   Inner speech recognition is generally faced considering phonemes, in general vowels or
syllables, such as /ba/ or /ku/, or simple words such as left, right, up and down, in subject-
dependent approaches.
   Preliminary works were conducted with very few participants and syllables by D’Zmura et
al. [11], where EEG waveform envelopes have been adopted to recognise EEG patterns. Also
Brigham and Kumar [12] and Deng et al. [13] considered the recognition of two syllables. In
the first work, the accuracy obtained for the 7 subjects ranges from 46% to 88%. The authors
preprocessed raw EEG data to reduce the effects of artifacts and noise, and applied k-Nearest
Neighbor classifier to autoregressive coefficients extracted as features. Instead, Deng et al. used
Hilbert spectra and linear discriminant analysis to recognise the two syllables imagined in three
different rhythms, for a 6 classes task, with accuracy ranging from 19% to 22%.
   Considering works where recognition of phonemes have been investigated, Da Salla et al.
[14, 15], analysed the recognition of three tasks: /a/, /u/, and rest, obtaining from 68% to 79% of
accuracy by using Common Spatial Pattern (CSP). On the same dataset, several other researchers
have been tested different models obtaining promising results [16, 17].
   Instead, Kim et al. [18] considered three vowels /a/, /i/ and /u/ and applied multivariate
empirical mode decomposition and common spatial pattern for feature extraction together with
linear discriminant analysis, reaching around 70% of accuracy.
   Few representative studies that try to recognise imagined words using EEG data are reported
in the literature. Given the complexity of the task, the number of terms considered is generally
limited.
Suppes et al. [19] proposed an experiment in which five subjects performed the internal speech
considering the following words: first, second, third, yes, no, right, and left for all subjects with
the addition of to, too and hear for the last three subjects.
In the work performed by Wang at al. [20], eight Chinese subjects were required to read in
mind two Chinese characters (that meant left and one). They were able to distinguish between
the two characters and the rest state. Feature vectors of EEG signals were extracted using CSP,
and then these vectors were classified with SVM. Accuracies between 73.65% and 95.76% were
obtained when comparing between each of the imagined words and the rest state. A mean
accuracy of 82.30% was achieved between the two words themselves.
Salama et al. [21] implemented different types of classifiers such as SVM, discriminant analysis,
self-organising map, feed-forward back-propagation and a combination of them, to recognise
two words (Yes and No). They used a single electrode EEG device to collect data from seven
subjects and the accuracy obtained ranges from 57% to 59%.
In [22], Mohanchandra at al. constructed a one-against-all multiclass SVM classifier to discrim-
inate five subvocalised words (water, help, thanks, food, and stop) and reported an accuracy
ranging from 60% to 92%.
In the González-Castañeda at al. [23] analyses, some techniques of sonification and textification
were applied, which allowed to characterise EEG signals as either an audio signal or a text
document. Five imagined words (up, down, left, right) and 27 subjects were considered. The
average accuracy rate using the EEG textified signals was 83.34%.
Using the data from six subjects, [24] reported an average accuracy of 50.10% for the three-short
words (in, out and up) classification problem and 66.20% for the two long words classification
problem (cooperate and independent), using a multi-class relevance vector machine. In order to
evaluate the effect of the sound, three phonemes were used, namely /a/, /i/ and /u/ obtaining an
accuracy of 49.0%. Coretto et al. [25], who collected one of the two datasets considered in this
paper, reported a mean recognition rate of 22.32% in classifying five Spanish vowels and 18.58%
in classifying six Spanish words using a Random Forest (RF) algorithm. Using the same dataset,
in [26] 30.00% and 24.97% accuracies were obtained respectively for vowels and words using
CNNs.
Recently, van den Berg et al. [6], working on the Thinking Out Loud dataset [3], also considered
in our work, reported an average accuracy of 29.70% for a four-word classification task using a
2D CNN based on the EEGNet architecture [27].


3. Datasets
The testing of the proposed strategies is performed on two publicly available datasets, i.e., the
Thinking Out Loud [3] and the Imagined Speech [25] datasets. In particular, the last dataset is
used to check the validity of the resulting best approach for the Thinking Out Loud one.

3.1. Thinking Out Loud dataset
The first literature dataset chosen to conduct the subsequent analyses is the Thinking Out
Loud [3] one, which is focused on an inner speech paradigm intended for the control of a BCI
system through imagination of Spanish words.
The selected Spanish words are arriba (up), abajo (down), derecha (right), and izquierda (left).
Notice that the words were presented randomly with a visual cue.
Ten (four females) healthy right-handed subjects with mean ± std age 34 ± 10, without any
hearing or speech loss, nor any previous BCI experience, participated in the experiment, which
consisted of three experimental conditions, i.e., inner/pronounced speech and visualised condi-
tion. During the inner speech condition the participant was asked to imagine his/her own voice,
repeating the corresponding word. Instead, the participant was asked to repeatedly pronounce
aloud the word corresponding to each visual cue during the pronounced speech condition. Finally,
during the visualised condition, the participant was asked to focus on moving a circle presented
at the center of the screen. The direction of the movement was provided by a visual cue.
   Each subject participated in 3 consecutive sessions (200 words/session), separated by a break.
Each session consisted of a baseline recording (15s), the pronounced speech run, two inner speech
runs, and two visualised condition runs. Each run was constituted by a series of trials containing
the different experimental tasks.
Notice that in this paper we consider only the inner speech condition and that the number of
trials for each of its classes varied from subject to subject, however the number of trials for
each class were balanced for the same subject. The minimum (maximum) number of trials for a
pronounced speech class was 25 (30), the minimum (maximum) number of trials for an inner
speech class was 45 (60). Please, refer to Table 4 of the original work by Nieto et al. [3] for
further details on the dataset.
   Figure 1 shows the organisation of each inner speech trial, under investigation in this paper.
A white circle was shown in the center of the screen and the subject was asked to stare at
it without blinking. Subsequently, a white triangle was shown pointing in one of the four
directions corresponding to the chosen Spanish words. When the triangle disappeared and
the white circle was presented again, the subject had to perform the indicated task. The task
execution had to be stopped when the white circle turned blue. The subject was asked to control
eye blinking until the circle disappeared. Finally, to evaluate the participants’ attention, the
subjects were asked to indicate the last inner speech and visualised conditions after a random
number of trials. The subject answered using keyboard arrows and feedback was displayed.


Figure 1: Trial workflow reported following the Thinking Out Loud dataset original paper [3].


  The data acquisition was performed using 128 active EEG wet electrodes and 8 external active
EOG/EMG wet electrodes. The resolution was of 24 bits resolution and 1024 Hz sampling rate
applied.
  The EEG signals of the Thinking Out Loud dataset were preprocessed by its authors. The
preprocessing included a band pass filter between 0.5-100 Hz, a notch filter at 50 Hz and
downsampling to 254 Hz.

3.2. Imagined Speech dataset
The Imagine Speech dataset [25] was chosen to confirm the validity of the model obtaining
the best results on the Thinking Out Loud dataset. In fact, similar experimental conditions are
presented considering Spanish words and also vowels.
In fact, the vowels /a/, /e/, /i /, /o/ and /u/ have been selected due to their acoustic stationarity,
simplicity and lack of meaning by themselves. While the Spanish words arriba (up), abajo
(down), derecha (right), izquierda (left), adelante (forward), and atras (backward) were chosen as
possible BCI commands to control the movements of an external device.
   Fifteen (seven females) healthy subjects with mean age of 25 years old, without any hearing or
speech loss, participated in the experiment. Only one of the subjects reported to be left-handed,
while the rest were right-handed.
EEG signals were recorded under two conditions: imagined speech and pronounced speech.
During imagined speech, the subjects had to imagine pronouncing the word without moving
muscles or producing sounds.
Target stimuli were presented in a sequence comprised of four intervals of predefined duration
(Figure 2). During the ready interval (2 s), the subject was informed that the rest interval finished
and a new cue would be displayed soon. Afterwards, the target word was presented, both visually
and acoustically, during the stimulus presentation interval (2 s). In the Imagine/Pronounce
stage an image represented the requested task (either imagined or pronounced speech). In this
stage the subject had to imagine the pronunciation or pronounce the word given as a cue. In
the case that the word was a vowel, the subject had to perform the task during the complete 4
seconds of this interval duration, while if the word was a command, a sequence of three audible
clicks indicated when to imagine or pronounce the target word. Finally, during the rest interval
(4 s) the subject was allowed to move, swallow or blink.
   Even for the Imagined Speech dataset, the number of trials for each class and condition varied
from subject to subject, however the number of trials for each class were sufficiently balanced
for the same subject, with variations of most 4 trials. The minimum (maximum) number of
trials for a pronounced speech class was 7 (14), the minimum (maximum) number of trials for
an inner speech class was 39 (51). Please, refer to Table 4 of the original work by Coretto et
al. [25] for further details on the dataset.


Figure 2: Sequence time course for the presentation of one stimulus, in this particular case for the word
adelante and under the imagined speech condition. Graphic inspired from the original dataset paper [25].


   EEG signals were recorded using Ag-AgCl cup electrodes, attached to the scalp according to
the 10-20 international system and with conductive paste. No electrode cap was used. F3, F4,
C3, C4, P3, and P4 were chosen as active electrodes, while reference and ground electrodes were
placed on the left and right mastoids. The EEG signals were acquired with 1024 Hz sampling
rate and 16 bit resolution.
   The EEG signals of the Imagined Speech dataset were preprocessed by its authors. The
preprocessing included a band pass filter between 2-40 Hz.


4. Proposed approaches
Some classification models are studied and implemented starting from basic methods such as
SVM, to ensemble methods such as the XGBoost classifier up to the use of neural networks
such as LSTM and BiLSTM.
4.1. Machine Learning approaches
Since inner speech recognition is a very complicated task, a simpler preliminary analysis was
performed on the Thinking out Loud dataset. Therefore, a binary classification between the
resting state and the action interval was performed. The first 1.5 s of the rest interval and the
2.5 s of the action interval were considered for each trial (Figure 1). Subsequently, the multiclass
classification was considered on the four words (left, right, up and down) of the same dataset.
The action intervals of 2.5 s were used for each trial (Figure 1).
   The following ML analyses were performed for both classification tasks.
Power Spectral Density (PSD) was used as a feature extraction technique before proceeding
with the classification. PSD was calculated using Welch’s method and based on relative power
in specific frequency bands: alpha (8-13 Hz), beta (13-30 Hz) and gamma (30-100 Hz).
The models were trained and tested on each subject individually, using K-fold cross validation.
In this study, the data was split into four folds, resulting in a number of trials ranging from 237
to 285 trials (depending on the subject) in the test set and from 713 to 855 trials in the training
set in binary classification, while in multiclass classification from 118 to 142 trials in the test set
and from 357 to 428 trials in the training set. The SVM and XGBoost classifiers were trained on
PSD feature vector with dimension (𝑛_𝑒𝑝𝑜𝑐ℎ𝑠, 𝑛_𝑐ℎ𝑎𝑛𝑛𝑒𝑙𝑠 * 𝑛_𝑏𝑎𝑛𝑑_𝑓 𝑟𝑒𝑞𝑠) for each subject.
Since the number of features is too high compared to the number of trials of each subject, three
possible solutions were analysed:
    • to apply Principal Component Analysis (PCA);
    • to extract the most important features identified with the XGBoost classifier, both consid-
      ering the subjects individually and making an intersection of the most important features
      in common to all subjects;
    • to choose a subset of meaningful electrodes, since the neural correlates of inner speech
      processing are reported to be mainly present in the left hemisphere (see Section 1).
   The analyses carried out in the binary classification showed a difference between the action
interval and the rest interval. It was therefore verified whether this difference could be associated
with one or more time windows, in order to identify a particularly significant area in which the
inner speech activity could be encoded. The action interval has been split into 0.5 s wide sliding
windows with a 50% overlap. For each window of the action interval, a binary SVM model
was trained, considering all the resting state and using the features extracted with XGBoost.
The idea was to identify the best window for binary classification and then use it in multiclass
classification. Again a subject-based approach was used.
   The performed analyses do not justify the choice of one interval over another and this
suggests continuing to consider the entire interval in the following tests carried out with deep
learning methods.

4.2. Deep Learning approaches
The DL models were trained and tested for the multiclass classification task on each subject
individually, using nested cross-validation.
Three different types of analysis were performed in order to obtain the input data to train the
models:
    • the PSD using the Welch’s method was calculated and the most important features were
      extracted using the XGBoost features importance vector;
    • the raw data considering all the channels were used;
    • the raw data considering only the channels associated with the most important features
      extracted using the XGBoost were used.
LSTM and BiLSTM networks were trained for each type of input data. The architecture and
parameter choices were performed iteratively by a trial and error process, focusing on the
accuracy and loss trend.
  The Imagine Speech dataset [25] was chosen to confirm the validity of the model that obtained
the best results on the Thinking Out Loud dataset.


5. Results and discussion
5.1. Machine Learning Models results
In the binary classification task, among the various tests performed using the PCA, the best
results for SVM were obtained without PCA and an accuracy of 79% is obtained, while with
XGBoost an accuracy of 81% is obtained using PCA, explaining 99% of the variance. Using the
most important features extracted with XGBoost considering each subject individually, the SVM
performances improve up to 80% accuracy with a gain of 0.9.
   In the multiclass classification task, the results obtained with SVM and XGBoost are very
similar, respectively 26.20% and 27.90% accuracy. In this case, using the most important features
extracted with XGBoost both considering each subject individually and in common to all
subjects, the results are approximately the same. This means that the subjects have common
characteristics relevant for classification. Furthermore, there are no particular differences when
using all channels or only those of the left hemisphere.
   The characteristics extracted in common to all subjects were analysed and are highlighted in
Figure 3. Considering all the channels, the most affected areas are the occipital one, probably
involved due to the visual signals presented on the screen, the temporal and the frontal ones.
Both the right and left hemispheres are involved. Since, even using only the electrodes identified
in the left hemisphere, the performance still remains good, this could mean that there are
electrodes in the left channels that compensate for the absence of the right ones. The frequency
band most involved is alpha, usually associated with intense mental activity. These features
were used later for some analyses carried out with the DL models.

5.2. Deep Learning Models results
This paragraph summarises the performances obtained with the different deep learning models
using the Thinking out loud dataset. In Table 1 the results are shown averaging over all subjects.
   In general, with BiLSTM the performances increase rather than using LSTM. This is due
to the fact that BiLSTM is able to capture the sequential dependencies between data in both
directions. The greatest improvement is obtained using raw data from all channels as input.
   Looking at the average performance subject by subject, we have a repeating trend. Using both
different models and different types of input data, the data of some subjects are classified better
Figure 3: Features in common to all subjects in the multiclass task identified with XGBoost using all
channels and gain = 0.95.


Table 1
Deep learning models comparison.
                Input Type                             LSTM Accuracy         BiLSTM Accuracy
          Most important features                         30.40%                  31.30%
           Raw data (all channels)                        27.20%                  36.10%
 Raw data (channels most important features)              26.70%                  33.10%


than others. Specifically, subjects 4 and 5 achieve an average performance which is always
slightly lower. This could probably be related to the data acquisition phase of these volunteers.
Maybe the placement of the electrodes was not perfect or their data is noisier. Analysing the
results of attention monitoring there are no differences with the other subjects, so the lack of
attention in carrying out the task should not be the cause of the lower performance.
   The best model network is composed of a BiLSTM layer followed by two dense layer (with
ReLU activation function) and a dense output layer (with softmax activation function). Two
dropout layers were used to reduce overfitting. SGD was used as optimisation method and
categorical cross entropy as loss function.
We remind that the architecture and parameter choice was performed iteratively by a trial and
error process, focusing on the accuracy and loss trend.
Figure 4 shows the results obtained for each subject. All the subjects achieve an accuracy above
randomness (represented by the red line - 25%) and the mean accuracy considering all the
subjects is 36.10% (± 0.054 std). Table 2 summarises the results of our best model for each
subject.
Figure 4: Accuracy of the BiLSTM network for multiclass classification using the Thinking out loud raw
data. The red line represents the chance level (25%).


Table 2
BiLSTM performance for each subject on the 4-class inner speech classification task using the Thinking
out loud raw data.
                   Subject Accuracy ± std Precision Recall F1-score
                    Sub 1    38.60 ± 0.050%        40.37%     38.93%       37.87%
                    Sub 2    40.17 ± 0.029%        40.06%     40.43%       39.56%
                    Sub 3    37.60 ± 0.055%        38.25%     37.50%       35.50%
                    Sub 4    33.67 ± 0.072%        34.44%     33.69%       32.81%
                    Sub 5    31.67 ± 0.028%        32.13%     31.94%       31.06%
                    Sub 6    34.81 ± 0.057%        37.69%     33.00%       33.75%
                    Sub 7    34.83 ± 0.062%        36.00%     34.94%       34.31%
                    Sub 8    36.60 ± 0.058%        37.69%     36.88%       34.88%
                    Sub 9    38.00 ± 0.089%        38.19%     37.75%       37.31%
                    Sub 10   35.33 ± 0.043%        34.50%     34.94%       34.38%
                   Average   36.12 ± 0.054%        36.93%     36.00%       35.14%


  Instead, Table 3 shows a comparison of our proposed approaches and works in the literature
using the Thinking out loud dataset.
  The Imagined Speech dataset was chosen to confirm the validity of the model that obtained
the best results on the Thinking Out Loud dataset. Figure 5 shows the results obtained for each
subject. All the subjects achieve an accuracy above chance (represented by the red line - 16.67%).
The mean accuracy considering all the subjects is 25.10% (± 0.045 std).
  Table 4 shows a comparison of our proposed approach and works in the literature using the
Imagined Speech dataset.
Table 3
Comparison of our ML and DL models with results in the literature using the Thinking Out Loud dataset.
          Classifier                        Input Data                          Accuracy
            SVM         PSD Features (channels left hemisphere) + PCA (0.99)      26.20%
          XGBoost                   PSD Features + PCA (0.99)                     27.90%
                                      most important features                     30.40%
             LSTM                     Raw data (all channels)                     27.20%
                           Raw data (channels most important features)            26.70%
                                      most important features                     31.30%
            BiLSTM                    Raw data (all channels)                    36.10%
                           Raw data (channels most important features)            33.10%
            EEGNet             Raw Data (channels left hemisphere)              29.67% [6]


Figure 5: Accuracy of the BiLSTM network for multiclass classification using the Imagined Speech raw
data. The red line represents the chance level (16.7%).


6. Conclusions
Inner speech recognition decoding EEG signal is still an open field of research. Few datasets are
available in the literature and the classification performance, even if above chance, is still very
low. The results obtained with this work confirm that the adoption of a BiLSTM architecture
increases the performance of classification with respect to those in the state-of-the-art. In
particular, the model designed for the Thinking Out Loud dataset was tested on the Imagined
Speech one, acquired using a similar experimental protocol but with lesser electrodes, and thus
confirming the validity of our proposal. The best classifier is obtained considering raw data
of all channels, denoting that a deeper analysis on the most significant features should be
performed, recalling that inner speech recognition should be considered in BCI applications
Table 4
Comparison of our DL model with results in the literature using the Imagined Speech dataset.
                     Classifier             Input Data               Accuracy
                      BiLSTM          Raw data (all channels)         25.10%
                        RF        Relative Wavelet Energy (RWE)     18.58% [25]
                      EEGNet          Raw Data (all channels)       24.97% [26]


where classification should be performed in real time. In future works, a parameter optimisation
will be performed considering a subject-based approach, to further increase classification
performance.
   Another future development of this work would be to test the API proposed by [28] on the
datasets analysed in the present paper to provide an insight on speech activity recognition and
to extend it to multiclass identification of words in real-time.
   Finally, the need of more numerous and less noisy datasets is crucial for further development
in this field of research.
In fact, the studies on speech related tasks may benefit from a collection of a larger pool of
data or the introduction of data augmentation approaches. However, it will be of fundamental
importance to provide an unbiased data augmentation that may be based on data driven
approaches like the empirical mode decomposition described in [29].


References
 [1] T. Schultz, M. Wand, T. Hueber, D. J. Krusienski, C. Herff, J. S. Brumberg, Biosignal-
     Based Spoken Communication: A Survey, IEEE/ACM Transactions on Audio, Speech, and
     Language Processing 25 (2017) 2257–2271.
 [2] P. Saha, M. Abdul-Mageed, S. Fels, Speak your mind! towards imagined speech recognition
     with hierarchical deep learning, arXiv preprint arXiv:1904.05746 (2019).
 [3] N. Nieto, V. Peterson, H. L. Rufiner, J. E. Kamienkowski, R. Spies, Thinking out loud, an
     open-access EEG-based BCI dataset for inner speech recognition, Scientific Data 9 (2022)
     1–17.
 [4] X. Gu, Z. Cao, A. Jolfaei, P. Xu, D. Wu, T.-P. Jung, C.-T. Lin, EEG-based brain-computer
     interfaces (BCIs): A survey of recent studies on signal sensing technologies and com-
     putational intelligence approaches and their applications, IEEE/ACM transactions on
     computational biology and bioinformatics 18 (2021) 1645–1666.
 [5] B. Alderson-Day, C. Fernyhough, Inner speech: development, cognitive functions, phe-
     nomenology, and neurobiology., Psychological bulletin 141 (2015) 931.
 [6] B. van den Berg, S. van Donkelaar, M. Alimardani, Inner Speech Classification using
     EEG Signals: A Deep Learning Approach, in: 2021 IEEE 2nd International Conference on
     Human-Machine Systems (ICHMS), IEEE, 2021, pp. 1–4.
 [7] E. Amit, C. Hoeflin, N. Hamzah, E. Fedorenko, An asymmetrical relationship between
     verbal and visual thinking: Converging evidence from behavior and fMRI, NeuroImage
     152 (2017) 619–627.
 [8] F. Bocquelet, T. Hueber, L. Girin, S. Chabardès, B. Yvert, Key considerations in designing a
     speech brain-computer interface, Journal of Physiology-Paris 110 (2016) 392–401.
 [9] S. Martin, I. Iturrate, J. d. R. Millán, R. T. Knight, B. N. Pasley, Decoding inner speech using
     electrocorticography: Progress and challenges toward a speech prosthesis, Frontiers in
     neuroscience 12 (2018) 422.
[10] J. T. Panachakel, A. G. Ramakrishnan, Decoding covert speech from EEG-a comprehensive
     review, Frontiers in Neuroscience (2021) 392.
[11] M. D’Zmura, S. Deng, T. Lappas, S. Thorpe, R. Srinivasan, Toward EEG sensing of imagined
     speech, in: International Conference on Human-Computer Interaction, Springer, 2009, pp.
     40–48.
[12] K. Brigham, B. V. Kumar, Imagined speech classification with EEG signals for silent com-
     munication: a preliminary investigation into synthetic telepathy, in: 2010 4th International
     Conference on Bioinformatics and Biomedical Engineering, IEEE, 2010, pp. 1–4.
[13] S. Deng, R. Srinivasan, T. Lappas, M. D’Zmura, EEG classification of imagined syllable
     rhythm using Hilbert spectrum methods, Journal of neural engineering 7 (2010) 046006.
[14] C. S. DaSalla, H. Kambara, M. Sato, Y. Koike, Single-trial classification of vowel speech
     imagery using common spatial patterns, Neural networks 22 (2009) 1334–1339.
[15] C. S. DaSalla, H. Kambara, Y. Koike, M. Sato, Spatial filtering and single-trial classification
     of EEG during vowel speech imagery, in: Proceedings of the 3rd International Convention
     on Rehabilitation Engineering & Assistive Technology, 2009, pp. 1–4.
[16] B. M. Idrees, O. Farooq, Vowel classification using wavelet decomposition during speech
     imagery, in: 2016 3rd International Conference on Signal Processing and Integrated
     Networks (SPIN), IEEE, 2016, pp. 636–640.
[17] A. Riaz, S. Akhtar, S. Iftikhar, A. A. Khan, A. Salman, Inter comparison of classification
     techniques for vowel speech imagery using EEG sensors, in: The 2014 2nd International
     Conference on Systems and Informatics (ICSAI 2014), IEEE, 2014, pp. 712–717.
[18] J. Kim, S.-K. Lee, B. Lee, EEG classification in a single-trial basis for vowel speech perception
     using multivariate empirical mode decomposition, Journal of neural engineering 11 (2014)
     036010.
[19] P. Suppes, Z.-L. Lu, B. Han, Brain wave recognition of words, Proceedings of the National
     Academy of Sciences 94 (1997) 14965–14969.
[20] L. Wang, X. Zhang, X. Zhong, Y. Zhang, Analysis and classification of speech imagery
     EEG for BCI, Biomedical signal processing and control 8 (2013) 901–908.
[21] M. Salama, L. ElSherif, H. Lashin, T. Gamal, Recognition of unspoken words using electrode
     electroencephalograhic signals, in: The Sixth International Conference on Advanced
     Cognitive Technologies and Applications, Citeseer, 2014, pp. 51–5.
[22] K. Mohanchandra, S. Saha, A communication paradigm using subvocalized speech: trans-
     lating brain signals into speech, Augmented Human Research 1 (2016) 1–14.
[23] E. F. González-Castañeda, A. A. Torres-García, C. A. Reyes-García, L. Villaseñor-Pineda,
     Sonification and textification: Proposing methods for classifying unspoken words from
     EEG signals, Biomedical Signal Processing and Control 37 (2017) 82–91.
[24] C. H. Nguyen, G. K. Karavas, P. Artemiadis, Inferring imagined speech using EEG signals:
     a new approach using Riemannian manifold features, Journal of neural engineering 15
     (2017) 016002.
[25] G. A. P. Coretto, I. E. Gareis, H. L. Rufiner, Open access database of EEG signals recorded
     during imagined speech, in: 12th International Symposium on Medical Information
     Processing and Analysis, volume 10160, SPIE, 2017, p. 1016002.
[26] C. Cooney, A. Korik, R. Folli, D. Coyle, Evaluation of hyperparameter optimization in
     machine and deep learning methods for decoding imagined speech EEG, Sensors 20 (2020)
     4629.
[27] V. J. Lawhern, A. J. Solon, N. R. Waytowich, S. M. Gordon, C. P. Hung, B. J. Lance, EEGNet:
     a compact convolutional neural network for EEG-based brain–computer interfaces, Journal
     of neural engineering 15 (2018) 056013.
[28] L. A. Moctezuma, M. M. Molinas Cabrera, Towards an API for EEG-based imagined speech
     classification, in: ITISE 2018-International Conference on Time Series and Forecasting,
     2018.
[29] J. Dinarès-Ferran, R. Ortner, C. Guger, J. Solé-Casals, A new method to generate artificial
     frames using the empirical mode decomposition for an EEG-based motor imagery BCI,
     Frontiers in neuroscience 12 (2018) 308.