<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Realtime depression estimation using mid-term audio features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Theodoros Giannakopoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christos Smailis</string-name>
          <email>xsmailis@iit.demokritos.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stavros Perantonis</string-name>
          <email>sper@iit.demokritos.gr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Constantine Spyropoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>National Center for Scientific Research DEMOKRITOS</institution>
          ,
          <addr-line>Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <fpage>41</fpage>
      <lpage>45</lpage>
      <abstract>
        <p>This paper presents a method towards estimating a clinical depression-specific score, namely the Beck Depression Inventory (BDI) score, based on analysis of mid-term audio features. A combination of support vector machines and semi-supervised learning has been applied to map the mid-term features to the BDI score. The method has been evaluated on the AVEC 2013 depression dataset. The overall system has been implemented in Python achieving a 20xrealtime computational complexity on an average computer.</p>
      </abstract>
      <kwd-group>
        <kwd>Depression</kwd>
        <kwd>Audio Analysis</kwd>
        <kwd>Regression</kwd>
        <kwd>Support Vector Machines</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>One of the most common mood disorders is clinical depression. Therefore,
automatic estimation or detection of its presence has gained research interest during
the last years, through means of audio-visual signal analysis. This work focuses
on the audio domain in order to estimate the Beck Depression Inventory (BDI)
score of an individual, given a long audio recording of his/her voice.</p>
      <p>
        Some of the previous studies in the field of depression analysis focus on
utilizing acoustic features for the task of either predicting the Beck Depression
Inventory (BDI) score of an individual, or classifying a person as depressive or
not depressive. For example in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] the performance of different audio features is
explored for the task of classifying audio files contained within the AVEC 2013
dataset [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], as either depressed or nondepressed. In [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] several novel Canonical
Correlation Analysis (CCA) based feature selection methods are presented, in
order to reduce the massive dimensionality introduced for the AVEC 2013.
Results revealed that using only 17% of original features, a relative improvement
of 30% decrease of RMSE over the baseline on challenge test set was obtained.
In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] audio features that reflect changes in coordination of vocal tract motion
have been utilized.
      </p>
      <p>
        Other studies have adopted different approaches by attempting to fuse audio
as well as video modalities in order to boost their performance. For example
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] the authors made use of Motion History Histogram (MHH) in the video
modality, to capture the movement of each pixel (texture variation) within the
face area. In the audio modality the authors used a set of spectral Low-Level
Descriptors (LLD). MHH was once again used for extracting change information
of the vocal expression. For each modality, the Partial Least Square (PLS)
regression algorithm is applied and predicted values of visual and vocal clues were
further combined at decision level. Another multimodal approach is presented in
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This study presents a multimodal approach using a GMM-UBM system with
three different kernels for the audio subsystem and Space Time Interest Points
in a Bag-of-Words approach for the vision subsystem. These are then fused at
the feature level to form the combined AV system. Key results include the strong
performance of acoustic audio features and the bag-of-words visual features in
predicting an individuals level of depression.
2
      </p>
    </sec>
    <sec id="sec-2">
      <title>Feature Extraction</title>
      <p>In most audio analysis and processing methods, it is rather common that the
input audio signal is divided into short-term frames before feature extraction. In
particular, the audio signal is broken into (non-)overlapping frames and a set of
features is extracted for each frame. The result of this procedure is a sequence
of feature vectors per audio signal. Another common technique used as a second
step in audio feature extraction is the processing of the feature sequence on a
mid-term basis. The audio signal is first divided into mid-term segments and
then, for each segment, the short-term processing stage is carried out. At a next
step, the feature sequence, which has been extracted from a mid-term segment,
is used for computing feature statistics. In practice, the duration of mid-term
windows typically lies in the range 1 − 10 secs, depending on the application
domain. In this work, extensive experimentation has led to selecting a 5 second
mid-term window and a 50 msec short-term frame. In both cases, 50% overlap
has been adopted.
2.1</p>
      <sec id="sec-2-1">
        <title>Short-term audio features</title>
        <p>
          In this section we describe the adopted short-term features. These features have
been used in several general audio analysis methods and speech processing
applications [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] and they cover a wide range of audio signal properties
achieving discrimination abilities in several classification and regression tasks.
Energy The energy feature is computed as a sum of squared signal values (in
the time domain), normalized by the window length. Short-term energy usually
exhibits high variation over successive speech frames, since speech signals contain
weak phonemes and short periods of silence between words.
        </p>
        <p>Zero Crossing Rate The Zero Crossing Rate (ZCR) of an audio signal is the
rate of sign-changes of the signal divided by the duration of that signal. ZCR
has been interpreted as measure of the noisiness of a signal, therefore it usually
exhibits higher values in the case of noisy signals.
Entropy of Energy The entropy of energy is a measure of abrupt changes
in the energy level of an audio signal. It is computed by firstly dividing each
frame in sub-frames of fixed duration. Then, for each sub-frame j its energy is
computed and divided by the total energy. Finally, the entropy of that sequence
of (normalized) sub-energies ej is computed as the final feature value. This
feature has lower values if there exist abrupt changes in the energy envelop of the
respective signal.</p>
        <p>Spectral Centroid and Spread The spectral centroid and the spectral spread
are two basic spectral domain features that quantify the position and shape. The
spectral centroid is the center of gravity of the spectrum, while spectral spread
is the second central moment of the spectrum.</p>
        <p>Spectral Entropy Spectral entropy is computed in a similarly to the entropy
of energy, however it is applied on the frequency domain.</p>
        <p>Spectral Flux Spectral flux is a measure of spectral change between two
successive frames and it is computed as the squared difference between the normalized
magnitudes of the spectra of the two successive frames.</p>
        <p>Spectral Rolloff Spectral rolloff is the frequency below which a certain
percentage of the magnitude distribution of the spectrum is concentrated. It can be
treated as a spectral shape descriptor of an audio signal and it has been used
for discriminating between voiced and unvoiced sounds.</p>
        <p>MFCCs The Mel-Frequency Cepstrum Coefficients (MFCCs) have been very
popular in the field of speech analysis. In practise, MFCCs are the discrete
cosine transform coefficients of the mel-scaled log-power spectrum. MFCCs have
been widely used in speech recognition, speaker clustering and many other audio
analysis applications.</p>
        <p>Chroma vector This is a 12-dimensional representation of the spectral energy
of an audio signal. This is a widely used descriptor, mostly in music-related
applications, however it has also been used in speech analysis. The chroma
vector is computed by grouping the spectral coefficients of a frame into 12 bins
representing the 12 equal-tempered pitch classes of western-type music.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Final feature vector extraction</title>
        <p>The process described in 2.1 leads to a sequence of short-term feature vectors of
21 dimensions (this is the total number of short-term features described above).
As a next step, statistics are calculated in a mid-term basis as described in the
beginning of this Section. In particular, the following statistics are computed:
(a) Average value μ, (b) Standard deviation σ2 and (c) σ2/μ ratio. This leads to
several mid-term feature vectors of 63 elements. The number of these mid-term
vectors depends on the overall duration of the audio signal. Each of these vectors
are fed as input in the next regression step that produces the final depression
estimate decision.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Depression Estimation</title>
      <p>Each mid-term feature vector described in the previous section is used in the
context of a Support Vector Machine regression technique to estimate the Beck
Depression Index. Note that since one decision is calculated per feature vector,
the final decision (per audio recording) is extracted by averaging the mid-term
BDIs. This rationalle helps in generating a sufficient number of samples for the
SVM regression model training phase and in addition it manages to handle
midterm vocal characteristics. Finally, a semi-supervised dimensionality reduction
technique has been adopted in order to give weight to feature dimensions that are
discriminative in terms of depression estimation. In particular, recordings that
share similar BDIs have been grouped together using a simple k-Means clustering
approach. In the sequel, a Linear Discriminant Analysis step has been performed
on the initial feature space to extract a depression-discriminant subspace. During
this process, the BDI clusters have been used as indices in the LDA process.
4
4.1</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <sec id="sec-4-1">
        <title>Dataset</title>
        <p>
          In order to evaluate the presented method, we have used the AVEC 2013
depression dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. This includes 340 video recordings of 292 subjects performing a
human-computer interaction tasks while being recorded by a audio-visual
sensors. The average age is 31.5 years and a range of 18 to 63 years. The length
of each recording varies from 20 to 50 minutes, with an average duration of 25
minutes per recording. The total duration of all recordings is 240 hours. The
behaviour within the clips consisted of different tasks such as: sustained vowel
phonation; speaking loud while solving a task, counting from 1 to 10, read speech,
singing, telling a story from the subject’s own past, and telling an imagined story.
The 16-bit audio was recorded at a sampling rate of 41KHz.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Results</title>
        <p>A cross-validation procedure on the AVEC dataset has been conducted in
order to evaluate the presented scheme via the following performance measures:
(a) Root mean square error (RMSE) (b) 4-class classification accuracy: in order
to calculate this measure, the final decisions are discretized to four “depression
levels” according to the following categorization: 0-13: minimal depression,
1419: mild depression, 20-28: moderate depression and 29-63: severe depression.
Then the overall accuracy of the classification task is computed. (c) 2-class
classification accuracy: similarly to the 4-class case, the problem is binarized using
BDI = 14 as a threshold. The cross-validation procedure estimated an RMSE
of 9.5, 4-class accuracy rate equal to 52% and binary accuracy rate of 70%.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and future work</title>
      <p>We have presented a method for detecting a subject’s clinical depression score
using audio information. A wide range of audio features has been extracted in
a mid-term basis, while a combination of SVMs and a semi-supervised
dimensionality reduction method has been used in the recognition process. Results
demonstrate a 20% drop in the error regarding the baseline performance of the
adopted dataset. We conduct ongoing research on using temporal analysis
techniques to enchance the signal representation process by selecting representative
areas of the recording with high discrimination ability (in terms of depression
analysis). In addition, we plan to implement visual shape modelling methods in
the context of a fusion system. Finally, it is rather important to conduct
collaboration with a specialist in mental health in order to extract useful correlations
between low-level audio features and mid-level depression-related characteristics.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Lopez-Otero</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Docio-Fernandez</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia-Mateo</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>(A study of acoustic features for the classification of depressed speech)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Valstar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eyben</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bilakhia</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schnieder</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cowie</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pantic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Avec 2013: the continuous audio/visual emotion and depression recognition challenge</article-title>
          .
          <source>In: Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge</source>
          ,
          <source>ACM</source>
          (
          <year>2013</year>
          )
          <fpage>3</fpage>
          -
          <lpage>10</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Kaya</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eyben</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salah</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schuller</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>(Cca based feature selection with application to continuous depression recognition from acoustic speech features)</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Williamson</surname>
            ,
            <given-names>J.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Quatieri</surname>
            ,
            <given-names>T.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Helfer</surname>
            ,
            <given-names>B.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Horwitz</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mehta</surname>
          </string-name>
          , D.D.:
          <article-title>Vocal biomarkers of depression based on motor incoordination</article-title>
          .
          <source>In: Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge</source>
          ,
          <source>ACM</source>
          (
          <year>2013</year>
          )
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>AI-Shuraifi</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Depression recognition based on dynamic facial and vocal expression features using partial least square regression</article-title>
          .
          <source>In: Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge</source>
          ,
          <source>ACM</source>
          (
          <year>2013</year>
          )
          <fpage>21</fpage>
          -
          <lpage>30</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Cummins</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Joshi</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dhall</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sethu</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Goecke</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Epps</surname>
          </string-name>
          , J.:
          <article-title>Diagnosis of depression by behavioural signals: a multimodal approach</article-title>
          .
          <source>In: Proceedings of the 3rd ACM international workshop on Audio/visual emotion challenge</source>
          ,
          <source>ACM</source>
          (
          <year>2013</year>
          )
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Giannakopoulos</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pikrakis</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Introduction to Audio Analysis: A MATLAB Approach</article-title>
          . Academic Press (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Theodoridis</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Koutroumbas</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Pattern Recognition, Fourth Edition</article-title>
          . Academic Press, Inc. (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hyoung-Gook</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nicolas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sikora</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>MPEG-7 Audio and Beyond: Audio Content Indexing and Retrieval</article-title>
          . John Wiley &amp; Sons (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>