<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>song_sent_scores: Computational Design for Charting Dynamic Emotion in Songs with a Multimodal Circumplex Framework</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Frederico Gonçalves Pedrosa</string-name>
          <email>fredericopedrosa@musica.ufmg.br</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Music, Artificial Intelligence, Psychometrics, Computational Design, Circumplex Model</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Federal University of Minas Gerais</institution>
          ,
          <addr-line>Av. Pres. Antônio Carlos, 6627 - Pampulha, Belo Horizonte - MG, 31270-901</addr-line>
          ,
          <country country="BR">Brazil</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Music profoundly impacts human emotion, yet computationally modeling its dynamic afective qualities remains challenging. This pictorial introduces song_sent_scores, a novel open-source toolkit (R/Python) that operationalizes Russell's Circumplex Model of Afect to represent and visualize the dynamic interplay of Valence (pleasantness/unpleasantness) and Arousal (energy/calmness) in songs. The toolkit derives these dimensions multimodally from both the audio signal (via a CLAP model) and lyrical content (transcribed by ASR and analyzed with an NLI-based model). We visually demonstrate song_sent_scores' capabilities through: (1) a small-scale experiment comparing its afective classifications against human ratings, revealing varied agreement across songs (e.g., r = 0.79 for 'Negro Drama', r = -0.70 for 'My Baby Just Cares for Me'); and (2) a case study analyzing 'Negro Drama' via its circumplex trajectory and an autoregressive afective network. This computational design approach ofers a richer, continuous representation of musical afect, providing a novel lens for designers, music therapists, and researchers to explore music's emotional architecture.</p>
      </abstract>
      <kwd-group>
        <kwd>Multimodal Circumplex</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Music is a powerful medium for conveying and evoking emotion, yet computationally
capturing its nuanced afective dynamics remains a significant challenge [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. This pictorial introduces
song_sent_scores [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], a novel open-source toolkit (available in R and Python) that operationalizes
Russell’s Circumplex Model of Afect [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] to represent and visualize the dynamic interplay of Valence
(pleasantness/unpleasantness) and Arousal (energy/calmness) in songs, derived multimodally from both
audio signals and lyrical content.
      </p>
      <p>
        The song_sent_scores function leverages state-of-the-art learning models for its core analysis
including 1) audio analysis with Contrastive Language-Audio Pretraining (CLAP) model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] performs
zero-shot classification on audio segments to infer Valence and Arousal directly from sonic
characteristics; and 2) song lyrics, either provided or transcribed using an Automatic Speech Recognition (ASR)
model (e.g., OpenAI’s Whisper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]), are analyzed for Valence and Arousal using a Natural Language
Inference (NLI) based zero-shot text classification model [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The pipeline was inspired by Tomasevic
and colleagues [7].
      </p>
      <p>The toolkit processes song segments, yielding Valence and Arousal scores for both audio and lyrical
modalities. These scores are then mapped onto a 2D circumplex plot, allowing for a dynamic visualization
of the song’s afective journey over time. This method moves beyond static categorization, ofering a
richer, continuous representation that can reveal subtle emotional shifts and the congruence between
musical sound and lyrical meaning.</p>
      <p>This pictorial will visually demonstrate song_sent_scores' capabilities. We begin by presenting a
small-scale experiment to assess the correspondence between the tool’s afective classifications and
LGOBE https://github.com/FredPedrosa/ (F.</p>
      <p>G. Pedrosa)</p>
      <p>CEUR
Workshop</p>
      <p>ISSN1613-0073
human ratings through statistical analysis. Building on these findings, we then showcase a detailed
case study: the analysis of a song’s temporal emotional development, visualized on the circumplex,
and its afective trajectory explored via dynamic network analysis. We aim to showcase how this
computational tool can enhance our understanding and discussion of dynamic musical afect, regarding
Artificial Intelligence (AI) classification and computational design.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        The computational analysis and understanding of emotion in music is an intense research area within
music informatics and afective computing [
        <xref ref-type="bibr" rid="ref1">8, 1</xref>
        ]. Researchers have explored both discrete emotion
models (categorizing music into labels like happy, sad, angry) and dimensional approaches [9]. Among
dimensional models, Russell’s circumplex model of afect [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which posits Valence and Arousal as
core afective dimensions, is widely adopted due to its intuitive appeal, extensive empirical support
(having garnered over 14,000 citations according to Semantic Scholar as of 2025), and applicability in
computational settings [10, 11].
      </p>
      <p>
        Recent advancements in zero-shot learning (ZSL) have opened new avenues for music emotion
recognition. Contrastive Language-Audio Pretraining (CLAP) models, for instance, learn joint representations
of audio and text, enabling the classification of audio based on natural language descriptions of target
states [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Similarly, ZSL for text classification using Natural Language Inference (NLI) techniques
allows for afective analysis of lyrics without task-specific training data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Complementing these,
some recent works in Zero-Shot Audio Classification (ZSAC) also explore leveraging Large Language
Models to generate rich sound attribute descriptions to enhance classification [ 12], highlighting the
growing synergy between audio processing and advanced language understanding.
      </p>
      <p>Although many systems focus on static emotion classification of songs or segments, the temporal
dynamics of afect in music are crucial for a deeper understanding, as music itself is a temporal art form
[13]. Our work with song_sent_scores contributes to this area by not only providing multimodal
circumplex scores, but also enabling the visualization of dynamic afective trajectories. The modeling of
the interrelations between these afective streams is inspired by approaches such as dynamic Exploratory
Graph Analysis (EGA) [14, 7], which has been used in other domains to understand emotion dynamics.
Furthermore, by comparing our toolkit’s output with human perceptual ratings, we align with the
critical need for robust evaluation methodologies in this field. While datasets for AI-driven afective
music *generation* are emerging (e.g., SunoCaps by Civit and colleagues [15]), our focus is on providing
an analytical tool for existing musical pieces.</p>
    </sec>
    <sec id="sec-3">
      <title>3. The song_sent_scores Toolkit Architecture</title>
      <p>1. Input Audio: The toolkit takes a song audio file (e.g., MP3, WAV) as input. This audio is used
for both direct sonic analysis and, if lyrics are not provided, for ASR.
2. Audio Afective Analysis (CLAP): The raw audio signal is processed by a Contrastive
LanguageAudio Pretraining (CLAP) model. This model generates embeddings of the audio segments and
performs zero-shot classification against predefined textual descriptions of afective states (e.g.,
‘audio with positive valence’, ‘audio with high arousal’) to infer Valence and Arousal scores
directly from the sonic characteristics. This results in the Audio V/A Scores.
3. Lyrical Content Processing:</p>
      <p>• Optional Provided Lyrics: Users can directly supply the song lyrics.
4. Text Afective Analysis (NLI): The obtained lyrical content (either provided or transcribed) is
then analyzed by a Natural Language Inference (NLI) based zero-shot text classification model.
This model assesses the interaction between lyrics and phrases representing diferent Valence
and Arousal states to produce the Text V/A Scores.
5. Output Aggregation: The derived Audio V/A Scores and Text V/A Scores are compiled into an
output dictionary, typically structured with separate entries for ‘audio’ and ‘text’ modalities, each
containing their respective valence and arousal values.
6. Visualization: Finally, these scores are mapped as coordinates onto a 2D circumplex plot. This
plot visually represents the song segment’s position in the Valence-Arousal space, with distinct
points for the audio and text modalities, allowing for an intuitive understanding of the song’s
multimodal afective profile.</p>
      <p>This pipeline allows song_sent_scores to ofer a comprehensive, multimodal assessment of a song’s
perceived emotional character, moving beyond unimodal analysis or simplistic discrete emotion labels.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Song Selection and Segmentation</title>
        <p>Four songs were selected to ensure linguistic and genre variability: ‘My Baby Just Cares for Me’
composed and interpreted by Nina Simone (Jazz, English lyrics), ‘Negro Drama’ by the RAP group
Racionais MCs (Brazilian Hip Hop, Portuguese lyrics), ‘O Mundo é um Moinho’ by the composer Cartola
(Samba, Portuguese lyrics), and ‘Territory’ by Sepultura band (Thrash Metal, English lyrics). From
each song, two distinct approximately 22-second excerpts (e.g., ‘My Baby Just Cares for me’ segment 1:
3-25s) were chosen for analysis, aiming to capture potentially diferent afective states.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Human Afective Ratings</title>
        <p>Thirteen human judges (mean age = 45.92, SD = 14.13; 69.2% female; high academic background with
61.5% holding Master’s or PhD degrees; 76.9% with at least one year of music study/practice) rated each
of the 8 song excerpts. After listening to each excerpt, judges provided scores on 7-point Likert-type
scales for perceived Valence (1=Very Negative/Unpleasant to 7=Very Positive/Pleasant) and Arousal
(1=Very Low/Calm to 7=Very High/Agitated), separately for the musical audio and the lyrical content
(transcriptions were provided for lyrical assessment). Inter-rater reliability for these human ratings
was assessed using the Intraclass Correlation Coeficient (ICC).</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Computational Afective Scoring with</title>
        <p>song_sent_scores
The same 8 song excerpts were processed using the song_sent_scores toolkit. The function
was configured to use laion/clap-htsat-unfused as the CLAP model,
openai/whisper-largev3 as the ASR model, and joeddav/xlm-roberta-large-xnli as the NLI model. For each excerpt,
song_sent_scores generated Valence and Arousal scores (ranging from -1 to +1) for both the audio
(a_v, a_a) and the transcribed lyrical content (t_v, t_a).</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Data Analysis</title>
        <p>Human ratings were averaged across the 13 judges for each of the 32 items (4 songs x 2 excerpts/song x
2 modalities/excerpt x 2 dimensions/modality). These average human scores were then rescaled from
their original 1-7 range to a -1 to +1 range to match the scale of the AI’s scores. Pearson’s correlation
coeficient (r) was used to assess the correspondence between the rescaled average human ratings and
the AI’s scores, both overall and for specific subsets per song.
To illustrate the toolkit’s capability for dynamic analysis, the full song ‘Negro Drama’ was segmented
into consecutive 15-second chunks. song_sent_scores was applied to each chunk to derive time-series
data for Valence and Arousal from both audio and lyrics. These time-series were visualized to show
the song’s afective trajectory. Additionally, a autoregressive network model of these four afective
streams (audio valence, audio arousal, text valence, text arousal) across the chunks was estimated using
a Graphical Vector Autoregression (GVAR) approach with the psychonetrics [16] package in R, and
visualized using qgraph [17].</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <sec id="sec-5-1">
        <title>5.1. Inter-Rater Reliability of Human Ratings</title>
        <p>The inter-rater reliability for human evaluations was calculated using the Intraclass Correlation
Coeficient (ICC). The average ratings of the 13 judges demonstrated excellent reliability, with an ICC(2,k) of
0.83 (95% CI [0.73, 0.90]) for absolute agreement and an ICC(3,k) of 0.86 (95% CI [0.77, 0.92]) for
consistency. This indicates that the aggregated human scores provide a robust benchmark for comparison.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. song_sent_scores Example Output: Circumplex Plot</title>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Correlation between Human Ratings and song_sent_scores</title>
        <p>Figure 3 presents the Pearson’s r correlation coeficients between the rescaled average human ratings
and the song_sent_scores outputs for each of the four songs, across all 8 items per song (2 excerpts x
2 modalities x 2 dimensions). The 95% confidence intervals for r are also shown. Strong and statistically
significant positive correlations were observed for ‘Negro Drama’ (r = 0.79, p = 0.019) and ‘Territory’ (r
= 0.78, p = 0.021), indicating good agreement between human perception and the AI for these songs. For
‘O Mundo é um Moinho’, the correlation was weak and not statistically significant (r = 0.20, p = 0.64),
suggesting poor agreement. Notably, for ‘My Baby Just Cares for Me’, a strong negative correlation
was found (r = -0.70, p = 0.054), suggesting that the AI’s assessment tended to be opposite to human
perception for this song. This warrants further investigation into the specific characteristics of this
song or its processing by the models.</p>
        <sec id="sec-5-3-1">
          <title>5.3.1. Temporal Dynamics</title>
          <p>Figures 4 and 5 illustrate the dynamic afective scores (Valence and Arousal, respectively, for both audio
and text) for ‘Negro Drama‘ when analyzed in consecutive 15-second chunks across the entire song.
These visualizations allow for tracking the evolving afective arc of the song, highlighting moments
of convergence or divergence between the emotional tone of the music and the lyrics across both
fundamental afective dimensions.</p>
          <p>Observing the Valence dynamics (Figure 4), for instance, around the 90-second mark, text valence
becomes highly positive while audio valence remains negative. Issues with ASR transcription (e.g.,
transcribing ‘Music‘ for instrumental segments) can also be identified, such as at the 75s and 105s marks,
where text scores revert to a default pattern.</p>
          <p>Similarly, the Arousal dynamics (Figure 5) reveal distinct trajectories for the audio and lyrical
modalities. The audio arousal for ‘Negro Drama‘ generally maintains a moderately high level throughout
most of the song, consistent with its energetic hip-hop style, though with notable peaks and valleys
reflecting structural changes in the music. Text arousal, on the other hand, shows more pronounced
lfuctuations. For example, a significant peak in text arousal is observed around the 300-second mark,
corresponding to a climactic lyrical moment when MC Mano Brown begins to rhyme. The content of
his lyrics intensifies the overall atmosphere.</p>
        </sec>
        <sec id="sec-5-3-2">
          <title>5.3.2. Temporal Influence Network (VAR Coeficients)</title>
          <p>Figure 6 displays the temporal influence network estimated from the four afective time-series (audio
valence, audio arousal, text valence, text arousal) of “Negro Drama”. This network was derived from a
Vector Autoregressive model of order 1 (VAR(1)) using the psychonetrics package [16] and visualized
with qgraph [17]. The directed edges represent how the state of one afective stream in a given
15second chunk (time  − 1 ) predicts the state of another (or the same) stream in the subsequent chunk
(time  ).</p>
          <p>The network reveals the predictive dynamics between modalities and afective dimensions. For
instance, one of the strongest predictive paths shows that higher Text Arousal at  − 1 strongly predicts
more positive Text Valence at  (coeficient ≈ 0.56). This suggests that energetic lyrical segments tend
to be followed by segments perceived as more positively valenced in their lyrical content.</p>
          <p>Autoregressive efects (loops on the nodes) indicate the temporal stability or inertia of each stream.
Audio Valence shows moderate positive autoregression (coef. ≈ 0.26), implying that its current
valence level tends to carry over to the next segment. In contrast, Text Arousal exhibits a weak
negative autoregression (coef. ≈ −0.14), suggesting that peaks in lyrical energy might be followed by a
slight decrease.</p>
          <p>Other cross-lagged influences are also evident. For example, higher Audio Arousal at  −1 moderately
predicts increased Text Valence at  (coef. ≈ 0.31), while higher Audio Valence at  − 1 moderately
predicts increased Text Arousal at  (coef. ≈ 0.30). These relationships highlight a complex interplay
where the afective state of one modality in one moment can set the stage for changes in both dimensions
of the other modality in the near future. This temporal network analysis provides insights into the
evolving emotional narrative constructed by the interplay of music and lyrics in ‘Negro Drama’.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>
        The song_sent_scores toolkit demonstrates a promising approach for operationalizing Russell’s
circumplex model of afect for multimodal song analysis. Building upon established dimensional
theories [
        <xref ref-type="bibr" rid="ref3">3, 10</xref>
        ] and leveraging recent advances in zero-shot learning for audio and text processing
[
        <xref ref-type="bibr" rid="ref4 ref6">4, 6, 12</xref>
        ], the dynamic visualizations generated by our toolkit ofer richer insights into musical afect
than static emotion categories traditionally allow [9, 13].
      </p>
      <p>The comparison with human ratings yielded varied results across the selected songs. Strong positive
agreement for Rap (‘Negro Drama‘, r = 0.79) and Thrash Metal (‘Territory‘, r = 0.78) suggests the toolkit
can efectively capture the perceived afect in these well-known and more ‘global’ genres. However, the
poor agreement for Samba (‘O Mundo é um Moinho‘, r = 0.20) and the striking inverse relationship
for Jazz (‘My Baby Just Cares for Me’, r = -0.70) highlight significant challenges. These discrepancies
likely point to genre-specific sensitivities of the current AI models, ASR limitations (especially for
non-sung segments or vocally complex styles like Jazz and Samba), and the potential misinterpretation
of culturally or aged specific expressive cues not well-represented in the models’ training data. The
negative correlation for the Jazz piece, for example, warrants deeper investigation into how its unique
vocal delivery, improvisation, and harmonic complexity might be processed divergently by the AI
compared to human listeners.</p>
      <p>The dynamic analysis of ‘Negro Drama‘ (Figures 4, 5, and 6) showcased the toolkits’ utility in mapping
evolving emotional trajectories and modeling interrelations between afective streams. This moves
towards a systemic understanding of how audio and lyrical valence and arousal interact and influence
each other over time, ofering a novel quantitative method for computational musicology and afective
computing, complementing qualitative approaches. The ability to identify, for instance, how Text
Arousal might predict subsequent Text Valence, or the strong negative interplay between Audio Arousal
and Audio Valence within this specific song, provides a granular view of its emotional architecture and
yield for a better understanding of how music afect is perceived.</p>
      <p>While this pictorial demonstrates clear potential, certain limitations of the current study and toolkit
should be noted. The human validation involved 13 judges and four stylistically diverse songs; larger,
more culturally varied participant samples and a broader selection of songs, particularly multiple
examples within each genre, are needed for more generalizable conclusions. The performance of the
constituent AI models (CLAP, Whisper, NLI) is also a factor, as their inherent biases and limitations can
propagate through the pipeline.</p>
      <p>Future work should therefore focus on several key areas. Refining ASR performance for diverse
vocal and styles and sung Portuguese, as well as Brazilian musical peculiarities is critical. Exploring
alternative or supplementary audio features beyond CLAP embeddings, perhaps incorporating more
psychoacoustically-grounded or music-theoretically informed features, could enhance the audio
analysis module. Furthermore, the toolkit could be extended to practical applications: interactive media
designers could use it to dynamically score experiences, music therapists to select or analyze music for
interventions, and researchers to conduct large-scale comparative studies of emotional expression in
music.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used a large language model from Google (based
on the Gemini family of models) in order to assist with grammar and spelling checks, provide help in
coding, and for brainstorming and discussing ideas. After using this tool/service, the author(s) reviewed
and edited the content as needed and take(s) full responsibility for the publication’s content.
ral Language Processing (EMNLP), 2019, pp. 3904–3913. URL: https://arxiv.org/abs/1909.00161,
arXiv:1909.00161.
[7] A. Tomasevic, H. Golino, A. Christensen, Decoding emotion dynamics in videos using dynamic
exploratory graph analysis and zero-shot image classification: A simulation and tutorial using the
transforemotion R package, 2024. URL: https://osf.io/preprints/psyarxiv/hf3g7. doi:10.31234/osf.
io/hf3g7, psyArXiv.
[8] A. Dash, K. Agres, AI-Based Afective Music Generation Systems: A Review of Methods and</p>
      <p>Challenges, ACM Computing Surveys 56 (2024) 1–34. doi:10.1145/3672554.
[9] M. Nunes-Silva, P. S. da Conceição Moreira, P. Eyer, Modelos cognitivos de emoções musicais:
Abordagens discreta e dimensional, Percepta: Revista de Cognição e Artes Musicais 11 (2023)
23–38. doi:10.34018/2318-891X.11(1)23-38.
[10] J. Posner, J. A. Russell, B. S. Peterson, The circumplex model of afect: An integrative approach
to afective neuroscience, cognitive development, and psychopathology, Development and
Psychopathology 17 (2005) 715–734. doi:10.1017/s0954579405050340.
[11] Y.-S. Seo, J.-H. Huh, Automatic emotion-based music classification for supporting intelligent IoT
applications, Electronics 8 (2019) 164. doi:10.3390/electronics8020164.
[12] X. Xu, P. Zhang, M. Yan, J. Zhang, M. Wu, Enhancing zero-shot audio classification using sound
attribute knowledge from large language models, arXiv preprint arXiv:2407.14355 (2024). URL:
https://arxiv.org/abs/2407.14355. arXiv:2407.14355.
[13] R. R. McCrae, Music lessons for the study of afect, Frontiers in Psychology 12 (2021) 760167.</p>
      <p>doi:10.3389/fpsyg.2021.760167.
[14] H. Golino, A. Christensen, EGAnet: Exploratory Graph Analysis – A framework for estimating
the number of dimensions in multivariate data using network psychometrics, 2025. URL: https:
//r-ega.net. doi:10.32614/CRAN.package.EGAnet, r package version 2.1.1.
[15] M. Civit, V. Drai-Zerbib, D. Lizcano, M. Escalona, SunoCaps: A novel dataset of text-prompt based
AI-generated music with emotion annotations, Data in Brief 55 (2024) 110743. doi:10.1016/j.
dib.2024.110743.
[16] S. Epskamp, psychonetrics: Structural Equation Modeling and Confirmatory Network
Analysis, 2024. URL: https://CRAN.R-project.org/package=psychonetrics. doi:10.32614/CRAN.package.
psychonetrics, r package version 0.13.
[17] S. Epskamp, A. O. J. Cramer, L. J. Waldorp, V. D. Schmittmann, D. Borsboom, qgraph: Network
visualizations of relationships in psychometric data, Journal of Statistical Software 48 (2012) 1–18.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kong</surname>
          </string-name>
          , J. Han,
          <string-name>
            <surname>G</surname>
          </string-name>
          . Wang,
          <article-title>A survey of music emotion recognition</article-title>
          ,
          <source>Frontiers of Computer Science</source>
          <volume>16</volume>
          (
          <year>2022</year>
          )
          <article-title>166335</article-title>
          . doi:
          <volume>10</volume>
          .1007/s11704- 021- 0569- 4.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>F. G.</surname>
          </string-name>
          <article-title>Pedrosa, song_sent_scores: Functions for multimodal song circumplex analysis in r and python, 2025</article-title>
          . URL: https://github.com/FredPedrosa/song_sent_scores, [Software].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <article-title>A circumplex model of afect</article-title>
          ,
          <source>Journal of Personality and Social Psychology</source>
          <volume>39</volume>
          (
          <year>1980</year>
          )
          <fpage>1161</fpage>
          -
          <lpage>1178</lpage>
          . doi:
          <volume>10</volume>
          .1037/h0077714.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Elizalde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Deshmukh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Ismail</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>CLAP</surname>
          </string-name>
          :
          <article-title>Learning audio concepts from natural language supervision</article-title>
          ,
          <source>in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '23)</source>
          , Rhodes Island, Greece,
          <year>2023</year>
          . URL: https://resourcecenter.ieee. org/conferences/icassp-2023
          <source>/spsicassp23vid0039.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5] OpenAI, Whisper:
          <article-title>Robust speech recognition via large-scale weak supervision</article-title>
          , https://openai. com/index/whisper/,
          <year>2022</year>
          . Accessed:
          <volume>18</volume>
          /05/2025].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roth</surname>
          </string-name>
          ,
          <article-title>Benchmarking zero-shot text classification: Datasets, evaluation and entailment approach</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natu-</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>