<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Sentiment Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander J. Kilpatrick</string-name>
          <email>alexander_kilpatrick@nucba.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Sound Symbolism; Automatic Emotion Recognition; Automatic Sentiment Analysis;</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Nagoya University of Commerce and Business</institution>
          ,
          <addr-line>Nisshin</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This report documents the construction and output of extreme gradient boosted algorithms that were trained using the phonemes that make up American English words to identify how different sounds express emotion and sentiment. The data comprised of two corpora that consist of words that have been assigned scores according to how they reflect certain emotions and sentiments. The models are trained only on the phonemes that make up each word. This is a unique approach to automatic emotion recognition and sentiment analysis which typically does not consider individual phonemes. In addition to the boosted algorithms, linear regression is used to examine the relationships between word length, and emotions and sentiments.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>2023 Copyright for this paper by its authors.
CEUR</p>
      <p>
        ceur-ws.org
manner so that each sample returned a count of the number of times each phoneme occurs in each word.
The outcome of this is a dataset comprised of mostly null values. Following on this method researchers
constructed algorithms to classify Pokémon names according to their evolution level using sound
symbolism [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. They showed that the random forest algorithms were able to classify novel Pokémon
names more accurately than Japanese university students assigned to an identical task. An issue of
overfitting due to the high number of null values in the dataset was uncovered and resolved using
crossvalidation. In the present study, word length is found to be a significant predictor of several emotions
and sentiments and this effect is taken into consideration in the design and analysis of the algorithms.
Potentially related, Li et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] examined the relationship between utterance length and word error rates
in automatic speech recognition and speech emotion recognition. They found that shorter utterances
tended to have higher word error rates likely due to a lack of contextual information.
      </p>
      <p>The present report outlines the construction and output of algorithms designed to combine sound
symbolism with automatic emotion recognition and sentiment analysis. Two corpora, with a combined
total of almost 20,000 words, are used to train 19 algorithms that are each designed to classify samples
according to specific emotions and sentiments. All algorithms return significant accuracy estimates.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Method</title>
      <p>All data, files, codes, and links to a YouTube series documenting this project can be found in the
following online repository: https://osf.io/brus3/?view_only=63412.</p>
      <p>
        The present study uses two separate corpora to train machine learning algorithms. The first is the
Glasgow Norms [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], a list of 5,500 words that have been assigned Likert scores for 9 sentiments. A
full list of the sentiments in the Glasgow Norms is provided in Table 1. The second corpus is the NRC
Word-Emotion Association Lexicon [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], a list of 14,000 words that have been assigned a binary score
as to whether each word is associated with 10 emotions and sentiments. A full list of the emotions and
sentiments in the NRC Lexicon can be found in Table 2. Words from both corpora were cross referenced
with the Carnegie Mellon University Pronouncing Dictionary (CMU [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]) to obtain American English
phonemes for each word. Words that did not find a match in the CMU were manually checked. Instances
of mismatched spelling were corrected. All other unmatched samples were discarded.
      </p>
      <p>
        All analyses were conducted in the R environment [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Word length was calculated by summing
the number of phonemes in each word. No additional considerations were made for diphthongs or long
vowels which were counted as single phonemes. The relationships between word length, and emotions
and sentiments were analyzed using a series of regression equations, dependent variables being the
average Likert scores in the Glasgow Norms and the binary classification in the NRC Lexicon;
independent variables being word length. The XGBoost algorithms were constructed using the
XGBoost [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] and caret [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] packages. K-fold cross-validation (K = 28) was used to avoid the
overfitting issue reported in [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. The data was split into 8 subsets (A-H) and recombined using a Latin
square resulting in 28 subsets with a 3:1 training to testing split. For example, the first iteration of each
model is trained using subsets A through F and tested on subsets G and H. The following results report
on the aggregate of each series of 28 iterations. Combined significance was calculated using Stouffer's
[
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] and Fisher’s [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] methods; however only Fisher’s method is reported as it returned more
conservative significance estimates. The algorithms for both corpora were designed to classify samples
so the Likert scale values in the Glasgow Norms were assigned to binary categories using a median
split. The XGBoost algorithm was found to be susceptible to distribution skew, so categories were
balanced by randomly removing samples from the majority category. This had little effect on the
Glasgow Norms dataset due to the median split but removed around 80% of samples in the NRC dataset
because only around 10% of samples in that dataset have a value of 1 in the binary dependent variable.
To increase variability, balancing was conducted after cross-validation sub-setting. To limit the
influence of word length in the XGBoost models, phoneme counts were divided by word length so that
features were a percentage how much a phoneme makes up each word. This resulted in a convergence
issue during tuning, so α was manually adjusted and the same learning rate was applied to all models
(α = 0.1). All other hyperparameters were automatically tuned by inputting diverse hyperparameter
settings into a tuning grid.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Results</title>
    </sec>
    <sec id="sec-4">
      <title>3.1. Linear Regression and Word Length</title>
      <p>A series of linear regression models were calculated to test the relationship between word length, and
the emotions and sentiments in the Glasgow Norms (Likert score) and the NRC Lexicon (binary). Table
1 reports on the findings of the analyses conducted on the Glasgow Norms. Increased Age of
Acquisition, Arousal, Size, and Valence had a significant positive correlation with word length while
Concreteness, Familiarity, and Imaginability had a negative one. All significant relationships observed
in the analyses conducted on the NRC Lexicon (Table 2) showed a positive correlation. These include
Anger, Sadness, and Trust emotions while the Negative and Positive sentiments were also significant.
All models constructed and tested using the Glasgow Norms achieved an accuracy greater than chance
and a Fisher’s combined p-value &lt; 0.001. Table 3 reports on the aggregated accuracy and standard
deviation of these models. A similar result was found in the NRC models except in the case of the
Surprise algorithm (p = 0.022 using Fisher’s method and p = 0.021 using Stouffer’s method). The NRC
models are presented in Table 4. The Glasgow Norms models did report a greater accuracy than the
NRC models; however, it is important to note that these were constructed and tested on larger datasets
due to the balancing outlined in 2. The accuracy was, on average, higher and variability was lower in
the models constructed using the Glasgow Norms models compared to the NRC models.
emotions. Features with high feature importance across models include voiceless plosives (/t/ and /k/),
the alveolar nasal (/n/), approximant consonants (/ɹ/ and /l/), the alveolar fricative (/s/), and the
openmid back vowel (/ʌ/) which appears particularly important, but it should be noted that this is also the
most common phoneme in American English according to the CMU.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Discussion</title>
      <p>All models achieved accuracy significantly greater than chance (p &lt; 0.001 in all cases but one).
Although further investigation is recommended, the results suggest that it is unlikely that sound
symbolism in American English expresses fine-grained emotions and sentiments because the feature
importance scores suggest many models are using the same features to make decisions. Rather, it seems
that sound symbolism communicates emotional and sentimental weight. Consider that the Valence
model in the Glasgow Norms—where high valence is positive and low valence is negative—and the
Positive and Negative models in the NRC Lexicon all showed a positive correlation with word length.
Positivity and negativity are sound symbolically expressed through longer words, although this is
slightly stronger for positive sentiments as shown by the NRC lexicon models and the significant, but
relatively weak, positive correlation in the Valence regression model.</p>
      <p>
        Those sounds that have high feature importance scores across models include voiceless plosives (/t/
and /k/), the alveolar nasal (/n/), approximant consonants (/ɹ/ and /l/), the alveolar fricative (/s/) and the
open-mid back vowel (/ʌ/). Most of the consistently important consonants are produced at the alveolar
ridge. /ʌ/ appears to be an especially important feature across models. This observation falls in line with
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] who showed that /ʌ/ was associated with negative valence; however, few other patterns can be drawn
between the that study and the present report. The high importance of /ʌ/ might also be due to a
combination of its high occurrence frequency, being the most common phoneme in the CMU, and the
distribution of null values in independent variables (NRC = 85%; Glasgow Norms = 88%). XGBoost
algorithms are constructed using decision trees which base their decisions upon the outcomes of nodes.
At each node a certain number of features are tested. /ʌ/ is the most commonly occurring phoneme in
English and it will often be tested against low frequency features with null values. This issue was
somewhat mitigated by dividing phoneme counts against word length, but it doesn’t solve the problem
entirely. Take for example Age of Acquisition Likert scores which were shown to have the strongest
association with increased word length across all models, this is an unsurprising finding. However, Age
of Acquisition XGBoost model feature importance scores revealed that /ʌ/ was the most important
feature in that model, to a much greater degree than other models. This suggests that word length is still
contributing to the models despite attempts to mitigate its influence through model tuning and data
engineering. Word length could be included in the XGBoost models; however, given that length has a
greater range than phoneme counts and no null values, this would likely mask weaker features [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>That said, all models reported significant accuracy. Given that most automatic emotion recognition
and sentiment analysis systems rely heavily on lexical and syntactic features, this study underscores the
potential of phonemic information as an additional valuable resource for improving the accuracy of
such systems, especially when dealing with emotional and sentimental aspects of language. While the
current study provides valuable insights into the role of sound symbolism in sentiment analysis, future
research could delve further into the interplay between phonemic features and linguistic and contextual
factors to enhance the robustness and generalizability of sentiment analysis models across different
languages and domains.</p>
    </sec>
    <sec id="sec-6">
      <title>5. References</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Saussure</surname>
            ,
            <given-names>F. D.</given-names>
          </string-name>
          (
          <year>1916</year>
          ). Cours de linguistique générale. Paris: Payot.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Ćwiek</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fuchs</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Draxler</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Asu</surname>
            ,
            <given-names>E. L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dediu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hiovain</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , ... &amp;
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          (
          <year>2022</year>
          ).
          <article-title>The bouba/kiki effect is robust across cultures and writing systems</article-title>
          .
          <source>Philosophical Transactions of the Royal Society B</source>
          ,
          <volume>377</volume>
          (
          <year>1841</year>
          ),
          <fpage>20200390</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>Fort</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lammertink</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peperkamp</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guevara‐Rukoz</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fikkert</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Tsuji</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Symbouki: a meta‐analysis on the emergence of sound symbolism in early language acquisition</article-title>
          .
          <source>Developmental science</source>
          ,
          <volume>21</volume>
          (
          <issue>5</issue>
          ),
          <year>e12659</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Shinohara</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kawahara</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2010</year>
          ).
          <article-title>A cross-linguistic study of sound symbolism: The images of size</article-title>
          .
          <source>In Annual meeting of the berkeley linguistics society</source>
          (Vol.
          <volume>36</volume>
          , No.
          <issue>1</issue>
          , pp.
          <fpage>396</fpage>
          -
          <lpage>410</lpage>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Sidhu</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Pexman</surname>
            ,
            <given-names>P. M.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Five mechanisms of sound symbolic association</article-title>
          .
          <source>Psychonomic bulletin &amp; review</source>
          ,
          <volume>25</volume>
          ,
          <fpage>1619</fpage>
          -
          <lpage>1643</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Adelman</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Estes</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Cossu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>Emotional sound symbolism: Languages rapidly signal valence via phonemes</article-title>
          .
          <source>Cognition</source>
          ,
          <volume>175</volume>
          ,
          <fpage>122</fpage>
          -
          <lpage>130</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Perlman</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          (
          <year>2021</year>
          ).
          <article-title>Size sound symbolism in the English lexicon</article-title>
          .
          <source>Glossa: a journal of general linguistics</source>
          ,
          <volume>6</volume>
          (
          <issue>1</issue>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>Kilpatrick</surname>
            ,
            <given-names>A. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ćwiek</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kawahara</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>Random forests, sound symbolism and Pokémon evolution</article-title>
          .
          <source>PloS one</source>
          ,
          <volume>18</volume>
          (
          <issue>1</issue>
          ),
          <year>e0279350</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klejch</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bell</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Lai</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2023</year>
          ).
          <article-title>ASR and Emotional Speech: A WordLevel Investigation of the Mutual Impact of Speech and Emotion Recognition</article-title>
          .
          <source>arXiv preprint arXiv:2305</source>
          .
          <fpage>16065</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>G.G.</given-names>
            <surname>Scott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Keitel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Becirspahic</surname>
          </string-name>
          , et al.
          <source>The Glasgow Norms: Ratings of 5</source>
          ,500 words on nine scales.
          <source>Behav Res</source>
          <volume>51</volume>
          ,
          <fpage>1258</fpage>
          -
          <lpage>1270</lpage>
          (
          <year>2019</year>
          ). https://doi.org/10.3758/s13428-018-1099-3.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.M.</given-names>
            <surname>Mohammad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.D.</given-names>
            <surname>Turney</surname>
          </string-name>
          , NRC Emotion Lexicon. National Research Council Canada,
          <volume>2</volume>
          ,
          <issue>234</issue>
          (
          <year>2013</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>CMU</given-names>
            <surname>Pronouncing</surname>
          </string-name>
          <article-title>Dictionary</article-title>
          .
          <source>(n.d.)</source>
          . Carnegie Mellon University. Retrieved June 16,
          <year>2023</year>
          , from http://www.speech.cs.cmu.edu/cgi-bin/cmudict.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>R Core</given-names>
            <surname>Team. R:</surname>
          </string-name>
          <article-title>A language and environment for statistical computing</article-title>
          . (
          <year>2023</year>
          ) [Computer software].
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>He. XGBoost: Extreme Gradient Boosting</surname>
          </string-name>
          .
          <source>R package version 1.5.0.1</source>
          . (
          <year>2021</year>
          ) [Computer software].
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kuhn</surname>
          </string-name>
          . caret: Classification and
          <string-name>
            <given-names>Regression</given-names>
            <surname>Training</surname>
          </string-name>
          .
          <source>R package version 6</source>
          .
          <fpage>0</fpage>
          -
          <lpage>88</lpage>
          . (
          <year>2023</year>
          ) [Computer software].
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S.A.</given-names>
            <surname>Stouffer</surname>
          </string-name>
          .
          <article-title>The American Soldier: Adjustment During Army Life</article-title>
          (Vol.
          <volume>1</volume>
          ). Princeton University Press. (
          <year>1949</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>R.A.</given-names>
            <surname>Fisher</surname>
          </string-name>
          .
          <article-title>Statistical methods for research workers</article-title>
          .
          <source>Oliver and Boyd</source>
          . (
          <year>1925</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>