<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>of English Words Unique to</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jin Jye Wong</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cynthia S. Q. Siew</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Psychology, National University of Singapore</institution>
          ,
          <addr-line>9 Arts Link</addr-line>
          ,
          <country>Singapore, Singapore</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This research describes a database of word association norms for words unique to the Singapore English dialect (e.g., shiok which means "great!" and swaku which refers to a country bumpkin). Because of the predominantly spoken nature of the basilectal form of Singapore English (colloquially referred to as "Singlish"), large-scale, local written corpora commonly used to compute semantic vector spaces of words in languages are uncommon. In order to study the semantic representations of uniquely Singapore English words, word associations were collected from native speakers of Singapore English via an online web application. When presented with a word (e.g. shiok), participants list the first words that come to their mind. The characteristics of the resulting association network are described.</p>
      </abstract>
      <kwd-group>
        <kwd>1 Semantic networks</kwd>
        <kwd>word associations</kwd>
        <kwd>cognitive networks</kwd>
        <kwd>Singapore English</kwd>
        <kwd>linguistics</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Singlish, a portmanteau of "Singapore English", is an informal, colloquial form of English used in
Singapore [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. As an English-based creole language, it incorporates elements of other languages
commonly spoken in Singapore, such as Malay, Chinese dialects of Hokkien, Cantonese, and Teochew,
and Indian languages such as Tamil [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and deviates from English at both the lexical and grammatical
levels [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. This paper reports word associations collected from native speakers of Singlish in a paradigm
similar to the Small World of Words project (SWOW) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        While natural language processing more commonly uses purely data-based approaches such as
distributional semantics which are derived from textual analysis, use cases remain for word association
approaches. For instance, word association-based models outperform text-based word co-occurrence
models in predicting affective properties of words such as valence (the degree to which a word is
positive or negative) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], while also showing high correlation with human ratings [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Internal language
models, which treat language as a body of knowledge residing in the brains of its speakers, which are
also derived from word association data, also outperform external language models (such as those
trained on corpora) that treat language as an extraneous object consisting of content produced by users
of the language, in judging if words are related or similar [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. It is worth noting that the set of word
associations used to create the internal language models was smaller than the combined training corpora
used by the external language models by several magnitudes.
      </p>
      <p>
        Human-sourced word association data can also reinforce language models, especially in cases of
languages where the available written corpora is smaller in size. SWOW data has already been used to
improve downstream task performance on commonsense reasoning benchmarks [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
      </p>
      <p>
        For Singlish in particular, modern large language models are able to generate readable samples of
text that adhere to Singlish grammar, but such snippets often only use the most common Singlish
constructs and vocabulary (such as appending 'lah' or using 'makan' to substitute 'eat'), while
simultaneously hallucinating Singlish words or constructs that do not exist (e.g. 'trafik' as an incorrect
spelling of 'traffic', ungrammatically tacking on 'la(k)' to words within a sentence) [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>The aim of this study is to attempt to construct a large-scale free association network of Singapore
English, the first of its kind. Such a network can provide more insight into the structure of the mental
lexicon of the speakers of Singapore English and serve as a basis on which to conduct future
psycholinguistic experiments.
2.1.</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
    </sec>
    <sec id="sec-3">
      <title>Participant Details</title>
      <p>Participants were recruited online using a crowd-sourced approach relying on social media, email,
and word-of-mouth. To further incentivize participation, a lucky draw was held during the months of
July to December 2022; 100 random participants were chosen every month to receive five Singapore
dollars’ (approximately 3.70 USD) worth of vouchers. Participation in the lucky draw was voluntary.
This study was approved by the Institutional Review Board at the National University of Singapore.</p>
      <p>This paper analyses data collected from 1st July 2022 to 16 January 2023. The dataset obtained
consists of 1095 responses, of whom 614 (56.1%) identified as female, 458 (41.8%) identified as male,
and 23 (2.1%) identified as neither. The average self-reported age was 31.9 years (SD = 12.72). Due to
ethical considerations, no identifying data was collected, and it was possible for individuals to
participate more than once.</p>
      <p>Besides gender and age, information about race, birth country, current country of residence, and
spoken languages were also collected. Regarding race, 828 (75.6%) identified as Chinese, 87 (7.94%)
identified as Malay, and 67 (6.12%) identified as Indian, which approximately reflects Singapore's
racial profile. Of the remainder, 45 (4.11%) identified as none of the above three races, while 68 (6.21%)
did not respond. Most participants indicated their birth country as either Singapore (n = 1027, 93.8%),
with the next largest subset being Malaysia (n = 16, 1.46%). The vast majority also indicated that their
current country of residence was Singapore (n = 1068, 97.5%). Additionally, participants were also
asked if they were native (L1) English speakers, and 950 (86.8%) indicated that they were, while the
remainder were not. Other languages spoken by participants include Mandarin Chinese (n = 800,
73.1%), Malay (n = 96, 8.76%), and Tamil (n = 16, 1.46%). Participants could indicate more than one
non-English language.
2.2.</p>
    </sec>
    <sec id="sec-4">
      <title>Data Collection Procedure</title>
      <p>
        Upon clicking the button to begin the study, participants were instructed to fill in their demographic
information. Next, they were presented with 20 cues (Singlish words/phrases) each from a list of
approximately 4,500 Singlish words or phrases manually compiled from various dictionaries (e.g. the
Coxford Singlish Dictionary [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]) and were instructed to provide up to three responses to each cue. If
a participant was unable to provide any responses, they could proceed to the next cue regardless. After
20 cues, participants were directed to a landing page which contained a debriefing and the unique
identifier code for their response which they could email to the researchers to participate in the lucky
draw. The stimuli presented were selected pseudo-randomly; cues that had the fewest responses in the
current iteration were more likely to be presented.
2.3.
      </p>
    </sec>
    <sec id="sec-5">
      <title>Quality Control</title>
      <p>
        Following criteria similar to that of the original SWOW project [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], responses from some
participants were excluded from analysis. First, to ensure that participants had sufficient experience
with Singlish, participants who indicated their country of birth or current residence to not be Singapore
were excluded. This removed 76 (6.94%) responses out of the total number of original 1,095 responses,
leaving us with 1,019 responses. Next, participants for whom less than 75% of the responses were
unique were also excluded (i.e. participants who gave the same response to many different cue words).
This subset consisted of 3 (&lt;1%) participants. Finally, participants with 75% or fewer of their responses
appearing on a compiled Singlish dictionary were removed; this comprises of participants that
responded with non-English words or phrases as well as participants that responded with
unrecognizable strings of English alphabet. This removed 15 (1.37%) of the participants. As a result, a
total of 1,900 responses are not further considered, and the final dataset consists of 1,001 participants
and 19,999 responses.
      </p>
      <p>
        Data cleaning and analysis was conducted using R [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Verbose responses (i.e. responses that
consisted of more than one word) were spilt into their constituent words, with each word treated as a
separate responses from the same participant.
      </p>
      <p>
        For spelling correction, the hunspell R package [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] was used. A custom Singlish dictionary was
created by adding the list of Singlish cues to the default British English dictionary. Since many cues
were phrases consisting of more than one word, the individual component words of each phrase were
added manually as separate entries. For example, the cues buay pai ("not bad") and buay tahan ("can't
endure") would result in the monograms buay, pai, and tahan being added. In the automated pipeline,
words deemed to be misspelled were automatically replaced with the first word suggested by hunspell,
with the function hunspell_suggest. This also entails that words spelled differently in American English
would be corrected to their British English equivalents; consequently, different spellings of a word (e.g.
color and colour) are treated as the same word (colour).
      </p>
      <p>
        To convert words to their standard forms, lemmatization was performed using the lemmatize_words
function from the textstem R package [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. This maps inflections of any word back to their root word
(e.g. incurred and incurs would be corrected to incur). Finally, any remaining word-final punctuation
(e.g. full stops, tags, and quotes) as well as non-English characters, were removed.
      </p>
    </sec>
    <sec id="sec-6">
      <title>Qualities of the Singlish Network</title>
      <p>
        A word association network was created using the igraph R package [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Vertices of the graph
consist of the Singlish cues and the split responses provided by participants. An edge connects two
nodes if one of the cues produced the other word as a response. Edges are weighted, with edge lengths
determined by taking the reciprocal of the number of times the response was produced when the cue
word was presented; a short edge thus denotes a stronger relation between the two vertices it connects.
      </p>
      <p>The final network was constructed by considering the first responses (R1) given by all participants,
the rationale being that second and third responses, where given, might be susceptible to influence by
any responses that precede them. The decision was made to construct an undirected instead of a directed
graph given that the set of cues contained only uniquely Singlish words or phrases, while the set of
responses contained both Singlish words and words found in the conventional English lexicon.</p>
      <p>The R1 network has a total of 8215 vertices (mean degree = 4.27, SD = 7.74), with the vertex no
having the highest degree of 191. For edges, the R1 network has a total of 17545 edges (mean edge
length = 0.924, SD = 0.194), with the shortest edge (strongest association) being the edge from the
Singlish cue phrase monyet see monyet do to the response monkey, which has a weight of 0.111. The
network is not fully connected, and consists of 62 connected components, with the largest connected
component consisting of 8071 vertices (98.2% of all vertices); this component has a global clustering
coefficient of 0.00326 and an average local clustering coefficient of 0.0118.</p>
      <p>
        The average shortest path length (ASPL) of the network is 1.73 and the network diameter is 8; these
measure provide an indication of how efficient the network is overall [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. The Small-World Index of
the network is 0.44, showing that it is less interconnected than an Erdös-Rényi random graph with the
same number of vertices and edges [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>Discussion and Future Directions</title>
      <p>
        The data collected represents the first-ever word association dataset dedicated to Singlish. Network
construction notwithstanding, it can be used in tandem with existing Singapore English resources such
as the Auditory English Lexicon Project (AELP) [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] to further examine the Singlish creole.
Completeness of the network can be improved by further studies that examine Singlish responses to
English cues, as well as consideration of non-English responses.
      </p>
      <p>Other possible avenues of investigation include comparison of the network with similar word
association networks of other languages. The constructed network can also be used as a starting point
to predict properties of less common Singlish words.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T. R.</given-names>
            <surname>Yeo</surname>
          </string-name>
          , Singlish,
          <year>2010</year>
          . URL: https://eresources.nlb.gov.sg/infopedia/articles/SIP_1745_
          <fpage>2010</fpage>
          - 12-29.html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Platt</surname>
          </string-name>
          ,
          <article-title>The Singapore English speech continuum and its basilect 'Singlish' as a 'Creoloid'</article-title>
          , Anthropological
          <string-name>
            <surname>Linguistics</surname>
          </string-name>
          (
          <year>1975</year>
          ):
          <fpage>363</fpage>
          -
          <lpage>374</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. R. E.</given-names>
            <surname>Leimgruber</surname>
          </string-name>
          ,
          <article-title>Singapore english</article-title>
          .
          <source>Language and Linguistics Compass</source>
          ,
          <volume>5</volume>
          .1 (
          <year>2011</year>
          ):
          <fpage>47</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>De Deyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Navarro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perfors</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brysbaert</surname>
          </string-name>
          , and
          <string-name>
            <given-names>G.</given-names>
            <surname>Storms</surname>
          </string-name>
          , The “Small World of Words”
          <article-title>English word association norms for over 12,000 cue words</article-title>
          .
          <source>Behavior research methods 51</source>
          (
          <year>2019</year>
          ),
          <fpage>987</fpage>
          -
          <lpage>1006</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Vankrunkelsven</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Verheyen</surname>
          </string-name>
          , G. Storms, and S. De Deyne,
          <article-title>Predicting Lexical Norms: A Comparison between a Word Association Model and Text-Based Word Co-occurrence Models</article-title>
          .
          <source>Journal of Cognition 1</source>
          , (
          <year>2018</year>
          ),
          <fpage>45</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>B. van Rensbergen</surname>
          </string-name>
          , S. De Deyne, and G. Storms,
          <article-title>Estimating affective word covariates using word association data</article-title>
          .
          <source>Behavior Research Methods</source>
          <volume>48</volume>
          (
          <year>2008</year>
          ),
          <fpage>1644</fpage>
          -
          <lpage>1652</lpage>
          . https://doi.org/10.3758/s13428-015-0680-2
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>De Deyne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Perfors</surname>
          </string-name>
          , and
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Navarro</surname>
          </string-name>
          , in:
          <article-title>Predicting human similarity judgments with distributional models: The value of word associations</article-title>
          .
          <source>COLING 2016 - 26th International Conference on Computational Linguistics, Proceedings of COLING 2016: Technical Papers</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1861</fpage>
          -
          <lpage>1870</lpage>
          . https://lirias.kuleuven.
          <source>be/2324421</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cohn</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Frermann</surname>
          </string-name>
          ,
          <article-title>Commonsense Knowledge in Word Associations</article-title>
          and ConceptNet.
          <year>2021</year>
          . arXiv preprint arXiv:
          <volume>2109</volume>
          .
          <fpage>09309</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z. X.</given-names>
            <surname>Yong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Forde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cahyawijaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lovenia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sutawika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C. B.</given-names>
            <surname>Cruz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Phan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. L.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Aji</surname>
          </string-name>
          , Prompting Large Language Models to Generate
          <source>CodeMixed Texts: The Case of South East Asian Languages</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>C.</given-names>
            <surname>Goh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. Y.</given-names>
            <surname>Woo</surname>
          </string-name>
          , The Coxford Singlish Dictionary, 2nd. ed,
          <source>Angsana Books</source>
          ,
          <year>2009</year>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R Core</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <surname>R:</surname>
          </string-name>
          <article-title>A language and environment for statistical computing</article-title>
          ,
          <year>2019</year>
          . URL: https://www.R-project.
          <source>org/. version 4</source>
          .1.
          <issue>2</issue>
          (
          <issue>2021</issue>
          -11-01)
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ooms</surname>
          </string-name>
          , hunspell:
          <string-name>
            <surname>High-Performance</surname>
            <given-names>Stemmer</given-names>
          </string-name>
          , Tokenizer, and Spell Checker,
          <year>2020</year>
          . URL: https://CRAN.R-project.org/package=hunspell
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T. W.</given-names>
            <surname>Rinker</surname>
          </string-name>
          , {textstem}:
          <article-title>Tools for stemming and lemmatizing text</article-title>
          ,
          <year>2018</year>
          . URL: http://github.com/trinker/textstem
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>G.</given-names>
            <surname>Csardi</surname>
          </string-name>
          and
          <string-name>
            <given-names>T.</given-names>
            <surname>Nepusz</surname>
          </string-name>
          ,
          <article-title>The igraph software package for complex network research</article-title>
          .
          <source>Interjournal, Complex Systems</source>
          <year>2006</year>
          , p.
          <fpage>1695</fpage>
          . URL: https://igraph.org
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>C. S. Q.</given-names>
            <surname>Siew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. U.</given-names>
            <surname>Wulff</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Beckage</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kenett</surname>
          </string-name>
          ,
          <article-title>Cognitive Network Science: A review of research on cognition through the lens of network representations, processes, and dynamics</article-title>
          .
          <source>Complexity</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>M. D. Humphries</surname>
            and
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Gurney</surname>
          </string-name>
          ,
          <article-title>Network 'small-world-ness': a quantitative method for determining canonical network equivalence</article-title>
          .
          <source>PloS one 3</source>
          .4 (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>W.D.</given-names>
            <surname>Goh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Yap</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          and
          <string-name>
            <given-names>Q. W.</given-names>
            <surname>Chee</surname>
          </string-name>
          ,
          <article-title>The Auditory English Lexicon Project: A multi-talker, multi-region psycholinguistic database of 10,170 spoken words and nonwords</article-title>
          .
          <source>Behavior Research Methods 52.5</source>
          (
          <year>2020</year>
          ):
          <fpage>2202</fpage>
          -
          <lpage>2231</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>