<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Predicting Social Exclusion: A Study of Linguistic Ostracism in Social Networks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Greta Gandolfi</string-name>
          <email>greta.gandolfi@alumni.unitn.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Carlo Strapparava</string-name>
          <email>strappa@fbk.eu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>FBK</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Trento</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Ostracism is a community-level phenomenon, shared by most social animals, including humans. Its detection plays a crucial role for the individual, with possible evolutionary consequences for the species. Considering (1) its bound with communication and (2) its social nature, we hypothesise the combination of (a) linguistic and (b) community-level features to have a positive impact on the automatic recognition of ostracism in human online communities. We model an English linguistic community through Reddit data and we analyse the performance of simple classification algorithms. We show how models based on the combination of (a) and (b) generally outperform the same architectures when fed by (a) or (b) in isolation.1</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Introduction</title>
      <p>
        Ostracism is a social phenomenon meant to
ignore or exclude an individual from a group,
performed by an individual or a group. Due to its
relevance in our everyday life - as a threat to basic
needs
        <xref ref-type="bibr" rid="ref11">(Wesselmann et al., 2012)</xref>
        - and its impact
on community-level essential patterns - such as
mother-infant attachment, xenophobia, and
leadership
        <xref ref-type="bibr" rid="ref7">(Raleigh and McGuire, 1986)</xref>
        - each
person must develop a system to predict and avoid it.
Humans and other social animals (such as rhesus
monkeys, for example) use ostracism as a form of
social control on problematic group members, as a
way to strengthen their group and to remove
members that do not conform to social norms.
Moreover, it reinforces the hierarchical role of the
per1Copyright ©2020 for this paper by its authors. Use
permitted under Creative Commons License Attribution 4.0
International (CC BY 4.0).
petrators while causing the social or even the
actual death of their direct victims. For these
reasons, the scope of ostracism allows researchers to
assume that its identification has adaptive
advantages
        <xref ref-type="bibr" rid="ref11">(Wesselmann et al., 2012)</xref>
        .
      </p>
      <p>Given its intrinsic relation with communication
and its community-level impact, we assume that
its detection can be automatised relying on
linguistic and extra-linguistic, community-level,
social features. We expect both the types of
information to be predictive but to work best when
combined.</p>
      <p>Reddit communities2 can be used as proxies of
linguistic communities since they provide huge
amounts of linguistic data3 paired with social
information. The performance of minimal binary
classifiers, such as Na¨ıve Bayes and SVM, can be
investigated to analyse the relevance of such cues
to distinguish between prospective ostracised or
not-ostracised members of a group, modelling our
adaptive ability to detect ostracism in advance.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>As far as we know, this can be defined as the first
attempt to analyse the phenomenon of ostracism
from the point of view of computational
linguistics.</p>
      <p>
        Linguistic behaviours have been analysed as
predictors of social exclusion. Researchers focused
both on the treatment of silence - i.e. the voluntary
suspension of any linguistic utterance -
        <xref ref-type="bibr" rid="ref12">(Williams,
2002)</xref>
        and on the proactive use of language - i.e.
the voluntary application of particular linguistic
acts. An example of such linguistic acts is the use
of gender-exclusive language (e.g., using he to
indicate both a male member or a female one),
experienced as ostracism by female members of the
group
        <xref ref-type="bibr" rid="ref10 ref8">(Stout and Dasgupta, 2011)</xref>
        .
      </p>
      <p>
        Also non-linguistic cues have been considered,
2Described in Section 3.
3Mainly written in the English language.
such as members’ competitive behavior
        <xref ref-type="bibr" rid="ref13">(Wu et al.,
2015)</xref>
        or agreeableness
        <xref ref-type="bibr" rid="ref3">(Hales et al., 2016)</xref>
        .
Predictors, in both of the cases, have been
searched in the victims’ behaviour or personality
type. Critically, our approach is meant to focus
primarily on cues coming from the perpetrators.
The following proposal is purely observational;
we will define a set of possible predictors of
social exclusion, not relying on a proper theoretical
model. We think that this exploration can help
other researchers to define a paradigm of social
exclusion, that focuses on general empirical
linguistic and extra-linguistic data.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methods and Tools</title>
      <p>Reddit is an American news aggregation and
discussion website, it ranks as the fifth most
visited website in the U.S., with an average of 430M
monthly active users and more than 130K active
communities4. It is organised in subreddits i.e.
hubs for discussion, controlled by moderators and
administrators and characterized by a transparent
hierarchical structure. Moderators and
administrators are listed in each community page and the
importance of each user on the platform is
represented by its karma5.</p>
      <p>Reddit provides a good balance of linguistic and
extra-linguistic data. Even if some sort of jargon
is present, the linguistic analysis is not constrained
by particular boundaries of length and form (being
more reliable, in this case, than Twitter data). The
extra-linguistic features that are particularly
relevant for this work are the ones reflecting the
structure and the hierarchical organisation of the Reddit
community. A more detailed description of these
features and their selection will be provided below.</p>
      <sec id="sec-3-1">
        <title>3.1 Dataset</title>
        <p>To collect data we used PRAW (Python Reddit
API Wrapper), a Python package that allows for
simple access to Reddit’s API (http://praw.
readthedocs.io).</p>
        <p>The dataset creation has been strongly controlled.
Having in mind the work of Raleigh and McGuire
(1986), that focused on the behaviour of
subadults and adults non-human primates leaving a
group after they failed to maintain their role as
dominant figures, we selected all reactions (i.e.,
4Data from https://www.redditinc.com.
5i.e. a number that is computed relying on the popularity
(ratio between upwards and downwards) of the total amount
of its comments and submissions (discussion posts).
comments to submissions and comments to posts)
addressed to ten moderators during nine years6.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.1.1 Moderator selection</title>
        <p>We distinguished between moderators that left the
linguistic community and moderators that are still
relevant (in terms of karma), trying to match their
period of activity on Reddit, for future
longitudinal comparisons7.</p>
        <p>Ostracised moderators are defined on the basis of
two identification processes. First, we
automatically searched for all the post in the subreddit
/r/redditrequest. It can be defined as a space in
which users are allowed to ask to remove a
moderator from a group, due to his/her/their inactivity
or abusive, harmful or irrespective behaviour
towards the other users (in that particular group or
in the whole Reddit community)8.</p>
        <p>We identified 5 users. These are proxies of
directly ostracised individuals that violated the
social norms of their groups. Secondly, we
automatically searched for all the moderators’ posts that
stated their willingness to leave the Reddit
community followed by their actual inactivity. We
simply performed a word-based search. We
selected other 5 moderators, representing a subset of
individuals that left the community deliberately.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.1.2 Sampling</title>
        <p>To create a balanced dataset, we searched for
popular moderators, who shared the same period of
activity with the target ones. We selected the ones
with the highest karma. For each year of
production, then, we randomly extracted a sample of
comments received, to obtain the same number of
reactions per year, for each moderator.</p>
        <p>We created a dataset9 of 4,200 linguistic reactions,
50% of which are addressed to the moderators that
6From 2010 to 2019.</p>
        <p>7For example, if one of the ostracised moderators have
been active in the community form the summer of 2013 to
the winter of 2015, we searched for another admin that has
been productive in the same period of time, without being
excluded from the community.</p>
        <p>8We could select only the posts in which the user name of
the target moderator was explicit (e.g. ”Please remove
moderator X from the subreddit Y”), several times, however, it
was more likely to find posts with this form: ”Please remove
the moderator of the subreddit Y”, which is more ambiguous.
Then we reduced the set of moderators, keeping only the ones
that actually stopped their activity i.e. that are no more active
with respect to the definition of inactivity provided by the
Reddit administrators: 3 months of silence in whole Reddit
environment.</p>
        <p>9Relevant materials can be found here: https://
github.com/gretagandolfi/ostracism.
left the community. The remaining 50% is
composed by reactions addressed to active and popular
moderators.
3.2</p>
      </sec>
      <sec id="sec-3-4">
        <title>Models</title>
        <p>We trained and tested a Na¨ıve Bayes and a SVM
algorithm (10-fold cross-validation) and we
analysed the fluctuations of their accuracy scores. We
took 0.50 as the baseline since the corpus is new
and perfectly balanced.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Feature selection</title>
      <p>To select the right features to detect ostracism, we
tried to focus on the formal properties of written
English, intentionally ignoring semantically
relevant information. This choice is justified by our
willingness to proceed in a domain-general
fashion and by the awareness of the fact that,
generally, ostracism differs from hate speech or swear,
being more subtle.
4.1</p>
      <sec id="sec-4-1">
        <title>Linguistic Features</title>
        <p>
          Punctuation and Stop-words Punctuation
marks and function words can reveal the syntactic
structure of a text, being useful in authorship
attribution and gender classification tasks
          <xref ref-type="bibr" rid="ref6 ref8">(Koppel
et al., 2006; Sarawgi et al., 2011)</xref>
          . Their analysis
does not involve semantics, thus promoting
generalisation. Moreover, punctuation has been
considered helpful in performing sentiment
detection
          <xref ref-type="bibr" rid="ref1">(Barbosa and Feng, 2010)</xref>
          .
        </p>
        <p>
          Length The length of the comments can give
hints on the conversation modality. Short posts,
for example, can sometimes show a closer
relationship between users if compared to longer ones.
Intuitively, fewer words are uttered when
interlocutors feel aligned one with each other, while
re-phrasing and the need for long explanations
are signs of misalignment and misunderstanding,
plausible manifestations of conflict
          <xref ref-type="bibr" rid="ref2 ref5">(Clark and
Henetz, 2014)</xref>
          . We computed the median length of
the sentences (identified by the sentence tokeniser
provided by NLTK python package) that compose
each comment, coding long and short comments
differently.
        </p>
        <p>
          Emoticons Emoticons are meant to express
feelings. They have been shown to play a crucial
role in sentiment analysis
          <xref ref-type="bibr" rid="ref9">(Shin and Maldonado,
2013)</xref>
          . The use of emoticon can reveal an author’s
positive or negative attitude towards a target
individual. We compute the informativeness of the
emoticons performing the VADER analysis that
provides polarity scores for each reaction passed
to the model
          <xref ref-type="bibr" rid="ref5">(Hutto and Gilbert, 2015)</xref>
          .
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Extra-linguistic features</title>
        <p>In this context, we define extra-linguistic features
the set of relevant data which is not related to the
users’ language in use. Extra-linguistic features
mainly relate to the hierarchical organization of
subreddits or the users’ popularity.</p>
        <p>Moderators Raleigh and McGuire (1986)
showed how the behaviour of ostracised ruling
primates can be seen as a function of the relations
between the prospective ruling individuals and
other members of the group. Considering this
fact, we decided to study the reactions addressed
to moderators from the Reddit community, as a
way of formalising and implement the idea of
the balancing of power in human and animal
communities. Reactions can come from normal
users, administrators or moderators themselves.
Here, we took as a feature the role of the author
of each reaction, computing its relevance for the
classification task10.</p>
        <p>Score Each Reddit post is associated with a
publicly visible score. Being defined as the sum of
the upvotes (likes, positive integers) and
downvotes (dislikes, negative integers) that the target
post or comment has obtained since it was
written, the score provides an idea of how much the
product is useful, funny or appreciated, from the
point of view of the community members.
Reddit Karma The karma is a measure of the
appreciation and the respect that a user gains in
years of activity. Its computation is based on
the ratio of the scores of each post and comment
he/she/they produced. We considered the karma
of the users addressing our targets.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiment</title>
      <p>We can operationalise the impact of linguistic and
extra-linguistic features on the binary
classification task looking at the fluctuations of the models’
accuracy. We focused on minimal questions, such
as: do the linguistic features have an impact on
the classification accuracy? Which is the best (i.e.</p>
      <p>10We coded basic users with 0, moderators with 0.5 and
admins with 1.
most accurate) combination? What is the impact
of each extra-linguistic feature on the
classification accuracy? Does the performance get better
if we combine linguistic and extra-linguistic
features?
6
6.1</p>
    </sec>
    <sec id="sec-6">
      <title>Results</title>
      <sec id="sec-6-1">
        <title>Linguistic and Extra-linguistic Features</title>
        <p>The relevance of the linguistic features and
extralinguistic features taken singularly is given by the
scores reported in Table111. The best linguistic
combination is C3, which contains all the
linguistic features considered. It is possible to notice
that, at this level, the accuracy depends on the
number of linguistic features considered,
increasing as the latter increases. Regarding the set of
extra-linguistic features, the social status of the
reaction’s author (moderator) seems to be the most
relevant.
11C1 stands for the combination of punctuation and stop
words, C2 for punctuation, stop words and sentence length
and C3 for punctuation, stop words, sentence length and
emoticons.</p>
        <p>12C1, C2 and C3 represent the sets of linguistic features
listed above, and each row of the table contains the accuracy
scores given by the summation of the social feature(s) (on
the left). EL1 stands for the combination of moderator and
score; EL2 for score and Reddit karma; EL3 for moderator,
score and Reddit karma.
models when trained only on linguistic or
extralinguistic features. Moreover, for all the
combinations, the SVM models outperform the Na¨ıve
Bayes models.
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>We explored the phenomenon of social exclusion
through Reddit data within a period of 9 years. We
collected reactions addressed to moderators, here
considered as leading figures of the groups. We
selected 10 moderators that left the community
influenced by the linguistic and non-linguistic
behaviour of the group they lead. We performed a
binary classification task on a total of 14200
linguistic reactions addressed to each of the target
moderators, analysing the influence of linguistic
and extra-linguistic or social patterns on two
simple models’ performance.</p>
      <p>
        We showed how the performance of both models
increases if linguistic and extra-linguistic features
are combined. The best combination of features,
concerning the SVM model, is given by the
combination of all the linguistic features and all the
social features considered. We can consider this
work as an attempt to follow the statements of the
sociolinguistics that considers language as
intrinsically bound up with society
        <xref ref-type="bibr" rid="ref4">(Hovy, 2018)</xref>
        .
Our experiment and the relative techniques are
simple and easy to replicate. We think that they
can be also applied in non-English domains, just
using a translating system for the stop-words. All
the other features can be directly generalised to
other languages.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Luciano</given-names>
            <surname>Barbosa</surname>
          </string-name>
          and
          <string-name>
            <given-names>Junlan</given-names>
            <surname>Feng</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>Robust sentiment detection on twitter from biased and noisy data</article-title>
          .
          <source>In 23rd International Conference on Computational Linguistics, COLING</source>
          , volume
          <volume>2</volume>
          , pages
          <fpage>36</fpage>
          -
          <lpage>44</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Herbert H Clark and Tania Henetz</surname>
          </string-name>
          .
          <year>2014</year>
          .
          <article-title>Working together</article-title>
          .
          <source>In The Oxford handbook of language and social psychology, page 85</source>
          . Oxford University Press, USA.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Andrew H. Hales</surname>
            , Matthew P. Kassner,
            <given-names>Kipling D.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
          </string-name>
          , and William G. Graziano.
          <year>2016</year>
          .
          <article-title>Disagreeableness as a cause and consequence of ostracism</article-title>
          .
          <source>Personality and Social Psychology Bulletin</source>
          ,
          <volume>42</volume>
          (
          <issue>6</issue>
          ):
          <fpage>782</fpage>
          -
          <lpage>797</lpage>
          . PMID:
          <volume>27044246</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Dirk</given-names>
            <surname>Hovy</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>The social and the neural network: How to make natural language processing about people again</article-title>
          .
          <source>In Proceedings of the Second Workshop on Computational Modeling of People's Opinions</source>
          , Personality, and Emotions in Social Media.
          <article-title>Association for Computational Linguistics, jun</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>C.J.</given-names>
            <surname>Hutto</surname>
          </string-name>
          and
          <string-name>
            <given-names>Eric</given-names>
            <surname>Gilbert</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Vader: A parsimonious rule-based model for sentiment analysis of social media text</article-title>
          .
          <source>In Proceedings of the 8th International Conference on Weblogs and Social Media</source>
          ,
          <string-name>
            <surname>ICWSM</surname>
          </string-name>
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Moshe</given-names>
            <surname>Koppel</surname>
          </string-name>
          , Jonathan Schler, Shlomo Argamon, and
          <string-name>
            <given-names>Eran</given-names>
            <surname>Messeri</surname>
          </string-name>
          .
          <year>2006</year>
          .
          <article-title>Authorship attribution with thousands of candidate authors</article-title>
          .
          <source>In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Michael J.</given-names>
            <surname>Raleigh</surname>
          </string-name>
          and
          <string-name>
            <surname>Michael T. McGuire</surname>
          </string-name>
          .
          <year>1986</year>
          .
          <article-title>Animal analogues of ostracism: Biological mechanisms and social consequences</article-title>
          .
          <source>Ethology and Sociobiology</source>
          ,
          <volume>7</volume>
          (
          <issue>3</issue>
          ):
          <fpage>201</fpage>
          -
          <lpage>214</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Ruchita</given-names>
            <surname>Sarawgi</surname>
          </string-name>
          , Kailash Gajulapalli, and
          <string-name>
            <given-names>Yejin</given-names>
            <surname>Choi</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>Gender attribution: Tracing stylometric evidence beyond topic and genre</article-title>
          .
          <source>In Proceedings of 2011 Conference on Computational Natural Language Learning - CoNLL.</source>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>S.Y.</given-names>
            <surname>Shin</surname>
          </string-name>
          and J.C. Maldonado, editors.
          <year>2013</year>
          .
          <article-title>Exploiting emoticons in sentiment analysis</article-title>
          .
          <source>Association for Computing Machinery</source>
          , Inc.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Jane G.</given-names>
            <surname>Stout</surname>
          </string-name>
          and
          <string-name>
            <given-names>Nilanjana</given-names>
            <surname>Dasgupta</surname>
          </string-name>
          .
          <year>2011</year>
          .
          <article-title>When he doesn't mean you: Gender-exclusive language as ostracism</article-title>
          .
          <source>Personality and Social Psychology Bulletin</source>
          ,
          <volume>37</volume>
          (
          <issue>6</issue>
          ):
          <fpage>757</fpage>
          -
          <lpage>769</lpage>
          . PMID:
          <volume>21558556</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Eric</given-names>
            <surname>Wesselmann</surname>
          </string-name>
          , James Nairne, and
          <string-name>
            <given-names>Kipling</given-names>
            <surname>Williams</surname>
          </string-name>
          .
          <year>2012</year>
          .
          <article-title>An evolutionary social psychological approach to studying the effects of ostracism</article-title>
          .
          <source>Journal of Social</source>
          , Evolutionary, and Cultural Psychology,
          <volume>6</volume>
          :
          <fpage>309</fpage>
          ,
          <fpage>09</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Kipling D</given-names>
            <surname>Williams</surname>
          </string-name>
          .
          <year>2002</year>
          .
          <article-title>Ostracism: The power of silence</article-title>
          . Guilford Press.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Long-Zeng Wu</surname>
            ,
            <given-names>D. Lance</given-names>
          </string-name>
          <string-name>
            <surname>Ferris</surname>
          </string-name>
          , Ho Kwong Kwan, Flora Chiang, Ed Snape, and
          <string-name>
            <surname>Lindie</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Liang</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Breaking (or making) the silence: How goal interdependence and social skill predict being ostracized</article-title>
          .
          <source>Organizational Behavior and Human Decision Processes</source>
          ,
          <volume>131</volume>
          :
          <fpage>51</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>