<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Sentiment Classification using Sociolinguistic Clusters</article-title>
      </title-group>
      <fpage>99</fpage>
      <lpage>104</lpage>
      <abstract>
        <p>Sociolinguistic studies suggest the similarity of language use among people with similar social state, and recent large-scale computational analyses of online text are providing various supports, for example, the effect of social class, geography, and political preference on the language use. We approach the tasks of TASS 2015 with sociolinguistic insights in order to capture the patterns in the expression of sentiment. Our approach expands the scope of analysis from the text itself to the authors: their social state and political preference. The tweets of authors with similar social state or political preference are grouped as a cluster, and classifiers are built separately for each cluster to learn the linguistic style of that particular cluster. The approach can be further improved by combining it with other language processing and machine learning techniques.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The social aspect of language is an important
means for understanding commonalities and
differences in the language use as
communication is inherently a social activity.</p>
      <p>Shared ideas and preferences of people are
reflected in the language use, and frequently
observed from various linguistic features such
as memes, style, and word choices. The social
aspect is also clear in the expression of
sentiment, especially in social media. The social
media platforms have many elements that
encourage the use of similar expressions among
social groups. For example, retweets and
hashtags facilitate the adoption of expressions,
and the short length of messages encourages the
use of familiar expressions.</p>
      <p>
        Our approach to the tasks of TASS 2015
        <xref ref-type="bibr" rid="ref9">(Villena-Román et al., 2015)</xref>
        is based on the
insights of sociolinguistics. Specifically, we
focus on the effect of social variables on
linguistic variations; people who share similar
preference or status may show similarity in the
expression of sentiment than others. For each
task, we cluster the tweets by people who share
some social features (e.g., political orientation,
occupation, or football team preference). In
order to capture the style of the sociolinguistic
clusters, a classification model is trained
separately for each cluster.
      </p>
      <p>While the primary benefit of the approach is
that it can distinguish the different style of
Publicado en http://ceur-ws.org/Vol-1397/. CEUR-WS.org es una publicación en serie con ISSN reconocido
sentiment expression among different social
groups, it also mitigates the scale limitation of
the training data. For instance, some football
players of the Social TV corpus and some
entity-aspect pair of the STOMPOL corpus
have limited number of associated tweets.</p>
      <p>Clustering them with other tweets that are
spoken by people with similar preference
expands the amount of data that can be used for
training.</p>
      <p>The approach can be easily combined with
other language processing and machine learning
techniques. Since our approach mainly
considers the characteristics of the authors
rather than the text of tweets itself, combining it
with more advanced language processing
techniques complements each other. In
addition, there is much room for future
improvement as the current implementation of
our approach uses primitive language
processing methods due to the limited local
Spanish knowledge of the author.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>The increasing availability of large-scale text
corpora and the advances of big data processing
platforms allows computational analysis of
sociolinguistic phenomena. Many works in
NLP and computational social science
nowadays are taking the hypotheses of
sociolinguistics as well as other social sciences and
testing them with online data sets.</p>
      <p>
        In the context of computational analysis of
sociolinguistics theories, a number of works
showed the effect of social features on
linguistic variations. For example,
        <xref ref-type="bibr" rid="ref3">Eisenstein et
al. (2011)</xref>
        observed the difference in term
frequency depending on the demographics and
geographical information of people, and also
that the different language use can play a
significant role in predicting the demographics
of authors. A similar study was conducted with
the information about occupation
(
        <xref ref-type="bibr" rid="ref6">PreotiucPietro et al., 2015</xref>
        ), and gender
        <xref ref-type="bibr" rid="ref10">(Wang et al.,
2013)</xref>
        . There are also works that specifically
observed the relation between the expression of
sentiment and social variables, for example,
daily routine
        <xref ref-type="bibr" rid="ref2">(Dodds et al., 2011)</xref>
        and urban
characteristics
        <xref ref-type="bibr" rid="ref4">(Mitchell et al., 2013)</xref>
        .
      </p>
      <p>
        The difference of the language use
depending on the political/ideological
preference has been explored as well. In the
communication literature, researchers have
conceptualized the phenomena as framing
        <xref ref-type="bibr" rid="ref7">(Scheufele, 1999)</xref>
        and many studies analyzed
how political and social issues are framed
differently between media outlets and partisan
organizations, and how they are related with the
perception of the public. Many works are
applying computational methods for similar
purposes and observing the difference of
language use from various online text data, for
example, news articles
        <xref ref-type="bibr" rid="ref3 ref5">(Park et al., 2011a)</xref>
        ,
comments
        <xref ref-type="bibr" rid="ref5">(Park et al., 2011b)</xref>
        , and discussion
forums
        <xref ref-type="bibr" rid="ref8">(Somasundaran et al., 2009)</xref>
        .
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>System Design</title>
      <p>The classification systems that we have
developed for the tasks share the central idea of
using sociolinguistic clusters. We describe
below the system developed for each task in
order.</p>
      <p>
        The classification tool is kept identical for
all the tasks. We use linear SVM equipped with
the Elastic Net regularizer as the classifier.
Given a set of tweets, the system trains a binary
classifier for each class in a one-vs-all manner
and combines them for multi-class
classification. The input text of the classifier
goes through the TFIDF bag of words
transformation. We optionally applied
lemmatization and stop-word removal with
FreeLing
        <xref ref-type="bibr" rid="ref1">(Carreras et al., 2004)</xref>
        to the system
for Task 1.
      </p>
      <sec id="sec-3-1">
        <title>3.1 Task 1: General Sentiment</title>
      </sec>
      <sec id="sec-3-2">
        <title>Classification</title>
        <p>The corpus of this task includes the tweets of
selected famous people and information about
them. The information about the people
includes the occupation and political
orientation.</p>
        <p>
          Our system for this task clusters people
based on their information, and uses the tweets
of the clusters for training. The idea behind the
system is that people with the same occupation
or political orientation will have similar
patterns in the expression of sentiments. A
similar idea was tested with English tweets in
          <xref ref-type="bibr" rid="ref6">Preotiuc-Pietro’s work (2015</xref>
          ), where they
predicted the occupation of authors based on
their tweets. For example, journalists may have
a certain way of expressing the sentiment,
which can be different from that of celebrities.
        </p>
        <p>We tested various clustering of people:
clustering by the occupation, political
orientation, and by both occupation and
political orientation. The system trains a
classifier for each cluster, only using the tweets
made by the people of that cluster. Depending
on the task granularity (5-level or 3-level), the
system trains the classifiers accordingly.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.2 Task 2 (a): Aspect-based Sentiment</title>
      </sec>
      <sec id="sec-3-4">
        <title>Analysis with SocialTV corpus</title>
        <p>Unlike Task 1, the corpus does not have the
information about the authors; thus, it is not
clear how to cluster the tweets. However, the
unique characteristic of the topic (the football
match between Real Madrid and F.C.
Barcelona) and the aspect-sentiment pair of the
tweets provide useful implications about the
authors. The rivalry between the two teams
suggests that many of the authors prefer one of
the two, and the aspect-sentiment pair gives
hints about the preferred team. For example, if a
tweet discusses Xavier Hernández and its
sentiment is positive, it is possible to guess that
the author prefers F.C. Barcelona, and the
author will share the sentiment with other fans
of F.C. Barcelona, who will commonly share
the sentiment towards either F.C. Barcelona or
Real Madrid.</p>
        <p>Thus, we group the aspects based on the
team affiliation. The players of each team are
grouped as a single entity respectively, and one
classifier is developed for each team. The rest
of the aspects (e.g., Afición) are not clustered
since they do not share a common membership
with either of the teams. Classifiers are also
developed separately for the rest of the aspects.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.3 Task 2 (b): Aspect-based Sentiment</title>
      </sec>
      <sec id="sec-3-6">
        <title>Analysis with STOMPOL corpus</title>
        <p>For this task, we cluster tweets in two levels.
First, we cluster tweets by the entity-aspect
pair. Thus, even if the tweets cover the same
entity (party), they are treated to cover a
different topic if the covered aspect is not the
same. For example, a tweet about the economic
proposal (aspect) of Podemos (entity) is
distinguished from a tweet about the education
policy (aspect) of Podemos (entity). It is also
possible to cluster tweets only by entity;
however, we consider both elements for
clustering as all the tweets of the corpus have a
specific aspect in association to the entity. In
addition, it is also frequent that people evaluate
a political party in multiple ways regarding
different aspects; a person may evaluate the
economic policies of Podemos positively but
negatively its foreign policies. Theories of
political communication, such as agenda setting
and framing theory, suggest that people often
recognize the parties and issues together when
they evaluate the parties.</p>
        <p>Second, we further cluster the tweets based
on the characteristics of the political parties. For
example, following the left vs. right dimension,
the tweets about the entity Izquierda Unida and
the aspect Economia are grouped with those
about Podemos and Economia as the two
parties would have similarity in terms of
economic policies than other parties on the right
wing. As a result, 10 clusters are produced (2
party groups x 5 aspects) and a classifier is
developed separately for each cluster.</p>
        <p>We compared two ways of grouping of the
parties: first is the left vs. right dimension as in
the example, and the second is the new vs. old
dimension considering the new political
landscape of Spain. The detail of the party
grouping is shown in Table 1.</p>
      </sec>
      <sec id="sec-3-7">
        <title>4.1 Task 1 General Sentiment</title>
      </sec>
      <sec id="sec-3-8">
        <title>Classification (5-levels, Full corpus)</title>
        <p>For this task, we ran three versions of the
method; first, clustering of the authors by
occupation, second, by political orientation,
third, by both. We submitted the first version
(cluster by occupation) as it performed better
than the other two. The performance metrics are
summarized in Table 2. The result and the
performance trend were similar for the 1k test
set corpus so we only describe the result of the
full-corpus.</p>
        <p>The breakdown of the performance by
sentiment category in Table 3 offers more
insights. The performance for the category
NEU and P is worse compared to that of other
categories. While other optimization can be
made for the two categories, we believe the
method can be improved simply by having
more number of examples of those categories in
the training set. Compared to other categories,
the current corpus includes much less examples
for the two categories.</p>
        <p>Interestingly, the performance further goes
down when preprocessing (lemmatization and
stopword removal) is conducted on the tweets.
This performance drop was observed regardless
of the version of our approach. The result
suggests that conventional preprocessing
removes important linguistic features that are
relevant to sentiment expression. Due to the
performance drop, we chose not to apply the
preprocessing in the following tasks.</p>
      </sec>
      <sec id="sec-3-9">
        <title>4.2 Task 1 General Sentiment</title>
      </sec>
      <sec id="sec-3-10">
        <title>Classification (3-levels, Full corpus)</title>
        <p>We ran the same three versions of the method
and the results are shown in Table 4. The
performance is relatively higher than the 5-level
classification task in general. Similar to the
previous result, the version that clusters people
by occupation performs better than the other
two.</p>
        <p>The performance breakdown shows some
difference from the previous task. First of all,
the performance for the category P is much
higher. We believe this is because the number
of training examples of this category is higher
than the previous task; the examples of P+ and
P categories are merged together. We also see
similar improvement for the category N. The
category NEU still remains as a bottleneck. The
improvement observed in the categories N and
P suggests that similar improvement may be
achieved for the category NEU if there are more
examples in the training set.</p>
      </sec>
      <sec id="sec-3-11">
        <title>4.3 Task 2a Aspect-based Sentiment</title>
      </sec>
      <sec id="sec-3-12">
        <title>Analysis with SocialTV corpus</title>
        <p>As described, the approach to this task is to
group the tweets by aspects that share the team
membership in the training phase. The
performance of the approach is shown in Table
6.
negative sentiment hence there are more
training examples with the negative sentiment.</p>
        <p>Further analysis is required to understand
the effect of the method. The breakdown of the
performance by category does not show a clear
pattern: while the tweets related to some players
are identified very accurately but those of some
other players are not; the performance does not
differ much depending on the team of the
players nor the sentiment expressed. We believe
a larger test set that has enough samples for all
players will better reveal the effect of the
approach.</p>
      </sec>
      <sec id="sec-3-13">
        <title>4.4 Task 2b Aspect-based Sentiment</title>
      </sec>
      <sec id="sec-3-14">
        <title>Analysis with STOMPOL corpus</title>
        <p>Two versions of the approach are applied to the
task: first, clustering the tweets of the same
aspect by the parties of the same ideological
leaning (left vs. right); second, by the novelty
of the parties. The result is shown in Table 7.</p>
        <p>The version that groups by the ideological
leaning of the parties performed better than the
other version. The breakdown of the
performance revealed that the approach
performed better for the tweets that express a
negative sentiment in general. For example,
nine categories out of the top-10 categories in
terms of F1 score were those expressing a
negative sentiment. This is partly because many
tweets related to politics often convey a</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>In this paper, we present a sentiment
classification method that utilizes
sociolinguistic insights. The method is based on
the idea that people with similar social state
(e.g., occupation) or political orientation may
show similarity also in the way they express
their sentiment online. Thus, the method is
focused on grouping authors with similar taste
or occupation. A classifier is developed
separately for each group to capture the
similarities and differences of expression
particularly within the group.</p>
      <p>The method achieves around 0.45 and 0.6 in
terms of accuracy for the 5-level Task 1
classification and 3-level Task 1 classification,
respectively. It achieves 0.63 and 0.56 for the
Social TV corpus and for the STOMPOL
corpus. The result shows that the method
performs better for the sentiment classes with
more training examples. It can also be further
improved by combining it with more language
processing methods optimized to Spanish.
and objective characteristics of place. PLoS
ONE 8: e64417.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Carreras</surname>
            , Xavier, Isaac Chao, Lluis Padró, and
            <given-names>Muntsa</given-names>
          </string-name>
          <string-name>
            <surname>Padró</surname>
          </string-name>
          .
          <year>2004</year>
          .
          <article-title>FreeLing: An OpenSource Suite of Language Analyzers</article-title>
          .
          <source>In Proc. of LREC.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Dodds</surname>
            ,
            <given-names>P. S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>K. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kloumann</surname>
            ,
            <given-names>I. M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bliss</surname>
            ,
            <given-names>C. A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Danforth</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter</article-title>
          .
          <source>PloS ONE</source>
          ,
          <volume>6</volume>
          (
          <issue>12</issue>
          ),
          <year>e26752</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Eisenstein</surname>
            ,
            <given-names>J</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Noah</surname>
            <given-names>A. S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Eric P. X.</surname>
          </string-name>
          <year>2011</year>
          .
          <article-title>Discovering sociolinguistic associations with structured sparsity</article-title>
          .
          <source>In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Mitchell</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>M. R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Harris</surname>
            ,
            <given-names>K. D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dodds</surname>
            ,
            <given-names>P. S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Danforth</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>The geography of happiness: Connecting twitter sentiment and expression</article-title>
          , demographics, Park,
          <string-name>
            <given-names>S.</given-names>
            ,
            <surname>Ko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          , Liu,
          <string-name>
            <given-names>Y.</given-names>
            , &amp;
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2011a</year>
          .
          <article-title>The politics of comments: predicting political orientation of news stories with commenters' sentiment patterns</article-title>
          .
          <source>In Proceedings of the ACM conference on Computer supported cooperative work.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2011b</year>
          .
          <article-title>Contrasting opposing views of news articles on contentious issues</article-title>
          .
          <source>In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Preotiuc-Pietro</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lampos</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Aletras</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>An analysis of the user occupational class through Twitter content</article-title>
          .
          <source>In Proceedings of the 53th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.</source>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Scheufele</surname>
            ,
            <given-names>D. A.</given-names>
          </string-name>
          <year>1999</year>
          .
          <article-title>Framing as a theory of media effects</article-title>
          .
          <source>Journal of communication</source>
          ,
          <volume>49</volume>
          (
          <issue>1</issue>
          ),
          <fpage>103</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Somasundaran</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Wiebe</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2009</year>
          .
          <article-title>Recognizing stances in online debates</article-title>
          .
          <source>In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP</source>
          .
          <article-title>Association for Computational Linguistics</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Villena-Román</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>García-Morera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , GarcíaCumbreras,
          <string-name>
            <given-names>M.A.</given-names>
            ,
            <surname>Martínez-Cámara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Martín-Valdivia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            ,
            <surname>Ureña-López</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. A.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Overview of TASS 2015</article-title>
          .
          <source>In Proceedings of TASS 2015: Workshop on Sentiment Analysis at SEPLN</source>
          . CEURWS.org vol.
          <volume>1397</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Y. C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Burke</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kraut</surname>
            ,
            <given-names>R. E.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Gender, topic, and audience response: an analysis of user-generated content on facebook</article-title>
          .
          <source>In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>