=Paper=
{{Paper
|id=Vol-1397/telefonica
|storemode=property
|title=Sentiment Classification Using Sociolinguistic Clusters
|pdfUrl=https://ceur-ws.org/Vol-1397/telefonica.pdf
|volume=Vol-1397
|dblpUrl=https://dblp.org/rec/conf/sepln/Park15
}}
==Sentiment Classification Using Sociolinguistic Clusters==
<pdf width="1500px">https://ceur-ws.org/Vol-1397/telefonica.pdf</pdf>
<pre>
TASS 2015, septiembre 2015, pp 99-104                                           recibido 20-07-15 revisado 24-07-15 aceptado 28-07-15


            Sentiment Classification using Sociolinguistic Clusters
           Clasificación de Sentimiento basada en Grupos Sociolingüísticos
                                                      Souneil Park
                                                  Telefonica Research
                                              souneil.park@telefonica.com


        Resumen: Estudios sociolingüísticos sugieren una alta similitud entre el lenguaje utilizado por
        personas de una misma clase social. Análisis recientes realizados a gran escala sobre textos en
        Internet y mediante el uso de mineria, sustentan esta hipótesis. Datos como la clase social del
        autor, su geolocalización o afinidades políticas tienen efecto sobre el uso del lenguaje en dichos
        textos. En nuestro trabajo utilizamos la información sociolingüística del autor para la
        identificación de patrones de expresión de sentimiento. Nuestro enfoque expande el ámbito del
        analisis de textos al análisis de los autores mediante el uso de su clase social y afinidad política.
        Más concretamente, agrupamos tweets de autores de clases sociales o afinidades políticas
        similares y entrenamos clasificadores de forma independiente con el propósito de aprender el
        estilo lingüístico de cada grupo. Este mismo enfoque podría mejorarse en combinación con
        otras técnicas de procesado del lenguaje y aprendizaje automático.
        Palabras clave: sociolingüística, clase social, estilo lingüístico, clustering de usuario.

        Abstract: Sociolinguistic studies suggest the similarity of language use among people with
        similar social state, and recent large-scale computational analyses of online text are providing
        various supports, for example, the effect of social class, geography, and political preference on
        the language use. We approach the tasks of TASS 2015 with sociolinguistic insights in order to
        capture the patterns in the expression of sentiment. Our approach expands the scope of analysis
        from the text itself to the authors: their social state and political preference. The tweets of
        authors with similar social state or political preference are grouped as a cluster, and classifiers
        are built separately for each cluster to learn the linguistic style of that particular cluster. The
        approach can be further improved by combining it with other language processing and machine
        learning techniques.
        Keywords: Sociolinguistics, Social Group, Linguistic Styles, User Clustering.


                                                                      and the short length of messages encourages the
1     Introduction                                                    use of familiar expressions.
                                                                         Our approach to the tasks of TASS 2015
The social aspect of language is an important
                                                                      (Villena-Román et al., 2015) is based on the
means for understanding commonalities and
                                                                      insights of sociolinguistics. Specifically, we
differences in the language use as
                                                                      focus on the effect of social variables on
communication is inherently a social activity.
                                                                      linguistic variations; people who share similar
Shared ideas and preferences of people are
                                                                      preference or status may show similarity in the
reflected in the language use, and frequently
                                                                      expression of sentiment than others. For each
observed from various linguistic features such
                                                                      task, we cluster the tweets by people who share
as memes, style, and word choices. The social
                                                                      some social features (e.g., political orientation,
aspect is also clear in the expression of
                                                                      occupation, or football team preference). In
sentiment, especially in social media. The social
                                                                      order to capture the style of the sociolinguistic
media platforms have many elements that
                                                                      clusters, a classification model is trained
encourage the use of similar expressions among
                                                                      separately for each cluster.
social groups. For example, retweets and
                                                                         While the primary benefit of the approach is
hashtags facilitate the adoption of expressions,
                                                                      that it can distinguish the different style of

Publicado en http://ceur-ws.org/Vol-1397/. CEUR-WS.org es una publicación en serie con ISSN reconocido               ISSN 1613-0073
                                                 Souneil Park


sentiment expression among different social                 preference has been explored as well. In the
groups, it also mitigates the scale limitation of           communication literature, researchers have
the training data. For instance, some football              conceptualized the phenomena as framing
players of the Social TV corpus and some                    (Scheufele, 1999) and many studies analyzed
entity-aspect pair of the STOMPOL corpus                    how political and social issues are framed
have limited number of associated tweets.                   differently between media outlets and partisan
Clustering them with other tweets that are                  organizations, and how they are related with the
spoken by people with similar preference                    perception of the public. Many works are
expands the amount of data that can be used for             applying computational methods for similar
training.                                                   purposes and observing the difference of
    The approach can be easily combined with                language use from various online text data, for
other language processing and machine learning              example, news articles (Park et al., 2011a),
techniques. Since our approach mainly                       comments (Park et al., 2011b), and discussion
considers the characteristics of the authors                forums (Somasundaran et al., 2009).
rather than the text of tweets itself, combining it
with more advanced language processing                      3    System Design
techniques complements each other. In
                                                            The classification systems that we have
addition, there is much room for future
                                                            developed for the tasks share the central idea of
improvement as the current implementation of
                                                            using sociolinguistic clusters. We describe
our approach uses primitive language
                                                            below the system developed for each task in
processing methods due to the limited local
                                                            order.
Spanish knowledge of the author.
                                                                The classification tool is kept identical for
                                                            all the tasks. We use linear SVM equipped with
2    Related Work
                                                            the Elastic Net regularizer as the classifier.
The increasing availability of large-scale text             Given a set of tweets, the system trains a binary
corpora and the advances of big data processing             classifier for each class in a one-vs-all manner
platforms allows computational analysis of                  and      combines     them      for    multi-class
sociolinguistic phenomena. Many works in                    classification. The input text of the classifier
NLP and computational social science                        goes through the TFIDF bag of words
nowadays are taking the hypotheses of socio-                transformation.     We      optionally    applied
linguistics as well as other social sciences and            lemmatization and stop-word removal with
testing them with online data sets.                         FreeLing (Carreras et al., 2004) to the system
    In the context of computational analysis of             for Task 1.
sociolinguistics theories, a number of works
showed the effect of social features on                     3.1 Task 1: General Sentiment
linguistic variations. For example, Eisenstein et           Classification
al. (2011) observed the difference in term
frequency depending on the demographics and                 The corpus of this task includes the tweets of
geographical information of people, and also                selected famous people and information about
that the different language use can play a                  them. The information about the people
significant role in predicting the demographics             includes the occupation and political
of authors. A similar study was conducted with              orientation.
the information about occupation (Preotiuc-                     Our system for this task clusters people
Pietro et al., 2015), and gender (Wang et al.,              based on their information, and uses the tweets
2013). There are also works that specifically               of the clusters for training. The idea behind the
observed the relation between the expression of             system is that people with the same occupation
sentiment and social variables, for example,                or political orientation will have similar
daily routine (Dodds et al., 2011) and urban                patterns in the expression of sentiments. A
characteristics (Mitchell et al., 2013).                    similar idea was tested with English tweets in
    The difference of the language use                      Preotiuc-Pietro’s work (2015), where they
depending       on     the    political/ideological         predicted the occupation of authors based on


                                                      100
                                 Sentiment Classification using Sociolinguistic Clusters


their tweets. For example, journalists may have                  distinguished from a tweet about the education
a certain way of expressing the sentiment,                       policy (aspect) of Podemos (entity). It is also
which can be different from that of celebrities.                 possible to cluster tweets only by entity;
   We tested various clustering of people:                       however, we consider both elements for
clustering by the occupation, political                          clustering as all the tweets of the corpus have a
orientation, and by both occupation and                          specific aspect in association to the entity. In
political orientation. The system trains a                       addition, it is also frequent that people evaluate
classifier for each cluster, only using the tweets               a political party in multiple ways regarding
made by the people of that cluster. Depending                    different aspects; a person may evaluate the
on the task granularity (5-level or 3-level), the                economic policies of Podemos positively but
system trains the classifiers accordingly.                       negatively its foreign policies. Theories of
                                                                 political communication, such as agenda setting
3.2 Task 2 (a): Aspect-based Sentiment                           and framing theory, suggest that people often
Analysis with SocialTV corpus                                    recognize the parties and issues together when
                                                                 they evaluate the parties.
Unlike Task 1, the corpus does not have the                          Second, we further cluster the tweets based
information about the authors; thus, it is not                   on the characteristics of the political parties. For
clear how to cluster the tweets. However, the                    example, following the left vs. right dimension,
unique characteristic of the topic (the football                 the tweets about the entity Izquierda Unida and
match between Real Madrid and F.C.                               the aspect Economia are grouped with those
Barcelona) and the aspect-sentiment pair of the                  about Podemos and Economia as the two
tweets provide useful implications about the                     parties would have similarity in terms of
authors. The rivalry between the two teams                       economic policies than other parties on the right
suggests that many of the authors prefer one of                  wing. As a result, 10 clusters are produced (2
the two, and the aspect-sentiment pair gives                     party groups x 5 aspects) and a classifier is
hints about the preferred team. For example, if a                developed separately for each cluster.
tweet discusses Xavier Hernández and its                             We compared two ways of grouping of the
sentiment is positive, it is possible to guess that              parties: first is the left vs. right dimension as in
the author prefers F.C. Barcelona, and the                       the example, and the second is the new vs. old
author will share the sentiment with other fans                  dimension considering the new political
of F.C. Barcelona, who will commonly share                       landscape of Spain. The detail of the party
the sentiment towards either F.C. Barcelona or                   grouping is shown in Table 1.
Real Madrid.
   Thus, we group the aspects based on the
team affiliation. The players of each team are
grouped as a single entity respectively, and one
classifier is developed for each team. The rest
of the aspects (e.g., Afición) are not clustered
since they do not share a common membership
with either of the teams. Classifiers are also
developed separately for the rest of the aspects.                       Table 1: Two groupings of the parties

3.3 Task 2 (b): Aspect-based Sentiment                           4     Results and Discussion
Analysis with STOMPOL corpus
                                                                 4.1 Task 1 General Sentiment
For this task, we cluster tweets in two levels.
First, we cluster tweets by the entity-aspect
                                                                 Classification (5-levels, Full corpus)
pair. Thus, even if the tweets cover the same                    For this task, we ran three versions of the
entity (party), they are treated to cover a                      method; first, clustering of the authors by
different topic if the covered aspect is not the                 occupation, second, by political orientation,
same. For example, a tweet about the economic                    third, by both. We submitted the first version
proposal (aspect) of Podemos (entity) is                         (cluster by occupation) as it performed better


                                                          101
                                              Souneil Park


than the other two. The performance metrics are          4.2 Task 1 General Sentiment
summarized in Table 2. The result and the                Classification (3-levels, Full corpus)
performance trend were similar for the 1k test
set corpus so we only describe the result of the         We ran the same three versions of the method
full-corpus.                                             and the results are shown in Table 4. The
                                                         performance is relatively higher than the 5-level
                                                         classification task in general. Similar to the
                                                         previous result, the version that clusters people
                                                         by occupation performs better than the other
                                                         two.


        Table 2. Performance Summary

    The breakdown of the performance by
sentiment category in Table 3 offers more
insights. The performance for the category                       Table 4. Performance Summary
NEU and P is worse compared to that of other
categories. While other optimization can be
made for the two categories, we believe the                  The performance breakdown shows some
method can be improved simply by having                  difference from the previous task. First of all,
more number of examples of those categories in           the performance for the category P is much
the training set. Compared to other categories,          higher. We believe this is because the number
the current corpus includes much less examples           of training examples of this category is higher
for the two categories.                                  than the previous task; the examples of P+ and
                                                         P categories are merged together. We also see
                                                         similar improvement for the category N. The
                                                         category NEU still remains as a bottleneck. The
                                                         improvement observed in the categories N and
                                                         P suggests that similar improvement may be
                                                         achieved for the category NEU if there are more
                                                         examples in the training set.


Table 3. Performance of Version 1 (Cluster-by-
     Occupation) by Sentiment Category

    Interestingly, the performance further goes
down when preprocessing (lemmatization and
stopword removal) is conducted on the tweets.            Table 5. Performance of Version 1 (Cluster-by-
                                                              Occupation) by Sentiment Category
This performance drop was observed regardless
of the version of our approach. The result
suggests that conventional preprocessing                 4.3 Task 2a Aspect-based Sentiment
removes important linguistic features that are           Analysis with SocialTV corpus
relevant to sentiment expression. Due to the
performance drop, we chose not to apply the              As described, the approach to this task is to
preprocessing in the following tasks.                    group the tweets by aspects that share the team
                                                         membership in the training phase. The


                                                   102
                               Sentiment Classification using Sociolinguistic Clusters


performance of the approach is shown in Table                  negative sentiment hence there are more
6.                                                             training examples with the negative sentiment.

                                                               5     Conclusion
                                                               In this paper, we present a sentiment
                                                               classification      method      that      utilizes
                                                               sociolinguistic insights. The method is based on
                                                               the idea that people with similar social state
        Table 6. Performance Summary                           (e.g., occupation) or political orientation may
                                                               show similarity also in the way they express
    Further analysis is required to understand                 their sentiment online. Thus, the method is
the effect of the method. The breakdown of the                 focused on grouping authors with similar taste
performance by category does not show a clear                  or occupation. A classifier is developed
pattern: while the tweets related to some players              separately for each group to capture the
are identified very accurately but those of some               similarities and differences of expression
other players are not; the performance does not                particularly within the group.
differ much depending on the team of the                           The method achieves around 0.45 and 0.6 in
players nor the sentiment expressed. We believe                terms of accuracy for the 5-level Task 1
a larger test set that has enough samples for all              classification and 3-level Task 1 classification,
players will better reveal the effect of the                   respectively. It achieves 0.63 and 0.56 for the
approach.                                                      Social TV corpus and for the STOMPOL
                                                               corpus. The result shows that the method
4.4 Task 2b Aspect-based Sentiment                             performs better for the sentiment classes with
Analysis with STOMPOL corpus                                   more training examples. It can also be further
                                                               improved by combining it with more language
Two versions of the approach are applied to the                processing methods optimized to Spanish.
task: first, clustering the tweets of the same
aspect by the parties of the same ideological                  References
leaning (left vs. right); second, by the novelty
of the parties. The result is shown in Table 7.                Carreras, Xavier, Isaac Chao, Lluis Padró, and
                                                                  Muntsa Padró. 2004. FreeLing: An Open-
                                                                  Source Suite of Language Analyzers. In
                                                                  Proc. of LREC.
                                                               Dodds, P. S., Harris, K. D., Kloumann, I. M.,
                                                                 Bliss, C. A., & Danforth, C. M. 2011.
                                                                 Temporal patterns of happiness and
                                                                 information in a global social network:
        Table 7. Performance Summary                             Hedonometrics and Twitter. PloS ONE,
                                                                 6(12), e26752.
   The version that groups by the ideological                  Eisenstein, J, Noah A. S., and Eric P. X. 2011.
leaning of the parties performed better than the                  Discovering sociolinguistic associations
other version. The breakdown of the                               with structured sparsity. In Proceedings of
performance revealed that the approach                            the 49th Annual Meeting of the Association
performed better for the tweets that express a                    for Computational Linguistics: Human
negative sentiment in general. For example,                       Language Technologies.
nine categories out of the top-10 categories in
                                                               Mitchell, L., Frank, M. R., Harris, K. D.,
terms of F1 score were those expressing a
                                                                 Dodds, P. S., & Danforth, C. M. 2013. The
negative sentiment. This is partly because many
                                                                 geography of happiness: Connecting twitter
tweets related to politics often convey a
                                                                 sentiment and expression, demographics,


                                                        103
                                              Souneil Park


   and objective characteristics of place. PLoS
   ONE 8: e64417.
Park, S., Ko, M., Kim, J., Liu, Y., & Song, J.
   2011a. The politics of comments: predicting
   political orientation of news stories with
   commenters’      sentiment   patterns.   In
   Proceedings of the ACM conference on
   Computer supported cooperative work.
Park, S., Lee, K., & Song, J. 2011b. Contrasting
   opposing views of news articles on
   contentious issues. In Proceedings of the
   49th Annual Meeting of the Association for
   Computational       Linguistics:     Human
   Language Technologies.
Preotiuc-Pietro, D., Lampos, V., & Aletras, N.
   2015. An analysis of the user occupational
   class    through    Twitter   content.   In
   Proceedings of the 53th Annual Meeting of
   the    Association    for    Computational
   Linguistics: Human Language Technologies.
Scheufele, D. A. 1999. Framing as a theory of
   media effects. Journal of communication,
   49(1), 103-122.
Somasundaran, S., & Wiebe, J. 2009.
  Recognizing stances in online debates. In
  Proceedings of the Joint Conference of the
  47th Annual Meeting of the ACL and the 4th
  International Joint Conference on Natural
  Language Processing of the AFNLP.
  Association for Computational Linguistics.
Villena-Román, J., García-Morera, J., García-
   Cumbreras, M.A., Martínez-Cámara, E.,
   Martín-Valdivia, M. T., Ureña-López, L. A.
   2015. Overview of TASS 2015. In
   Proceedings of TASS 2015: Workshop on
   Sentiment Analysis at SEPLN. CEUR-
   WS.org vol. 1397.
Wang, Y. C., Burke, M., Kraut, R. E. 2013.
  Gender, topic, and audience response: an
  analysis of user-generated content on
  facebook. In Proceedings of the SIGCHI
  Conference on Human Factors in
  Computing Systems.


                                                   104

</pre>