=Paper=
{{Paper
|id=Vol-1397/telefonica
|storemode=property
|title=Sentiment Classification Using Sociolinguistic Clusters
|pdfUrl=https://ceur-ws.org/Vol-1397/telefonica.pdf
|volume=Vol-1397
|dblpUrl=https://dblp.org/rec/conf/sepln/Park15
}}
==Sentiment Classification Using Sociolinguistic Clusters==
TASS 2015, septiembre 2015, pp 99-104 recibido 20-07-15 revisado 24-07-15 aceptado 28-07-15
Sentiment Classification using Sociolinguistic Clusters
Clasificación de Sentimiento basada en Grupos Sociolingüísticos
Souneil Park
Telefonica Research
souneil.park@telefonica.com
Resumen: Estudios sociolingüísticos sugieren una alta similitud entre el lenguaje utilizado por
personas de una misma clase social. Análisis recientes realizados a gran escala sobre textos en
Internet y mediante el uso de mineria, sustentan esta hipótesis. Datos como la clase social del
autor, su geolocalización o afinidades políticas tienen efecto sobre el uso del lenguaje en dichos
textos. En nuestro trabajo utilizamos la información sociolingüística del autor para la
identificación de patrones de expresión de sentimiento. Nuestro enfoque expande el ámbito del
analisis de textos al análisis de los autores mediante el uso de su clase social y afinidad política.
Más concretamente, agrupamos tweets de autores de clases sociales o afinidades políticas
similares y entrenamos clasificadores de forma independiente con el propósito de aprender el
estilo lingüístico de cada grupo. Este mismo enfoque podría mejorarse en combinación con
otras técnicas de procesado del lenguaje y aprendizaje automático.
Palabras clave: sociolingüística, clase social, estilo lingüístico, clustering de usuario.
Abstract: Sociolinguistic studies suggest the similarity of language use among people with
similar social state, and recent large-scale computational analyses of online text are providing
various supports, for example, the effect of social class, geography, and political preference on
the language use. We approach the tasks of TASS 2015 with sociolinguistic insights in order to
capture the patterns in the expression of sentiment. Our approach expands the scope of analysis
from the text itself to the authors: their social state and political preference. The tweets of
authors with similar social state or political preference are grouped as a cluster, and classifiers
are built separately for each cluster to learn the linguistic style of that particular cluster. The
approach can be further improved by combining it with other language processing and machine
learning techniques.
Keywords: Sociolinguistics, Social Group, Linguistic Styles, User Clustering.
and the short length of messages encourages the
1 Introduction use of familiar expressions.
Our approach to the tasks of TASS 2015
The social aspect of language is an important
(Villena-Román et al., 2015) is based on the
means for understanding commonalities and
insights of sociolinguistics. Specifically, we
differences in the language use as
focus on the effect of social variables on
communication is inherently a social activity.
linguistic variations; people who share similar
Shared ideas and preferences of people are
preference or status may show similarity in the
reflected in the language use, and frequently
expression of sentiment than others. For each
observed from various linguistic features such
task, we cluster the tweets by people who share
as memes, style, and word choices. The social
some social features (e.g., political orientation,
aspect is also clear in the expression of
occupation, or football team preference). In
sentiment, especially in social media. The social
order to capture the style of the sociolinguistic
media platforms have many elements that
clusters, a classification model is trained
encourage the use of similar expressions among
separately for each cluster.
social groups. For example, retweets and
While the primary benefit of the approach is
hashtags facilitate the adoption of expressions,
that it can distinguish the different style of
Publicado en http://ceur-ws.org/Vol-1397/. CEUR-WS.org es una publicación en serie con ISSN reconocido ISSN 1613-0073
Souneil Park
sentiment expression among different social preference has been explored as well. In the
groups, it also mitigates the scale limitation of communication literature, researchers have
the training data. For instance, some football conceptualized the phenomena as framing
players of the Social TV corpus and some (Scheufele, 1999) and many studies analyzed
entity-aspect pair of the STOMPOL corpus how political and social issues are framed
have limited number of associated tweets. differently between media outlets and partisan
Clustering them with other tweets that are organizations, and how they are related with the
spoken by people with similar preference perception of the public. Many works are
expands the amount of data that can be used for applying computational methods for similar
training. purposes and observing the difference of
The approach can be easily combined with language use from various online text data, for
other language processing and machine learning example, news articles (Park et al., 2011a),
techniques. Since our approach mainly comments (Park et al., 2011b), and discussion
considers the characteristics of the authors forums (Somasundaran et al., 2009).
rather than the text of tweets itself, combining it
with more advanced language processing 3 System Design
techniques complements each other. In
The classification systems that we have
addition, there is much room for future
developed for the tasks share the central idea of
improvement as the current implementation of
using sociolinguistic clusters. We describe
our approach uses primitive language
below the system developed for each task in
processing methods due to the limited local
order.
Spanish knowledge of the author.
The classification tool is kept identical for
all the tasks. We use linear SVM equipped with
2 Related Work
the Elastic Net regularizer as the classifier.
The increasing availability of large-scale text Given a set of tweets, the system trains a binary
corpora and the advances of big data processing classifier for each class in a one-vs-all manner
platforms allows computational analysis of and combines them for multi-class
sociolinguistic phenomena. Many works in classification. The input text of the classifier
NLP and computational social science goes through the TFIDF bag of words
nowadays are taking the hypotheses of socio- transformation. We optionally applied
linguistics as well as other social sciences and lemmatization and stop-word removal with
testing them with online data sets. FreeLing (Carreras et al., 2004) to the system
In the context of computational analysis of for Task 1.
sociolinguistics theories, a number of works
showed the effect of social features on 3.1 Task 1: General Sentiment
linguistic variations. For example, Eisenstein et Classification
al. (2011) observed the difference in term
frequency depending on the demographics and The corpus of this task includes the tweets of
geographical information of people, and also selected famous people and information about
that the different language use can play a them. The information about the people
significant role in predicting the demographics includes the occupation and political
of authors. A similar study was conducted with orientation.
the information about occupation (Preotiuc- Our system for this task clusters people
Pietro et al., 2015), and gender (Wang et al., based on their information, and uses the tweets
2013). There are also works that specifically of the clusters for training. The idea behind the
observed the relation between the expression of system is that people with the same occupation
sentiment and social variables, for example, or political orientation will have similar
daily routine (Dodds et al., 2011) and urban patterns in the expression of sentiments. A
characteristics (Mitchell et al., 2013). similar idea was tested with English tweets in
The difference of the language use Preotiuc-Pietro’s work (2015), where they
depending on the political/ideological predicted the occupation of authors based on
100
Sentiment Classification using Sociolinguistic Clusters
their tweets. For example, journalists may have distinguished from a tweet about the education
a certain way of expressing the sentiment, policy (aspect) of Podemos (entity). It is also
which can be different from that of celebrities. possible to cluster tweets only by entity;
We tested various clustering of people: however, we consider both elements for
clustering by the occupation, political clustering as all the tweets of the corpus have a
orientation, and by both occupation and specific aspect in association to the entity. In
political orientation. The system trains a addition, it is also frequent that people evaluate
classifier for each cluster, only using the tweets a political party in multiple ways regarding
made by the people of that cluster. Depending different aspects; a person may evaluate the
on the task granularity (5-level or 3-level), the economic policies of Podemos positively but
system trains the classifiers accordingly. negatively its foreign policies. Theories of
political communication, such as agenda setting
3.2 Task 2 (a): Aspect-based Sentiment and framing theory, suggest that people often
Analysis with SocialTV corpus recognize the parties and issues together when
they evaluate the parties.
Unlike Task 1, the corpus does not have the Second, we further cluster the tweets based
information about the authors; thus, it is not on the characteristics of the political parties. For
clear how to cluster the tweets. However, the example, following the left vs. right dimension,
unique characteristic of the topic (the football the tweets about the entity Izquierda Unida and
match between Real Madrid and F.C. the aspect Economia are grouped with those
Barcelona) and the aspect-sentiment pair of the about Podemos and Economia as the two
tweets provide useful implications about the parties would have similarity in terms of
authors. The rivalry between the two teams economic policies than other parties on the right
suggests that many of the authors prefer one of wing. As a result, 10 clusters are produced (2
the two, and the aspect-sentiment pair gives party groups x 5 aspects) and a classifier is
hints about the preferred team. For example, if a developed separately for each cluster.
tweet discusses Xavier Hernández and its We compared two ways of grouping of the
sentiment is positive, it is possible to guess that parties: first is the left vs. right dimension as in
the author prefers F.C. Barcelona, and the the example, and the second is the new vs. old
author will share the sentiment with other fans dimension considering the new political
of F.C. Barcelona, who will commonly share landscape of Spain. The detail of the party
the sentiment towards either F.C. Barcelona or grouping is shown in Table 1.
Real Madrid.
Thus, we group the aspects based on the
team affiliation. The players of each team are
grouped as a single entity respectively, and one
classifier is developed for each team. The rest
of the aspects (e.g., Afición) are not clustered
since they do not share a common membership
with either of the teams. Classifiers are also
developed separately for the rest of the aspects. Table 1: Two groupings of the parties
3.3 Task 2 (b): Aspect-based Sentiment 4 Results and Discussion
Analysis with STOMPOL corpus
4.1 Task 1 General Sentiment
For this task, we cluster tweets in two levels.
First, we cluster tweets by the entity-aspect
Classification (5-levels, Full corpus)
pair. Thus, even if the tweets cover the same For this task, we ran three versions of the
entity (party), they are treated to cover a method; first, clustering of the authors by
different topic if the covered aspect is not the occupation, second, by political orientation,
same. For example, a tweet about the economic third, by both. We submitted the first version
proposal (aspect) of Podemos (entity) is (cluster by occupation) as it performed better
101
Souneil Park
than the other two. The performance metrics are 4.2 Task 1 General Sentiment
summarized in Table 2. The result and the Classification (3-levels, Full corpus)
performance trend were similar for the 1k test
set corpus so we only describe the result of the We ran the same three versions of the method
full-corpus. and the results are shown in Table 4. The
performance is relatively higher than the 5-level
classification task in general. Similar to the
previous result, the version that clusters people
by occupation performs better than the other
two.
Table 2. Performance Summary
The breakdown of the performance by
sentiment category in Table 3 offers more
insights. The performance for the category Table 4. Performance Summary
NEU and P is worse compared to that of other
categories. While other optimization can be
made for the two categories, we believe the The performance breakdown shows some
method can be improved simply by having difference from the previous task. First of all,
more number of examples of those categories in the performance for the category P is much
the training set. Compared to other categories, higher. We believe this is because the number
the current corpus includes much less examples of training examples of this category is higher
for the two categories. than the previous task; the examples of P+ and
P categories are merged together. We also see
similar improvement for the category N. The
category NEU still remains as a bottleneck. The
improvement observed in the categories N and
P suggests that similar improvement may be
achieved for the category NEU if there are more
examples in the training set.
Table 3. Performance of Version 1 (Cluster-by-
Occupation) by Sentiment Category
Interestingly, the performance further goes
down when preprocessing (lemmatization and
stopword removal) is conducted on the tweets. Table 5. Performance of Version 1 (Cluster-by-
Occupation) by Sentiment Category
This performance drop was observed regardless
of the version of our approach. The result
suggests that conventional preprocessing 4.3 Task 2a Aspect-based Sentiment
removes important linguistic features that are Analysis with SocialTV corpus
relevant to sentiment expression. Due to the
performance drop, we chose not to apply the As described, the approach to this task is to
preprocessing in the following tasks. group the tweets by aspects that share the team
membership in the training phase. The
102
Sentiment Classification using Sociolinguistic Clusters
performance of the approach is shown in Table negative sentiment hence there are more
6. training examples with the negative sentiment.
5 Conclusion
In this paper, we present a sentiment
classification method that utilizes
sociolinguistic insights. The method is based on
the idea that people with similar social state
Table 6. Performance Summary (e.g., occupation) or political orientation may
show similarity also in the way they express
Further analysis is required to understand their sentiment online. Thus, the method is
the effect of the method. The breakdown of the focused on grouping authors with similar taste
performance by category does not show a clear or occupation. A classifier is developed
pattern: while the tweets related to some players separately for each group to capture the
are identified very accurately but those of some similarities and differences of expression
other players are not; the performance does not particularly within the group.
differ much depending on the team of the The method achieves around 0.45 and 0.6 in
players nor the sentiment expressed. We believe terms of accuracy for the 5-level Task 1
a larger test set that has enough samples for all classification and 3-level Task 1 classification,
players will better reveal the effect of the respectively. It achieves 0.63 and 0.56 for the
approach. Social TV corpus and for the STOMPOL
corpus. The result shows that the method
4.4 Task 2b Aspect-based Sentiment performs better for the sentiment classes with
Analysis with STOMPOL corpus more training examples. It can also be further
improved by combining it with more language
Two versions of the approach are applied to the processing methods optimized to Spanish.
task: first, clustering the tweets of the same
aspect by the parties of the same ideological References
leaning (left vs. right); second, by the novelty
of the parties. The result is shown in Table 7. Carreras, Xavier, Isaac Chao, Lluis Padró, and
Muntsa Padró. 2004. FreeLing: An Open-
Source Suite of Language Analyzers. In
Proc. of LREC.
Dodds, P. S., Harris, K. D., Kloumann, I. M.,
Bliss, C. A., & Danforth, C. M. 2011.
Temporal patterns of happiness and
information in a global social network:
Table 7. Performance Summary Hedonometrics and Twitter. PloS ONE,
6(12), e26752.
The version that groups by the ideological Eisenstein, J, Noah A. S., and Eric P. X. 2011.
leaning of the parties performed better than the Discovering sociolinguistic associations
other version. The breakdown of the with structured sparsity. In Proceedings of
performance revealed that the approach the 49th Annual Meeting of the Association
performed better for the tweets that express a for Computational Linguistics: Human
negative sentiment in general. For example, Language Technologies.
nine categories out of the top-10 categories in
Mitchell, L., Frank, M. R., Harris, K. D.,
terms of F1 score were those expressing a
Dodds, P. S., & Danforth, C. M. 2013. The
negative sentiment. This is partly because many
geography of happiness: Connecting twitter
tweets related to politics often convey a
sentiment and expression, demographics,
103
Souneil Park
and objective characteristics of place. PLoS
ONE 8: e64417.
Park, S., Ko, M., Kim, J., Liu, Y., & Song, J.
2011a. The politics of comments: predicting
political orientation of news stories with
commenters’ sentiment patterns. In
Proceedings of the ACM conference on
Computer supported cooperative work.
Park, S., Lee, K., & Song, J. 2011b. Contrasting
opposing views of news articles on
contentious issues. In Proceedings of the
49th Annual Meeting of the Association for
Computational Linguistics: Human
Language Technologies.
Preotiuc-Pietro, D., Lampos, V., & Aletras, N.
2015. An analysis of the user occupational
class through Twitter content. In
Proceedings of the 53th Annual Meeting of
the Association for Computational
Linguistics: Human Language Technologies.
Scheufele, D. A. 1999. Framing as a theory of
media effects. Journal of communication,
49(1), 103-122.
Somasundaran, S., & Wiebe, J. 2009.
Recognizing stances in online debates. In
Proceedings of the Joint Conference of the
47th Annual Meeting of the ACL and the 4th
International Joint Conference on Natural
Language Processing of the AFNLP.
Association for Computational Linguistics.
Villena-Román, J., García-Morera, J., García-
Cumbreras, M.A., Martínez-Cámara, E.,
Martín-Valdivia, M. T., Ureña-López, L. A.
2015. Overview of TASS 2015. In
Proceedings of TASS 2015: Workshop on
Sentiment Analysis at SEPLN. CEUR-
WS.org vol. 1397.
Wang, Y. C., Burke, M., Kraut, R. E. 2013.
Gender, topic, and audience response: an
analysis of user-generated content on
facebook. In Proceedings of the SIGCHI
Conference on Human Factors in
Computing Systems.
104