<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>New York City, USA, July</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Exposing a Set of Fine-Grained Emotion Categories from Tweets</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jasy Liew Suet Yan</string-name>
          <email>jliewsue@syr.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Howard R. Turtle</string-name>
          <email>turtle@syr.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>School of Information Studies, Syracuse University Syracuse</institution>
          ,
          <addr-line>New York</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2016</year>
      </pub-date>
      <volume>10</volume>
      <issue>2016</issue>
      <fpage>8</fpage>
      <lpage>14</lpage>
      <abstract>
        <p>An important starting point in analyzing emotions on Twitter is the identification of a set of suitable emotion classes representative of the range of emotions expressed on Twitter. This paper first presents a set of 48 emotion categories discovered inductively from 5,553 annotated tweets through a small-scale content analysis by trained or expert annotators. We then refine the emotion categories to a set of 28 and test how representative they are on a larger set of 10,000 tweets through crowdsourcing. We describe the two-phase methodology used to expose and refine the set of fine-grained emotion categories from tweets, compare the inter-annotator agreement between annotations generated by expert and novice annotators (crowdsourcing) and show that it is feasible to perform fine-grained emotion classification using gold standard data generated from these two phases. Our main goal is to offer a more representative and finer-grained framework of emotions expressed in microblog text, thus allowing study of emotions that are currently underexplored in sentiment analysis.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The ways that individuals express themselves in tweets
provide windows into their emotional worlds. Twitter, a
popular microblogging site with 500 million tweets being
sent a day, is particularly rich with emotion expressions.
These emotion expressions can be harnessed for sentiment
analysis and to build more emotion-sensitive systems. The
availability of tweets has paved the way for studies of how
emotions expressed on microblogs affect stock market
trends [Bollen et al., 2011a], relate to fluctuations in social
and economic indicators [Bollen et al., 2011b], serve as a
measure for the population’s level of happiness [Dodds and
Danforth, 2010], provide situational awareness for both the
authorities and the public in the event of disasters [Vo and
Collier, 2013], and reflect clinical depression [Park et al.,
2012].</p>
      <p>An important starting point in analyzing emotions on
Twitter is the identification of a set of suitable emotion
classes. This set of emotion classes should be representative
of the emotions expressed in tweets. No consensus has
emerged as to how many classes are needed to represent the
emotions expressed in text [Farzindar and Inkpen, 2015].
Previous studies have focused on adapting conventional
emotion theories from psychology to represent emotions
expressed on Twitter and has not attempted to discover the
actual range of emotions expressed or how these emotions
are actually characterized in tweets. The most commonly
used emotion categories are adopted from the basic emotion
framework, Ekman’s six basic emotions (happiness,
sadness, fear, anger, disgust, and surprise) [Ekman, 1971] or
Plutchik’s eight basic emotions comprising Ekman’s six
basic emotion, plus the addition of trust and anticipation
[Plutchik, 1962].</p>
      <p>Instead of borrowing a set of emotion categories from
existing emotion theories in psychology, this paper aims to
expose a set of categories that are representative of the
emotions expressed on Twitter by analyzing the range of
emotions humans can reliably detect in microblog text. Our
main goal is to offer a more representative and finer-grained
framework of emotions expressed in microblog text, thus
allowing study of emotions that are currently underexplored
in sentiment analysis. In this paper, we address the general
research question of what emotions can humans detect in
microblog text. We first uncover the set of emotion
categories inductively from data and then further refine that
set into a manageable set that both humans and machine
learning systems are able to reliably detect.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Theoretical Background</title>
      <p>Generally, we define emotion in text as “a subset of
particularly visible and identifiable feelings” [Besnier, 1990;
Kagan, 1978] that are expressed in written form through
descriptions of expressive reactions (furrowed brow, smile),
physiological reactions (increase in heart rate, teeth
grinding), cognitions (thoughts of abandonment), behaviors
(escape, attack, avoidance) as well as other socially
prescribed set of responses [Averill, 1980; Cornelius, 1996].
The classification of emotion in text is largely based on two
common models of emotion: 1) the dimensional model, and
2) the categorical model [Calvo and Mac Kim, 2012; Zachar
and Ellis, 2012].</p>
      <p>The dimensional model organizes emotions into more
general dimensions representing the underlying fundamental
structure. Emotions can be identified through the
composition of two or more independent dimensions
[Zachar and Ellis, 2012]. Attempts to identify the
dimensions have been conducted through multidimensional
scaling of human similarity judgments of emotion
expressions based on facial expressions [Abelson and
Sermat, 1962], vocal expressions [Green and Cliff, 1975]
and emotion terms [Russell, 1978]. The two common
dimensions that emerged from these studies are
pleasuredispleasure (valence) and degree of arousal (intensity).
Similar findings are found in semantic differential studies
on emotion terms with the addition of another dimension,
dominance-submissiveness [Russell and Mehrabian, 1977].
Valence (also referred to as polarity) classifies emotion as
either being positive, negative or neutral [Alm et al., 2005;
Strapparava and Mihalcea, 2007]. Intensity is somewhat
similar to the degree of arousal although it is generally used
to measure the strength of the emotion (i.e., very weak to
very strong) [Aman and Szpakowicz, 2007]. It can be
operationalized as a nominal variable with labels
representing varying intensities or measured on a numeric
scale.</p>
      <p>The categorical model organizes emotions into categories
that are formed around prototypes. Each emotion category
has a set of distinguishable properties and is assigned a label
that best describes the category (e.g., happy, sad and angry).
The basic emotion framework follows the categorical
model, where emotion is organized and represented using a
category system. Each category represents a prototypical
emotion. Using a hierarchical classification approach,
[Shaver et al., 2001] expanded the basic emotions into 25
finer categories through similarity sorting of 135 emotion
words. These finer categories are more representative of the
emotions that can be expressed using English words.</p>
      <p>The dimensional model offers a more coarse-grained
representation of emotion while the categorical model can
be used to represent emotion at a finer-grained level. In
addition, the categorical model uses emotion labels that are
more intuitive, thus making recognition of the emotion
easier for humans. Therefore, we adopted the categorical
model in line with our goal to develop a fine-grained
emotion taxonomy for microblog text.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Methodology</title>
      <p>We used content analysis to identify a stable set of emotion
categories that is representative of the range of emotions
expressed in tweets. The small-scale content analysis was
first conducted (Phase 1) by training a group of annotators
to annotate a sample of 5,553 tweets. Three tasks were
completed to uncover this set of emotion categories: 1)
inductive coding, 2) card sorting, and 3) emotion word
rating. In Phase 2, we tested the representativeness of the
emotion categories derived from Phase 1 using large-scale
content analysis. Annotations were collected through
crowdsourcing using Amazon Mechanical Turk (AMT).
2.1</p>
      <sec id="sec-3-1">
        <title>Data Collection</title>
        <p>Data consisted of tweets (i.e., microblog posts) retrieved
from Twitter. Four different sampling strategies were used
to retrieve the tweets to be included in the corpus: random
sampling (RANDOM), sampling by topic (TOPIC), and two
variations of sampling by user type (SEN-USER and
AVGUSER). For the RANDOM sample, nine stopwords (the, be,
to, of, and, a, in, that, have) reported to be words most
frequently used on Twitter were used to retrieve tweets.
Topic sampling was done by retrieving tweets that contain
selected topical hashtags or keywords. Sampling by user
type retrieved tweets using selected user names
(@usernames). One user sample contained tweets retrieved
from US Senators (SEN-USER). Tweets from the second
user sample were retrieved using randomly selected user
names (AVG-USER). Tweets were either retrieved using
the Twitter API or acquired from two publicly available data
sets: 1) the SemEval 2014 tweet data set [Nakov et al.,
2013; Rosenthal et al., 2014], and 2) the 2012 US
presidential elections data set [Mohammad et al., 2014]. The
data set containing 15,553 tweets received roughly equal
contribution from each of the four sampling strategies.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2 Phase 1: Small-scale Content Analysis</title>
      </sec>
      <sec id="sec-3-3">
        <title>2.2.1 Task 1: Inductive Coding</title>
        <p>We adapted grounded theory [Glaser and Strauss, 1967] to
expose a set of fine-grained emotion categories from tweets.
This method used inductive coding to derive the
classification scheme through observation of content [Potter
and Levine-Donnerstein, 1999]. Annotators engaged in
three coding activities central to this method: open coding,
axial coding, and selective coding [Corbin and Strauss,
2008]. In open coding, annotators read the content of each
tweet to capture all possible meanings, and took a first pass
at assigning concepts to describe the interpretation of the
data. No restriction was posed on analysis in this phase, and
minimal instructions were provided to avoid predisposing
annotators. Axial coding then involved the process of
drawing the relationships between concepts and categories.
Based on their knowledge of emotion, annotators started
with a set of self-defined emotion tags. They then met in
groups with the primary researcher to start drawing
relationships between different emotion tags suggested by
individuals in the group. Emotion tags were examined,
accepted, modified, and discarded. Discrete emotion
categories started to form in this phase, and were
systematically applied to more data. Annotators switched
back and forth between axial coding and open coding until a
stable set of categories was identified. Finally, selective
coding represented an integration phase where the identified
discrete categories were further developed, defined and
refined under a unifying theme of emotion. Annotators then
continued to validate the classification scheme by applying
and refining it on more data until no new category emerged.</p>
        <p>Graduate students who were interested in undertaking the
task as part of a class project (e.g., Natural Language
Processing course) or to gain research experience in content
analysis (e.g., independent study) were recruited as
annotators. Annotators were not expected to possess special
skills except for the required abilities to read and interpret
English text. A total of eighteen annotators worked on the
annotation task over a period of ten months. To derive an
emotion framework based on collective knowledge, each
tweet was annotated by at least three annotators. Thus,
annotators were divided into groups of at least three. Each
group was assigned to work on one of the four samples.</p>
        <p>All the annotators went through the same training
procedures to reduce as much as possible the variation
among different individuals. Each annotator first attended a
one hour training session to discuss the concept of emotion
with the researcher and to receive instructions on how to
perform annotations of the tweets. Annotators were not
given any emotion categories and were asked to suggest the
best-fitting emotion tags or labels to describe the emotion
expressed in each tweet (Example 1). For tweets containing
multiple emotions, annotators were asked to first identify
the primary emotion expressed in the tweet, and then also
include the other emotions observed (Example 2).
Example 1: Alaska is so proud of our Spartans! The 4-25
executed every mission in Afghanistan with honor &amp; now,
they're home http://t.co/r8pLpnud [Pride]
Example 2: Saw Argo yesterday, a movie about the 1979
Iranian Revolution. Chilling, sobering, and inspirational at
the same time. [Inspiration, Fear]</p>
        <p>Annotation was done in an iterative fashion. In the first
iteration, also referred to as the training round, all annotators
annotated the same sample of 300 tweets from SEN-USER.
Upon completing the training round, annotators were
assigned to annotate at least 1,000 tweets from one of the
four samples (RANDOM, TOPIC, AVG-USER or
SENUSER) in subsequent iterations. Every week, annotators
worked independently on annotating a subset of 150 – 200
tweets but met with the researcher in groups to discuss
disagreements, and 100% agreement for emotion tag was
achieved after discussion. In these weekly meetings, the
researcher also facilitated the discussions among annotators
working on the same sample to merge, remove, and refine
suggested emotion tags. Output of Task 1 included 4,010
annotated tweets in the gold standard corpus and 246
emotion tags.</p>
      </sec>
      <sec id="sec-3-4">
        <title>2.2.2 Task 2: Card Sorting</title>
        <p>Some of the 246 emotion tags were simply morphological
variations and many were semantically similar. Task 2
served as an intermediate step to refine the emotion tags
emerging from data into a more manageable set of higher
level emotion categories. Annotators were asked to perform
a card sorting exercise in different teams to group emotion
tags that are variants of the same root word or semantically
similar into the same category. Annotators were divided into
5 teams, and each team received a pack of 1’ x 5’ cards
containing only the emotion tags used by the all members in
their respective teams.</p>
        <p>Each team consisted of 2 - 3 members who worked on the
same sample. Teams were instructed to follow the four-step
procedures described below:
a) Group all the emotion tags into categories. Members
were allowed to create a “Not Emotion” category if
needed.
b) Create a name for the emotion category. Collectively
pick the most descriptive emotion tag or suggest a new
name to represent each category.
c) Group all the emotion categories based on valence:
positive, negative and neutral.
d) Match emotion categories generated from other team’s
card sorting activity to the emotion categories
proposed by your team.</p>
      </sec>
      <sec id="sec-3-5">
        <title>Team Sample Number of Emotion Categories</title>
      </sec>
      <sec id="sec-3-6">
        <title>Positive Negative Neutral Total</title>
        <p>G1 SEN-USER 8 13 2 23
G2 TOPIC 16 14 5 35
G3 TOPIC 16 18 8 42
G4 AVG-USER 14 18 15 47
G5 RANDOM 14 16 9 39
Table 1: Number of categories proposed by each card sorting team</p>
        <p>Members in the same team were allowed to discuss their
decisions with each other during the card sorting exercise
with minimal intervention from the researcher. The session
concluded when all members completed the four-step
procedure and reached a consensus on final groupings of the
emotion tags. No limit was placed on the number of
categories or the number of emotion tags within each
category so the number of categories proposed varied across
the five teams as shown in Table 1. Some teams decided to
put the emotion tags into fewer higher-level categories,
while others who chose to capture more subtle emotions
generated more emotion categories. Finally, the researcher
merged, divided, and verified the final emotion categories to
be included in the classification scheme.</p>
        <p>Once the final 48 emotion categories shown in Table 2
were identified (see Emotion-Category-48 column), the
original emotion tag labels generated from the open coding
exercise were systematically replaced by the appropriate
emotion category labels. Annotators then incrementally
annotated more tweets (150 - 200 tweets per round) to
ensure that a point of saturation was reached. No new
emotion category emerged from data in this coding phase.
Another 1,543 annotated tweets with gold labels were added
to the corpus.</p>
      </sec>
      <sec id="sec-3-7">
        <title>2.2.3 Task 3: Emotion Word Rating</title>
        <p>We found it methodologically challenging and time
consuming to provide rigorous training to a large number of
annotators in order to grow the size of the corpus with 48
emotion categories. A word rating study was conducted as a
systematic method to merge and distill the number of
categories into a more manageable set. The motivation
behind the word rating study came from prior studies
showing that emotion words with greater similarity tend to
be in close proximity to one another on a two-dimensional
pleasure and degree of arousal space [Russell, 1980]. In
order to plot our emotion categories in this two-dimensional
space, we collected the pleasure and arousal ratings for each
emotion category. A set of 50 emotion words were selected
for the emotion rating task. We included the 48 emotion
category names and added 2 emotion words that were
deemed to be more appropriate category names than the
ones determined by the annotators in Task 2. These two
emotion words were “longing” for the category “yearning”
and “torn” for “ambivalence”.</p>
        <p>To obtain a complete set of pleasure and arousal ratings
for our set of 50 emotion words, we conducted an emotion
word rating study on AMT. We adapted the instrument that
was used in [Bradley &amp; Lang, 1999] to collect the ratings.
We implemented the study using exactly the same 9-point
scale for the pleasure and arousal ratings. The validity of the
scales are described in [Bradley &amp; Lang, 1994]. The same
set of instructions was reused but modified to fit the
crowdsourcing context.</p>
        <p>Human raters were recruited from the pool of workers
available on AMT. The rating instrument was offered to the
workers via a Human Intelligence Task (HIT), and workers
received payment of US$ 0.20 upon completion and
approval of the HIT. HITs were restricted to workers in the
US to increase the likelihood that ratings came from native
English speakers. Each respondent first read the instructions
on how to use the pleasure and arousal scales. Respondents
were then instructed to make a pleasure rating and an
arousal rating for each of the 50 emotion words.</p>
        <p>1.0
0.9
0.8
0.7
0.6
0.4
0.3
0.2
0.1
0.0</p>
        <p>Anger</p>
        <p>Shock
HaDtiesgustFear Jealousy Desperation</p>
        <p>Worry</p>
        <p>Annoyance
DispleDarseead</p>
        <p>Torn</p>
        <p>Regret
Disappointment Confusion
Shame Awkward</p>
        <p>Doubt</p>
        <p>SadneGsusilt
0.5 Displeasure</p>
        <p>Exhaustion</p>
        <p>Aroused
Yearning</p>
        <p>Lust Surprise
Longing
Anticipation</p>
        <p>Curiosity</p>
        <p>Nostalgia</p>
        <p>ESmypmatphaythy</p>
        <p>Ambivalence
Boredom</p>
        <p>Indif erence
0.0
0.1
0.2
0.3
0.4
0.6
0.7</p>
        <p>0.8
Calm
0.5</p>
        <p>After removing incomplete and rejected responses, mean
rating and standard deviation were computed from 76 usable
responses.</p>
        <p>Excitement
Desire</p>
        <p>AmazemLoevnet
AFmaussceinmaetinotn</p>
        <p>Happiness</p>
        <p>Inspiration</p>
        <p>Hope
PrideLike Admiration
GratituCdoenfidence</p>
        <p>Pleased
Pleasure</p>
        <p>Relief
Relaxed
0.9
1.0</p>
        <p>Figure 1 shows the plot for all 50 emotion words based
on AMT ratings normalized using feature scaling. Emotion
categories that are semantically-related and relatively close
in proximity to one another on the plot are merged. The
merge process involved some subjective decision and
reduced the number of emotion categories from 48 to the
final set of 28 shown in Table 2 (see Emotion-Category-28
column). Category name “ambivalence” was substituted by
its more descriptive member term, “torn” and “yearning”
was substituted by “longing”. Also, two emotion categories
from the original 48, “desire” and “lust” were dropped
altogether from the final set of 28 because it is not clear that
they should be considered separate emotional states [Ortony
and Turner, 1990]. Based on their conceptualization in our
annotation scheme, they were considered to be more general
feelings of wanting rather than distinct emotional states.
Finally, the 48 emotion category labels in the corpus were
systematically replaced by the corresponding 28 emotion
category labels.</p>
        <p>The set of 28 categories is derived from the corpus and is
a “good” representation of the set of emotions expressed
therein. It is substantially more refined than the traditional 5
to 8 category set yet is small enough that human annotators
are comfortable with the distinctions.</p>
      </sec>
      <sec id="sec-3-8">
        <title>2.3 Phase 2: Large-scale Content Analysis</title>
        <p>Manual annotations for an additional 10,000 tweets were
obtained using AMT in Phase 2. For emotion tag, workers
were given a set of 28 emotion categories to choose from
plus an “other” option with a text box so they could suggest
a new emotion tag where none of the listed emotion
category was applicable. The order in which the emotion
categories were presented to the workers was randomized
across the four samples in order to control for order effect.
If a tweet was flagged as containing multiple emotions,
annotators were asked to provide all relevant emotion tags.</p>
        <p>Recruitment of workers was done through Human
Intelligence Tasks (HITs) on the online AMT platform.
AMT workers must fulfill at least the basic requirement of
being able to read and understand English text. We set the
HIT approval rate for all requesters’ HITs to greater than or
equal to 95% and the number of HITs approved to greater
than or equal to 1000 to increase the probability of
recruiting first-rate workers.</p>
        <p>In the design of the HIT, workers were provided clear and
simple instructions describing the task, the annotation site
link, as well as a batch id required to retrieve a subset of 30
tweets to work on. Of the 30 tweets in each HIT, 25 were
new tweets and 5 were gold standard tweets intended to be
used for quality control. Each HIT was assigned to three
different annotators. Each HIT bundled a different subset of
30 tweets so a worker could attempt more than one HIT.
Workers were paid US$ 0.50 for every completed and
approved HIT containing 30 tweets.</p>
      </sec>
      <sec id="sec-3-9">
        <title>Emotion-Category-28</title>
        <p>Admiration
Amusement
Anger
Boredom
Confidence
Curiosity
Desperation
Doubt
Excitement
Exhaustion
Fascination
Fear
Gratitude
Happiness</p>
      </sec>
      <sec id="sec-3-10">
        <title>Emotion-Category-48 Emotion-Category-28</title>
        <p>Admiration Hate
Amusement Hope
Anger, Annoyance, Displeased, Indifference
Disappointment
Boredom Inspiration
Confidence Jealousy
Curiosity Longing
Desperation Love
Doubt, Confusion, *Torn Pride
Excitement, Anticipation Regret
Exhaustion Relaxed
Fascination, Amazement Sadness
Fear, Dread, Worry Shame
Gratitude Surprise
Happiness, Pleased Sympathy</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3 Inter-annotator Agreement</title>
      <p>emotion categories in Phase 1 is 0.50. Emotion annotation
especially at a fine-grained level is a subjective and difficult
task. It is possible to generate reliable data when annotators
are given sufficient training. With limited training, α in
Phase 2 decreases almost by half to 0.28.</p>
      <p>For Phase 1, all disagreements were first resolved through
discussion with expert annotators. Essentially, expert
annotators achieved 100% agreement in Phase 1. In Phase 2,
about one third of the tweets had full agreement for emotion
tag among all annotators (32%). To avoid throwing away
any data, the researcher manually reviewed all annotations
and resolved the disagreements. Such effort was deemed
necessary to reduce as much noise as possible in the corpus,
and to ensure that the classification schemes were applied
consistently across the two phases of data collection. Similar
to the Phase 1, each tweet in Phase 2 was assigned final
labels for emotion category.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Emotion Distribution</title>
      <p>Slightly over half (51%) of the tweets contain emotion.
Table 3 shows imbalance in the frequencies of the emotion
categories. Of the 28 emotion categories, the full corpus
(Phase 1 and Phase 2) contains the highest instances of
happiness (12%) and the lowest instances of jealousy
(0.2%). Only 9 categories have less than 100 instances. The
frequency distribution of the emotion categories in Phase 1,
Phase 2 and Phase 1 + Phase 2 are roughly similar.</p>
      <p>The corpus contains a significant portion of tweets tagged
with a single emotion category (92%) and only 8% of tweets
tagged with more than one emotion category. Although
tweets containing multiple emotions represent only 8% of
the corpus, including such tweets in the corpus leads to over
40% overall increase in the number of positive examples
(i.e., instances of an emotion category).
5</p>
    </sec>
    <sec id="sec-6">
      <title>Comparing Machine Learning Results from Phase 1 and Phase 2</title>
      <p>Since a tweet might be assigned multiple emotion
categories, we frame the problem as a multi-label
classification task. A separate binary classifier was built for
each emotion category to detect if an emotion category were
present or absent in a tweet (emotion X or not emotion X).</p>
      <p>We conducted a wide range of classification experiments
to better understand the impact of classifier and feature set
selection on classification accuracy [Liew, 2016]. We
present here results for a single representative selection:
Sequential Minimal Optimization (SMO), an SVM variant
[Platt, 1998] trained with features that include unigrams
occurring three or more times in the corpus that are
stemmed and lowercased. Classifiers were evaluated using
ten-fold cross validation.</p>
      <p>The precision, recall and F1 for SMO across Phase 1,
Phase 2 and Phase 1 + Phase 2 are shown in Table 3. A
general upward trend in precision (P), recall (R) and F1 are
observed across the three data sets. There are two key
takeaways from our preliminary experiments. First, using
the combined data from P1 and P2 generally yields higher
performance than using P1 or P2 data alone. For a majority
of the emotion categories, the classifiers used for emotion
classification achieved similar performance using gold
standard data generated Phase 1 and Phase 2 respectively.
Second, classifiers provided with more training examples
usually produce higher overall performance as evidenced by
higher F1 when larger data sets are used. The results for
individual emotion categories shows that more data does
not always leads to higher performance. The classifiers may
behave differently depending on the linguistic
characteristics of the category. More experiments will be
conducted in future work to identify the salient linguistic
features for each emotion category.
6</p>
    </sec>
    <sec id="sec-7">
      <title>Conclusion</title>
      <p>We describe a two-phase methodology to uncover a set of
28 emotion categories representative of the emotions
expressed in tweets. There are two main contributions: 1)
the introduction an emotion taxonomy catered for emotion
expressed in text and 2) the development of a gold standard
corpus that can be used to train and evaluate more
finegrained emotion classifiers.</p>
      <p>The set of 28 emotions is derived using an integrative
view of emotion and grounded on linguistic expressions of
emotion in text. In Phase 1, inductive coding was first used
to expose a set of emotion categories from 5,553 tweets.
The categories were then further merged and refined using
card sorting and emotion word rating. In Phase 2, we then
tested the representativeness of the emotion categories on a
larger data set of 10,000 tweets using crowdsourcing. No
new emotion categories emerged from Phase 2, indicating
that the 28 emotion categories are sufficient to capture the
richness of emotional experiences expressed in tweets.
However, the classifiers perform poorly on some categories
such as confidence, desperation, doubt and indifference. We
intend to perform a closer examination of the low
performing categories to determine if they should be
removed.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We thank the annotators who volunteered in performing the
annotation task. We are grateful to Dr. Elizabeth D. Liddy
for her insights in the study.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Abelson and Sermat</source>
          , 1962]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Abelson</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Vello</given-names>
            <surname>Sermat</surname>
          </string-name>
          .
          <article-title>Multidimensional Scaling of Facial Expressions</article-title>
          .
          <source>Journal of Experimental Psychology</source>
          ,
          <volume>63</volume>
          (
          <issue>6</issue>
          ):
          <fpage>546</fpage>
          -
          <lpage>554</lpage>
          ,
          <year>1962</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Alm et al.,
          <year>2005</year>
          ]
          <string-name>
            <given-names>Cecilia</given-names>
            <surname>Alm</surname>
          </string-name>
          , Dan Roth, and Richard Sproat.
          <article-title>Emotions from Text: Machine Learning for Text-Based Emotion Prediction</article-title>
          .
          <source>In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing</source>
          , pages
          <fpage>579</fpage>
          -
          <lpage>586</lpage>
          , Stroudsburg, PA, USA,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Aman and Szpakowicz</source>
          , 2007]
          <string-name>
            <given-names>Saima</given-names>
            <surname>Aman</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Stan</given-names>
            <surname>Szpakowicz</surname>
          </string-name>
          .
          <article-title>Identifying Expressions of Emotion in Text</article-title>
          .
          <source>In Text, Speech and Dialogue</source>
          , pages
          <fpage>196</fpage>
          -
          <lpage>205</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[Averill</source>
          , 1980]
          <string-name>
            <surname>James</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Averill</surname>
          </string-name>
          .
          <article-title>A Constructivist View of Emotion</article-title>
          . Emotion: Theory, Research, and Experience,
          <volume>1</volume>
          :
          <fpage>305</fpage>
          -
          <lpage>339</lpage>
          , Academic Press, New York,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Besnier</source>
          , 1990]
          <string-name>
            <given-names>Niko</given-names>
            <surname>Besnier</surname>
          </string-name>
          .
          <article-title>Language and Affect</article-title>
          .
          <source>Annual Review of Anthropology</source>
          ,
          <volume>19</volume>
          :
          <fpage>419</fpage>
          -
          <lpage>451</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Bollen et al., 2011a]
          <string-name>
            <given-names>Johan</given-names>
            <surname>Bollen</surname>
          </string-name>
          , Huina Mao, and
          <string-name>
            <given-names>Xiaojun</given-names>
            <surname>Zeng</surname>
          </string-name>
          .
          <article-title>Twitter Mood Predicts the Stock Market</article-title>
          .
          <source>Journal of Computational Science</source>
          ,
          <volume>2</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Bollen et al., 2011b]
          <string-name>
            <given-names>Johan</given-names>
            <surname>Bollen</surname>
          </string-name>
          , Alberto Pepe, and
          <string-name>
            <given-names>Huina</given-names>
            <surname>Mao</surname>
          </string-name>
          .
          <article-title>Modeling Public Mood and Emotion: Twitter Sentiment and Socio-Economic Phenomena</article-title>
          .
          <source>In Proceedings of the 5th International AAAI Conference on Weblogs and Social Media</source>
          , pages
          <fpage>450</fpage>
          -
          <lpage>53</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Bradley et al.,
          <year>1994</year>
          ]
          <string-name>
            <surname>Margaret</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bradley</surname>
          </string-name>
          , and
          <string-name>
            <surname>Peter J. Lang</surname>
          </string-name>
          . Measuring Emotion:
          <article-title>The Self-Assessment Manikin and the Semantic Differential</article-title>
          .
          <source>Journal of Behavior Therapy and Experimental Psychiatry</source>
          ,
          <volume>25</volume>
          (
          <issue>1</issue>
          ):
          <fpage>49</fpage>
          -
          <lpage>59</lpage>
          ,
          <year>1994</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Bradley and Lang</source>
          , 1999]
          <string-name>
            <surname>Margaret</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bradley</surname>
          </string-name>
          , and
          <string-name>
            <surname>Peter J. Lang</surname>
          </string-name>
          .
          <article-title>Affective Norms for English Words (ANEW): Instruction Manual and Affective Ratings</article-title>
          . University of Florida:
          <source>Technical Report C-1</source>
          , The Center for Research in Psychophysiology,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <source>[Calvo and Mac Kim</source>
          ,
          <year>2012</year>
          ]
          <article-title>Rafael A. Calvo, and Sunghwan Mac Kim</article-title>
          . Emotions in Text: Dimensional and
          <string-name>
            <given-names>Categorical</given-names>
            <surname>Models</surname>
          </string-name>
          .
          <source>Computational Intelligence</source>
          ,
          <volume>29</volume>
          (
          <issue>3</issue>
          ):
          <fpage>527</fpage>
          -
          <lpage>43</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>[Corbin and Strauss</source>
          , 2008]
          <string-name>
            <given-names>Juliet</given-names>
            <surname>Corbin</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Anselm</given-names>
            <surname>Strauss</surname>
          </string-name>
          .
          <article-title>Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory</article-title>
          . Sage,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[Cornelius</source>
          , 1996]
          <string-name>
            <surname>Randolph</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cornelius</surname>
          </string-name>
          .
          <article-title>The Science of Emotion: Research and Tradition in the Psychology of Emotions</article-title>
          . Upper Saddle River, Prentice Hall, New Jersey,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>[Dodds and Danforth</source>
          , 2010]
          <string-name>
            <surname>Peter S. Dodds</surname>
          </string-name>
          , and
          <string-name>
            <surname>Christopher</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Danforth</surname>
          </string-name>
          .
          <article-title>Measuring the Happiness of Large-Scale Written Expression: Songs, Blogs, and Presidents</article-title>
          .
          <source>Journal of Happiness Studies</source>
          ,
          <volume>11</volume>
          (
          <issue>4</issue>
          ):
          <fpage>441</fpage>
          -
          <lpage>56</lpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <source>[Ekman</source>
          , 1971]
          <string-name>
            <given-names>Paul</given-names>
            <surname>Ekman</surname>
          </string-name>
          .
          <source>Universals and Cultural Differences in Facial Expressions of Emotion. Nebraska Symposium on Motivation</source>
          ,
          <volume>19</volume>
          :
          <fpage>207</fpage>
          -
          <lpage>83</lpage>
          ,
          <year>1971</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Farzindar and Inkpen</source>
          , 2015]
          <string-name>
            <given-names>Atefeh</given-names>
            <surname>Farzindar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Diana</given-names>
            <surname>Inkpen</surname>
          </string-name>
          .
          <source>Natural Language Processing for Social Media. Synthesis Lectures on Human Language Technologies</source>
          ,
          <volume>8</volume>
          (
          <issue>2</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>166</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>[Glaser and Strauss</source>
          , 1967] Barney G. Glaser,
          <string-name>
            <given-names>and Anselm L.</given-names>
            <surname>Strauss</surname>
          </string-name>
          .
          <article-title>The Discovery of Grounded Theory: Strategies for Qualitative Research</article-title>
          , Aldine Publishing,
          <year>Chicago 1967</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[Green and Cliff</source>
          , 1975]
          <string-name>
            <surname>Rex</surname>
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Green</surname>
            , and
            <given-names>Norman</given-names>
          </string-name>
          <string-name>
            <surname>Cliff</surname>
          </string-name>
          .
          <article-title>Multidimensional Comparisons of Structures of Vocally and Facially Expressed Emotion</article-title>
          .
          <source>Perception &amp; Psychophysics</source>
          ,
          <volume>17</volume>
          (
          <issue>5</issue>
          ):
          <fpage>429</fpage>
          -
          <lpage>438</lpage>
          ,
          <year>1975</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>[Kagan</source>
          , 1978]
          <string-name>
            <given-names>Jerome</given-names>
            <surname>Kagan</surname>
          </string-name>
          .
          <article-title>On Emotion and Its Development: A Working Paper</article-title>
          .
          <source>In The Development of Affect</source>
          , pages
          <fpage>11</fpage>
          -
          <lpage>41</lpage>
          , Genesis of Behavior 1,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>[Liew</source>
          , 2016]
          <article-title>Jasy Liew Suet Yan</article-title>
          .
          <article-title>Fine-Grained Emotion Detection in Microblog Text</article-title>
          . Syracuse,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA: Syracuse University,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [Mohammad et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Saif</given-names>
            <surname>Mohammad</surname>
          </string-name>
          , Xiaodan Zhu, and Joel Martin.
          <article-title>Semantic Role Labeling of Emotions in Tweets</article-title>
          .”
          <source>In Proceedings of the ACL 2014 Workshop on Computational Approaches</source>
          to Subjectivity, Sentiment, and Social Media, pages
          <fpage>32</fpage>
          -
          <lpage>41</lpage>
          , Baltimore,
          <string-name>
            <surname>MD</surname>
          </string-name>
          , USA,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [Nakov et al.,
          <year>2013</year>
          ]
          <string-name>
            <given-names>Preslav</given-names>
            <surname>Nakov</surname>
          </string-name>
          , Zornitsa Kozareva, Alan Ritter, Sara Rosenthal, Veselin Stoyanov, and Theresa Wilson. SemEval-2013
          <source>Task 2: Sentiment Analysis in Twitter.” In Proceedings of the 7th International Workshop on Semantic Evaluation</source>
          ,
          <volume>2</volume>
          :
          <fpage>312</fpage>
          -
          <lpage>320</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <source>[Ortony and Turner</source>
          , 1990]
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Ortony</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Terence J.</given-names>
            <surname>Turner</surname>
          </string-name>
          .
          <source>What's Basic about Basic Emotions? Psychological Review</source>
          ,
          <volume>97</volume>
          (
          <issue>3</issue>
          ):
          <fpage>315</fpage>
          -
          <lpage>331</lpage>
          ,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [Park et al.,
          <year>2012</year>
          ]
          <string-name>
            <given-names>Minsu</given-names>
            <surname>Park</surname>
          </string-name>
          , Chiyoung Cha, and
          <string-name>
            <given-names>Meeyoung</given-names>
            <surname>Cha</surname>
          </string-name>
          .
          <article-title>Depressive Moods of Users Portrayed in Twitter</article-title>
          .
          <source>In Proceedings of the ACM SIGKDD Workshop on Healthcare Informatics</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>[Platt</source>
          ,
          <year>1998</year>
          ] John C. Platt.
          <article-title>Fast Training of Support Vector Machines Using Sequential Minimal Optimization</article-title>
          .
          <source>In Advances in Kernel Methods</source>
          , pages
          <fpage>41</fpage>
          -
          <lpage>65</lpage>
          , MIT Press,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <source>[Plutchik</source>
          , 1962]
          <string-name>
            <given-names>Robert</given-names>
            <surname>Plutchik</surname>
          </string-name>
          .
          <article-title>The Emotions: Facts, Theories, and a New Model</article-title>
          . Studies in Psychology, Random House, New York,
          <year>1962</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [Potter and
          <string-name>
            <surname>Levine-Donnerstein</surname>
          </string-name>
          ,
          <year>1999</year>
          ]
          <string-name>
            <given-names>W. James</given-names>
            <surname>Potter</surname>
          </string-name>
          , and
          <string-name>
            <surname>Deborah</surname>
          </string-name>
          Levine-Donnerstein.
          <article-title>Rethinking Validity and Reliability in Content Analysis</article-title>
          .
          <source>Journal of Applied Communication Research</source>
          ,
          <volume>27</volume>
          (
          <issue>3</issue>
          ):
          <fpage>258</fpage>
          -
          <lpage>284</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [Rosenthal et al.,
          <year>2014</year>
          ]
          <string-name>
            <given-names>Sara</given-names>
            <surname>Rosenthal</surname>
          </string-name>
          , Preslav Nakov, Alan Ritter, and
          <string-name>
            <given-names>Veselin</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          . Semeval-2014
          <source>Task 9: Sentiment Analysis in Twitter.” In Proceedings of the 8th International Workshop on Semantic Evaluation</source>
          , pages
          <fpage>73</fpage>
          -
          <lpage>80</lpage>
          . Dublin, Ireland,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>[Russell</source>
          , 1978]
          <article-title>James A. Russell. Evidence of Convergent Validity on the Dimensions of Affect</article-title>
          .
          <source>Journal of Personality and Social Psychology</source>
          ,
          <volume>36</volume>
          (
          <issue>10</issue>
          ):
          <fpage>1152</fpage>
          -
          <lpage>1168</lpage>
          ,
          <year>1978</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>[Russell</source>
          , 1980]
          <string-name>
            <surname>James</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Russell</surname>
          </string-name>
          .
          <article-title>A Circumplex Model of Affect</article-title>
          .
          <source>Journal of Personality and Social Psychology</source>
          ,
          <volume>39</volume>
          (
          <issue>6</issue>
          ):
          <fpage>1161</fpage>
          -
          <lpage>1178</lpage>
          ,
          <year>1980</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          <source>[Russell and Mehrabian</source>
          , 1977]
          <article-title>James A. Russell, and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Mehrabian</surname>
          </string-name>
          .
          <article-title>Evidence for a Three-Factor Theory of Emotions</article-title>
          . Journal of Research in Personality,
          <volume>11</volume>
          (
          <issue>3</issue>
          ):
          <fpage>273</fpage>
          -
          <lpage>294</lpage>
          ,
          <year>1977</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [Shaver et al.,
          <year>2001</year>
          ]
          <string-name>
            <given-names>Phillip</given-names>
            <surname>Shaver</surname>
          </string-name>
          , Judith Schwartz, Donald Kirson, and
          <string-name>
            <surname>Cary O'Connor. Emotion</surname>
          </string-name>
          <article-title>Knowledge: Further Exploration of a Prototype Approach</article-title>
          . In Emotions in Social Psychology, pages
          <fpage>26</fpage>
          -
          <lpage>56</lpage>
          . Psychology Press,
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <source>[Strapparava and Mihalcea</source>
          , 2007]
          <string-name>
            <given-names>Carlo</given-names>
            <surname>Strapparava</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Rada</given-names>
            <surname>Mihalcea</surname>
          </string-name>
          . Semeval-2007
          <source>Task</source>
          <volume>14</volume>
          :
          <string-name>
            <given-names>Affective</given-names>
            <surname>Text</surname>
          </string-name>
          .
          <source>In Proceedings of the 4th International Workshop on Semantic Evaluations</source>
          , pages
          <fpage>70</fpage>
          -
          <lpage>74</lpage>
          . Prague,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <source>[Vo and Collier</source>
          , 2013]
          <string-name>
            <surname>Bao-Khanh H. Vo</surname>
            , and
            <given-names>Nigel</given-names>
          </string-name>
          <string-name>
            <surname>Collier</surname>
          </string-name>
          .
          <article-title>Twitter Emotion Analysis in Earthquake Situations</article-title>
          .
          <source>International Journal of Computational Linguistics and Applications</source>
          ,
          <volume>4</volume>
          (
          <issue>1</issue>
          ):
          <fpage>159</fpage>
          -
          <lpage>173</lpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <source>[Zachar and Ellis</source>
          , 2012]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Zachar</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Ralph D.</given-names>
            <surname>Ellis</surname>
          </string-name>
          .
          <article-title>Categorical versus Dimensional Models of Affect: A Seminar on the Theories of Panksepp and Russell</article-title>
          . Vol.
          <volume>7</volume>
          . John Benjamins Publishing Company,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>