Comparing Brand Perception Through
Exploratory Sentiment Analysis in Social Media

                        Mario Cichonczyk and Carsten Gips

              Bielefeld University of Applied Sciences, Minden, Germany
                        firstname.lastname@fh-bielefeld.de


        Abstract. The presented student project outlines a natural language
        processing pipeline for brand metric comparison in the Twitter ecosys-
        tem. Sentiment calculation for an unlabeled data set is demonstrated and
        calibrated using the statistical Central Limit Theorem as a guidance to
        anchor the sentiment indicator in a homogeneous market. The process
        is evaluated by comparing the sentimental market performance of three
        leading German logistics companies. A support for the value of senti-
        ment analysis for automated customer feedback analysis in real-time is
        concluded.


1     Introduction

The brand philosophy behind a business is usually a driving principle of the
entrepreneurial actions it follows. Ideally, these actions accumulate in a brand
strategy and a finely tuned marketing mix to acquire market share and estab-
lish brand awareness and perception. The success of these marketing efforts can
be measured by evaluating the time-delayed return on investment of associated
profit margins. This approach has a deficit in explanatory power as it lacks a
fine-grained insight into the complex effects of diversely faceted, multi-channel
marketing and brand positioning methods. Therefore, marketing research relies
on qualitative and quantitative analyses and surveying techniques for a more so-
phisticated evaluation of marketing investment impact. Targeted studies with re-
source expenditure are employed to answer specific questions of subjective brand
perception. With technological development and progress, new approaches may
be introduced to increase the efficiency of effect monitoring and thereby reducing
inertia in strategic realignment according to market feedback. [18]
Since social media is getting more established in everyday life, intelligence can
be gathered through a new and essentially cost-free feedback channel [12]. While
social media marketing is concerned with the public relations effort in direction
to the customer, the same platforms allow for an inversion of communication
from consumer to business. The presented work explores how brand perception
can be measured and compared by making use of natural language processing

    Copyright c 2020 by the paper’s authors. Use permitted under Creative Commons
    License Attribution 4.0 International (CC BY 4.0).
in the Twitter ecosystem. To achieve this objective, the hashtag space of lead-
ing German logistics companies was analyzed and collected as an exemplary
use case. An analytics pipeline was constructed and applied to the dataset to
better understand the customer base and approximate overall consumer senti-
ment towards the logistics brands, linking a scalar metric to consensus opinion
as outlined in [3]. This approach can show the value of sentiment analysis in
social media analysis for customer satisfaction research of single companies and
it will also allow for contextualization in regard to competitors. As social media
is essentially a means of open many-to-many communication, the same pipeline
can be applied to feedback aimed towards other brands and will then allow a
more empirical comparison.


2   Approach

The main contribution of this work aims to show how current ideas in natural
language processing may be applied to take advantage of automated social media
brand perception analysis. We would like to outline our course of action and the
thought process used in this applied research project as it may inspire other re-
search towards the stated problem case. For this purpose, two typical marketing
query questions were choosen as an example to demonstrate the approach:

1. How do leading German consumer logistics brands rank in Twitter user
   approval?
2. How can Twitter meta information augment user approval analysis?

Related works predominantly make use of machine learning on labeled data with
the primary goal of method testing and evaluation [28][15][26]. These approaches
perform well regarding categorization in sentiment classes, but they cannot be
directly transferred to the investigated problem without manual data annotation
as a pre-step for model training. Real-world data is unlabeled and seed datasets
do not yet exist for the highly specific domain of German logistics, hence seem-
ingly producing an unsupervised clustering problem with feature engineering
and selection in focus.
The specific characteristics of brand perception analysis allows for a third op-
tion. Munoz & Kumar [23] point out that perceptional metrics help to gauge the
effectiveness of brand-building activities across points of customer interaction.
Brand profiling can be achieved by fixing an indicator within a metric of a mar-
ket and then comparing this indicator to a companies brand and its competitors
[23]. Therefore, the aim of the presented approach is to acquire such a polarity
indicator. Tools, models and methods from both supervised and unsupervised
natural language classification and processing become available with this mod-
ified problem definition. The constructed data pipeline adheres to this premise
and is presented in detail in the following paragraphs.
2.1   Collection and Transformation
Data acquisition was implemented using the Tweepy1 library for the Python
programming language to interact with the commercially available Twitter API2
for developers. Tweepy was chosen for its native ability to handle rate limited free
access tokens and dynamic adjustment of traffic bandwidth. To stay within these
limits and to focus on customer opinion, only the top three logistics companies
for the courier, express and parcel services in Germany where selected, which as
of 2018 are DHL, Hermes and DPD [13]. Therefore, the Tweepy filter
           "#dhl OR #hermes OR #dpd OR dhl OR hermes OR dpd"
was constructed and 10594 tweets where collected over 3 weeks in the win-
ter of 2018, limited to tweets with a set ”German” language flag. The Twitter
API returns tweets in JSON format, containing a large number of partly re-
dundant data fields. All JSON data was parsed and attributes of interest to
the research question where selected and named accordingly. The selected and
renamed fields where ”usr id”, ”tweet id”, ”usr followers”, ”timestamp”, ”fa-
vorites”, ”retweets”, ”client”, ”hashtags” and the ”text” itself. After transfor-
mation, the tweets where stacked as rows to form a 10594x9 matrix M.

2.2   Data Exploration, Filtering and Feature Construction
First exploration and inspection of the dataset had shown that the Tweepy filter
was not sufficient to separate the communications concerning the three selected
logistics brands in a satisfactory manner. The data contained tweets which did
not directly relate to the domain of interest. It became clear that a more in-depth
analysis of the communication topics was necessary to further sanitize the results.
Tweet topic modeling through utilization of keyword annotations - better known
as ”hashtags” - was identified to hold valuable advantages for the solution of this
problem [31][29]. Based on the work presented by Wang, Wei & Zhang [30], a
graph model was defined by the set of all hashtags H = {h0 , h1 , ..., hm } contained
within the dataset, where each hashtag hi represents a node weighted by its
global occurence count and is associated with a set of tweets Tk = {τ0 , τ1 , ..., τn }
in which it occurs. The set of edges E consists of a link between two hashtags if
they co-occur in the same tweet. The weight of an edge eij between hi and hj is
incremented for each co-occurence of hi and hj . The graph model HG= {H, E}
was used to isolate the logistics subgraph of interest which reduced T , and
thereby the rows of M, to only hold tweets relevant to the research.
The ”hashtags” column of M was transformed to HG and stored in GEXF-
format3 . This step made it possible to import the hashtag graph into existing
graph analysis software4 , providing access to pre-optimized methods. The Yifan
1
  https://github.com/tweepy/tweepy
2
  https://developer.twitter.com/
3
  https://networkx.github.io/documentation/stable/reference/
  readwrite/gexf.html
4
  https://gephi.org/
Hu multilevel layout algorithm was chosen for its speed and good quality on
large graphs [11] to achieve topic modeling. With this technique, all node and
edge weights are used as force simulations of attraction and repulsion, creating a
dispersed planar graph projection. Clusters of hashtags in frequent co-occurence
form topic subgraphs while unrelated topics drift apart. After the embedding
step, topics weakly connected to logistics can be visually identified.
The collective topics ”dhl”, ”dpd” and ”hermes” were found to form a distinct
subgraph with minor outliers regarding ”jobs” topics. Stronger interrelations
to unrelated themes existed for the single topic ”hermes”. This effect can be
attributed to the ambiguity of the term. Subgraphs concerning ”fashion” and
”export politics” where intertwined with tweets about the logistics company.
The set of hashtags identifying those outlying tweets was added as a filter to
remove non-logistics rows from M, resulting in an on-topic dataset.
Tweets which did not resemble a consumer opinion where filtered out, e.g. adver-
tisements and news where undesirable data as they would distort the sentiment
analysis. Therefore, the column ”client” was investigated. Under the assump-
tion that consumer opinion is predominantly voiced through consumer client
software, other agents can be ignored. All user agents of the ”client” column
were manually mapped to the categories ”Android”, ”iOS”, ”Desktop”, ”Other”
and ”Professional”. ”Other” resembles multi-platform clients which cannot di-
rectly be associated with a single platform. The ”Professional” label identifies all
user agents known to be developed for commercial usage, e.g. tweet automation
or social media management software. All ”Professional” rows where removed.
Commercial users who rely on consumer client software are exempt from this
pruning step. These were further analyzed by the count of their followers. For all
rows in M, the unique ”usr id” fields where collected and then ranked by their
”usr followers” value. Since the Twitter follower count adheres to the power law
[22], manually inspecting the top followed profiles was sufficient to mark opin-
ion bearing commercial accounts for removal and neutralize their influence on
global sentiment. For professional tweets which may have been remaining in the
data after all filters, it was assumed that their opinion influx was significantly
outweighed by the now superior consumer class tweets.
To finally conclude filtering and feature construction, M was extended by the
columns ”is dhl”, ”is dpd”, ”is hermes”, ”hour” and ”weekday”. With these ex-
tra columns, searching and querying the dataset for the following steps is faster
and more convenient. Finding tweets associated with specific brands is done by
simple boolean masking, resulting in desired subsets of M without having to
repeatedly parse and search tweet content for each query. Aggregation over time
windows is sped up by pre-splitting the complex timestamp format of the Twit-
ter API.
After the outlined data sanitation treatment, actual text analysis was possible.

2.3   Text Preprocessing
The Natural Language Tool Kit (NLTK) offers the basic functionalities neces-
sary for natural language processing [2]. As such, it was used to pre-process the
tweet content information formulated by Twitter users. The ”text” column was
tokenized and all single tokens of interest were added as a new ”tokens” column,
containing a list of tokens for each tweet. All tokens were stripped of non-textual
information, URLs removed, umlauts converted and otherwise undesirable infor-
mation filtered out. The sanitized tokens represent all German words of a tweet,
but not all words hold analytic information. NLTK provides a list of stop words
for several languages, including German. Accordingly, all tokens where checked
against the German NLTK stop word list and removed if they where considered
a match. Afterwards, the term frequency of all tokens was calculated to identify
other possible stop words. Several high ranking tokens were found to lack value
for analysis:

 "dhl, dpd, hermes, paket, mal, dass, schon, kommt, immer, seit,
                              fuer"

These were added to the stop word list and also removed. Brand name tokens
are redundant as their information is already stored in M (see Paragraph 2.2).
With the token list constructed, semantic polarity estimation can be examined
on a word level.


2.4   Token Polarity

Since algorithms do not by default have any understanding of the emotional im-
pact of word semantics, sentiment analysis relies on human consensus opinion.
Databases with annotated word polarities between [-1, 1] for negative and pos-
itive sentiment respectively are used to look up a scalar value for a given token.
For the German language, such a dictionary exists in the form of the SentiWS
[27] project. As a first, naive approach, the pipeline’s sentiment resolver tries to
annotate the tokens of M directly through a query to SentiWS. If the word is
present in the dictionary, no further search is required and its sentiment can be
returned. SentiWS contains about 3500 basic forms and 34000 inflections. De-
spite this size, less than 15% of all tweet tokens where found in SentiWS. This
result is not surprising given the combination of language syntax complexity, a
raw tweet tokens count of more than 80000 and the effort involved in dictionary
building. It was expected that only a low number of entries would directly match.
Therefore, each token that could not be resolved in SentiWS is forwarded to the
NLTK German stemmer. Stemming is the process of reducing a complex word,
which might be an inflection, to its basic form [19]. This step can be understood
as a simplification of the token, or in a more technical sense even a functional
projection. The goal is to project the token space sample in such a way that its
transformation aligns with the SentiWS target space. Thereby, a morphological
alternative is potentially found and might allow for a sentiment look up. Increas-
ing the number of token morphemes using this method also increased polarity
coverage.
If the stemmer was not able to find a morpheme in SentiWS, syntactical al-
ternative search is exhausted. Therefore, the actual meaning of the token can
be used to find synonyms whose sentiment is known. Liebeck [17] found the
introduction of semantic equivalence to be of advantagous value for more thor-
ough sentiment analysis and referred to synonyom search in synset databses
like GermaNet [10]. Mohtarami [21] et al. explored and alternatively suggested
the use of vector-based approaches for the same purpose. They observed that
WordNet [20] - and therefore GermaNet as a descendant - perform satisfactory
when semantic synonyms are searched but lack in accuracy when sentimental
equivalence is the primary metric. To accomodate this problem, they introduced
emotional features of words to construct their vectors. The key insight for the
presented work is the necessity of a more general human-like language intelli-
gence to identify word alternatives beyond pure semantic synonyms. Webber [32]
came to the same conclusion and presented a proof of concept disambiguation
and language analysis system trained on half a million Wikipedia articles instead
of a domain specific corpus. The results suggested superior context-based gen-
eral language processing capabilities. Therefore, a similar approach was chosen.
Instead of synonym search in statically assembled synsets, a Word2Vec model
trained on Wikipedia articles was used to find alternatives for tokens which nei-
ther themselves nor their stemmed variants could be resolved through SentiWS.
Word2Vec was chosen due to its proven performance [8] regarding this purpose
and because a large (650 million words), pre-trained, general Wikipedia knowl-
edge model already exists5 for the gensim6 Word2Vec implementation. Training
a similarly large model would not have been possible for the purposes of this
student project.
The lookup process was constructed as follows. Gensim is used to retrieve a vec-
tor space embedding for the token, e.g. the word ”house”. As stated above, this
vector representation shares a topological vicinity with its contextual synonyms,
e.g ”building” or ”home”. The aim is to find a spatially close synonym which can
be associated with an entry in SentiWS. Starting from the ”house” embedding,
its neighbouring entries are probed in order of increasing distance. For exam-
ple, ”building” may highlight as the closest approximation to ”house”. The word
”building” is therefore chosen as an alternative candidate. This candidate is then
tested against SentiWS and if a polarity value could be retrieved, the candidate
is selected. In this case, ”building” may not have been a valid alternative, is re-
jected and the process continues with the next neighbour in increasing distance,
which is ”house”. The new candidate is tested in the same manner and can be
successfully matched with a sentiment, leading to its selection as a valid alter-
native to ”house”. The algorithm encapsulates the same process a human would
employ in thinking about other phrasings of the same utterance. Figuratively
speaking, the amount of imagination necessary to come up with another phrase
is continually incremented until a suitable rephrasement is found. This can of
course lead to misleading levels of synonymity if done ad infinitum. Therefore,
it was decided to penalize the retrieved sentiment by the distance between the
original token vector and its alternative, similarly to Kim & Shin [14].

5
    https://github.com/devmount/GermanWordEmbeddings
6
    https://radimrehurek.com/gensim/models/word2vec.html
The complete algorithm to resolve the sentiment value for a token vector t can
be defined as follows:
                                                       SentiWS(s)
                 sentiment(t) =           max                                   (1)
                                  s∈SentiWS∩Word2Vec     D(s, t)

Note that this algorithm implicitly neutralizes alternatives if they are too broadly
associated synonyms:
                              lim sentiment(t) = 0                               (2)
                            D(s,t)7→∞

The end behaviour of sentiment(t) ensures the absence of polarity distortion in
synonym search.
All tokens in the data were labeled using this process.


2.5    Sentiment Weighting and Analysis
After the tokens of M were given an emotional weight as outlined in Section 2.4
on a word level, further analysis on a sentence level can proceed. Fang & Zhan
[5] summarize that every word of a sentence has its syntactic role which defines
how the word is used. These roles, also known as part of speech, have significant
impact on the importance of their underlying sentiment for the polarity of the
complete sentence. For example, words like pronouns usually do not contain any
sentiment and are therefore neutral. In contrast, verbs or adjectives can hold dif-
ferent weights respectively [5]. Part of speech taggers are used to classify words
according to their syntactic role. The NLTK tagger class has been extended
for the German language and trainend on the TIGER [4] corpus in a different
project7 , achieving an accuracy of 98% as stated by the authors. This effective
tagger was chosen for the presented work due to its good performance, generaliza-
tion capabilities and fast integration in NLTK. After POS processing, all tokens
were labeled according to the STTS [33] tag system. Nichols & Song [25] have
examined the relationship between scalar sentiment, part of speech and overall
sentence polarity. They empirically compared the influence of POS strengths
on classifier performance and approximated an optimal solution. Their exhaus-
tive search for the set P OS = {noun, verb, adjective, adverb} and the strength
weights str(P OSi ) ∈ {1, 2, 3, 4, 5} has shown that the best performance for pur-
poses of sentiment analysis was achieved with the following scalar weight vector:
              (str(noun) = 2, str(verb) = 3, str(adv) = 4, str(adj) = 5)         (3)

A mapping from the STTS tags to the categories utilized by Nichols & Song was
introduced to ensure compatibility between the German tagger and their weight-
ing approach. Afterwards, all tokens were accentuated according to their syntac-
tical sentence function, resulting in increased sentiment variance and therefore
more expressive overall tweet polarity.
As a last step before the culminating conflation of all individual token polari-
ties per tweet, negations need to be handled as they significantly influence the
7
    https://github.com/ptnplanet/NLTK-Contributions/
    tree/master/ClassifierBasedGermanTagger
calculated emotion by their valency scopes [6]. Two primary ways for negation
handling were tested: syntax scope analysis and a heuristic approach. Carrillo
[1] et al. proposed that superior performance is achieved if the negation scope
is determined by examining the valence subtree of the negation token based on
part of speech association. After successfully labeling each word with a POS tag,
the grammatical syntax reveals the subtree which is supposed to be negated and
therefore inversely influential on sentiment. While this approach would grant
realistic language sentiment, it presupposes that the syntax tree is immacu-
late. Especially for Twitter, this is rarely the case. Gui [9] et al. found that
the Twitter culture of mutual communication is inherently comprised of non-
standard orthography and reconstructing an approximately valid syntax tree
requires substantial effort. Their findings where confirmable and therefore op-
posed the syntactical negation handling as proposed by Carrillo [1] et al. for
practical appliance in the presented project. For this reason, the more widely [6]
used heuristics solution was employed. The German tagger was able to reliably
identify the negation token itself (e.g. ”nicht”) and labeled it accordingly with
the fitting STTS tag. This label gave an anchor to which a rule-based negation
heuristic could be expediently attached. Inspired by the syntactical solution, the
heuristic successively searches for the next token with a sentiment that has been
weighted by its tag (see the beginning of this section). The rationale behind
the algorithm is that the sentiment bearing successor feature is assumed to be
the most likely target for negation. Samples suggested that this heuristic rule
performs sufficiently in relation to the goals of analysis.
After all negations were handled, it was finally possible to propose an estimated
polarity per tweet. Similarly to the work of Kumar & Sebastian [16], sentiment
was calculated by summing the weighted and - if necessary - negated token
polarity scalars. The resulting values were added as new column to M.


2.6   Scale Calibration

Having calculated a value which one might consider ”sentiment” is not enough
for actual market analyses due to two reasons:

1. The scale - while argumentative coherent and grounded in the outlined ra-
   tionale - can be understood as a valid indicator, it is still arbitrarily defined.
   Its definition is sound, but given that the scale is supposed to measure lev-
   els of human emotion, it needs validation. Such a test would require human
   evaluation, altering the problem to a supervised interpretation.
2. As Section 2 stated, Munoz & Kumar [23] emphasize indicator fixation and
   anchoring within the metric of a market to achieve brand profiling. Only then
   is empirical comparison to the calibrated indicator, and thereby competitive
   brands, feasible.

These two problems seemingly demand further research and evaluation. Con-
tradistinctively, it is argued that the combination of both allows for a use-case
specific solution if the fundamental nature of the underlying data is exploited by
utilizing the broad scope of opinions being uttered on Twitter. This characteris-
tic permits the introduction of the established statistical Central Limit Theorem
(CLT). The CLT is the observation of the convergence behaviour of probabil-
ity distributions of an increasing number of one- or multi-dimensional random
variables to a normal distribution [7]. For a public opinion surveying purpose as
presented, the CLT leads to a beneficial conclusion: if a sufficiently large number
of unrelated, random sample opinions are gathered from a sample population,
the overall sample mean will be normally distributed around the population
mean. Furthermore, if the opinion distribution is limited to the interval [−1, 1]
by the pre-conceived sentiment constraint and additionally, baseline polarity is
assumed to be neutral, all essential properties of the expected opinion distribu-
tion are therefore known in advance without human intervention for validation.
To exploit this reasoning for the calibration of the proposed polarity estimation
process, a second Twitter dataset GT (for Ground Truth) was collected using
the Tweepy filter
                "#2018 OR #2019 OR #december OR #january"
and is processed through the same pipeline as M, leading to a broad, dispersed
set of tweets unrelated to any specific topic. As these tweets cover a wide range
of independent themes and conversational domains, it can be reasoned that the
global population sentiment characteristics behave according to the CLT. This
theory resembles the missing link between the ”arbitrarily” constructed polarity
estimation pipeline and actual market sentiment, resulting in the desired indica-
tor described by Munoz & Kumar. Establishing the connection mathematically
is possible in a multitude of ways, as long as the link adheres to the following
formalism. Since the global sentiment distribution of the general dataset GT
should at best follow the listed constraints, its histogram Φ(GTpolarity ) should
resemble the normal distribution as close as possible. As such, the aim is to find
a projection of GTpolarity which minimizes the error between the histogram and
the normal distribution. If such a projection is found, it acts as the calibration
metric for the analytics pipeline. Thereupon, the calibrated projection can be
used on the actual logistics dataset to infer class labels in relation to the opinion
of the general population. If the distribution is discretized at the interquartile
ranges (IQR), half of all opinions fall in the central area. These will be considered
”neutral” and make up the overall majority. A quarter of all opinions fall left
of the first IQR marker and will be considered negatively extreme. Their class
label is ”negative”. And lastly, the remaining datapoints fall beyond the third
IQR marker and are hence labeled ”positive” as they express positively extreme
sentiment. This baseline polarity will be the ground truth reference and M can
be labeled in the same way, using the absolute sentiment boundaries dictated
by GT. Subsequently, sentiment analysis and classification are concluded and
evaluation of logistics opinions is finally possible.

3   Analysis
For evaluation, the questions put forward in Section 2 were answered using the
constructed pipeline.
After discretization, the Twitter class label distribution is normally distributed
and zero centered. The IQR markers force the calibration into the CLT assump-
tion. Therefore, specialized tweet topics can be compared by calculating the
relative distance between the class label tendencies.


3.1   ”How do leading German consumer logistics brands rank in
      Twitter user approval?”

For the complete logistics dataset M, the class labels deviate from the Twit-
ter baseline GT. Neutral sentiment is 11.92% less present in tweets relating
to logistics while positive and negative sentiment are 4.58% and 7.34% above
baseline respectively. This observation of increased variance shows that users
communicate with higher emotional tendencies towards the topic. It can be
concluded that opinions regarding the logistics domain are mostly stated more
vigorously. To reduce selection bias, the logistics brands must therefore be com-
pared exclusively within their domain. Otherwise, their relative ranking would
be distorted by the overall preconceived notions of opinion. Appendix Figure 1
(top) visualizes this distortion. On first glance, all three brands perform with
high emotional response, skewed towards negative feedback. This issue is the
result of the distributional relationship to Φ(GTpolarity ). Drawing the conclu-
sion that the three specific brands perform bad on Twitter is not precise as the
entire domain generally provokes the shown response. Due to this implication,
a more accurate baseline indicator for performance ranking is the sentiment
histogram of the logistics domain. All brands react differently and more truth-
fully to the this metric and better conclusions can be drawn, as presented in
Appendix Figure 1 (middle). Therein, it can be seen that the relative rank-
ing is now increasingly expressive. DHL performs better than its competitors
within the domain, having less negative class labels and more positive class
labels than average. DPD and HERMES are negatively skewed beyond aver-
age expectation, performing worse than DHL. HERMES exclusively falls behind
in both negative (more than average) and positive (less than average) opin-
ions.


3.2   ”How can Twitter meta information augment user approval
      analysis?”

The presented comparison solely relies on single tweet content and thus individ-
ual sentiment. The Twitter API grants access to information beyond pure textual
data. Correlating the meta data to polarity can yield clarified insight. For exam-
ple, one of the defining functionalities of Twitter is the ability to like and/or share
(”retweet”) opinions of other users. The implications for sentiment statistics are
pivotal. Nagarajan, Purohit & Sheth [24] observed different levels of endorsement
by peer users depending on tweet positivity. Hence, weighting tweet sentiment by
the amount of shared and approved peer affirmation links argumentative popu-
larity to brand performance indication. Furthermore, if a large enough dataset is
gathered, sentiment can be followed along the chain of retweets, forming an inter-
esting graph traversal problem. It could be mapped out how positive and negative
sentiment propagate through the Twitter ecosystem and how these multiplica-
tive patterns differ for brands. For the presented project, the dataset is not large
enough to reconstruct such patterns. Nonetheless, each entry of M does contain
the integer counts of likes and retweets and these values hold analytic value. Rea-
soned by the arguments above, the two integers were summed per row and added
as a new column labeled ”Propagation”. The new value resembles the amount of
persons sharing the view being expressed in the tweet and can be used as a weight
vector for sentiment aggregation. The results of Nagarajan, Purohit & Sheth were
observable afterwards aswell. Our observations indicate that tweets in the logis-
tics domain are generally shared less often than baseline, but if they are shared,
they are more likely negative in contrast to GT. Due to this finding, the class dis-
tributions change significantly if polarity propagation is factored into relational
brand metrics as shown in Appendix Figure 1 (bottom). User approval leads
to considerable amplification of disparity. DPD and HERMES shift their distri-
butions to pronounced neutrality. They both reduce negative sentiment slightly
but also exceedingly decrease in positive sentiment by a large factor. In contrast,
DHL overwhelmingly profits from the shared user opinions. Negative and neutral
sentiment labels are shifted to an 11.04% increase in positive sentiment.
    In addition to likes and retweets, the data also contains the timestamp at
which a tweet was written. Especially in the logistics business, time plays an im-
portant role and should therefore be targeted analytically. In Appendix Figure
2, a heatmap of aggregated class labels per day and hour, averaged in mean rows
and columns, is shown. For aggregation, the labels were interpreted as {−1, 0, 1}
for {negative, neutral, positive} respectively. In GT, it can be observed that
negative sentiment correlates with business hours. Furthermore, negativity peaks
towards the end of the business week. Intuitively, non-delivery at the beginning
of the weekend may induce disappointment as the customer would have to wait
past the work-free days until the next business day. Such a hypothesis could be
further evaluated if domain knowledge is introduced.

For the individual brands, different observations stand out:
 1. DHL: Generally, DHL performs above average on Mondays. Sentiment then
    declines with progression of the week. Negativity peaks at the weekend.
 2. HERMES: Best performance is expected on Tuesdays. Daily peak negativity
    shifts with weekly progression. Customers tend to express negative feedback
    incrementally in later hours, peaking at 20:00 Hours. The relation may be
    connected to the longer business hours employed by HERMES.
 3. DPD: No obvious pattern is present except a sentiment low on Mondays
    14:00 Hours. Otherwise, the data correlates to the baseline.
The underlying tweet texts and their themes where investigated at the empha-
sized days and hours but did not reveal any apparent common denominator ex-
plaining their occurrence in addition to there mere temporal correlation. These
time distributions can serve more appropriate analyses for research with added
knowledge of the internal structures of the different companies. Then, more pre-
cise assumptions can be made about the cause of the observed patterns.
The different client agents were also investigated in relation to sentiment but
did not show any discernible correlation.


4   Conclusion
The outlined process has shown how powerful Twitter can be for sentiment
analysis in brand perception polling. Especially when meta data is introduced,
insights far beyond classical polling techniques become apparent. The differ-
ent logistics brands have shown distinguishable approval performance and the
inclusion of Twitter meta data was beneficial to inquire into these variations.
Acquiring a professional API access may be costly at first, but the increase in
data volume and quality can add more capabilities, such as instant metrics or
even precise geolocation, a variable directly linkable to geographical key per-
formance indicators in logistics. As such, the real-time data pipeline could be
integrated into business intelligence and monitoring systems for anomaly detec-
tion and cause-effect analysis.
Furthermore, natural language processing methods were employed to demon-
strate their value for modern-day feedback evaluation. In-depth studies into the
many different ways to approach language processing problems may highlight
even more fruitful pipeline steps. Building a domain specific language corpus
should be the first obstacle to take in that direction. The current lack thereof
discouraged the usage of many techniques like supervised machine learning. The
same holds for the limited size of the tested data set. If more comprehensive
collection was possible, access to other ways of statistical description and model
building opens up. Especially in regard to finer recognition of irony and sarcasm,
further improvement of the pipeline is required. In its current form, sarcasm was
not adequately recognized. Relative brand metric comparison is still valid, as all
brands suffered from this deficit in the same way. Sarcasm seems to be inherently
linked to strongly opinionated utterances, which is a reason why the proposed
workflow did not rely on emoticon recognition as other works suggest for label
inference. Data exploration has shown that emoticons played a predominant role
in sarcasm emphasis. This observation may hold value for further research but
discouraged their importance for the current student project.
Summarizing, the work supports the assumption that modern social media can
have a vital contribution to fast refinement of a brands marketing mix for strate-
gic realignment in real-time. This is a benefit to ensure positive reception of
corporate philosophies directly as a reaction to automated analyses of innova-
tive, digital consumer feedback channels, using contemporary research in natural
language processing.
Appendix


           Fig. 1. Twitter sentiment compared to logistics brands.
Fig. 2. Daily logistics sentiment.
References
 1. Carrillo de Albornoz, J., Plaza, L., Diaz, A., Ballesteros, M.: Ucm-i: A rule-based
    syntactic approach for resolving the scope of negation. In: *SEM 2012: The First
    Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings
    of the main conference and the shared task, and Volume 2: Proceedings of the Sixth
    International Workshop on Semantic Evaluation (SemEval 2012). pp. 282–287. As-
    sociation for Computational Linguistics (2012), http://aclweb.org/anthology/S12-
    1037
 2. Bird, S., Loper, E.: Nltk: The natural language toolkit. In: Proceedings of
    the ACL 2004 on Interactive Poster and Demonstration Sessions. ACLdemo
    ’04, Association for Computational Linguistics, Stroudsburg, PA, USA (2004).
    https://doi.org/10.3115/1219044.1219075
 3. Culotta, A., Cutler, J.: Mining brand perceptions from twitter social networks.
    Marketing Science 35(3), 343–362 (2016). https://doi.org/10.1287/mksc.2015.0968
 4. Dipper, S., Kübler, S.: German Treebanks: TIGER and TüBa-D/Z, pp. 595–639.
    Springer Netherlands, Dordrecht (2017)
 5. Fang, X., Zhan, J.: Sentiment analysis using product review data. Journal of Big
    Data 2(1), 5 (Jun 2015). https://doi.org/10.1186/s40537-015-0015-2
 6. Farooq, U., Mansoor, H., Nongaillard, A., Ouzrout, Y., Qadir, M.A.: Negation
    handling in sentiment analysis at sentence level. Journal of Computers 12, 470–
    478 (01 2016)
 7. Fischer, H.: A history of the central limit theorem: From classical to modern prob-
    ability theory. Springer Science & Business Media (2010)
 8. Goldberg, Y., Levy, O.: word2vec explained: deriving mikolov et al.’s
    negative-sampling word-embedding method. CoRR abs/1402.3722 (2014),
    http://arxiv.org/abs/1402.3722
 9. Gui, T., Zhang, Q., Huang, H., Peng, M., Huang, X.: Part-of-speech tagging for
    twitter with adversarial neural networks. In: Proceedings of the 2017 Conference
    on Empirical Methods in Natural Language Processing. pp. 2411–2420 (2017)
10. Hamp, B., Feldweg, H.: Germanet-a lexical-semantic net for german. Automatic
    Information Extraction and Building of Lexical Semantic Resources for NLP Ap-
    plications (1997)
11. Hu, Y.: Efficient and high quality force-directed graph drawing. Mathematica Jour-
    nal 10, 37–71 (01 2006)
12. Jansen, B.J., Zhang, M., Sobel, K., Chowdhury, A.: Twitter power: Tweets as
    electronic word of mouth. JASIST 60, 2169–2188 (2009)
13. Jansen, B.J., Zhang, M., Sobel, K., Chowdhury, A.: Umsatzverteilung im kep-
    endkundenmarkt in deutschland nach anbietern im geschäftsjahr 2017/18. Han-
    delsblatt 134, 16 (2018)
14. Kim, Y., Shin, H.: A new approach for measuring sentiment orientation
    based on multi-dimensional vector space. CoRR abs/1801.00254 (2018),
    http://arxiv.org/abs/1801.00254
15. Kouloumpis, E., Wilson, T., Moore, J.D.: Twitter sentiment analysis: The good
    the bad and the omg! Icwsm 11(538-541), 164 (2011)
16. Kumar, A., Sebastian, T.: Sentiment analysis on twitter. International Journal of
    Computer Science Issues 9, 372–378 (07 2012)
17. Liebeck, M.: Aspekte einer automatischen meinungsbildungsanalyse von online-
    diskussionen. In: Ritter, N., Henrich, A., Lehner, W., Thor, A., Friedrich, S.,
    Wingerath, W. (eds.) Datenbanksysteme für Business, Technologie und Web (BTW
    2015) - Workshopband. pp. 203–212. Gesellschaft für Informatik e.V., Bonn (2015)
18. Löffler, R., Wittern, H.: Markenwahrnehmung und marken-differenzierung im
    zeitalter des web 2.0. In: Markendifferenzierung, pp. 359–375. Springer (2011)
19. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval,
    p. 32. Cambridge University Press, New York, NY, USA (2008)
20. Miller, G.A.: Wordnet: a lexical database for english. Communications of the ACM
    38(11), 39–41 (1995)
21. Mohtarami, M., Amiri, H., Lan, M., Tran, T.P., Tan, C.L.: Sense sentiment
    similarity: An analysis. In: Proceedings of the Twenty-Sixth AAAI Confer-
    ence on Artificial Intelligence. pp. 1706–1712. AAAI’12, AAAI Press (2012),
    http://dl.acm.org/citation.cfm?id=2900929.2900970
22. Mueller, J., Stumme, G.: Predicting rising follower counts on twitter using profile
    information. CoRR abs/1705.03214 (2017), http://arxiv.org/abs/1705.03214
23. Munoz, T., Kumar, S.: Brand metrics: Gauging and linking brands with business
    performance. Journal of Brand Management 11(5), 381–387 (2004)
24. Nagarajan, M., Purohit, H., Sheth, A.P.: A qualitative examination of topical tweet
    and retweet practices. In: ICWSM (2010)
25. Nicholls, C., Song, F.: Improving sentiment analysis with part-of-speech weighting.
    vol. 3, pp. 1592 – 1597 (08 2009)
26. Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion
    mining. In: LREC (2010)
27. Remus, R., Quasthoff, U., Heyer, G.: Sentiws - a publicly available german-
    language resource for sentiment analysis. In: LREC (2010)
28. Schweidel, D.A., Moe, W.W., Boudreaux, C.: Social media intelligence: Measuring
    brand sentiment from online conversations (2011)
29. Steinskog, A., Therkelsen, J., Gambäck, B.: Twitter topic modeling by tweet
    aggregation. In: Proceedings of the 21st Nordic Conference on Computa-
    tional Linguistics. pp. 77–86. Association for Computational Linguistics (2017),
    http://aclweb.org/anthology/W17-0210
30. Wang, X., Wei, F., Liu, X., Zhou, M., Zhang, M.: Topic sentiment analysis in
    twitter: a graph-based hashtag sentiment classification approach. In: CIKM (2011)
31. Wang, Y., Liu, J., Qu, J., Huang, Y., Chen, J., Feng, X.: Hashtag graph based
    topic model for tweet mining. In: 2014 IEEE International Conference on Data
    Mining. pp. 1025–1030 (Dec 2014). https://doi.org/10.1109/ICDM.2014.60
32. Webber, F.D.S.: Semantic folding theory and its application in semantic finger-
    printing. CoRR abs/1511.08855 (2015), http://arxiv.org/abs/1511.08855
33. Westpfahl, S., Schmidt, T., Jonietz, J., Borlinghaus, A.: Stts 2.0. guidelines fuer
    die annotation von pos-tags fuer transkripte gesprochener sprache in anlehnung an
    das stuttgart tuebingen tagset (stts) (2017)