=Paper=
{{Paper
|id=Vol-2458/invited6
|storemode=property
|title=Bitcoin Value and Sentiment Expressed in Tweets
|pdfUrl=https://ceur-ws.org/Vol-2458/paper6.pdf
|volume=Vol-2458
|authors=Bernhard Preisler,Margot Mieskes,Christoph Becker
|dblpUrl=https://dblp.org/rec/conf/swisstext/PreislerMB19
}}
==Bitcoin Value and Sentiment Expressed in Tweets==
Bitcoin Value and Sentiment Expressed in Tweets
Bernhard Preisler∗ Margot Mieskes† Christoph Becker†
University of Applied Sciences Darmstadt
Germany
†
firstname.lastname@h-da.de
Abstract traditional models. The hypothesis behind this is,
that people investing in Bitcoin might also voice
In recent years, traditional economic mod- their opinions and/or beliefs through Social Media
els failed to forsee several developments channels, such as Twitter and therefore influence
resulting in a considerable economic crisis. the market on a subjective level. To that end, we
Other phenomena, such as the increase in collect Tweets and perform a sentiment analysis
Bitcoin value cannot be completely mod- on them. Our main question is whether sentiments
eled by these traditional means either. As expressed in Tweets correlate with the value of the
Bitcoin and other cryptocurrencies are a crypotcurrency. Our preliminary results indicate
playground for technically interested peo- that the degree of sentiment does strongly correlate
ple, it might be worthwhile to look into with the development of the currency and that infor-
other communication channels, such as So- mation found in Tweets could improve traditional
cial Media to find clues for the develop- economic models.
ment we observe. We hypothesize that sen- Our major contributions are1 :
timent expressed in, for example, might
model the development of Bitcoin value • A dataset of Tweets related to Bitcoin.
better than traditional models. In this work,
we present a data set of Tweets covering al- • A subset of the main data set that was manu-
most one year, which we annotated for Sen- ally annotated for sentiment.
timent. Additionally, we show results from
preliminary experiments which support our • An evaluation of various off-the shelf ma-
hypothesis that sentiment information is chine learning methods to automatically clas-
highly predictive of the value development. sify sentiment in Tweets.
1 Introduction • A preliminary analysis of the development of
Financial markets sometimes exhibit tendencies, sentiment in Tweets in correlation to the de-
Keynes (1936) describes as Animal Spirits and in velopment of the value of the cryptocurrency
the past, traditional models failed to support all that Bitcoin.
was observable from an economic point of view.
Kindleberger (1978) was able to show already in The paper is structured as follows: Section 2
1978 in the context of a financial crisis that opin- gives an overview on the relevant related work.
ions and beliefs of investors are related to news In Section 3 we describe the data collection and
and journal articles. This is especially true for cryp- manual annotation. In Section 4 we describe the
tocurrencies such as Bitcoin, which showed a rather machine learning and baseline methods used and
erratic behaviour in the past two years. To get new the features extracted from the data. Section 5
insights into market behaviour, we decide to use presents the results and their discussion and we
Twitter and evaluate whether Tweets can give us finalize the paper with our conclusions and some
more information on the currency’s behaviour than pointers for future work in Section 6.
∗ 1
The author was doing his final thesis at the University of The data set and its annotations are available at https:
Applied Sciences Darmstadt. //github.com/mieskes/BitcoinTweets
Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
2 Related Work 3 Data Collection and Annotation
As Bitcoin values evolve rapidly, we assume that a
Work on Sentiment analysis is available in abun-
medium that allows for rapid communication, such
dance and reviewing the whole field is beyond the
as Twitter more closely reflects the development of
scope of this paper. Mäntyläki et al. (2018) present
the currency.
a survey on the topic of sentiment analysis by look-
From Twitter we extract Tweets with relation to
ing at over 6000 publications, of which 99% were
Bitcoin, by identifying them through their respec-
published after 2004. Therefore, we focus on work
tive hashtags, such as #bitcoin, #btc, #cryptocur-
that was most influential for us.
rency etc. We collected data from January 2018
Gonçalves et al. (2013) look into methods for until August 2018 and restricted our collection to
assigning sentiment to five data sets. They test var- English Tweets only, to reduce the chance to have a
ious methods, including lexicon-based approaches. mixed-language data set. The total data set contains
Their results indicate, that machine learning works over 50 million Tweets3 .
best for Twitter. Figure 2 shows how often Hashtags related to
With respect to sentiment analysis of Twitter the Bitcoin and cryptocurrencies occur in our data set.
SemEval tasks are of specific interest. Results from We observe that only approximately 17% of the
the 2016 installment (Nakov et al., 2016), espe- Tweets are actually marked with bitcoin, while a
cially subtask A “Message Polarity Classification” lot of Tweets refer to other cryptocurrencies or deal
and subtask B “Classification to a two-point scale” with general topics related to them, such as min-
show that accuracy ranges from 0.646 for the best ing. To reduce the data set we removed duplicate
team to 0.342 for the baseline on Task A. For Task Tweets, as identified by their ID and also retweeted
B the accuracy is at 0.862 for the best system and Tweets.
0.778 for the baseline.
3.1 Preprocessing
In 2017 the subtask A aimed at a three-point
classification (positive, negative and neutral), while We perform a range of preprocessing steps inspired
subtask B was the same as in 2016 (Rosenthal et al., by Martı́nez-Cámara et al. (2013) in order to extract
2017). Results are again in the range of 0.651 features and feed the data to the machine learning
(accuracy) for the best system. The baseline is algorithms. These preprocessing steps included
annotating all Tweets into either positive, negative filtering for stop words, removal of hashtags, User-
or neutral and results range from 0.193 for the case, IDs and URLs within the Tweets. The remaining
where everything was labeled as positive to 0.483 data only contains plain text.
for labeling everything as neutral.
3.2 Annotation
For 2018 the tasks changed slightly to look at
emotions and valence. The annotation for the va- To be able to train a machine learning model, we
lence task was done on a 7-point scale, ranging need training data. We handed slightly less than
from very positive mental state to -3 very negative 2000 Tweets to human annotators via Amazon Me-
mental state.2 chanical Turk to annotate them for sentiment.
With respect to Bitcoin, Kim (2014) analysed Figure 1 shows the task description and the anno-
comments in a Bitcoin Forum in order to predict tation interface as displayed on Amazon Mechani-
the value development of the currency. The au- cal Turk. We coloured the various levels of senti-
thor uses data from three years and analyses the ment for ease of use. In the description, we refer to
comments for sentiment. Using machine learning, positive sentiment as indication of rising value and
the author models the comments and the currency negative sentiment as indication for dropping value
development based on 90% of their data and test of the currency. Apart from the plain text, Turkers
the resulting model on 10% of the data. The accu- did not get any meta data on the Tweets.
racy is at 80% correct for the prediction of currency Each Tweet is annotated by 7 Turkers and results
value based on comments. were averaged. Average values ≥ 0.15 are con-
sidered positive Tweets, ≤ −0.15 are considered
2 3
https://competitions.codalab.org/ The set of Tweet IDs are available at https://
competitions/17751 github.com/mieskes/BitcoinTweets
Figure 1: Task description and Annotation Interface for Amazon Mechanical Turk.
mean Text Sent
-2.3 @SilverBulletBTC Damn, and I can not buy . . . -1
0.4 Gauthier-Mohammed: I will be a father of . . . 1
-3.4 Oh my! So many #scam these days . . . -1
1.7 New #Blockchain marketplace Repayment . . . 1
Table 1: Exemplary Tweets including average and
mapped sentiment classifications.
majority had done. We identified these instances
and removed them from consideration. This left us
with enough annotations to create a gold standard
on it and raised the inter-annotator agreement to
α = 0.53, which, considering the complexity of
Figure 2: Distribution of Hashtags in the Data set
the taks, is a good result.
Table 1 shows example Tweets from the training
negative Tweets and results in between are consid- data. The first column shows the average sentiment
ered neutral. Our final training data set contains value based on all annotations and the last column
1042 positive Tweets, 727 negative Tweets and 88 shows the mapped sentiment classification.
neutral Tweets. Figure 4 shows the distribution of tweets anno-
We evaluate the annotation quality using Krip- tated with a specific sentiment class. We observe,
pendorffs α. As expected, the inter-annotator that more tweets receive a positive classification,
agreement for the full distinction is fairly low while fewer receive a negative classification. Most
(α = 0.13). As we are primarily interested in tweets are annotated as Moderately or Very Positive,
positive, negative and neutral sentiment, we col- while on the negative side, the various subclasses
lapsed the annotations to represent only the three are more evenly distributed. It is interesting to note,
main classes (Details are described above). Nev- that very few tweets are marked as Extremely Neg-
ertheless, the result (α = 0.43) was considered ative, while on the positive side, a considerable
improvable. A more detailed look at the annotation amount of tweets are marked as Extremely Positive.
revealed, that in some cases individual annotators This indicates, that most tweets are positive, up to
annotated the complete or near opposite of what the the degree of being enthusiastic.
Figure 3: Development of the model using traning and test data both for the original HDLTex and the
modified HDLTex architecture.
lexicon-based. The implementation is tested on
three different review datat sets (Amazon, Yelp and
IMDB) and achieve accuracy rates between 76.5%
for the Amazon Review data set and 71.5% for the
Yelp data set.
4.2 Machine Learning Approaches
We also experiment with various machine learning
approaches. Two serve as baselines and are tra-
ditional machine learning systems, while two are
Figure 4: Distribution of manual annotations into deep-learning based.
the various sentiment classes.
4.2.1 Baselines
We use Random Forest and Support Vector Ma-
4 Sentiment Classification
chines (SVM) in their implementation in R using
We experiment with a range of machine learning standard features. Using a GridSearch and 10-fold
methods – both classical and deep learning-based. cross-validation, we experimentally determine the
We use SVMs and Random Forest in addition to best parameters for both SVM and Random Forest
two deep learning based methods, which we de- and use them to classify the data.
scribe in the following.
4.2.2 HDLTex
4.1 Baselines The Hierarchical Deep Learning for Text Classi-
We employ two baseline systems in our experi- fication has been developed specifically for text
ments. Hutto and Gilbert (2014) describe vader- classification (Kowsari et al., 2017). In its origi-
sentiment4 as a lexicon and rule-based sentiment nal implementation it contains an Artifical Neural
analysis tool, which is specifically targeted towards Network (ANN), a Convolutional Neural Network
Social Media. On Social Media the authors achieve (CNN) and a Recurrent Neural Network (RNN).
an overall F1 score for the classification of positive, We experimentally adapt the model with respect
negative and neutral sentiment of 0.96. The tool is to the various parameters. Most importantly, we in-
implemented in Python. crease the drop out to 65% and use only 15 epochs.
Sentimentr5 is implemented in R and is also Figure 3 shows how the accuracy of the models
4
using the original (left side) and modified (right
https://github.com/cjhutto/
vaderSentiment sentimentr;https://cran.r-project.org/
5
https://github.com/trinker/ web/packages/sentimentr/sentimentr.pdf
Method Class 1 F1 Class -1 F1 Accuracy between the two methods might be due to the com-
CNNSC 0.79 0.86 0.80
ad. CNNSC 0.86 0.89 0.85 parably small data set used for training and that a
HDLTex 0.69 0.81 0.75 larger data set might boost the performance of the
ad. HDLTex 0.75 0.83 0.77 deep learning-based system. Overall, our results
RandomForest 0.73 0.86 0.73
SVM 0.79 0.83 0.79 are comparable to what has been reported in the
vaderSentiment 0.85 0.90 0.85 literature.
setimentr 0.80 0.87 0.79
Figure 6 shows the unigram features ranked by
Table 2: Results for the various automatic senti- their importance. We observe that the most predic-
ment annotation methods examined. tive unigrams are actually easily associated with
positive or negative sentiment. Words like join are
less clear, but nevertheless rank comparably high
side) HDLTex architecture develop. We see that
for the sentiment classification.
both methods reach the plateau measured in accu-
An initial error analysis shows that, as expected,
racy between 5 to 10 epochs.6 But the modified
the neutral class, which makes up about 5% of our
HLDTex architecture achieves a higher accuracy on
data set, causes misclassifications. Either because
the test data than the original HDLTex architecture.
neutral tweets are classified as having positive or
4.2.3 CNNSC negative sentiment or the other way around. There-
fore, improving the classification of the neutral
We use the Convolutional Neural Network for Sen-
class might also improve the overall classification.
tence Classification (CNNSC) by Kim (2014) with
pretrained Word2Vec-based Vectors from the Twit- 5.2 Sentiment Index and Bitcoin Value
ter domain.
Similar to the HDLTex we experimentally create In the next step, we apply the adapted CNNSC
a modified architecture, which uses fewer epochs to the whole data set in order to classify the data
(20), more filters (128) and a higher drop out rate from the complete observed time frame. Figure 7
(75%). shows the results for the sentiment development
Figure 5 shows how the models using the orig- in comparison to the Bitcoin value. The index is
inal (left side) and modified (right side) CNNSC normalized to range between 0 and 1. The negative
architecture develop. While the original architec- value for the sentiment index at the starting point
ture shows a somewhat “bumpy” start in the first 5 is an artefact due to lack in previous data. We
epochs, the learning curve for the modified archi- observe that the Bitcoin value constantly dropped
tecture is considerably smoother. Furthermore, the during the observed time-frame, with some bumps
modified CNNSC achieves a higher accuracy both in between. The sentiment index closely follows
in the training and the test data. this development and reflects it.
In addition, we perform initial experiments us-
5 Results ing time-series analysis. For this, we look at the
development of the sentiment index and the Bitcoin
In the following we present results for the sentment value on a daily basis. These preliminary results
classification and the relation of the sentiment in- indicate that the sentiment index is a highly sig-
dex to the development of the Bitcoin value. nificant predictor for the Bitcoin value. But as
both Twitter and Bitcoin are rapidly developing
5.1 Sentiment Classification and changing, it would be interesting to also in-
Table 2 shows the results for the various machine vestigate shorter time-frames, such as half-day or
learning methods and the two baselines we used hourly predictions.
(see Section 4) for details. We observe that all
methods are fairly close together in terms of F1 6 Conclusion
and overall accuracy. For the negative class, the
We presented a data set of Tweets related to Cryp-
modified CNNSC achieves the best results, while
tocurrencies. We manually analysed a subset of
for the positive class vaderSentiment achieves the
the Tweets in order to re-train and evaluate vari-
best results. Both methods perform similarly with
ous machine learning and off-the-shelf sentiment
respect to overall accuracy. This lack of difference
classification methods. The main question though
6
The graph on the right is based on fewer epochs. was to analyse the development of the sentiment
Figure 5: Development of the model using training and test data both for the original CNNSC and the
modified CNNSC architecture.
expressed in Tweets in relation to the development
of the currency’s value. We found that off-the-shelf
tools perform well enough to automatically analyse
this type of data. Moreover, the sentiment index
closely reflected the Bitcoin value, which indicates
that the analysis of social media data could sup-
port current economical models in predicting fu-
ture developments. Initial results using time-series
analysis indicate that the sentiment index is highly
predictive of the currency development.
Future Work The first next step is to extend the
time-series analysis and evaluate if the predictions
also hold on a shorter time-frame (i.e., half-day
or hourly predictions). Additionally, looking not
only at sentiment, but also at emotions and espe-
Figure 6: Feature Importance in Sentiment Classi-
cially extreme emotions might provide additional
fication.
information.
We currently only looked at positive, negative
and neutral sentiment. Extending this to cover the
whole annotated range could give additional im-
provement on the prediction and the currency value
development. Finally, it would be interesting to
evaluate whether these findings also hold in other
areas of economics. Work by (Soo, 2018) on the
american housing market indicates that analysing
textual data with respect to economical data could
improve current models.
Acknowledgments
Figure 7: Sentiment and Bitcoin value development This work was supported by the research center for
in the observed time frame. Applied Computer Science (FZAI) and the Faculty
for Mathematics and Natural Sciences, University
of Applied Sciences Darmstadt.
References shop on Semantic Evaluations (SemEval-2017). Van-
couver, Canada, pages 502–518.
Pollyanna Gonçalves, Matheus Araújo, Fabrı́cio
Benevenuto, and Meeyoung Cha. 2013. Com- Cindy K. Soo. 2018. Quantifying Sentiment with
paring and combining sentiment analysis News Media across Local Housing Markets. The
methods. In Proceedings of the First ACM Review of Financial Studies 31(10):3689–3719.
Conference on Online Social Networks. ACM, https://doi.org/10.1093/rfs/hhy036.
New York, NY, USA, COSN ’13, pages 27–38.
https://doi.org/10.1145/2512938.2512951.
C.J. Hutto and Eric Gilbert. 2014. VADER: A Parsi-
monious Rule-based Model for Sentiment Analysis
of Social Media Text. In Proceedings of the Eighth
International AAAI Conference on Weblogs and So-
cial Media. pages 216–225.
John Maynard Keynes. 1936. The General Theory of
Employment, Interest and Money. Springer.
Yoon Kim. 2014. Convolutional neural networks
for sentence classification. In Proceedings of the
2014 Conference on Empirical Methods in Nat-
ural Language Processing (EMNLP). Association
for Computational Linguistics, pages 1746–1751.
https://doi.org/10.3115/v1/D14-1181.
Charles Kindleberger. 1978. Manias, Panics, and
Charts: A History of Financial Crises. Oxford Uni-
versity Press .
K. Kowsari, D. E. Brown, M. Heidarysafa, K. Ja-
fari Meimandi, M. S. Gerber, and L. E. Barnes.
2017. HDLTex: Hierarchical Deep Learn-
ing for Text Classification. In 2017 16th
IEEE International Conference on Machine Learn-
ing and Applications (ICMLA). pages 364–371.
https://doi.org/10.1109/ICMLA.2017.0-134.
Mika V. Mäntyläki, Daniel Graziotin, and Miikka Kuu-
tila. 2018. The evolution of sentiment analysis – A
review of research topics, venues, and top cited pa-
pers. Computer Science Review 27:16 – 32.
Eugenio Martı́nez-Cámara, Arturo Montejo-Ráez,
M. Teresa Martı́n-Valdivia, and L. Alfonso Ureña-
López. 2013. SINAI: Machine Learning and Emo-
tion of the Crowd for Sentiment Analysis in Mi-
croblogs. In Second Joint Conference on Lexical
and Computational Semantics (*SEM), Volume 2:
Proceedings of the Seventh International Workshop
on Semantic Evaluation (SemEval 2013). Associa-
tion for Computational Linguistics, pages 402–407.
http://aclweb.org/anthology/S13-2066.
Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
Sebastiani, and Veselin Stoyanov. 2016. SemEval-
2016 Task 4: Sentiment Analysis in Twitter. In Pro-
ceedings of the 10th International Workshop on Se-
mantic Evaluation (SemEval-2016). Association for
Computational Linguistics, San Diego, California,
pages 1–18. http://www.aclweb.org/anthology/S16-
1001.
Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017.
SemEval-2017 Task 4: Sentiment Analysis in Twit-
ter. In Proceedings of the 11th International Work-