Bitcoin Value and Sentiment Expressed in Tweets Bernhard Preisler∗ Margot Mieskes† Christoph Becker† University of Applied Sciences Darmstadt Germany † firstname.lastname@h-da.de Abstract traditional models. The hypothesis behind this is, that people investing in Bitcoin might also voice In recent years, traditional economic mod- their opinions and/or beliefs through Social Media els failed to forsee several developments channels, such as Twitter and therefore influence resulting in a considerable economic crisis. the market on a subjective level. To that end, we Other phenomena, such as the increase in collect Tweets and perform a sentiment analysis Bitcoin value cannot be completely mod- on them. Our main question is whether sentiments eled by these traditional means either. As expressed in Tweets correlate with the value of the Bitcoin and other cryptocurrencies are a crypotcurrency. Our preliminary results indicate playground for technically interested peo- that the degree of sentiment does strongly correlate ple, it might be worthwhile to look into with the development of the currency and that infor- other communication channels, such as So- mation found in Tweets could improve traditional cial Media to find clues for the develop- economic models. ment we observe. We hypothesize that sen- Our major contributions are1 : timent expressed in, for example, might model the development of Bitcoin value • A dataset of Tweets related to Bitcoin. better than traditional models. In this work, we present a data set of Tweets covering al- • A subset of the main data set that was manu- most one year, which we annotated for Sen- ally annotated for sentiment. timent. Additionally, we show results from preliminary experiments which support our • An evaluation of various off-the shelf ma- hypothesis that sentiment information is chine learning methods to automatically clas- highly predictive of the value development. sify sentiment in Tweets. 1 Introduction • A preliminary analysis of the development of Financial markets sometimes exhibit tendencies, sentiment in Tweets in correlation to the de- Keynes (1936) describes as Animal Spirits and in velopment of the value of the cryptocurrency the past, traditional models failed to support all that Bitcoin. was observable from an economic point of view. Kindleberger (1978) was able to show already in The paper is structured as follows: Section 2 1978 in the context of a financial crisis that opin- gives an overview on the relevant related work. ions and beliefs of investors are related to news In Section 3 we describe the data collection and and journal articles. This is especially true for cryp- manual annotation. In Section 4 we describe the tocurrencies such as Bitcoin, which showed a rather machine learning and baseline methods used and erratic behaviour in the past two years. To get new the features extracted from the data. Section 5 insights into market behaviour, we decide to use presents the results and their discussion and we Twitter and evaluate whether Tweets can give us finalize the paper with our conclusions and some more information on the currency’s behaviour than pointers for future work in Section 6. ∗ 1 The author was doing his final thesis at the University of The data set and its annotations are available at https: Applied Sciences Darmstadt. //github.com/mieskes/BitcoinTweets Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna- tional (CC BY 4.0). 2 Related Work 3 Data Collection and Annotation As Bitcoin values evolve rapidly, we assume that a Work on Sentiment analysis is available in abun- medium that allows for rapid communication, such dance and reviewing the whole field is beyond the as Twitter more closely reflects the development of scope of this paper. Mäntyläki et al. (2018) present the currency. a survey on the topic of sentiment analysis by look- From Twitter we extract Tweets with relation to ing at over 6000 publications, of which 99% were Bitcoin, by identifying them through their respec- published after 2004. Therefore, we focus on work tive hashtags, such as #bitcoin, #btc, #cryptocur- that was most influential for us. rency etc. We collected data from January 2018 Gonçalves et al. (2013) look into methods for until August 2018 and restricted our collection to assigning sentiment to five data sets. They test var- English Tweets only, to reduce the chance to have a ious methods, including lexicon-based approaches. mixed-language data set. The total data set contains Their results indicate, that machine learning works over 50 million Tweets3 . best for Twitter. Figure 2 shows how often Hashtags related to With respect to sentiment analysis of Twitter the Bitcoin and cryptocurrencies occur in our data set. SemEval tasks are of specific interest. Results from We observe that only approximately 17% of the the 2016 installment (Nakov et al., 2016), espe- Tweets are actually marked with bitcoin, while a cially subtask A “Message Polarity Classification” lot of Tweets refer to other cryptocurrencies or deal and subtask B “Classification to a two-point scale” with general topics related to them, such as min- show that accuracy ranges from 0.646 for the best ing. To reduce the data set we removed duplicate team to 0.342 for the baseline on Task A. For Task Tweets, as identified by their ID and also retweeted B the accuracy is at 0.862 for the best system and Tweets. 0.778 for the baseline. 3.1 Preprocessing In 2017 the subtask A aimed at a three-point classification (positive, negative and neutral), while We perform a range of preprocessing steps inspired subtask B was the same as in 2016 (Rosenthal et al., by Martı́nez-Cámara et al. (2013) in order to extract 2017). Results are again in the range of 0.651 features and feed the data to the machine learning (accuracy) for the best system. The baseline is algorithms. These preprocessing steps included annotating all Tweets into either positive, negative filtering for stop words, removal of hashtags, User- or neutral and results range from 0.193 for the case, IDs and URLs within the Tweets. The remaining where everything was labeled as positive to 0.483 data only contains plain text. for labeling everything as neutral. 3.2 Annotation For 2018 the tasks changed slightly to look at emotions and valence. The annotation for the va- To be able to train a machine learning model, we lence task was done on a 7-point scale, ranging need training data. We handed slightly less than from very positive mental state to -3 very negative 2000 Tweets to human annotators via Amazon Me- mental state.2 chanical Turk to annotate them for sentiment. With respect to Bitcoin, Kim (2014) analysed Figure 1 shows the task description and the anno- comments in a Bitcoin Forum in order to predict tation interface as displayed on Amazon Mechani- the value development of the currency. The au- cal Turk. We coloured the various levels of senti- thor uses data from three years and analyses the ment for ease of use. In the description, we refer to comments for sentiment. Using machine learning, positive sentiment as indication of rising value and the author models the comments and the currency negative sentiment as indication for dropping value development based on 90% of their data and test of the currency. Apart from the plain text, Turkers the resulting model on 10% of the data. The accu- did not get any meta data on the Tweets. racy is at 80% correct for the prediction of currency Each Tweet is annotated by 7 Turkers and results value based on comments. were averaged. Average values ≥ 0.15 are con- sidered positive Tweets, ≤ −0.15 are considered 2 3 https://competitions.codalab.org/ The set of Tweet IDs are available at https:// competitions/17751 github.com/mieskes/BitcoinTweets Figure 1: Task description and Annotation Interface for Amazon Mechanical Turk. mean Text Sent -2.3 @SilverBulletBTC Damn, and I can not buy . . . -1 0.4 Gauthier-Mohammed: I will be a father of . . . 1 -3.4 Oh my! So many #scam these days . . . -1 1.7 New #Blockchain marketplace Repayment . . . 1 Table 1: Exemplary Tweets including average and mapped sentiment classifications. majority had done. We identified these instances and removed them from consideration. This left us with enough annotations to create a gold standard on it and raised the inter-annotator agreement to α = 0.53, which, considering the complexity of Figure 2: Distribution of Hashtags in the Data set the taks, is a good result. Table 1 shows example Tweets from the training negative Tweets and results in between are consid- data. The first column shows the average sentiment ered neutral. Our final training data set contains value based on all annotations and the last column 1042 positive Tweets, 727 negative Tweets and 88 shows the mapped sentiment classification. neutral Tweets. Figure 4 shows the distribution of tweets anno- We evaluate the annotation quality using Krip- tated with a specific sentiment class. We observe, pendorffs α. As expected, the inter-annotator that more tweets receive a positive classification, agreement for the full distinction is fairly low while fewer receive a negative classification. Most (α = 0.13). As we are primarily interested in tweets are annotated as Moderately or Very Positive, positive, negative and neutral sentiment, we col- while on the negative side, the various subclasses lapsed the annotations to represent only the three are more evenly distributed. It is interesting to note, main classes (Details are described above). Nev- that very few tweets are marked as Extremely Neg- ertheless, the result (α = 0.43) was considered ative, while on the positive side, a considerable improvable. A more detailed look at the annotation amount of tweets are marked as Extremely Positive. revealed, that in some cases individual annotators This indicates, that most tweets are positive, up to annotated the complete or near opposite of what the the degree of being enthusiastic. Figure 3: Development of the model using traning and test data both for the original HDLTex and the modified HDLTex architecture. lexicon-based. The implementation is tested on three different review datat sets (Amazon, Yelp and IMDB) and achieve accuracy rates between 76.5% for the Amazon Review data set and 71.5% for the Yelp data set. 4.2 Machine Learning Approaches We also experiment with various machine learning approaches. Two serve as baselines and are tra- ditional machine learning systems, while two are Figure 4: Distribution of manual annotations into deep-learning based. the various sentiment classes. 4.2.1 Baselines We use Random Forest and Support Vector Ma- 4 Sentiment Classification chines (SVM) in their implementation in R using We experiment with a range of machine learning standard features. Using a GridSearch and 10-fold methods – both classical and deep learning-based. cross-validation, we experimentally determine the We use SVMs and Random Forest in addition to best parameters for both SVM and Random Forest two deep learning based methods, which we de- and use them to classify the data. scribe in the following. 4.2.2 HDLTex 4.1 Baselines The Hierarchical Deep Learning for Text Classi- We employ two baseline systems in our experi- fication has been developed specifically for text ments. Hutto and Gilbert (2014) describe vader- classification (Kowsari et al., 2017). In its origi- sentiment4 as a lexicon and rule-based sentiment nal implementation it contains an Artifical Neural analysis tool, which is specifically targeted towards Network (ANN), a Convolutional Neural Network Social Media. On Social Media the authors achieve (CNN) and a Recurrent Neural Network (RNN). an overall F1 score for the classification of positive, We experimentally adapt the model with respect negative and neutral sentiment of 0.96. The tool is to the various parameters. Most importantly, we in- implemented in Python. crease the drop out to 65% and use only 15 epochs. Sentimentr5 is implemented in R and is also Figure 3 shows how the accuracy of the models 4 using the original (left side) and modified (right https://github.com/cjhutto/ vaderSentiment sentimentr;https://cran.r-project.org/ 5 https://github.com/trinker/ web/packages/sentimentr/sentimentr.pdf Method Class 1 F1 Class -1 F1 Accuracy between the two methods might be due to the com- CNNSC 0.79 0.86 0.80 ad. CNNSC 0.86 0.89 0.85 parably small data set used for training and that a HDLTex 0.69 0.81 0.75 larger data set might boost the performance of the ad. HDLTex 0.75 0.83 0.77 deep learning-based system. Overall, our results RandomForest 0.73 0.86 0.73 SVM 0.79 0.83 0.79 are comparable to what has been reported in the vaderSentiment 0.85 0.90 0.85 literature. setimentr 0.80 0.87 0.79 Figure 6 shows the unigram features ranked by Table 2: Results for the various automatic senti- their importance. We observe that the most predic- ment annotation methods examined. tive unigrams are actually easily associated with positive or negative sentiment. Words like join are less clear, but nevertheless rank comparably high side) HDLTex architecture develop. We see that for the sentiment classification. both methods reach the plateau measured in accu- An initial error analysis shows that, as expected, racy between 5 to 10 epochs.6 But the modified the neutral class, which makes up about 5% of our HLDTex architecture achieves a higher accuracy on data set, causes misclassifications. Either because the test data than the original HDLTex architecture. neutral tweets are classified as having positive or 4.2.3 CNNSC negative sentiment or the other way around. There- fore, improving the classification of the neutral We use the Convolutional Neural Network for Sen- class might also improve the overall classification. tence Classification (CNNSC) by Kim (2014) with pretrained Word2Vec-based Vectors from the Twit- 5.2 Sentiment Index and Bitcoin Value ter domain. Similar to the HDLTex we experimentally create In the next step, we apply the adapted CNNSC a modified architecture, which uses fewer epochs to the whole data set in order to classify the data (20), more filters (128) and a higher drop out rate from the complete observed time frame. Figure 7 (75%). shows the results for the sentiment development Figure 5 shows how the models using the orig- in comparison to the Bitcoin value. The index is inal (left side) and modified (right side) CNNSC normalized to range between 0 and 1. The negative architecture develop. While the original architec- value for the sentiment index at the starting point ture shows a somewhat “bumpy” start in the first 5 is an artefact due to lack in previous data. We epochs, the learning curve for the modified archi- observe that the Bitcoin value constantly dropped tecture is considerably smoother. Furthermore, the during the observed time-frame, with some bumps modified CNNSC achieves a higher accuracy both in between. The sentiment index closely follows in the training and the test data. this development and reflects it. In addition, we perform initial experiments us- 5 Results ing time-series analysis. For this, we look at the development of the sentiment index and the Bitcoin In the following we present results for the sentment value on a daily basis. These preliminary results classification and the relation of the sentiment in- indicate that the sentiment index is a highly sig- dex to the development of the Bitcoin value. nificant predictor for the Bitcoin value. But as both Twitter and Bitcoin are rapidly developing 5.1 Sentiment Classification and changing, it would be interesting to also in- Table 2 shows the results for the various machine vestigate shorter time-frames, such as half-day or learning methods and the two baselines we used hourly predictions. (see Section 4) for details. We observe that all methods are fairly close together in terms of F1 6 Conclusion and overall accuracy. For the negative class, the We presented a data set of Tweets related to Cryp- modified CNNSC achieves the best results, while tocurrencies. We manually analysed a subset of for the positive class vaderSentiment achieves the the Tweets in order to re-train and evaluate vari- best results. Both methods perform similarly with ous machine learning and off-the-shelf sentiment respect to overall accuracy. This lack of difference classification methods. The main question though 6 The graph on the right is based on fewer epochs. was to analyse the development of the sentiment Figure 5: Development of the model using training and test data both for the original CNNSC and the modified CNNSC architecture. expressed in Tweets in relation to the development of the currency’s value. We found that off-the-shelf tools perform well enough to automatically analyse this type of data. Moreover, the sentiment index closely reflected the Bitcoin value, which indicates that the analysis of social media data could sup- port current economical models in predicting fu- ture developments. Initial results using time-series analysis indicate that the sentiment index is highly predictive of the currency development. Future Work The first next step is to extend the time-series analysis and evaluate if the predictions also hold on a shorter time-frame (i.e., half-day or hourly predictions). Additionally, looking not only at sentiment, but also at emotions and espe- Figure 6: Feature Importance in Sentiment Classi- cially extreme emotions might provide additional fication. information. We currently only looked at positive, negative and neutral sentiment. Extending this to cover the whole annotated range could give additional im- provement on the prediction and the currency value development. Finally, it would be interesting to evaluate whether these findings also hold in other areas of economics. Work by (Soo, 2018) on the american housing market indicates that analysing textual data with respect to economical data could improve current models. Acknowledgments Figure 7: Sentiment and Bitcoin value development This work was supported by the research center for in the observed time frame. Applied Computer Science (FZAI) and the Faculty for Mathematics and Natural Sciences, University of Applied Sciences Darmstadt. References shop on Semantic Evaluations (SemEval-2017). Van- couver, Canada, pages 502–518. Pollyanna Gonçalves, Matheus Araújo, Fabrı́cio Benevenuto, and Meeyoung Cha. 2013. Com- Cindy K. Soo. 2018. Quantifying Sentiment with paring and combining sentiment analysis News Media across Local Housing Markets. The methods. In Proceedings of the First ACM Review of Financial Studies 31(10):3689–3719. Conference on Online Social Networks. ACM, https://doi.org/10.1093/rfs/hhy036. New York, NY, USA, COSN ’13, pages 27–38. https://doi.org/10.1145/2512938.2512951. C.J. Hutto and Eric Gilbert. 2014. VADER: A Parsi- monious Rule-based Model for Sentiment Analysis of Social Media Text. In Proceedings of the Eighth International AAAI Conference on Weblogs and So- cial Media. pages 216–225. John Maynard Keynes. 1936. The General Theory of Employment, Interest and Money. Springer. Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP). Association for Computational Linguistics, pages 1746–1751. https://doi.org/10.3115/v1/D14-1181. Charles Kindleberger. 1978. Manias, Panics, and Charts: A History of Financial Crises. Oxford Uni- versity Press . K. Kowsari, D. E. Brown, M. Heidarysafa, K. Ja- fari Meimandi, M. S. Gerber, and L. E. Barnes. 2017. HDLTex: Hierarchical Deep Learn- ing for Text Classification. In 2017 16th IEEE International Conference on Machine Learn- ing and Applications (ICMLA). pages 364–371. https://doi.org/10.1109/ICMLA.2017.0-134. Mika V. Mäntyläki, Daniel Graziotin, and Miikka Kuu- tila. 2018. The evolution of sentiment analysis – A review of research topics, venues, and top cited pa- pers. Computer Science Review 27:16 – 32. Eugenio Martı́nez-Cámara, Arturo Montejo-Ráez, M. Teresa Martı́n-Valdivia, and L. Alfonso Ureña- López. 2013. SINAI: Machine Learning and Emo- tion of the Crowd for Sentiment Analysis in Mi- croblogs. In Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). Associa- tion for Computational Linguistics, pages 402–407. http://aclweb.org/anthology/S13-2066. Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani, and Veselin Stoyanov. 2016. SemEval- 2016 Task 4: Sentiment Analysis in Twitter. In Pro- ceedings of the 10th International Workshop on Se- mantic Evaluation (SemEval-2016). Association for Computational Linguistics, San Diego, California, pages 1–18. http://www.aclweb.org/anthology/S16- 1001. Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017. SemEval-2017 Task 4: Sentiment Analysis in Twit- ter. In Proceedings of the 11th International Work-