Bitcoin Value and Sentiment Expressed in Tweets

                   Bernhard Preisler∗ Margot Mieskes†       Christoph Becker†
                            University of Applied Sciences Darmstadt
                                            Germany
                             †
                               firstname.lastname@h-da.de


                        Abstract                                traditional models. The hypothesis behind this is,
                                                                that people investing in Bitcoin might also voice
    In recent years, traditional economic mod-                  their opinions and/or beliefs through Social Media
    els failed to forsee several developments                   channels, such as Twitter and therefore influence
    resulting in a considerable economic crisis.                the market on a subjective level. To that end, we
    Other phenomena, such as the increase in                    collect Tweets and perform a sentiment analysis
    Bitcoin value cannot be completely mod-                     on them. Our main question is whether sentiments
    eled by these traditional means either. As                  expressed in Tweets correlate with the value of the
    Bitcoin and other cryptocurrencies are a                    crypotcurrency. Our preliminary results indicate
    playground for technically interested peo-                  that the degree of sentiment does strongly correlate
    ple, it might be worthwhile to look into                    with the development of the currency and that infor-
    other communication channels, such as So-                   mation found in Tweets could improve traditional
    cial Media to find clues for the develop-                   economic models.
    ment we observe. We hypothesize that sen-                      Our major contributions are1 :
    timent expressed in, for example, might
    model the development of Bitcoin value                         • A dataset of Tweets related to Bitcoin.
    better than traditional models. In this work,
    we present a data set of Tweets covering al-                   • A subset of the main data set that was manu-
    most one year, which we annotated for Sen-                       ally annotated for sentiment.
    timent. Additionally, we show results from
    preliminary experiments which support our                      • An evaluation of various off-the shelf ma-
    hypothesis that sentiment information is                         chine learning methods to automatically clas-
    highly predictive of the value development.                      sify sentiment in Tweets.

1    Introduction                                                  • A preliminary analysis of the development of
Financial markets sometimes exhibit tendencies,                      sentiment in Tweets in correlation to the de-
Keynes (1936) describes as Animal Spirits and in                     velopment of the value of the cryptocurrency
the past, traditional models failed to support all that              Bitcoin.
was observable from an economic point of view.
Kindleberger (1978) was able to show already in                    The paper is structured as follows: Section 2
1978 in the context of a financial crisis that opin-            gives an overview on the relevant related work.
ions and beliefs of investors are related to news               In Section 3 we describe the data collection and
and journal articles. This is especially true for cryp-         manual annotation. In Section 4 we describe the
tocurrencies such as Bitcoin, which showed a rather             machine learning and baseline methods used and
erratic behaviour in the past two years. To get new             the features extracted from the data. Section 5
insights into market behaviour, we decide to use                presents the results and their discussion and we
Twitter and evaluate whether Tweets can give us                 finalize the paper with our conclusions and some
more information on the currency’s behaviour than               pointers for future work in Section 6.
   ∗                                                              1
   The author was doing his final thesis at the University of       The data set and its annotations are available at https:
Applied Sciences Darmstadt.                                     //github.com/mieskes/BitcoinTweets


    Copyright c 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 Interna-
tional (CC BY 4.0).
2   Related Work                                         3     Data Collection and Annotation
                                                         As Bitcoin values evolve rapidly, we assume that a
Work on Sentiment analysis is available in abun-
                                                         medium that allows for rapid communication, such
dance and reviewing the whole field is beyond the
                                                         as Twitter more closely reflects the development of
scope of this paper. Mäntyläki et al. (2018) present
                                                         the currency.
a survey on the topic of sentiment analysis by look-
                                                            From Twitter we extract Tweets with relation to
ing at over 6000 publications, of which 99% were
                                                         Bitcoin, by identifying them through their respec-
published after 2004. Therefore, we focus on work
                                                         tive hashtags, such as #bitcoin, #btc, #cryptocur-
that was most influential for us.
                                                         rency etc. We collected data from January 2018
  Gonçalves et al. (2013) look into methods for         until August 2018 and restricted our collection to
assigning sentiment to five data sets. They test var-    English Tweets only, to reduce the chance to have a
ious methods, including lexicon-based approaches.        mixed-language data set. The total data set contains
Their results indicate, that machine learning works      over 50 million Tweets3 .
best for Twitter.                                           Figure 2 shows how often Hashtags related to
   With respect to sentiment analysis of Twitter the     Bitcoin and cryptocurrencies occur in our data set.
SemEval tasks are of specific interest. Results from     We observe that only approximately 17% of the
the 2016 installment (Nakov et al., 2016), espe-         Tweets are actually marked with bitcoin, while a
cially subtask A “Message Polarity Classification”       lot of Tweets refer to other cryptocurrencies or deal
and subtask B “Classification to a two-point scale”      with general topics related to them, such as min-
show that accuracy ranges from 0.646 for the best        ing. To reduce the data set we removed duplicate
team to 0.342 for the baseline on Task A. For Task       Tweets, as identified by their ID and also retweeted
B the accuracy is at 0.862 for the best system and       Tweets.
0.778 for the baseline.
                                                         3.1    Preprocessing
   In 2017 the subtask A aimed at a three-point
classification (positive, negative and neutral), while   We perform a range of preprocessing steps inspired
subtask B was the same as in 2016 (Rosenthal et al.,     by Martı́nez-Cámara et al. (2013) in order to extract
2017). Results are again in the range of 0.651           features and feed the data to the machine learning
(accuracy) for the best system. The baseline is          algorithms. These preprocessing steps included
annotating all Tweets into either positive, negative     filtering for stop words, removal of hashtags, User-
or neutral and results range from 0.193 for the case,    IDs and URLs within the Tweets. The remaining
where everything was labeled as positive to 0.483        data only contains plain text.
for labeling everything as neutral.
                                                         3.2    Annotation
   For 2018 the tasks changed slightly to look at
emotions and valence. The annotation for the va-         To be able to train a machine learning model, we
lence task was done on a 7-point scale, ranging          need training data. We handed slightly less than
from very positive mental state to -3 very negative      2000 Tweets to human annotators via Amazon Me-
mental state.2                                           chanical Turk to annotate them for sentiment.
   With respect to Bitcoin, Kim (2014) analysed             Figure 1 shows the task description and the anno-
comments in a Bitcoin Forum in order to predict          tation interface as displayed on Amazon Mechani-
the value development of the currency. The au-           cal Turk. We coloured the various levels of senti-
thor uses data from three years and analyses the         ment for ease of use. In the description, we refer to
comments for sentiment. Using machine learning,          positive sentiment as indication of rising value and
the author models the comments and the currency          negative sentiment as indication for dropping value
development based on 90% of their data and test          of the currency. Apart from the plain text, Turkers
the resulting model on 10% of the data. The accu-        did not get any meta data on the Tweets.
racy is at 80% correct for the prediction of currency       Each Tweet is annotated by 7 Turkers and results
value based on comments.                                 were averaged. Average values ≥ 0.15 are con-
                                                         sidered positive Tweets, ≤ −0.15 are considered
  2                                                        3
    https://competitions.codalab.org/                        The set of Tweet IDs are available at https://
competitions/17751                                       github.com/mieskes/BitcoinTweets
          Figure 1: Task description and Annotation Interface for Amazon Mechanical Turk.

                                                       mean    Text                                             Sent
                                                        -2.3   @SilverBulletBTC Damn, and I can not buy . . .     -1
                                                         0.4   Gauthier-Mohammed: I will be a father of . . .      1
                                                        -3.4   Oh my! So many #scam these days . . .              -1
                                                         1.7   New #Blockchain marketplace Repayment . . .         1

                                                      Table 1: Exemplary Tweets including average and
                                                      mapped sentiment classifications.


                                                      majority had done. We identified these instances
                                                      and removed them from consideration. This left us
                                                      with enough annotations to create a gold standard
                                                      on it and raised the inter-annotator agreement to
                                                      α = 0.53, which, considering the complexity of
Figure 2: Distribution of Hashtags in the Data set
                                                      the taks, is a good result.
                                                         Table 1 shows example Tweets from the training
negative Tweets and results in between are consid-    data. The first column shows the average sentiment
ered neutral. Our final training data set contains    value based on all annotations and the last column
1042 positive Tweets, 727 negative Tweets and 88      shows the mapped sentiment classification.
neutral Tweets.                                          Figure 4 shows the distribution of tweets anno-
   We evaluate the annotation quality using Krip-     tated with a specific sentiment class. We observe,
pendorffs α. As expected, the inter-annotator         that more tweets receive a positive classification,
agreement for the full distinction is fairly low      while fewer receive a negative classification. Most
(α = 0.13). As we are primarily interested in         tweets are annotated as Moderately or Very Positive,
positive, negative and neutral sentiment, we col-     while on the negative side, the various subclasses
lapsed the annotations to represent only the three    are more evenly distributed. It is interesting to note,
main classes (Details are described above). Nev-      that very few tweets are marked as Extremely Neg-
ertheless, the result (α = 0.43) was considered       ative, while on the positive side, a considerable
improvable. A more detailed look at the annotation    amount of tweets are marked as Extremely Positive.
revealed, that in some cases individual annotators    This indicates, that most tweets are positive, up to
annotated the complete or near opposite of what the   the degree of being enthusiastic.
Figure 3: Development of the model using traning and test data both for the original HDLTex and the
modified HDLTex architecture.

                                                          lexicon-based. The implementation is tested on
                                                          three different review datat sets (Amazon, Yelp and
                                                          IMDB) and achieve accuracy rates between 76.5%
                                                          for the Amazon Review data set and 71.5% for the
                                                          Yelp data set.

                                                          4.2     Machine Learning Approaches
                                                          We also experiment with various machine learning
                                                          approaches. Two serve as baselines and are tra-
                                                          ditional machine learning systems, while two are
Figure 4: Distribution of manual annotations into         deep-learning based.
the various sentiment classes.
                                                          4.2.1    Baselines
                                                          We use Random Forest and Support Vector Ma-
4       Sentiment Classification
                                                          chines (SVM) in their implementation in R using
We experiment with a range of machine learning            standard features. Using a GridSearch and 10-fold
methods – both classical and deep learning-based.         cross-validation, we experimentally determine the
We use SVMs and Random Forest in addition to              best parameters for both SVM and Random Forest
two deep learning based methods, which we de-             and use them to classify the data.
scribe in the following.
                                                          4.2.2    HDLTex
4.1 Baselines                                             The Hierarchical Deep Learning for Text Classi-
We employ two baseline systems in our experi-             fication has been developed specifically for text
ments. Hutto and Gilbert (2014) describe vader-           classification (Kowsari et al., 2017). In its origi-
sentiment4 as a lexicon and rule-based sentiment          nal implementation it contains an Artifical Neural
analysis tool, which is specifically targeted towards     Network (ANN), a Convolutional Neural Network
Social Media. On Social Media the authors achieve         (CNN) and a Recurrent Neural Network (RNN).
an overall F1 score for the classification of positive,      We experimentally adapt the model with respect
negative and neutral sentiment of 0.96. The tool is       to the various parameters. Most importantly, we in-
implemented in Python.                                    crease the drop out to 65% and use only 15 epochs.
  Sentimentr5 is implemented in R and is also                Figure 3 shows how the accuracy of the models
    4
                                                          using the original (left side) and modified (right
    https://github.com/cjhutto/
vaderSentiment                                            sentimentr;https://cran.r-project.org/
  5
    https://github.com/trinker/                           web/packages/sentimentr/sentimentr.pdf
        Method          Class 1 F1    Class -1 F1    Accuracy   between the two methods might be due to the com-
        CNNSC              0.79          0.86          0.80
      ad. CNNSC            0.86          0.89          0.85     parably small data set used for training and that a
        HDLTex             0.69          0.81          0.75     larger data set might boost the performance of the
     ad. HDLTex            0.75          0.83          0.77     deep learning-based system. Overall, our results
    RandomForest           0.73          0.86          0.73
         SVM               0.79          0.83          0.79     are comparable to what has been reported in the
    vaderSentiment         0.85          0.90          0.85     literature.
       setimentr           0.80          0.87          0.79
                                                                   Figure 6 shows the unigram features ranked by
Table 2: Results for the various automatic senti-               their importance. We observe that the most predic-
ment annotation methods examined.                               tive unigrams are actually easily associated with
                                                                positive or negative sentiment. Words like join are
                                                                less clear, but nevertheless rank comparably high
side) HDLTex architecture develop. We see that
                                                                for the sentiment classification.
both methods reach the plateau measured in accu-
                                                                   An initial error analysis shows that, as expected,
racy between 5 to 10 epochs.6 But the modified
                                                                the neutral class, which makes up about 5% of our
HLDTex architecture achieves a higher accuracy on
                                                                data set, causes misclassifications. Either because
the test data than the original HDLTex architecture.
                                                                neutral tweets are classified as having positive or
4.2.3        CNNSC                                              negative sentiment or the other way around. There-
                                                                fore, improving the classification of the neutral
We use the Convolutional Neural Network for Sen-
                                                                class might also improve the overall classification.
tence Classification (CNNSC) by Kim (2014) with
pretrained Word2Vec-based Vectors from the Twit-                5.2    Sentiment Index and Bitcoin Value
ter domain.
   Similar to the HDLTex we experimentally create               In the next step, we apply the adapted CNNSC
a modified architecture, which uses fewer epochs                to the whole data set in order to classify the data
(20), more filters (128) and a higher drop out rate             from the complete observed time frame. Figure 7
(75%).                                                          shows the results for the sentiment development
   Figure 5 shows how the models using the orig-                in comparison to the Bitcoin value. The index is
inal (left side) and modified (right side) CNNSC                normalized to range between 0 and 1. The negative
architecture develop. While the original architec-              value for the sentiment index at the starting point
ture shows a somewhat “bumpy” start in the first 5              is an artefact due to lack in previous data. We
epochs, the learning curve for the modified archi-              observe that the Bitcoin value constantly dropped
tecture is considerably smoother. Furthermore, the              during the observed time-frame, with some bumps
modified CNNSC achieves a higher accuracy both                  in between. The sentiment index closely follows
in the training and the test data.                              this development and reflects it.
                                                                   In addition, we perform initial experiments us-
5        Results                                                ing time-series analysis. For this, we look at the
                                                                development of the sentiment index and the Bitcoin
In the following we present results for the sentment            value on a daily basis. These preliminary results
classification and the relation of the sentiment in-            indicate that the sentiment index is a highly sig-
dex to the development of the Bitcoin value.                    nificant predictor for the Bitcoin value. But as
                                                                both Twitter and Bitcoin are rapidly developing
5.1 Sentiment Classification                                    and changing, it would be interesting to also in-
Table 2 shows the results for the various machine               vestigate shorter time-frames, such as half-day or
learning methods and the two baselines we used                  hourly predictions.
(see Section 4) for details. We observe that all
methods are fairly close together in terms of F1                6     Conclusion
and overall accuracy. For the negative class, the
                                                                We presented a data set of Tweets related to Cryp-
modified CNNSC achieves the best results, while
                                                                tocurrencies. We manually analysed a subset of
for the positive class vaderSentiment achieves the
                                                                the Tweets in order to re-train and evaluate vari-
best results. Both methods perform similarly with
                                                                ous machine learning and off-the-shelf sentiment
respect to overall accuracy. This lack of difference
                                                                classification methods. The main question though
     6
         The graph on the right is based on fewer epochs.       was to analyse the development of the sentiment
Figure 5: Development of the model using training and test data both for the original CNNSC and the
modified CNNSC architecture.

                                                    expressed in Tweets in relation to the development
                                                    of the currency’s value. We found that off-the-shelf
                                                    tools perform well enough to automatically analyse
                                                    this type of data. Moreover, the sentiment index
                                                    closely reflected the Bitcoin value, which indicates
                                                    that the analysis of social media data could sup-
                                                    port current economical models in predicting fu-
                                                    ture developments. Initial results using time-series
                                                    analysis indicate that the sentiment index is highly
                                                    predictive of the currency development.

                                                    Future Work The first next step is to extend the
                                                    time-series analysis and evaluate if the predictions
                                                    also hold on a shorter time-frame (i.e., half-day
                                                    or hourly predictions). Additionally, looking not
                                                    only at sentiment, but also at emotions and espe-
Figure 6: Feature Importance in Sentiment Classi-
                                                    cially extreme emotions might provide additional
fication.
                                                    information.
                                                       We currently only looked at positive, negative
                                                    and neutral sentiment. Extending this to cover the
                                                    whole annotated range could give additional im-
                                                    provement on the prediction and the currency value
                                                    development. Finally, it would be interesting to
                                                    evaluate whether these findings also hold in other
                                                    areas of economics. Work by (Soo, 2018) on the
                                                    american housing market indicates that analysing
                                                    textual data with respect to economical data could
                                                    improve current models.

                                                    Acknowledgments

Figure 7: Sentiment and Bitcoin value development   This work was supported by the research center for
in the observed time frame.                         Applied Computer Science (FZAI) and the Faculty
                                                    for Mathematics and Natural Sciences, University
                                                    of Applied Sciences Darmstadt.
References                                                 shop on Semantic Evaluations (SemEval-2017). Van-
                                                           couver, Canada, pages 502–518.
Pollyanna Gonçalves, Matheus Araújo, Fabrı́cio
  Benevenuto, and Meeyoung Cha. 2013. Com-               Cindy K. Soo. 2018. Quantifying Sentiment with
  paring and combining sentiment analysis                  News Media across Local Housing Markets. The
  methods.       In Proceedings of the First ACM           Review of Financial Studies 31(10):3689–3719.
  Conference on Online Social Networks. ACM,               https://doi.org/10.1093/rfs/hhy036.
  New York, NY, USA, COSN ’13, pages 27–38.
  https://doi.org/10.1145/2512938.2512951.
C.J. Hutto and Eric Gilbert. 2014. VADER: A Parsi-
  monious Rule-based Model for Sentiment Analysis
  of Social Media Text. In Proceedings of the Eighth
  International AAAI Conference on Weblogs and So-
  cial Media. pages 216–225.
John Maynard Keynes. 1936. The General Theory of
  Employment, Interest and Money. Springer.
Yoon Kim. 2014. Convolutional neural networks
  for sentence classification. In Proceedings of the
  2014 Conference on Empirical Methods in Nat-
  ural Language Processing (EMNLP). Association
  for Computational Linguistics, pages 1746–1751.
  https://doi.org/10.3115/v1/D14-1181.
Charles Kindleberger. 1978. Manias, Panics, and
  Charts: A History of Financial Crises. Oxford Uni-
  versity Press .
K. Kowsari, D. E. Brown, M. Heidarysafa, K. Ja-
  fari Meimandi, M. S. Gerber, and L. E. Barnes.
  2017.       HDLTex: Hierarchical Deep Learn-
  ing for Text Classification.     In 2017 16th
  IEEE International Conference on Machine Learn-
  ing and Applications (ICMLA). pages 364–371.
  https://doi.org/10.1109/ICMLA.2017.0-134.
Mika V. Mäntyläki, Daniel Graziotin, and Miikka Kuu-
  tila. 2018. The evolution of sentiment analysis – A
  review of research topics, venues, and top cited pa-
  pers. Computer Science Review 27:16 – 32.
Eugenio Martı́nez-Cámara, Arturo Montejo-Ráez,
  M. Teresa Martı́n-Valdivia, and L. Alfonso Ureña-
  López. 2013. SINAI: Machine Learning and Emo-
  tion of the Crowd for Sentiment Analysis in Mi-
  croblogs. In Second Joint Conference on Lexical
  and Computational Semantics (*SEM), Volume 2:
  Proceedings of the Seventh International Workshop
  on Semantic Evaluation (SemEval 2013). Associa-
  tion for Computational Linguistics, pages 402–407.
  http://aclweb.org/anthology/S13-2066.
Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio
  Sebastiani, and Veselin Stoyanov. 2016. SemEval-
  2016 Task 4: Sentiment Analysis in Twitter. In Pro-
  ceedings of the 10th International Workshop on Se-
  mantic Evaluation (SemEval-2016). Association for
  Computational Linguistics, San Diego, California,
  pages 1–18. http://www.aclweb.org/anthology/S16-
  1001.
Sara Rosenthal, Noura Farra, and Preslav Nakov. 2017.
  SemEval-2017 Task 4: Sentiment Analysis in Twit-
  ter. In Proceedings of the 11th International Work-