Detecting Potential Subscribers on Twitch:
      A Text Mining Approach with XGBoost —
      Discovery Challenge ChAT: CoolStoryBob

    Marvin Gärtner[0000−0003−1651−1326] , Andreas Theissler[0000−0003−0746−0424] ,
                               and Marc Fernandes

               Aalen University of Applied Sciences, 73430 Aalen, Germany


         Abstract. In this paper we describe our approach to solve the text clas-
         sification problem of the Chat Analytics for Twitch (ChAT) discovery
         challenge of ECML-PKDD 2020. The task was to predict the subscrip-
         tion status of Twitch users for a given channel based on their comments
         posted within the Twitch chat. Users have the opportunity to support
         channels in the form of monthly subscriptions, giving them exclusive
         subscriber-only features in return. Half of the earnings from subscrip-
         tions are received by the streamers themselves, with the other half going
         to Twitch. Thus, there is a monetary motivation for Twitch and the
         streamers to acquire more subscriptions. The motivation of this research
         is to detect potential subscribers by predicting a user’s subscription sta-
         tus using a trained ML model. These users can then be targeted with
         marketing campaigns. For our solution we use BOW and TF-IDF vec-
         tors as text features as well as additional extracted numerical features.
         We applied downsampling to the majority class and used XGBoost as
         the binary classifier. On the organizers’ evaluation set our submission
         achieved an F1 -score of 0.2647 on the class of subscribers (random base-
         line: 0.0741) and reached second place among all submissions.

         Keywords: Twitch.tv · Chat Analytics · Text Mining · Natural Lan-
         guage Processing · XGBoost · ECML-PKDD Discovery Challenge


1      Introduction
With the growth of the video game industry, a new variation of online enter-
tainment has developed. So-called online video game streaming allows private
individuals and professional e-sports athletes to stream their video gameplay
while others watch them play [11]. Among a variety of different streaming plat-
forms, Twitch.tv is by far the most popular one [4]. In 2020, Twitch had more
than 2.4 million average monthly viewers and over 160,000 active channels2 .
Twitch offers streamers with a certain number of monthly viewers to join their
    Copyright ©2020 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2
    https://twitchtracker.com/statistics, accessed June 27th 2020
2      Marvin Gärtner et al.

partnership-program enabling them to become full-time professional streamers.
Besides running ads, Twitch partners can also offer their viewers to subscribe
to their channel. Subscribers pay a monthly fee with half of the earnings going
to Twitch. Consequently, there is a monetary motivation for Twitch and for
the streamers to encourage viewers to subscribe. A subscription offers several
advantages for the viewers including, but not limited to, watching the stream
without advertisements, extended communication channels and subscriber-only
features [7]. The most frequently used communication channel on Twitch is the
integrated Twitch chat (TC), allowing viewers to communicate directly with the
streamer and with other viewers [13]. Here again, subscribers have the advan-
tage to use extended TC functions such as special username highlighting and
the possibility of sending special subscriber emotes [7].
    In this paper we present our submission to the Chat Analytics for Twitch
(ChAT) discovery challenge of ECML-PKDD 2020. The challenge’s task was to
predict whether a Twitch user has subscribed to a channel, by applying ma-
chine learning (ML) methods on the comments posted in the TC. Detecting the
subscription status of users based on their chat data could help to identify po-
tential subscribers to a channel. These results could then be used for targeted
advertisement. In our submission we used TF-IDF and BOW vectors for the
textual features, as well as additionally extracted numerical features. We tested
different ML models, among which XGBoost [3] produced the best result with
an F1 -score of 0.2647 on the organizers’ evaluation set denoted as E and 0.3229
on our test sets T that were randomly sampled from the provided data. Our
submission reached second place among all submissions.


2   Related Work
Twitch has received considerable attention over the past years. In [1], Barbieri
et al. analyze the problem of emote prediction, i.e. the task of predicting which
emote the user is more likely to use based on a collection of chatroom messages,
as well as the trolling detection problem, i.e. recognizing a certain set of emotes
commonly used in troll messages. They use a bidirectional long short-term mem-
ory network (LSTM) which is compared to a bag-of-words baseline and a logistic
classifier based on word embedding, where LSTM outperformed the other two
baselines [1]. Kobs et al. present sentiment analysis based on the Twitch ex-
clusive emotes [10], which can be used by users in the chat function. With the
help of a created emote dictionary, they attach the sentiments to the initially
unlabeled chat data set to obtain a labeled data set. This is then used as input
for a convolutional neural network. Their results show the suitability of emotes
as indicators for sentiment analysis [10]. Poché at al. investigate comments of
a similar community in [14]. They analyze user comments on YouTube coding
tutorials to support content creators to effectively understand the needs and
concerns of their viewers. They use naive bayes and support vector machines
(SVM) to classify comments. Their findings can help to deliver higher quality
content and increase the number of subscribers. The discussed papers show dif-
                                  Detecting potential subscribers on Twitch       3

ferent valuable text mining approaches applied on problems like user-targeted
content and sentiment analysis. Despite extensive research in the field of text
mining, the relationship between chat messages and channel subscription status
of users in social media platforms has not been widely investigated. In partic-
ular, Twitch itself has not gained much attention regarding Natural Language
Processing research [1]. With this paper, based on the problem setting defined
by the discovery challenge [9], we aim to contribute to fill this gap.


3     Discovery Challenge: Chat Analytics for Twitch
3.1   Data Set
The data for the ChAT discovery challenge was provided by the organizers from
the Universities of Würzburg, Leipzig, and Weimar [9], who obtained the data
via the Twitch API. The data set has more than 400 million Twitch comments
from English channels captured in January 2020 and is composed of 29 mil-
lion unique channel-user combinations with 7.9 million different users Ui and
140,000 channels Cj . The overall data volume is 38 GB. The two-class data set
is fully labelled, with channel-user combinations being labelled as subscribed or
not subscribed. The class distribution has an imbalance with a skew towards non-
subscription — only 8% of the channel-user combinations hold subscriptions.


3.2   Task Description
The task was to predict whether a user is subscribed to a given channel based on
the user’s chat messages on Twitch [9]. The organizers handed out the aforemen-
tioned data set approx. 12 weeks prior to the submission deadline. In summary,
the task was to solve an imbalanced binary classification problem requiring text
analytics and supervised ML models. Evaluation took place by a evaluation set
E prepared by the organizers, which is composed of 90,000 unseen channel-user
combinations, with 50% of the users in the evaluation set E being present in the
training data with their contributions to other channels. The evaluation of the
submitted model was done with TIRA, an online platform presented by Potthast
et al. in [15]. Due to the class imbalance, the F1 -score on the class of subscribed
users was used as the evaluation criterion.


3.3   Applicability of Results
While the prediction of the subscription itself is an interesting and challenging
problem, we view the applicability as highly useful outside this challenge tak-
ing the following consideration: In the evaluation set of the discovery challenge
the subscription status is obviously not available but is rather the variable to
be predicted. However, one can reasonably assume that Twitch as well as the
streamers know if a user Ui has subscribed to a Twitch channel Cj . Yet, a ML
model to predict the subscription status for Ui × Cj is highly valuable. When
4       Marvin Gärtner et al.

a model can classify a user’s subscription status based chat messages, it can be
assumed that the model has managed to extract some sort of knowledge of how
to tell if a user is subscribed based on his/her behavior in the chat. Let C × U
be the channel-user combinations, i.e. the chat messages of users in the different
channels. In the subset of one channel Cj , let Ysub be the set of subscribers and
Ŷsub be the set of users classified as subscribers by some ML model. Since in
practice Ysub is known, the ML models developed in this challenge can be used
to detect potential subscribers P . We refer to potential subscribers as users that
act like subscribers but have not subscribed and might hence be interested in
a subscription, which could be supported e.g. by special offers. So outside the
frame of this discovery challenge, potential subscribers P can be determined with
the models developed in this challenge by:

                                   P = Ŷsub \ Ysub                              (1)
   Note, that at first glance it seems counter-intuitive to consider the set of
misclassifications. However due to the availability of the true class labels, this
problem setting differs from traditional classification.

4     Model Description
4.1   Pre-processing Pipeline
As aforementioned, the data is highly imbalanced, which can cause classifiers to
optimize towards the majority class. As a result, these classifiers tend to predict
the majority class very accurately, but fail to predict the minority class [5]. To
address this, resampling methods were used. Undersampling, where the majority
class is randomly sampled to have the number of samples of the minority class,
was experimentally found to be the best in our setting. SMOTE [2] was tested
but achieved slightly lower results.
    In order to reduce noise in the train set, the following pre-processing steps
(see Fig. 1) were applied to the text data before training the classifiers [6]: lower
case conversion, replacement of emojis3 and emoticons4 by a text representa-
tion, removal of the most common colloquial terms, stop words removal using
NLTK’s stop words list5 , removal of the most and least frequently used words,
lemmatization using WordNet’s Lemmatizer6 , replacement of letters that occur
consecutively more than twice (e.g. goood → good), and removal of words that
only contain one character.

4.2   Feature Engineering
For the numerical representation of the chat messages, bag-of-words (BOW)
and term frequency-inverse document frequency (TF-IDF) were evaluated. The
3
  https://pypi.org/project/emoji/
4
  https://github.com/NeelShah18/emot/blob/master/emot/emo unicode.py
5
  https://www.nltk.org/nltk data/
6
  https://www.nltk.org/howto/wordnet.html
                                     Detecting potential subscribers on Twitch        5


                     Fig. 1: Pre-processing of the raw user messages


BOW considers the term frequency TF(w,d) assuming that a more frequent
word w in a document d is more representative for the document [8, 12]. For the
representation of the game titles we decided to use the BOW model, since the
context is irrelevant here. Furthermore, we assume that the more often a game
occurs in the text the more likely it represents a respective class.
    An alternative is TF-IDF which considers the relevance of words by mul-
tiplying TF with the inverse document frequency (IDF). In TF-IDF, a word
with lower frequency is associated with higher importance and vice versa [12].
From the most frequent words in Fig. 2, it can be seen that the text data does
not differ significantly between subscribers and non-subscribers. From this we
conclude that words with a lower frequency are more likely to represent the class
than those with a higher frequency, which is why we chose TF-IDF vectors for
the numerical representation of the chat data.


      Fig. 2: Most frequent words of subscribers (left) and non-subscribers (right)


    As one key feature we identified the number of subscriber emotes used.
Twitch emotes are images or animations that can be posted by users in the
chat and provide a quick and wordless form of expression [10]. A certain number
of emotes are available for all users, whereas the majority of emotes are exclu-
sive subscriber emotes which can only be used by subscribing to a channel [7] or
by having received special gifts which is more likely when frequently watching a
channel. During data analysis we observed that the subscriber emotes in the com-
ments are shown as text beginning with a lower-case letter followed by an upper-
case letter. This common syntax was used to extract the frequency of emotes.
The public emotes7 , which are available to all users, were excluded since their
frequency was found to be no indicator for a channel subscription. In addition, we
extracted further numerical features from the user messages, as can be seen in Ta-
ble 1. The correlation of the features with the subscription status was p determined
with the point biserial correlation coefficient rpbs = msub −m s
                                                                 notsub   nsub ∗nnotsub
                                                                                n2
7
    https://twitchemotes.com/
6         Marvin Gärtner et al.

with rpbs = [0, 1], where mi and ni refer to the mean values and number of
instances of the two classes and s to the standard deviation.


Table 1: Numerical features extracted for each channel-user combination and their
correlation with the subscription status

feature                       correlation    feature                               correlation
number of words                  0.088       number of emojis                         0.011
number of distinct emotes        0.168       number of emotes                         0.067
number of words in upper case    0.040       number of single chars                   0.081
average word length              0.040       number of words with numerical values    0.024
number of stop words             0.084       number of words starting with !, #, @    0.063
                                                                      8
number of emoticons              0.030       sentiment using TextBlob (-1...+1)       0.051
number of distinct games         0.169       number of messages                       0.110


    In order to combine all features into one feature space, we used scikit-learn’s
column transformer9 , a pipeline that takes input of different data types, then
performs the desired transformations and finally combines all features into one
feature space, ready to be used as input by a ML algorithm. After creating the
input feature space, we used another pipeline that takes the transformed features
as input, scales them and finally trains our model. Fig. 3 illustrates the process
of feature transformation and model training.


            Fig. 3: ML pipeline: feature extraction and training of XGBoost.


4.3     Classification
For classification we evaluated AdaBoost, SVMs, logistic regression, decision
trees, and XGBoost. We also experimented with deep neural networks (DNNs),
but since it showed that the classifier itself seems not to be the key, but rather the
feature representation, we discarded DNNs due to significantly slower training
times. We adapted the training process to optimize for F1 -score instead of accu-
racy. For training 5-fold cross-validation (CV) was used. During CV and on our
7
    https://textblob.readthedocs.io/en/dev/quickstart.html#sentiment-analysis
9
    https://scikit-learn.org/stable/modules/compose.html#column-transformer
                                         Detecting potential subscribers on Twitch      7

test sets XGBoost [3] — a classifier combining the benefits of tree methods and
ensembles using gradient-boosted decision trees — achieved the highest F1 -score.
In addition to yielding the best results, we found XGBoost to be particularly
useful due to the following observations: it yielded robust results over different
settings during our experiments, it has a moderate computational complexity
compared to DNNs, and as a tree-based method it is scale-invariant which is
beneficial while experimenting with features.


5   Experiments and Results
In this section we describe the results on our randomly sampled test sets T
and the provisioned evaluation set E. We adopt the baseline provided by the
organizers of the challenge: By randomly classifying channel-user combinations
as subscribed /not subscribed, given the respective class distribution of 8% and
92%, the random baseline F1 -score is 0.0741.
    Before we initiated the training procedure, we isolated 5 randomly sampled
test sets to test our classifier on unseen data. Then we resampled the data
and used 5-fold CV to evaluate the generalizability of our model. From the
tested classifiers, the best results on our own test sets are achieved using the
XGBoost [3] and the described pre-processing and feature extraction steps. This
model was submitted and achieved an F1 -score of 0.3229 on our sampled test
sets T . On the evaluation set E an F1 -score of 0.2647 (see Table 2) was achieved
with 0.4341 of subscribers detected (see Table 3).


Table 2: Results of evaluated models on our own test sets T and for the submitted
model on the organizers’ evaluation set E.

          data set         baseline XGBoost AdaBoost SVM log. regr. dec. trees
          own test sets T 0.0741 0.3229     0.3033   0.2171 0.3029  0.2383
          evaluation set E 0.0741 0.2647    -        -      -       -


Table 3: Confusion matrix of evaluation set E with 90,000 channel-user combinations.
                                    prediction: not subscribed prediction: subscribed
            class: not subscribed             72363                    11440
            class: subscribed                  3507                     2690
8      Marvin Gärtner et al.

6   Discussion
In general, all submissions in this challenge had relatively low F1 -scores indi-
cating that the classes are hard to separate based on a user’s chat messages.
We found that the choice of the classifier itself did not dramatically improve
the results, the choice of pre-processing and feature extraction steps was more
crucial. The number of distinct emotes was identified as a particularly strong
feature. As shown in Fig. 4, the importance of the distinct emotes count ex-
ceeds the importance other features significantly. For comparison we re-trained
the model without the number of distinct emotes, resulting in a decrease of the
F1 -score to 0.3012 on T . Despite the low F1 -scores, following our discussion on
the applicability of these models in Section 3, the developed ML models might
still prove useful for the acquisition of new subscribers.


      Fig. 4: Top 5 features measured by XGBoost feature importance (gain)


    While on our own randomly sampled test sets T we achieved an F1 -score of
0.3229, the results on the evaluation set E were 0.2647, which suggests overfit-
ting. In this case, however, we assume that the lower result is predominantly
caused by the composition of the evaluation set E. In E the channels and
users were categorized into different activity groups, measured by the amount
of chat messages. The lower and upper 25% of the channels and users cor-
respond to an activity of ”low” and ”high” respectively. The remaining 50%
correspond to the category ”normal”. As a result, there are 9 different activ-
ity categories for each channel-user combination (channel=low&user=low, chan-
nel=low&user=normal, ...). From each category 10,000 channel-user combina-
tions were sampled yielding the evaluation set with 90,000 instances. While the
activity categories were sampled to occur with same frequencies, in the full data
set a rather different distribution is present. Table 4 shows a comparison of the
distributions of the raw data used to train our model and the evaluation set E.
    In addition, the model’s F1 -scores for each combination in E is given. On the
one hand, in particular for low activities of users and/or channels our model per-
formed poorly. On the other hand, these combinations are quite rare in practice
as shown by the percentages for the full data set. For higher channel and/or user
activities our model performs significantly better. Naively calculating a weighted
average of the categories’ F1 -scores weighted by their occurrence in the full data
set yields an F1avg of 0.297.
                                     Detecting potential subscribers on Twitch           9


Table 4: Distribution of channel-user activity combinations in full data set and evalu-
ation test set E (l: low, n: normal, h: high) and model performance on each category

channel-user activity       l-l    l-n    l-h    n-l    n-n    n-h    h-l    h-n    h-h
full data set distribution 0.03% 0.49% 7.99% 0.16% 1.87% 29.03% 0.34% 5.22% 54.87%
evaluation set distribution 11.11% 11.11% 11.11% 11.11% 11.11% 11.11% 11.11% 11.11% 11.11%
F1 -score (test set)        0.1565 0.1872 0.1898 0.1567 0.2139 0.2721 0.3546 0.3913 0.3206


7    Conclusion and Future Work
In this work we presented a text mining approach to predict the subscription
status of Twitch users for a given channel as part of the ChAT discovery chal-
lenge. We found feature representation to be more important than the actual
classifier. As features we encoded the text as TF-IDF and BOW vectors and
additionally extracted numerical features. We experimented with different clas-
sifiers, with XGBoost achieving best results with an F1 -score of 0.3229 on our
randomly sampled test sets. With an F1 -score of 0.2647 on the evaluation set
we reached second place in the challenge. The results show that the problem is
challenging and requires more research. Nevertheless, we believe that the sub-
missions’ models can be used for the acquisition of new subscribers based on our
presented consideration of knowing the subscription status in practice.
     In the future, the representation of the chat messages using word embeddings
like Word2Vec, as described in [1], could be a promising direction. Also the use
of convolutional neural networks as in [10] might be an option for improvement.
Applying models on the raw text in order to learn feature representations is
another option worth researching.


Acknowledgements
We would like to thank Marc Ebert and Dominik Hahn for implementation
support during pre-processing and Julian Theissler for his precious hints during
feature engineering based on his domain knowledge on Twitch. Finally, our grat-
itude goes to the organizers of the discovery challenge Konstantin Kobs, Martin
Potthast, Albin Zehe, and Matti Wiegmann [9].


References
 1. Barbieri, F., Espinosa Anke, L., Ballesteros, M., Soler, J., Saggion, H.: To-
    wards the understanding of gaming audiences by modeling twitch emotes.
    In: Proceedings of the 3rd Workshop on Noisy User-generated Text. pp. 11–
    20. Association for Computational Linguistics, Stroudsburg, PA, USA (2017).
    https://doi.org/10.18653/v1/W17-4402
 2. Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic
    minority over-sampling technique. J. Artif. Int. Res. 16(1), 321–357 (Jun 2002)
10      Marvin Gärtner et al.

 3. Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System. In:
    22nd ACM SIGKDD International Conference on Knowledge Discovery and
    Data Mining. p. 785–794. KDD ’16, ACM, New York, NY, USA (2016).
    https://doi.org/10.1145/2939672.2939785
 4. Claypool, M., Farrington, D., Muesch, N.: Measurement-based analysis of
    the video characteristics of twitch.tv. In: 2015 IEEE Games Entertain-
    ment Media Conference (GEM). pp. 1–4. IEEE (14102015 - 16102015).
    https://doi.org/10.1109/GEM.2015.7377227
 5. Estabrooks, A., Jo, T., Japkowicz, N.: A multiple resampling method for learn-
    ing from imbalanced data sets. Computational Intelligence 20(1), 18–36 (2004).
    https://doi.org/10.1111/j.0824-7935.2004.t01-1-00228.x
 6. Haddi, E., Liu, X., Shi, Y.: The role of text pre-processing in
    sentiment analysis. Procedia Computer Science 17, 26–32 (2013).
    https://doi.org/10.1016/j.procs.2013.05.005
 7. Hamilton, W.A., Garretson, O., Kerne, A.: Streaming on twitch. In: Proceed-
    ings of the 32nd annual ACM conference on Human factors in computing sys-
    tems - CHI ’14. pp. 1315–1324. ACM Press, New York, New York, USA (2014).
    https://doi.org/10.1145/2556288.2557048
 8. Joachims, T.: A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for
    Text Categorization. In: Proceedings of the Fourteenth International Conference
    on Machine Learning. p. 143–151. ICML ’97, Morgan Kaufmann Publishers Inc.,
    San Francisco, CA, USA (1997)
 9. Kobs, K., Potthast, M., Wiegmann, M., Zehe, A., Stein, B., Hotho, A.: Towards
    Predicting the Subscription Status of Twitch.tv Users — ECML-PKDD ChAT
    Discovery Challenge 2020. Proceedings of ECML-PKDD 2020 ChAT Discovery
    Challenge (2020)
10. Kobs, K., Zehe, A., Bernstetter, A., Chibane, J., Pfister, J., Tritscher, J.,
    Hotho, A.: Emote-Controlled: Obtaining Implicit Viewer Feedback Through
    Emote-Based Sentiment Analysis on Comments of Popular Twitch.Tv Chan-
    nels. Trans. Soc. Comput. 3(2) (Apr 2020). https://doi.org/10.1145/3365523,
    https://doi.org/10.1145/3365523
11. Nascimento, G., Ribeiro, M., Cerf, L., Cesario, N., Kaytoue, M., Raissi, C., Vas-
    concelos, T., Meira, W.: Modeling and analyzing the video game live-streaming
    community. In: 2014 9th Latin American Web Congress. pp. 1–9. IEEE (22102014
    - 24102014). https://doi.org/10.1109/LAWeb.2014.9
12. Neto, J.L., Santos, A.D., Kaestner, C.A., Alexandre, N., Santos, D., A, C.A., Alex,
    K., Freitas, A.A., Parana, C.: Document clustering and text summarization (2000)
13. Pan, R., Bartram, L., Neustaedter, C.: Twitchviz. In: Kaye, J., Druin, A.,
    Lampe, C., Morris, D., Hourcade, J.P. (eds.) Proceedings of the 2016 CHI
    Conference Extended Abstracts on Human Factors in Computing Systems -
    CHI EA ’16. pp. 1959–1965. ACM Press, New York, New York, USA (2016).
    https://doi.org/10.1145/2851581.2892427
14. Poche, E., Jha, N., Williams, G., Staten, J., Vesper, M., Mahmoud, A.: Analyz-
    ing user comments on youtube coding tutorial videos. In: 2017 IEEE/ACM 25th
    International Conference on Program Comprehension (ICPC). pp. 196–206. IEEE
    (2017). https://doi.org/10.1109/ICPC.2017.26
15. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: Tira integrated research archi-
    tecture. In: Information Retrieval Evaluation in a Changing World, pp. 123–160
    (2019). https://doi.org/10.1007/978-3-030-22948-1 5