StinkyCheese: Chat-Based Model for Subscription Classification Túlio Corrêa Loures[0000−0003−1321−4844] , Gustavo Lúcius Fernandes[0000−0002−1748−8976] , Fernanda G. Araújo[0000−0002−3035−2678] , Karen S. Martins[0000−0001−7949−4573] , and Pedro O. S. Vaz-de-Melo[0000−0002−9749−0151] Universidade Federal de Minas Gerais, Belo Horizonte - MG, Brazil {loures.tc,gustavo.lucius,fernandaguim,karensm,olmo}@dcc.ufmg.br Abstract. In this paper, we report the process of building a binary classifier for the Chat Analytics for Twitch challenge. The goal is to predict the subscription status of a user in a channel based on their comments. As a result for this task, our final submission achieves an F1-score of 0.1422 on the dataset used to evaluate the classifiers. We explore the results and try to understand which scenarios contribute to this performance. We hope these analyses can be used to direct new research and to improve classifiers designed for this or related tasks. Keywords: Chat Analytics · Deep Learning · Classification 1 Introduction Streaming platforms such as Twitch are extremely popular, attracting millions of broadcasters and hundreds of millions of viewers each month2 . Through them, multimedia content creators stream a multitude of recordings or live streams on the internet. Besides that, some platforms enable live interactions between the broadcasters and their audience. This is the case of Twitch3 , which started mostly as a game streaming platform and has since grown to include other types of content, such as cooking, programming, music, and tourism channels. To encourage and attract content creators, the Twitch platform has a mon- etization model in partnership with the owners of stream channels [11]. This model has three ways in which partners can earn money: channel subscriptions, Bits and advertisements. By subscribing to a channel, the viewers can monthly pay for packages of different prices (e.g. $4.99, $9.99, or $24.99) and receive in re- turn permission to access exclusive content. The viewers can also purchase Bits, Copyright ©2020 for this paper by its authors. Use permitted under Creative Com- mons License Attribution 4.0 International (CC BY 4.0). 2 https://blogs.wsj.com/digits/2015/01/29/twitchs-viewers-reach-100-million-a- month/ - Site visited on July 23, 2020 3 https://www.twitch.tv/ - Site visited on June 23, 2020 2 T. Corrêa Loures et al. a virtual product that serves to incentivize and financially support the channel owner. Lastly, channel owners can broadcast advertisements on their channels. For all cases, the collected amount is divided between Twitch and the channel owner. In this context, the Chat Analytics for Twitch (ChAT) challenge [6] proposes the construction of a binary classifier to predict the subscription status of user- channel pairs based on the comments the user made in the channel’s chat. With that, it would be possible to identify, for instance, which users are potentially interested in a subscription and make target advertising. In this work, we propose a classification model that combines text representation of the messages with conversational features. 2 The Dataset 2.1 Description The ChAT dataset was provided by the ECML-PKDD 2020 Discovery Chal- lenge [6], being divided into a training and a test set. The dataset includes channel and user combinations where, for each user-channel pair, it contains all messages posted by the user in the channel and some metadata. The metadata contains two more keys for each message in the user-channel pair: the timestamp of the comment and the game that was played in the channel when the comment was posted. In addition, for each user-channel pair we have the corresponding subscription status, that is, whether the user is subscribed to the channel or not. In total, the training set contains more than 700, 000, 000 messages posted in English channels in January 2020, divided across almost 30 million user-channel pairs. For the test set, 90, 000 user-channel instances were given. 2.2 Initial Analysis Before proposing a classification model, we made some initial exploratory anal- ysis on the dataset. We generated a random sample containing 500, 000 user- channel pairs, of which 85, 985 are subscribed and 414, 015 are non-subscribed. First, we measured the number of messages per user-channel pair. On average, users write 70 messages in channels they are subscribed to (median 20). On the other hand, non-subscribed users only write an average of 25 messages (median 4). A similar trend can be observed for the total length of the messages. In other words, users who are subscribed to a channel tend to post longer messages than non-subscribed users. We also analyzed the frequency of messages for subscribed and non-subscribed users. Figure 1 illustrates the timestamp of consecutive comments (starting with a timestamp 0 for the first comment) for two example instances in the same chan- nel, one of a subscribed user and another for a non-subscribed user. Note that, while the non-subscribed user made more comments, that user’s participation had a lower duration than the participation of the subscribed user. While this single example cannot be used as a generalization, it showcases the intricacies of trying to relate frequency of comments to the subscription status. StinkyCheese: Chat-Based Model for Subscription Classification 3 Fig. 1: Messages over time for two example instances. 3 The StinkyCheese Model Deep learning architectures have become the state of the art solution for several tasks, including text classification [2, 5]. Long Short-Term Memory (LSTM) [4] is a deep learning model that has achieved great successes in this task due to its ability to learn long-term dependencies and extract high-level text informa- tion. [13]. Our proposed model4 , represented in Figure 2, is in essence a combi- nation of the text representations of comments learned by a LSTM architecture concatenated with additional descriptive features. Text Input Embedding Features LSTM Concatenation Dense Output Fig. 2: Structure of the StinkyCheese model The LSTM architecture receives as input the word embeddings of the mes- sages and has 100 units. It is responsible for learning the text representation of 4 Implementation available at https://github.com/lourestc/chatpkdd 4 T. Corrêa Loures et al. the messages in the user-channel pair. The word embeddings were randomly ini- tialized, with a size of 100, and then learned during the training process together with the remainder of the network. We avoided using pre-trained embeddings, given that they are usually built on more formal texts (such as Google News [8] or Wikipedia [12]), while Twitch chat tends to have highly irregular, noisy and informal text (especially with the use of unique tokens for emotes 5 ). Thus, we decided to train the embeddings for our text model completely based on the Twitch chat messages. Finally, we applied a dropout of 0.5 for regularization, both for the embedding layer and the LSTM architecture. The text representation learned by the LSTM is concatenated with additional features, which are described in the following subsection. These combined tensors are then run through a simple dense layer, and the output is predicted with a sigmoid activation function. The loss of the model is calculated with the Binary Cross Entropy function. 3.1 Conversational Features Considering the subject of the challenge, the data presents other relevant infor- mation that could be useful apart from the text of the messages themselves, such as the user-channel relationships and the structure of the users’ participation in the chat system. Thus, besides the text model previously described, we defined features that could help the classification model based on these conversational traits. For each instance, features were calculated based on these characteristics, taking into account the other users and channels in the dataset. These features can be divided in groups related with three different aspects: verbosity, which is derived from traits of the user’s messages in the given chan- nel; attendance, which is related to the frequency in which the user post mes- sages in the channel; participation, which indicates how much the given user and channel participate in the dataset as a whole. Verbosity. These features measure how much content is generated by the user in the channel. Both the amount and the length of the comments are considered. As a reference point to these numbers, average values for the user and for the channel are used as additional features. – NumMsg UC: The number of messages sent by the user in the channel. – LenMsgs UC: The text size (term count) of the user’s concatenated mes- sages in the channel. – AvgLenMsgs U: The average text size (term count) of the user’s concate- nated messages, considering all the channels that they participate. – AvgLenMsgs C: The average concatenated messages text size (term count) of users participating in the channel. 5 For examples, see https://twitchemotes.com/ StinkyCheese: Chat-Based Model for Subscription Classification 5 Attendance. These features aim to capture the assiduity and permanence of the user in the channel. They measure both the length of time over which the user participated in the channel, as well as the intervals between messages. – AttendanceTime UC: The total time of user’s activities in the channel. In other words, the time between the user’s last and first messages in the channel. – AvgInterval UC: The average interval time between messages sent by a user in the channel. – StdInterval UC: The standard deviation of the interval time between mes- sages sent by a user in the channel. – MinInterval UC: The shortest interval time between messages sent by a user in the channel. – MaxInterval UC: The longest interval time between messages sent by a user in the channel. Participation. These features inform the general involvement of the user and of the channel in the dataset. In a way, they measure the dedication between the user-channel pair. – NumChannels U: The number of channels in which the user participates. – NumUsers C: The number of users participating in the channel. 4 Experimental Setup To run our training process, we used a computer with an Intel(R) Core(TM) i7- 9700KF CPU @ 3.60GHz processor, 56 GiB of memory, and a NVIDIA GeForce RTX 2080 Ti GPU. The implementation was done in Python, using the Keras package. 4.1 Dataset Division For the training process performed in this project, we divided the training dataset into files of 100, 000 (one hundred thousand) instances (channel-user pairs) each. All preprocessing and dataset manipulation steps were performed on a single subfile at a time. So, for example, when counting the number of chan- nels a user participated in, our experimentation only considered events in the same subfile of the instance for which the feature was being calculated. Although this does reduce the overall quality of the training process, this was done so that the preprocessing steps needed for the model described in Section 3 would more easily be performed. In this paper, only one such file was used for the training step of the model, and a second one for the validation of the model. Using the remaining instances would potentially provide a more robust result in future works. 6 T. Corrêa Loures et al. 4.2 Data Preprocessing As a first step, we preprocessed the dataset into a format that could be more easily used in our model. The messages from each user-channel pair are initially concatenated in order of occurrence, with no separation tokens between mes- sages. Then stop words are removed from the text. We split the text by terms, considering each sequence of punctuation characters as a single token (in order to capture the use of simple emoticons, such as ;-/). Finally, we calculated the features defined in Section 3.1. This preprocessing step took 11 minutes to be performed for the file we used in the experiments. 4.3 Ablation Study In this section, we describe an ablation study performed over a number of aspects that may influence the performance of our model. The hyperparameters were selected using a second 100.000-instance file as the validation (dev) set, and the scores are also reported in this set. We train a number of models with differing hyperparameters, exploring the impact of the optimizer, the learning rate and the maximum number of words on the F1-score. We follow practical recommendations for some of the most commonly used hyperparameters: it is often the case that a few of them make a big difference [1]. We first start experimenting on the learning rate, perhaps the most important hyperparameter [3, p. 424]. It is typical to pick the learning rate using a grid search with values approximately following a logarithmic scale [3, p. 428], so we take values within the set {0.1, 0.01, 10−3 , 10−4 }. Regarding the optimization process, there is no consensus on the optimal algorithm to use [10]. So, we choose three of the most popular optimizers: RM- Sprop, Adam, and Adagrad. Lastly, we change the number of words per sequence used for the text rep- resentation, with 200, 300 and 400 as values. These values were chosen based on the computational resources available at the time. Note that even with a 400 value, this limit is not enough to cover all instances of test set and training set, which have 865 and 9120 instances, respectively, that can not be completely covered with 400 tokens. Moreover, cutting out the last words of each sequence can have a different effect than cutting the first or at random, although we have not evaluated these different approaches in this work. The best overall result was obtained with the Adagrad optimizer, learning rate of 10−4 and 200-word representation; the most promising experiments are shown in Table 1. Once the best hyperparameter selection was defined, we per- formed a final training run on the model over 10 epochs. This execution took 83 minutes and 13 seconds to complete. StinkyCheese: Chat-Based Model for Subscription Classification 7 Hyperparams Dev set F1 OPT LR MW Adagrad 0.0001 200 0.309 Adagrad 0.001 200 0.303 Adam 0.0001 300 0.255 Adam 0.0001 300 0.244 RMSprop 0.0001 300 0.238 Table 1: Ablation over the hyperparameters. OPT = optimizers; LR = learning rate; MW = maximum number of words. 5 Results 5.1 General Evaluation For the ChAT challenge, participants were asked to submit their code to run on the TIRA platform [9]. Each submitted model would then be run on a hidden test set, and the results would be shared6 . Our final submission achieves an F1-score of 0.1422 on the test set, ranking in fourth place in the challenge. As a baseline, a random sampler classifier achieves an F1-score of 0.0741 on the same set. 5.2 Qualitative Analysis In this section, we analyze how our model behaves based on the activity level of the user, provided in the ground truth file of the test set. This activity is classified into low, normal or high, according to the number of messages written or received. As shown in Figure 3a, the model predicts with 0.72 accuracy the subscribed users with high activity level. On the other hand, it makes wrong predictions for subscribed users with low activity, achieving only 0.232 accuracy for this subset. This tendency, however, is not seen among non-subscribed users. While the model has high accuracy (0.787) for the prediction of low-activity users in this scenario, it often fails (0.38 accuracy) for the highly active ones, as seen in Figure 3b. This analysis indicates that our model tends to correlate activity level with the probability of the user being subscribed, as we often correctly predict high- activity users that are subscribed and low-activity users that are non-subscribed. Considering the features we employed, as described in Section 3.1, this matches the expected capabilities and bias of the model. We also observed how the model performs with known and unknown channels and users. While the model performed better with channels from the test set that were already seen in the sampled training set, with an F1 score of 0.1964, 6 See results in: https://events.professor-x.de/dc-ecmlpkdd-2020/results.html - Site visited on July 20, 2020 8 T. Corrêa Loures et al. (a) Subscribed users (b) Non-subscribed users Fig. 3: Accuracy according to user activity it performed worse with seen users, having an F1 score of 0.0962. It is worth noting that, as known users and channels consist of a small portion of the dataset (1, 474 known channels and 2, 255 known users, out of the 90, 000 test instances), inferences based on this data can be unreliable. In order to better visualize the different instances in which the StinkyCheese model was able to correctly predict the subscription status or not, we present a few examples of messages from user-channel pairs. From these examples, we can see that the task of identifying if a user is subscribed or not based on the chat is a complex one. Subscribed users can have interactions in chat as short as those of non-subscribed users, while non-subscribed can have conversations that seem as intimate as a dedicated subscriber. True Positive (subscribed, predicted as such): – “so —you getting a new monitor” – “Your welcome. I finally get to catch your stream —Ok. Haven’t played d2 in 3 weeks. Content wasn’t keeping interested —I hope season is good. This season was lackluster” False Positive (non-subscribed, predicted subscribed): – “Hey —Dude —Wassaup —Wanna play” – “bots —cya and happy new year —the beer was perfect” True Negative (non-subscribed, predicted as such): – “just stopping by for a minute to whish max and everyone a happy new year” – “!giveaway —What do u play on —Pc nvm” False Negative (subscribed, predicted non-subscribed): – “whiskey time —what?? —good night :) —ye ? —always” – “!delcom !giveaway” StinkyCheese: Chat-Based Model for Subscription Classification 9 6 Discussion and Conclusions This work seeks to identify the subscription status of a user in a Twitch channel with word vectors fed into a LSTM and then concatenated with conversational features. Using chat messages for machine learning tasks is a complex endeavor. Not only is the textual content of the messages an important factor, but the structure of the conversation and the connection between agents also feature into the picture. Studies in the field have several possible angles to look into, and there’s still too much to improve. Analyzing our results, it is possible to observe that the prediction of our classifier is attached with the user activity, so it tends to predict as subscribed a user that has high activity in channel, and to predict as non-subscribed an use that has low activity in channel. In order to make the model better at dealing with other cases, increasing the complexity of later steps of the network could help the model to extract more complex relations between features and make more accurate predictions. We also have other limitations, such as the fact that we use features that heavily summarize complex information regarding the structure of the conver- sation and user-channel relationships. A more complex network structure could be used with the extra metadata provided in the dataset, for example, using the timestamp sequence as a whole instead of simply taking measures over them as features. As future work, we also intend to train our model again using pre-trained word embeddings created with Twitch reviews, as proposed by [7]. As we said, Twitch chat tends to have highly irregular, noisy and informal text. Therefore, formal texts would not perform well. It is important to use pre-trained embed- dings with the same language. In addition, some features have a larger range of values in comparison to others, such as those based on timestamps in comparison to those based on message length. This can disrupt the classifier and affect the results. To avoid this problem, the feature values could be normalized, potentially improving the learning process. One final point that could greatly improve the quality of the model would be a heavier training process. This includes both the use of the entire training dataset and the improvement of the hyperparameter search. References 1. Bengio, Y.: Practical recommendations for gradient-based training of deep archi- tectures. CoRR abs/1206.5533 (2012), http://arxiv.org/abs/1206.5533 2. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirec- tional transformers for language understanding. CoRR abs/1810.04805 (2018), http://arxiv.org/abs/1810.04805 3. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016), http://www.deeplearningbook.org 10 T. Corrêa Loures et al. 4. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9, 1735–80 (12 1997). https://doi.org/10.1162/neco.1997.9.8.1735 5. Howard, J., Ruder, S.: Fine-tuned language models for text classification. CoRR abs/1801.06146 (2018), http://arxiv.org/abs/1801.06146 6. Kobs, K., Potthast, M., Wiegmann, M., Zehe, A., Stein, B., Hotho, A.: Towards predicting the subscription status of twitch.tv users — ecml-pkdd chat discov- ery challenge 2020. Proceedings of ECML-PKDD 2020 ChAT Discovery Challenge (2020) 7. Kobs, K., Zehe, A., Bernstetter, A., Chibane, J., Pfister, J., Tritscher, J., Hotho, A.: Emote-controlled: Obtaining implicit viewer feedback through emote-based sentiment analysis on comments of popular twitch.tv chan- nels. Trans. Soc. Comput. 3(2) (Apr 2020). https://doi.org/10.1145/3365523, https://doi.org/10.1145/3365523 8. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre- sentations in vector space (2013) 9. Potthast, M., Gollub, T., Wiegmann, M., Stein, B.: TIRA Integrated Research Architecture. In: Ferro, N., Peters, C. (eds.) Information Retrieval Evaluation in a Changing World. The Information Retrieval Series, Springer (Sep 2019). https://doi.org/10.1007/978-3-030-22948-1 5 10. Schaul, T., Antonoglou, I., Silver, D.: Unit tests for stochastic optimization (2013) 11. TWITCH: Twitch partner program (2020), https://www.twitch.tv/p/partners/, last accessed 23 June 2020 12. Yamada, I., Asai, A., Sakuma, J., Shindo, H., Takeda, H., Takefuji, Y., Matsumoto, Y.: Wikipedia2vec: An efficient toolkit for learning and visualizing the embeddings of words and entities from wikipedia. arXiv preprint 1812.06280v3 (2020) 13. Zhang, L., Wang, S., Liu, B.: Deep learning for sentiment analysis: A survey. Wi- ley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8(4), e1253 (2018)