=Paper= {{Paper |id=Vol-2661/paper2 |storemode=property |title=Team voyTECH: User Activity Modeling with Boosting Trees |pdfUrl=https://ceur-ws.org/Vol-2661/paper2.pdf |volume=Vol-2661 |authors=Immanuel Bayer,Anastasios Zouzias |dblpUrl=https://dblp.org/rec/conf/pkdd/BayerZ20 }} ==Team voyTECH: User Activity Modeling with Boosting Trees== https://ceur-ws.org/Vol-2661/paper2.pdf
    Team voyTECH: User Activity Modeling with
                Boosting Trees

                     Immanuel Bayer1 and Anastasios Zouzias2
                                    1
                                    Palaimon GmbH
                                   Berlin, Germany
                             immanuel.bayer@palaimon.io
                                 https://palaimon.io
                                2
                                  Huawei Technologies
                               Zurich Research Center
                                     Switzerland
                           anastasios.zouzias@huawei.com




        Abstract. This paper describes our winning solution for the ECML-
        PKDD ChAT Discovery Challenge 2020. We show that whether or not
        a Twitch user has subscribed to a channel can be well predicted by
        modeling user activity with boosting trees. We introduce the connection
        between target-encodings and boosting trees in the context of high cardi-
        nality categoricals and find that modeling user activity is more powerful
        then direct modeling of content when encoded properly and combined
        with a gradient boosting optimization approach.

        Keywords: competition · boosting · high cardinality categoricals · target-
        encodings



1     The Competition

The task of the ECML-PKDD ChAT Discovery Challenge 2020 [11] is to predict
whether or not a Twitch user has subscribed to a channel (binary classification
task) given the list of messages he has posted on this and other channels.
    The dataset consists of more than 400 million public Twitch comments taken
from English channels that are published during the month of January 2020
along with metadata. The training data consists of over 29 million and the test
dataset of 90,000 channel-user combinations and their comments. In more detail,
each input training sample consists of the channel-id, the user-id, and a list of
time-stamped comments from this user, including the specific game played in
this channel at this particular time.

              c
    Copyright ©2020  for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2         I. Bayer and A. Zouzias

1.1     Competition Challenges

The ChAT competition presents two peculiarities. The first challenge is that
only half of the users in the test set have no history which requires special
attention when extracting users and channels features. This challenge draws
similarities with the cold start problem in recommendation systems [12]. The
second challenge is related to the sampling distribution of the test set (leader-
board). More precisely, the entire spectrum of user/channel activity levels (low,
normal, high) is weighted equally across all groups which is vastly different than
their frequencies in the training set (see Table 1). Namely, for each out of 9
combinations of user activity levels (low, normal, high) and channel activity
levels (low, normal, high), 10,000 channel-user pairs are sampled uniformly (i.e.,
one channel-user activity group is where the user is of low activity and the
channel is of normal activity). Hence, in total 90,000 test samples are generated.
In Table 1, we outline statistics of the dataset within the activity groups.


Table 1. Statistics of the training dataset per channel-user activity group. ‘u low-
c normal‘ corresponds to low user and normal channel activity group.

             group        users channels    pairs    subscribed % pairs % subscribed
         u low-c normal   141K      40K      144K       5,696   0.0049     0.0397
           u low-c low     10K      7,7K      10K        611    0.0003     0.0595
       u normal-c normal 480K       67K      562K      35,021   0.0190     0.0622
          u low-c high   2,153K     34K     2,359K      181K    0.0798     0.0768
         u normal-c low    46K      23K       47K       3651    0.0016     0.0770
        u normal-c high 3,508K      36K     8,740K      683K    0.2958     0.0782
          u high-c high  1,911K     36K    16,045K     1,314K   0.5432     0.0819
          u high-c low     77K      31K       99K       8,498   0.0033     0.0858
        u high-c normal   663K      73K     1,531K      135K    0.0518     0.0886




1.2     Contributions

The main contributions of this paper are:

    – The detailed presentation of the winning solution of the Discovery Challenge
      2020 including training and evaluation setup.
    – Introduction of the connection between target encodings and boosting trees
      in the context of high cardinality categoricals.
    – Additional experiments on the competition dataset that examine the critical
      modeling decisions of our solution.


1.3     User Activity Modeling
               Team voyTECH: User Activity Modeling with Boosting Trees          3
                                                    user                  channel
                                                              subscribes
Our approach is based on the assumption that
modeling user activity is more important than
                                                     plays
specific content (e.g. message text). User ac-                              hosts
                                                   /watches
tivity is modeled as interactions between the
user and key objects of the system she inter-
acts with (e.g. channels and games).                             game
    This approach naturally leads to a high
dimensional categorical feature/variable rep-             Fig. 1. Interactions
resentation that has been well studied in the
context of recommender systems, click-through-rate predictions and similar in-
dustrial applications. It is also closely related to the concept of graph-based
relational features [1].
    Our experimental results (Section 3) indicate that the interactions illustrated
in Figure 1 together with features describing the quantity of user activity (e.g.
days active, number of frequently used channels) have strong predictive power.
Introducing game-id as a high level object is especially important for the 50% test
set user without history (cold-start). For a cold-start user, their most frequent
game-id can effectively proxy their user-id (more details in Section 3.1). Before
presenting details of our solution (Section 3.3), we first introduce the concept
of target encoding to motivate our choice to combine high dimensional feature
representations with boosting tree models.


2     High Cardinality Categoricals and Boosting Trees

In this section, we discuss in detail the interaction between (high cardinality)
categoricals, mean target encodings and boosting trees which is at the core of
our winning solution in form of the popular CatBoost library that we used to
implement our models [4].
    Several user and channel categorical features are present in the dataset such
as which game has been played and activity levels for user, games, and channels.
By computing the interaction features between user and channel representations,
several categoricals with high cardinality are extracted. Due to their sparsity,
such high cardinality categoricals pose several challenges in modeling and, in
general, can lead to poor generalization performance. A popular class of models
to handle such a semi-structured datasets containing high cardinality categoricals
are Gradient Boosting Trees [8] and in particular the CatBoost library. The
winning solution is based on a single CatBoost model. Model ensembles are
likely to further improve our results, but were skipped due to time restrictions.


2.1   Categorical Encodings in Models

The handling of categorical features usually happens during the feature engi-
neering phase, where the modeler has the freedom to arbitrarily transform or
extract the input features before those are fed into a model. However, models
exist that can handle categorical features automatically, i.e., the modeler simply
4         I. Bayer and A. Zouzias

specifies the features that should be handled as categoricals without any fur-
ther pre-processing required. The user of such models is only able to adjust the
predefined categorical encoding process with input through hyper-parameters.
For example, hyper-parameters for categorical features include ‘perform one-hot-
encoding if cardinality of any categorical is less than a threshold‘, ‘perform hash
encoding with specified number of hashing dimensions‘ to name a few.
    Here, we summarize a few recently proposed models that handle categorical
features as part of the model definition. Two gradient boosting tree implemen-
tations, Microsoft’s LightGBM [10] and Yandex’s CatBoost [4], allow the user
to specify which features should be handled as categoricals by the models. The
h2o.ai implementation of Random Forests handles categoricals out of the box.
Neural networks can provide an embedding layer to handle categoricals as an
additional layer of a neural network, see Keras embedding layer or the so-called
‘entity embeddings‘ [7]. LightGBM splits a categorical feature by partitioning
its categories into 2 subsets. If the categorical feature has k levels, there are
2(k−1) − 1 possible partitions. However, there is an efficient O(k log(k)) time
solution for regression trees [5]. The basic idea is to sort the categories according
to the training objective at each split.
    CatBoost is a gradient boosting tree implementation that applies a regular-
ized mean target encoding on the top-level tree split as a preprocessing step. Such
preprocessing could be considered sub-optimal, at least for the case of trees with
large depth [3]. Although the CatBoost approach might result in sub-optimal
greedy binary splits, CatBoost requires less operations per tree split and offers
a very efficient and optimized implementation. The efficiency if based on the
property that regularized mean target encoding values are computed only once
compared to the optimal greedy approach where the mean target encodings have
to be maintained or computed on every tree split.
    In the following section, we provide more background on the fundamentals
of CatBoost and, in particular, its connection to mean target encodings since
mean target encodings are the core design principle behind CatBoost.


2.2     Target Encodings

In this section, we setup the framework of feature extraction from categoricals
that is usually called target encodings from machine learning practitioners.
    We denote m samples with n features by a m × n design matrix X with
column coefficients that are either numericals (in R) or categoricals4 . In other
words, the j-th column of X is in Rm or Cjm for a set of elements of categoricals Cj .
Moreover, we denote by Xj the j-th column of X and by [n] the set {1, 2, . . . , n}.
In addition, we denote by y the m-dimensional target vector. The mean value
4
    Throughout this paper, a feature is an input variable/predictor that is used for
    prediction. A categorical feature (or categorical) is a feature with a domain that is a
    fixed set without an explicit ordering. The elements of a categorical are referred as
    levels. For example, postal code, favorite color, city or country of a specific individual
    are examples of categoricals.
               Team voyTECH: User Activity Modeling with Boosting Trees              5

of y, also referred to as mean target value, is denoted by µ. The tuple (X, y)
contains all relevant information for a prediction task and we call such a tuple
design matrix pair or for simplicity, design matrix.
   We focus on the typical binary classification task, i.e., assuming an input
target vector y ∈ {0, 1}m . The analysis can be extended directly to the regression
task. Now we are ready to define target encodings.

Definition 1 (Target Encodings). Given (X, y) and an integer j ∈ [n] so that
the j-th column of X is categorical, it follows that target encoding is a function
f(Xj ,y) : Cj → R.

From now on, we write f instead of f(Xj ,y) for notation convenience. It is impor-
tant to note that we allow f to depend on the input dataset. Moreover, we say
that f is defined (or fitted) on (X, y) to explicitly specify the input data used
on the definition of f .
    A very common example of target encoding is the mean target encoding. That
is, assume that the j-th column of X is a categorical containing values/levels in
Cj = {L1 , L2 , . . . , Lk }. The mean target encoding µj of the j-th column is defined
as follows: µj support on Cj and for any L ∈ Cj ,
                                            m
                                         1 !
                              µj (L) =         yi Xi,j =L                          (1)
                                         N i=1

where N equals to the number of occurrences of L in the j-th column of X and
 pred is the indicator function, i.e., equals to 1 if pred is true, otherwise equals
to zero. In words, mean target encodings are roughly defined as the mean target
value of any level of the categorical (group).
    In general, any property of the target values distribution of the group can be
also extracted. For example, ML practitioners frequently use the minimum, max-
imum, standard deviation, kurtosis, percentiles of the target values in addition
to the mean value. The main idea is to extract as much statistical information
of the target distribution of the group as possible.


Regularization of Target Encodings. By definition, target encodings intro-
duce target leakage and could lead to poor generalization performance, hence,
target encoding regularization must always be used [9]. In this section, we outline
several regularization methods of target encodings.
    Extra caution on regularization should be given in the presence of high car-
dinality categoricals, i.e., categoricals with a large number of distinct levels as
present in this competition. In fact, it is relatively easy to construct an exam-
ple where the naive application of target encodings leads to severe overfitting.
In order to exemplify this behavior, a minimal example is constructed by the
authors of the ‘vtreat‘ package [15]. CatBoost provides an implementation that
handles these issues automatically, however, it is important for the modeler to
better understand the general approaches that we outline next.
6      I. Bayer and A. Zouzias

Smoothing / Empirical Bayes / Shrinkage of Mean Target Encoding. In the
presence of high-cardinality categoricals, it is quite often the case that indi-
vidual categorical levels appear only in a small number of samples. In such a
scenario, the estimate of the mean target encoding doesn’t generalize well due
to the small number of samples used to calculate the statistics. Here, smoothing
or shrinkage can be applied which have a similar effect as empirical Bayesian
(EB) approaches [6]. Indeed, Empirical Bayesian conditional probabilities of a
categorical can be understood as mean target encodings [13].
    In our notation, the EB regularized version of the mean target encoding is
defined as
                      µEB
                       j (L) := λ(N )µj (L) + (1 − λ(N ))µ                   (2)
where N equals to the number of occurrences of L in the j-th column of X
and λ(n) is a monotonically increasing function on n bounded between 0 and
                                                                1
1. A common choice of practitioners for λ is λ(n) = 1+exp(−((n−l)/σ)     which
is a s-shaped function with a value of 0.5 for n = l and σ representing the
steepness [13, Equation 4]. Thus, Equation 2 is a smoothed version of the mean
target encoding.

Bootstrapping / rolling mean. Bootstrapping is another approach to regularized
target encodings. A specific instance of bootstrapping and target encodings is
implemented in CatBoost [4].
    CatBoost uses a bootstrapping rolling mean approach to reduce overfitting
while utilizing the whole training dataset for estimating the target encodings. In
a nutshell, CatBoost performs a random permutation on the rows of X and for the
i-th row of X (with respect to the random permutation) the mean target encoding
is computed using only the rows up to the (i−1)-row. Namely, CatBoost averages
several independent random permutations and, moreover, adds a shrinkage prior
to the global mean.
    To sum up, CatBoost performs categorical encoding for a level L ∈ Cj as
follows. For the i-th row and a fixed permutation of the rows, CatBoost computes
                                  "                       #
               µCat
                 j (L, i) := λµj L; (X1:(i−1),j , y1:(i−1) + (1 − λ)µ

where µ is the mean target value and λ is a smoothing hyper-parameter.


3     Winning Model and Additional Experiments
In this section, we describe the winning solution in more detail and present
additional experimental results that better explain the critical choices made to
achieve our result.

3.1   Feature Engineering
We have extracted a range of features from the message log representing the
training data. The following features focus on different aspects of the data
(time/activity, message, game).
               Team voyTECH: User Activity Modeling with Boosting Trees           7

 1. t_min:day of first message              8. m_med:medium number of charac-
 2. t_max:day of last message                  ter per message
 3. t_days:number of days with mes-         9. g_n_mes:total number of messages
    sages                                  10. g_n:number of games
 4. t_dur: t_max - t_min
 5. t_active:t_days / t_dur                11. g_top:game with most messages
 6. m_total: total number of charac-       12. g_chat:number of messages for
    ters for all messages                      game just chatting
 7. m_max:max number of character          13. g_top_freq: fraction of messages
    per message                                for g_top


   All features have been computed as user- and channel-features. This can be
done efficiently by aggregating the messages on either the user or channel id.
Features such as the number of days between the first and last message (t dur)
are computed for individual channel-user combinations.
   In addition the following features have been added:

 1. uid:user id                             4. cid:channel id
 2. n_channel:number of channels per        5. n_user:number of user per chan-
    user                                       nel
 3. u_group[low, normal, high]: user        6. c_group[low, normal, high]: chan-
    activity                                   nel activity


   We show in Section 3.3 that only a small subset of all features is needed to
achieve competitive performance.




        (a) Top features (train)                    (b) Top features (test)

Fig. 2. CatBoost’s feature importance show clear differences between train and our
constructed test data that can be attributed to the group activity shift (especially
pronounced for the uid feature). The top 10 test features have been selected for a
simplified model (table 3).
8       I. Bayer and A. Zouzias

3.2   Best Performing Model
Model definition. The model is based on the CatBoost library version 0.23.1.
The loss function is set to be logistic loss, also known as cross-entropy loss.
Training of the model is stopped early based on the performance on our custom
validation set using the autostop capabilities of CatBoost (‘od type‘ set to ”Iter”
and ‘od wait‘ set to 20). The best model is selected automatically by setting
use best model=True.
    The top performing submission is a single CatBoost model trained with
the following hyper-parameters: ’l2 leaf reg’: 64, ’learning rate’: 0.08, ’thresh-
old’: 0.167, ’depth’: 9, ’random strength’: 0.5, ’max ctr complexity’: 2. These
parameters have been manually selected on our constructed test set.

Table 2. Leaderboard results of the competition. Our submission voyTECH ranked
first with a clear lead to the second best approach.

                         Rank     Teamname        Test F1 score
                           1.      voyTECH           0.3433
                           2.    CoolStoryBob        0.2647
                           3.    ItsBoshyTime        0.2593
                           4.    StinkyCheese        0.1422
                           5.   Random Baseline      0.0741




Cross validation Setup. Training and testing data come from a vastly different
distribution, hence, a careful cross validation setup was crucial for model training
and hyper-parameter tuning. It is given from the competition description that
half of the users in the test set have no history and user-channel interactions
are sampled uniformly from low, normal, and high activity levels. Therefore, our
goal is to construct a validation set with similar properties. The validation set is
constructed as follows: we sample in total 45,000 channel-user pairs (5,000 pairs
per activity level pair group), ensuring that these pairs do not appear in the
training set. These 45,000 channel-pairs are then duplicated in the validation set
by modifying the user-id with an unknown identifier not present in the training
set.The training dataset is sub-sampled with a ‘max per group‘ parameter that
restricts the number of samples per channel-user group.

3.3   Additional Experiments
We find that we can achieve strong performance (Table 3) even with a very small
subset of features (’uid’, ’cid’, ’t days’, ’g top’, ’g top c’, ’n user’, ’n channel’,
’m med’, ’m max’, ’t min’) selected by using CatBoot’s feature importance (Fig-
ure 2). The sub-sampling of the training data decreases the model performance
(Figure 3) and the interaction between features (max ctr compl ≥ 2) is impor-
tant. However, the memory and run-time requirements increase dramatically
when max ctr compl > 2. We therefore had to trade higher interactions against
more training samples, which were more critical for performance.
                Team voyTECH: User Activity Modeling with Boosting Trees                 9




         (a) Leaderboard Model                              (b) Best Features Model

Fig. 3. Activity group specific F1-score. The differences between the full and reduced
feature model is most pronounced for the highest (h,h) and lowest (l, l) activity groups.
Table 3. Model performance with respect to train data size, features, and hyper-
parameter (random-strength=0.5, threshold=0.167, l2-leaf-reg=64 and depth=9 are
the same for all rows).

             f1 leaderboard f1 mytest train data     features     max ctr compl    lr
             0.3433          0.3522      16m            all            2          0.15
             0.3382          0.3505      8m             all            2          0.08
             0.3345          0.3490      16m       top-features        2          0.15
             0.3329          0.3417      full      top-features        1          0.08
             0.3322          0.3421      16m       top-features        1          0.15




3.4   Conclusions & Further Work


In this work we showed that modeling user activity is more powerful than di-
rect modeling of content when encoded properly and combined with a suitable
optimization approach. We also introduced the connection between target en-
codings and boosting trees in the context of high cardinality categoricals and
highlighted differences in the two popular boosting tree implementations Cat-
Boost and LightGBM. We plan to conduct further experiments that compare
Boosting Trees to Factorization Machines [14], a model that has been used suc-
cessfully to model user activity in an earlier Discovery Challenge [2].



Acknowledgement

We would like to thank the organizers for the interesting competition and their
support with the TIRA platform. From Palaimon we want to thank Phoebe Kuhn
for her help with the manuscript and Alexander Pisarenko for his relentless work
to provide us with a reliable and scalable infrastructure.
10      I. Bayer and A. Zouzias

References
 1. Bayer, I., Nagel, U., Rendle, S.: Graph based relational features for collective clas-
    sification. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining.
    pp. 447–458. Springer (2015)
 2. Bayer, I., Rendle, S.: Factor models for recommending given names. ECML PKDD
    Discovery Challenge p. 81 (2013)
 3. Breiman, L., Friedman, J., Stone, C., Olshen, R.: Classification and Regression
    Trees. The Wadsworth and Brooks-Cole statistics-probability series, Taylor & Fran-
    cis (1984)
 4. Dorogush, A.V., Ershov, V., Gulin, A.: Catboost: gradient boosting with categor-
    ical features support. In: Workshop on ML Systems at NIPS 2017 (2017)
 5. Fisher,     W.D.:     On     grouping   for   maximum       homogeneity.      Journal
    of the American Statistical Association 53(284), 789–798 (1958).
    https://doi.org/10.1080/01621459.1958.10501479
 6. Gelman, A., Carlin, J.B., Stern, H.S., Rubin, D.B.: Bayesian Data Analysis. Chap-
    man and Hall/CRC, 2nd edn. (2004)
 7. Guo, C., Berkhahn, F.: Entity embeddings of categorical variables. CoRR
    abs/1604.06737 (2017)
 8. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data
    Mining, Inference, and Prediction. Springer series in statistics, Springer (2009)
 9. Kaufman, S., Rosset, S., Perlich, C.: Leakage in data mining: Formulation, de-
    tection, and avoidance. In: Proceedings of the 17th ACM SIGKDD International
    Conference on Knowledge Discovery and Data Mining (KDD). pp. 556–563. ACM
    (2011)
10. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.Y.:
    Lightgbm: A highly efficient gradient boosting decision tree. In: Neural Information
    Processing Systems (NIPS), pp. 3146–3154. Curran Associates, Inc. (2017)
11. Kobs, K., Potthast, M., Wiegmann, M., Zehe, A., Stein, B., Hotho, A.: Towards
    predicting the subscription status of twitch.tv users — ecml-pkdd chat discov-
    ery challenge 2020. Proceedings of ECML-PKDD 2020 ChAT Discovery Challenge
    (2020)
12. Lam, X.N., Vu, T., Le, T.D., Duong, A.D.: Addressing cold-start problem in rec-
    ommendation systems. In: Proceedings of the 2nd International Conference on
    Ubiquitous Information Management and Communication. p. 208211. ICUIMC
    08, ACM (2008). https://doi.org/10.1145/1352793.1352837
13. Micci-Barreca, D.: A preprocessing scheme for high-cardinality categorical at-
    tributes in classification and prediction problems. SIGKDD Explor. Newsl. 3(1),
    27–32 (Jul 2001). https://doi.org/10.1145/507533.507538
14. Rendle, S.: Factorization machines. In: 2010 IEEE International Conference on
    Data Mining. pp. 995–1000. IEEE (2010)
15. Zumel, N., Mount, J.: vtreat: a data.frame processor for predictive modeling. CoRR
    abs/1706.09516 (2017)