=Paper= {{Paper |id=Vol-2473/paper12 |storemode=property |title=Learning on a Stream of Features with Random Forest |pdfUrl=https://ceur-ws.org/Vol-2473/paper12.pdf |volume=Vol-2473 |authors=Jan Motl,Pavel Kordik |dblpUrl=https://dblp.org/rec/conf/itat/MotlK19 }} ==Learning on a Stream of Features with Random Forest== https://ceur-ws.org/Vol-2473/paper12.pdf

Learning on a Stream of Features with Random Forest

Jan Motl, Pavel Kordík

Czech Technical University in Prague,
Thákurova 9, 160 00 Praha 6, Czech Republic,
jan.motl@fit.cvut.cz, pavel.kordik@fit.cvut.cz

Abstract: We study an interesting and challenging prob-
lem, supervised learning on a stream of features, in which
the size of the feature set is unknown, and not all features
are available for learning while leaving the number of ob-
servations constant. In this problem, the features arrive one
at a time, and the learner’s task is to train a model equiva-
lent to a model trained from "scratch". When a new feature
is inserted into the training set, a new set of trees is trained
and added into the current forest. However, it is desirable
to correct the selection bias: older features has more oppor-
tunities to get selected into trees than the new features. We
combat the selection bias by adjusting the feature selection
distribution. However, while this correction improves accu-
racy of the random forest, it may require training of many
new trees. In order to keep the count of the new trees small,
we furthermore put more weight on more recent trees than
on the old trees.
Keywords: random forest, incremental learning, online
learning, sequential learning, stream learning

Problem formulation One of the common issues in ma-
chine learning is changing data and the need to keep the
machine learning models up to date with the changing data.
One of the successful simplifications is to assume that over
time we are getting new samples. However, this article is
Figure 1: The difference between learning from a stream
concerned with the orthogonal problem — fast updates of
of samples (top) and a stream of features (bottom). In both
models when new features arrive (see Figure 1).
cases, we have n samples and d features at time t. But at
time t + 1, we either have one more sample (top) or one
Motivation Our original need for learning on a stream
more feature (bottom).
of features was due to our interest into propositionaliza-
tion [3]. Propositionalization is a data preprocessing step,
which converts relational data into a single data. And one comparable to accuracy obtained on exhaustive proposi-
of the persistent problems of propositionalization is that it tionalization). However, our former research had evident
generates a wast quantity of redundant and/or unpredictive weakness: it neglected to take into account possible inter-
features (e.g.: [3, 2]). Would not it be interesting to intel- actions between features. This paper attempts to address
ligently guide the propositionalization in order to avoid this weakness.
wasteful generation of these irrelevant features? Our previ-
ous research [4] answered this question positively — based Why not a feature selection filter? Features that are cur-
on univariate feature selection on a stream of features, we rently unpredictive may become predictive, as new features
obtained 10-fold acceleration of the propositionalization appear. For example, consider XOR problem, in which the
(while maintaining the accuracy of the downstream model binary label is determined by two binary features f1 and
Copyright ©2019 for this paper by its authors. Use permitted under f2 : y = xor( f1 , f2 ). Features f1 and f2 are individually un-
Creative Commons License Attribution 4.0 International (CC BY 4.0). predictive. But together, they define the label. Univariate
feature selection filters (e.g.: based on information gain ra- 1 Implementation
tio) cannot correctly identify the change or the first feature
relevance as the second feature is added in XOR problem. Bias Whenever a new feature xnew arrives, we may train n
But models capable of modeling feature interactions (like new trees. And add the newly trained trees into the current
random forests) can eventually identify these features as random forest. Unfortunately, with this approach, the new
important. features would be underrepresented in the forest in com-
parison to old features simply because the old features had
Application Beside propositionalization, learning on a multiple opportunities to get used in a tree while the new
stream of features has another interesting use-case: Kaggle feature had only one opportunity to get used in a tree.
competitions. In these challenges, competitors are given a Consequently, earlier features would have a bigger im-
dataset and the team with the best model wins1 . Based on pact (weight) in the forest than the newer features. This
the analysis of solutions of the past winners2 , one of the presents a bias, which is generally undesirable.
common differentiating factors is extensive feature engi-
neering. However, competitive feature engineering is gen- Variable count of trees The first intuitive improvement is
erally not a one-time task but rather an iterative process: to make sure that the new feature is actually always passed
to the new trees (instead of leaving it on the chance). And
1. formulate a hypothesis (e.g.: log transformation of
instead of generating an arbitrary count of the trees, we can
features will improve the accuracy of the downstream
calculate the optimal count n that minimizes the random
model),
feature selection bias.
2. test the hypothesis (e.g.: evaluate the change of accu- First, we introduce the notation. Let c be the count of
racy of the downstream model), how many times a feature x was passed to decision trees.
And let old subscript describe some old feature and new
where the choice of the next round of hypotheses is in-
subscript to describe the new feature. If we want to avoid
fluenced based on the success of the previously evaluated
the random feature selection bias, following should hold:
hypotheses. Traditionally, the evaluation of the hypothesis
required retraining of the model from scratch. Our solution cnew = cold . (1)
is to update the current model. The benefit is evident: the
update of the current model takes less time than retraining Since
the model from scratch. And consequently, that gives us cnew = n (2)
the freedom to test more hypotheses.
because the new feature is always selected and
Random forest We take random forest [1] as a starting
cold = mtry · dold + mtry · n, (3)
model to expand into an online implementation because
it can deal with dirty data (e.g.: missing values, outliers, where dold is the count of the old features, we get:
mix of numerical and nominal attributes,...) and given an
implementation of a decision tree, it is easy to implement n = mtry · dold + mtry · n. (4)
and reason about.
The key idea behind random forest classifier is that we Hence, we get the optimal n with:
make an ensemble of decision trees. In order to create diver- mtry · dold
sity between the trees, it employs two strategies: bagging n= . (5)
1 − mtry
and random feature selection. Bagging is based on a ran-
dom sampling of training instances with repetition. While The issue with this approach is that if we keep adding
random feature selection is without repetition. The count of d features one-by-one, the total count of the trees in the
features to select is one of the most tunable parameters of ensemble grows quadratically.
random forests [5] and multiple heuristics for the optimal
value were provided in the literature. For simplicity of the Tree weighting If we want to avoid the quadratic growth of
following analysis, we assume that the count of the selected the random forest, we may weight the late trees more than
features is a fixed ratio of the count of all the features. We the former trees. While we could have calculated the tree
call the ratio mtry. weight analytically, we provide an algorithmic solution in
Algorithm 1. In praxis, the advantage of the algorithmic
1 See a list of all possible challenges at https://www.kaggle.com/
solution is that it is self-correcting — if some of the as-
competitions
2 http://blog.kaggle.com/category/ sumptions are not fully fulfilled (e.g.: When we have 11
winners-interviews/ features and the feature selection ratio is 0.5, we can either
Algorithm 1: Random forest update, when a new feature arrives. Function featureCnt() returns count of features to
sample.
Input: X: training data, y: training label, col: index of the new feature, treeCnt: cnt of trees to train,
weightedFeatureUseCnt: bookkeeping vector initialized to zeros, ensemble: collection of trees.
Output: ensemble, treeWeight, weightedFeatureUseCnt.
1 f eatureUseCnt = zeros (col);
2 for i=1:treeCnt do
3 oldFeatures = choice (1:col-1, featureCnt (col-1), replacement=False);
4 f eatures = [oldFeatures, col];
5 samples = choice (nrow (x), nrow (x), replacement=True);
6 tree = fitTree (X[samples, f eatures], y[samples]);
7 ensemble = [ensemble, tree];
8 f eatureUseCnt[ f eatures]++ ;
9 end
10 treeWeight = avg (weightedFeatureUseCnt[1:col-1]) / ( f eatureUseCnt[col] - avg ( f eatureUseCnt[1:col-1]));
11 weightedFeatureUseCnt = weightedFeatureUseCnt + treeWeight* f eatureUseCnt;

select 5 or 6 features but not 5.5.), the error is not ignored AUC (Area Under the Receiver Operating Characteristics).
(as it would be in a closed-form analytical solution) but is In the case of the offline random forest, we train the random
encoded in weightedFeatureUseCount. And each call forest just once on all the features.
of Algorithm 1 directly minimizes the error.
When scoring new samples, we evaluate trees in the en- Meta-parameters At each generation (addition of a new
semble and calculate the weighted average of the predic- feature), we train 30 new trees. This value is recommended
tions (each generation of trees share the same treeWeight). by Breiman [1] and we decided to go with it. For feature
selection ratio, we used 2⁄3.

2 Experiments
Data sets We used all 232 data sets (see Appendix A) from
OpenML [6] that have a binary label (because we evaluate
We compare two online random forest implementations:
the models with AUC), less than 200000 samples (because
baseline and challenger. In baseline, features are selected
of runtime) and less than 15 features (again, because of the
with uniform probability (like in ordinary random forest).
runtime).
In the challenger model, the new feature is always selected
while the old features are selected with uniform probabil-
Results In 87% (201/232), the challenger model had higher
ity3 . Furthermore, we train an offline random forest with
average testing AUC than the baseline. Sign test on this
the same meta-parameters as the online random forest in
statistic gives one-tail P-value < 10−29 . The average differ-
order to depict the value of the online learning.
ence of the testing AUC across all the data sets was 2.10
Protocol For each data set, we performed the following percent point. Furthermore, in 71% (164/232), the chal-
procedure 10 times: We randomly split the data set into lenger model had higher average testing AUC than the of-
training/testing subsets with stratified sampling with 2:1 fline model (P-value < 10−8 ). The table with the results
ratio. Then we randomly permutate the feature order in the and the code that generated the table is available from
data set (because our proposal should work regardless of https://github.com/janmotl/rf.
the feature ordering). Finally, on online random forests we
perform incremental learning feature-by-feature (i.e.: first 3 Discussion
we train the random forest on the first feature, then we add
the second feature into the forest,... and continue until the Overhead Challenger model, in comparison to base-
last feature is added into the forest). After adding the last line model, uses 3 more variables: featureUseCount,
feature, the final model is evaluated on the testing set with weightedFeatureUseCount and treeWeight. Each of
3 This probability is smaller in the challenger model than in the base- these variables is (or fill in) a vector of length d, the count
line model in order to keep the final count of features in challengers’ trees of features. Ignoring the differences in the data types, the
identical to the count of features in baselines’ trees. total memory overhead is equivalent to 3 more training data
samples. The computational complexity of updating these
3 variables, when a new feature is added, is O(d) since
treeCount is a constant.

Limitation Our experiment suffers from one limitation:
while we make sure that the feature selection rate is uni-
form, we ignore interactions between the features. This
could be a topic of further research.

Extension One of possible extensions of our work, which
we did not pursue further, is pruning of the oldest trees from
the ensemble. The idea is simple: the older generations of
the trees have so small weight, that they hardly influence
the final prediction.

4 Conclusion
We have extended random forest to work on a stream of
features. The idea was simple: when a new feature arrives,
extend the forest with a new set of trees. However, with this
strategy, older features end up used more frequently than
the new features. When we fix this feature selection bias, it
improves the testing AUC on average by 2 percent points.
The proposed algorithm for feature selection bias correc-
tion is fast, easy to implement and robust. The code was
open-sourced at https://github.com/janmotl/rf.

5 Acknowledgments
We would like to thank the anonymous reviewers, their
comments helped to improve this paper. This research was
supported by the Grant Agency of the Czech Technical
University in Prague, grant No. SGS17/210/OHK3/3T/18.

References
[1] Leo Breiman. Random forest. Mach. Learn., 45(5):1–35,
1999.
[2] Valentin Kassarnig and Franz Wotawa. Evolutionary propo-
sitionalization of multi-relational data. Proc. 30th Int. Conf.
Softw. Eng. Knowl. Eng., 2018:629–690, 2018.
[3] Mark-André Krogel. On Propositionalization for Knowledge
Discovery in Relational Databases. PhD thesis, Otto-von-
Guericke-Universität Magdeburg, 2005.
[4] Jan Motl and Pavel Kordík. Do we need to observe features to
perform feature selection? CEUR Workshop Proc., 2203:44–
51, 2018.
[5] Philipp Probst, Marvin N. Wright, and Anne Laure
Boulesteix. Hyperparameters and tuning strategies for ran-
dom forest, 2019.
[6] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis
Torgo. OpenML: networked science in machine learning.
ACM SIGKDD Explor. Newsl., 15(2):49–60, jun 2014.
A Used datasets

2dplanes biomed echoMonths house_8L rabe_166
abalone blogger ecoli houses rabe_176
acute-inflammations blood-transfusion electricity humandevel rabe_265
aids BNG(breast-w) elusage hungarian rabe_266
Amazon_employee_access BNG(tic-tac-toe) fertility hutsof99_logis rabe_97
analcatdata_apnea1 bolts fishcatch ilpd rmftsa_ctoarrivals
analcatdata_apnea2 boston fri_c0_100_10 iris rmftsa_ladata
analcatdata_apnea3 braziltourism fri_c0_100_5 irish rmftsa_sleepdata
analcatdata_asbestos breast-cancer fri_c0_1000_10 jEdit_4.0_4.2 Run_or_walk_information
analcatdata_bankruptcy breast-cancer-dropped fri_c0_1000_5 jEdit_4.2_4.3 sa-heart
analcatdata_birthday breast-w fri_c0_250_10 kdd_el_nino-small schlvote
analcatdata_bondrate breastTumor fri_c0_250_5 kidney sensory
analcatdata_boxing1 bridges fri_c0_500_10 kin8nm servo
analcatdata_boxing2 car fri_c0_500_5 lowbwt sleep
analcatdata_broadway cars fri_c1_100_10 lupus sleuth_case1102
analcatdata_broadwaymult chatfield_4 fri_c1_100_5 machine_cpu sleuth_case1201
analcatdata_challenger cholesterol fri_c1_1000_10 MagicTelescope sleuth_case1202
analcatdata_chlamydia chscase_adopt fri_c1_1000_5 mammography sleuth_case2002
analcatdata_creditscore chscase_census2 fri_c1_250_10 mbagrade sleuth_ex1221
analcatdata_cyyoung8092 chscase_census3 fri_c1_250_5 mfeat-morphological sleuth_ex1605
analcatdata_cyyoung9302 chscase_census4 fri_c1_500_10 mofn-3-7-10 sleuth_ex1714
analcatdata_dmft chscase_census5 fri_c1_500_5 monks-problems-1 sleuth_ex2015
analcatdata_draft chscase_census6 fri_c2_100_10 monks-problems-2 sleuth_ex2016
analcatdata_fraud chscase_funds fri_c2_100_5 monks-problems-3 socmob
analcatdata_germangss chscase_geyser1 fri_c2_1000_10 mozilla4 solar-flare
analcatdata_gsssexsurvey chscase_health fri_c2_1000_5 mu284 space_ga
analcatdata_gviolence chscase_vine1 fri_c2_250_10 mux6 stock
analcatdata_japansolvent chscase_vine2 fri_c2_250_5 mv strikes
analcatdata_lawsuit chscase_whale fri_c2_500_10 newton_hema tae
analcatdata_michiganacc cleve fri_c2_500_5 no2 threeOf9
analcatdata_neavote cleveland fri_c3_100_10 nursery tic-tac-toe
analcatdata_negotiation Click_prediction_small fri_c3_100_5 page-blocks Titanic
analcatdata_olympic2000 cloud fri_c3_1000_10 parity5 transplant
analcatdata_reviewer cm1_req fri_c3_1000_5 parity5_plus_5 vertebra-column
analcatdata_runshoes cmc fri_c3_250_10 pc1_req veteran
arsenic-female-bladder datatrieve fri_c4_250_10 pollen visualizing_hamster
arsenic-female-lung delta_ailerons fri_c4_500_10 postoperative-patient-data visualizing_livestock
arsenic-male-bladder delta_elevators fried prnn_crabs visualizing_slope
arsenic-male-lung diabetes fruitfly prnn_fglass visualizing_soil
autoMpg diabetes_numeric glass prnn_synth vowel
badges2 diggle_table_a1 grub-damage profb wholesale-customers
balance-scale diggle_table_a2 haberman puma8NH wilt
balloon disclosure_x_bias hayes-roth pwLinear wine
banana disclosure_x_noise heart-c quake witmer_census_1980
bank8FM disclosure_x_tampered heart-h qualitative-bankruptcy
banknote-authentication disclosure_z heart-statlog rabe_131
baskball dresses-sales hip rabe_148

Table 1: List of used data sets.