=Paper=
{{Paper
|id=Vol-2473/paper12
|storemode=property
|title=Learning on a Stream of Features with Random Forest
|pdfUrl=https://ceur-ws.org/Vol-2473/paper12.pdf
|volume=Vol-2473
|authors=Jan Motl,Pavel Kordik
|dblpUrl=https://dblp.org/rec/conf/itat/MotlK19
}}
==Learning on a Stream of Features with Random Forest==
Learning on a Stream of Features with Random Forest Jan Motl, Pavel Kordík Czech Technical University in Prague, Thákurova 9, 160 00 Praha 6, Czech Republic, jan.motl@fit.cvut.cz, pavel.kordik@fit.cvut.cz Abstract: We study an interesting and challenging prob- lem, supervised learning on a stream of features, in which the size of the feature set is unknown, and not all features are available for learning while leaving the number of ob- servations constant. In this problem, the features arrive one at a time, and the learner’s task is to train a model equiva- lent to a model trained from "scratch". When a new feature is inserted into the training set, a new set of trees is trained and added into the current forest. However, it is desirable to correct the selection bias: older features has more oppor- tunities to get selected into trees than the new features. We combat the selection bias by adjusting the feature selection distribution. However, while this correction improves accu- racy of the random forest, it may require training of many new trees. In order to keep the count of the new trees small, we furthermore put more weight on more recent trees than on the old trees. Keywords: random forest, incremental learning, online learning, sequential learning, stream learning Problem formulation One of the common issues in ma- chine learning is changing data and the need to keep the machine learning models up to date with the changing data. One of the successful simplifications is to assume that over time we are getting new samples. However, this article is Figure 1: The difference between learning from a stream concerned with the orthogonal problem — fast updates of of samples (top) and a stream of features (bottom). In both models when new features arrive (see Figure 1). cases, we have n samples and d features at time t. But at time t + 1, we either have one more sample (top) or one Motivation Our original need for learning on a stream more feature (bottom). of features was due to our interest into propositionaliza- tion [3]. Propositionalization is a data preprocessing step, which converts relational data into a single data. And one comparable to accuracy obtained on exhaustive proposi- of the persistent problems of propositionalization is that it tionalization). However, our former research had evident generates a wast quantity of redundant and/or unpredictive weakness: it neglected to take into account possible inter- features (e.g.: [3, 2]). Would not it be interesting to intel- actions between features. This paper attempts to address ligently guide the propositionalization in order to avoid this weakness. wasteful generation of these irrelevant features? Our previ- ous research [4] answered this question positively — based Why not a feature selection filter? Features that are cur- on univariate feature selection on a stream of features, we rently unpredictive may become predictive, as new features obtained 10-fold acceleration of the propositionalization appear. For example, consider XOR problem, in which the (while maintaining the accuracy of the downstream model binary label is determined by two binary features f1 and Copyright ©2019 for this paper by its authors. Use permitted under f2 : y = xor( f1 , f2 ). Features f1 and f2 are individually un- Creative Commons License Attribution 4.0 International (CC BY 4.0). predictive. But together, they define the label. Univariate feature selection filters (e.g.: based on information gain ra- 1 Implementation tio) cannot correctly identify the change or the first feature relevance as the second feature is added in XOR problem. Bias Whenever a new feature xnew arrives, we may train n But models capable of modeling feature interactions (like new trees. And add the newly trained trees into the current random forests) can eventually identify these features as random forest. Unfortunately, with this approach, the new important. features would be underrepresented in the forest in com- parison to old features simply because the old features had Application Beside propositionalization, learning on a multiple opportunities to get used in a tree while the new stream of features has another interesting use-case: Kaggle feature had only one opportunity to get used in a tree. competitions. In these challenges, competitors are given a Consequently, earlier features would have a bigger im- dataset and the team with the best model wins1 . Based on pact (weight) in the forest than the newer features. This the analysis of solutions of the past winners2 , one of the presents a bias, which is generally undesirable. common differentiating factors is extensive feature engi- neering. However, competitive feature engineering is gen- Variable count of trees The first intuitive improvement is erally not a one-time task but rather an iterative process: to make sure that the new feature is actually always passed to the new trees (instead of leaving it on the chance). And 1. formulate a hypothesis (e.g.: log transformation of instead of generating an arbitrary count of the trees, we can features will improve the accuracy of the downstream calculate the optimal count n that minimizes the random model), feature selection bias. 2. test the hypothesis (e.g.: evaluate the change of accu- First, we introduce the notation. Let c be the count of racy of the downstream model), how many times a feature x was passed to decision trees. And let old subscript describe some old feature and new where the choice of the next round of hypotheses is in- subscript to describe the new feature. If we want to avoid fluenced based on the success of the previously evaluated the random feature selection bias, following should hold: hypotheses. Traditionally, the evaluation of the hypothesis required retraining of the model from scratch. Our solution cnew = cold . (1) is to update the current model. The benefit is evident: the update of the current model takes less time than retraining Since the model from scratch. And consequently, that gives us cnew = n (2) the freedom to test more hypotheses. because the new feature is always selected and Random forest We take random forest [1] as a starting cold = mtry · dold + mtry · n, (3) model to expand into an online implementation because it can deal with dirty data (e.g.: missing values, outliers, where dold is the count of the old features, we get: mix of numerical and nominal attributes,...) and given an implementation of a decision tree, it is easy to implement n = mtry · dold + mtry · n. (4) and reason about. The key idea behind random forest classifier is that we Hence, we get the optimal n with: make an ensemble of decision trees. In order to create diver- mtry · dold sity between the trees, it employs two strategies: bagging n= . (5) 1 − mtry and random feature selection. Bagging is based on a ran- dom sampling of training instances with repetition. While The issue with this approach is that if we keep adding random feature selection is without repetition. The count of d features one-by-one, the total count of the trees in the features to select is one of the most tunable parameters of ensemble grows quadratically. random forests [5] and multiple heuristics for the optimal value were provided in the literature. For simplicity of the Tree weighting If we want to avoid the quadratic growth of following analysis, we assume that the count of the selected the random forest, we may weight the late trees more than features is a fixed ratio of the count of all the features. We the former trees. While we could have calculated the tree call the ratio mtry. weight analytically, we provide an algorithmic solution in Algorithm 1. In praxis, the advantage of the algorithmic 1 See a list of all possible challenges at https://www.kaggle.com/ solution is that it is self-correcting — if some of the as- competitions 2 http://blog.kaggle.com/category/ sumptions are not fully fulfilled (e.g.: When we have 11 winners-interviews/ features and the feature selection ratio is 0.5, we can either Algorithm 1: Random forest update, when a new feature arrives. Function featureCnt() returns count of features to sample. Input: X: training data, y: training label, col: index of the new feature, treeCnt: cnt of trees to train, weightedFeatureUseCnt: bookkeeping vector initialized to zeros, ensemble: collection of trees. Output: ensemble, treeWeight, weightedFeatureUseCnt. 1 f eatureUseCnt = zeros (col); 2 for i=1:treeCnt do 3 oldFeatures = choice (1:col-1, featureCnt (col-1), replacement=False); 4 f eatures = [oldFeatures, col]; 5 samples = choice (nrow (x), nrow (x), replacement=True); 6 tree = fitTree (X[samples, f eatures], y[samples]); 7 ensemble = [ensemble, tree]; 8 f eatureUseCnt[ f eatures]++ ; 9 end 10 treeWeight = avg (weightedFeatureUseCnt[1:col-1]) / ( f eatureUseCnt[col] - avg ( f eatureUseCnt[1:col-1])); 11 weightedFeatureUseCnt = weightedFeatureUseCnt + treeWeight* f eatureUseCnt; select 5 or 6 features but not 5.5.), the error is not ignored AUC (Area Under the Receiver Operating Characteristics). (as it would be in a closed-form analytical solution) but is In the case of the offline random forest, we train the random encoded in weightedFeatureUseCount. And each call forest just once on all the features. of Algorithm 1 directly minimizes the error. When scoring new samples, we evaluate trees in the en- Meta-parameters At each generation (addition of a new semble and calculate the weighted average of the predic- feature), we train 30 new trees. This value is recommended tions (each generation of trees share the same treeWeight). by Breiman [1] and we decided to go with it. For feature selection ratio, we used 2⁄3. 2 Experiments Data sets We used all 232 data sets (see Appendix A) from OpenML [6] that have a binary label (because we evaluate We compare two online random forest implementations: the models with AUC), less than 200000 samples (because baseline and challenger. In baseline, features are selected of runtime) and less than 15 features (again, because of the with uniform probability (like in ordinary random forest). runtime). In the challenger model, the new feature is always selected while the old features are selected with uniform probabil- Results In 87% (201/232), the challenger model had higher ity3 . Furthermore, we train an offline random forest with average testing AUC than the baseline. Sign test on this the same meta-parameters as the online random forest in statistic gives one-tail P-value < 10−29 . The average differ- order to depict the value of the online learning. ence of the testing AUC across all the data sets was 2.10 Protocol For each data set, we performed the following percent point. Furthermore, in 71% (164/232), the chal- procedure 10 times: We randomly split the data set into lenger model had higher average testing AUC than the of- training/testing subsets with stratified sampling with 2:1 fline model (P-value < 10−8 ). The table with the results ratio. Then we randomly permutate the feature order in the and the code that generated the table is available from data set (because our proposal should work regardless of https://github.com/janmotl/rf. the feature ordering). Finally, on online random forests we perform incremental learning feature-by-feature (i.e.: first 3 Discussion we train the random forest on the first feature, then we add the second feature into the forest,... and continue until the Overhead Challenger model, in comparison to base- last feature is added into the forest). After adding the last line model, uses 3 more variables: featureUseCount, feature, the final model is evaluated on the testing set with weightedFeatureUseCount and treeWeight. Each of 3 This probability is smaller in the challenger model than in the base- these variables is (or fill in) a vector of length d, the count line model in order to keep the final count of features in challengers’ trees of features. Ignoring the differences in the data types, the identical to the count of features in baselines’ trees. total memory overhead is equivalent to 3 more training data samples. The computational complexity of updating these 3 variables, when a new feature is added, is O(d) since treeCount is a constant. Limitation Our experiment suffers from one limitation: while we make sure that the feature selection rate is uni- form, we ignore interactions between the features. This could be a topic of further research. Extension One of possible extensions of our work, which we did not pursue further, is pruning of the oldest trees from the ensemble. The idea is simple: the older generations of the trees have so small weight, that they hardly influence the final prediction. 4 Conclusion We have extended random forest to work on a stream of features. The idea was simple: when a new feature arrives, extend the forest with a new set of trees. However, with this strategy, older features end up used more frequently than the new features. When we fix this feature selection bias, it improves the testing AUC on average by 2 percent points. The proposed algorithm for feature selection bias correc- tion is fast, easy to implement and robust. The code was open-sourced at https://github.com/janmotl/rf. 5 Acknowledgments We would like to thank the anonymous reviewers, their comments helped to improve this paper. This research was supported by the Grant Agency of the Czech Technical University in Prague, grant No. SGS17/210/OHK3/3T/18. References [1] Leo Breiman. Random forest. Mach. Learn., 45(5):1–35, 1999. [2] Valentin Kassarnig and Franz Wotawa. Evolutionary propo- sitionalization of multi-relational data. Proc. 30th Int. Conf. Softw. Eng. Knowl. Eng., 2018:629–690, 2018. [3] Mark-André Krogel. On Propositionalization for Knowledge Discovery in Relational Databases. PhD thesis, Otto-von- Guericke-Universität Magdeburg, 2005. [4] Jan Motl and Pavel Kordík. Do we need to observe features to perform feature selection? CEUR Workshop Proc., 2203:44– 51, 2018. [5] Philipp Probst, Marvin N. Wright, and Anne Laure Boulesteix. Hyperparameters and tuning strategies for ran- dom forest, 2019. [6] Joaquin Vanschoren, Jan N. van Rijn, Bernd Bischl, and Luis Torgo. OpenML: networked science in machine learning. ACM SIGKDD Explor. Newsl., 15(2):49–60, jun 2014. A Used datasets 2dplanes biomed echoMonths house_8L rabe_166 abalone blogger ecoli houses rabe_176 acute-inflammations blood-transfusion electricity humandevel rabe_265 aids BNG(breast-w) elusage hungarian rabe_266 Amazon_employee_access BNG(tic-tac-toe) fertility hutsof99_logis rabe_97 analcatdata_apnea1 bolts fishcatch ilpd rmftsa_ctoarrivals analcatdata_apnea2 boston fri_c0_100_10 iris rmftsa_ladata analcatdata_apnea3 braziltourism fri_c0_100_5 irish rmftsa_sleepdata analcatdata_asbestos breast-cancer fri_c0_1000_10 jEdit_4.0_4.2 Run_or_walk_information analcatdata_bankruptcy breast-cancer-dropped fri_c0_1000_5 jEdit_4.2_4.3 sa-heart analcatdata_birthday breast-w fri_c0_250_10 kdd_el_nino-small schlvote analcatdata_bondrate breastTumor fri_c0_250_5 kidney sensory analcatdata_boxing1 bridges fri_c0_500_10 kin8nm servo analcatdata_boxing2 car fri_c0_500_5 lowbwt sleep analcatdata_broadway cars fri_c1_100_10 lupus sleuth_case1102 analcatdata_broadwaymult chatfield_4 fri_c1_100_5 machine_cpu sleuth_case1201 analcatdata_challenger cholesterol fri_c1_1000_10 MagicTelescope sleuth_case1202 analcatdata_chlamydia chscase_adopt fri_c1_1000_5 mammography sleuth_case2002 analcatdata_creditscore chscase_census2 fri_c1_250_10 mbagrade sleuth_ex1221 analcatdata_cyyoung8092 chscase_census3 fri_c1_250_5 mfeat-morphological sleuth_ex1605 analcatdata_cyyoung9302 chscase_census4 fri_c1_500_10 mofn-3-7-10 sleuth_ex1714 analcatdata_dmft chscase_census5 fri_c1_500_5 monks-problems-1 sleuth_ex2015 analcatdata_draft chscase_census6 fri_c2_100_10 monks-problems-2 sleuth_ex2016 analcatdata_fraud chscase_funds fri_c2_100_5 monks-problems-3 socmob analcatdata_germangss chscase_geyser1 fri_c2_1000_10 mozilla4 solar-flare analcatdata_gsssexsurvey chscase_health fri_c2_1000_5 mu284 space_ga analcatdata_gviolence chscase_vine1 fri_c2_250_10 mux6 stock analcatdata_japansolvent chscase_vine2 fri_c2_250_5 mv strikes analcatdata_lawsuit chscase_whale fri_c2_500_10 newton_hema tae analcatdata_michiganacc cleve fri_c2_500_5 no2 threeOf9 analcatdata_neavote cleveland fri_c3_100_10 nursery tic-tac-toe analcatdata_negotiation Click_prediction_small fri_c3_100_5 page-blocks Titanic analcatdata_olympic2000 cloud fri_c3_1000_10 parity5 transplant analcatdata_reviewer cm1_req fri_c3_1000_5 parity5_plus_5 vertebra-column analcatdata_runshoes cmc fri_c3_250_10 pc1_req veteran arsenic-female-bladder datatrieve fri_c4_250_10 pollen visualizing_hamster arsenic-female-lung delta_ailerons fri_c4_500_10 postoperative-patient-data visualizing_livestock arsenic-male-bladder delta_elevators fried prnn_crabs visualizing_slope arsenic-male-lung diabetes fruitfly prnn_fglass visualizing_soil autoMpg diabetes_numeric glass prnn_synth vowel badges2 diggle_table_a1 grub-damage profb wholesale-customers balance-scale diggle_table_a2 haberman puma8NH wilt balloon disclosure_x_bias hayes-roth pwLinear wine banana disclosure_x_noise heart-c quake witmer_census_1980 bank8FM disclosure_x_tampered heart-h qualitative-bankruptcy banknote-authentication disclosure_z heart-statlog rabe_131 baskball dresses-sales hip rabe_148 Table 1: List of used data sets.