Introduction

Adaptive Incremental Learning for Statistical Relational Models Using Gradient-Based Boosting

Yulong Gu

Paolo Missier

paolo.missier@newcastle.ac.uk 0 0 School of Computing Science, Newcastle University Newcastle upon Tyne , United Kingdom

2017

4 6

We consider the problem of incrementally learning models from relational data. Most existing learning methods for statistical relational models use batch learning, which becomes computationally expensive and eventually infeasible for large datasets. The majority of the previous work in relational incremental learning assumes the model's structure is given and only the model's parameters needed to be learned. In this paper, we propose algorithms that can incrementally learn the model's parameters and structure simultaneously. These algorithms are based on the successful formalisation of the relational functional gradient boosting system (RFGB), and extend the classical propositional ensemble methods to relational learning for handling evolving data streams.

Gradient-Based Boosting Incremental Learning Hoe ding Bound Ensemble Methods Statistical Relational Learning Relational Regression Tree Concept Drift

Introduction

Copyright c by the paper's authors. Copying permitted for private and academic purposes. already been extensions that enable the RFGB system to handle imbalanced data with Soft Margin [ 3 ] and incomplete data with Structural EM [ 4 ]. In this paper, inspired by HTILDE-RT [ 5 ] and DWM [6], we have further developed the RFGB system for incremental learning by upgrading the RRT used in the RFGB system with the use of Hoe ding bound and the concept adaptation strategy that is successfully employed in the CVFDT [7], and incorporating the upgraded RRT with ensemble methods through a Hoe ding-based stability metric to cope with concept drift.

In this work, Hoe ding Relational Regression Tree (HRRT), Relational Incremental Boosting (RIB) and Relational Boosted Forest (RBF) are introduced for adaptive incremental learning in SRL setting. This paper is organized as follows: in Section 2, the background including RFGB and CVFDT are reviewed; in Section 3, the rule stability, HRRT, RIB, and RBF are explained; in Section 4, conclusions and future work are provided. 2 2.1

Background Relational Functional Gradient Boosting

Friedman [ 2 ] proposed a boosting approach where functional gradients are computed for each example over the objective function. The RFGB [ 1 ] considers the objective function as a sigmoid function (Equation 2). These gradients correspond to the di erence between the true label and predicted probability for an example and are then used to generate a regression dataset. In each iteration, a RRT i is learned to t to these gradients and added into the potential function m:

This approach learns a boosted probabilistic model that corresponds to the local conditional probability distribution in some SRL models.

m = 0 + 1 + ::: + The propositional incremental dceision tree learner VFDT [8] uses Hoe ding bounds to guarantee that its output is asymptotically nearly identical to that of a batch learner. CVFDT [7] introduces a concept adaptation strategy to VFDT for handling concept drift. CVFDT keeps the incrementally learned tree consistent with a sliding window of examples. In this paper, we call the contents of a sliding window the window data. The concept adaptation strategy is implemented as follows: The statistics of the window data is encoded into the su cient statistics for nodes in the tree. These su cient statistics are then used to periodically check if the Hoe ding bound holds at each node , and if not, the algorithm will create an alternate sub-tree for the node to compete with the failed sub-tree in terms of prediction accuracy. In the case where the alternate sub-tree beats the failed one, the failed sub-tree and the old con icting rules encoded in it will be discarded and replaced by the alternate sub-tree. The disadvantage of eliminating old con icting rules entirely is that the model will only support current window data and usually result in larger prediction variance.

In the following section, we will introduce our algorithms using ensemble methods to handle concept drift by allowing the coexistence of con icting rules in relational setting. In this paper, we qualify a tree as stable with the following considerations. The tree is highly consistent with the window data, and the rules encoded in the tree are real rules that can, with high con dence, interpret the distribution associated with the data generator. We will boost a tree when it is stable so that the objective function (equation 2) is best optimized for the current window data and the newly found stable rules can be solidi ed and transformed into established rules. On the other hand, a stable tree is highly resistant to concept drift, as the newly found stable rules will require many counter-examples to be invalidated. Inspired by DWM [6] in which the ensemble methods are proven e cient by extensive experiments for adapting drifting concepts, we propose to employ ensemble methods to tackle the concept drift problem in the relational setting.

In the presence of concept drift, we want to learn how stable the rules that the tree has learned so far are, as an indicator of the stability of a tree. We introduce a metric that measures the stability of a tree as follows. De nition 1. De ne the Rule Stability of a model as the size of the smallest change in sample D that may cause rule r0 to become superior to r. This is as shown in equation 3, where D0 is D after change.

Learner : (Dif f (D; D0) = n; r) ! r0 (3)

According to CVFDT [7], to guarantee that the rule learned from a data sample is the real rule r from the corresponding population with desired con dence 1 , the condition GXa;Xb > must be met, where

GXa;Xb = G(Xa) G(Xb), G(Xi) is the average of results from splitting function G(Xi), Xa and Xb is the working and second best test respectively at a node, and is a boundary calculated based on the size of sample, and choice of splitting criterion and . As the streaming process continues, a con icting rule r0 of rule r might be introduced. To adapt r0, the condition GXa;Xb < must be met to trigger the concept adaptation strategy. Therefore, with con dence 1 , the size of the smallest change that may cause r0 to become superior to r in the CVFDT context is T olerance = GXa;Xb . According to De nition 1, the T olerence measures the rule stability of an inner node. The T reeT ol which is the sum of the T olerance of every inner node in a tree denotes the stability of the tree. 3.2

Hoe ding Relational Regression Tree

Our HRRT algorithm upgrades the RRT learner used in RFGB by incorporating methods from CVFDT [7]. It has a test search space including conjunctions of recursive and aggregated predicates by using re nement operator with -subsumption and aggregate condition [ 1 ]. Most of the algorithm is similar to CVFDT except that HRRT is using learning from interpertation setting, for detailed explanation please refer to CVFDT and HTILDE-RT [ 5 ]. We implement the T reeT ol in StabilityCheck function with a threshold de ned by user to periodically check whether the tree quali es as stable in the sense that the tree is stable when T reeT ol is greater than the threshold. HRRT and StabilityCheck are the building blocks for the following algorithms. 3.3

Relational Incremental Boosting

As shown in Algorithm 1, the initial tree is incrementally learned by HRRT and will be boosted when it has satis ed the StabilityCheck (line 9). The empty tree will then be trained to t on the functional gradients of the boosted tree (lines 5-7) for example in the window data. In the presence of concept drift, the error introduced by the old con icting rules in will be corrected by . When has also met the StabilityCheck and the execution signal S is true (line 6), the algorithm adds to , and then boosts their sum in that itself is just a functional gradient tree (FGT) of , only the composite of them can represent the newly found stable rules.

Algorithm 1 Relational Incremental Boosting 1: procedure RIB(DataStream; p) 2: Initialize empty tree and 3: for each d in DataStream do 4: After every p examples do f ; Sg EvalCentre( ; d) 5: if :boosted then HRRT ( ; GradExpGen( ; d)) 6: if StabilityCheck( ) and S then Boosting( + ) and reset 7: end if 8: else HRRT ( ; d) 9: if StabilityCheck( ) then Boosting( ), :boosted T rue 10: end if 11: end if 12: end for 13: return ( + ) 14: end procedure

The execution signal S is set by the EvalCentre (line 4) that handles two crucial issues with RIB and the following RBF. One issue is that the complexity of the resulting model grows with the increasing size of learned data. Inspired by DWM [6], we introduce an evaluation centre that periodically evaluates the contribution to error of each FGT in the resulting model, and discards the FGTs that perform poorly over time. In such a way, the model's complexity decreases at the expense of its integrity. Another issue is that if the concept never drifts, new boosted trees will be added to the model constantly but provide trivial improvement in performance. In this case, the EvalCentre evaluates the global performance of the model and stops executing boosting by setting the signal S to false when the performance reaches a pre-de ned threshold, indicating that strong consistency of the model to the window data is achieved. 3.4

Relational Boosted Forest

In this RBF explanation, the model makes predictions based on the weighted average of regression values of trees in the forest with normalized weights. Other prediction models can be applied depending on the scenario. As shown in Algorithm 2, the forest and weights are initialized along with an empty tree and its associated weight w set to 1 (line 2). When has passed the StabilityCheck and the execution signal S is true, the boosted and its weight w will be added to the forest and weights respectively (lines 7-8). When a boosted tree makes a mistake in a predictive attempt, the EvalCentre in RBF will decrease its weight. The forest is in such a way recursively populated. Each boosted tree contains established rules that co-exist in the forest, and the weights are dynamically tuned to adapt the window data. In response to the complexity problem, the poorly performing trees with a weight less than a pre-de ned threshold will be removed from the forest. The signal S is set in the same way as RIB to handle the no drifting concept issue.

Algorithm 2 Relational Boosted Forest 1: procedure RBF(DataStream; p) 2: Initialize empty F orest & W eights, empty tree and w 3: for each d in DataStream do 4: After every p examples do 5: fF orest; W eights; Sg 6: HRRT ( ; d) 7: if StabilityCheck( ) and S then Boosting( ) 8: Add to F orest, w to W eights and reset ; w 9: end if 10: end for 11: return fF orest; W eigthsg 12: end procedure 1

EvalCentre(F orest; W eights; d) 4

Conclusions and Future Work

In this paper, we have introduced three adaptive incremental learning algorithms: the HRRT, RIB and RBF. All these algorithms can incrementally and adaptively learn the parameters and structure simultaneously for SRL models such as the Relational Dependency Network (RDNs) and the Markov Logic Network (MLNs). The RIB and RBF extend the classical ensemble methods for the rst time to relational scenarios for handling drifting concepts. As they are developed in the RFGB system, RIB and RBF can be naturally integrated with algorithms designed for RFGB system such as the Soft Margin [ 3 ] and the Structural EM [ 4 ] for modelling imbalanced and incomplete data in an incremental fashion.

In future work, we will evaluate these algorithms on some of the standard SRL benchmark datasets, and compare their performance against each other and with other state-of-the-art online structure learning algorithms.

1. Natarajan , S. , Khot , T. , Kersting , K. , Gutmann , B. , Shavlik , J.: Gradient-based boosting for statistical relational learning: The relational dependency network case . Mach. Learn . 86 , 25 - 56 ( 2012 )

2. Friedman , J.H. : Greedy function approximation: A gradient boosting machine . Ann. Stat. 29 , 1189 - 1232 ( 2001 )

3. Yang , S. , Khot , T. , Kersting , K. , Kunapuli , G. , Hauser , K. , Natarajan , S. : Learning from Imbalanced Data in Relational Domains: A Soft Margin Approach . Proc. - IEEE Int. Conf. Data Mining, ICDM. 2015-Janua , 1085 - 1090 ( 2015 )

4. Khot , T. , Natarajan , S. , Kersting , K. , Shavlik , J.: Gradient-based boosting for statistical relational learning: the Markov logic network and missing data cases . Mach. Learn . 100 , 75 - 100 ( 2015 )

5. Menezes , G. , Zaverucha , G.: HTILDE-RT: Scaling up relational regression trees for very large datasets . Inductive Log. Program. ( 2011 )