Online Transfer Learning for Concept Drifting Data Streams Helen McKay Nathan Griffiths Department of Computer Science Department of Computer Science University of Warwick University of Warwick Coventry, UK Coventry, UK H.McKay@warwick.ac.uk Phillip Taylor Theo Damoulas Department of Computer Science Department of Computer Science University of Warwick Department of Statistics Coventry, UK University of Warwick Coventry, UK Zhou Xu Jaguar Land Rover Research Coventry, UK Abstract Transfer learning uses knowledge learnt in a source domain to aid pre- dictions in a target domain. When both source and target domains are online, each are susceptible to concept drift, which may alter the map- ping of knowledge between them. Drifts in online domains can make additional information available, necessitating knowledge transfer both from the source to the target and vice versa. To address this we intro- duce the Bi-directional Online Transfer Learning framework (BOTL), which uses knowledge learnt in each online domain to aid predictions in others. We also introduce two variants of BOTL that incorporate model culling to minimise negative transfer in frameworks with large numbers of domains. We provide a theoretical performance guarantee that indicates BOTL achieves a loss no worse than the underlying local concept drift detection algorithm. Empirical results are presented using two data stream generators: the drifting hyperplane emulator and the smart home heating simulator, and real-world data predicting Time To Collision (TTC) from vehicle telemetry. The evaluation shows BOTL and it’s variants outperform the existing state-of-the-art technique. 1 Introduction Online Learning (OL) and Transfer Learning (TL) have been extensively studied within the machine learning community [6, 19, 24]. OL enables supervised learning to be conducted upon data streams that are susceptible to concept drift [6]. Concept drift can cause the distribution of data to change over time, modifying the underlying Copyright c by the paper’s authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). In: A. Editor, B. Coeditor (eds.): Proceedings of the XYZ Workshop, Location, Country, DD-MMM-YYYY, published at http://ceur-ws.org concept, meaning predictive models must be updated or discarded to maintain e↵ective predictions. To build accurate models, many real-world applications require large amounts of training data, which is often limited due to concept drifts [19]. TL enables models to be learnt in domains where training data is readily available, and used where it is limited to build more e↵ective predictors [19]. TL has typically been conducted o✏ine, limiting its use in real-world online environments [30]. It may be desirable to use on-device learning to personalise the functionalities of user facing applications, but a rich history of data may not be available locally due to memory limitations, and drifts may be encountered frequently. Predictive performances could be enhanced using TL in an online setting by using knowledge learnt from others to aid the target predictor. The Online Transfer Learning framework (OTL), developed by Zhao et al. [30], was proposed to enable TL to be used within an online setting. Current versions of OTL, such as [9, 7, 26], assume the source is in an o✏ine environment, ignoring the possibility of concept drift occurring in the source domain. In this paper, we propose the Bi-directional Online Transfer Learning framework (BOTL)1 , which considers both the source and target to be online. This has three benefits over existing approaches. Firstly, individual concepts are learnt, using concept drift detection strategies, and transferred to improve performance in the target [6]. Secondly, additional knowledge is transferred as new concepts are encountered in the source domain. Thirdly, knowledge can be transferred bi-directionally, enabling more e↵ective predictions to be made in both domains. Specifically, we: • introduce the BOTL framework, enabling each domain to benefit from online TL in a regression setting, • provide a theoretical performance guarantee showing predictions made by BOTL are no worse than the underlying local concept drift detection algorithm, and • show the performance of BOTL exceeds existing state-of-the-art techniques using a variety of datasets. We evaluate BOTL in a regression setting using two synthetic datasets and one real-world dataset containing both sudden and gradual drifts. We compare BOTL with a state-of-the-art online TL framework, the Generalised Online Transfer Learning framework (GOTL), which assumes the source is o✏ine [9]. The remainder of this paper is organised as follows. Section 2 outlines related work. Section 3 formulates the setting in which BOTL is used. Section 4 presents the proposed framework, and its theoretical performance guarantee is presented in Section 5. Section 6 specifies the data used for results presented in Section 7. Finally, Section 8 concludes the paper. 2 Related Work Online TL combines OL and TL. The aim of TL is to use knowledge learnt in one task, referred to as the source, to improve the e↵ectiveness of predictions in another, referred to as the target [18]. There are three distinct types of TL: inductive, transductive and unsupervised [1, 3, 19]. Inductive TL is used when source and target predictive tasks are di↵erent. Knowledge is transferred from the source to induce a supervised predictive function in the target [3]. Typically, large amounts of labelled target data is required to create a mapping between domains [19]. Unsupervised TL is applied in a similar way, but to tasks such as clustering [19]. Transductive TL is used when source and target tasks are the same, transferring knowledge to improve predictive performances for a target where no labelled data is available [1]. TL can be further categorised as homogeneous, where the domains of source and target are the same, or heterogeneous, where they di↵er [30]. In this paper we consider a homogeneous setting, and use inductive TL to improve the predictive performances within both source and target domains. It is desirable, for many modern applications, such as smart home heating systems, to predict future events from historical data. However, applications are often limited by memory constraints, preventing a complete history of data being retained [8]. Additionally due to the dynamic and non-stationary environment of data streams, the underlying concept may evolve or drift over time [13]. Concept drift is a change in the distribution of the observed data, or a change in the mapping between observations and values to be predicted [11]. If the underlying concept changes, the previously built model may no longer make e↵ective predictions, requiring the model to be modified or re-learnt [6]. To maintain e↵ective predictions, concept drift detection algorithms are frequently used in OL. Concept drift detection algorithms typically use a sliding window to maintain a subset of recent instances, usually used to 1 Available here: https://github.com/hmckay/BOTL update or rebuild the model. Strategies to update a model include ensemble learning approaches, where the window of recent instances is used to create a new model and combined with previously learnt models to improve the predictive performance. Model predictions are aggregated, for example, DWM uses a mean weighted by the models’ estimated performance [14]. Alternatively, concept drift detection algorithms such as ADWIN [2] use the window of data to create a model that represents the current concept independently of previous ones. A challenge associated with these concept drift detection strategies is that every time a concept is encountered, a new model must be learnt, and data must be collected to build the model. RePro [27] uses an approach similar to ADWIN, but retains a history of concepts and concept transitions to prevent learning recurring concepts [28]. This prevents the need to collect new data each time a recurring concept is encountered, however, data must still be collected to build models for new concepts. For many real-world applications, particularly those that are user facing, knowledge obtained from other users could enhance predictions when new concepts are encountered through the use of online TL. Existing online TL frameworks aim to transfer knowledge learnt from an o✏ine source to an online target for classification tasks. OTL [30], combines the o✏ine source model with the online target model using a weighting mechanism that is updated with respect to the performance of the models on a sliding window of data in the target domain. GOTL [9] extends the OTL weighting mechanism such that online TL can be used for both classification and regression. The weighting mechanism used by GOTL incrementally updates in steps to obtain weightings for source and target models. If the step size, , used to modify the weights is small enough, the ensemble of source and target models approximates the optimal weight combination [8]. However, if the step size is too small, it may take substantial time for the weights to update to their desired values, making predictions unreliable during this period. The field of online TL relates to Online Multi-task Learning (OMTL) [17, 20, 21], and Multistream Regression (MSR) [10]. MSR can be seen as a special case, where the source and target data streams are drawn from the same underlying distribution, and concepts encountered in the target domain have previously been encountered in the source [4]. This means the models transferred from the source can be used to make predictions in the target without requiring a target learner. This is unrealistic for many real-world applications as although source and target domains may be similar, it is unlikely the data streams are drawn from the same distribution. The goal of OMTL is to minimise the cumulative global loss across all domains [16], whereas online TL aims to minimise the predictive losses within each individual domain. Considering loss in this way is beneficial when applied to tasks such as application personalisation, where each domain represents a di↵erent user, and prediction errors should be minimised for that specific individual. Although online TL has been actively studied [9, 7, 26, 25, 29, 30], existing approaches assume the source is o✏ine. We propose BOTL, which considers both source and target in online environments, as might be expected in real-world applications such as smart home heating, or vehicle Adaptive Cruise Control (ACC), personalisations. 3 Problem Formulation Let domain D consist of a feature space , where xt 2 Rm is the instance observed at time t such that xt = {xt1 , . . . , xtm } 2 . Given domain D, a task consists of the target response variable, y 2 Y , where y 2 R, and a regression function, f : ! Y , which is learnt to map observed data to the target concept [19]. The knowledge learnt in a source domain, DS, can be transferred to the target domain, DT , and used to enhance predictions [24]. Online TL aims to learn the target predictive function, f T , that e↵ectively predicts the response variable, 0 yt 2 Y T , for each instance, xTt 2 T , observed in the target data stream, such that ŷtTi = fiT (xTt ). Model T transfer is used to enhance the target predictor by combining knowledge learnt in the local domain with knowledge learnt from other domains. For example, if we consider the scenario of application personalisation, where each domain represents an individual user, each instance, xt , may describe the user’s current environmental setting. If we wish to personalise application functionality by predicting some unknown value, yt , we may be able to utilise knowledge learnt from another user, fjS, to enhance the predictive performance of the target learner. Identifying concept drift in the source allows models to be transferred, fjS where j = 1 . . . k, for each of the k concepts encountered. BOTL aims to minimise the predictive error in the target domain by combining knowledge learnt from the target data stream with models previously learnt in a source domain. Focusing on minimising the loss with respect to the local, or target, domain makes BOTL highly applicable to the task of application personalisation, where predictions are made to benefit a specific individual. To achieve this, if we have a source domain, DS, that Table 1: Notation Definition D↵ Domain ↵: target, T , or source, S ↵ Data stream ↵ Y↵ Response variables of ↵ xt 2 ↵ The tth observed instance in ↵ yt 2Y ↵ The response variable of instance xt M Knowledge base of models f ↵: ↵ !Y ↵ Model learnt in domain ↵ F M: T !Y T Meta-model of M ŷt Prediction using F M (xt ) 0↵ ŷt Prediction using f ↵(xt ) W Sliding window of instances Wmax Maximum window size errt Predictive error of instance xt errW Predictive error across W l Loss threshold d Drift threshold cperf Performance culling threshold cM I Mutual information culling threshold has previously learnt models f1S, . . . , fjS, and a target domain, DT , that has previously learnt models f1T , . . . , fiT , at time t, then models f1S, . . . , fjS, should be made available to the target domain such that the target learner can benefit from the knowledge learnt in the source domain, DS. As both domains are online, and knowledge transfer is bi-directional, the models f1T , . . . , fiT should also be made available to the source domain, DS, such that the source learner can benefit from the knowledge learnt in the target domain, DT . In this paper, the source and target domains are considered to be homogeneous, such that they share the same underlying feature space, S = T , and Y S = Y T . Although the domains are homogeneous, the underlying concepts to be learnt within source and target domains may not be equivalent, therefore models from a source domain may not be relevant to the current target concept. BOTL provides a mechanism to combine models and maximise the impact of transferred models on the target. In presenting BOTL we use the notation detailed in Table 1. 4 Bi-directional Online Transfer Learning (BOTL) To utilise knowledge of distinct concepts, BOTL hinges upon a sliding window based concept drift detection algorithm. In this paper we use an adaptation of RePro [28] for regression as the underlying drift detector, however, other concept drift detection algorithms could be used. Although RePro requires some domain expertise to select appropriate parameter values, such as window size and drift threshold, it has been selected as the foundation of BOTL as it provides two unique benefits that outweigh this limitation. Firstly, RePro aims to select a single model that represents the current concept rather than using an ensemble of models to make predictions [27]. If instead an ensemble-based detection algorithm was used, such as DWM [14], the knowledge learnt to represent a single source concept is encompassed across multiple models in the ensemble, therefore all models would need to be transferred. Limiting the number of models needing to be transferred to the target domain reduces the communication and computational overhead of combining knowledge. Secondly, RePro retains a history of previously encountered concepts, preventing the need to re-learn models for recurring concepts [28]. RePro uses a sliding window of data, W , in the target domain to identify if a previously learnt model represents the current concept. If the concept has not previously been encountered a new model is created. RePro was initially developed specifically for classification, therefore modifications are required for regression tasks, highlighted in Algorithm 1. The original RePro algorithm detects drifts by measuring the target model’s classification accuracy across W . To apply RePro to regression, we instead use the R2 performance of the target model, fiT , across W . A concept drift is said to have occurred if the R2 performance drops below a predefined drift threshold, d . RePro traditionally maintains a sliding window of data with a maximum size of Wmax by discarding one incorrectly classified instance, and all subsequent correctly classified instances when the window is full. In order to encapsulate this behaviour we introduce ✏-insensitivity through a loss threshold, l , that allows Algorithm 1 Adapted RePro for regression. Input: Wmax , l , d , T , H T =; (historical concepts), M T =; (transition matrix). Learn f1T using x1 ...xWmax , add to H T for t=Wmax +1,Wmax +2,... do 0 Receive xt and predict ŷt =fiT (xt ) 0 Receive yt , add hxt ,ŷt ,yt i to W if fiT is new and stable then Add fiT to H T and fiT 1!fiT to M T if R2 (fiT ,W )< d then T fi+1 = SelectModel(H T ,M T ,W ) or learn new model over W else if |W | Wmax then Remove x(t |W |) from W while |fiT (x(t |W |) ) y(t |W |) | l do Remove x(t |W |) from W Algorithm 2 BOTL: transferring models. Input: Wmax , l , d , T , M. for t=1,2,... do if fj+1 S available then Receive fj+1 S from source, add to M Receive xt Select fiT and detect drifts using Alg. 1, add to M xt 0=hf1S(xt ),...,fkS(xt ),fiT (xt )i ŷt =F M (xt 0 ) (see Eq. 1) Receive yt if fiT is a new stable model then Send fiT to all other domains in framework 0 instance x(t |W |) to be discarded from the window if the predicted value, ŷ(t |W |) , satisfies 0 |ŷ(t |W |) y(t |W |) |  l. Concept drift detection in BOTL is conducted solely using the locally learnt model. Drift detection should be conducted independently of any knowledge transfer as transfer may enhance the predictive performance across the current window of target data, as such, hindering drift detection. RePro uses a transition matrix to determine the likelihood of encountering recurring concepts. To prevent the reuse of unstable models that make poor predictions, only those that have been used to make predictions over a predefined number of instances, without causing a drift to be detected, are considered to be stable and added to the transition matrix. BOTL adopts this notion of a stable model to reduce negative transfer, which occurs when an ine↵ective model is transferred to the target domain [19]. Unstable models are not transferred, preventing them from negatively impacting the target predictor. A stable model is defined in BOTL to be one that has been used across 2|W | instances, while maintaining a performance above the drift threshold, d . Once a model is deemed to be stable, it can be transferred to other domains to aid their respective predictors, as shown in Algorithm 2. This means that model transfer is limited to those that RePro considers to have successfully learnt a concept in their local domain. Knowledge transfer is achieved in BOTL by communicating models across domains. When model fjS is received, it is added to the set of transferred models, M, and combined with the target predictor, fiT , to enhance the overall predictive performance. Our instantiation of BOTL uses an Ordinary Least Squares (OLS) regressor as a meta learner to combine the available models such that the squared error of the predicted values, ŷ, across W is minimised. Other regression learners could be used in place of OLS. Each transferred model, fjS 2 M, and the current target model, fiT , are used to generate a new window of 0 0 0 data. Each sample xt 0 in the newly generated window of data is of the form xt 0 = {ŷtS1 , . . . , ŷtSk , ŷtTi }, where 0 S ŷt j for all j = 1, . . . , k is the predicted values of source model fjS on instance xt from the original window of Algorithm 3 BOTL-C.II: model culling. Input: W , cperf , cM I , M, fiT for fjS2M do if R2 (fjS,W ) cperf then Remove fjS from M for fkS2M do if MutualInformation(fjS,fkS,W ) cM I then if R2 (fjS,W ) R2 (fkS,W ) then Remove fkS from M else Remove fjS from M 0 target data, and ŷtTi is the predicted value of the locally learnt target model, fiT , selected by RePro for the current concept, ci . This window of model predictions is used by the OLS meta learner to obtain the overarching predictive function, ŷt = F M (xt 0 ) 0 1 X k (1) = w0 + @ wj fj (xt )A + w(k+1) fiT (xt ). S j=1 4.1 Bi-directional Transfer BOTL considers the scenario where all domains are online, therefore distinctions between source and target can be disregarded. In this paper, BOTL conducts peer-to-peer model transfer, allowing knowledge transfer to enhance the predictive performances of all domains. When a newly learnt model is stable, it is transferred to all other domains in the framework, and each domain updates it’s model set, M, when a concept drift is encountered. Real-world applications, such as smart home heating system personalisations, may be comprised of a large number of domains, rapidly increasing the number of models to be transferred as the number of domains grow. Such applications can su↵er in predictive performance due to the curse of dimensionality, where the number of input features to the OLS meta learner becomes large in comparison to the window size [5]. To combat this, we introduce culling to BOTL, referred to as BOTL-C. 4.2 Model Culling Culling transferred models from the model set, M, helps to prevent the OLS meta learner overfitting when a large number of models have been transferred and only a small window of data is available. We introduce two variants of BOTL-C. Firstly BOTL-C.I reduces the number of models available to the OLS meta learner by temporarily removing transferred models from the model set, M, when their R2 performance across the current window of data drops below a threshold, cperf . These models are re-added to M when a concept drift is encountered to enhance predictions of future concepts in the target domain. Although this method of culling is naı̈ve, it can reduce the impact of negative transfer. In scenarios with high volumes of model transfer, BOTL-C.I requires a high R2 threshold to sufficiently reduce the number of models to prevent the OLS meta learner overfitting. This can be detrimental as a high proportion of the transferred models containing useful information are culled and no longer available to enhance the predictive performance of the target learner. To overcome this, BOTL-C.II, outlined in Algorithm 3, evaluates transferred models based on both performance and diversity, metrics commonly used in ensemble pruning literature [31]. Initially, BOTL-C.II reduces the impact of negative transfer by culling models that achieve an R2 performance less than cperf , on W . A low cperf value is preferred, ensuring transferred models containing some useful information are retained. Using a low threshold may not sufficiently reduce the model set, M, to prevent overfitting, therefore a second round of culling is performed based on model diversity. BOTL-C.II measures the diversity between transferred models using mutual information when making predictions on the local window of data. If two transferred models have a mutual information greater than cM I , BOTL-C.II culls the model that performs worse. A high cM I should be selected as the window of locally available data is often small, therefore if a complex concept is to be learnt, the target learner may benefit from utilising knowledge transferred from similar concepts. However, if this threshold is too high the model set, M, will not be reduced sufficiently to prevent overfitting. 4.3 Initialisation The underlying adaptation of RePro requires an initial window of data, W , to create the first predictive model, f1T . Prior to obtaining this data no predictions can be made using RePro. BOTL allows models transferred from other domains to be used to make predictions during this period. Models transferred are initially weighted equally to obtain, |M| 1 X S ŷt = f (xt ). (2) |M| j=1 j Before the first target model, f1T , has been learnt, and only a small amount of data has been observed, the OLS regressor can create a model, F M , using only source models, fjS. This approach is prone to overfitting due to the small amount of data available but may be preferred over making no predictions or using Equation 2 over the entire initial window of data. The BOTL-C variants help to reduce overfitting within this initial period, however, as the amount of available data is small, all transferred models may have R2 performances below the culling threshold. In this scenario, both BOTL-C variants select the best three transferred models, regardless of the culling threshold. 5 Performance Guarantee Theorem. Assuming an i.i.d data generating process, BOTL has a loss less than or equal to the model learnt locally using RePro with no knowledge transfer, L(fiT ) L(F M ), (3) where L(fiT ) denotes the loss of the local model, fiT , created using RePro, and L(F M ) is the loss of the OLS meta learner, F M , created using the set of k models transferred from the source, {f1S, . . . , fkS} and the current target model, fiT . Proof. We measure loss over the local window of data, W , using the mean squared error of predictions, |W | 1 X 2 L(·) = (yt ŷt ) , (4) |W | t=1 where yt is the response variable for instance xt , and ŷt is the predicted value. If no transfer is used, the local model, fiT , is used to predict ŷt for each instance xt such that ŷt = f T (xt ). 0 0 S BOTL uses the set of models, M, to obtain predictions ŷtTi and all ŷt j for instance xt , using the locally T S S S learnt model, fi , and each of the j transferred model, fj 2 {f1 , . . . , fk }, respectively. Predictions are used to create a meta instance, xt 0 , which the OLS meta learner, F M , uses to obtain an overarching prediction, ŷt = F M (xt 0 ) = F M ⇣hf1S(xt ), . . . , fkS(xt ), f⌘iT (xt )i (5) 0 0 0 = F M hŷtS1 , . . . , ŷtSk , ŷtTi i , where j=k X 0 0 S F M (xt 0 ) = w0 + wj ŷt j + w(k+1) ŷtTi . (6) j=1 0 Weights w0 , . . . , w(k+1) are assigned to each prediction, ŷtn , for each model n in M, where |M| = (k + 1), to obtain an ensemble prediction, ŷt , for instance xt by solving the optimisation problem that minimises the squared error of F M : 0 0 112 |W | j=k X X 0 S 0 min @yt @w0 + wj ŷt j + w(k+1) ŷtTi AA . (7) w0 ,...,w(k+1) t=1 j=1 F M is used to make predictions, ŷt , for instance xt , using Equation 6. Using Equation 4, we can rewrite the loss of F M as, 0 0 112 |W | j=k 1 X X 0 S 0 L(F M ) = @yt @w0 + wj ŷt j + w(k+1) ŷtTi AA . (8) |W | t=1 j=1 ⇤ If we constrain the optimisation problem in Equation 7 to obtain the meta learner F M by fixing the weights, wa , such that the weight associated with the locally learnt model fiT is 1, while all others are 0, we obtain a meta model of the form, 0 1 j=k X 0 M⇤ Sj 0 Ti A F (xt ) = @0 + 0 0ŷt + 1ŷt , (9) j=1 giving the loss function |W | ⇤ 1 X⇣ 0 ⌘2 L(F M ) = yt ŷtT , (10) |W | t=1 ⇤ equivalent to only using the locally learnt model, L(F M ) = L(fiT ). As the optimisation problem in Equation 7 is convex, ⇤ L(F M ) L(F M ). (11) Finally, as the constrained optimisation problem in Equation 9 is equivalent to using only the locally learnt model, fiT , the loss of BOTL is less than or equal to the loss of the locally learnt model. 6 Experimental Set-up Many benchmark datasets have been created to evaluate concept drift detection algorithms [22, 23, 12], however, most are categorically labelled. In order to evaluate BOTL in a regression setting, we present a modification to the benchmark drifting hyperplane dataset [15]. Additionally, a simulation of a smart home heating system was created using data from a UK weather station to derive desired heating temperatures for a user. The use of such data enables BOTL to be evaluated on data streams containing drifts that are more typical within real-world environments. Finally we evaluate the performance of BOTL using a following distance dataset2 , created from vehicular data, and used to predict the time to the vehicle it is following. 6.1 Drifting Hyperplane For this benchmark data generator, an instance at time t, xt , is a vector, xt = {xt1 , xt2 , . . . , xtn }, containing n randomly generated, uniformly distributed, variables, xtn 2 [0, 1]. For each instance, xt , a response variable, yt 2 [0, 1], is created using the function yt = (xtp + xtq + xtr )/3, where p, q, and r reference three of the n variables of instance xt . This function represents the underlying concept, ca to be learnt and predicted. Concept drifts are introduced by modifying which features are used to create y. For example, an alternative concept, cb , may be represented by function yt = (xtu + xtv + xtw )/3, where {p, q, r} = 6 {u, v, w} such that ca 6= cb . We introduce uniform noise, ±0.1, by modifying yt for each instance xt with probability 0.1. A variety of drift types have been synthesised in this generator including sudden drift, gradual drift and recurring drifts. A sudden drift from concept ca to concept cb is encountered immediately between time steps t and (t + 1) by changing the underlying function used to create yt and y(t+1) . A gradual drift from concept ca to cb occurs between time steps t and (t + m), where m instances of data are observed during the drift. Instances of data created between t and (t + m) use one of the underlying concept functions to determine their response variable. The probability of an instance belonging to concept ca decreases proportionally to the number of instances seen after time t while the probability of it belonging to cb increases as we approach (t + m). Recurring drifts are created by introducing a concept cc that reuses the underlying function defined by a previous concept, ca , such that we achieve conceptual equivalence, cc = ca . 2 Data generators, sample vehicle data, and reproducibility documentation are available at: https://github.com/hmckay/BOTL 6.2 Heating Simulation A simulation of a smart home heating system was created, deriving the desired room temperature of a user. Heating temperatures were derived using weather data collected from a weather station in Birmingham, UK, from 2014 to 2016. This dataset contained rainfall, temperature and sunrise patterns, which were combined with a schedule, obtained from sampling an individual’s pattern of life, to determine when the heating system should be engaged. The schedule was synthesised to vary desired temperatures based on time of day, day of week, and external weather conditions, creating complex concepts. To create multiple domains, weather data was sampled from overlapping time periods and used as input to the synthesised schedule to determine the desired heating temperatures. Due to the dependencies on weather data, each stream was subject to large amounts of noise. Concept drifts were introduced manually by changing the schedule, however, drifts also occurred naturally due to changing weather conditions. By sampling weather data from overlapping time periods, and due to the seasonality of weather, data streams follow similar trends, ensuring predictive performance can benefit from knowledge transfer. By using complex concepts, dependent on noisy data, the evaluation of BOTL on this data is more indicative of what is achievable when used in real-world environments. 6.3 Following Distance This dataset uses a vehicle’s following distance and speed to calculate TTC when following a vehicle. Vehicle telemetry data such as speed, gear position, brake pressure, throttle position and indicator status, alongside sensory data that infer external conditions, such as temperature, headlight status, and windscreen wiper status, were recorded at a sample rate of 1Hz. Additionally, some signals such as vehicle speed, brake pressure and throttle position were averaged over a window of 5 seconds to capture a recent history of vehicle state. Vehicle telemetry and environmental data can be used to predict TTC and used to personalise vehicle functionalities such as ACC by identifying the preferred following distance, reflecting current driving conditions. Data was collected from 4 drivers for 17 journeys which varied in duration, collection time and route. Each journey is considered to be an independent domain and BOTL enables knowledge to be learnt and transferred across journeys and between drivers. Each data stream is subject to concept drifts that occur naturally due to changes in the surrounding environment such as road types and traffic conditions. 7 Experimental Results We compared BOTL against existing algorithms designed to make predictions in online data streams, namely RePro [27] and GOTL [9], using the drifting hyperplane, heating simulation and following distance datasets. BOTL is model agnostic, however, in order to make comparisons between BOTL and existing techniques, all implementations used an ✏-insensitive Support Vector Regressor (SVR). RePro is used to determine a baseline performance threshold, obtained when no knowledge is transferred [27]. The RePro implementation defines parameters including the maximum window size, Wmax , loss threshold, l , and drift threshold, d , dependant upon the data stream and complexity of concepts. GOTL was designed to learn from an o✏ine source, however, as we are considering the implications of both domains being online, we used RePro to detect individual concepts in the source. This is necessary as many online applications cannot retain an entire history of data, preventing a single model from being learnt. We used RePro to identify the model that had been used in the source for the largest proportion of the data stream so is considered the most stable. GOTL transferred this model from the source domain to the target to enhance the e↵ectiveness of the target predictor. A small step-size, = 0.025, was chosen, as suggested by Grubinger et al. [8], which slowly modified the weights used to combine source and target models. BOTL combines knowledge via the OLS meta learner therefore no additional parameters are required, however the BOTL-C culling parameters must be defined. We set cperf = 0 for BOTL-C.I, thereby discarding models that performed worse than the average predictor (R2 < 0). For BOTL-C.II, we used cperf = 0.2, such that poorly performing models were culled more aggressively, and cM I = 0.95, such that two models with extremely high mutual information were not both retained by the meta learner. This parameter value was high to ensure knowledge of similar concepts were retained when predicting complex concepts. When evaluating BOTL and BOTL-C variants, all data streams for a given experiment were used as source domains with bi-directional transfer. Repeat experiments were conducted by randomising the ordering and interval between the commencement of learning in each domain. For RePro, all data streams were learnt from independently without knowledge transfer. When evaluating GOTL, experiments were conducted such that each Table 2: Drifting Hyperplanes: predictions using RePro, GOTL, BOTL and BOTL-C variants for five sudden and five gradual drifting domains, where * indicates p < 0.01 in comparison to RePro and GOTL, and bold indicates the highest R2 performance. Sudden Drift Gradual Drift R2 RMSE R2 RMSE RePro 0.950 (±0.001) 0.058 0.855 (±0.001) 0.070 GOTL 0.928 (±0.001) 0.069 0.825 (±0.001) 0.076 BOTL *0.958 (±0.001) 0.053 *0.893 (±0.001) 0.060 BOTL-C.I *0.957 (±0.001) 0.054 *0.886 (±0.003) 0.062 BOTL-C.II *0.956 (±0.001) 0.054 *0.889 (±0.004) 0.061 data stream was paired with every other data stream as source and target domains respectively. Due to only transferring the most stable model when using GOTL, learning in the target domain only commenced once learning in the source domain had completed such that the most stable source model could be identified and transferred to the target domain. 7.1 Drifting Hyperplane We considered the e↵ectiveness of BOTL on synthetic data created using the drifting hyperplane data generator containing two types of drift: sudden and gradual. Five data streams of 10,000 instances were created for each drift type with drifts encountered every 500 time steps. Sudden drifts occurred immediately, however, gradual drifts occurred over a period of 100 time steps. Each data stream shared at most three concepts with another domain, ensuring some models transferred were useful to the target learner, while others were not. Sudden and gradual drifting data streams were separated such that transfer occurred only between domains of the same drift type. RePro parameters Wmax = 30, l = 0.05, and d = 0.6 were used across all frameworks. The results in Table 2 indicate that, although GOTL transferred the most stable model, a performance increase was not achieved over RePro. Although GOTL did not benefit from knowledge transfer, BOTL was able to outperform both GOTL and RePro with statistical t-tests achieving p-values < 0.01, highlighting the importance of acknowledging the online nature of the source domain when transferring knowledge. The performance increase observed over GOTL can be attributed to the availability of all source models in the target domain. Additionally, GOTL’s step-wise weighting mechanism prevents the influence of a model changing drastically over a small period of time. This means a large amount of data must be observed after each drift to converge on an approximation of the optimal weights. To overcome this, a larger step-size could be used, however, this may prevent or hinder convergence. BOTL overcomes this by using the OLS meta learner to minimise the squared error of the combined predictor with instantaneous e↵ect. The performances of BOTL-C variants were also significantly better than RePro and GOTL, obtaining t-test values of p < 0.01, however, they performed slightly worse than BOTL. Figure 1 highlights the aggressive nature of the culling techniques used by BOTL-C.I and BOTL-C.II. It also shows BOTL used at least four times more models than BOTL-C variants and highlights correlations between the number of models used and performance. When the number of models used was small, the predictive performance of BOTL-C variants decreased. This performance decrease can be attributed to the aggressive nature of these culling mechanisms. Culling based on model performance alone prohibited the inclusion of a diverse set of models, reducing the overarching predictive performance of the meta learner. When BOTL-C variants retained a larger proportion of the transferred models a performance similar to BOTL was achieved. We also considered the performance of BOTL on data streams with varying concept durations, showing that BOTL is more e↵ective than existing approaches, irrespective of the duration of each concept. Additional data streams of 10,000 instances were created with di↵erent time periods between drifts. Figures 2a and 2b show the performance of BOTL, RePro and GOTL in these data streams. Since BOTL variants are underpinned by RePro, their performance dropped similarly as drifts occur more frequently. This is because RePro must observe a local drop in performance over the sliding window of data to identify drifts, therefore a higher proportion of the data stream has poor predictions as the number of concept drifts increases. Despite the decrease in performance, BOTL withstood the impact of frequent drifts more readily than RePro and GOTL as BOTL uses knowledge learnt in other domains to aid the target predictor prior to a new model being learnt by RePro. While GOTL also uses transferred knowledge, its weighting mechanism prevented e↵ective use of this knowledge at drift points. Figure 1: Drifting Hyperplanes: R2 performance and number of models used by BOTL and BOTL-C variants. (a) Sudden concept drift. (b) Gradual concept drift, with drift durations of 100 time steps. Figure 2: Drifting Hyperplanes: performance of RePro, GOTL and BOTL variants with sudden and gradual concept drifts encountered at di↵erent frequencies. 7.2 Heating Simulation We chose RePro parameters l = 0.5, d = 0.6, and Wmax = 700, creating a sliding window that encapsulated approximately two weeks of heating and weather data, to analyse frameworks on this data. These parameters meant that instances were removed from the start of the sliding window when predictions were made within ±0.5 C of the desired heating temperature, and drifts were detected when the R2 performance of the target model on the current window of data dropped below 0.6. The concepts to be learnt in these domains were more complex than those in the drifting hyperplane data streams, causing lower R2 performances overall. The addition of knowledge transfer, using GOTL and BOTL, provided a significant increase in performance in comparison to RePro with no transfer, as shown in Table 3. GOTL, BOTL and BOTL-C variants performed better than RePro with statistical t-test p-values of p < 0.01. The ability to transfer knowledge was more advantageous in comparison to the drifting hyperplane setting because concepts were more complex, preventing RePro from building e↵ective models on the window of available data. This meant the knowledge transferred helped boost the performance of the target predictor, even when only a single model was transferred using GOTL. Transferring multiple models provided a significant benefit as BOTL performed better than GOTL with a t-test p-value < 0.01. Table 3: Heating Simulations: predicting desired heating temperatures across five domains using RePro, GOTL, BOTL and BOTL-C variants, where * indicates p < 0.01 in comparison to RePro and GOTL, and bold indicates the highest R2 performance. R2 RMSE RePro 0.607 (±0.003) 2.601 (±0.015) GOTL 0.709 (±0.003) 2.231 (±0.009) BOTL *0.786 (±0.009) 1.914 (±0.042) BOTL-C.I *0.779 (±0.010) 1.946 (±0.044) BOTL-C.II *0.744 (±0.008) 2.102 (±0.036) Table 4: Following Distances: predicting TTC across seven domains using RePro, GOTL, BOTL and BOTL-C variants, where * indicates p < 0.01 in comparison to RePro and GOTL, and bold indicates the highest R2 performance. R2 RMSE RePro 0.497 (±0.002) 0.636 (±0.002) GOTL 0.619 (±0.001) 0.554 (±0.003) BOTL 0.665 (±0.002) 0.524(±0.002) BOTL-C.I 0.666 (±0.002) 0.523(±0.002) BOTL-C.II *0.673 (±0.001) 0.518(±0.002) 7.3 Following Distance Finally we evaluated BOTL on real-world data using the following distance dataset, where the task was to predict TTC. Due to the real-world nature of this data, concept drifts occur frequently and data is noisy. The RePro parameters l = 0.1, d = 0.5 and Wmax = 80 were used, encapsulating 80 seconds of vehicle data. This meant that instances were removed from the start of the sliding window if predictions were made within ±0.1 seconds of the recorded TTC value, and drifts were detected when the R2 performance of the target model on the current window of data dropped below 0.6. Table 4 shows the performance of RePro, GOTL and BOTL variants across seven data streams. These results highlight GOTL was less suitable when the relationship between source and target concepts were unknown. All variants of BOTL and BOTL-C performed better than RePro and GOTL, with BOTL-C variants slightly outperforming BOTL. BOTL-C.II achieved statistical t-test p-values of p < 0.01 when compared to both RePro and GOTL. This can be attributed to the number of domains in the framework, indicating large numbers of transferred models caused the OLS meta learner to overfit the local window of data. To investigate scalability, Figure 3 displays the average R2 performance per domain and number of models used for predictions using RePro, BOTL and BOTL-C variants as the number of domains increased. For settings with a small number of domains, BOTL and BOTL-C variants performed similarly, both outperforming the baseline RePro algorithm. As the number of domains expanded, and the number of models transferred increased, the performance of BOTL dropped below RePro. This can be attributed to the OLS meta learner overfitting the small window of local data when the number of models in the ensemble was large. Culling using the performance of transferred models alone (BOTL-C.I) enabled a larger number of domains to be used, however, cannot be considered scalable as the performance of BOTL-C.I dropped below that of RePro when more domains were added. BOTL-C.II culled more aggressively, using diversity alongside performance, ensuring enough beneficial knowledge was retained to enhance the target learner’s performance, while minimising negative transfer and preventing the OLS meta learner overfitting the small window of locally available data. The results indicate that the ability to consider both source and target domains to be online is beneficial. In doing so, the number of transferred models greatly increases, requiring culling mechanisms, particularly when used in noisy real-world data streams, to retain the benefit of transferring knowledge between domains. 8 Conclusion Online domains that must learn complex models often have limited data availability, and are hindered by the presence of concept drift. We have presented the BOTL framework, and two BOTL-C variants, that enable knowledge to be transferred across online domains to aid the target learner. By using RePro as the underlying concept drift detection algorithm we ensured e↵ective models were learnt from the available data. We enhanced Figure 3: Following Distances: performance (with standard error) and number of models used by RePro, BOTL and BOTL-C variants as the number of domains increase. predictive performance by combining knowledge transferred from other online domains using an OLS meta learner, enabling additional knowledge to be used to minimise the error of the overarching prediction. In this paper, RePro was chosen as the underlying concept drift detection algorithm. Although RePro requires some domain expertise to identify appropriate parameter values, such as window size and drift threshold, it’s ability to retain a history of models to prevent relearning recurring concepts helped to reduce the number of models transferred between domains and therefore allowed more domains to be included in the framework before the OLS meta learner su↵ered from overfitting. However, in real-world environments with many domains, the number of models transferred may need to be reduced further. BOTL-C variants achieved this using common ensemble pruning strategies. These pruning strategies also required culling parameter values to be specified. To overcome the need to specify these parameters we will investigate the use of task relatedness to identify similar concepts across domains without parameterised thresholds in future work. This will reduce the dependency on domain expertise and will allow BOTL to be used for applications that require scalability to larger numbers of domains. Additionally, BOTL considers only the homogeneous setting, therefore a natural extension is to incorporate domain adaptation to enable knowledge transfer across heterogeneous domains using the BOTL framework. 8.0.1 Acknowledgements This work was supported by the UK EPSRC and Jaguar Land Rover under the iCASE scheme. References [1] Andrew Arnold, Ramesh Nallapati, and William W. Cohen. A comparative study of methods for transduc- tive transfer learning. In Seventh IEEE International Conference on Data Mining Workshops, ICDMW ’07, pages 77–82, 2007. [2] Albert Bifet and Ricard Gavalda. Learning from time-changing data with adaptive windowing. In Proceed- ings of the 2007 SIAM International Conference on Data Mining, pages 443–448. SIAM, 2007. [3] Hal Daume III and Daniel Marcu. Domain adaptation for statistical classifiers. Journal of Artificial Intel- ligence Research, 26:101–126, 2006. [4] Bo Dong, Yifan Li, Yang Gao, Ahsanul Haque, Latifur Khan, and Mohammad M. Masud. Multistream regression with asynchronous concept drift detection. In 2017 IEEE International Conference on Big Data, pages 596–605, Dec 2017. [5] Jerome H. Friedman. On bias, variance, 0/1—loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1(1):55–77, Mar 1997. [6] João Gama, Indrė Žliobaitė, Albert Bifet, Mykola Pechenizkiy, and Abdelhamid Bouchachia. A survey on concept drift adaptation. ACM Comput. Surv., 46(4):44:1–44:37, 2014. [7] Liang Ge, Jing Gao, and Aidong Zhang. Oms-tl: A framework of online multiple source transfer learning. In Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, CIKM ’13, pages 2423–2428. ACM, 2013. [8] Thomas Grubinger, Georgios Chasparis, and Thomas Natschläger. Online transfer learning for climate control in residential buildings. In Proceedings of the 5th Annual European Control Conference, ECC ’16, pages 1183–1188, 2016. [9] Thomas Grubinger, Georgios Chasparis, and Thomas Natschläger. Generalized online transfer learning for climate control in residential buildings. Energy and Buildings, 139:63–71, 2017. [10] Ahsanul Haque, Hemeng Tao, Swarup Chandra, Jie Liu, and Latifur Khan. A framework for multistream regression with direct density ratio estimation. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018. [11] T. Ryan Hoens, Robi. Polikar, and Nitesh. V. Chawla. Learning from streaming data with concept drift and imbalance: an overview. Progress in Artificial Intelligence, 1(1):89–101, 2012. [12] Geo↵ Hulten, Laurie Spencer, and Pedro Domingos. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, pages 97–106. ACM, 2001. [13] Mark G. Kelly, David J. Hand, and Niall M. Adams. The impact of changing populations on classifier performance. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’99, pages 367–371. ACM, 1999. [14] Jeremy Z. Kolter and Marcus A. Maloof. Dynamic weighted majority: a new ensemble method for tracking concept drift. In Third IEEE International Conference on Data Mining, pages 123–130, 2003. [15] Jeremy Z. Kolter and Marcus A. Maloof. Using additive expert ensembles to cope with concept drift. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, pages 449–456. ACM, 2005. [16] Guangxia. Li, Steven CH Hoi, Kuiyu Chang, Wenting Liu, and Ramesh Jain. Collaborative online multitask learning. IEEE Transactions on Knowledge and Data Engineering, 26(8):1866–1876, 2014. [17] Keerthiram Murugesan and Jamie Carbonell. Multi-task multiple kernel relationship learning. In Proceedings of the 2017 SIAM International Conference on Data Mining, pages 687–695. SIAM, 2017. [18] Jianhan Pan, Xuegang Hu, Peipei Li, Huizong Li, Wei He, Yuhong Zhang, and Yaojin Lin. Domain adaptation via multi-layer transfer learning. Neurocomputing, 190:10–24, 2016. [19] Sinno J. Pan and Qiang Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010. [20] Paul Ruvolo and Eric Eaton. Active task selection for lifelong machine learning. In AAAI, 2013. [21] Avishek Saha, Piyush Rai, Hal Daumã, Suresh Venkatasubramanian, et al. Online learning of multiple tasks and their relationships. In Proceedings of the 14th International Conference on Artificial Intelligence and Statistics, pages 643–651, 2011. [22] Je↵ery C. Schlimmer and Richard H. Granger. Incremental learning from noisy data. Machine Learning, 1(3):317–354, 1986. [23] W. Nick Street and YongSeog Kim. A streaming ensemble algorithm (sea) for large-scale classification. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’01, pages 377–382. ACM, 2001. [24] Alexey Tsymbal. The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin, 106, 2004. [25] Qingyao Wu, Hanrui Wu, Xiaoming Zhou, Mingkui Tan, Yonghui Xu, Yuguang Yan, and Tianyong Hao. Online transfer learning with multiple homogeneous or heterogeneous sources. IEEE Transactions on Knowl- edge and Data Engineering, 29(7):1494–1507, 2017. [26] Yuguang Yan, Qingyao Wu, Mingkui Tan, Michael K. Ng, Huaqing Min, and Ivor W. Tsang. Online heterogeneous transfer by hedge ensemble of o✏ine and online decisions. IEEE Transactions on Neural Networks and Learning Systems, 29(7):3252–3263, July 2018. [27] Ying Yang, Xindong Wu, and Xingquan Zhu. Combining proactive and reactive predictions for data streams. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD ’05, pages 710–715. ACM, 2005. [28] Ying Yang, Xindong Wu, and Xingquan Zhu. Mining in anticipation for concept change: Proactive-reactive prediction in data streams. Data Mining and Knowledge Discovery, 13(3):261–289, 2006. [29] Haibo Yin and Yun-an Yang. Online transfer learning with extreme learning machine. In AIP Conference Proceedings, volume 1839. AIP Publishing, 2017. [30] Peilin Zhao and Steven C. Hoi. Otl: A framework of online transfer learning. In Proceedings of the 27th International Conference on Machine Learning, ICML ’10, pages 1231–1238, 2010. [31] Zhi-Hua Zhou. Ensemble methods: foundations and algorithms. Chapman and Hall/CRC, 2012.