SVR-based Modelling for the MoReBikeS Challenge: Analysis, Visualisation and Prediction Yu Chen, Peter Flach University of Bristol, United Kingdom {yc14600, Peter.Flach}@bristol.ac.uk Abstract. We present a solution to the MoReBikeS challenge in ECML PKDD 2015 conference by analysing data from different aspects, by visu- alising latent patterns, by building a set of features. The proposed model is accurate, efficient yet easy to implement. 1 Introduction During the development, we consider it necessary to know an answer about the following issues: – Which features are informative and which are noisy? – How do they affect the target value? – How to make the best use of them? There are two parts of this solution for MoReBikeS challenge, one is recon- struction of the feature space and another is the regression. As the task is to predict the number of bikes of a station three hours in advance, so a regression algorithm is needed to learn the mapping between target value and features, in this part an SVR model is deployed. The critical part is feature reconstruction, better features can improve the performance of a simple regression model significantly, so based on the analysis and visualisation of given features, a new feature space has been generated for regression. 2 Feature Selection and Representation 2.1 Raw Features There are 24 given features in total which can be divided to 4 categories. 1. Facts of stations. The facts of stations provided in the data set include the station ID, the latitude, the longitude and the number of docks in that station. All these properties for one station do not change over time. 2. Temporal information. The timestamp of a data entry consists of eight fields: ”Timestamp” in terms of seconds from the UNIX epoch, ”Year”, ”Month”, ”Day”, ”Hour”, ”Weekday”, ”Weekhour”, and ”IsHoliday” which indicates whether the day is a public holiday. These features are giving over- lapping temporal information, we only need a subset of them to represent a time point. As shown in Figure 1, the ”Timestamp” is actually including in- Year Day Weekday IsHoliday Timestamp Weekhour Month Hour Fig. 1: Relations between Temporal Features formation of ”Year”, ”Month”, ”Day”, ”Hour”,”Weekday” and ”Weekhour”, whereas ”Weekday” and ”Hour” also can be deduced by ”Weekhour”. Only ”IsHoliday” is independent to any of others. 3. Weather. This set of features include ”windMaxSpeed”, ”windMeanSpeed”, ”windDirection”, ”temperature”, ”relHumidity”, ”airPressure”, ”precipita- tion”. One major observation of weathers is that all the values of all the seven fields share among all stations. 4. Counts and their statistics. This set of features relates to the target value directly. First of all, ”bikes 3h ago” gives the target value of the 3- hour-earlier time point at a station. The full profile features use all previous data points of the same ”Weekhour” to obtain long term statistics for each ”Weekhour” in each station, accordingly the short profile features only use at most four previous data points to obtain short-term statistics. The long- term statistics of the 200 old stations only have very small changes over time in contrast to the short-term ones. 2.2 Selection of Features 1. Features to remove: – ”Year” and ”Month”: ”Year” and ”Month” are fixed to October 2014 in the training data set, there is nothing to learn from them. – ”Timestamp”: accordingly ”Timestamp” is too general to distinguish dif- ferent temporal information and has too many possible values to clearly indicate similarity between different time points. – Full profile features: the actual meanings of full profile features are dif- ferent from 200 old stations to 75 new features, because the historical data used to calculate the long term statistics are spanning over 2 years for the 200 old stations but only several weeks for the new 75 stations. 2. Features to keep: – ”IsHoliday”: this feature doesn’t overlap with any other temporal fea- tures(Figure 1) and gives extra information in addition to the timestamp. – ”Day”: since ”Timestamp” is already removed, ”Day” has become a temporal feature without any alternative and probably includes some periodical information. – Bikes of 3 hours ago and short profile features: unlike long term statistics, these features of 75 new stations are aligned with the 200 old stations, so they can be very informative. – Temperature: the existing linear models only keep temperature among seven weather features, which implies it is a useful information. 3. Features to keep or remove: – Facts of stations: these features provide characteristics of a station, as they do not change over time, so they can only provide static knowledge about a station, but they might be useful to recognise the pattern of a certain station. – ”Weekday+Hour” or ”Weekhour”: although ”Weekhour” is a general representation of ”Weekday+Hour”, it emphasises the difference between every hour in one week, while the ”Weekday+Hour” can represent two types of differences: ”Weekday” differs or ”Hour” differs. It needs further analysis to decide which representation is more appropriate for this data set. – Other weather features: it is not clear whether these weather features are signals or noises before further analysis. 2.3 Visualisation The provided statistical features use ”Weekhour” as the criteria to choose time points, it implies ”Weekhour” may have a strong effect on the target value. Therefore, we visualise the target of each station at different week hours to capture latent knowledge of the data. Below figures give overall prospects of all stations in various hours and week- days. The ”+” marks the location of a station, the circle around it indicates how many bikes in that station in 3 hours later, the radius of a circle has been normalised by number of docks of that station. Figure 2 shows four chosen hours in a Wednesday. Obviously the bike storage in that city is stable during night and most bikes are stored in stations near the north and south edges(Figure 2 (a) and (b)), these areas are probably the popular residential areas of the city. In contrast, most bikes have been transferred to the centre of the city in a typical working hour(Figure 2 (c)), and in an off-work hour bikes are spread over the city(Figure 2 (d)). (a) Wednesday 0:00 (b) Wednesday 7:00 (c) Wednesday 10:00 (d) Wednesday 16:00 Fig. 2: Different hours on Wednesday Concerning the doubts between ”Weekday+Hour” and ”Weekhour”, we need some comparison between different temporal categories. Figure 3 gives overall prospects of same hours in the daytime of a Saturday and Monday. At 10:00 on Saturday(Figure 3 (a)), there are still a number of bikes stored in residential areas and stations in the city centre are nearly empty, and at 16:00(Figure 3 (b)) a new hot spot emerges, quite a lot of bikes have been moved to the east edge of the city. These two hours have completely different properties comparing to the same hours on Wednesday. On the contrary, the two hours on Monday(Figure 3 (c),(d)) are very similar to those on Wednesday(Figure 2 (c),(d)). They clearly have a mutual pattern and it is easy to understand: Monday and Wednesday are both working days. Conclusions from visualisation results: • Two time-points with different hours in the same weekday are not necessarily more similar than different hours in different weekdays. (a) Saturday 10:00 (b) Saturday 16:00 (c) Monday 10:00 (d) Monday 16:00 Fig. 3: Same hours in different weekdays • Two time-points with the same hour in different weekdays are not necessarily more similar than different hours in different weekdays. So it is clear now that ”Weekday” + ”Hour” may indicate false similarity between time points whereas ”Weekhour” can avoid such errors without losing useful information. The visualisation results also indicate that there are different behaviour pat- terns underlying different groups of stations, such as the behaviour of stations near north and south edges are similar, but very different with the stations in the city centre. So it could be helpful to recognise patterns in groups, the method to do so will be introduced later in section of fast test. Accordingly the facts of stations (station ID, longitude, latitude, the number of docks) are not the criteria to identify a group, hence we can remove them from the feature space as well. 2.4 Feature Representation How to represent these selected raw features is another important aspect of reshaping the feature space. There are two simple but powerful methods applied in this task. 1. Vectorisation In this context the two selected temporal features(”Day”, ”Weekhour”) are categorical variables, according to the visualisation results the distance cal- culated by their numerical values certainly can not represent the similarity between two time points. Vectorisation is a common choice for such problem, each value of a categorical feature is transformed to a bit in a binary vector. After this transformation, ”Day” and ”Weekhour” have generated 199 new features. 2. Normalisation The profile features are counting numbers of bikes and they have different upper limits due to various capacities of stations. These features can be normalised by number of docks of each station so that they are comparable between all stations: fk (t) fˆk (t) = (1) Nk (t) fk (t) represents one of the following profile features of station k at time t: · ”bikes 3h ago” · ”short profile bikes” · ”short profile 3h diff bikes” Nk (t) is the number of docks of station k at time t. So far we have selected and reconstructed temporal features by visualisation and vectorisation, also transformed profile features by normalisation. The remaining uncertainty about weather features will be pinned down by feature filtering dur- ing tests. 3 Regression Model To perform the regression task, there are abundant options to choose a regression model, like Linear Regression, Gaussian Process Regression, Nearest Neighbour Regression, Support Vector Regression and Neural Networks, etc.. Consider- ing the prediction performance and computational complexity, Support Vector Regression(SVR) is a safe and handy choice here. It’s probably not the best regression model for this task, however, it is capable of performing well. The deployed SVR model is implemented by scikit learn [3] and it is an Epsilon-Support Vector Regression model [1]. The chosen kernel function is sig- moid kernel function [2]: Kij = tanh(γxTi xj + c0 ) (2) Where xTi xj represents the inner product of 2 data points.The parameters of this SVR model are chosen by the fast tests which will be introduced in later section: C = 2,  = 0.02, γ = 0.25, c0 = −1 (3) To align with normalised profile features and to fit the sigmoid kernel better, the target value is transformed by below equation for training the SVR model: νk (t) νk (t − 3h) ν̂k (t) = − (4) Nk (t) Nk (t − 3h) Where νk (t) is the number of bikes at time t in station k; Nk (t) is the number of docks at time t in station k. This equation can be simplified when the number of docks of a station does not change over time, i.e. there exist a positive integer Nk such that Nk = Nk (t) for all t ∈ R+ , which is the case in our data. νk (t) − νk (t − 3h) ν̂k (t) = (5) Nk 4 Testing and Evaluation 4.1 Fast Test As the computational cost of training the SVR model by all data points is quite expensive, only using partial data to build the SVR model can reduce the cost efficiently. The idea of the fast test is to use data of K neighbours of a certain station n to train an SVR model and then test it on data of the station n. According to the visualisation result, there are different behaviour patterns underlying different groups of stations, so we can obtain a group for a certain station by identifying its neighbours with similar behaviour. The neighbours are obtained by ranking euclidean distances between stations, the target value of each time point of a station is treated as a feature, as there are 745 time points of each station, then the euclidean distance is calculated in a feature space with 745 dimensions, and if there are missing values of some time points, the mean value of all time points of that station will be used instead. More neighbours means more data points and a robuster regression model, meanwhile more data points also increase computational complexity for both training and testing vastly. Setting the number of neighbours K to 10 can give a reasonable performance with considerable reduction of computing expense. Here the reasonable performance means it is better than the average performance obtained from the existing linear regression models. 4.2 Validation There are two datasets used to validate optimisations of the model, one is the data of 75 new stations in October 2014, another is the data of 10 old stations in November, December and January from 2012 to 2014. The reason to do so is that the test data is from 75 new stations in November, December 2014 and January 2015, such validation sets could avoid overfitting on a certain month or certain stations. Only those changes which can improve the performance on both validation sets will be adopted. The hyper-parameters of the SVR model are selected by this validation strat- egy. The weather features are filtered out one by one through such validation as well, each time we remove a weather feature to see whether the performance become better or worse, if it has become worse, we keep this feature, otherwise, we just remove it. 4.3 Online test As we have submission opportunities for the online test by a small set of test data, these test results provided important feedback about the model, such as the SVR model trained by all data points is more robust than those trained by neighbours, the importance of ”Weekhour” is consistent with our analysis, the full profile features are confusing the model, etc.. The comparison between some attempts is shown in Table 1. Table 1: Comparison between Online Test Results Size of Training Set Feature Options MAE K ”Weekhour” ”Weekday”+”Hour” Full Profiles 2.625 20 X X 2.612 20 X X 2.52 20 X 2.496 275 X X 2.46 275 X 2.37 275 X 4.4 Final Model and Full Test The final model for this challenge is decided eventually with following features: 1. ”IsHoliday”, ”Day”, ”Weekhour”: vectorised. 2. ”bikes 3h ago”,”short profile bikes”,”short profile 3h diff bikes”: normalised by equation (1). 3. ”temperature”. The SVR model is as described in section 3 and the training set is data of 275 stations in October 2014. The MAE(Mean Absolute Error) score of the final model on full test data is 2.051. Since all online tests included vectorisation, so we have tested the same combination of features without vectorisation on the full test data as well, the MAE score is 3.519. It appears the vectorisation plays a key role in this model to boost the performance. 5 Conclusion The deployed methods are quite simple and easy to apply. The main idea is to extract important information from given features and then build new features upon them to optimise the feature space for regression. The performance highly relates to how well the feature space represents patterns underlying the data. The disadvantage of the final model is that new features are not obtained auto- matically by learning algorithms, and the computational cost is still high which might be improved by a better sampling method to shrink the training set or a faster regression algorithm instead of SVR. References 1. Chang, C.C., Lin, C.J.: Libsvm: A library for support vector machines. ACM Trans- actions on Intelligent Systems and Technology (TIST) 2(3), 27 (2011) 2. Lin, H.T., Lin, C.J.: A study on sigmoid kernels for svm and the training of non-psd kernels by smo-type methods. submitted to Neural Computation pp. 1–32 (2003) 3. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830 (2011)