Model Selection in Data Analysis Competitions David Kofoed Wind1 and Ole Winther2 Abstract. The use of data analysis competitions for selecting the The basic structure of a predictive modelling competition – as most appropriate model for a problem is a recent innovation in the seen for example on Kaggle and in the Netflix competition – is the field of predictive machine learning. Two of the most well-known following: A predictive problem is described, and the participants examples of this trend was the Netflix Competition and recently the are given a dataset with a number of samples and the true target competitions hosted on the online platform Kaggle. values (the values to predict) for each sample given, this is called the In this paper, we will state and try to verify a set of qualitative training set. The participants are also given another dataset like the hypotheses about predictive modelling, both in general and in the training set, but where the target values are not known, this is called scope of data analysis competitions. To verify our hypotheses we the test set. The task of the participants is to predict the correct target will look at previous competitions and their outcomes, use qualitative values for the test set, using the training set to build their models. interviews with top performers from Kaggle and use previous When participants have a set of proposed predictions for the test personal experiences from competing in Kaggle competitions. set, they can submit these to a website, which will then evaluate The stated hypotheses about feature engineering, ensembling, the submission on a part of the test set known as the quiz set, the overfitting, model complexity and evaluation metrics give indications validation set or simply as the public part of the test set. The result and guidelines on how to select a proper model for performing well of this evaluation on the quiz set is shown in a leaderboard giving in a competition on Kaggle. the participants an idea of how they are progressing. Using a competitive approach to predictive modelling is being 1 Introduction praised by some as the modern way to do science: In recent years, the amount of available data has increased Kaggle recently hosted a bioinformatics contest, which exponentially and “Big Data Analysis” is expected to be at the core required participants to pick markers in a series of HIV genetic of most future innovations [2, 4, 5]. A new and very promising trend sequences that correlate with a change in viral load (a measure in the field of predictive machine learning is the use of data analysis of the severity of infection). Within a week and a half, the competitions for model selection. Due to the rapid development best submission had already outdone the best methods in the in the field of competitive data analysis, there is still a lack of scientific literature. [3] consensus and literature on how one should approach predictive (Anthony Goldbloom, Founder and CEO at Kaggle) modelling competitions. These prediction contests are changing the landscape for In his well-known paper “Statistical Modeling : The Two researchers in my area an area that focuses on making good Cultures” [1], Leo Breiman divides statistical modelling into two predictions from finite (albeit sometimes large) amount of data. cultures, the data modelling culture and the algorithmic modelling In my personal opinion, they are creating a new paradigm culture. The arguments put forward in [1] justifies an approach with distinctive advantages over how research is traditionally to predictive modelling where the focus is purely on predictive conducted in our field. [6] accuracy. That this is the right way of looking at statistical modelling (Mu Zhu, Associate Professor, University of Waterloo) is the underlying assumption in statistical prediction competitions, and consequently also in this paper. This competitive approach is interesting and seems fruitful – The concept of machine learning competitions was made popular one can even see it as an extension of the aggregation ideas put with the Netflix Prize, a massive open competition with the aim forward in [1] in the sense that the winning model is simply the of constructing the best algorithm for predicting user ratings of model with the best accuracy, not taking computational efficiency movies. The competition featured a prize of 1,000,000 dollars for or interpretability into account. Still one could ask if the framework the first team to improve Netflix’s own results by 10% and multiple provided by for example Kaggle gives a trustworthy resemblance of teams achieved this goal. After the success with the Netflix Prize, real-world predictive modelling problems where problems do not the website Kaggle was born, providing a platform for predictive come with a quiz set and a leaderboard. modelling. Kaggle hosts numerous data prediction competitions and has more than 170,000 users worldwide. In this paper we state five hypotheses about building and selecting models for competitive data analysis. To verify these hypotheses we will look at previous competitions and their outcomes, use qualitative 1 Technical University of Denmark, Denmark, email: dawi@dtu.dk interviews with top performers from Kaggle and use previous 2 Technical University of Denmark, Denmark, email: olwi@dtu.dk personal experiences from competing in Kaggle competitions. 2 Interviews and Previous Competitions 3. Simple models can get you very far 4. Ensembling is a winning strategy In this section we will shortly describe the data we are using. We will 5. Predicting the right thing is important list the people whom we interviewed and name the previous Kaggle competition we are using for empirical data. 3.1 Feature engineering is the most important part 2.1 Interviews With the extensive amount of free tools and libraries available for data analysis, everybody has the possibility of trying advanced To help answer the questions we are stating, we have asked a series statistical models in a competition. As a consequence of this, what of questions to some of the best Kaggle participants throughout time. gives you most “bang for the buck” is rarely the statistical method We have talked (by e-mail) with the following participants (name, you apply, but rather the features you apply it to. By feature Kaggle username, current rank on Kaggle): engineering, we mean using domain specific knowledge or automatic • Steve Donoho (BreakfastPirate #2) methods for generating, extracting, removing or altering features in • Lucas Eustaquio (Leustagos #6) the data set. • Josef Feigl (Josef Feigl #7) For most Kaggle competitions the most important part is • Zhao Xing (xing zhao #10) feature engineering, which is pretty easy to learn how to do. • Anil Thomas (Anil Thomas #11) (Tim Salimans) • Luca Massaron (Luca Massaron #13) • Gábor Takács (Gábor Takács #20) The features you use influence more than everything • Tim Salimans (Tim Salimans #48) else the result. No algorithm alone, to my knowledge, can Answers and parts of answers to our questions are included in this supplement the information gain given by correct feature paper as quotes when relevant. engineering. (Luca Massaron) Feature engineering is certainly one of the most important 2.2 Previous competitions aspects in Kaggle competitions and it is the part where one should spend the most time on. There are often some hidden Besides the qualitative interviews with Kaggle masters, we also features in the data which can improve your performance by a looked at 10 previous Kaggle competitions, namely the following: lot and if you want to get a good place on the leaderboard you • Facebook Recruiting III - Keyword Extraction have to find them. If you screw up here you mostly can’t win • Partly Sunny with a Chance of Hashtags anymore; there is always one guy who finds all the secrets. • See Click Predict Fix However, there are also other important parts, like how you • Multi-label Bird Species Classification - NIPS 2013 formulate the problem. Will you use a regression model or • Accelerometer Biometric Competition classification model or even combine both or is some kind of • AMS 2013-2014 Solar Energy Prediction Contest ranking needed. This, and feature engineering, are crucial to • StumbleUpon Evergreen Classification Challenge achieve a good result in those competitions. • Belkin Energy Disaggregation Competition There are also some competitions were (manual) feature • The Big Data Combine Engineered by BattleFin engineering is not needed anymore; like in image processing • Cause-effect pairs competitions. Current state of the art deep learning algorithms can do that for you. (Josef Feigl) These competitions were selected as 10 consecutive competitions, where we excluded a few competitions which did not fit the standard There are some specific types of data which have previously framework of statistical data analysis (for example challenges in required a larger amount of feature engineering, namely text data optimization and operations research). and image data. In many of the previous competitions with text and image data, feature engineering was a huge part of the winning Throughout this paper, these competitions are referenced with solutions (examples of this are for example S UNNY H ASHTAGS, the following abbreviated names: FACEBOOK, S UNNY H ASHTAGS, FACEBOOK, S EE C LICK P REDICT and B IRD). At the same time S EE C LICK P REDICT, B IRD, ACCELEROMETER, S OLAR E NERGY, (perhaps due to the amount of work needed to do good feature S TUMBLE U PON, B ELKIN, B IG DATA and C AUSE E FFECT. engineering here) deep learning approaches to automatic feature extraction have gained popularity. 3 Hypotheses In the competition S UNNY H ASHTAGS which featured text data In this section we state 5 hypotheses about predictive modelling taken from Twitter, feature engineering was a major part of the in a competitive framework. We will try to verify the validity of winning solution. The winning solution used a simple regularized each hypothesis using a combination of mathematical arguments, regression model, but generated a lot of features from the text: empirical evidence from previous competitions and qualitative interviews we did with some of the top participants at Kaggle. The My set of features included the basic tfidf of 1,2,3-grams five hypotheses to be investigated are: and 3,5,6,7 ngrams. I used a CMU Ark Twitter dedicated tokenizer which is especially robust for processing tweets + it 1. Feature engineering is the most important part of predictive tags the words with part-of-speech tags which can be useful machine learning to derive additional features. Additionally, my base feature set 2. Overfitting to the leaderboard is a real issue included features derived from sentiment dictionaries that map 2 each word to a positive/neutral/negative sentiment. I found this average” or “guess the average segmented by variable X”. helped to predict S categories by quite a bit. Finally, with Ridge These are simply to establish what is possible with very simple model I found that doing any feature selection was only hurting models. You’d be surprised that you can sometimes come very the performance, so I ended up keeping all of the features ⇠ 1.9 close to the score of someone doing something very complex mil. The training time for a single model was still reasonable. by just using a simple model. (Steve Donoho) (aseveryn - 1st place winner) I think a simple model can make you top 10 in a Kaggle In the competitions which did not have text or image data, feature competition. In order to get a money prize, you have to go to engineering sometimes still played an important role in the winning ensembles most of time. (Zhao Xing) entries. An example of this is the C AUSE E FFECT competition, where the winning entry created thousands of features, and then You can go very far [with simple models], if you use them used genetic algorithms to remove non-useful features again. On well, but likely you cannot win a competition by a simple model the contrary, sometimes the winning solutions are those which go a alone. Simple models are easy to train and to understand and non-intuitive way and simply use a black-box approach. An example they can provide you with more insight than more complex of this is the S OLAR E NERGY competition where the Top-3 entries black boxes. They are also easy to be modified and adapted almost did not use any feature engineering (even though this seemed to different situations. They also force you to work more on like the most intuitive approach for many) – and simply combined the data itself (feature engineering, data cleaning, missing data the entire dataset into one big table and used a complex black-box estimation). On the other hand, being simple, they suffer from model. high bias, so they likely cannot catch a complex mapping of your unknown function. (Luca Massaron) Having too many features (making the feature set overcomplete), is not advisable either, since redundant or useless features tend to Simplicity can come in multiple forms, both regarding the reduce the model accuracy. complexity of the model, but also regarding the pre-processing of the data. In some competitions, regularized linear regression can be the winning model in spite of its simplicity. In other cases, the winning 3.1.1 Mathematical justification for feature engineering solutions are those who do almost no pre-processing of the data (as When using simple models, it is often necessary to engineer new seen in for example the S OLAR E NERGY competition). features to capture the right trends in the data. The most common example of this, is attempting to use a linear method to model 3.3 Ensembling is a winning strategy non-linear behaviour. As described in [1], complex models and in particular models which To give a simple example of this, assume we want to predict the are combinations of many models should perform better when price of a house H given the dimensions (length lH and width wH measured on predictive accuracy. This hypothesis can be backed up of the floor plan) of the house. Assume also that the price p(H) can by looking at the winning solutions for the latest competitions on be described as a linear function p(H) = ↵aH + , where aH = Kaggle. lH ·wH is the area. By fitting a linear regression model to the original parameters lH , wH , we will not capture the quadratic trend in the If one considers the 10 Kaggle competitions mentioned in data. If we instead construct a new feature aH = lH · wH (the area), Section 2.2 and look at which models the top participants used, for each data sample (house), and fit a linear regression model using one finds that in 8 of the 10 competitions, model combination this new feature, then we will be able to capture the trend we are and ensemble-models was a key part of the final submission. The looking for. only two competitions where no ensembling was used by the top participants were FACEBOOK and B ELKIN, where a possible usage of model combination was non-trivial and where the data sets were 3.2 Simple models can get you very far of a size that favored simple models. When looking through descriptions of people’s solutions after a competition has ended, there is often a surprising number of very No matter how faithful and well tuned your individual simple solutions obtaining good results. What is also (initially) models are, you are likely to improve the accuracy with surprising, is that the simplest approaches are often described by ensembling. Ensembling works best when the individual some of the most prominent competitors. models are less correlated. Throwing a multitude of mediocre models into a blender can be counterproductive. Combining a I think beginners sometimes just start to “throw” algorithms few well constructed models is likely to work better. Having at a problem without first getting to know the data. I also think said that, it is also possible to overtune an individual model to that beginners sometimes also go too-complex-too-soon. There the detriment of the overall result. The tricky part is finding the is a view among some people that you are smarter if you create right balance. (Anil Thomas) something really complex. I prefer to try out simpler. I “try” to follow Albert Einsteins advice when he said, “Any intelligent [The fact that most winning entries use ensembling] is fool can make things bigger and more complex. It takes a touch natural from a competitors perspective, but potentially very of genius – and a lot of courage – to move in the opposite hurtful for Kaggle/its clients: a solution consisting of an direction”. (Steve Donoho) ensemble of 1000 black box models does not give any insight and will be extremely difficult to reproduce. This will not My first few submissions are usually just “baseline” translate to real business value for the comp organizers. submissions of extremely simple models – like “guess the (Tim Salimans) 3 I am a big believer in ensembles. They do improve The public leaderboard is some help, [...] but one needs to accuracy. BUT I usually do that as a very last step. I usually be careful to not overfit to it especially on small datasets. Some try to squeeze all that I can out of creating derived variables masters I have talked to pick their final submission based on a and using individual algorithms. After I feel like I have done weighted average of their leaderboard score and their CV score all that I can on that front, I try out ensembles. (weighted by data size). Kaggle makes the dangers of overfit (Steve Donoho) painfully real. There is nothing quite like moving from a good rank on the public leaderboard to a bad rank on the private Ensembling is a no-brainer. You should do it in every leaderboard to teach a person to be extra, extra careful to not competition since it usually improves your score. However, for overfit. (Steve Donoho) me it is usually the last thing I do in a competition and I don’t spend too much time on it. (Josef Feigl) Having a good cross validation system by and large makes it unnecessary to use feedback from the leaderboard. It also helps Besides the intuitive appeal of averaging models, one can justify to avoid the trap of overfitting to the public leaderboard. ensembling mathematically. (Anil Thomas) 3.3.1 Mathematical justification for ensembling Overfitting to the leaderboard is always a major problem. The best way to avoid it is to completely ignore the leaderboard To justify ensembling mathematically, we refer to the approach of score and trust only your cross-validation score. The main [7]. They look at a one-of-K classification problem and model the problem here is that your cross-validation has to be correct and probability of input x belonging to class i as that there is a clear correlation between your cv-score and the leaderboard score (e.g. improvement in your cv-score lead to fi (x) = p(ci |x) + i + ⌘i (x), improvement on the leaderboard). If that’s the case for a given competition, then it’s easy to avoid overfitting. This works where p(ci |x) is an a posteriori probability distribution of the i-th usually well if the test set is large enough. class given input x, where i is a bias for the i-th class (which is If the testset is only small in size and if there is no clear independent of x) and where ⌘i (x) is the error of the output for class correlation, then it’s very difficult to only trust your cv-score. i. This can be the case if the test set is taken from another They then derive the following expression for how the added error distribution than the train set. (Josef Feigl) (the part of the error due to our model fit being wrong) changes when averaging over the different models in the ensemble: In the 10 last competitions on Kaggle, two of them showed ✓ ◆ extreme cases of overfitting and four showed mild cases ave 1 + (N 1) Eadd = Eadd , of overfitting. The two extreme cases were B IG DATA and N S TUMBLE U PON. In Table 1 the Top-10 submissions on the public where is the average correlation between the models (weighted by test set from B IG DATA is shown, together with the results of the same the prior probabilities of the different classes) and N is the number participants on the private test set. of models trained. Name # Public # Private Public score Private score Konstantin Sofiyuk 1 378 0.40368 0.43624 The important take-away from this result is that ensembling works Ambakhof 2 290 0.40389 0.42748 best if the models we combine have a low correlation. A key thing SY 3 2 0.40820 0.42331 Giovanni 4 330 0.40861 0.42893 to note though, is that low correlation between models in itself is not asdf 5 369 0.41078 0.43364 enough to guarantee a lowering of the overall error. Ensembling as dynamic24 6 304 0.41085 0.42782 Zoey 7 205 0.41220 0.42605 described above is effective in lowering the variance of a model but GKHI 8 288 0.41225 0.42746 not in lowering the bias. Jason Sumpter 9 380 0.41262 0.44014 Vikas 10 382 0.41264 0.44276 3.4 Overfitting to the leaderboard is an issue Table 1. Results of the Top-10 participants on the leaderboard for the competition: “Big Data Combine” During a competition on Kaggle, the participants have the possibility of submitting their solutions (predictions on the public and private test set) to a public leaderboard. By submitting a solution to the In B IG DATA, the task was to predict the value of stocks multiple leaderboard you get back an evaluation of your model on the hours into the future, which is generally thought to be extremely public-part of the test set. It is clear that obtaining evaluations from difficult 3 . The extreme jumps on the leaderboard is most likely due the leaderboard gives you additional information/data, but it also to the sheer difficulty of predicting stocks combined with overfitting. introduces the possibility of overfitting to the leaderboard-scores: In the cases where there were small differences between the public The leaderboard definitely contains information. Especially leaderboard and the private leaderboard, the discrepancy can also when the leaderboard has data from a different time period sometimes be explained by the scores for the top competitors being than the training data (such as with the heritage health prize). so close that random noise affected the positions. You can use this information to do model selection and hyperparameter tuning. (Tim Salimans) 3 This is similar to what is known as the Efficient Market Hypothesis. 4 3.5 Predicting the right thing is important 4 Additional advice One task that is sometimes trivial, and other times not, is that of In addition to the quotes related to the five hypotheses, the top “predicting the right thing”. It seems quite trivial to state that it is Kaggle-participants also revealed helpful comments for performing important to predict the right thing, but it is not always a simple well in a machine learning competition. Some of their statements are matter in practice. given in this section. A next step is to ask, “What should I actually be predicting?”. This is an important step that is often missed by The best tip for a newcomer is the read the forums. You can many – they just throw the raw dependent variable into their find a lot of good advices there and nowaday also some code favorite algorithm and hope for the best. But sometimes you to get you started. Also, one shouldn’t spend too much time want to create a derived dependent variable. I’ll use the GE on optimizing the parameters of the model at the beginning of Flightquest as an example you dont want to predict the actual the competition. There is enough time for that at the end of a time the airplane will land; you want to predict the length of the competition. (Josef Feigl) flight; and maybe the best way to do that is to use that ratio of how long the flight actually was to how long it was originally In each competiton I learn a bit more from the winners. estimate to be and then multiply that times the original estimate. A competiton is not won by one insight, usually it is won (Steve Donoho) by several careful steps towards a good modelling approach. Everything play its role, so there is no secret formula here, just There are two ways to address the problem of predicting the right several lessons learned applied together. I think new kagglers thing: The first way is the one addressed in the quote from Steve would benefit more of carefully reading the forums and the Donoho, about predicting the correct derived variable. The other is past competitions winning posts. Kaggle masters aren’t cheap to train the statistical models using the appropriate loss function. on advice! (Lucas Eustaquio) Just moving from RMSE to MAE can drastically change My most surprising experience was to see the consistently the coefficients of a simple model such as a linear regression. good results of Friedman’s gradient boosting machine. It does Optimizing for the correct metric can really allow you to not turn out from the literature that this method shines in rank higher in the LB, especially if there is variable selection practice. (Gabor Takacs) involved. (Luca Massaron) Usually it makes sense to optimize the correct metric The more tools you have in your toolbox, the better (especially in your cv-score). [...] However, you don’t have prepared you are to solve a problem. If I only have a hammer to do that. For example one year ago, I’ve won the Event in my toolbox, and you have a toolbox full of tools, you are Recommendation Engine Challenge which metric was MAP. probably going to build a better house than I am. Having said I never used this metric and evaluated all my models using that, some people have a lot of tools in their toolbox, but they LogLoss. It worked well there. (Josef Feigl) don’t know *when* to use *which* tool. I think knowing when to use which tool is very important. Some people get a bunch of As an example of why using the wrong loss function might give tools in their toolbox, but then they just start randomly throwing rise to issues, look at the following simple example: Say you want to a bunch of tools at their problem without asking, “Which tool fit the simplest possible regression model, namely just an intercept a is best suited for this problem?” (Steve Donoho) to the data: x = (0.1, 0.2, 0.4, 0.2, 0.2, 0.1, 0.3, 0.2, 0.3, 0.1, 100) 5 Conclusion If we let aMSE denote the a minimizing the mean squared error, This paper looks at the recent trend of using data analysis and let aMAE denote the a minimizing the mean absolute error, we competitions for selecting the most appropriate model for a specific get the following problem. When participating in data analysis competitions, models get evaluated solely based on their predictive accuracy. Because aMSE ⇡ 9.2818, aMAE ⇡ 0.2000 the submitted models are not evaluated on their computational efficiency, novelty or interpretability, the model construction differs If we now compute the MSE and MAE using both estimates of a, slightly from the way models are normally constructed for academic we get the following results: purposes and in industry. 1 X 1 X |xi aMAE | ⇡ 9 |xi aMSE | ⇡ 16 11 i 11 i We stated a set of five different hypotheses about the way to select and construct models for competitive purposes. We then 1 X 1 X (xi aMAE )2 ⇡ 905 (xi aMSE )2 ⇡ 822 used a combination of mathematical theory, experience from past 11 11 i i competitions and qualitative interviews with top participants from We see (as expected) that for each loss function (MAE and Kaggle to try and verify these hypotheses. MSE), the parameter which was fitted to minimize that loss function achieves a lower error. This should come as no surprise, but when Although there is no secret formula for winning a data analysis the loss functions and statistical methods become complicated (such competition, the stated hypotheses together with additional good as Normalized Discounted Cumulative Gain used for some ranking advice from top performing Kaggle competitors, give indications and competitions), it is not always as trivial to see if one is actually guidelines on how to select a proper model for performing well in a optimizing the correct thing. competition on Kaggle. 5 REFERENCES [1] Leo Breiman, ‘Statistical modeling: The two cultures’, Statistical Science, (2001). [2] World Economic Forum. Big data, big impact: New possibilities for international development. http://bit.ly/1fbP4aj, January 2012. [Online]. [3] A. Goldbloom, ‘Data prediction competitions – far more than just a bit of fun’, in Data Mining Workshops (ICDMW), 2010 IEEE International Conference on, pp. 1385–1386, (Dec 2010). [4] Steve Lohr. The age of big data. http://www. nytimes.com/2012/02/12/sunday-review/ big-datas-impact-in-the-world.html, February 2012. [Online; posted 11-February-2012]. [5] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela Hung Byers. Big data: The next frontier for innovation, competition and productivity. http: //www.mckinsey.com/insights/business_technology/ big_data_the_next_frontier_for_innovation, May 2011. [Online; posted May-2011]. [6] Zhum Mu. The impact of prediction contests, 2011. [7] K. Tumer and J. Ghosh, ‘Error correlation and error reduction in ensemble classifiers’, Connection Science, 8(3-4), 385–403, (1996). 6