=Paper=
{{Paper
|id=Vol-1201/paper-14
|storemode=property
|title=Model Selection in Data Analysis Competitions
|pdfUrl=https://ceur-ws.org/Vol-1201/paper-14.pdf
|volume=Vol-1201
|dblpUrl=https://dblp.org/rec/conf/ecai/WindW14
}}
==Model Selection in Data Analysis Competitions==
<pdf width="1500px">https://ceur-ws.org/Vol-1201/paper-14.pdf</pdf>
<pre>
             Model Selection in Data Analysis Competitions
                                               David Kofoed Wind1 and Ole Winther2


Abstract. The use of data analysis competitions for selecting the             The basic structure of a predictive modelling competition – as
most appropriate model for a problem is a recent innovation in the         seen for example on Kaggle and in the Netflix competition – is the
field of predictive machine learning. Two of the most well-known           following: A predictive problem is described, and the participants
examples of this trend was the Netflix Competition and recently the        are given a dataset with a number of samples and the true target
competitions hosted on the online platform Kaggle.                         values (the values to predict) for each sample given, this is called the
   In this paper, we will state and try to verify a set of qualitative     training set. The participants are also given another dataset like the
hypotheses about predictive modelling, both in general and in the          training set, but where the target values are not known, this is called
scope of data analysis competitions. To verify our hypotheses we           the test set. The task of the participants is to predict the correct target
will look at previous competitions and their outcomes, use qualitative     values for the test set, using the training set to build their models.
interviews with top performers from Kaggle and use previous                When participants have a set of proposed predictions for the test
personal experiences from competing in Kaggle competitions.                set, they can submit these to a website, which will then evaluate
   The stated hypotheses about feature engineering, ensembling,            the submission on a part of the test set known as the quiz set, the
overfitting, model complexity and evaluation metrics give indications      validation set or simply as the public part of the test set. The result
and guidelines on how to select a proper model for performing well         of this evaluation on the quiz set is shown in a leaderboard giving
in a competition on Kaggle.                                                the participants an idea of how they are progressing.

                                                                             Using a competitive approach to predictive modelling is being
1   Introduction                                                           praised by some as the modern way to do science:

In recent years, the amount of available data has increased                       Kaggle recently hosted a bioinformatics contest, which
exponentially and “Big Data Analysis” is expected to be at the core           required participants to pick markers in a series of HIV genetic
of most future innovations [2, 4, 5]. A new and very promising trend          sequences that correlate with a change in viral load (a measure
in the field of predictive machine learning is the use of data analysis       of the severity of infection). Within a week and a half, the
competitions for model selection. Due to the rapid development                best submission had already outdone the best methods in the
in the field of competitive data analysis, there is still a lack of           scientific literature. [3]
consensus and literature on how one should approach predictive                             (Anthony Goldbloom, Founder and CEO at Kaggle)
modelling competitions.
                                                                                  These prediction contests are changing the landscape for
   In his well-known paper “Statistical Modeling : The Two
                                                                              researchers in my area an area that focuses on making good
Cultures” [1], Leo Breiman divides statistical modelling into two
                                                                              predictions from finite (albeit sometimes large) amount of data.
cultures, the data modelling culture and the algorithmic modelling
                                                                              In my personal opinion, they are creating a new paradigm
culture. The arguments put forward in [1] justifies an approach
                                                                              with distinctive advantages over how research is traditionally
to predictive modelling where the focus is purely on predictive
                                                                              conducted in our field. [6]
accuracy. That this is the right way of looking at statistical modelling
                                                                                      (Mu Zhu, Associate Professor, University of Waterloo)
is the underlying assumption in statistical prediction competitions,
and consequently also in this paper.
                                                                              This competitive approach is interesting and seems fruitful –
   The concept of machine learning competitions was made popular           one can even see it as an extension of the aggregation ideas put
with the Netflix Prize, a massive open competition with the aim            forward in [1] in the sense that the winning model is simply the
of constructing the best algorithm for predicting user ratings of          model with the best accuracy, not taking computational efficiency
movies. The competition featured a prize of 1,000,000 dollars for          or interpretability into account. Still one could ask if the framework
the first team to improve Netflix’s own results by 10% and multiple        provided by for example Kaggle gives a trustworthy resemblance of
teams achieved this goal. After the success with the Netflix Prize,        real-world predictive modelling problems where problems do not
the website Kaggle was born, providing a platform for predictive           come with a quiz set and a leaderboard.
modelling. Kaggle hosts numerous data prediction competitions and
has more than 170,000 users worldwide.                                        In this paper we state five hypotheses about building and selecting
                                                                           models for competitive data analysis. To verify these hypotheses we
                                                                           will look at previous competitions and their outcomes, use qualitative
1 Technical University of Denmark, Denmark, email: dawi@dtu.dk             interviews with top performers from Kaggle and use previous
2 Technical University of Denmark, Denmark, email: olwi@dtu.dk
                                                                           personal experiences from competing in Kaggle competitions.
2     Interviews and Previous Competitions                                    3. Simple models can get you very far
                                                                              4. Ensembling is a winning strategy
In this section we will shortly describe the data we are using. We will       5. Predicting the right thing is important
list the people whom we interviewed and name the previous Kaggle
competition we are using for empirical data.
                                                                              3.1 Feature engineering is the most important part
2.1    Interviews                                                             With the extensive amount of free tools and libraries available
                                                                              for data analysis, everybody has the possibility of trying advanced
To help answer the questions we are stating, we have asked a series           statistical models in a competition. As a consequence of this, what
of questions to some of the best Kaggle participants throughout time.         gives you most “bang for the buck” is rarely the statistical method
We have talked (by e-mail) with the following participants (name,             you apply, but rather the features you apply it to. By feature
Kaggle username, current rank on Kaggle):                                     engineering, we mean using domain specific knowledge or automatic
• Steve Donoho (BreakfastPirate #2)                                           methods for generating, extracting, removing or altering features in
• Lucas Eustaquio (Leustagos #6)                                              the data set.
• Josef Feigl (Josef Feigl #7)
                                                                                     For most Kaggle competitions the most important part is
• Zhao Xing (xing zhao #10)
                                                                                 feature engineering, which is pretty easy to learn how to do.
• Anil Thomas (Anil Thomas #11)
                                                                                                                                  (Tim Salimans)
• Luca Massaron (Luca Massaron #13)
• Gábor Takács (Gábor Takács #20)                                                The features you use influence more than everything
• Tim Salimans (Tim Salimans #48)                                                else the result. No algorithm alone, to my knowledge, can
Answers and parts of answers to our questions are included in this               supplement the information gain given by correct feature
paper as quotes when relevant.                                                   engineering.                              (Luca Massaron)

                                                                                     Feature engineering is certainly one of the most important
2.2    Previous competitions                                                     aspects in Kaggle competitions and it is the part where one
                                                                                 should spend the most time on. There are often some hidden
Besides the qualitative interviews with Kaggle masters, we also
                                                                                 features in the data which can improve your performance by a
looked at 10 previous Kaggle competitions, namely the following:
                                                                                 lot and if you want to get a good place on the leaderboard you
• Facebook Recruiting III - Keyword Extraction                                   have to find them. If you screw up here you mostly can’t win
• Partly Sunny with a Chance of Hashtags                                         anymore; there is always one guy who finds all the secrets.
• See Click Predict Fix                                                              However, there are also other important parts, like how you
• Multi-label Bird Species Classification - NIPS 2013                            formulate the problem. Will you use a regression model or
• Accelerometer Biometric Competition                                            classification model or even combine both or is some kind of
• AMS 2013-2014 Solar Energy Prediction Contest                                  ranking needed. This, and feature engineering, are crucial to
• StumbleUpon Evergreen Classification Challenge                                 achieve a good result in those competitions.
• Belkin Energy Disaggregation Competition                                           There are also some competitions were (manual) feature
• The Big Data Combine Engineered by BattleFin                                   engineering is not needed anymore; like in image processing
• Cause-effect pairs                                                             competitions. Current state of the art deep learning algorithms
                                                                                 can do that for you.                               (Josef Feigl)
These competitions were selected as 10 consecutive competitions,
where we excluded a few competitions which did not fit the standard              There are some specific types of data which have previously
framework of statistical data analysis (for example challenges in             required a larger amount of feature engineering, namely text data
optimization and operations research).                                        and image data. In many of the previous competitions with text
                                                                              and image data, feature engineering was a huge part of the winning
   Throughout this paper, these competitions are referenced with              solutions (examples of this are for example S UNNY H ASHTAGS,
the following abbreviated names: FACEBOOK, S UNNY H ASHTAGS,                  FACEBOOK, S EE C LICK P REDICT and B IRD). At the same time
S EE C LICK P REDICT, B IRD, ACCELEROMETER, S OLAR E NERGY,                   (perhaps due to the amount of work needed to do good feature
S TUMBLE U PON, B ELKIN, B IG DATA and C AUSE E FFECT.                        engineering here) deep learning approaches to automatic feature
                                                                              extraction have gained popularity.
3     Hypotheses
                                                                                 In the competition S UNNY H ASHTAGS which featured text data
In this section we state 5 hypotheses about predictive modelling              taken from Twitter, feature engineering was a major part of the
in a competitive framework. We will try to verify the validity of             winning solution. The winning solution used a simple regularized
each hypothesis using a combination of mathematical arguments,                regression model, but generated a lot of features from the text:
empirical evidence from previous competitions and qualitative
interviews we did with some of the top participants at Kaggle. The                   My set of features included the basic tfidf of 1,2,3-grams
five hypotheses to be investigated are:                                          and 3,5,6,7 ngrams. I used a CMU Ark Twitter dedicated
                                                                                 tokenizer which is especially robust for processing tweets + it
1. Feature engineering is the most important part of predictive                  tags the words with part-of-speech tags which can be useful
   machine learning                                                              to derive additional features. Additionally, my base feature set
2. Overfitting to the leaderboard is a real issue                                included features derived from sentiment dictionaries that map

                                                                          2
  each word to a positive/neutral/negative sentiment. I found this               average” or “guess the average segmented by variable X”.
  helped to predict S categories by quite a bit. Finally, with Ridge             These are simply to establish what is possible with very simple
  model I found that doing any feature selection was only hurting                models. You’d be surprised that you can sometimes come very
  the performance, so I ended up keeping all of the features ⇠ 1.9               close to the score of someone doing something very complex
  mil. The training time for a single model was still reasonable.                by just using a simple model.                   (Steve Donoho)
                                     (aseveryn - 1st place winner)
                                                                                    I think a simple model can make you top 10 in a Kaggle
   In the competitions which did not have text or image data, feature            competition. In order to get a money prize, you have to go to
engineering sometimes still played an important role in the winning              ensembles most of time.                          (Zhao Xing)
entries. An example of this is the C AUSE E FFECT competition,
where the winning entry created thousands of features, and then                      You can go very far [with simple models], if you use them
used genetic algorithms to remove non-useful features again. On                  well, but likely you cannot win a competition by a simple model
the contrary, sometimes the winning solutions are those which go a               alone. Simple models are easy to train and to understand and
non-intuitive way and simply use a black-box approach. An example                they can provide you with more insight than more complex
of this is the S OLAR E NERGY competition where the Top-3 entries                black boxes. They are also easy to be modified and adapted
almost did not use any feature engineering (even though this seemed              to different situations. They also force you to work more on
like the most intuitive approach for many) – and simply combined                 the data itself (feature engineering, data cleaning, missing data
the entire dataset into one big table and used a complex black-box               estimation). On the other hand, being simple, they suffer from
model.                                                                           high bias, so they likely cannot catch a complex mapping of
                                                                                 your unknown function.                          (Luca Massaron)
   Having too many features (making the feature set overcomplete),
is not advisable either, since redundant or useless features tend to             Simplicity can come in multiple forms, both regarding the
reduce the model accuracy.                                                     complexity of the model, but also regarding the pre-processing of the
                                                                               data. In some competitions, regularized linear regression can be the
                                                                               winning model in spite of its simplicity. In other cases, the winning
3.1.1    Mathematical justification for feature engineering                    solutions are those who do almost no pre-processing of the data (as
When using simple models, it is often necessary to engineer new                seen in for example the S OLAR E NERGY competition).
features to capture the right trends in the data. The most common
example of this, is attempting to use a linear method to model                 3.3 Ensembling is a winning strategy
non-linear behaviour.
                                                                               As described in [1], complex models and in particular models which
   To give a simple example of this, assume we want to predict the             are combinations of many models should perform better when
price of a house H given the dimensions (length lH and width wH                measured on predictive accuracy. This hypothesis can be backed up
of the floor plan) of the house. Assume also that the price p(H) can           by looking at the winning solutions for the latest competitions on
be described as a linear function p(H) = ↵aH + , where aH =                    Kaggle.
lH ·wH is the area. By fitting a linear regression model to the original
parameters lH , wH , we will not capture the quadratic trend in the               If one considers the 10 Kaggle competitions mentioned in
data. If we instead construct a new feature aH = lH · wH (the area),           Section 2.2 and look at which models the top participants used,
for each data sample (house), and fit a linear regression model using          one finds that in 8 of the 10 competitions, model combination
this new feature, then we will be able to capture the trend we are             and ensemble-models was a key part of the final submission. The
looking for.                                                                   only two competitions where no ensembling was used by the top
                                                                               participants were FACEBOOK and B ELKIN, where a possible usage
                                                                               of model combination was non-trivial and where the data sets were
3.2     Simple models can get you very far                                     of a size that favored simple models.
When looking through descriptions of people’s solutions after a
competition has ended, there is often a surprising number of very                    No matter how faithful and well tuned your individual
simple solutions obtaining good results. What is also (initially)                models are, you are likely to improve the accuracy with
surprising, is that the simplest approaches are often described by               ensembling. Ensembling works best when the individual
some of the most prominent competitors.                                          models are less correlated. Throwing a multitude of mediocre
                                                                                 models into a blender can be counterproductive. Combining a
       I think beginners sometimes just start to “throw” algorithms              few well constructed models is likely to work better. Having
  at a problem without first getting to know the data. I also think              said that, it is also possible to overtune an individual model to
  that beginners sometimes also go too-complex-too-soon. There                   the detriment of the overall result. The tricky part is finding the
  is a view among some people that you are smarter if you create                 right balance.                                     (Anil Thomas)
  something really complex. I prefer to try out simpler. I “try” to
  follow Albert Einsteins advice when he said, “Any intelligent                      [The fact that most winning entries use ensembling] is
  fool can make things bigger and more complex. It takes a touch                 natural from a competitors perspective, but potentially very
  of genius – and a lot of courage – to move in the opposite                     hurtful for Kaggle/its clients: a solution consisting of an
  direction”.                                       (Steve Donoho)               ensemble of 1000 black box models does not give any insight
                                                                                 and will be extremely difficult to reproduce. This will not
     My first few submissions are usually just “baseline”                        translate to real business value for the comp organizers.
  submissions of extremely simple models – like “guess the                                                                        (Tim Salimans)

                                                                           3
       I am a big believer in ensembles. They do improve                              The public leaderboard is some help, [...] but one needs to
  accuracy. BUT I usually do that as a very last step. I usually                  be careful to not overfit to it especially on small datasets. Some
  try to squeeze all that I can out of creating derived variables                 masters I have talked to pick their final submission based on a
  and using individual algorithms. After I feel like I have done                  weighted average of their leaderboard score and their CV score
  all that I can on that front, I try out ensembles.                              (weighted by data size). Kaggle makes the dangers of overfit
                                                     (Steve Donoho)               painfully real. There is nothing quite like moving from a good
                                                                                  rank on the public leaderboard to a bad rank on the private
     Ensembling is a no-brainer. You should do it in every                        leaderboard to teach a person to be extra, extra careful to not
  competition since it usually improves your score. However, for                  overfit.                                          (Steve Donoho)
  me it is usually the last thing I do in a competition and I don’t
  spend too much time on it.                          (Josef Feigl)                   Having a good cross validation system by and large makes it
                                                                                  unnecessary to use feedback from the leaderboard. It also helps
  Besides the intuitive appeal of averaging models, one can justify               to avoid the trap of overfitting to the public leaderboard.
ensembling mathematically.                                                                                                           (Anil Thomas)


3.3.1    Mathematical justification for ensembling                                    Overfitting to the leaderboard is always a major problem.
                                                                                  The best way to avoid it is to completely ignore the leaderboard
To justify ensembling mathematically, we refer to the approach of                 score and trust only your cross-validation score. The main
[7]. They look at a one-of-K classification problem and model the                 problem here is that your cross-validation has to be correct and
probability of input x belonging to class i as                                    that there is a clear correlation between your cv-score and the
                                                                                  leaderboard score (e.g. improvement in your cv-score lead to
                   fi (x) = p(ci |x) +   i + ⌘i (x),                              improvement on the leaderboard). If that’s the case for a given
                                                                                  competition, then it’s easy to avoid overfitting. This works
where p(ci |x) is an a posteriori probability distribution of the i-th            usually well if the test set is large enough.
class given input x, where i is a bias for the i-th class (which is                   If the testset is only small in size and if there is no clear
independent of x) and where ⌘i (x) is the error of the output for class           correlation, then it’s very difficult to only trust your cv-score.
i.                                                                                This can be the case if the test set is taken from another
   They then derive the following expression for how the added error              distribution than the train set.                      (Josef Feigl)
(the part of the error due to our model fit being wrong) changes when
averaging over the different models in the ensemble:
                                                                                  In the 10 last competitions on Kaggle, two of them showed
                                ✓                   ◆                          extreme cases of overfitting and four showed mild cases
                   ave              1 + (N     1)
                  Eadd = Eadd                           ,                      of overfitting. The two extreme cases were B IG DATA and
                                        N
                                                                               S TUMBLE U PON. In Table 1 the Top-10 submissions on the public
where is the average correlation between the models (weighted by               test set from B IG DATA is shown, together with the results of the same
the prior probabilities of the different classes) and N is the number          participants on the private test set.
of models trained.
                                                                                 Name                 # Public   # Private   Public score   Private score
                                                                                 Konstantin Sofiyuk   1          378         0.40368        0.43624
   The important take-away from this result is that ensembling works             Ambakhof             2          290         0.40389        0.42748
best if the models we combine have a low correlation. A key thing                SY                   3          2           0.40820        0.42331
                                                                                 Giovanni             4          330         0.40861        0.42893
to note though, is that low correlation between models in itself is not          asdf                 5          369         0.41078        0.43364
enough to guarantee a lowering of the overall error. Ensembling as               dynamic24            6          304         0.41085        0.42782
                                                                                 Zoey                 7          205         0.41220        0.42605
described above is effective in lowering the variance of a model but             GKHI                 8          288         0.41225        0.42746
not in lowering the bias.                                                        Jason Sumpter        9          380         0.41262        0.44014
                                                                                 Vikas                10         382         0.41264        0.44276


3.4     Overfitting to the leaderboard is an issue                               Table 1. Results of the Top-10 participants on the leaderboard for the
                                                                                                 competition: “Big Data Combine”
During a competition on Kaggle, the participants have the possibility
of submitting their solutions (predictions on the public and private
test set) to a public leaderboard. By submitting a solution to the                In B IG DATA, the task was to predict the value of stocks multiple
leaderboard you get back an evaluation of your model on the                    hours into the future, which is generally thought to be extremely
public-part of the test set. It is clear that obtaining evaluations from       difficult 3 . The extreme jumps on the leaderboard is most likely due
the leaderboard gives you additional information/data, but it also             to the sheer difficulty of predicting stocks combined with overfitting.
introduces the possibility of overfitting to the leaderboard-scores:
                                                                                  In the cases where there were small differences between the public
      The leaderboard definitely contains information. Especially              leaderboard and the private leaderboard, the discrepancy can also
  when the leaderboard has data from a different time period                   sometimes be explained by the scores for the top competitors being
  than the training data (such as with the heritage health prize).             so close that random noise affected the positions.
  You can use this information to do model selection and
  hyperparameter tuning.                          (Tim Salimans)               3 This is similar to what is known as the Efficient Market Hypothesis.


                                                                           4
3.5     Predicting the right thing is important                                4 Additional advice
One task that is sometimes trivial, and other times not, is that of            In addition to the quotes related to the five hypotheses, the top
“predicting the right thing”. It seems quite trivial to state that it is       Kaggle-participants also revealed helpful comments for performing
important to predict the right thing, but it is not always a simple            well in a machine learning competition. Some of their statements are
matter in practice.                                                            given in this section.
      A next step is to ask, “What should I actually be
  predicting?”. This is an important step that is often missed by                    The best tip for a newcomer is the read the forums. You can
  many – they just throw the raw dependent variable into their                   find a lot of good advices there and nowaday also some code
  favorite algorithm and hope for the best. But sometimes you                    to get you started. Also, one shouldn’t spend too much time
  want to create a derived dependent variable. I’ll use the GE                   on optimizing the parameters of the model at the beginning of
  Flightquest as an example you dont want to predict the actual                  the competition. There is enough time for that at the end of a
  time the airplane will land; you want to predict the length of the             competition.                                        (Josef Feigl)
  flight; and maybe the best way to do that is to use that ratio of
  how long the flight actually was to how long it was originally                     In each competiton I learn a bit more from the winners.
  estimate to be and then multiply that times the original estimate.             A competiton is not won by one insight, usually it is won
                                                   (Steve Donoho)                by several careful steps towards a good modelling approach.
                                                                                 Everything play its role, so there is no secret formula here, just
   There are two ways to address the problem of predicting the right             several lessons learned applied together. I think new kagglers
thing: The first way is the one addressed in the quote from Steve                would benefit more of carefully reading the forums and the
Donoho, about predicting the correct derived variable. The other is              past competitions winning posts. Kaggle masters aren’t cheap
to train the statistical models using the appropriate loss function.             on advice!                                     (Lucas Eustaquio)
      Just moving from RMSE to MAE can drastically change
                                                                                     My most surprising experience was to see the consistently
  the coefficients of a simple model such as a linear regression.
                                                                                 good results of Friedman’s gradient boosting machine. It does
  Optimizing for the correct metric can really allow you to
                                                                                 not turn out from the literature that this method shines in
  rank higher in the LB, especially if there is variable selection
                                                                                 practice.                                     (Gabor Takacs)
  involved.                                      (Luca Massaron)
      Usually it makes sense to optimize the correct metric                          The more tools you have in your toolbox, the better
  (especially in your cv-score). [...] However, you don’t have                   prepared you are to solve a problem. If I only have a hammer
  to do that. For example one year ago, I’ve won the Event                       in my toolbox, and you have a toolbox full of tools, you are
  Recommendation Engine Challenge which metric was MAP.                          probably going to build a better house than I am. Having said
  I never used this metric and evaluated all my models using                     that, some people have a lot of tools in their toolbox, but they
  LogLoss. It worked well there.                   (Josef Feigl)                 don’t know *when* to use *which* tool. I think knowing when
                                                                                 to use which tool is very important. Some people get a bunch of
    As an example of why using the wrong loss function might give
                                                                                 tools in their toolbox, but then they just start randomly throwing
rise to issues, look at the following simple example: Say you want to
                                                                                 a bunch of tools at their problem without asking, “Which tool
fit the simplest possible regression model, namely just an intercept a
                                                                                 is best suited for this problem?”                   (Steve Donoho)
to the data:


x = (0.1, 0.2, 0.4, 0.2, 0.2, 0.1, 0.3, 0.2, 0.3, 0.1, 100)                    5 Conclusion
  If we let aMSE denote the a minimizing the mean squared error,               This paper looks at the recent trend of using data analysis
and let aMAE denote the a minimizing the mean absolute error, we               competitions for selecting the most appropriate model for a specific
get the following                                                              problem. When participating in data analysis competitions, models
                                                                               get evaluated solely based on their predictive accuracy. Because
               aMSE ⇡ 9.2818,           aMAE ⇡ 0.2000                          the submitted models are not evaluated on their computational
                                                                               efficiency, novelty or interpretability, the model construction differs
  If we now compute the MSE and MAE using both estimates of a,
                                                                               slightly from the way models are normally constructed for academic
we get the following results:
                                                                               purposes and in industry.
      1 X                        1 X
             |xi aMAE | ⇡ 9            |xi aMSE | ⇡ 16
     11 i                        11 i                                             We stated a set of five different hypotheses about the way to
                                                                               select and construct models for competitive purposes. We then
      1 X                              1 X
             (xi   aMAE )2 ⇡ 905               (xi   aMSE )2 ⇡ 822             used a combination of mathematical theory, experience from past
   11                                 11
         i                                 i                                   competitions and qualitative interviews with top participants from
   We see (as expected) that for each loss function (MAE and                   Kaggle to try and verify these hypotheses.
MSE), the parameter which was fitted to minimize that loss function
achieves a lower error. This should come as no surprise, but when                Although there is no secret formula for winning a data analysis
the loss functions and statistical methods become complicated (such            competition, the stated hypotheses together with additional good
as Normalized Discounted Cumulative Gain used for some ranking                 advice from top performing Kaggle competitors, give indications and
competitions), it is not always as trivial to see if one is actually           guidelines on how to select a proper model for performing well in a
optimizing the correct thing.                                                  competition on Kaggle.

                                                                           5
REFERENCES
[1] Leo Breiman, ‘Statistical modeling: The two cultures’, Statistical
    Science, (2001).
[2] World Economic Forum. Big data, big impact: New possibilities
    for international development. http://bit.ly/1fbP4aj, January
    2012. [Online].
[3] A. Goldbloom, ‘Data prediction competitions – far more than just a bit
    of fun’, in Data Mining Workshops (ICDMW), 2010 IEEE International
    Conference on, pp. 1385–1386, (Dec 2010).
[4] Steve Lohr.          The age of big data.              http://www.
    nytimes.com/2012/02/12/sunday-review/
    big-datas-impact-in-the-world.html, February 2012.
    [Online; posted 11-February-2012].
[5] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard
    Dobbs, Charles Roxburgh, and Angela Hung Byers. Big data: The
    next frontier for innovation, competition and productivity. http:
    //www.mckinsey.com/insights/business_technology/
    big_data_the_next_frontier_for_innovation,                        May
    2011. [Online; posted May-2011].
[6] Zhum Mu. The impact of prediction contests, 2011.
[7] K. Tumer and J. Ghosh, ‘Error correlation and error reduction in
    ensemble classifiers’, Connection Science, 8(3-4), 385–403, (1996).


                                                                             6

</pre>