<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Breaking the curse of dimensionality for machine learning on genomic data</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aidan O'Brien</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Piotr Szul</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oscar Luo</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew George</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Dunne</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Denis Bauer</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CSIRO Health</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Biosecurity</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>CSIRO Data</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Genomic data analyses are performed on ever larger patient cohorts. Machine learning (ML) is employed to detect the complex genomic interactions that can lead to diseases like diabetes or cancer. However, current ML approaches are unable to cope with these data volumes. We introduce CursedForest, a tailored implementation of random forests, designed to handle data with extremely large number of variables per sample. CursedForest is included in our earlier genome interpretation package, VariantSpark, allowing it to perform near realtime classification of population-scale patient cohorts in ”patients like mine” scenarios, as well as performing GWAS analysis on large unfiltered whole genome sequencing cohorts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Hence using more sophisticated machine
learning (ML) approaches, in particular tree-based
models, have been successful for taking the interaction
of variables into account [Wright et al., 2016]. In
addition, random forests are well suited for
processing “wide” genomic data for two reasons. Firstly,
while other machine learning applications have the
propensity to overfit datasets with more features
p than samples n
        <xref ref-type="bibr" rid="ref3">(a consequence of the “curse
of dimensionality” [Bauer et al., 2014])</xref>
        , decision
trees are resistant to overfitting. Secondly, random
forests are also very easy to parallelise. As the
forest is a sum of decision trees, it is possible to grow
separate trees on different processors and combine
the trees.
      </p>
      <p>However, the use of traditional compute
infrastructure limits the parallelisation strategies that can
be employed. The programs are limited to
utilising only CPUs that are on the same computer node
(multithreading) or farm out independent tasks to
CPUs distributed across nodes that do not require
communication between the processes (a separate
tree is grown on each node). Hadoop/Spark
overcomes these limitations by enabling programs to
scale beyond compute-node boundaries and hence
enable more sophisticated parallelisation strategies.
In the case of random forests, the computations for
each node of a tree can hence be handed off to
separate processors.</p>
      <p>Despite overcoming the node-boundary
limitation, the standard implementation of random
forest in Spark ML is not able to handle the extremely
“wide” genomic data as it was developed for a large
number of samples with only modest
dimensionality [Abuzaid et al., 2016]. Although Spark ML
can build a random forest model on a subset of the
data (chromosome 1), we show that the time taken
is excessive due to the large amount of data being
aggregated and processed by the driver node
during intermediate stages of building the model. This
unbalanced work load where the driver node
becomes the bottleneck and worker nodes are idle
prevents a seamless scaling to larger datasets. We also
show that the memory requirements per executor
increases with dimensionality due to the data types
Spark ML uses.</p>
      <p>Here we introduce CursedForest, a tailored
Hadoop/Spark-based implementation of random
forests specifically designed to cater for “big”
(many samples) and “wide” (many features)
datasets. CursedForest extends our previously
developed variant interpretation framework,
VariantSpark [O’Brien et al., 2015], to now offer
supervised as well as unsupervised ML algorithms
in the Spark framework. In our implementation a
Spark application runs on a “driver” node and
distributes tasks to many “worker” nodes, or
“executors”. By also utilising VariantSpark, which uses
Spark to read in and manipulate the standard
genomic variant format (VCF) directly, CursedForest
outperforms existing tools even on small datasets
where multithreading generally performs well.
Harnessing the virtually unlimited capability to
parallelise tasks, CursedForest can hence explore the
solution space faster by building a larger number of
diverse models to generate a consensus from.</p>
      <p>Using this facility, CursedForest is capable of
parallelising the split for each node in a tree thereby
handling millions of features, as required to
process whole genome sequencing data or SNP array
data with unobserved genotypes imputed [Howie et
al., 2012]. This provides the potential to
generate datasets of hundreds of thousands of
individuals with millions of variants (imputing the GWAS
catalog), highlighting the need for modern compute
paradigms in the genomics space.</p>
      <p>VariantSpark [O’Brien et al., 2015] with the
CursedForest extension therefore offers a
comprehensive analysis toolkit that can scale to future data
demands. To showcase the framework’s ability
we demonstrate a classification as well as
featureselection task on synthetic data in the first section.
In the second section, we demonstrate the ability
of CursedForest to successfully replicate findings
from a previous GWAS study as well as identify
novel variants associated with bone mineral
density (BMD). Thirdly, we demonstrate the scalability
of CursedForest in respect to the dimensionality of
data by building a random forest model on
wholegenome data from the 1000 Genomes Project [1000
Genomes Project Consortium, 2012] to predict
ethnicity. Finally, given the role different parameter
values can play in model construction, we explore
the effect that tuning these parameters can have on
the prediction accuracy of the model.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>Methods</title>
      <sec id="sec-2-1">
        <title>CursedForest</title>
        <p>As is standard with Spark applications, we store our data
in a Resilient Distributed Dataset (RDD), where an RDD
is essentially a collection of elements. In the case of
Spark ML, each element in the RDD is a sample. RDDs
contribute to the scalability of Spark as they can be
distributed across multiple nodes and operated on in parallel.
Even as we add more samples to a dataset, Spark can
simply schedule extra tasks to handle the additional items in
the RDD.</p>
        <p>However, within an RDD, Spark ML stores each
sample as a vector. Unlike RDDs, which can be partitioned
and distributed across multiple nodes, each vector must
be present in its entirety on any node accessing it. This is
no problem with typical datasets; however, as
dimensionality increases, the vectors eventually reach a size where
they can no longer fit into a single node’s memory.</p>
        <p>So in the case of adding more samples, Spark ML can
simply create more tasks, keeping memory consumption
within the cluster’s bounds. However, as the
dimensionality of each sample grows, the memory requirements of the
job increases to enable these increasingly large vectors to
be loaded into memory.</p>
        <p>On the other hand, CursedForest is specifically
designed to handle wide “cursed” data. It avoids the relation
between memory and dimensionality by avoiding
calculations that rely on entire feature vectors and taking the
parallelization work down to the level of the individual
features. For each node of a tree, CursedForest will
distribute tasks that consist of single features (variants), for
every individual. Each of these tasks will calculate the
information gain for that specific feature. Once these tasks
have completed, the results are reduced to return the
feature which gives the greatest information gain. This
process is then repeated until CursedForest has created the
entire decision tree.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Scalability</title>
        <p>The current implementation of CursedForest uses a Gini
impurity criteria for splitting. Let fq be the fraction of
(a) Number of trees build per hour when growing (b) Number of trees build per hour when growing
number of variables. number of samples.
items labeled with value q, where q = 1; : : : ; Q at a node.</p>
        <p>The Gini impurity is</p>
        <p>IG(f ) =</p>
        <p>fq);
Q
X fq(1
q=1
which is at a minimum when all observations at the node
are in the same class.</p>
        <p>We were running Apache Spark 1.6.1 on a cluster
with 12 worker nodes each with 16 Intel Xeon
E52660@2.20GHz CPU cores and 128 GB of memory.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Synthetic data</title>
        <p>Each dataset consists of n samples and p variables where
n &lt;&lt; p, and values for each variable are ordinal
variables with three levels represented as numbers f0; 1; 2g
(which correspond to an additive effect encoding of
genomic variation) randomly generated from a uniform
distribution.</p>
        <p>The model parameters are wi = 1=p2i 1 for i =
1; : : : 5 and we set</p>
        <p>i=5
z = X wixi:
i=1
(1)
We let 2 = V ar(z)(1 )= where is a parameter
controlling the fraction of variance explained by the
informative variables and in our study we chose = 0:125
as used by previous approaches. Then y = z + where</p>
        <p>N (0; 2): The dichotomous response is generated by
thresholding y at the 0:5 quantile.</p>
        <p>y =
We consider the parameter settings for the random forest
algorithm. We use the R notation from the random forest
package [Liaw and Wiener, 2002] which incorporates the
original Fortran code by Brieman and Cutler. We
incorporate the advice of [Liaw and Wiener, 2002], which we
have found mirrors our own experience:
ntree – the number of trees. The number of trees
necessary for good performance grows with the
number of predictors. [Liaw and Wiener, 2002]
suggest that a high ntree is necessary to get stable
estimates of variable importance and proximity;
however, even though the variable importance measures
may vary from run to run, we note that it is
possible for a random forest model to have a poorer fit
and still have an accurate ranking of variable
importance;
mtry – the number of variables considered at each
split (if mtry=p, we have a boosted decision tree
model). If one has a very large number of variables
but expects only very few to be “important”, using
larger mtry may give better performance;
the size and complexity of the individual trees is
controlled in random forest by setting nodesize,
the minimum size of terminal nodes. It is controlled
in Spark ML by setting maxDepth, the maximum
depth of each tree in the forest.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Results</title>
      <p>In this section we explore the performance of
CursedForest in more detail by testing its ability to
scale to different sizes of data and computational
resources.</p>
      <p>In order to assess these characteristics, we run
CursedForest classification on synthetic datasets
(a) Time in seconds to build 100 trees for different
mtry fractions.
with varying numbers of variables (features) and
samples, similar to the dataset used in [Wright and
Ziegler, 2016] to evaluate ranger, allocating
varying number of CPU cores to the CursedForest and
also varying the computational complexity of the
random forests by using a range of mtry values.</p>
      <p>We investigate the different synthetic datasets
generated for Section 2.3 and measured the time
taken to build a random forest model of 100 trees.
The results reported below are averages of 5 runs,
and all the cases were executed with the same
random seed, to improve the consistency of
measurements.</p>
      <p>First we look at CursedForest horizontal
scalability for a medium size dataset of 2.5 million variables
and 5000 samples, by varying the mtry fraction and
the number of CPU cores allocated to the execution.
Regardless of the number of cores used,
CursedForest displays approximately linear dependency
between the execution time and mtry (Fig 2a).</p>
      <p>CursedForest scales almost linearly with the
number of CPU cores for medium values of mtry
fraction but for both lower and higher values the
performance degrades slightly (Fig 2b). In the
latter case the likely cause is communication overhead
(with lower mtry values the proportion of time for
parallelizable computation to the time for internode
communication is lower) while in the latter case it
is most likely caused by reaching the clusters
computational capacity.</p>
      <p>Next we investigate CursedForest scalability with
regards to the size of data, by varying the number
of variables and sample for a fixed mtry fraction of
0.25 and execution of 128 CPU cores. The results
(b) Number of trees build per hour when using a
growing number of CPU cores.
are visualized in Fig 1 below (please note log scale
on the axes and the values on y axes are expressed
as trees per hour).</p>
      <p>Generally, the number of trees per hour decreases
with an increased number of variables and samples
sizes. Some irregularities in the graph can be
attributed to computation vs communication tradeoff.
It is also worth noting that keeping the mtry
fraction constant results in higher mtry values with the
growing number of variables, and this is what drives
the performance down rather than the increase of
dataset size itself.</p>
      <p>To conclude CursedForest is capable of
processing 60 trees per hour on a dataset with 50
million variables and 10,000 samples, which is the size
range for whole genome sequencing experiments of
clinically relevant cohort sizes.
3.1</p>
      <sec id="sec-3-1">
        <title>Exploring theoretical recovery rate of wide data</title>
        <p>Donoho and Tanner [Donoho and Tanner, 2009]
give a “universal phase change” result that has
applications in a large number of areas including
variable selection in high dimensions. Consider Fig 3a
which shows the region where a model can recover
the important variables, plotted as a function of =
n=p and = k=n (where k is the number of
significant variables). There is a distinct boundary, shown
empirically, to the region where we can reliably
recover significant variables. [Donoho and Stodden,
2006] investigate the behavior of a number of
regression approaches for variable selection (LARS,
Lasso and forward stepwise) and make the point
that above the phase-transition line variable
recovery is still possible by a combinatorial approach.</p>
        <p>It is not surprising that that it is more difficult to
recover the signal variable in the upper-left area of
the figure, as the problem is both under-determined
and sparse. What is surprising is the connection
with arguments from combinatorial geometry. This
suggests that we are seeing a universal rule rather
than an implementation issue. As CursedForest is
designed for extremely large numbers of variables
it is likely to be operating in difficult regions of the
figure where the ratio = n=p is small.</p>
        <p>We note several things here:
the Donoho-Tanner phase transition arises in
recovering the in data generated by a linear
model. However, in a decision tree (random
forest) there is no notion of estimating the
A decision tree (random forest) is a heuristic
search. It may recover a relationship in the
space of the combinatorial search.</p>
        <p>The existence of the Donoho-Tanner phase
transition is a salutary warning. There are likely to be
limits, both computational and logical, to the
recovery of signals from noisy data. CursedForest is a
contribution to addressing the practical limits but
the logical limits will still apply. However in the
case of data that is both big and wide,
CursedForest and other VariantSpark methods may provide a
useful tool.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Biological data</title>
        <p>We apply CursedForest to two biological datasets.
Firstly, the 1000 genomes project to test its
classification accuracy and secondly, to a bone
mineral density dataset to demonstrate a GWAS-style
analysis. We training CursedForest on the 1000
genomes dataset, which consists of 2,504 samples
with 81,047,467 features each to predict the
ethnicity from genomic profiles. CursedForest achieves
an out of bag error of OOB=0.01 and completes in
36 min 54 seconds, demonstrating its capability to
run on population-scale cohorts of real world
applications. Next we perform feature selection on over
7.2 million genomic variants and identify the
locations associated with Bone Mineral Density (BMD)
in a previously published GWAS dataset [Duncan
et al., 2011]. We faithfully recover 5 known BMD
genes that were previously identified in GWAS
studies, however also find two probable new
associations that were previously only suggestive. This
demonstrates the utility of our approach as well as
the ability to amplify signal by taking SNP
interactions into account rather than limiting the analysis
to individual strong responders.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Conclusion</title>
      <p>We have demonstrated that using a different
parallelization model can extend random forests to the
case of an extremely large number of variables.
We have treated the case of variable selection in a
p &gt;&gt; n model, where most of the variables are
uninformative, and have demonstrated the utility of
the model for large GWAS datasets. By comparing
this implementation to other implementations
(including those optimized for large datasets) we have
demonstrated the utility of this approach.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1000
          <string-name>
            <given-names>Genomes</given-names>
            <surname>Project</surname>
          </string-name>
          <string-name>
            <surname>Consortium</surname>
          </string-name>
          ,
          <year>2012</year>
          ] 1000
          <string-name>
            <given-names>Genomes</given-names>
            <surname>Project</surname>
          </string-name>
          <article-title>Consortium. An integrated map of genetic variation from 1,092 human genomes</article-title>
          .
          <source>Nature</source>
          ,
          <volume>491</volume>
          (
          <issue>7422</issue>
          ):
          <fpage>56</fpage>
          -
          <lpage>65</lpage>
          ,
          <year>November 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Abuzaid et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Firas</given-names>
            <surname>Abuzaid</surname>
          </string-name>
          , Joseph K Bradley, Feynman T Liang,
          <article-title>Andrew Feng, Lee Yang, Matei Zaharia, and Ameet S Talwalkar. Yggdrasil: An optimized system for training deep decision trees at scale</article-title>
          . In D. D.
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sugiyama</surname>
            ,
            <given-names>U. V.</given-names>
          </string-name>
          <string-name>
            <surname>Luxburg</surname>
            ,
            <given-names>I. Guyon</given-names>
          </string-name>
          , and R. Garnett, editors,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>29</volume>
          , pages
          <fpage>3817</fpage>
          -
          <lpage>3825</lpage>
          . Curran Associates, Inc.,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Bauer et al.,
          <year>2014</year>
          ]
          <string-name>
            <surname>Denis</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bauer</surname>
          </string-name>
          , Clara Gaff, Marcel E. Dinger, Melody Caramins, Fabian A.
          <string-name>
            <surname>Buske</surname>
            ,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Fenech</surname>
            , David Hansen,
            <given-names>and Lynne</given-names>
          </string-name>
          <string-name>
            <surname>Cobiac</surname>
          </string-name>
          .
          <article-title>Genomics and personalised whole-of-life healthcare</article-title>
          .
          <source>Trends in Molecular Medicine</source>
          ,
          <volume>20</volume>
          (
          <issue>9</issue>
          ):
          <fpage>479</fpage>
          -
          <lpage>486</lpage>
          ,
          <year>September 2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[Donoho and Stodden</source>
          , 2006]
          <string-name>
            <given-names>David</given-names>
            <surname>Donoho</surname>
          </string-name>
          and
          <string-name>
            <given-names>Victoria</given-names>
            <surname>Stodden</surname>
          </string-name>
          .
          <article-title>Breakdown point of model selection when the number of variables exceeds the number of observations</article-title>
          .
          <source>In The 2006 IEEE International Joint Conference on Neural Network Proceedings</source>
          , pages
          <fpage>1916</fpage>
          -
          <lpage>1921</lpage>
          . IEEE,
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Donoho and Tanner</source>
          , 2009]
          <string-name>
            <given-names>David</given-names>
            <surname>Donoho</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jared</given-names>
            <surname>Tanner</surname>
          </string-name>
          .
          <article-title>Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing</article-title>
          .
          <source>Philosophical Transactions of the Royal Society of London A: Mathematical, Physical and Engineering Sciences</source>
          ,
          <volume>367</volume>
          (
          <year>1906</year>
          ):
          <fpage>4273</fpage>
          -
          <lpage>4293</lpage>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Duncan et al.,
          <year>2011</year>
          ] Emma L. Duncan, Patrick Danoy, John P. Kemp,
          <string-name>
            <given-names>Paul J.</given-names>
            <surname>Leo</surname>
          </string-name>
          ,
          <string-name>
            <surname>Eugene</surname>
            <given-names>McCloskey</given-names>
          </string-name>
          , Geoffrey C. Nicholson, Richard Eastell, Richard L. Prince, John A. Eisman, Graeme Jones,
          <string-name>
            <given-names>Philip N.</given-names>
            <surname>Sambrook</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ian R. Reid</surname>
          </string-name>
          ,
          <string-name>
            <surname>Elaine M. Dennison</surname>
            , John Wark,
            <given-names>J. Brent</given-names>
          </string-name>
          <string-name>
            <surname>Richards</surname>
            , Andre G. Uitterlinden,
            <given-names>Tim D.</given-names>
          </string-name>
          <string-name>
            <surname>Spector</surname>
            , Chris Esapa,
            <given-names>Roger D.</given-names>
          </string-name>
          <string-name>
            <surname>Cox</surname>
          </string-name>
          ,
          <string-name>
            <surname>Steve D. M. Brown</surname>
          </string-name>
          , Rajesh V.
          <article-title>Thakker, Kathryn A. Addison, Linda A</article-title>
          .
          <string-name>
            <surname>Bradbury</surname>
            , Jacqueline R. Center, Cyrus Cooper, Catherine Cremin, Karol Estrada, Dieter Felsenberg, ClausC. Gler, Johanna Hadler,
            <given-names>Margaret J</given-names>
          </string-name>
          . Henry,
          <string-name>
            <surname>Albert Hofman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Mark A.</given-names>
            <surname>Kotowicz</surname>
          </string-name>
          , Joanna Makovey, Sing C. Nguyen,
          <string-name>
            <surname>Tuan</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , Julie A.
          <string-name>
            <surname>Pasco</surname>
            , Karena Pryce,
            <given-names>David M.</given-names>
          </string-name>
          <string-name>
            <surname>Reid</surname>
            , Fernando Rivadeneira, Christian Roux, Kari Stefansson, Unnur Styrkarsdottir, Gudmar Thorleifsson, Rumbidzai Tichawangana,
            <given-names>David M.</given-names>
          </string-name>
          <string-name>
            <surname>Evans</surname>
          </string-name>
          , and
          <string-name>
            <surname>Matthew</surname>
            <given-names>A. Brown.</given-names>
          </string-name>
          <article-title>Genomewide association study using extreme truncate selection identifies novel genes affecting bone mineral density and fracture risk</article-title>
          .
          <source>PLoS Genetics</source>
          ,
          <volume>7</volume>
          (
          <issue>4</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          ,
          <year>April 2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Howie et al.,
          <year>2012</year>
          ]
          <string-name>
            <given-names>Bryan</given-names>
            <surname>Howie</surname>
          </string-name>
          , Christian Fuchsberger, Matthew Stephens, Jonathan Marchini, and Gonc¸alo
          <string-name>
            <given-names>R.</given-names>
            <surname>Abecasis</surname>
          </string-name>
          .
          <article-title>Fast and accurate genotype imputation in genome-wide association studies through pre-phasing</article-title>
          .
          <source>Nature Genetics</source>
          ,
          <volume>44</volume>
          (
          <issue>8</issue>
          ):
          <fpage>955</fpage>
          -
          <lpage>959</lpage>
          ,
          <year>August 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[Liaw and Wiener</source>
          , 2002]
          <string-name>
            <given-names>Andy</given-names>
            <surname>Liaw</surname>
          </string-name>
          and
          <string-name>
            <given-names>Matthew</given-names>
            <surname>Wiener</surname>
          </string-name>
          .
          <article-title>Classification and regression by randomforest</article-title>
          .
          <source>R News</source>
          ,
          <volume>2</volume>
          (
          <issue>3</issue>
          ):
          <fpage>18</fpage>
          -
          <lpage>22</lpage>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>[Loebbecke and Picot</source>
          , 2015]
          <string-name>
            <given-names>Claudia</given-names>
            <surname>Loebbecke</surname>
          </string-name>
          and
          <string-name>
            <given-names>Arnold</given-names>
            <surname>Picot</surname>
          </string-name>
          .
          <article-title>Reflections on societal and business model transformation arising from digitization and big data analytics: A research agenda</article-title>
          .
          <source>The Journal of Strategic Information Systems</source>
          ,
          <volume>24</volume>
          (
          <issue>3</issue>
          ):
          <fpage>149</fpage>
          -
          <lpage>157</lpage>
          ,
          <year>September 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>[O'Brien</surname>
          </string-name>
          et al.,
          <year>2015</year>
          ]
          <string-name>
            <surname>Aidan R. O'Brien</surname>
            ,
            <given-names>Neil F. W.</given-names>
          </string-name>
          <string-name>
            <surname>Saunders</surname>
          </string-name>
          , Yi Guo, Fabian A.
          <string-name>
            <surname>Buske</surname>
            ,
            <given-names>Rodney J.</given-names>
          </string-name>
          <string-name>
            <surname>Scott</surname>
          </string-name>
          , and
          <string-name>
            <surname>Denis</surname>
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bauer</surname>
          </string-name>
          .
          <article-title>Variantspark: population scale clustering of genotype information</article-title>
          .
          <source>BMC Genomics</source>
          ,
          <volume>16</volume>
          (
          <issue>1</issue>
          ),
          <year>December 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>[Wright and Ziegler</source>
          , 2016]
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Wright</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          .
          <article-title>Ranger: A fast implementation of random forests for high dimensional data in C++</article-title>
          and R.
          <source>Journal of Statistical Software</source>
          ,
          <year>2016</year>
          . in press.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [Wright et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Marvin N.</given-names>
            <surname>Wright</surname>
          </string-name>
          , Andreas Ziegler, and
          <string-name>
            <surname>Inke R. Ko</surname>
          </string-name>
          <article-title>¨nig. Do little interactions get lost in dark random forests</article-title>
          ?
          <source>BMC Bioinformatics</source>
          ,
          <volume>17</volume>
          (
          <issue>1</issue>
          ):
          <fpage>145</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>