=Paper= {{Paper |id=Vol-1536/paper32 |storemode=property |title= Understanding Data Science: An Emerging Discipline for Data Intensive Discovery |pdfUrl=https://ceur-ws.org/Vol-1536/paper32.pdf |volume=Vol-1536 |dblpUrl=https://dblp.org/rec/conf/rcdl/Brodie15 }} == Understanding Data Science: An Emerging Discipline for Data Intensive Discovery == https://ceur-ws.org/Vol-1536/paper32.pdf

Understanding Data Science:
An Emerging Discipline for Data-Intensive Discovery
© Michael L. Brodie
CSAIL, MIT
Cambridge, MA, USA
㼙㼘㼎㼞㼛㼐㼕㼑㻬㼏㼟㼍㼕㼘㻚㼙㼕㼠㻚㼑㼐㼡

Abstract opportunities and challenges.
To better understand DIA and its opportunities
Over the past two decades, Data-Intensive Analysis and challenges I examined over 30 DIA use cases that
has emerged not only as a basis for the Fourth are at very large-scale - in the range where theory and
Paradigm of engineering and scientific discovery but as practice may break. This paper summarizes some key
a basis for discovery in most human endeavors for results of my research related to understanding and
which data is available. Originating in the 1960s, its defining Data Science as a body of principles and
recent emergence due to Big Data and massive techniques with which to measure and improve the
computing power is leading to widespread deployment, correctness, completeness, and efficiency of Data-
yet it is in its infancy in its application and our Intensive Analysis. As with its predecessor discovery
understanding of it; hence in its development. Given the paradigms, establishing this emerging Fourth Paradigm
potential risks and rewards of Data-Intensive Analysis and the underlying principles and techniques of Data
and its breadth of application, it is imperative that we Science may take decades.
get this right.
The objective of this emerging Fourth Paradigm is 1.2 Significance of DIA and Data Science
more than acquiring data and extracting knowledge.
Data Science is transforming discovery in many human
Like its predecessor the scientific method, the objective
endeavours including healthcare, manufacturing,
of the Fourth Paradigm is to investigate phenomena by
education, financial modelling, policing, and marketing
acquiring new knowledge, and correct and integrate it
[10][13]. It has been used to produce significant results
with previous knowledge. In addition, data science is a
in areas from particle physics (e.g., Higgs Boson), to
body of principles and techniques with which to
identifying and resolving sleep disorders using Fitbit
measure and improve the correctness, completeness,
data, to recommenders for literature, theatre, and
and efficiency of Data-Intensive Analysis. It is now time
shopping. More than 50 national governments have
to identify and understand the fundamentals. In my
established data-driven strategies as an official policy as
research, I have analyzed more than 30 very large-scale
in science and engineering [2] as well as in healthcare,
use cases to understand current practical aspects, to gain
e.g., US National Institutes of Health and President
insight into the fundamentals, and to address the fourth
Obama’s Precision Medicine Initiative [15] for
“V” of Big Data – veracity -- the accuracy of the data
“Delivering the right treatments, at the right time, every
and the resulting analytics This development may take
time to the right person.” The hope, supported by early
decades.
results, is that data-driven techniques will accelerate the
discovery of treatments to manage and prevent chronic
1 Data Science: A New Discovery Paradigm diseases with more precision and that are tailored to
That Will Transform Our World specific individuals as well as being at dramatically
lower cost.
1.1 Introduction Data Science is being used to radically transform
Over the past two decades, Data-Intensive Analysis entire domains, such as medicine and biomedical
(also called Big Data Analytics) has emerged not only research as stated as the purpose of the newly created
as a basis for the Fourth Paradigm [8] of engineering Center for Biomedical Informatics at the Harvard
and scientific discovery but more broadly as a basis for Medical School. It is also making an impact in
discovery in most human endeavours for which data is economics [14], drug discovery [17], and many other
available. Roots of Data-Intensive Analysis (DIA) that domains. As a result of its successes and potential Data
have led to its recent dramatic growth include Big Data Science is rapidly becoming a sub-discipline of most
(c. 2000) that, just emerging, is opening the door to academic areas. These developments suggest the strong
profound change – to new ways of reasoning, problem belief in the potential value of Data Science – but can it
solving, and processing that in turn bring new deliver?
Early successes and clearly stated expectations of
Proceedings of the XVII International Conference Data Science are truly remarkable; however, its actual
«Data Analytics and Management in Data Intensive deployment, like many hot trends, is far less than it
Domains» (DAMDID/RCDL’2015), Obninsk, Russia, appears. According to Gartner’s 2015 survey of Big
October 13 - 16, 2015

238
Data Management and Analytics, 60% of the Fortune 1.4 What Could Possibly Go Wrong?
500 claim to have deployed Data Science, less than
Do we understand the risks of recommending the wrong
20% have implemented consequent significant changes
film, the wrong product, the wrong medical diagnoses,
and less than 1% have optimized its benefits. Gartner
treatments, or drugs? The minimal apparent risk of a
concludes that 85% will be unable to exploit Big Data
result that fails to achieve its objectives when acted
in 2015. The vast majority of deployments address
upon includes losses in time, resources, customer
tactical aspects of existing processes and static business
satisfaction, customers, and potentially a loss of
intelligence rather than realizing its power by
business. The vast majority of Data Science
identifying strategic advantages through discovering
applications face such small risks; hence veracity has
previously unforeseen value.
received little attention. Far greater risks could be
1.3 Illustrious Histories: The Origins of Data Science incurred if incorrect Data Science results are acted upon
in critical contexts, such as those already underway in
Data Science is in its infancy. Few individuals or drug discovery [18] and personalized medicine. Most
organizations understand the potential of and the scientists in these contexts are well aware of the risks of
paradigm shift associated with Data Science, let alone errors, hence go to extremes to estimate and minimize
understand it conceptually. The high rewards and the them. The wonder of CERN’s ATLAS and CMS
equally high risks and its pervasive application make it projects “discovery” of the Higgs Boson announced
imperative that we better understand Data Science – its July 4, 2012 with a confidence of 5 sigma might
models, methods, processes, and results. suggest that the results were achieved overnight. They
Data Science is inherently multi-disciplinary were not. They took 40 years and included Data Science
drawing on over 30 allied disciplines, according to techniques developed over a decade applied over Big
some definitions. Its principle components include Data by two independent projects, ATLAS and CMS,
mathematics, statistics, and computer science especially each of which were subsequently peer reviewed and
areas such as AI (e.g., machine learning), data published [1][11] with a further yearlong verification
management, and high performance computing. While that established a confidence of 10 sigma. To what
these disciplines need to be evaluated in the new extent do the vast majority of Data Science applications
paradigm, they have long illustrious histories. Data concern themselves with verification and error bounds
analysis developed over 4,000 years ago with origins in let alone understand the verification methods applied at
Babylon (17th-12th C BCE) and India (12th C BCE). CERN? Informal surveys of data scientists conducted in
Mathematical analysis originated in the 17th C around this study at Data Science conferences suggest that 80%
the time of the Scientific Revolution. While statistics of customers never ask for error bounds.
has its roots in 5th C BCE and the 18th C, its application The existential risks of applying Data Science have
in Data Science originated in 1962 with John W. Tukey been raised by world leading authorities such as the
[20] and George Box[4]. These long illustrious histories Organization for Economic Cooperation and
suggest that Data Science draws on well-established Development, the AI [3][7][9][19] and legal [5]
results that took decades or centuries to develop. To communities with the most extreme concerns stated by
what extent do they (e.g., statistical significance) apply the Future of Life Institute with the objective of
in this paradigmatically new context? safeguarding life and developing optimistic visions of
Data Science constitutes a new paradigm in the the future in order to mitigate existential risks facing
sense of Kuhn’s scientific revolutions [12]. Data humanity from AI.
Science’s predecessor paradigm, the Scientific Method, Given the potential risks and rewards of DIA
has approximately 2,000 years in the development of and of its breadth of application across conventional,
empiricism starting with Aristotle (384-322 BCE), empirical scientific and engineering domains as well as
Ptolemy (1st C), and the Bacons (13th, 16th C). Data across most human endeavors we better get this right!
Science, a primary basis of eScience [8], collectively The scientific and engineering communities place high
termed the Fourth Paradigm, is emerging following the confidence in their existing discovery paradigms with
~1,000-year development of its three predecessor well-defined measures of likelihood and confidence
paradigms of scientific and engineering discovery: within relatively precise error estimates1. Can we say
theory, experimentation, and simulation [8]. Data the same for modern Data Science as a discovery
Science that has developed and been applied for over 50 paradigm and for its results? A simple observation of
years qualitatively changed in the late 20th century with the formal development of the processes and methods
the emergence of Big Data, typically defined as data at of its predecessors suggest that we cannot. Indeed, we
volumes, velocities, and variety that current do not know if or under what conditions the constituent
technologies, let alone humans, cannot handle disciplines, like statistics, may break down.
efficiently. This paper addresses another characteristic Do we understand DIA to the extent that we can
that current technologies and theories do not handle assign probabilistic measures of likelihood to its
well, veracity. results? With the scale and emerging nature of DIA-

1
Even after 1,000 years serious issues persist, e.g., P values
(significance) and reproducibility.

239
based discovery, how do we estimate the correctness models correct at one scale may be wrong at a larger
and completeness of analytical results relative to a scale or vice versa, a model wrong at one scale (hence
hypothesized discovery question when the underlying discarded) may become correct at a higher scale (more
principles and techniques may not apply in this new complex model).
context? Machine learning algorithms can identify
In summary, we do not yet understand DIA correlations between thousands, millions, or even
adequately to quantify the probability or likelihood that billions of variables. This suggests that it is difficult to
a projected outcome will occur within estimated error impossible for humans to understand what or how these
bounds. While CERN used Data Science and Big Data algorithms discover. Imagine trying to understand such
to identify results, verification was ultimately empirical, a model that results from selecting some subset of the
as it must be in drug discovery [18] and other critical correlations on the assumption that they may be causal
areas, until analytical techniques are developed and thus constitute a model of the phenomenon with high
proven robust. confidence of being correct with respect to some
hypotheses, with or without error bars.
1.5 Do We Understand Data Science?
1.6 Cornerstone of A New Discovery Paradigm
Do we even understand what Data Science methods
compute or how they work? Human thought is limited The Fourth Paradigm - eScience supported by Data
by the human mind. According to Miller’s Law [14], Science - is paradigmatically different from its
the human mind (short term working memory) is predecessor discovery paradigms. It provides
capable of conceiving less than ten (7 +/- 2) concepts at revolutionary new ways [12] of thinking, reasoning and
one time. Hence, humans have difficulty understanding processing - new modes of inquiry, problem solving,
complex models involving more than ten variables. The and decision-making. It is not the Third Paradigm
conventional process is to imagine a small number of augmented by Big Data, but something profoundly
variables2 then abstract or encapsulate that knowledge different. Losing sight of this difference forfeits its
into a model that can subsequently augmented with power and benefits and loses the perspective that it is A
more variables. Thus most scientific theories develop Revolution That Will Transform How We Live, Work,
slowly over time into complex models. For example, and Think [13].
Newton’s model of particle physics was extended for Paradigm shifts are difficult to notice as they
350 years through Bohr, Heisenberg, Einstein, and emerge, just as the proverbial frog does not notice that
others, up to Glashow, Salam, and Weinberg, to form its hot bath is becoming lethal. There are several ways
The Standard Model of Particle Physics. Scientific to describe the shift. There is a shift of resources from
discovery in particle physics is wonderful and has taken (empirically) discovering causality (Why the
over 350 years. Due to its complexity no physicist has phenomenon occurs) – the heart of the Scientific
understood the entire Standard Model for decades, Method – to discovering interesting correlations (What
rather it is represented in complex, computational might have occurred). This shift involves moving from
models. a strategic perspective driven by human generated
When humans analyse a problem, they do so with hypotheses (theory-driven, top-down) to a tactical
models with a limited number of variables. As the perspective driven by observations (data-driven,
number of variables increase, it is increasingly difficult bottom-up).
to understand the model and the potential combinations Seen at their extremes, the Scientific Method
and correlations. Hence, humans limit their models and involves testing hypotheses (theories) posed by
analyses to those that they can comprehend. These scientists while Data Science can be used to generate
human-scale models are typically theory-driven thus hypotheses to be tested based on significant correlations
limiting their scale (number of variables) to what can be amongst variables that are identified algorithmically in
conceived. the data. In principle, vast amounts of data and
What if the phenomenon is arbitrarily complex or computing power can be used to accelerate discovery
beyond immediate human conception? I suspect that simply by outpacing human thinking in both power and
this is addressed iteratively with one model (theory) complexity. The power of Data Science is growing
becoming abstracted as the base for another more rapidly due to the development of ever more powerful
complex theory, and so on (standing on the shoulders of computing resources and algorithms, such as deep
those who have gone before), e.g., the development of learning. So rather than optimize an existing process,
quantum physics from elementary particles. That is, Data Science can be used to identify patterns that
once the human mind understands a model, it can form suggest unforeseen solutions, thus automating
the basis of a more complex model. This development serendipity as it is called when a human observes an
under the scientific method scales at a rate limited by anomaly that stimulated a bright idea to resolve it.
human conception thus limiting the number of variables However, even more compelling is one step beyond
and complexity. This is error-prone since phenomena the simple version of this shift, namely a symbiosis of
may not manifest at a certain level of complexity hence the both paradigms. For example, Data Science can be
used to offer highly probable hypotheses or correlations
2
Physical science PhDs typically involve < 5 variables. from which we select those with acceptable error

240
estimates and that are worthy of subsequent empirical qualitatively less than two decades ago with the
analysis. In turn, empiricism is used to pursue these emergence of Big Data and the consequent paradigm
hypotheses until some converge and some diverge at shift described above. The focus of this research into
which point Data Science can be applied to refine or modern Data Science is on veracity – the ability to
confirm the converging hypotheses, having discarded estimate the correctness, completeness, and efficiency
the divergent hypotheses, and the cycle starts again. of an end-to-end DIA activity and of its results. Hence,
Ideally, one would optimize the combination of theory- I use the following definition that is in the spirit of [17].
driven empirical analysis with data-driven analysis to
accelerate discovery faster than either on their own. Data Science is a body of principles and
While Data Science is a cornerstone of a new techniques for applying data-intensive analysis to
discovery paradigm, it may be conceptually and investigate phenomena, acquire new knowledge,
methodologically more challenging than its and correct and integrate previous knowledge with
predecessors since it involves everything included in its measures of correctness, completeness, and
predecessor paradigms – modelling, methods, efficiency of the derived results with respect to
processes, measures of correctness, completeness, and some pre-defined (top down) or emergent (bottom
efficiency – in a much more complex context, namely up) specification (scope, question, hypothesis).
that of Big Data. Following well-established
developments, we should try to find the fundamentals of 3 Understanding Data Science From
Data Science – its principles and techniques – to help
manage the complexity and guide its understanding and
Practice
application. 3.1 Methodology to Better Understand DIA

2 Data Science: A Perspective Driven by a passion for understanding Data Science in
practice, my year-long and on-going research study has
Since Data Science is in its infancy and is inherently investigated over 30 very large scale Big Data
multi-disciplinary, there are naturally many definitions applications most of which have produced or are daily
of Data Science that should emerge and evolve with the producing significant value. The use cases include
discipline. As definitions serve many purposes, it is particle physics; astrophysics and satellite imagery;
reasonable to have multiple definitions each serving oceanography; economics; information services; several
different purposes. Most Data Science definitions life sciences applications in pharmaceuticals, drug
attempt to define Why (it’s purpose), What (constituent discovery, and genetics; and various areas of medicine
disciplines), and How (constituent actions of discovery including precision medicine, hospital studies, clinical
workflows). trials, intensive care unit and emergency room
A common definition of Data Science is the medicine.
activity of extracting knowledge from data 3. While The focus is to investigate relatively well-
simple, it does not convey the larger goal of Data understood, successful use cases where correctness is
Science or its consequent challenges. A DIA activity is critical and the Big Data context is at massive scale;
far more than a collection of actions or the mechanical such use cases constitute less than 5% of all deployed
processes of acquiring and analyzing data. Like its Big Data analytics. The focus was on these use cases, as
predecessor paradigm, the Scientific Method, the we do not know where errors may arise outside normal
purpose of Data Science and a DIA activity is to scientific and analytical errors. There is a greater
investigate phenomena by acquiring new knowledge, likelihood that established disciplines, e.g., statistics
and correcting and integrating it with previous and data management, might break at very large scale
knowledge – continually evolving our current where errors due to failed fundamentals may be more
understanding of the phenomena based on newly obvious.
available data. We seldom start from scratch, clearly the The breadth and depth of the use cases revealed
simplest case here. Hence, discovering, understanding, strong, significant emerging trends, some of which are
and integrating data must precede extracting knowledge listed below. These confirmed for some use case
all at massive scale, i.e., largely by automated means. owners, and suggested to others, solutions and
The Scientific Method that underlies the Third directions that they were pursuing but could not have
Paradigm is a body of principles and techniques that seen without the perspective of 30+ use cases.
provide the formal and practical bases of scientific and
engineering discovery. The principles and techniques 3.2 DIA Processes
have been developed over hundreds of years originating A Data-Intensive-Activity is an analytical process that
with Plato and are still evolving today with significant consists of applying sophisticated analytical methods to
unresolved issues such as statistical significance, (i.e., P large data sets that are stored under some analytical
values) and reproducibility. models. While this is the typical view of Data Science
While Data Science had its origins 50 years ago projects or DIA use cases, this analytical component of
with Tukey [19] and Box [4] it started to change the DIA activity constitutes ~20% of an end-to-end DIA
pipeline or workflow. Currently it consumes ~20% of
3 the resources required to complete a DIA analysis.
Wikipedia.com

241
An end-to-end DIA activity involves two data proportionally 80% of the errors that could arise in DIA
management processes that precede the DIA process, may arise in the data management processes, prior to
namely Raw Data Acquisition and Curation, and DIA even starting.
Analytical Data Acquisition. Raw Data Acquisition and
Curation starts with discovering and understanding data 3.3 Characteristics of Large-Scale DIA Use Cases
in data sources and ends with integrating and storing The focus of my research is successful, very large scale,
curated data in a repository that represents entities in the multi-year projects with many with 100s to 1,000s, of
domain of interest and metadata about those entities ongoing DIA activities. These activities are supported
with which to make a specific interpretations and that is by a DIA ecosystem consisting of a community of users
shared by a community of users. Analytical Data (e.g., over 5,000 scientists in the ATLAS and CMS
Acquisition starts with discovering and understanding projects at CERN and similar numbers of scientists
data within the shared repository and ends with storing using the worldwide Cancer Genome Atlas) and
the resulting information, specific entities and technology (e.g., science gateways4, collectively
interpretations, into an analytical model to be used by referred to in some branches of science as networked
the subsequent DIA process. science). Some significant trends that have emerged
Sophisticated algorithms such as machine from the analysis of these use cases are listed, briefly,
learning largely automate DIA processes, as they have below.
to be automated to process such large volumes of data The typical view of Data Science appears to be
using complex algorithms. Currently, Raw Data based on the vast majority (~95%) of DIA use cases.
Acquisition and Curation, and Analytical Data While they share some characteristics with those in this
Acquisition processes are far less automated typically study, there are fundamental differences such as the
requiring 80% or more of the total resources to concern for and due diligence associated with veracity
complete. as mentioned above.
This understanding leads to the following Based on this study data analysis appears to fall
definitions. into three classes. Conventional data analysis over
Data-Intensive Discovery (DID) is the activity of “small data” accounts for at least 95% of all data
using Big Data to investigate phenomena, to analysis, often using Microsoft Excel. DIA over Big
acquire new knowledge, and to correct and Data has two sub-classes, simple DIA, i.e., the vast
integrate previous knowledge. majority of DIA use cases mentioned above, and
complex DIA such as the use cases analyzed in this
“-Intensive” is added when the data is “at scale”. study that are characterized by complex analytical
Theory-driven DID is the investigation of human models (e.g., sub-models of the Standard Model of
generated scientific, engineering, or other hypotheses Physics, economic models, an organizational model for
over Big Data. Data-Driven DID employs automatic enterprises worldwide, and models for genetics and
hypothesis generation. epigenetics) and a corresponding plethora of analytical
Data-Intensive Analysis is the process of methods (e.g., the vast method libraries in CERN’s
analyzing Big Data with analytical methods and Root framework). The complexity of the models and
models. methods are as complex as the phenomena being
analyzed.
DID goes beyond the Third paradigm of The most widely used DIA tools for simple cases
scientific or engineering discovery by investigating claim to support analyst self-service in point-and-click
scientific or engineering hypotheses using DIA. A DIA environments, some claiming “point us at the data and
activity is an experiment over data thus requiring all we will find the patterns of interest for you”. This
aspects of a scientific experiment, e.g., experimental characteristic is infeasible in the use cases analyzed. A
design, expressed over data, a.k.a. data-based requirement common to the use cases analyzed is not
empiricism. only the principle of being machine driven and human
A DIA Process (workflow or pipeline) is a guided, i.e., a man-machine symbiosis, but extensive
sequence of operations that constitute an end-to- attempts to optimize this symbiosis for scale, cost, and
end DIA activity from the source data to the precision (too much human-in-the-loop leads to errors,
quantified, qualified result. too little leads to nonsense).
DIA ecosystems are inherently multi-disciplinary
Currently, ~80% of the effort and resources required for (ideally interdisciplinary), collaborative, and iterative.
the entire DIA activity are due to the two data Not only does DIA (Big Data Analytics) require
management processes – areas where scientists / multiple disciplines, e.g., genetics, statistics and
analysts are not experts. Emerging technology, such as machine learning, so too do the data management
for data curation at scale, aims to flip that ratio from processes require multiple disciplines, e.g., data
80:20 to 20:80 so as to let scientists do science; analysts management, domain and machine learning experts for
do analysis; etc. This requires an understanding of the
data management processes and their correctness, 4
There are over 60 large-scale scientific gateways, e.g., The
completeness, and efficiency in addition to those of the Cancer Genome Atlas and CERN’s Worldwide LHC
DIA process. Another obvious consequence is that Computing Grid.

242
data curation, statisticians for sampling, etc. Hence, constructing the model required for the
In large-scale DIA ecosystems, a DIA is a virtual background involves selecting and combining relevant
experiment [6]. Far from claims of simplicity and simulations. If there is no simulation for some aspect
point-and-click self-service, most large-scale DIA that you require, then it must be requested or you may
activities reflect the complexity of the analysis at hand have to build it yourself. Similarly, if there is no
and are the result of long-term (months to years) relevant data of interest in the experimental data
experimental designs that involve greater complexity repository, it must be requested from subsequent
than their empirical counterparts to deal with scale, capture from the detectors when LHC is next fired up in
significance, hypotheses, null hypotheses, and deeper the appropriate energy levels. This comes from a
challenges such as determining causality from completely separate team running the (non-virtual)
correlations and identifying and dealing with biases and experiment.
often irrational human intervention.
Finally, veracity is one of the most significant The development of the background is
challenges and critical requirements of all DIA approximately a one person-year activity as it involves
ecosystems studied. While there are many, complex the experimental design, the design and refinement of
methods in conventional Data Science to estimate the model (software simulations), the selection of
veracity most owners of use cases studied expressed methods and tuning to achieve the correct signature
concern for adequately estimating veracity in modern (i.e., get the right data), verify the model (observe
Data Science. Most assume that all data is imprecise; expected outcomes when tested), and dealing with
hence require probabilistic measures and error bars errors (statistical and systematic) that arise from the
and likelihood estimates for all results. More basically, hardware or process. The result of the Background
most DIA ecosystem experts recognize that errors can phase is a model approved by the collaborative to
arise across an end-to-end DIA activity and are represent the background required by the experiment
investing substantially in addressing these issues in both with the signal region blinded. The model is an
the DIA processes and the data management processes “application” that runs on the Atlas “platform” using
that currently require significant human guidance. Atlas resources - libraries, software, simulations, and
An objective of this research is to discover the data much drawing on the ROOT framework, CERN’s
extent to which the above characteristics of very large core modeling and analysis infrastructure. It is verified
scale, complex DIAs also apply to simple DIAs. There by being executed under various testing conditions.
is a strong likelihood that they apply directly but are This is an incremental or iterative process each
difficult to detect. That is the principles and techniques step of which is reviewed. The resulting design
of DIA apply equally to simple and complex DIA. document for the Top Quark experiment was
approximately 200 pages of design choices, parameter
3.4 Looking Into A Use Case settings, and results - both positive and negative! All
Due to the detail involved, there is not space in this experimental data and analytical results are
chapter or book to describe a single use case considered probabilistic. All results have error bars; in particle
in this study. However, let’s look into a single step of a physics they must be at least 5 sigma to be accepted.
use case involving a virtual experiment conducted at This explains the year of iteration in which analytical
CERN in the Atlas project. The heart of empirical models are adjusted, analytical methods are selected
science is experimental design. It starts by identifying, and tuned, and results reviewed by the collaboration.
formulating, and verifying a worthy hypothesis to The next step is the actual virtual experiment. This
pursue. This first complex step typically involves a too takes months. You might be surprised to find that
multi-disciplinary team, called the collaborators for this once the data is un-blinded (i.e., synthetic data is
virtual experiment, often from around the world for replaced in the region of interest with experimental
more than a year. We consider the second step, the data), the experimenter, often a PhD candidate, gets one
construction of the control or background model and only one execution of the “verified” model over the
(executable software and data) that creates the experimental data.
background (e.g., executable or testable model and a Hopefully this portion of a use case illustrates that
given data set) required as the basis within which to DIA is a complex but critical tool in scientific discovery
search (analyze) for “signals” that would represent the used with a well-defined understanding of veracity. It
phenomenon being investigated in the hypothesis. This must stand up to scrutiny that evaluates if the
is the control that completely excludes the data of experiment - consisting of all models, methods, and
interest. The data of interest (the signal region) is data with probabilistic results and error bounds better
“blinded” completely so as not to bias the experiment. than 5 sigma – is adequate to be accepted by Science or
The background (control) is designed using software Nature as demonstrating that the hypothesized
that simulates relevant parts of the standard model of correlation is causal.
particle physics plus data from Atlas selected with the
appropriate signatures with the data of interest blinded. 4. Research For An Emerging Discipline
Over time Atlas contributors have developed
The next step in this research to better understand the
simulations of many parts of the standard model.
theory and practice of the emerging discipline of Data

243
Science; to understand and address its opportunities and [3] J. Bohannon, “Fears of an AI pioneer,” Science,
challenges; and to guide its development, is given in its vol. 349, no. 6245, pp. 252–252, Jul. 2015.
definition. Modern Data Science builds on conventional [4] G.E.P. Box. Science and Statistics. Journal of the
Data Science and on all of its constituent disciplines American Statistical Association 71, 356 (April
required to design, verify, and operate end-to-end DIA 2012), 791–799 reprint of original from 1962
activities, including both data management and DIA [5] N. Diakopoulos. Algorithmic Accountability
processes, in a DIA ecosystem for a shared community Reporting: On the Investigation of Black
of users. Each discipline must be considered with Boxes. Tow Center. February 2014.
respect to which it contributes to
[6] Duggan, Jennie and Michael Brodie, Hephaestus:
investigating phenomena, acquiring new knowledge,
Data Reuse for Accelerating Scientific Discovery,
and correcting and integrating new with previous
In CIDR 2015
knowledge. Each operation must be understood with
respect to which correctness, completeness, and [7] S.J. Gershman, E.J. Horvitz, and J.B. Tenenbaum.
efficiency can be estimated. 2015. Computational rationality: A converging
This research involves identifying relevant paradigm for intelligence in brains, minds, and
principles and techniques. Principles concern the machines. Science 349, 6245 (2015), 273–278.
theories that are established formally, e.g., [8] Jim Gray on eScience: a transformed scientific
mathematically, and possibly demonstrated empirically. method, in A.J.G. Hey, S. Tansley, and K.M. Tolle
Techniques involve the application of wisdom [21], i.e., (Eds.): The fourth paradigm: data-intensive
domain knowledge, art, experience, methodologies, scientific discovery. Proc. IEEE 99, 8 (2009),
practice, often called best practices. The principles and 1334–1337.
techniques, especially those established for [9] E. Horvitz and D. Mulligan. 2015. Data, privacy,
conventional Data Science, must be verified and if and the greater good. Science 349, 6245 (July
required extended, augmented, or replaced for the new 2015), 253–255.
context of the Fourth Paradigm, especially its volumes, [10] M.I. Jordan and T.M. Mitchell. 2015. Machine
velocities, and variety. For example, new departments learning: Trends, perspectives, and prospects.
at MIT, Stanford, and the University of California, Science 349, 6245 (July 2015), 255–260.
Berkeley, are conducting such research under what [11] V. Khachatryan et al. 2012. Observation of a new
some are calling 21st Century Statistics. boson at a mass of 125 GeV with the CMS
A final, stimulating challenge is what is called experiment at the LHC. Physics Letters B 716, 1
meta-modelling or meta-theory. DIA, and more (2012), 30–61.
generally Data Science, is inherently multi-disciplinary
[12] Kuhn, Thomas S. The Structure of Scientific
[10]. This area emerged in the physical sciences in the
Revolutions. 3rd ed. Chicago, IL: University of
1980s and subsequently in statistics and machine
Chicago Press, 1996.
learning and is now being applied in other areas to
address combining results of multiple disciplines. [13] Mayer-Schönberger, V., & Cukier, K. (2013-03-
Analogously, meta-modelling arises when using 05). Big Data: A Revolution That Will Transform
multiple analytical models and multiple analytical How We Live, Work, and Think. Houghton Mifflin
methods to analyze different perspectives or Harcourt
characteristics of the same phenomena. This extremely [14] Miller, G. A. (1956). "The magical number seven,
natural and useful methodology, called ensemble plus or minus two: Some limits on our capacity for
modelling, is required in many physical sciences, processing information". Psychological Review 63
statistics, and AI, and should be explored as a (2): 81–97.
fundament modelling methodology. [15] NIH Precision Medicine Initiative,
http://www.nih.gov/precisionmedicine/
Acknowledgement [16] D.C. Parkes and M.P. Wellman. 2015. Economic
I gratefully acknowledge the brilliant insights and reasoning and artificial intelligence. Science 349,
improvements proposed by Prof Jennie Duggan, 6245 (July 2015), 267–272.
Northwestern University and Prof Thilo Stadelmann, [17] F. Provost and T. Fawcett. 2013. Data Science and
Zurich University of Applied Sciences. its Relationship to Big Data and Data-Driven
Decision Making. Big Data 1, 1 (March 2013),
References 51–59.
[1] G. Aad et al. 2012. Observation of a new particle [18] Scott Spangler, et. al. 2014. Automated hypothesis
in the search for the Standard Model Higgs boson generation based on mining scientific literature.
with the ATLAS detector at the LHC. Physics In Proceedings of the 20th ACM SIGKDD
Letters B 716, 1 (2012), 1–29. international conference on Knowledge discovery
[2] Accelerating Discovery in Science and and data mining (KDD '14). ACM, New York,
Engineering Through Petascale Simulations and NY, USA, 1877-1886.
Analysis (PetaApps), National Science
Foundation, Posted July 28, 2008.

244
[19] J. Stajic, R. Stone, G. Chin, and B. Wible. 2015.
Rise of the Machines. Science 349, 6245 (July
2015), 248–249.
[20] J. W. Tukey, “The Future of Data Analysis,” Ann.
Math. Statist. pp. 1–67, 1962.
[21] Bin Yu, Data Wisdom for Data Science,
ODBMS.org, April 13, 2015.

245