=Paper= {{Paper |id=Vol-1536/paper32 |storemode=property |title= Understanding Data Science: An Emerging Discipline for Data Intensive Discovery |pdfUrl=https://ceur-ws.org/Vol-1536/paper32.pdf |volume=Vol-1536 |dblpUrl=https://dblp.org/rec/conf/rcdl/Brodie15 }} == Understanding Data Science: An Emerging Discipline for Data Intensive Discovery == https://ceur-ws.org/Vol-1536/paper32.pdf
                 Understanding Data Science:
       An Emerging Discipline for Data-Intensive Discovery
                                              © Michael L. Brodie
                                                  CSAIL, MIT
                                             Cambridge, MA, USA
                                             㼙㼘㼎㼞㼛㼐㼕㼑㻬㼏㼟㼍㼕㼘㻚㼙㼕㼠㻚㼑㼐㼡

                       Abstract                                    opportunities and challenges.
                                                                           To better understand DIA and its opportunities
     Over the past two decades, Data-Intensive Analysis            and challenges I examined over 30 DIA use cases that
has emerged not only as a basis for the Fourth                     are at very large-scale - in the range where theory and
Paradigm of engineering and scientific discovery but as            practice may break. This paper summarizes some key
a basis for discovery in most human endeavors for                  results of my research related to understanding and
which data is available. Originating in the 1960s, its             defining Data Science as a body of principles and
recent emergence due to Big Data and massive                       techniques with which to measure and improve the
computing power is leading to widespread deployment,               correctness, completeness, and efficiency of Data-
yet it is in its infancy in its application and our                Intensive Analysis. As with its predecessor discovery
understanding of it; hence in its development. Given the           paradigms, establishing this emerging Fourth Paradigm
potential risks and rewards of Data-Intensive Analysis             and the underlying principles and techniques of Data
and its breadth of application, it is imperative that we           Science may take decades.
get this right.
     The objective of this emerging Fourth Paradigm is             1.2 Significance of DIA and Data Science
more than acquiring data and extracting knowledge.
                                                                   Data Science is transforming discovery in many human
Like its predecessor the scientific method, the objective
                                                                   endeavours including healthcare, manufacturing,
of the Fourth Paradigm is to investigate phenomena by
                                                                   education, financial modelling, policing, and marketing
acquiring new knowledge, and correct and integrate it
                                                                   [10][13]. It has been used to produce significant results
with previous knowledge. In addition, data science is a
                                                                   in areas from particle physics (e.g., Higgs Boson), to
body of principles and techniques with which to
                                                                   identifying and resolving sleep disorders using Fitbit
measure and improve the correctness, completeness,
                                                                   data, to recommenders for literature, theatre, and
and efficiency of Data-Intensive Analysis. It is now time
                                                                   shopping. More than 50 national governments have
to identify and understand the fundamentals. In my
                                                                   established data-driven strategies as an official policy as
research, I have analyzed more than 30 very large-scale
                                                                   in science and engineering [2] as well as in healthcare,
use cases to understand current practical aspects, to gain
                                                                   e.g., US National Institutes of Health and President
insight into the fundamentals, and to address the fourth
                                                                   Obama’s Precision Medicine Initiative [15] for
“V” of Big Data – veracity -- the accuracy of the data
                                                                   “Delivering the right treatments, at the right time, every
and the resulting analytics This development may take
                                                                   time to the right person.” The hope, supported by early
decades.
                                                                   results, is that data-driven techniques will accelerate the
                                                                   discovery of treatments to manage and prevent chronic
1 Data Science: A New Discovery Paradigm                           diseases with more precision and that are tailored to
That Will Transform Our World                                      specific individuals as well as being at dramatically
                                                                   lower cost.
1.1 Introduction                                                        Data Science is being used to radically transform
Over the past two decades, Data-Intensive Analysis                 entire domains, such as medicine and biomedical
(also called Big Data Analytics) has emerged not only              research as stated as the purpose of the newly created
as a basis for the Fourth Paradigm [8] of engineering              Center for Biomedical Informatics at the Harvard
and scientific discovery but more broadly as a basis for           Medical School. It is also making an impact in
discovery in most human endeavours for which data is               economics [14], drug discovery [17], and many other
available. Roots of Data-Intensive Analysis (DIA) that             domains. As a result of its successes and potential Data
have led to its recent dramatic growth include Big Data            Science is rapidly becoming a sub-discipline of most
(c. 2000) that, just emerging, is opening the door to              academic areas. These developments suggest the strong
profound change – to new ways of reasoning, problem                belief in the potential value of Data Science – but can it
solving, and processing that in turn bring new                     deliver?
                                                                        Early successes and clearly stated expectations of
Proceedings of the XVII International Conference                   Data Science are truly remarkable; however, its actual
«Data Analytics and Management in Data Intensive                   deployment, like many hot trends, is far less than it
Domains» (DAMDID/RCDL’2015), Obninsk, Russia,                      appears. According to Gartner’s 2015 survey of Big
October 13 - 16, 2015



                                                             238
Data Management and Analytics, 60% of the Fortune                  1.4 What Could Possibly Go Wrong?
500 claim to have deployed Data Science, less than
                                                                   Do we understand the risks of recommending the wrong
20% have implemented consequent significant changes
                                                                   film, the wrong product, the wrong medical diagnoses,
and less than 1% have optimized its benefits. Gartner
                                                                   treatments, or drugs? The minimal apparent risk of a
concludes that 85% will be unable to exploit Big Data
                                                                   result that fails to achieve its objectives when acted
in 2015. The vast majority of deployments address
                                                                   upon includes losses in time, resources, customer
tactical aspects of existing processes and static business
                                                                   satisfaction, customers, and potentially a loss of
intelligence rather than realizing its power by
                                                                   business. The vast majority of Data Science
identifying strategic advantages through discovering
                                                                   applications face such small risks; hence veracity has
previously unforeseen value.
                                                                   received little attention. Far greater risks could be
1.3 Illustrious Histories: The Origins of Data Science             incurred if incorrect Data Science results are acted upon
                                                                   in critical contexts, such as those already underway in
Data Science is in its infancy. Few individuals or                 drug discovery [18] and personalized medicine. Most
organizations understand the potential of and the                  scientists in these contexts are well aware of the risks of
paradigm shift associated with Data Science, let alone             errors, hence go to extremes to estimate and minimize
understand it conceptually. The high rewards and the               them. The wonder of CERN’s ATLAS and CMS
equally high risks and its pervasive application make it           projects “discovery” of the Higgs Boson announced
imperative that we better understand Data Science – its            July 4, 2012 with a confidence of 5 sigma might
models, methods, processes, and results.                           suggest that the results were achieved overnight. They
     Data Science is inherently multi-disciplinary                 were not. They took 40 years and included Data Science
drawing on over 30 allied disciplines, according to                techniques developed over a decade applied over Big
some definitions. Its principle components include                 Data by two independent projects, ATLAS and CMS,
mathematics, statistics, and computer science especially           each of which were subsequently peer reviewed and
areas such as AI (e.g., machine learning), data                    published [1][11] with a further yearlong verification
management, and high performance computing. While                  that established a confidence of 10 sigma. To what
these disciplines need to be evaluated in the new                  extent do the vast majority of Data Science applications
paradigm, they have long illustrious histories. Data               concern themselves with verification and error bounds
analysis developed over 4,000 years ago with origins in            let alone understand the verification methods applied at
Babylon (17th-12th C BCE) and India (12th C BCE).                  CERN? Informal surveys of data scientists conducted in
Mathematical analysis originated in the 17th C around              this study at Data Science conferences suggest that 80%
the time of the Scientific Revolution. While statistics            of customers never ask for error bounds.
has its roots in 5th C BCE and the 18th C, its application               The existential risks of applying Data Science have
in Data Science originated in 1962 with John W. Tukey              been raised by world leading authorities such as the
[20] and George Box[4]. These long illustrious histories           Organization for Economic Cooperation and
suggest that Data Science draws on well-established                Development, the AI [3][7][9][19] and legal [5]
results that took decades or centuries to develop. To              communities with the most extreme concerns stated by
what extent do they (e.g., statistical significance) apply         the Future of Life Institute with the objective of
in this paradigmatically new context?                              safeguarding life and developing optimistic visions of
     Data Science constitutes a new paradigm in the                the future in order to mitigate existential risks facing
sense of Kuhn’s scientific revolutions [12]. Data                  humanity from AI.
Science’s predecessor paradigm, the Scientific Method,                     Given the potential risks and rewards of DIA
has approximately 2,000 years in the development of                and of its breadth of application across conventional,
empiricism starting with Aristotle (384-322 BCE),                  empirical scientific and engineering domains as well as
Ptolemy (1st C), and the Bacons (13th, 16th C). Data               across most human endeavors we better get this right!
Science, a primary basis of eScience [8], collectively             The scientific and engineering communities place high
termed the Fourth Paradigm, is emerging following the              confidence in their existing discovery paradigms with
~1,000-year development of its three predecessor                   well-defined measures of likelihood and confidence
paradigms of scientific and engineering discovery:                 within relatively precise error estimates1. Can we say
theory, experimentation, and simulation [8]. Data                  the same for modern Data Science as a discovery
Science that has developed and been applied for over 50            paradigm and for its results? A simple observation of
years qualitatively changed in the late 20th century with          the formal development of the processes and methods
the emergence of Big Data, typically defined as data at            of its predecessors suggest that we cannot. Indeed, we
volumes, velocities, and variety that current                      do not know if or under what conditions the constituent
technologies, let alone humans, cannot handle                      disciplines, like statistics, may break down.
efficiently. This paper addresses another characteristic                   Do we understand DIA to the extent that we can
that current technologies and theories do not handle               assign probabilistic measures of likelihood to its
well, veracity.                                                    results? With the scale and emerging nature of DIA-

                                                                   1
                                                                     Even after 1,000 years serious issues persist, e.g., P values
                                                                   (significance) and reproducibility.




                                                             239
based discovery, how do we estimate the correctness                 models correct at one scale may be wrong at a larger
and completeness of analytical results relative to a                scale or vice versa, a model wrong at one scale (hence
hypothesized discovery question when the underlying                 discarded) may become correct at a higher scale (more
principles and techniques may not apply in this new                 complex model).
context?                                                                 Machine learning algorithms can identify
          In summary, we do not yet understand DIA                  correlations between thousands, millions, or even
adequately to quantify the probability or likelihood that           billions of variables. This suggests that it is difficult to
a projected outcome will occur within estimated error               impossible for humans to understand what or how these
bounds. While CERN used Data Science and Big Data                   algorithms discover. Imagine trying to understand such
to identify results, verification was ultimately empirical,         a model that results from selecting some subset of the
as it must be in drug discovery [18] and other critical             correlations on the assumption that they may be causal
areas, until analytical techniques are developed and                thus constitute a model of the phenomenon with high
proven robust.                                                      confidence of being correct with respect to some
                                                                    hypotheses, with or without error bars.
1.5 Do We Understand Data Science?
                                                                    1.6 Cornerstone of A New Discovery Paradigm
Do we even understand what Data Science methods
compute or how they work? Human thought is limited                  The Fourth Paradigm - eScience supported by Data
by the human mind. According to Miller’s Law [14],                  Science - is paradigmatically different from its
the human mind (short term working memory) is                       predecessor discovery paradigms. It provides
capable of conceiving less than ten (7 +/- 2) concepts at           revolutionary new ways [12] of thinking, reasoning and
one time. Hence, humans have difficulty understanding               processing - new modes of inquiry, problem solving,
complex models involving more than ten variables. The               and decision-making. It is not the Third Paradigm
conventional process is to imagine a small number of                augmented by Big Data, but something profoundly
variables2 then abstract or encapsulate that knowledge              different. Losing sight of this difference forfeits its
into a model that can subsequently augmented with                   power and benefits and loses the perspective that it is A
more variables. Thus most scientific theories develop               Revolution That Will Transform How We Live, Work,
slowly over time into complex models. For example,                  and Think [13].
Newton’s model of particle physics was extended for                      Paradigm shifts are difficult to notice as they
350 years through Bohr, Heisenberg, Einstein, and                   emerge, just as the proverbial frog does not notice that
others, up to Glashow, Salam, and Weinberg, to form                 its hot bath is becoming lethal. There are several ways
The Standard Model of Particle Physics. Scientific                  to describe the shift. There is a shift of resources from
discovery in particle physics is wonderful and has taken            (empirically) discovering causality (Why the
over 350 years. Due to its complexity no physicist has              phenomenon occurs) – the heart of the Scientific
understood the entire Standard Model for decades,                   Method – to discovering interesting correlations (What
rather it is represented in complex, computational                  might have occurred). This shift involves moving from
models.                                                             a strategic perspective driven by human generated
     When humans analyse a problem, they do so with                 hypotheses (theory-driven, top-down) to a tactical
models with a limited number of variables. As the                   perspective driven by observations (data-driven,
number of variables increase, it is increasingly difficult          bottom-up).
to understand the model and the potential combinations                   Seen at their extremes, the Scientific Method
and correlations. Hence, humans limit their models and              involves testing hypotheses (theories) posed by
analyses to those that they can comprehend. These                   scientists while Data Science can be used to generate
human-scale models are typically theory-driven thus                 hypotheses to be tested based on significant correlations
limiting their scale (number of variables) to what can be           amongst variables that are identified algorithmically in
conceived.                                                          the data. In principle, vast amounts of data and
     What if the phenomenon is arbitrarily complex or               computing power can be used to accelerate discovery
beyond immediate human conception? I suspect that                   simply by outpacing human thinking in both power and
this is addressed iteratively with one model (theory)               complexity. The power of Data Science is growing
becoming abstracted as the base for another more                    rapidly due to the development of ever more powerful
complex theory, and so on (standing on the shoulders of             computing resources and algorithms, such as deep
those who have gone before), e.g., the development of               learning. So rather than optimize an existing process,
quantum physics from elementary particles. That is,                 Data Science can be used to identify patterns that
once the human mind understands a model, it can form                suggest unforeseen solutions, thus automating
the basis of a more complex model. This development                 serendipity as it is called when a human observes an
under the scientific method scales at a rate limited by             anomaly that stimulated a bright idea to resolve it.
human conception thus limiting the number of variables                   However, even more compelling is one step beyond
and complexity. This is error-prone since phenomena                 the simple version of this shift, namely a symbiosis of
may not manifest at a certain level of complexity hence             the both paradigms. For example, Data Science can be
                                                                    used to offer highly probable hypotheses or correlations
2
    Physical science PhDs typically involve < 5 variables.          from which we select those with acceptable error




                                                              240
estimates and that are worthy of subsequent empirical                qualitatively less than two decades ago with the
analysis. In turn, empiricism is used to pursue these                emergence of Big Data and the consequent paradigm
hypotheses until some converge and some diverge at                   shift described above. The focus of this research into
which point Data Science can be applied to refine or                 modern Data Science is on veracity – the ability to
confirm the converging hypotheses, having discarded                  estimate the correctness, completeness, and efficiency
the divergent hypotheses, and the cycle starts again.                of an end-to-end DIA activity and of its results. Hence,
Ideally, one would optimize the combination of theory-               I use the following definition that is in the spirit of [17].
driven empirical analysis with data-driven analysis to
accelerate discovery faster than either on their own.                  Data Science is a body of principles and
     While Data Science is a cornerstone of a new                      techniques for applying data-intensive analysis to
discovery paradigm, it may be conceptually and                         investigate phenomena, acquire new knowledge,
methodologically      more     challenging      than  its              and correct and integrate previous knowledge with
predecessors since it involves everything included in its              measures of correctness, completeness, and
predecessor paradigms – modelling, methods,                            efficiency of the derived results with respect to
processes, measures of correctness, completeness, and                  some pre-defined (top down) or emergent (bottom
efficiency – in a much more complex context, namely                    up) specification (scope, question, hypothesis).
that of Big Data. Following well-established
developments, we should try to find the fundamentals of              3 Understanding Data Science From
Data Science – its principles and techniques – to help
manage the complexity and guide its understanding and
                                                                     Practice
application.                                                         3.1 Methodology to Better Understand DIA

2 Data Science: A Perspective                                        Driven by a passion for understanding Data Science in
                                                                     practice, my year-long and on-going research study has
Since Data Science is in its infancy and is inherently               investigated over 30 very large scale Big Data
multi-disciplinary, there are naturally many definitions             applications most of which have produced or are daily
of Data Science that should emerge and evolve with the               producing significant value. The use cases include
discipline. As definitions serve many purposes, it is                particle physics; astrophysics and satellite imagery;
reasonable to have multiple definitions each serving                 oceanography; economics; information services; several
different purposes. Most Data Science definitions                    life sciences applications in pharmaceuticals, drug
attempt to define Why (it’s purpose), What (constituent              discovery, and genetics; and various areas of medicine
disciplines), and How (constituent actions of discovery              including precision medicine, hospital studies, clinical
workflows).                                                          trials, intensive care unit and emergency room
        A common definition of Data Science is the                   medicine.
activity of extracting knowledge from data 3. While                       The focus is to investigate relatively well-
simple, it does not convey the larger goal of Data                   understood, successful use cases where correctness is
Science or its consequent challenges. A DIA activity is              critical and the Big Data context is at massive scale;
far more than a collection of actions or the mechanical              such use cases constitute less than 5% of all deployed
processes of acquiring and analyzing data. Like its                  Big Data analytics. The focus was on these use cases, as
predecessor paradigm, the Scientific Method, the                     we do not know where errors may arise outside normal
purpose of Data Science and a DIA activity is to                     scientific and analytical errors. There is a greater
investigate phenomena by acquiring new knowledge,                    likelihood that established disciplines, e.g., statistics
and correcting and integrating it with previous                      and data management, might break at very large scale
knowledge – continually evolving our current                         where errors due to failed fundamentals may be more
understanding of the phenomena based on newly                        obvious.
available data. We seldom start from scratch, clearly the                 The breadth and depth of the use cases revealed
simplest case here. Hence, discovering, understanding,               strong, significant emerging trends, some of which are
and integrating data must precede extracting knowledge               listed below. These confirmed for some use case
all at massive scale, i.e., largely by automated means.              owners, and suggested to others, solutions and
        The Scientific Method that underlies the Third               directions that they were pursuing but could not have
Paradigm is a body of principles and techniques that                 seen without the perspective of 30+ use cases.
provide the formal and practical bases of scientific and
engineering discovery. The principles and techniques                 3.2 DIA Processes
have been developed over hundreds of years originating               A Data-Intensive-Activity is an analytical process that
with Plato and are still evolving today with significant             consists of applying sophisticated analytical methods to
unresolved issues such as statistical significance, (i.e., P         large data sets that are stored under some analytical
values) and reproducibility.                                         models. While this is the typical view of Data Science
        While Data Science had its origins 50 years ago              projects or DIA use cases, this analytical component of
with Tukey [19] and Box [4] it started to change                     the DIA activity constitutes ~20% of an end-to-end DIA
                                                                     pipeline or workflow. Currently it consumes ~20% of
3                                                                    the resources required to complete a DIA analysis.
    Wikipedia.com




                                                               241
          An end-to-end DIA activity involves two data               proportionally 80% of the errors that could arise in DIA
management processes that precede the DIA process,                   may arise in the data management processes, prior to
namely Raw Data Acquisition and Curation, and                        DIA even starting.
Analytical Data Acquisition. Raw Data Acquisition and
Curation starts with discovering and understanding data              3.3 Characteristics of Large-Scale DIA Use Cases
in data sources and ends with integrating and storing                The focus of my research is successful, very large scale,
curated data in a repository that represents entities in the         multi-year projects with many with 100s to 1,000s, of
domain of interest and metadata about those entities                 ongoing DIA activities. These activities are supported
with which to make a specific interpretations and that is            by a DIA ecosystem consisting of a community of users
shared by a community of users. Analytical Data                      (e.g., over 5,000 scientists in the ATLAS and CMS
Acquisition starts with discovering and understanding                projects at CERN and similar numbers of scientists
data within the shared repository and ends with storing              using the worldwide Cancer Genome Atlas) and
the resulting information, specific entities and                     technology (e.g., science gateways4, collectively
interpretations, into an analytical model to be used by              referred to in some branches of science as networked
the subsequent DIA process.                                          science). Some significant trends that have emerged
          Sophisticated algorithms such as machine                   from the analysis of these use cases are listed, briefly,
learning largely automate DIA processes, as they have                below.
to be automated to process such large volumes of data                      The typical view of Data Science appears to be
using complex algorithms. Currently, Raw Data                        based on the vast majority (~95%) of DIA use cases.
Acquisition and Curation, and Analytical Data                        While they share some characteristics with those in this
Acquisition processes are far less automated typically               study, there are fundamental differences such as the
requiring 80% or more of the total resources to                      concern for and due diligence associated with veracity
complete.                                                            as mentioned above.
          This understanding leads to the following                        Based on this study data analysis appears to fall
definitions.                                                         into three classes. Conventional data analysis over
  Data-Intensive Discovery (DID) is the activity of                  “small data” accounts for at least 95% of all data
  using Big Data to investigate phenomena, to                        analysis, often using Microsoft Excel. DIA over Big
  acquire new knowledge, and to correct and                          Data has two sub-classes, simple DIA, i.e., the vast
  integrate previous knowledge.                                      majority of DIA use cases mentioned above, and
                                                                     complex DIA such as the use cases analyzed in this
“-Intensive” is added when the data is “at scale”.                   study that are characterized by complex analytical
Theory-driven DID is the investigation of human                      models (e.g., sub-models of the Standard Model of
generated scientific, engineering, or other hypotheses               Physics, economic models, an organizational model for
over Big Data. Data-Driven DID employs automatic                     enterprises worldwide, and models for genetics and
hypothesis generation.                                               epigenetics) and a corresponding plethora of analytical
  Data-Intensive Analysis is the process of                          methods (e.g., the vast method libraries in CERN’s
  analyzing Big Data with analytical methods and                     Root framework). The complexity of the models and
  models.                                                            methods are as complex as the phenomena being
                                                                     analyzed.
          DID goes beyond the Third paradigm of                            The most widely used DIA tools for simple cases
scientific or engineering discovery by investigating                 claim to support analyst self-service in point-and-click
scientific or engineering hypotheses using DIA. A DIA                environments, some claiming “point us at the data and
activity is an experiment over data thus requiring all               we will find the patterns of interest for you”. This
aspects of a scientific experiment, e.g., experimental               characteristic is infeasible in the use cases analyzed. A
design, expressed over data, a.k.a. data-based                       requirement common to the use cases analyzed is not
empiricism.                                                          only the principle of being machine driven and human
  A DIA Process (workflow or pipeline) is a                          guided, i.e., a man-machine symbiosis, but extensive
  sequence of operations that constitute an end-to-                  attempts to optimize this symbiosis for scale, cost, and
  end DIA activity from the source data to the                       precision (too much human-in-the-loop leads to errors,
  quantified, qualified result.                                      too little leads to nonsense).
                                                                           DIA ecosystems are inherently multi-disciplinary
Currently, ~80% of the effort and resources required for             (ideally interdisciplinary), collaborative, and iterative.
the entire DIA activity are due to the two data                      Not only does DIA (Big Data Analytics) require
management processes – areas where scientists /                      multiple disciplines, e.g., genetics, statistics and
analysts are not experts. Emerging technology, such as               machine learning, so too do the data management
for data curation at scale, aims to flip that ratio from             processes require multiple disciplines, e.g., data
80:20 to 20:80 so as to let scientists do science; analysts          management, domain and machine learning experts for
do analysis; etc. This requires an understanding of the
data management processes and their correctness,                     4
                                                                      There are over 60 large-scale scientific gateways, e.g., The
completeness, and efficiency in addition to those of the             Cancer Genome Atlas and CERN’s Worldwide LHC
DIA process. Another obvious consequence is that                     Computing Grid.




                                                               242
data curation, statisticians for sampling, etc.                      Hence, constructing the model required for the
      In large-scale DIA ecosystems, a DIA is a virtual              background involves selecting and combining relevant
experiment [6]. Far from claims of simplicity and                    simulations. If there is no simulation for some aspect
point-and-click self-service, most large-scale DIA                   that you require, then it must be requested or you may
activities reflect the complexity of the analysis at hand            have to build it yourself. Similarly, if there is no
and are the result of long-term (months to years)                    relevant data of interest in the experimental data
experimental designs that involve greater complexity                 repository, it must be requested from subsequent
than their empirical counterparts to deal with scale,                capture from the detectors when LHC is next fired up in
significance, hypotheses, null hypotheses, and deeper                the appropriate energy levels. This comes from a
challenges such as determining causality from                        completely separate team running the (non-virtual)
correlations and identifying and dealing with biases and             experiment.
often irrational human intervention.
      Finally, veracity is one of the most significant                     The development of the background is
challenges and critical requirements of all DIA                      approximately a one person-year activity as it involves
ecosystems studied. While there are many, complex                    the experimental design, the design and refinement of
methods in conventional Data Science to estimate                     the model (software simulations), the selection of
veracity most owners of use cases studied expressed                  methods and tuning to achieve the correct signature
concern for adequately estimating veracity in modern                 (i.e., get the right data), verify the model (observe
Data Science. Most assume that all data is imprecise;                expected outcomes when tested), and dealing with
hence require probabilistic measures and error bars                  errors (statistical and systematic) that arise from the
and likelihood estimates for all results. More basically,            hardware or process. The result of the Background
most DIA ecosystem experts recognize that errors can                 phase is a model approved by the collaborative to
arise across an end-to-end DIA activity and are                      represent the background required by the experiment
investing substantially in addressing these issues in both           with the signal region blinded. The model is an
the DIA processes and the data management processes                  “application” that runs on the Atlas “platform” using
that currently require significant human guidance.                   Atlas resources - libraries, software, simulations, and
      An objective of this research is to discover the               data much drawing on the ROOT framework, CERN’s
extent to which the above characteristics of very large              core modeling and analysis infrastructure. It is verified
scale, complex DIAs also apply to simple DIAs. There                 by being executed under various testing conditions.
is a strong likelihood that they apply directly but are                    This is an incremental or iterative process each
difficult to detect. That is the principles and techniques           step of which is reviewed. The resulting design
of DIA apply equally to simple and complex DIA.                      document for the Top Quark experiment was
                                                                     approximately 200 pages of design choices, parameter
3.4 Looking Into A Use Case                                          settings, and results - both positive and negative! All
      Due to the detail involved, there is not space in this         experimental data and analytical results are
chapter or book to describe a single use case considered             probabilistic. All results have error bars; in particle
in this study. However, let’s look into a single step of a           physics they must be at least 5 sigma to be accepted.
use case involving a virtual experiment conducted at                 This explains the year of iteration in which analytical
CERN in the Atlas project. The heart of empirical                    models are adjusted, analytical methods are selected
science is experimental design. It starts by identifying,            and tuned, and results reviewed by the collaboration.
formulating, and verifying a worthy hypothesis to                          The next step is the actual virtual experiment. This
pursue. This first complex step typically involves a                 too takes months. You might be surprised to find that
multi-disciplinary team, called the collaborators for this           once the data is un-blinded (i.e., synthetic data is
virtual experiment, often from around the world for                  replaced in the region of interest with experimental
more than a year. We consider the second step, the                   data), the experimenter, often a PhD candidate, gets one
construction of the control or background model                      and only one execution of the “verified” model over the
(executable software and data) that creates the                      experimental data.
background (e.g., executable or testable model and a                       Hopefully this portion of a use case illustrates that
given data set) required as the basis within which to                DIA is a complex but critical tool in scientific discovery
search (analyze) for “signals” that would represent the              used with a well-defined understanding of veracity. It
phenomenon being investigated in the hypothesis. This                must stand up to scrutiny that evaluates if the
is the control that completely excludes the data of                  experiment - consisting of all models, methods, and
interest. The data of interest (the signal region) is                data with probabilistic results and error bounds better
“blinded” completely so as not to bias the experiment.               than 5 sigma – is adequate to be accepted by Science or
The background (control) is designed using software                  Nature as demonstrating that the hypothesized
that simulates relevant parts of the standard model of               correlation is causal.
particle physics plus data from Atlas selected with the
appropriate signatures with the data of interest blinded.            4. Research For An Emerging Discipline
      Over time Atlas contributors have developed
                                                                     The next step in this research to better understand the
simulations of many parts of the standard model.
                                                                     theory and practice of the emerging discipline of Data




                                                               243
Science; to understand and address its opportunities and            [3] J. Bohannon, “Fears of an AI pioneer,” Science,
challenges; and to guide its development, is given in its               vol. 349, no. 6245, pp. 252–252, Jul. 2015.
definition. Modern Data Science builds on conventional              [4] G.E.P. Box. Science and Statistics. Journal of the
Data Science and on all of its constituent disciplines                  American Statistical Association 71, 356 (April
required to design, verify, and operate end-to-end DIA                  2012), 791–799 reprint of original from 1962
activities, including both data management and DIA                  [5] N.     Diakopoulos. Algorithmic      Accountability
processes, in a DIA ecosystem for a shared community                    Reporting: On the Investigation of Black
of users. Each discipline must be considered with                       Boxes. Tow Center. February 2014.
respect       to     which      it     contributes      to
                                                                    [6] Duggan, Jennie and Michael Brodie, Hephaestus:
investigating phenomena, acquiring new knowledge,
                                                                        Data Reuse for Accelerating Scientific Discovery,
and correcting and integrating new with previous
                                                                        In CIDR 2015
knowledge. Each operation must be understood with
respect to which correctness, completeness, and                     [7] S.J. Gershman, E.J. Horvitz, and J.B. Tenenbaum.
efficiency can be estimated.                                            2015. Computational rationality: A converging
      This research involves identifying relevant                       paradigm for intelligence in brains, minds, and
principles and techniques. Principles concern the                       machines. Science 349, 6245 (2015), 273–278.
theories that are established formally, e.g.,                       [8] Jim Gray on eScience: a transformed scientific
mathematically, and possibly demonstrated empirically.                  method, in A.J.G. Hey, S. Tansley, and K.M. Tolle
Techniques involve the application of wisdom [21], i.e.,                (Eds.): The fourth paradigm: data-intensive
domain knowledge, art, experience, methodologies,                       scientific discovery. Proc. IEEE 99, 8 (2009),
practice, often called best practices. The principles and               1334–1337.
techniques,      especially   those     established    for          [9] E. Horvitz and D. Mulligan. 2015. Data, privacy,
conventional Data Science, must be verified and if                      and the greater good. Science 349, 6245 (July
required extended, augmented, or replaced for the new                   2015), 253–255.
context of the Fourth Paradigm, especially its volumes,            [10] M.I. Jordan and T.M. Mitchell. 2015. Machine
velocities, and variety. For example, new departments                   learning: Trends, perspectives, and prospects.
at MIT, Stanford, and the University of California,                     Science 349, 6245 (July 2015), 255–260.
Berkeley, are conducting such research under what                  [11] V. Khachatryan et al. 2012. Observation of a new
some are calling 21st Century Statistics.                               boson at a mass of 125 GeV with the CMS
      A final, stimulating challenge is what is called                  experiment at the LHC. Physics Letters B 716, 1
meta-modelling or meta-theory. DIA, and more                            (2012), 30–61.
generally Data Science, is inherently multi-disciplinary
                                                                   [12] Kuhn, Thomas S. The Structure of Scientific
[10]. This area emerged in the physical sciences in the
                                                                        Revolutions. 3rd ed. Chicago, IL: University of
1980s and subsequently in statistics and machine
                                                                        Chicago Press, 1996.
learning and is now being applied in other areas to
address combining results of multiple disciplines.                 [13] Mayer-Schönberger, V., & Cukier, K. (2013-03-
Analogously, meta-modelling arises when using                           05). Big Data: A Revolution That Will Transform
multiple analytical models and multiple analytical                      How We Live, Work, and Think. Houghton Mifflin
methods to analyze different perspectives or                            Harcourt
characteristics of the same phenomena. This extremely              [14] Miller, G. A. (1956). "The magical number seven,
natural and useful methodology, called ensemble                         plus or minus two: Some limits on our capacity for
modelling, is required in many physical sciences,                       processing information". Psychological Review 63
statistics, and AI, and should be explored as a                         (2): 81–97.
fundament modelling methodology.                                   [15] NIH        Precision       Medicine       Initiative,
                                                                        http://www.nih.gov/precisionmedicine/
Acknowledgement                                                    [16] D.C. Parkes and M.P. Wellman. 2015. Economic
     I gratefully acknowledge the brilliant insights and                reasoning and artificial intelligence. Science 349,
improvements proposed by Prof Jennie Duggan,                            6245 (July 2015), 267–272.
Northwestern University and Prof Thilo Stadelmann,                 [17] F. Provost and T. Fawcett. 2013. Data Science and
Zurich University of Applied Sciences.                                  its Relationship to Big Data and Data-Driven
                                                                        Decision Making. Big Data 1, 1 (March 2013),
References                                                              51–59.
 [1] G. Aad et al. 2012. Observation of a new particle             [18] Scott Spangler, et. al. 2014. Automated hypothesis
     in the search for the Standard Model Higgs boson                   generation based on mining scientific literature.
     with the ATLAS detector at the LHC. Physics                        In Proceedings of the 20th ACM SIGKDD
     Letters B 716, 1 (2012), 1–29.                                     international conference on Knowledge discovery
 [2] Accelerating    Discovery      in  Science    and                  and data mining (KDD '14). ACM, New York,
     Engineering Through Petascale Simulations and                      NY, USA, 1877-1886.
     Analysis      (PetaApps),     National    Science
     Foundation, Posted July 28, 2008.




                                                             244
[19] J. Stajic, R. Stone, G. Chin, and B. Wible. 2015.
     Rise of the Machines. Science 349, 6245 (July
     2015), 248–249.
[20] J. W. Tukey, “The Future of Data Analysis,” Ann.
     Math. Statist. pp. 1–67, 1962.
[21] Bin Yu, Data Wisdom for Data Science,
     ODBMS.org, April 13, 2015.




                                                         245