=Paper=
{{Paper
|id=Vol-1536/paper32
|storemode=property
|title=
Understanding Data Science: An Emerging Discipline for Data Intensive Discovery
|pdfUrl=https://ceur-ws.org/Vol-1536/paper32.pdf
|volume=Vol-1536
|dblpUrl=https://dblp.org/rec/conf/rcdl/Brodie15
}}
==
Understanding Data Science: An Emerging Discipline for Data Intensive Discovery
==
Understanding Data Science: An Emerging Discipline for Data-Intensive Discovery © Michael L. Brodie CSAIL, MIT Cambridge, MA, USA 㼙㼘㼎㼞㼛㼐㼕㼑㻬㼏㼟㼍㼕㼘㻚㼙㼕㼠㻚㼑㼐㼡 Abstract opportunities and challenges. To better understand DIA and its opportunities Over the past two decades, Data-Intensive Analysis and challenges I examined over 30 DIA use cases that has emerged not only as a basis for the Fourth are at very large-scale - in the range where theory and Paradigm of engineering and scientific discovery but as practice may break. This paper summarizes some key a basis for discovery in most human endeavors for results of my research related to understanding and which data is available. Originating in the 1960s, its defining Data Science as a body of principles and recent emergence due to Big Data and massive techniques with which to measure and improve the computing power is leading to widespread deployment, correctness, completeness, and efficiency of Data- yet it is in its infancy in its application and our Intensive Analysis. As with its predecessor discovery understanding of it; hence in its development. Given the paradigms, establishing this emerging Fourth Paradigm potential risks and rewards of Data-Intensive Analysis and the underlying principles and techniques of Data and its breadth of application, it is imperative that we Science may take decades. get this right. The objective of this emerging Fourth Paradigm is 1.2 Significance of DIA and Data Science more than acquiring data and extracting knowledge. Data Science is transforming discovery in many human Like its predecessor the scientific method, the objective endeavours including healthcare, manufacturing, of the Fourth Paradigm is to investigate phenomena by education, financial modelling, policing, and marketing acquiring new knowledge, and correct and integrate it [10][13]. It has been used to produce significant results with previous knowledge. In addition, data science is a in areas from particle physics (e.g., Higgs Boson), to body of principles and techniques with which to identifying and resolving sleep disorders using Fitbit measure and improve the correctness, completeness, data, to recommenders for literature, theatre, and and efficiency of Data-Intensive Analysis. It is now time shopping. More than 50 national governments have to identify and understand the fundamentals. In my established data-driven strategies as an official policy as research, I have analyzed more than 30 very large-scale in science and engineering [2] as well as in healthcare, use cases to understand current practical aspects, to gain e.g., US National Institutes of Health and President insight into the fundamentals, and to address the fourth Obama’s Precision Medicine Initiative [15] for “V” of Big Data – veracity -- the accuracy of the data “Delivering the right treatments, at the right time, every and the resulting analytics This development may take time to the right person.” The hope, supported by early decades. results, is that data-driven techniques will accelerate the discovery of treatments to manage and prevent chronic 1 Data Science: A New Discovery Paradigm diseases with more precision and that are tailored to That Will Transform Our World specific individuals as well as being at dramatically lower cost. 1.1 Introduction Data Science is being used to radically transform Over the past two decades, Data-Intensive Analysis entire domains, such as medicine and biomedical (also called Big Data Analytics) has emerged not only research as stated as the purpose of the newly created as a basis for the Fourth Paradigm [8] of engineering Center for Biomedical Informatics at the Harvard and scientific discovery but more broadly as a basis for Medical School. It is also making an impact in discovery in most human endeavours for which data is economics [14], drug discovery [17], and many other available. Roots of Data-Intensive Analysis (DIA) that domains. As a result of its successes and potential Data have led to its recent dramatic growth include Big Data Science is rapidly becoming a sub-discipline of most (c. 2000) that, just emerging, is opening the door to academic areas. These developments suggest the strong profound change – to new ways of reasoning, problem belief in the potential value of Data Science – but can it solving, and processing that in turn bring new deliver? Early successes and clearly stated expectations of Proceedings of the XVII International Conference Data Science are truly remarkable; however, its actual «Data Analytics and Management in Data Intensive deployment, like many hot trends, is far less than it Domains» (DAMDID/RCDL’2015), Obninsk, Russia, appears. According to Gartner’s 2015 survey of Big October 13 - 16, 2015 238 Data Management and Analytics, 60% of the Fortune 1.4 What Could Possibly Go Wrong? 500 claim to have deployed Data Science, less than Do we understand the risks of recommending the wrong 20% have implemented consequent significant changes film, the wrong product, the wrong medical diagnoses, and less than 1% have optimized its benefits. Gartner treatments, or drugs? The minimal apparent risk of a concludes that 85% will be unable to exploit Big Data result that fails to achieve its objectives when acted in 2015. The vast majority of deployments address upon includes losses in time, resources, customer tactical aspects of existing processes and static business satisfaction, customers, and potentially a loss of intelligence rather than realizing its power by business. The vast majority of Data Science identifying strategic advantages through discovering applications face such small risks; hence veracity has previously unforeseen value. received little attention. Far greater risks could be 1.3 Illustrious Histories: The Origins of Data Science incurred if incorrect Data Science results are acted upon in critical contexts, such as those already underway in Data Science is in its infancy. Few individuals or drug discovery [18] and personalized medicine. Most organizations understand the potential of and the scientists in these contexts are well aware of the risks of paradigm shift associated with Data Science, let alone errors, hence go to extremes to estimate and minimize understand it conceptually. The high rewards and the them. The wonder of CERN’s ATLAS and CMS equally high risks and its pervasive application make it projects “discovery” of the Higgs Boson announced imperative that we better understand Data Science – its July 4, 2012 with a confidence of 5 sigma might models, methods, processes, and results. suggest that the results were achieved overnight. They Data Science is inherently multi-disciplinary were not. They took 40 years and included Data Science drawing on over 30 allied disciplines, according to techniques developed over a decade applied over Big some definitions. Its principle components include Data by two independent projects, ATLAS and CMS, mathematics, statistics, and computer science especially each of which were subsequently peer reviewed and areas such as AI (e.g., machine learning), data published [1][11] with a further yearlong verification management, and high performance computing. While that established a confidence of 10 sigma. To what these disciplines need to be evaluated in the new extent do the vast majority of Data Science applications paradigm, they have long illustrious histories. Data concern themselves with verification and error bounds analysis developed over 4,000 years ago with origins in let alone understand the verification methods applied at Babylon (17th-12th C BCE) and India (12th C BCE). CERN? Informal surveys of data scientists conducted in Mathematical analysis originated in the 17th C around this study at Data Science conferences suggest that 80% the time of the Scientific Revolution. While statistics of customers never ask for error bounds. has its roots in 5th C BCE and the 18th C, its application The existential risks of applying Data Science have in Data Science originated in 1962 with John W. Tukey been raised by world leading authorities such as the [20] and George Box[4]. These long illustrious histories Organization for Economic Cooperation and suggest that Data Science draws on well-established Development, the AI [3][7][9][19] and legal [5] results that took decades or centuries to develop. To communities with the most extreme concerns stated by what extent do they (e.g., statistical significance) apply the Future of Life Institute with the objective of in this paradigmatically new context? safeguarding life and developing optimistic visions of Data Science constitutes a new paradigm in the the future in order to mitigate existential risks facing sense of Kuhn’s scientific revolutions [12]. Data humanity from AI. Science’s predecessor paradigm, the Scientific Method, Given the potential risks and rewards of DIA has approximately 2,000 years in the development of and of its breadth of application across conventional, empiricism starting with Aristotle (384-322 BCE), empirical scientific and engineering domains as well as Ptolemy (1st C), and the Bacons (13th, 16th C). Data across most human endeavors we better get this right! Science, a primary basis of eScience [8], collectively The scientific and engineering communities place high termed the Fourth Paradigm, is emerging following the confidence in their existing discovery paradigms with ~1,000-year development of its three predecessor well-defined measures of likelihood and confidence paradigms of scientific and engineering discovery: within relatively precise error estimates1. Can we say theory, experimentation, and simulation [8]. Data the same for modern Data Science as a discovery Science that has developed and been applied for over 50 paradigm and for its results? A simple observation of years qualitatively changed in the late 20th century with the formal development of the processes and methods the emergence of Big Data, typically defined as data at of its predecessors suggest that we cannot. Indeed, we volumes, velocities, and variety that current do not know if or under what conditions the constituent technologies, let alone humans, cannot handle disciplines, like statistics, may break down. efficiently. This paper addresses another characteristic Do we understand DIA to the extent that we can that current technologies and theories do not handle assign probabilistic measures of likelihood to its well, veracity. results? With the scale and emerging nature of DIA- 1 Even after 1,000 years serious issues persist, e.g., P values (significance) and reproducibility. 239 based discovery, how do we estimate the correctness models correct at one scale may be wrong at a larger and completeness of analytical results relative to a scale or vice versa, a model wrong at one scale (hence hypothesized discovery question when the underlying discarded) may become correct at a higher scale (more principles and techniques may not apply in this new complex model). context? Machine learning algorithms can identify In summary, we do not yet understand DIA correlations between thousands, millions, or even adequately to quantify the probability or likelihood that billions of variables. This suggests that it is difficult to a projected outcome will occur within estimated error impossible for humans to understand what or how these bounds. While CERN used Data Science and Big Data algorithms discover. Imagine trying to understand such to identify results, verification was ultimately empirical, a model that results from selecting some subset of the as it must be in drug discovery [18] and other critical correlations on the assumption that they may be causal areas, until analytical techniques are developed and thus constitute a model of the phenomenon with high proven robust. confidence of being correct with respect to some hypotheses, with or without error bars. 1.5 Do We Understand Data Science? 1.6 Cornerstone of A New Discovery Paradigm Do we even understand what Data Science methods compute or how they work? Human thought is limited The Fourth Paradigm - eScience supported by Data by the human mind. According to Miller’s Law [14], Science - is paradigmatically different from its the human mind (short term working memory) is predecessor discovery paradigms. It provides capable of conceiving less than ten (7 +/- 2) concepts at revolutionary new ways [12] of thinking, reasoning and one time. Hence, humans have difficulty understanding processing - new modes of inquiry, problem solving, complex models involving more than ten variables. The and decision-making. It is not the Third Paradigm conventional process is to imagine a small number of augmented by Big Data, but something profoundly variables2 then abstract or encapsulate that knowledge different. Losing sight of this difference forfeits its into a model that can subsequently augmented with power and benefits and loses the perspective that it is A more variables. Thus most scientific theories develop Revolution That Will Transform How We Live, Work, slowly over time into complex models. For example, and Think [13]. Newton’s model of particle physics was extended for Paradigm shifts are difficult to notice as they 350 years through Bohr, Heisenberg, Einstein, and emerge, just as the proverbial frog does not notice that others, up to Glashow, Salam, and Weinberg, to form its hot bath is becoming lethal. There are several ways The Standard Model of Particle Physics. Scientific to describe the shift. There is a shift of resources from discovery in particle physics is wonderful and has taken (empirically) discovering causality (Why the over 350 years. Due to its complexity no physicist has phenomenon occurs) – the heart of the Scientific understood the entire Standard Model for decades, Method – to discovering interesting correlations (What rather it is represented in complex, computational might have occurred). This shift involves moving from models. a strategic perspective driven by human generated When humans analyse a problem, they do so with hypotheses (theory-driven, top-down) to a tactical models with a limited number of variables. As the perspective driven by observations (data-driven, number of variables increase, it is increasingly difficult bottom-up). to understand the model and the potential combinations Seen at their extremes, the Scientific Method and correlations. Hence, humans limit their models and involves testing hypotheses (theories) posed by analyses to those that they can comprehend. These scientists while Data Science can be used to generate human-scale models are typically theory-driven thus hypotheses to be tested based on significant correlations limiting their scale (number of variables) to what can be amongst variables that are identified algorithmically in conceived. the data. In principle, vast amounts of data and What if the phenomenon is arbitrarily complex or computing power can be used to accelerate discovery beyond immediate human conception? I suspect that simply by outpacing human thinking in both power and this is addressed iteratively with one model (theory) complexity. The power of Data Science is growing becoming abstracted as the base for another more rapidly due to the development of ever more powerful complex theory, and so on (standing on the shoulders of computing resources and algorithms, such as deep those who have gone before), e.g., the development of learning. So rather than optimize an existing process, quantum physics from elementary particles. That is, Data Science can be used to identify patterns that once the human mind understands a model, it can form suggest unforeseen solutions, thus automating the basis of a more complex model. This development serendipity as it is called when a human observes an under the scientific method scales at a rate limited by anomaly that stimulated a bright idea to resolve it. human conception thus limiting the number of variables However, even more compelling is one step beyond and complexity. This is error-prone since phenomena the simple version of this shift, namely a symbiosis of may not manifest at a certain level of complexity hence the both paradigms. For example, Data Science can be used to offer highly probable hypotheses or correlations 2 Physical science PhDs typically involve < 5 variables. from which we select those with acceptable error 240 estimates and that are worthy of subsequent empirical qualitatively less than two decades ago with the analysis. In turn, empiricism is used to pursue these emergence of Big Data and the consequent paradigm hypotheses until some converge and some diverge at shift described above. The focus of this research into which point Data Science can be applied to refine or modern Data Science is on veracity – the ability to confirm the converging hypotheses, having discarded estimate the correctness, completeness, and efficiency the divergent hypotheses, and the cycle starts again. of an end-to-end DIA activity and of its results. Hence, Ideally, one would optimize the combination of theory- I use the following definition that is in the spirit of [17]. driven empirical analysis with data-driven analysis to accelerate discovery faster than either on their own. Data Science is a body of principles and While Data Science is a cornerstone of a new techniques for applying data-intensive analysis to discovery paradigm, it may be conceptually and investigate phenomena, acquire new knowledge, methodologically more challenging than its and correct and integrate previous knowledge with predecessors since it involves everything included in its measures of correctness, completeness, and predecessor paradigms – modelling, methods, efficiency of the derived results with respect to processes, measures of correctness, completeness, and some pre-defined (top down) or emergent (bottom efficiency – in a much more complex context, namely up) specification (scope, question, hypothesis). that of Big Data. Following well-established developments, we should try to find the fundamentals of 3 Understanding Data Science From Data Science – its principles and techniques – to help manage the complexity and guide its understanding and Practice application. 3.1 Methodology to Better Understand DIA 2 Data Science: A Perspective Driven by a passion for understanding Data Science in practice, my year-long and on-going research study has Since Data Science is in its infancy and is inherently investigated over 30 very large scale Big Data multi-disciplinary, there are naturally many definitions applications most of which have produced or are daily of Data Science that should emerge and evolve with the producing significant value. The use cases include discipline. As definitions serve many purposes, it is particle physics; astrophysics and satellite imagery; reasonable to have multiple definitions each serving oceanography; economics; information services; several different purposes. Most Data Science definitions life sciences applications in pharmaceuticals, drug attempt to define Why (it’s purpose), What (constituent discovery, and genetics; and various areas of medicine disciplines), and How (constituent actions of discovery including precision medicine, hospital studies, clinical workflows). trials, intensive care unit and emergency room A common definition of Data Science is the medicine. activity of extracting knowledge from data 3. While The focus is to investigate relatively well- simple, it does not convey the larger goal of Data understood, successful use cases where correctness is Science or its consequent challenges. A DIA activity is critical and the Big Data context is at massive scale; far more than a collection of actions or the mechanical such use cases constitute less than 5% of all deployed processes of acquiring and analyzing data. Like its Big Data analytics. The focus was on these use cases, as predecessor paradigm, the Scientific Method, the we do not know where errors may arise outside normal purpose of Data Science and a DIA activity is to scientific and analytical errors. There is a greater investigate phenomena by acquiring new knowledge, likelihood that established disciplines, e.g., statistics and correcting and integrating it with previous and data management, might break at very large scale knowledge – continually evolving our current where errors due to failed fundamentals may be more understanding of the phenomena based on newly obvious. available data. We seldom start from scratch, clearly the The breadth and depth of the use cases revealed simplest case here. Hence, discovering, understanding, strong, significant emerging trends, some of which are and integrating data must precede extracting knowledge listed below. These confirmed for some use case all at massive scale, i.e., largely by automated means. owners, and suggested to others, solutions and The Scientific Method that underlies the Third directions that they were pursuing but could not have Paradigm is a body of principles and techniques that seen without the perspective of 30+ use cases. provide the formal and practical bases of scientific and engineering discovery. The principles and techniques 3.2 DIA Processes have been developed over hundreds of years originating A Data-Intensive-Activity is an analytical process that with Plato and are still evolving today with significant consists of applying sophisticated analytical methods to unresolved issues such as statistical significance, (i.e., P large data sets that are stored under some analytical values) and reproducibility. models. While this is the typical view of Data Science While Data Science had its origins 50 years ago projects or DIA use cases, this analytical component of with Tukey [19] and Box [4] it started to change the DIA activity constitutes ~20% of an end-to-end DIA pipeline or workflow. Currently it consumes ~20% of 3 the resources required to complete a DIA analysis. Wikipedia.com 241 An end-to-end DIA activity involves two data proportionally 80% of the errors that could arise in DIA management processes that precede the DIA process, may arise in the data management processes, prior to namely Raw Data Acquisition and Curation, and DIA even starting. Analytical Data Acquisition. Raw Data Acquisition and Curation starts with discovering and understanding data 3.3 Characteristics of Large-Scale DIA Use Cases in data sources and ends with integrating and storing The focus of my research is successful, very large scale, curated data in a repository that represents entities in the multi-year projects with many with 100s to 1,000s, of domain of interest and metadata about those entities ongoing DIA activities. These activities are supported with which to make a specific interpretations and that is by a DIA ecosystem consisting of a community of users shared by a community of users. Analytical Data (e.g., over 5,000 scientists in the ATLAS and CMS Acquisition starts with discovering and understanding projects at CERN and similar numbers of scientists data within the shared repository and ends with storing using the worldwide Cancer Genome Atlas) and the resulting information, specific entities and technology (e.g., science gateways4, collectively interpretations, into an analytical model to be used by referred to in some branches of science as networked the subsequent DIA process. science). Some significant trends that have emerged Sophisticated algorithms such as machine from the analysis of these use cases are listed, briefly, learning largely automate DIA processes, as they have below. to be automated to process such large volumes of data The typical view of Data Science appears to be using complex algorithms. Currently, Raw Data based on the vast majority (~95%) of DIA use cases. Acquisition and Curation, and Analytical Data While they share some characteristics with those in this Acquisition processes are far less automated typically study, there are fundamental differences such as the requiring 80% or more of the total resources to concern for and due diligence associated with veracity complete. as mentioned above. This understanding leads to the following Based on this study data analysis appears to fall definitions. into three classes. Conventional data analysis over Data-Intensive Discovery (DID) is the activity of “small data” accounts for at least 95% of all data using Big Data to investigate phenomena, to analysis, often using Microsoft Excel. DIA over Big acquire new knowledge, and to correct and Data has two sub-classes, simple DIA, i.e., the vast integrate previous knowledge. majority of DIA use cases mentioned above, and complex DIA such as the use cases analyzed in this “-Intensive” is added when the data is “at scale”. study that are characterized by complex analytical Theory-driven DID is the investigation of human models (e.g., sub-models of the Standard Model of generated scientific, engineering, or other hypotheses Physics, economic models, an organizational model for over Big Data. Data-Driven DID employs automatic enterprises worldwide, and models for genetics and hypothesis generation. epigenetics) and a corresponding plethora of analytical Data-Intensive Analysis is the process of methods (e.g., the vast method libraries in CERN’s analyzing Big Data with analytical methods and Root framework). The complexity of the models and models. methods are as complex as the phenomena being analyzed. DID goes beyond the Third paradigm of The most widely used DIA tools for simple cases scientific or engineering discovery by investigating claim to support analyst self-service in point-and-click scientific or engineering hypotheses using DIA. A DIA environments, some claiming “point us at the data and activity is an experiment over data thus requiring all we will find the patterns of interest for you”. This aspects of a scientific experiment, e.g., experimental characteristic is infeasible in the use cases analyzed. A design, expressed over data, a.k.a. data-based requirement common to the use cases analyzed is not empiricism. only the principle of being machine driven and human A DIA Process (workflow or pipeline) is a guided, i.e., a man-machine symbiosis, but extensive sequence of operations that constitute an end-to- attempts to optimize this symbiosis for scale, cost, and end DIA activity from the source data to the precision (too much human-in-the-loop leads to errors, quantified, qualified result. too little leads to nonsense). DIA ecosystems are inherently multi-disciplinary Currently, ~80% of the effort and resources required for (ideally interdisciplinary), collaborative, and iterative. the entire DIA activity are due to the two data Not only does DIA (Big Data Analytics) require management processes – areas where scientists / multiple disciplines, e.g., genetics, statistics and analysts are not experts. Emerging technology, such as machine learning, so too do the data management for data curation at scale, aims to flip that ratio from processes require multiple disciplines, e.g., data 80:20 to 20:80 so as to let scientists do science; analysts management, domain and machine learning experts for do analysis; etc. This requires an understanding of the data management processes and their correctness, 4 There are over 60 large-scale scientific gateways, e.g., The completeness, and efficiency in addition to those of the Cancer Genome Atlas and CERN’s Worldwide LHC DIA process. Another obvious consequence is that Computing Grid. 242 data curation, statisticians for sampling, etc. Hence, constructing the model required for the In large-scale DIA ecosystems, a DIA is a virtual background involves selecting and combining relevant experiment [6]. Far from claims of simplicity and simulations. If there is no simulation for some aspect point-and-click self-service, most large-scale DIA that you require, then it must be requested or you may activities reflect the complexity of the analysis at hand have to build it yourself. Similarly, if there is no and are the result of long-term (months to years) relevant data of interest in the experimental data experimental designs that involve greater complexity repository, it must be requested from subsequent than their empirical counterparts to deal with scale, capture from the detectors when LHC is next fired up in significance, hypotheses, null hypotheses, and deeper the appropriate energy levels. This comes from a challenges such as determining causality from completely separate team running the (non-virtual) correlations and identifying and dealing with biases and experiment. often irrational human intervention. Finally, veracity is one of the most significant The development of the background is challenges and critical requirements of all DIA approximately a one person-year activity as it involves ecosystems studied. While there are many, complex the experimental design, the design and refinement of methods in conventional Data Science to estimate the model (software simulations), the selection of veracity most owners of use cases studied expressed methods and tuning to achieve the correct signature concern for adequately estimating veracity in modern (i.e., get the right data), verify the model (observe Data Science. Most assume that all data is imprecise; expected outcomes when tested), and dealing with hence require probabilistic measures and error bars errors (statistical and systematic) that arise from the and likelihood estimates for all results. More basically, hardware or process. The result of the Background most DIA ecosystem experts recognize that errors can phase is a model approved by the collaborative to arise across an end-to-end DIA activity and are represent the background required by the experiment investing substantially in addressing these issues in both with the signal region blinded. The model is an the DIA processes and the data management processes “application” that runs on the Atlas “platform” using that currently require significant human guidance. Atlas resources - libraries, software, simulations, and An objective of this research is to discover the data much drawing on the ROOT framework, CERN’s extent to which the above characteristics of very large core modeling and analysis infrastructure. It is verified scale, complex DIAs also apply to simple DIAs. There by being executed under various testing conditions. is a strong likelihood that they apply directly but are This is an incremental or iterative process each difficult to detect. That is the principles and techniques step of which is reviewed. The resulting design of DIA apply equally to simple and complex DIA. document for the Top Quark experiment was approximately 200 pages of design choices, parameter 3.4 Looking Into A Use Case settings, and results - both positive and negative! All Due to the detail involved, there is not space in this experimental data and analytical results are chapter or book to describe a single use case considered probabilistic. All results have error bars; in particle in this study. However, let’s look into a single step of a physics they must be at least 5 sigma to be accepted. use case involving a virtual experiment conducted at This explains the year of iteration in which analytical CERN in the Atlas project. The heart of empirical models are adjusted, analytical methods are selected science is experimental design. It starts by identifying, and tuned, and results reviewed by the collaboration. formulating, and verifying a worthy hypothesis to The next step is the actual virtual experiment. This pursue. This first complex step typically involves a too takes months. You might be surprised to find that multi-disciplinary team, called the collaborators for this once the data is un-blinded (i.e., synthetic data is virtual experiment, often from around the world for replaced in the region of interest with experimental more than a year. We consider the second step, the data), the experimenter, often a PhD candidate, gets one construction of the control or background model and only one execution of the “verified” model over the (executable software and data) that creates the experimental data. background (e.g., executable or testable model and a Hopefully this portion of a use case illustrates that given data set) required as the basis within which to DIA is a complex but critical tool in scientific discovery search (analyze) for “signals” that would represent the used with a well-defined understanding of veracity. It phenomenon being investigated in the hypothesis. This must stand up to scrutiny that evaluates if the is the control that completely excludes the data of experiment - consisting of all models, methods, and interest. The data of interest (the signal region) is data with probabilistic results and error bounds better “blinded” completely so as not to bias the experiment. than 5 sigma – is adequate to be accepted by Science or The background (control) is designed using software Nature as demonstrating that the hypothesized that simulates relevant parts of the standard model of correlation is causal. particle physics plus data from Atlas selected with the appropriate signatures with the data of interest blinded. 4. Research For An Emerging Discipline Over time Atlas contributors have developed The next step in this research to better understand the simulations of many parts of the standard model. theory and practice of the emerging discipline of Data 243 Science; to understand and address its opportunities and [3] J. Bohannon, “Fears of an AI pioneer,” Science, challenges; and to guide its development, is given in its vol. 349, no. 6245, pp. 252–252, Jul. 2015. definition. Modern Data Science builds on conventional [4] G.E.P. Box. Science and Statistics. Journal of the Data Science and on all of its constituent disciplines American Statistical Association 71, 356 (April required to design, verify, and operate end-to-end DIA 2012), 791–799 reprint of original from 1962 activities, including both data management and DIA [5] N. Diakopoulos. Algorithmic Accountability processes, in a DIA ecosystem for a shared community Reporting: On the Investigation of Black of users. Each discipline must be considered with Boxes. Tow Center. February 2014. respect to which it contributes to [6] Duggan, Jennie and Michael Brodie, Hephaestus: investigating phenomena, acquiring new knowledge, Data Reuse for Accelerating Scientific Discovery, and correcting and integrating new with previous In CIDR 2015 knowledge. Each operation must be understood with respect to which correctness, completeness, and [7] S.J. Gershman, E.J. Horvitz, and J.B. Tenenbaum. efficiency can be estimated. 2015. Computational rationality: A converging This research involves identifying relevant paradigm for intelligence in brains, minds, and principles and techniques. Principles concern the machines. Science 349, 6245 (2015), 273–278. theories that are established formally, e.g., [8] Jim Gray on eScience: a transformed scientific mathematically, and possibly demonstrated empirically. method, in A.J.G. Hey, S. Tansley, and K.M. Tolle Techniques involve the application of wisdom [21], i.e., (Eds.): The fourth paradigm: data-intensive domain knowledge, art, experience, methodologies, scientific discovery. Proc. IEEE 99, 8 (2009), practice, often called best practices. The principles and 1334–1337. techniques, especially those established for [9] E. Horvitz and D. Mulligan. 2015. Data, privacy, conventional Data Science, must be verified and if and the greater good. Science 349, 6245 (July required extended, augmented, or replaced for the new 2015), 253–255. context of the Fourth Paradigm, especially its volumes, [10] M.I. Jordan and T.M. Mitchell. 2015. Machine velocities, and variety. For example, new departments learning: Trends, perspectives, and prospects. at MIT, Stanford, and the University of California, Science 349, 6245 (July 2015), 255–260. Berkeley, are conducting such research under what [11] V. Khachatryan et al. 2012. Observation of a new some are calling 21st Century Statistics. boson at a mass of 125 GeV with the CMS A final, stimulating challenge is what is called experiment at the LHC. Physics Letters B 716, 1 meta-modelling or meta-theory. DIA, and more (2012), 30–61. generally Data Science, is inherently multi-disciplinary [12] Kuhn, Thomas S. The Structure of Scientific [10]. This area emerged in the physical sciences in the Revolutions. 3rd ed. Chicago, IL: University of 1980s and subsequently in statistics and machine Chicago Press, 1996. learning and is now being applied in other areas to address combining results of multiple disciplines. [13] Mayer-Schönberger, V., & Cukier, K. (2013-03- Analogously, meta-modelling arises when using 05). Big Data: A Revolution That Will Transform multiple analytical models and multiple analytical How We Live, Work, and Think. Houghton Mifflin methods to analyze different perspectives or Harcourt characteristics of the same phenomena. This extremely [14] Miller, G. A. (1956). "The magical number seven, natural and useful methodology, called ensemble plus or minus two: Some limits on our capacity for modelling, is required in many physical sciences, processing information". Psychological Review 63 statistics, and AI, and should be explored as a (2): 81–97. fundament modelling methodology. [15] NIH Precision Medicine Initiative, http://www.nih.gov/precisionmedicine/ Acknowledgement [16] D.C. Parkes and M.P. Wellman. 2015. Economic I gratefully acknowledge the brilliant insights and reasoning and artificial intelligence. Science 349, improvements proposed by Prof Jennie Duggan, 6245 (July 2015), 267–272. Northwestern University and Prof Thilo Stadelmann, [17] F. Provost and T. Fawcett. 2013. Data Science and Zurich University of Applied Sciences. its Relationship to Big Data and Data-Driven Decision Making. Big Data 1, 1 (March 2013), References 51–59. [1] G. Aad et al. 2012. Observation of a new particle [18] Scott Spangler, et. al. 2014. Automated hypothesis in the search for the Standard Model Higgs boson generation based on mining scientific literature. with the ATLAS detector at the LHC. Physics In Proceedings of the 20th ACM SIGKDD Letters B 716, 1 (2012), 1–29. international conference on Knowledge discovery [2] Accelerating Discovery in Science and and data mining (KDD '14). ACM, New York, Engineering Through Petascale Simulations and NY, USA, 1877-1886. Analysis (PetaApps), National Science Foundation, Posted July 28, 2008. 244 [19] J. Stajic, R. Stone, G. Chin, and B. Wible. 2015. Rise of the Machines. Science 349, 6245 (July 2015), 248–249. [20] J. W. Tukey, “The Future of Data Analysis,” Ann. Math. Statist. pp. 1–67, 1962. [21] Bin Yu, Data Wisdom for Data Science, ODBMS.org, April 13, 2015. 245