The Big Mechanism Program: Changing How Science Is Done Andrey Rzhetsky University of Chicago, 900 East 57th Street, Chicago, IL 60637, USA andrey.rzhetsky@uchicago.edu Abstract instance, why is the expression of a gene ephemeral? Technologically, FRIES focuses on information The talk will describe details of actively evolving extraction over deep reading, simulation, and even FPGA research conducted by the UChicago consortium of the acceleration of systems biology simulators. Big Mechanism program, funded by the US DARPA The second consortium (“UChicago”), in which the agency. The consortium’s work focuses on: author of this keynote acts as the PI, is composed of (1) probabilistic reasoning across cancer claims culled researchers at the University of Chicago, the United from literature which uses custom-designed ontologies; Kingdom’s National Center for Text Mining at the (2) the computational modelling of cancer mechanisms University of Manchester, along with participants from and pathways to automatically predict therapeutic clues; the Brunel University in London, all of whom collaborate (3) automated hypothesis generation to strategically on developing robotic platforms for experiment design extend this knowledge, and; (4) developing a ‘Robot and analysis. Scientist’ that performs experiments to test hypotheses The third consortium, called CURE, consists of two probabilistically, then feeding those results back to the groups from Harvard Medical School, IHMC in Florida, system. and SIFT. Their focus is on deep reading, fine-grained modeling, and simulation of cell signaling’s underlying 1 Introduction biochemistry. This talk will provide an overview of the objectives DARPA is funding the Big Mechanism program and results related mostly to the work of the second (http://www.darpa.mil/program/big-mechanism) in consortium. order to study large, explanatory models of complicated systems in which interactions have important causal effects. The program’s aim is to develop technology used 3 UChicago consortium to read research abstracts and papers and extract pieces As the project is ongoing and far from completion, of causal mechanisms, assemble these pieces into more we will cover the ideas that led the consortium to our complete causal models, and reason over these models to current system design, our biological and medical produce explanations. The program’s domain is cancer motivations, and preliminary results. biology, with an emphasis on signalling pathways; this is Motivation: Today, cancer-related text mining is just one example of causal, explanatory models, that we performed in linear pipelines (named entity recognition are hoping will be extensible across multiple domains, to event extraction) without explicitly estimating similar to what IBM Watson’s team [1] is attempting statement uncertainty or importance relative to a total presently. model of cancer. Moreover, reading is divorced from reasoning and experimentation. Probabilistic reasoning 2 The overall structure of the Big is rarely used. Similarly, the Robot Scientist approach Mechanism program currently uses non-probabilistic logic and is disconnected from text mining and not applied to medicine. In The program is currently organized into three addition, a wealth of panomics data is increasingly consortia, all of which take different views of causal available, but existing methods treat each event models, different reading technologies, and different use independently and disregard prior knowledge. cases. Fundamental medical problem: We do not fully The largest consortium, called FRIES, includes understand how to stop cancer cells from growing faster groups at CMU, SRI, University of Arizona, Oregon than normal tissue, and spreading throughout the body Health Sciences University, and others. FRIES’s main (metastasizing). Death from cancer typically occurs focus is to explain signalling pathway behaviours. For when uncontrolled growth occurs in a place where it cannot be surgically removed. Most traditional anti- Proceedings of the XVIII International Conference cancer drugs are highly toxic to patients. As a result, «Data Analytics and Management in Data Intensive single drug treatment is generally undesirable for the Domains» (DAMDID/RCDL’2016), Ershovo, Russia, following reasons: (1) It is generic and not targeted to the October 11 - 14, 2016 patient and their cancer’s genotype(s); (2) Intervention is 1 required at multiple points along a cancer pathway, and; partially reduce the need to develop new drugs, easing (3) Cancer evolves resistance. The Holy Grail of cancer the economic burden of discovering and testing new therapy is to find highly potent, non-toxic drug medications. (Each new FDA-approved drug has an combinations that are tailored to individual patients, and estimated price tag of somewhere between 100 million linked to the readout of gene and protein expression from and 1 billion US dollars.) their specific cancer(s). The system generates hypotheses of the form The system developed by the consortium “cocktail of drugs X1, …, Xn activates gene ESR1” and incorporates three components, called Reading, each hypothesis is tested experimentally in a triple- Assembly, and Explanation (see Figure 1). These negative breast cancer cell line. Either human biologists components integrate machine reading with probabilistic or the Robot Scientist carry out these experiments. modelling, the design of custom-made ontologies, and automated experiments conducted by the Robot Scientist 4 “UChicago” team (a robot that is driven by experiment-designing and planning programs). For quality control and Reading (NLP and text-mining; ontologies, corpus- benchmarking, an independent set of experiments is dependent and unsupervised information extraction, conducted by humans. logic): Sophia Ananiadou, Junichi Tsujii, Larisa Soldatova, Hoifung Poon, Andrey Rzhetsky, Robert Stevens, James Evans. Assembling (Models of quality of science, quality of extraction, consistency, statement provenance, Markov Logic, crowdsourcing): Jacob Foster, James Evans, Hoifung Poon, Andrey Rzhetsky. Explaining (Markov Logic, graphical models, consistency models, kinetic/dynamic consistency models): Hoifung Poon, Jacob Foster, James Evans, Ishanu Chattopadhyay, Andrey Rzhetsky. AI and Robotics: Hoifung Poon, Kevin P. White, Ross D. King. Cancer-specific, wet-lab experiments: Ross D. King, Kevin P. White. In prior work, Ross D. King's laboratory has developed two Robot Scientists, “Adam” and “Eve”, which are among the most advanced existing laboratory automation systems. Figure 1 The integrated system, see references [2,3] for 5 Conclusion related prior work contributing to the components of the system The approach chosen by the team relies on the assimilation of massive, pre-existing literature (similar to To illustrate how all these components come IBM Watson) combined with iterative model updating together, the talk will present a use case: Automated, based on empirical data and newly designed experiments optimal drug combination prediction for achieving (unlike IBM Watson). The project’s general activation or silencing of target gene(s) in a breast cancer methodology is not domain-specific, so it is theoretically cell line. In our initial setup, we are using a text-mined extensible across scientific domains. network of about three hundred genes and proteins, containing parts of networks in use cases 1 and 2. In the References first pass, we focused on activating the estrogen receptor gene (ESR1) in a triple-negative breast cancer cell line [1] Best, J. IBM Watson: The Inside Story Of How The by administering a cocktail of two or more FDA- Jeopardy-Winning Supercomputer Was Born, And approved drugs. What It Wants To Do Next - Feature - TechRepublic. The motivation for the use case is to practically apply TechRepublic. N.p., 2015. Web. 13 May 2015. growing (through machine reading and experimental [2] Evans J, Rzhetsky A. Machine science. Science. Jul validation) model of cellular machinery to manipulate 23 2010;329(5990):399-400. the state of the cancer cell, achieving silencing or [3] King RD, Rowland J, Oliver SG, Young M, Aubrey activation of target genes/proteins in the absence of drugs W, Byrne E, Liakata M, Markham M, Pir P, specifically targeting these molecules. If successful, Soldatova, LN, Sparkes A, Whelan KE, Clare C. The computationally-derived drug cocktails could at least Automation of Science. Science. 2010. 324, 85-89. 2