Organization of Virtual Experiments in Data-Intensive Domains: Hypotheses and Workflow Specification © Dmitry Kovalev © Leonid Kalinichenko © Sergey Stupnikov Federal Research Center «Computer Science and Control» of Russian Academy of Sciences, Moscow, Russia dkovalev@ipiran.ru lkalinichenko@ipiran.ru sstupnikov@ipiran.ru Abstract. Organization and management of virtual experiments in data-intensive research has been widely studied in the several past years. Authors survey existing approaches to deal with virtual experiments and hypotheses, and analyze virtual experiment management in a real astronomy use-case. Requirements for a system to organize virtual experiments in data intensive domain have been gathered and overall structure and functionality for system running virtual experiments are presented. The relationships between hypotheses and models in virtual experiment are discussed. Authors also illustrate how to conceptually model virtual experiments and respective hypotheses and models in provided astronomy use-case. Potential benefits and drawbacks of such approach are discussed, including maintenance of experiment consistency and shrinkage of experiment space. Overall, infrastructure for managing virtual experiments is presented. Keywords: virtual experiment, hypothesis, conceptual modeling, data intensive domains. 1 Introduction from section 2 are not enough, and introduces real-world use-case coming from astronomy. In section 4 main Data intensive research (DIR) is evolving according notions are defined. Section 5 provides infrastructure and to the 4th paradigm of scientific development and reflects functionality of system components is proposed. Section the fact that modern science is highly dependent on 6 concludes the article. knowledge extraction from massive datasets [5]. Data intensive research is multidisciplinary in its nature, 2 Related works bringing in many separate principles and techniques to Systems with explicit representation of hypotheses handle complex data analysis and management. Up to are being rapidly developed during last several years [2– 80% of researcher’s time is spent on management of raw 4, 6, 10]. Authors analyzed 3 different systems for and analytical data, including data collection, curation executing virtual experiments and hypotheses: and integration. The rest part requires knowledge Hephaestus, Upsilon-DB and SDI. Some requirements inference from collected data in order to test proposed for organizing and managing virtual experiments were hypotheses, gather novel information and correctly extracted during the analysis. Although these platforms integrate it. Although, it is the core of scientific work, it provide some important insights into defining and takes just 20% of researcher’s time. To overcome that, a handling hypotheses, they miss some important features. new approach for handling multidisciplinary DIR is First, they do not describe the perception of needed. automatically derived hypotheses by domain experts, do Large-scale scientific experiments besides data not track their evolution, and do not discuss experiment processing issues are highly sophisticated– they include design principles. workflows, models and analytical methods. Every Hephaestus. It is a system for running virtual implementation of DIR can be treated as virtual experiments over existing collections of data. It provides experiment over massive collections of data. In [7] a independence from resources and the system rewrites its survey is presented discussing different approaches to queries into data source queries. System hides underlying experiment modeling and how its core artifacts – implementation details from user, letting him work only hypotheses, can be specified. The use of conceptual with Hephaestus language. The language itself is a SQL- representation of hypotheses and their corresponding like language and is used to specify virtual experiment implementation is emphasized, thus leading to the need and underlying hypotheses. of proper tools. Hephaestus separates two different classes of The article aims at developing methods and tools to hypotheses: top-down and bottom-up. Top-down support the execution and conceptual modeling of virtual hypotheses are the one introduced by the researcher, experiment and designing infrastructure to manage it. while bottom-up hypotheses are derived from data. Article is structured as follows. In Section 2 related System supports the discovery of bottom-up hypotheses works are discussed. Section 3 explains why systems by looking for the correlation in data. These hypotheses are then ranked by some score (e. g. p-value of some statistical test) and the one with highest are passed to the Proceedings of the XIX International Conference researcher. Yet the system does not support automatical “Data Analytics and Management in Data Intensive finding of causality, which is an important requirement Domains” (DAMDID/RCDL’2017), Moscow, Russia, for the future work. Hephaestus emphasizes the role of October 10–13, 2017 293 the expert in understanding which relationships should virtual experiments and hypotheses is based on Besancon be further studied and which should not be chased. Galaxy Model (BGM). BGM is based on “the population Hephaestus also computes metrics about experiments to synthesis approach … aims at assembling current estimate significance adequate to abandon further scenarios of galaxy formation and evolution, theories of computation. System is used in testing clinical trials. The stellar formation and evolution, models of stellar system does not catch the evolution of hypotheses or atmospheres and dynamical constraints, in order to make experiments yet. a consistent picture explaining currently available Upsilon-DB. System enables researcher to code and observations of different types (photometry, astrometry, manage deterministic scientific hypotheses as uncertain spectroscopy) at different wavelengths”. data. It uses internal database to form hypotheses as BGM which is being developed for more than 35 relations and adds uncertainty parameter. Later, that years represents a complex computational artifact, uncertainty parameter is used to rank hypotheses using described in a series of [1, 11, 12]and presented in several Bayes rule. Provided approach can be treated as major releases. Such a development represents a unique complementary to classical statistical approach. The experience for catching the evolution scenarios for the systems allows to work with two types of uncertainty - model, changes to the model introduced both by using theoretical, which is brought by competing hypotheses, new observations (e.g. Hypparcos and Tycho-2 surveys) and empirical uncertainty, which appears because of and the theoretical progress in the field. Both small alternative datasets used. The system introduces changes to parameters of the model and huge algorithm to rank hypotheses using observed data. This improvements of the whole process were also made is done because several competing hypothesis can during the lifetime of the model. Also, the BGM authors explain the same observation well and some score to enabled the community to change some parts of the distinguish them is needed. When new data becomes model. available, this score can be adjusted accordingly. Due to the great experience collected by the BGM Hypotheses have mathematical representation and authors in the respective articles and associated code, authors provide method to translate its mathematical now there is a possibility to collect the requirements for representation into relations in database. The simulations the system to supports experiments and provide rationale are also treated as data and respective relations are put to choosing the appropriate methods and adequate inside the same database as hypotheses. Authors techniques for the infrastructure. emphasize the need to support and develop the extraction BGM takes as input hypotheses and their parameters. of hypotheses from data and methods to sample both The examples of such hypotheses are star formation rate hypotheses and data. They illustrate that systems such as (SFR), initial mass function (IMF), density laws, Eureqa [8] can be used to learn formula representation evolutionary tracks and so on [1]. As the model is from data. evolving, new values for hypotheses parameters, even Following example is presented in the paper: authors new parameters have been introduced into the BGM, e.g. present three different laws describing free fall and some for the IMF hypothesis in the last realization there has not simulated data. They rank hypotheses accordingly. only been tests of several new values of the hypothesis, SDI. Platform is used to support scientific but also separation of 2-slope and 3-slope instances of experiments. The system has the ability to integrate open IMF is done. data, reuse observed data and simulation data in the It is very important to explicitly catch the relationship further development of experiments. The system enables between several hypotheses in VE. Hypotheses and their multiple groups of researchers to access data and parameters can be interrelated. For example, stellar experiments simultaneously. Components of the birthrate function is derived from both IMF and SFR framework are developed in such way that they could be functions and local volume density function is based on deployed, adapted and accessed in individual research provided density law. The relationships between projects fast. SDI requires the support of lineage, hypotheses put constraints on the tuning of their provenance, classification, indexing of experiments and parameters – model can quickly become. data, the whole cycle of obtaining data, curating and Parameters of a single hypothesis can be linked to cleaning it, building experiments to test hypotheses over each other directly through equations. There are also massive data, aggregating results is supported over long indirect connections of parameters of several hypotheses, periods of time. The use of semantics is required by the e.g. SFR parameter correlates with the slopes of IMF. system. This implies that one could not give the best solution for a particular variable without correlating it with others. 3 Astronomical Use-case So, there is a need to support for a correlation search Surveyed systems do not cover several important between hypotheses parameters and to store relationships issues, including interaction between hypotheses in between parameters of a single hypothesis. single experiment, tracing experiment evolution, Not all model ingredients are allowed to be changed perception of automatically derived hypotheses and by the user. This is done because if some hypothesis is formulas by field experts. changed in the model and no further adjustments for the Authors’ further experience on how to deal with dependant hypotheses are made a model consistency is 294 broken. Furthermore, the model has a property of being 4 Hypotheses and Models in Virtual self-consistent meaning that when input values change, Experiment if it is possible hypotheses derived by the one changed are properly adjusted in order not to break fundamental 4.1 Main Notions equations of astronomy. Therefore, derived by Extracted information needs to be formally relationship needs to be modeled. Also, system specified. For that, authors define additional artifact component should enable the adjustment and calibration – virtual experiment. It is a tuple , where O is a domain ontology. Domain AgeBins ontology is a set of concepts and relationships in applied domain formally specified with some DensityLaw SFR AgeVelocity language. Dispersion H is a set of hypotheses specifications LocalVolume and relationships between them. H is a part of ontology IMF Density and uses concepts from it. Together they form the ontology of virtual experiment. Hypothesis is a StellarBirth Evolutionary proposed explanation of a phenomenon that still Rate Tracks has to be rigorously tested. M is a set of models. Each model is a set of LocalLuminosity functions. Every model implements a hypothesis specification. If model generates expected DynamicalSelf behavior of some phenomenon, it is said that Consistency model and respective hypothesis are supported by Figure 1 BGM Hypotheses Lattice with derived by observations. relationship R: H -> M, is a mapping from the set of hypotheses and into the models. Apart from explicit hypotheses, there are also implicit hypotheses in the model. They are not described in the W is a workflow. Workflow is a set of articles and are tacit. The example of such hypothesis is tasks, orchestrated by specific constructs (workflow that no stars come from outside of the Galaxy. It is patterns - split, join, etc.). Each task represents a important to explicitly store such hypotheses and function with predefined signature, which invokes understand how to extract such hypotheses from models from M. Workflow implements experiment publications and data sources. specifying when each model that conforms to related hypotheses should be invoked. Workflow is used to implement BGM experiment specifying when each model which conforms to related C is a configuration for each experiment run. hypotheses should be invoked. The workflow has also It consists of a total mapping from workflow tasks into evolved since the first version, e.g. for thin disk treatment sets of function parameter values. new activities dependent on IMF and SFR hypotheses are There exist a lot of possible introduced. This development can only be tracked using hypotheses representations – mathematical publications. Some activities in model structure require models, Boolean networks, ontologies, predicates in the usage of statistical methods, tests and tools, which are first-order logic, etc. Authors use ontologies to specify used on both local hypotheses and on the general hypotheses. simulations from the whole experiment. Possible relationships between hypotheses As the number of experiments is huge due to the are competes_with, which is used to relate increasing size of competing hypotheses family, now not competing hypotheses and derived_by to relate two all of the possible are run against the whole sky. Studying hypotheses, one of which was used to derive another. the ways to reduce the number of experiments which give Derived_bycan be used to form hypotheses lattice [9] – the best fit and to choose when and if to abandon further algebraic structure with partial order relation. computations of experiment is a major part of Hypotheses derived from a single hypothesis are requirements to the new system. Using the information atomic, otherwise – complex (see Fig. 1). from experiment run done both locally and by other Model, which implements hypothesis, research groups can be helpful in achieving that goal. should conform to the hypothesis specification. If Some researches of data-intensive analysis model generates expected behavior of some emphasize the role of error bars. As the data in astronomy phenomenon, it is said that model and respective is provided usually with errors, the BGM uses special hypothesis are supported by observations. methods to work with such type of uncertainty. A 4.2 Remarks on methodology component supporting statistical tools which works with error bars is a major requirement for the infrastructure. Since hypotheses become the core artifact of virtual experiment, there is a shift in treating data to successfully manage it. Fig. 2 depicts the process of specifying virtual experiment. 295 First, hypotheses are extracted from articles. Usually, DataExactCardinality(1 probability VirtualExperiment)) it is text or formulas. Sometimes, there is a need to provide external hypotheses and substitute existing ones. Hypothesis is specified in the same ontology as Next step is to define mapping between hypotheses and virtual experiment. Every hypothesis has name, models, which implement these hypotheses, and build description, author(s) and associated articles. It also has some workflow specifying the sequence of tasks. a model associated with it. Following [4] associated probability of hypothesis is introduced. Forming a research lattice is a next step. Virtual Several hypotheses explaining one and the same experiment needs configuration and execution plan. After that, one can launch virtual experiment. phenomena are called competing. Also hypothesis can be derived by some other hypothesis. Hypotheses lattice is Extract hypotheses Define external formed with derived_by relationship on hypotheses from articles hypotheses space. Class(Hypothesis) DataProperty(probability domain(Hypothesis) Define mapping between Define experiment range(xsd:float)) hypotheses and models configuration DataExactCardinality(1probability Hypothesis) DataProperty(name domain(Hypothesis) Define workflow range(xsd:string)) DataExactCardinality(1name Hypothesis) Build research lattice DataProperty(description domain(Hypothesis) range(xsd:string)) Build execution plan Configure statistical tests DataMinCardinality(1descriptionHypothesis) DataProperty(author domain(Hypothesis)range(xsd:string)) Use hypotheses Run virtual experiment DataMinCardinality(1authorHypothesis) cache DataProperty(article domain(Hypothesis) Figure 2 Methodology to form virtual experiment range(xsd:anyURI)) DataMinCardinality(1article Hypothesis) 4.3 Virtual Experiment Specification DataProperty(model domain(Hypothesis) Conceptual schema to define virtual experimentis Hypothesis) provided. It is written with the simplified OWL DataExactCardinality(1 model range(xsd:anyURI)) functional syntax (Declaration keyword is omitted; property, domain, and range declarations are combined).Virtual experiment (VirtualExperiment class) Class(HypothesisMetaClass) has associated set of hypotheses, single workflow, ClassAssertion(HypothesisMetaClassHypothesis) observed_data against which experiment will run and probability, which describes how well underlying model ObjectProperty(competes domain(HypothesisMetaClass) suits observed data. Closer probability is to 1, better the underlying model simulates phenomenon. range(HypothesisMetaClass)) Ontology( ass) Class(VirtualExperiment) range(HypothesisMetaClass)) ObjectProperty(Hypothesis 4.4 Hypotheses Specification domain(VirtualExperiment) range(Hypothesis)) Examples of hypotheses and their relationships come ObjectMinCardinality(1 Hypothesis from Besancon Galaxy Model (BGM). For the sake of VirtualExperiment) clarity not all hypotheses in BGM are specified. All of DataProperty(workflow domain(VirtualExperiment) the BGM hypotheses are treated as subclasses of range(xsd:anyURI)) Hypothesis class. DataMinCardinality(1 workflow Initial Mass Function is the mass distribution of a VirtualExperiment) given population of stars and is represented by standard DataProperty(mediatordomain(VirtualExperiment) power law. Due to construction of the hypothesis in the range(xsd:anyURI)) BGMIMF has a mathematical representation as a DataMinCardinality(1 mediator piecewise function with 2 or 3 pieces (slopes) where it is VirtualExperiment) defined for mass regions. As there are just 2 possible DataProperty(probability sizes of the piecewise function, we put this into two disjoint subclasses. There are restrictions on available domain(VirtualExperiment) range(xsd:float)) mass to Sol mass ratio. For IMF, authors test 10 different 296 versions of a hypothesis, 4 of them are 2-slope functions ObjectPropertyAssertion(Slopes HaywoodIMFHaywoodSlope1) and 6 of them are 3-slope function. All of tested hypotheses are competing. Example instance from each ObjectPropertyAssertion(Slopes HaywoodIMF subclass is given. HaywoodSlope2) Class(Slope) ObjectPropertyAssertion(Slopes HaywoodIMF HaywoodSlope3) DataProperty(alpha domain(Slope) range(xsd:float)) Star Formation Rate, Ψ(t) represents the total mass of DataProperty(minMass domain(Slope) stars born per unit time per unit mass of Galaxy. Star range(xsd:float)) formation rate has subclasses for representing constant Ψ(t) = C and exponential function Ψ(t) = exp{-γt} where DataProperty(maxMass domain(Slope) range(xsd:float)) γ is a parameter. Authors tested several competing hypotheses - two possible values for gamma (0.12 and DataExactCardinality(1 alpha Slope) 0.25) and one constant value. They can be stated as DataExactCardinality(1 minMassSlope) instances of respective classes. DataExactCardinality(1 maxMassSlope) SubClassOf(SFR Hypothesis) SubClassOf(IMF Hypothesis) DataProperty(time domain(SFR) range(xsd:float)) ObjectProperty(Slopes domain(IMF) range(Slope)) DataExactCardinality(7 time SFR ) DataProperty(availableMass domain(IMF) DataProperty(bornStarMass domain(SFR) range(xsd:float)) range(xsd:float)) DataExactCardinality(1 availableMass IMF ) DataExactCardinality(7 bornStarMass SFR ) DataProperty(outputStarMass domain(IMF) SubClassOf(ConstantSFR SFR) range(xsd:float)) DataProperty(constant domain(ConstantSFR) DataExactCardinality(1 outputStarMass IMF ) range(xsd:float)) SubClassOf(ThreeSlopeIMF IMF) DataExactCardinality(1 constant ) ObjectExactCardinality(3 Slopes ThreeSlopeIMF) SubClassOf(ExponentSFR SFR) SubClassOf (TwoSlopeIMF IMF) DataProperty(gamma domain(ExponentSFR) range(xsd:float)) ObjectExactCardinality(2 Slopes TwoSlopeIMF ) DataExactCardinality(1 gamma) DisjointClasses(TwoSlopeIMFThreeSlopeIMF) DisjointClasses(ExponentSFRConstantSFR) ObjectPropertyAssertion(competes ConstantSFR ObjectPropertyAssertion(competes TwoSlopeIMF SFR) IMF) ObjectPropertyAssertion(competes ExponentSFR ObjectPropertyAssertion(competes ThreeSlopeIMF SFR) IMF) ClassAssertion(ExponentSFRRobinSFR) ClassAssertion(Slope HaywoodSlope1) DataPropertyAssertion(gamma RobinSFR DataPropertyAssertion(alpha HaywoodSlope1 "0.12"^^xsd:float) "1.7"^^xsd:float) DataPropertyAssertion(minMass HaywoodSlope1 BGM apart from model ingredients has also implicit "0.09"^^xsd:float) hypotheses, which are not marked as ingredients. For DataPropertyAssertion(maxMass HaywoodSlope1 example, 1) thin disk is divided into seven age bins; 2) "1.0"^^xsd:float) no stellar population comes from the outside of the ClassAssertion(Slope HaywoodSlope2) galaxy. For the first example we can specify additional class AgeBins which has exactly seven age bins. DataPropertyAssertion(alpha HaywoodSlope2 "2.5"^^xsd:float) SubClassOf(AgeBins Hypothesis) DataPropertyAssertion(minMass HaywoodSlope2 DataProperty(ageBin domain(AgeBins) "1.0"^^xsd:float) range(xsd:integer)) DataPropertyAssertion(maxMass HaywoodSlope2 DataExactCardinality(7 ageBin AgeBins) "3.0"^^xsd:float) It is more difficult to deal with the second one. As a ClassAssertion(Slope HaywoodSlope3) possible solution, additional hypothesis could later be DataPropertyAssertion(alpha HaywoodSlope3 specified. "3.0"^^xsd:float) Hypotheses lattice is modeled with derived Byobject DataPropertyAssertion(minMass HaywoodSlope3 property. Some classes can be specified using Equivalent "3.0"^^xsd:float) Classes construction. Hypotheses lattice for BGM was DataPropertyAssertion(maxMass HaywoodSlope3 created manually, but later it should be constructed "120.0"^^xsd:float) automatically by system for executing experiments. (Part ClassAssertion(ThreeSlopeIMFHaywoodIMF) of) hypotheses lattice for BGM is shown in Fig. 1. 297 ObjectPropertyAssertion(derivedBy SFR AgeBins) and volume density are adjusted and recomputed. ObjectPropertyAssertion(derivedByAgeVelocityDis 5. getLNSimulations. Provided with specific hypotheses persionAgeBins) (IMF, SFR, evolutionary tracks and so on) stars and ObjectPropertyAssertion(derivedBy SFR their parameters are simulated in the local LocalVolumeDensity) neighborhood. ObjectPropertyAssertion(derivedByDensityLawLoca 6. getAliveStarsRemnants. Stars are splitted into alive lVolumeDensity) stars and remnants. Remnants -are possible stars for For IMF class and there are relations between slopes, which the age and mass combination was not on the output Mass and available Mass. Based on available evolutionary tracks. Mass parameter alpha is chosen and then output Mass is 7. solvePotentialEquation. Poisson equation is solved computed. If available Mass is inside the respective with the input of stellar content of thin disk. interval, alpha is taken and output Mass is computed. 8. constrainPotential. Calculated potential should be Next, post-condition for ExponentSFR is written. It says constrained by observed Galactic rotation curve. The that born stars should have mass respective to the central mass and corona parameters are computed in exponential equation. Other pre- and post-conditions are such a way that the potential reproduces the observed specified in the same manner. rotation curve. Document( 9. calculatePotentialParameters. Based on the calculated potential central mass parameters and Group(Forall ?IMF ?am ?s ?om ?a ?min ?max ( corona parameters are computed. AND (?IMF[AvailableMass -> ?am Slopes -> ?s 10. solveBoltzmannEquation. Boltzmann collisionless outputStarMass -> equation for an isothermal and relaxed stellar ?om] ?s[alpha -> ?a minMass -> ?min population is solved in order not to break maxMass -> ?max]) :- fundamentals of the model. AND (External(pred:numeric-greater-than(?am 11. checkDynamicalConsistency. As equations in 6,7,8 External(func:numeric- are solved separately, the potential does not satisfy multiply(?mincon:solMass))) both constraints. These tasks should be run until the External(pred:numeric-less-than(?am changes in the potential and other parameters are less External(func:numeric- than 0.01. multiply(?max con:solMass)))) Workflow is specified as a RIF-PRD document. The )))) ontology for virtual experiment and BGM ontology are imported. Rules in the document are separated into two Forall ?ExponentSFR ?g ?t ?m ( groups. The first group with priority 2 is used to define ?ExponentSFR[Gamma -> ?g Time->?t BornStarMass- workflow input and output parameters and variables. Part > ?m]:- AND( of specification describes several hypotheses passed as External(pred:numeric-equal(?m input parameters and calculated local surface density for External(func:numeric-exponent(func:numeric- each age bin as output. GetLocalSurfaceDensity task is multiply( specified in a group with priority 1. Task gets as input SFR hypothesis and total surface density vector (initially "-1.0"^^xsd:float?t)?g))))))) a guess) and multiplies provided values. Task checks if 4.5 Workflow Specification Xor of dependent tasks is done. The model of mass determination consists of a local Document( Dialect(RIF-PRD) mass normalization, the simulation of the local Base() distribution. These tasks can be further divided into Import() 1. getRSVDensity. Relative density is calculated using Import()Prefix(bgm) ratio of surface to volume density (RSV) is computed. 2. getSurfaceDenisty. For each thin disk subcomponent Prefix(ve) surface density is calculated and then summed. Surface density of each age subcomponent has to be Group 2 ( proportional to the intensity of SFR in its respective Do( age bin. Assert(External(wkfl:parameter- 3. getVolumeDensity. Volume stellar mass densities are definition(sfrbgm:SFRIN))) calculated and summed Total volume is checked to fit Assert(External(wkfl:parameter- the observations. definition(imfbgm:IMF IN))) 4. adjustSurfaceDensity. If the difference occurs surface Assert(External(wkfl:parameter-definition(avd 298 bgm:AgeVelocityDispersionIN))) and just take several best ones. Assert(External(wkfl:parameter-definition(dl SELECT ?experiment bgm:DensityLaw IN))) WHERE { Assert(External(wkfl:parameter-definition(et ?experiment probability ?probability . bgm:EvolutionaryTracks IN))) ?experiment workflow ?workflow . Assert(External(wkfl:variable- ?experiment Hypothesis ?hypothesis . definition(lsdsList(xsd:float) ?hypothesis name ?name . IN))) FILTER(?name = 'RobinSFR' && ?workflow = Assert(External(wkfl:variable-definition(clsds URI) List(xsd:float) } OUT))) ORDER BY desc(?probability) Assert(External(wkfl:variable-value(clsd List()))) LIMIT 2 Group 1 ( 5 Requirements for Infrastructure for Do( Managing Virtual Experiments Forall ?sfr?bsm?lsd ?lsds ?clsd ?clsds such that ( In a series of experiment run it is important to keep track on evolution of models, hypotheses and External(wkfl:variable-value(lsds ?lsds)) experiments, as well as identifying new data sources. External(wkfl:variable-value(clsds ?clsds)) Operations to manipulate virtual experiments and its External(wkfl:variable-value(sfr ?sfr)) components need to be defined. Next, the system needs ?lsd#?lsds to capture dependencies (competes, derived by) between hypotheses, invariants in single hypothesis. Correlations ?clsd#?clsds between parameters of several hypotheses should also be ?sfr[bornStarMass -> ?bsm] considered. ( IfOr(Not(External(wkfl:end-of- Second, infrastructure should contain components task(getRSVDensity))) responsible for automatic extraction of dependencies External(wkfl:end-of- between hypotheses, parameters in single and multiple task(adjustSurfaceDensity)) ) hypotheses. Obtained data is used in deciding which Then Do( Modify(?clsd ->External(func:numeric- experiments should be abandoned and also used in multiply(?bsm keeping hypotheses in a single experiment consistent. ?lsd)) ) Third, one needs components for maintaining experiment consistency and constraining the number of Assert(External(wkfl:end-of- task(getSurfaceDensity))) )) possible experiments as well as defining the metric which is used to define if experiment poorly explains 4.6 Choosing parameters of hypotheses for virtual phenomena and abandon further computations. Methods experiment execution for removing poor experiments based on previous Since some hypotheses can take quite a few values, experiments runs are also required. Experiments and the number of possible models can reach thousands. This hypotheses should stay consistent when parameters of a poses a question about the order of model execution and hypotheses change. how to make these executions effective (and not to As soon as several hypotheses in some experiments recompute previous unchanged results). For that we use could explain some phenomena well and due to errors in special structures to cache and store results. The system data, researcher needs to deal with uncertainty and needs can put model execution in some order and use the results methods to rank experiments and competing hypotheses of previous executions. This could drastically increase on massive datasets. the speed of model computation, especially on big While experiment could change slightly from a amount of data. To implement this we use properties of previous experiment run (e. g. one hypothesis parameter hypotheses lattices. changes), system should store some data about previous The researcher can run several experiments finding executions. Methods for understanding which parts of the probability of each, which can be later queried by experiments should be recomputed and which are not other researchers. For example, following query takes should be developed as well. Defining structures to two experiments, which have underlying models best store results of previous experiments and query these explaining observed data, and fixed values for hypothesis results is important. Since there could be thousands of SFR and workflow specified by URI. Since there could possible experiments system should use a method to form be thousands of possible experiments, there is a need to a plan to execute experiments in such way that stored order them by their probability. As in [3] we don't want results are mostly used and no additional recomputations the researched to bury in thousands of possible models are made. 299 Some stages will investigate and adopt or reject [2] Demchenko, Y. et al.: Addressing Big Data Issues in certain values such as a velocity hypothesis, then Scientific Data Infrastructure. Collaboration continue. The design of the paths to be followed is called Technologies and Systems (CTS), 2013 International experimental design that as in the scientific method is the Conference on, pp. 48-55 (2013) hardest part of the analysis. In principle, as in many [3] Duggan, J., Brodie, M.L.: Hephaestus: Data Reuse systems, Hephaestus could pursue multiple paths in for Accelerating Scientific Discovery. (2015). parallel using some metric to determine when to abandon [4] Gonçalves, B. et al.: Υ-DB: A system for Data-driven a path. Some have criticized DeepDive and others for Hypothesis Management and Analytics. (Nov. 2014) following a single path. [5] Hey, A.J. et al. eds.: The fourth paradigm: data- Reducing computational experiments (what we call intensive scientific discovery. Microsoft Research virtual experiments), as mentioned above using metrics Redmond, WA (2009) to estimate significance adequate to abandon further [6] Kalinichenko, L. et al.: Rule-based Multi-dialect computation. Infrastructure for Conceptual Problem Solving over Heterogeneous Distributed Information Resources. 6 Conclusion New Trends in Databases and Information Systems. Springer International Publishing, pp. 61-68 (2013) The article aims at developing a new approach to [7] Kalinichenko, L.A. et al.: Methods and Tools for managing virtual experiment. Hypotheses are becoming Hypothesis-Driven Research Support: A Survey. core artifacts of that approach. By analyzing existing Informatics and Application, 9 (1), pp. 28-54 (2015) systems and use case requrements are extracted. Formal [8] Ly, D.L.,Lipson, H.: Learning Symbolic specification of the determination of the mass model Representations of Hybrid Dynamical Systems. J. of from BGM is presented in the OWL syntax. Machine Learning Research.,13, pp. 3585-3618, Dec Further work should be concentrated on developing (2012) metasystem for handling hypotheses, models and other [9] Porto, F. et al.: A Scientific Hypothesis Conceptual metadata in virtual experiment. Model. Advances in Conceptual Modeling, pp. 101- 110. Springer (2012) Acknowledgments [10] Porto, F., Schulze, B.: Data Management for This research was partially supported by the eScience in Brazil. Concurrency and Computation: Russian Foundation for Basic Research (projects 15-29- Practice and Experience, 25 (16), pp. 2307-2309 06045, 16-07-01028). (2013) [11] Robin, A., Crézé, M.: Stellar Populations in the Milky References Way-A Synthetic Model. Astronomy and [1] Czekaj, M. et al.: The Besancon Galaxy Model Astrophysics, 157, pp. 71-90 (1986) Renewed I. Constraints on the Local Star Formation [12] Robin, A.C. et al.: A Synthetic View on Structure and History from Tycho Data. arXiv preprint Evolution of the Milky Way. Astronomy & arXiv:1402.3257. (Feb. 2014) Astrophysics, 409 (2), pp. 523-540. (2003) 300