=Paper=
{{Paper
|id=Vol-2846/paper20
|storemode=property
|title=Guiding Parameter Estimation of Agent-Based Modeling Through Knowledge-Based Function Approximation
|pdfUrl=https://ceur-ws.org/Vol-2846/paper20.pdf
|volume=Vol-2846
|authors=William Broniec,Sungeun An,Spencer Rugaber,Ashok K. Goel
|dblpUrl=https://dblp.org/rec/conf/aaaiss/BroniecAR021
}}
==Guiding Parameter Estimation of Agent-Based Modeling Through Knowledge-Based Function Approximation==
Guiding Parameter Estimation of Agent-Based Modeling Through Knowledge-Based Function Approximation William Broniec, Sungeun An, Spencer Rugaber and Ashok K. Goel Design Intelligence Laboratory, School of Interactive Computing, Georgia Institute of Technology, 85 Fifth Street NW, Atlanta, GA 30308, USA Abstract Parameter estimation is a common challenge in scientific modeling. However, agent-based modeling offers particular challenges: since the system behavior emerges out of local interactions among agents, many solutions are computationally intensive and do not scale with the number of parameters. The challenge is especially acute in interactive agent-based modeling where the goal is to support humans with little domain expertise. We describe a knowledge-based function approximation technique for the problem of parameter estimation in interactive agent-based modeling. Our method uses domain knowledge to decompose a large parameter search space into smaller and simpler spaces, and ranks the spaces by priority of search, thereby making the problem more tractable. We describe three experiments for validating the technique using the VERA system for interactive agent-based modeling. Keywords 1 Parameter estimation, Agent-based modeling, Genetic algorithms, Optimization, Scientific modeling 1. Introduction A common challenge in science is the generation of a model that can explain a set of observed data. Using a model, scientists can forecast future data and evaluate hypothetical “what-if” scenarios by altering the values of the parameters of the model [6]. AI has developed many methods to (partially) automate the process of scientific modeling [e.g., 7]. Recently, with the rise of ML techniques, symbolic regression has been used to learn both model equations and model parameters from data [25, 26]. Traditionally many scientific models of complex systems were described with differential equations, for example, the Lotka-Volterra [20] equations for modeling predator-prey relationships in ecology, the Kermack-McKendrick [18] model of epidemiology, and the Bass [3] model for innovation diffusion. Over the last generation, agent-based models have become very popular in some scientific disciplines such as ecology, economics, and epidemiology [5, 23]. While differential equation models are deterministic and describe system-level behavior of homogeneous populations, agent-based models are stochastic and describe individual-level interactions among heterogeneous populations [14, 22]. The parameter estimation problem in agent-based modeling is particularly challenging because the system behavior emerges out of interactions among a large number of individuals and thus it is computationally very intensive and scales with the number of parameters. Given the large dimensionality of the problem, optimization techniques such as genetic algorithms (GA) can and have been used in conjunction with agent-based modeling to explore the parameter space and find the best parameter set with respect to the optimization function [10, 19, 28]. However, this is an incomplete solution because GAs themselves can require a very large number of iterations to In A. Martin, K. Hinkelmann, H.-G. Fill, A. Gerber, D. Lenat, R. Stolle, F. van Harmelen (Eds.), Proceedings of the AAAI 2021 Spring Symposium on Combining Machine Learning and Knowledge Engineering (AAAI-MAKE 2021) - Stanford University, Palo Alto, California, USA, March 22-24, 2021. EMAIL: williambroniec@gatech.edu (W. Broniec); sungeun.an@gatech.edu (S. An); spencer@cc.gatech.edu (S. Rugaber); ashok.goel@cc.gatech.edu (A. Goel) ORCID: 0000-0002-0877-7063 (W. Broniec); 0000-0001-7116-9338 (S. An); 0000-0001-7116-9338 (S. Rugaber); 0000-0001-7116-9338 (A. Goel) ©️ 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings (CEUR-WS.org) converge and thus are computationally very intensive. Bayesian methods too have been used in conjunction with agent-based modeling so that a priori probabilities of the parameters can help bias the estimation process. However, this method requires prior knowledge about the probability distributions of the components being modeled, which is not always available. For example, specification of the parameters of approximately 1.5 million biological species in Smithsonian Institution’s Encyclopedia of Life (EOL; eol.org) [21] contain little, if any, information about their prior probabilities. The problem of parameter optimization is especially acute in interactive agent-based modeling, for example, when an agent-based simulation platform, such as NetLogo (https://ccl.northwestern.edu/netlogo/), is used for supporting human learning. This is in part because of the limited domain expertise of human learners and partly because of a lack of cognitive strategies to search a complex search space. In fact, recent research suggests that human learners typically struggle with estimating the parameter values of agent-based simulations [2]. In this paper, we describe a knowledge-based method to guide a GA to convergence in agent-based simulations for the parameter estimation problem. The research question associated with this effort is: How can AI techniques be applied to agent-based modeling (ABM) in order to (1) given existing process model m, induce a model m* such that m* better explains observed data? (2) analyze, ground, and provide an understanding of the emergent properties of the simulation? Our method has two phases. First, we develop a categorization of parameters for agent-based simulations, combining and grouping simulation parameters into four types according to their functions in the simulation: start-state, isolated, relationship, or object property. Our method uses this knowledge to decompose the large search space of parameter estimation into smaller and simpler spaces, and ranks the spaces by priority of search, thereby making the problem more tractable. Second, in the resulting smaller and simpler search spaces, our technique uses random variables and polynomial functions that can give a close approximation of the agent-based simulations while being much faster. We have evaluated our knowledge-based function approximation technique on the Virtual Experimentation Research Assistant (VERA; https://vera.cc.gatech.edu/) [1], a free, public online modeling and simulation tool. In this paper, we illustrate the utility of the proposed method on three models in two separate domains (ecology and epidemiology). We have also evaluated our method with a simulation taken from the NetLogo standard library (https://ccl.northwestern.edu/netlogo/) for external validity. 2. Related Work 2.1. Agent-Based Modeling Agent-based modeling (ABM) is a powerful simulation technique that has seen a number of applications in the last few years, including applications to understand complex systems and solve real- world problems [11]. In ABM, a system is modeled as a collection of autonomous individual entities that simulate real systems by interacting with each other within the environment. ABM serves as a “virtual laboratory” where alternative traits for key behaviors can be tested by plugging them into the ABM and testing how well the ABM then reproduces patterns observed in the real system. However, an important drawback of ABM is its time complexity [23]. Interactions between agents will introduce at least polynomial time complexity with regard to the number of agents, and interactions with even higher complexity may also be introduced. Regardless of optimization techniques employed, we necessarily will need to repeatedly make comparisons between the target data and the proposed simulation. 2.2. Optimization in ABM Optimization approaches including genetic algorithms have previously been applied to ABMs to reach global or near-global optima. However, the use of such metaheuristics in the context of ABM brings specific difficulties [9, 10, 19]. First, the computation of the fitness function requires the execution of the interactions among a large number of agents, which implies a high time complexity. Second, although the property of emergence in ABM is powerful, it does not naturally provide an explanation for how the result ties back to the parameters. Instead, understanding of the parameters comes from statistical “sensitivity analysis” that can be used to determine the most important input variables for an output behavior within the model [15]. It is thus necessary to develop strategies to accelerate the convergence of the algorithm and to understand the parameters. In this paper, we describe a knowledge- based approach based on a categorization of the functional roles of the parameters in the simulation to guide the generic algorithm to address these issues. 3. Virtual Laboratory for inquiry-based modeling VERA supports inquiry-based modeling by providing learners the authentic experience of scientific inquiry (e.g., identifying a problem, proposing multiple hypotheses, testing the hypotheses, and rejecting/accepting the hypotheses) through construction, evaluation, and revision of conceptual models. Hypothesis testing is particularly important because then learners can take a more active role in constructing their own understanding in a feedback loop. However, experimenting by running simulations requires mathematical abilities as well as programming skills because a student should understand complex mathematics to write code in the simulation language. VERA empowers students to test their hypotheses irrespective of their mathematical abilities because it can automatically spawn NetLogo simulations from the conceptual models. 3.1 VERA for ecological modeling VERA for ecological modeling (or VERA-Eco) enables users to build a conceptual model by adding biotic or abiotic components and drawing relationships among them on the model canvas. Conceptual models of ecological phenomena in VERA are expressed in the Component-Mechanism-Phenomenon (CMP) language [17, 27] that derive from the Structure-Behavior-Function theory of modeling complex systems [12]. A CMP model consists of components and relationships between components. A component can be one of three types: biotic, abiotic, and habitat. A relationship relates one component to another in a directed manner (e.g., component X consumes component Y). Figure 1 illustrates a CMP model of phosphorus run-off in the Chesapeake Bay; the large oval boxes in the middle depict habitats, in this case, land and shallow water. (The template on the right depicts simulation parameters and their values.) Figure 1: A screenshot of the VERA model editor page. Following our earlier work [16], VERA automatically translate the patterns in the conceptual models into the primitives of agent-based simulation of NetLogo. The running of the simulation enables the user to observe the evolution of the system variables over time and iterate through the generate- evaluate-revise loops. In this way, VERA integrates both qualitative reasoning in the conceptual model and quantitative reasoning in the simulation reasoning on one hand, and explanatory reasoning (conceptual model) and predictive reasoning (simulation) on the other. VERA thus acts as a virtual laboratory for scientific experimentation. The learner begins with a question. She then generates (potentially) multiple hypotheses for answering the question. In the process, the user may consult EOL for inspiration. Next, she elaborates on the hypotheses by constructing a detailed conceptual model. Then the learner asks VERA to spawn a simulation from the conceptual model. VERA provides the learner with templates of simulation parameters. The user sets initial values for the parameters and may again consult EOL for finding the values. VERA now automatically spawns the simulation and displays the results as graphs, for example, a graph indicating the changes in populations of various species over time. The learner may now experiment with different simulation parameters, or revise the conceptual model, or generate an alternative hypothesis. 3.2. VERA for Epidemiological Modeling At the start of the COVID-19 pandemic, VERA Epidemiology (VERA-Epi) was created to support agent-based versions of compartmental epidemiology models [8]. Just as with VERA-Eco, users develop a graphical representation of a model, provide parameter values, and VERA will generate a subsequent agent-based simulation. The model semantics for VERA-Epi are based on the Harel statechart [13]; nodes now represent the states of individual agents, and edges represent likelihoods for those agents to transition between states. 4. Parameter Estimation Method Figure 2 illustrates our method for automated estimation of the values of the simulation parameters in VERA. Since the search space for optimizing an agent-based simulation is large, the method uses parameter categorization to simplify the structure and reduce the computation while preserving its semantics. Then various functions are applied to approximate the agent-based simulation output. After the ABM approximation process, a genetic algorithm is used to solve the combinatorial problem of finding the optimal set of parameter values for different components. Figure 2: Overview of the proposed method for ABM approximation and parameter optimization. 4.1. Function Approximation Given a dataset and an existing model, we want to assist human learners in finding the optimal parameter values that, when used to generate a simulation, yield results closest to that dataset. This can be formalized as an optimization problem where the inputs are the simulation parameters of a model and error is the distance between the simulation output and the initial dataset. Figure 3: Parameter categorization for VERA Ecology 4.1.1. Parameter Categorization First, a distinction needs to be made between object properties and class properties. Object properties are concerned with each agent in the simulation, and their values change each tick of the simulation’s clock based on the agents’ behaviors (e.g., age, location, etc.). On the other hand, class properties are constant values used to set up the simulation (e.g., starting population, lifespan, body mass, etc.) The top row of Figure 3 below shows the original parameters used in the agent-based simulation (e.g., start- state, isolated, relationship parameters) and the derived properties from the original parameters (e.g., object properties), color-coded by their category. Here are the descriptions of each parameter category: • Start-State Parameters: Simulation values that set up the simulation’s starting state, and have no effect after • Isolated Parameters: Parameters describing behaviors that only affect an individual agent and no others • Relationship Parameters: Parameters affecting interactions among different agents • Object Property: Each agent tracks these core values internally This categorization of simulation parameters can be compared earlier work on the use of ontologies for building agent-based simulations. For example, Benjamin, Patki & Mayer (2006) describe an ontology of components of an agent-based simulation [4]. In contrast, our work focuses on the categorization of the functional role of simulation parameters. The "stacked" parameters with pairs of blocks connected shown in the start-state and isolated parameters in Figure 3 mean that these pairs of parameters are treated as a single parameter from the eyes of the simulation. This is primarily driven by the semantics of the user interface. Users may benefit in conceptual understanding from different wording as it applies to different classes of agents, while programmatically these two different parameters serve the same purpose or are integrated into a single value used in the simulation. Measured output values in the simulation are also displayed, and in the case of VERA-Eco this is simply the count of each agent class. Instead of optimizing each parameter individually and calculating them repeatedly in the ABM (e.g., "lifespan" does not have to be calculated over and over), behaviors are simulated using polynomial function approximation. 4.1.2. Random Variables and Polynomial Functions Using the parameter categorization, an approximation of the agent-based simulation output can be derived using random variables to model populations of agents and polynomial functions to model agent behaviors. Using random variable distributions as stand-ins for population groups drastically reduces the number of computations performed and the memory used. Different populations may be more accurately modeled by specific distribution functions, but the normal distribution serves as the best stand-in with an unknown distribution due to the central limit theorem. Therefore, rather than storing biomass for thousands of individual agents, a Gaussian distribution can be represented using two variables, the mean and the variance, to describe the biomass for each age. The same process is applied to represent reproductive interval Gaussians as well. In the case of Vera-Eco, the simulation initialization (e.g., tick 0) assigns each of the starting populations a random age from 0 to max age (i.e., lifespan - 1) and sets the initial biomass value for each population, and the biomass follows a uniform distribution with the mean of initial biomass value and the variance of 0. Each tick of the simulation, the polynomial functions are applied to these populations to skew the distribution. For example, in every simulation tick, a certain amount of biomass is lost from every agent due to its metabolism as determined by its respiratory rate, which will subtract from the mean while the variance does not change. However, when there is a relationship between two populations, such as predation, the corresponding consumption events will increase the average biomass for some predator agents and the reduction of some prey agents. In this case, the Gaussians are recomputed off the changing values, and computation of the next behavior proceeds. 4.2. Optimization To obtain the closest values possible to the target dataset, an optimization algorithm is necessary to test and evaluate different parameter sets. Scientific models based on differential equations can rely on regression analysis to achieve this, but agent-based models typically lack such representations. Heuristic search is needed to explore the space, and due to the highly combinatorial nature of estimating parameters, genetic algorithm was selected. Figure 4 shows a standard genetic algorithm representation. The process begins with a set of individual members of a species which is called a Population. A species is characterized by a set of parameters (also known as genes) that together determine the dynamics of the individuals of the species (also known as a chromosome). Figure 4: A standard genetic algorithm representation and process The population of chromosomes is initialized randomly. Each chromosome is then evaluated using the difference between the simulation approximation and the target dataset as a fitness evaluation. A selection is made among the population of chromosomes based on these scores, and we obtain a new population named parent population. Recombination (also known as crossover) and mutation operators are then applied to this population which yields new sets of parameter values to continue the process. 4.2.1. Fitness Function To evaluate how “fit” the simulation output r is with respect to dataset d, we compare the similarity between the two sets of output data. Multiple methods including simple Euclidean distance can be used, but we used dynamic time warping (DTW), which is a robust, simple, and efficient measure for computing the dissimilarity between two time-series datasets [24]. DTW belongs to the group of so- called elastic dissimilarity measures and works by optimally aligning (or ‘warping’) the time series in the temporal dimension so that the accumulated cost of this alignment is minimal. In its most basic form, this cost can be obtained by dynamic programming, recursively applying: 𝐷𝑖,𝑗 = 𝛿(𝑥𝑖, 𝑦𝑗) + 𝑚𝑖𝑛(𝐷𝑖,𝑗−1, 𝐷𝑖−1,𝑗, 𝐷𝑖−1,𝑗−1) (1) for i = 1,...,M and j = 1,...,N, being M and N the lengths of our two time series (here the dataset and the new parameter set). As we are using distance as a fitness measure, we used negative distance to represent the fitness of the solution (larger fitness measure means better solutions). 5. VERA Ecology Results Using the genetic algorithm to optimize over the combination of random variables and polynomial functions to approximate our ABM, we get results faster by orders of magnitude at the cost of some accuracy. In Figure 5, Graph (a) below shows a synthetic target dataset and simulation output graph of a simple VERA-Eco model, and Graph (b) shows the same simulation with parameter values randomized. This basic simulation consists of sunlight, two different plants, and a species of bug that consumes both of them. While both graphs show roughly the same pattern for the blue, orange, and grey lines, the population shown in yellow varies drastically. In the left graph, the population rises and then falls after a few cycles, whereas in the simulation dataset it collapses immediately. Target Dataset Initial Simulation Trial 1 Trial 2 Bug Tree Kudzu Bug Tree Kudzu Bug Tree Kudzu Bug Tree Kudzu 200 1,000 500 200 1,000 500 200 1000 500 200 1,000 500 145 873 481 145 797 406 210 755 483 147 866 480 1,380 302 337 1,725 112 48 749 283 15,479 992 382 369 1,304 115 12,797 1,479 12 535 1,220 62 13,808 817 190 15,437 1,246 36 13,097 1,099 4 161 4,939 0 4,656 839 87 15,465 6,867 0 1,991 16,694 0 0 9,998 0 933 4,448 1 5,328 18,009 0 13 11,900 0 0 19,067 0 8 10,996 0 408 5,929 0 238 575 0 68 19,992 0 0 12,563 0 340 19,762 0 1 11,655 0 0 20,000 0 0 19,660 0 3 19,999 0 0 3,275 0 1 20,000 0 0 19,997 0 0 20000 20000 20000 20000 10000 10000 10000 10000 0 0 0 0 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 1 3 5 7 9 (a) (b) (c) (d) Figure 5: Results of using our methods on the agent-based simulation of VERA. (a) Left-most graph– Target data d. (b) Middle-left graph–Initial model m. (c) Middle-right graph–Improved model m* on first trial (< 2 minutes). (d) Right-most graph– improved model m* on second trial (= 2 minutes). In each graph: the blue line represents the bug species, the orange line represents sunlight, the grey line represents the tree species, and the yellow line represents the kudzu vine. Sunlight is omitted from the table due to being at a constant value of 5,000 units in all derivations of this simulation. Graphs (c) and (d) show the results of two independent runs of our function approximation methods. Since the mutation, crossover, and selection are stochastic, each run of the simulation yields different results. In the first graph, the kudzu population (indicated as the yellow line) more closely resembles that from the target dataset while the valley in the bug population (indicated as blue lines) was absent due to compounding error in our approximation. In the second run, the bug population resembles the original dataset even closer. 5.1 VERA Epidemiology Results To test the domain generality of our technique, we used the same optimization framework in VERA- Epi that uses an agent-based version of the SIR model of epidemiology, a basic but significant and well- studied model of disease spread that groups a population into three categories – Susceptible (S), Infected (I), and Recovered (R) – and provides equations that describe the rates at which the sizes of these groups change [29]. Traditionally, the parameters of the SIR model are written as beta (β), the disease trans- mission rate, and gamma (γ), the recovery rate. The user interface of VERA-Epi presents the user with a larger set of more detailed parameters in the SIR model, but these are reduced to functionally equivalent parameters. The first step in the optimization process is to classify and group the parameters according to the categorization. Figure 6: Parameter categorization of VERA-Epi For the VERA-Epi SIR model, we used the same categorization as before to group the parameters (see Figure 6). The starting population value can be used as the initial state, and the only object property that needs to be tracked is the health state of the agent (susceptible, infected, or recovered). The average contacts per day per person is combined with the transmission likelihood per contact to generate a likelihood that an agent will become infected (corresponds to beta of the ordinary SIR model). Average recovery time also impacts state by defining the likelihood an individual agent will recover from infection (corresponds with gamma of the original SIR model). As we can see, Average Recovery Time is classified as an isolated parameter while Average Contacts Per Day Per Person and Transmission Likelihood are combined into a single relationship parameter. This is because agents recover on their own, irrespective of any other agents. However, agents will only get sick if they come into contact with other agents. Because the agent’s state in this model is one of several possibilities rather than a numeric value, it cannot be represented by a typical Gaussian distribution. The distribution selected should be that most appropriate for the target simulation, and because the SIR model makes no representation of “partially sick” or “partially recovered”, the simplest solution is to treat each state as simply a separate distribution with zero variance, also known as the Dirac delta distribution. With the parameter space mapped out and the distributions known, this reduction can be plugged into the genetic algorithm method explained above. While the performance gains are not as significant as with VERA-Eco due to this simulation being simpler, it does reduce the time complexity as a function of simulation size. In effect, this reduction closely recreates the original equation form of the SIR model, although still operating on discrete units. 6. External Validity VERA is simply one engine for producing agent-based models. Being able to apply our method to different types of ABMs would increase the external validity of our methods. The “Rabbits, Grass, Weeds” simulation from the NetLogo example library [30] was selected for two main reasons. First, the example was also in the domain of ecology and posited a scenario (rabbits foraging for food) that could be replicated in VERA-Eco but was written with entirely different simulation code. Second, the example possessed only a handful of parameters, providing an example simulation more basic than VERA’s to work with. Figure 7: (a) Left: The "Rabbits, Grass, Weeds" simulation from the NetLogo example library. This screenshot shows the simulation interface in action with variable sliders on the left controlling the different simulation parameters. (b) Right: Parameter map in the "Rabbits, Grass, Weeds" simulation. The "Rabbits, Grass, Weeds" simulation is a simplified model of a predator and prey between the rabbits, grass, and weeds. When a rabbit bumps into some grass or weeds, it eats the grass to gain its energy (see Figure 7). If the rabbit gains enough energy, it reproduces. Otherwise, it dies. This simulation consists of six parameters: starting number, birth threshold, grass growth rate, grass energy, Starting Population Grass Growth Rate 0.6 2 0.5 1.5 0.4 0.3 1 0.2 0.5 0.1 0 0 127 379 631 1 64 127 190 253 316 379 442 505 568 631 694 757 820 883 946 1 64 190 253 316 442 505 568 694 757 820 883 946 Birth Threshold Grass Energy 10 20 8 15 6 10 4 2 5 0 0 1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 85 1 7 13 19 25 31 37 43 49 55 61 67 73 79 91 97 Figure 8: Results of Sensitivity Analysis of the different Parameters in the "Rabbits, Grass, Weeds" simulation. Blue line–Original. Orange line–Estimation. (a) Upper-left graph: Size of the rabbit population. (b) Upper-right graph: Grass growth rate. (c) Lower-left graph: Birth threshold. (d) Lower- right graph: Grass energy. weeds growth rate, and weeds energy. Each individual rabbit agent has two variable values associated with it–current energy and location. If a rabbit finds some grass, it will consume the grass and gain energy. If the rabbit finds weeds, it will gain no energy. During each tick of the simulation clock, the rabbits expend a fixed amount of energy, and a rabbit that runs out of energy dies, removing it from the simulation. Using the same categorization described in the previous section (see Section 4.1) to break down the simulation parameters, we get the following map as shown in Figure 7 (b). Grass growth rate, grass energy, and birth threshold are combined to describe energy using polynomial functions, and energy and location of each agent are represented as a set of Gaussian distributions. The grass parameters affect the energy Gaussian of the rabbit population. Location is also a Gaussian distribution in the simulation, but no parameters in this simulation control the location. Figure 8 shows four graphs with sensitivity analysis of the different parameters–the blue line being the sensitivity analysis of the actual simulation and the orange line being that of the approximation. The x axis for these four graphs are the attempted parameter values, and the y axis is the difference in distance between the outputs. In other words, it shows how much each parameter affects the simulation results. For example, starting population (a) and grass growth rate (b) have minor, roughly linear impacts on the output whereas birth threshold (c) is a sharp cutoff (e.g., if it is too high, rabbits will die before they have a chance to reproduce), and grass energy (d) has a stair step effect. 7. Conclusion We have described a knowledge-based method for speeding up the use of GAs for optimizing agent- based simulations. Specifically, we described a general categorization for classifying simulation parameters that can be used by other agent-based simulations. This categorization of simulation parameters complements and supplements earlier research on ontologies of components of agent-based simulations. The validity of our method was shown by the application examples across domains using the VERA modeling and simulation platform as well as through an external NetLogo predation model. Overall, our system works well for decomposing and understanding the semantic characteristics of the agent-based simulation parameters with exponentially faster results than optimization over the simulation itself. This affords rapid simulations thereby supporting end users. The primary drawback to our method is error propagation. With one species or a small number of relationships, the simulation is near-exact. With more complex simulations running over longer periods of time, it slowly begins to deviate: some important information may be missing, which can take the simulation into a completely different course. Therefore, our next step is to develop additional strategies to reduce the compounding error in our approximation and to apply the method to more complex examples. Another direction for further work is to conduct a user study to better understand how parameter estimation can facilitate the process of human learning and scientific discovery. Acknowledgements This research is supported in part by an US NSF grant #1636848 (Big Data Spokes: Collaborative: Using Big Data for Environmental Sustainability: Big Data + AI Technology = Accessible, Usable, Useful Knowledge!) and the NSF South Big Data Hub. References [1] S. An, R. Bates, J. Hammock, S. Rugaber, E. Weigel & A. Goel. (2020). Scientific modeling using large scale knowledge. In Procs. Twenty-first International Conference on AI in Education (AIED’2020), pp. 20-24. [2] S. An, S. Rugaber, E. Weigel & A. Goel (2021) Cognitive strategies for navigating high- dimensional parameter spaces in modeling complex systems; submitted for publication. [3] F. Bass. (1969). A new product growth for model consumer durables. Management science, 15(5), 215-227. [4] P. Benjamin, M. Patki & R. Mayer. (2006) Using ontologies for simulation modeling. In Procs. 2006 IEEE Winter Simulation Conference. [5] E. Bonabeau & C. Meyer. (2001) Swarm intelligence: A whole new way to think about business. Harvard Business Review 79(5), 106-115. [6] W. Bridewell, J. Sánchez, P. Langley & D. Billman. (2006). An interactive environment for the modeling and discovery of scientific knowledge. International Journal of Human-Computer Studies, 64(11), 1099-1114. [7] W. Bridewell, P. Langley, L. Todorovski & S. Džeroski. (2008). Inductive process modeling. Machine Learning, 71(1), 1-32. [8] W. Broniec, S. An, S. Rugaber, & A. Goel. (2020). Using VERA to explain the impact of social distancing on the spread of COVID-19. arXiv preprint arXiv:2003.13762. [9] E. Cabrera, M. Taboada, M. Iglesias, F. Epelde & E. Luque. (2011). Optimization of healthcare emergency departments by agent-based simulation. Procedia computer science, 4, 1880-1889. [10] B. Calvez & G. Hutzler. (2005). Automatic tuning of agent-based models using genetic algorithms. In Procs. International Workshop on Multi-Agent Systems and Agent-Based Simulation (pp. 41- 57). Springer, Berlin, Heidelberg. [11] V. Grimm, U. Berger, F. Bastiansen, et al. (2006). A standard protocol for describing individual- based and agent-based models. Ecological modelling, 198(1-2), 115–126. [12] A. Goel, S. Rugaber & S. Vattam. (2009). Structure, Behavior and Function Models of Complex Systems: The Structure-Behavior-Function Modeling Language. AIEDAM 23: 23-35. [13] D. Harel. (1987). Statecharts: A visual formalism for complex systems, Science of computer programming. 231-274. [14] E. Hunter, B. MacNamee & J. Kelleher. (2018). A comparison of agent-based models and equation based models for infectious disease epidemiology. In Procs. AICS (pp. 33-44). [15] B. Iooss & P. Lemaître. (2015). A review on global sensitivity analysis methods. In Uncertainty management in simulation-optimization of complex systems (pp. 101-122). Springer, Boston, MA. [16] D. Joyner, A. Goel & N. Papin. (2014). MILA--S: generation of agent-based simulations from conceptual models of complex systems. In Procs. 19th international conference on intelligent user interfaces (pp. 289-298). [17] D. Joyner, A. Goel, S. Rugaber, C. Hmelo-Silver & R. Jordan. (2011). Evolution of an Integrated Technology for Supporting Learning about Complex Systems: Looking Back, Looking Ahead. In Procs. 11th IEEE International Conference on Advanced Learning Technologies, pp. 257-259. [18] W. Kermack & A. McKendrick. (1927). A contribution to the mathematical theory of epidemics. In Procs. Royal Society of London. Series A, Containing papers of a mathematical and physical character, 115(772), 700-721. [19] J. Lee, T. Filatova, A. Ligmann-Zielinska, et al. (2015). The complexities of agent-based modeling output analysis. The journal of artificial societies and social simulation, 18(4). [20] A. Lotka. (1910). Contribution to the Theory of Periodic Reaction. The Journal of Physical Chemistry, 14, 271-274. [21] C. Parr, M. Wilson, M. Leary et al et al. (2014). The encyclopedia of life v2: providing global access to knowledge about life on earth. Biodiversity Data Journal (2). [22] H. Parunak, R. Savit & R. Riolo. (1998). Agent-based modeling vs. equation-based modeling: A case study and user’s guide. In Procs. Multi-Agent Systems and Agent-Based Simulation, 10-25. [23] S. Railsback & V. Grimm. (2019). Agent-based and individual-based modeling: a practical introduction. Princeton University Press. [24] H. Sakoe & S. Chiba. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE transactions on acoustics, speech, and signal processing, 26(1), 43-49. [25] M. Schmidt & H. Lipson. (2009). Distilling free-form natural laws from experimental data. Science, 324(5923), 81-85. [26] S. Udrescu & M. Tegmark. (2021). AI Feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16), eaay2631. [27] S. Vattam, A. Goel, S. Rugaber et al. (2011). Understanding complex natural systems by articulating Structure-Behavior-Function models. Educational Technology & Society, 14(1): 66- 81. [28] Z. Wang & J. Zhang. (2012). Agent-based modeling and genetic algorithm simulation for the climate game problem. Mathematical Problems in Engineering. [29] H. Weiss. (2013). The SIR model and the foundations of public health. Materials mathematics. 0001-17. [30] U. Wilensky. (2001). NetLogo rabbits grass weeds model. Center for Connected Learning and Computer-Based Modeling, Northwestern University, Evanston, IL. http://ccl.northwestern. edu/netlogo/models/RabbitsGrassWeeds.