Hydrologic Predictions using Probabilistic Relational Models Max Metzger Alison O’Connor David F. Boutt Joe Gorman Charles River Analytics. Charles River Analytics University of Massachusetts Charles River Analytics. 625 Mt. Auburn St. 625 Mt. Auburn St. 611 North Pleasant Street 625 Mt. Auburn St. Cambridge, MA 02138 Cambridge, MA 02138 Amherst, MA 01003 Cambridge, MA 02138 Abstract significant role played by terrain, soil, and subsurface factors in the effects of rainfall run-off on terrain. A decision aid capable of automatically evaluating the The US Army faces a significant burden in suitability of emplacement sites would reduce the time planning sustainment operations. Currently, needed for evaluation by logistics planners and improve logistics planners must manually evaluate the quality of sites selected. potential emplacement sites to determine their To reduce the time and the difficulty of logistics site terrain suitability. Sites subject to rainfall-runoff selection we designed and demonstrated a series of responses such as flooding are ill-suited for Terrain Impact Decision Extensions (TIDE) for logistics emplacements, but evaluating the likelihood of planning tools and processes. TIDE performs data-fusion such responses requires significant time and over a variety of terrain and weather data sets, and uses expertise. To reduce the time and to ease the probabilistic relational models (PRMs) to reason with difficulty of logistics site selection we uncertainty to evaluate the suitability of potential logistics demonstrated a series of Terrain Impact Decision sites against a series of expert rules for a variety of Extensions (TIDE) for use in logistics planning emplacement systems. By using PRMs to rank the tools and processes. TIDE performs data-fusion severity of potential rainfall-runoff responses, TIDE was over a variety of terrain and weather data sets able to site determine suitability much faster than by using probabilistic relational models (PRMS), rigorous physical simulation. Additionally, PRMs can providing a high-performance alternative to reason with incomplete data (e.g., a lack of detailed soil physics-based hydrologic models. information), making them useful even when evaluating data-poor regions. 1. INTRODUCTION 1.1 PROBLEM DESCRIPTION Maintaining a constant supply of water and fuel is critical The rainfall-runoff response of landscapes is a to sustaining the US Army’s forces in the field. fundamental problem in the field of hydrology (Singh, (Department of the Army, 2008). To provide access to 1988). The accumulation of water at a particular time- these critical resources, logistics planners must deliver space location on the Earth’s surface (i.e., terrain those resources with minimal failure to establish and ponding) is the result of the confluence of many maintain emplacements (e.g., tanks, fuel lines) capable of climatologic, hydrologic, and physical factors and storing these crucial commodities. Water and fuel parameters. During a liquid precipitation event (e.g., supplies must not be susceptible to disruption, damage, rain), water is transported in three main ways: water can and contamination by water due to rainfall-runoff run-off/on in the form of overland flow; infiltrate into the responses such as flooding, overland flow, and ponding soil and become ground water; or be transferred back into (i.e., the temporary accumulation of surface water). the atmosphere via evapotranspiration. Overland flow can in turn lead to the accumulation of surface water (e.g., Currently, the risks posed by rainfall-runoff responses to flooding), which poses a risk to US Army emplacements. potential emplacement sites are manually evaluated, and require considerable expertise and time. Site evaluation is The terrain assessment model must account for multiple further complicated for areas lacking detailed data that aspects of the area of interest. First, overall climatic describe terrain, soil properties, and subsurface conditions conditions (i.e., arid, semi-arid, humid) have an important (e.g., the presence of aquifers). This occurs due to the influence over the relative distribution of water in the 31 three pathways. The model must account for uncertainty with limited or incomplete data before executing any in weather predictions and climatologic predictions. impact assessment rules. The PRM model developed Second, the model must account for local factors within under the TIDE effort is capable of reasoning with the area of interest that influence the rainfall-runoff incomplete data and inferring data that may be absent. response. There are many such factors, including rainfall Additionally, while our initial model is very simple, intensity and duration, slope of the land, land use and land further work may expand the model to be very complex. cover characteristics (i.e., vegetation and impervious The object-oriented PRM approach is well-suited for such surfaces), soil and air temperature, soil hydraulic complexity. properties, and soil moisture conditions. The data sources The PRM output is used to generate maps showing the for these factors may be incomplete or inaccurate, likelihood for flow accumulation at a given location for a introducing additional uncertainty. certain amount of time. We based our models on the The prediction of where, when, and how long water will Hortonian Infiltration and Runoff/On (HIRO2) model, accumulate on the land surface is reliant on constraining which was originally developed for the USDA (Meng, parameters that describe the above processes and Green, Salas et al., 2008). This model predicts rainfall- conditions. Fortunately, hydrologists have been runoff responses, including runoff channels (in which developing tools to both quantify these factors and surface water flows) and the time until ponding occurs. develop quantitative models for predicting rainfall-runoff The HIRO2 model performs well, but operates at larger response to precipitation events. scales than are useful to emplacement selection, generally being most accurate at scales of hundreds of meters. The These models are often based on solving complex HIRO2 model served as the basis for our model, but was equations that govern the physics of surface and modified to operate at higher levels of fidelity without subsurface water (Abbott, Bathurst, Cunge et al., 1986; significantly compromising performance. Panday & Huyakorn, 2004) or assign statistical values to terrain based on observation (Yoram, 2003). These Bayesian modeling techniques have been used in the field models are not practical for US Army planning because of hydrology for decades (Vicens, Rodriguez-Iturbe, they require complete data sets, are extremely time- Schaake et al., 1975), but the majority of this work has consuming to compute, and do not scale to the levels of different goals than TIDE. Bayesian modelling detail and scope required by US Army logistics planners. approaches generally take existing models that use direct measurements as inputs (e.g., rainfall) and predict specific 2. APPROACH hydrologic response values (e.g., runoff rate, groundwater level). Bayesian techniques are then used to calibrate the Given the potential incompleteness of input parameters models parameters to improve their accuracy (Beven & (including terrain, soil and subsurface data), our approach Binley, 1992; Thiemann, Trosset, Gupta et al., 2001; uses a probability-based method to track the inferences Vrugt, Ter Braak, Clark et al., 2008). made about data through the model. For TIDE to be useful, the system must infer terrain characteristics, soil TIDE differs from past Bayesian hydrologic models in properties, and subsurface conditions from limited data. two fundamental ways. First, our model attempts to While terrain elevation data is available for most of the predict the impact of rainfall-runoff responses, not their world at varying levels of detail, soil data is less precise values. Generally speaking, US Army logistics prevalent. Land use, land cover (e.g., vegetation), as well planners are not concerned about predicting the exact as the soil’s hydrologic properties and moisture amount of surface water that may accumulate, but are conditions are all factors in predicting rainfall-runoff instead primarily concerned about the impact on the response. When this information is not directly available, mission. For example, the difference between 1.2 meters it needs to be estimated or inferred. For example, soil of standing water or 2.4 meters is irrelevant if either properties for a given region within the United States may makes the mission impossible to complete. be well-known and stored in a Geographic Information Second, the TIDE model must perform with reasonable System (GIS) database, but this data may be unknown for accuracy in regions of the world that have little, if any, many rural regions around the world. An exhaustive hydrologic data observations (e.g., hourly flow rates for a geological survey of potential sites within that region is stream) that can be used to train or calibrate a model. not possible given time and personnel constraints. Even Instrumenting and measuring rainfall-runoff responses in when terrain, soil, and subsurface data are present, it may these areas may be too costly, logistically infeasible, or not be at resolutions high enough to be relevant to the dangerous. As a consequence the TIDE model must rely emplacements (e.g., a map with soil data at a resolution of on generally available data (e.g., elevation, land cover, 500m is of limited use when selecting a site for a fuel line weather). less than a meter across). In cases where data describing terrain characteristics, soil properties, and subsurface 2.1 PROBABALISTIC RELATIONAL MODELS conditions are absent, purely rule-based approaches are insufficient, as rules alone are poorly-suited to handling To represent our terrain and hydrologic models in a incomplete data. The system must be capable of reasoning probabilistic form that allows us to determine the 32 suitability of an area of interest, we designed a situation. The relationships that hold between these probabilistic relational model (PRM) (Koller & Pfeffer, instances are captured by the PRM. For example, our 1998; Pfeffer, Koller, Milch et al., 1999; Friedman, PRM contains a class SiteModel that has one attribute, Getoor, Koller et al., 1999). PRMs describe the world in Suitability. The value of Suitability depends on the terms of classes of objects, instances of those classes and instance of the Runoff class’ attribute, relationships between them. Serving as a powerful RankRunoffPotential. To reason over this model, one extension of Bayesian Networks (BNs), PRMs use object- must create instances of both the SiteModel and Runoff oriented semantics that capture attribute, structural, and classes. class uncertainty to overcome computational and storage The flexibility and reusability of PRMs grant us the complexity challenges faced by BNs. ability to reason over millions of locations. For each The design of PRMs has proven to be useful in location, the relevant set of known facts about specific representing a wide range of complex domains that attributes – the land cover type, soil type, rank of slope involve uncertainty and require flexibility and reusability. and rank of flow –must be provided to the instances of In regard to complexity, PRMs capture the logical and classes. As we transition to discuss our PRM in greater relational structure of a domain. For example, PRMs detail, it will become more evident that these four key specify how one attribute influences the value of another features of PRMs – complexity, uncertainty, flexibility, attribute. In our PRM, the value of the attribute, and reusability – are crucial to obtaining successful RankDrainageCapacity, is dependent on the values of results. Bayesian Networks could also apply to this attributes, LandCoverType and SoilType, from two other problem, as the relational structure is fixed for every classes. Therefore, the model uses the values of instance. Nevertheless, the object-oriented representation RankDrainageCapacity’s dependent attributes, of PRMs were quite helpful in designing the model. LandCoverType and SoilType, to infer the value of RankDrainageCapacity. 2.2 PRM EDITOR To handle uncertainty, PRMs use probability distributions We developed a PRM Editor that provides an intuitive encoded in the model to determine values of unknown graphical user interface (GUI) that allows users to create variables. The value of LandCoverType and SoilType for complex PRMs by defining classes of objects, adding a location are retrieved from data sources outside the attributes to those class definitions, creating instances of model and then posted to the model. Therefore, there is the classes, and specifying the relationships between little uncertainty in regard to these two attributes. them. Conversely, the value of RankDrainageCapacity is inferred inside the model using probability distributions. Upon launching the PRM Editor, the user can navigate To overcome the uncertainty involved with this attribute, between three views: the global view, class view, and encoded in the model is a map of possible combinations instance view. While these three views are initially blank, of land cover and soil types to appropriate probability the panels become populated with information and distributions. Relying on the team’s hydrologic expertise, graphical representations of the model. The global view we created initial distributions for each possible pair of allows a user to view the PRM as a whole in a folder land cover and soil types, as well as each land cover and format. Its top-level folder, named after the PRM, can be soil type provided the other attribute was unknown. expanded to display three other folders, enums, classes, Similarly, using domain knowledge, we supplied and instances. The enums and instances folders can distributions for each individual slope ranking, flow further be expanded to show all enumerations and ranking, and drainage ranking assuming that the other two instances of classes in the model. Within the classes attributes were unknown. Given the increased number of folder are additional folders for each class that can be combinations for RankRunoffPotential, to obtain expanded to view the attributes in that class. distributions in the case that two or three attributes were Unlike the global view, the class view displays a known, we multiplied the probabilities of the known graphical representation of the PRM. Each class is attributes for each of the five possible represented by a box labeled with the class name. If RankRunoffPotential values. The distributions for applicable, arrows are automatically drawn between Suitability were much simpler to encode, as only five boxes indicating super and subclasses (parent-child probability distributions that required no further relationships). The instance view also displays boxes that, calculations were necessary. While these initial rather than represent classes, represent the instances of distributions pass face validation, future work is needed to classes in the model. adjust the distributions to meet higher accuracy needs. To begin utilizing these three views, the user has the To support flexibility and reusability, PRMs allow the option to either load an existing model into the GUI or reuse of the same class probability models for all create a new PRM. After loading or initializing the model, instances of a class. New probabilistic models do not have the user can begin building the model by adding classes. to be constructed for each new situation. Instances of When creating a class, the user must specify the name and classes can be configured in any way desired for a given parent class of the new class. In the case of our model, we 33 created six classes, none of which had parents, so this the resolution to assignment, the user must enter the exact field remained blank. value of the attribute or its reference. For example, if the attribute were an integer, the user could indicate that Adding an attribute requires more detail than adding a value was 10. Alternatively, the reference could be set to class. A user must specify the attribute name, type, and an attribute of another class that was also an integer. The resolution. The user has the option to assign single or appropriate resolution for Suitability is dependency. multi as the attribute’s type, as well as choose from a list Therefore, the user must specify the influencer, the of possible types. Possible types contained in this list attribute that Suitability depends on, as well as the include integer, real number, Boolean, type, nothing, as conditions and their respective distributions. The well as all of the classes and enumerations created by the conditions are the possible values of the influencer. Each user. If the attribute is of type enumeration, the user must possible value of the influencer is paired with a CPD have previously defined the appropriate enumeration. For indicating the likelihood of each possible value of the example, in our model, the SiteModel class’ attribute, attribute. Suitability depends on the instance of the Suitability, is of type enumeration. The possible values Runoff class’, RunoffInstance, attribute for Suitability are VeryPoor, Poor, Medium, Well and RankRunoffPotential. RankRunoffPotential has five VeryWell. Therefore, we created an enumeration called possible values – VeryLow, Low, Medium, High, and RankPoor, to represent an attribute with these five VeryHigh. Therefore, Suitability will have five conditions possible values. Before defining an attribute’s resolution, and five distributions that indicate the probability of each the user must create instances of other classes. By of Suitability’s five possible values occurring given the creating instances of classes, attributes in other classes value of RankRunoffPotential. can depend on the attributes of these instances. Figure 1 shows the global and class relationship views after the six Having defined four enumerations, six classes, classes and their six respective instances have been seven attributes, and six instances in our model using the created. PRM Editor, the model was saved to as a .prm file that could be used by the TIDE system. 2.3 HYRDOLOGIC MODEL A PRM consists of a set of class probability models. The final version of our PRM (Figure 2) contains six classes – SiteModel, Runoff, Topography, DrainageCapacity, LandCover, and Soil. Each class has a set of attributes. Attributes are either simple or complex. Simple attributes are random variables that represent direct properties of an object, such as the type of land cover or type of soil, whereas complex attributes represent relationships to other objects. The attributes in our model are all simple. Logical relationships can be described between classes. The lines in Figure 2 represent these relationships. Assuming we have an instance of every class, an instance of LandCover is related to an instance of the DrainageCapacity class by the LandCoverType attribute. Each simple attribute is associated with a set of parents and a CPD. The parents are determined by the attributes that the attribute depends on. Attributes can depend on either other simple attributes of the same object or of related objects. An example of an attribute of an object depending on an attribute of a related object is the dependence of the RankDrainageCapacity on LandCoverType. Figure 1: Global View (left), Class Relationship View Attributes of related objects are specified via attribute slot (right) chains, such as the slot chain LandCoverInstance LandCoverType. This slot chain begins with the object To complete the implementation of the previously representing the land cover of a location, and accesses the discussed attribute, Suitability, its resolution must be simple attribute indicating the type of land cover at this defined. The resolution of an attribute can be assigned as location. The model specifies that the nothing, assignment, or dependency. Upon creating the RankDrainageCapacity attribute of the DrainageCapacity attribute, the default choice is nothing. If the user updates class has this slot chain as a parent. To reiterate, this 34 indicates that the RankDrainageCapacity depends Table 1: SiteModel Class Implementation probabilistically on the LandCoverType. The other slot chain parent of RankDrainageCapacity is class SiteModel = { SoilInstance.SoilType. It is important to emphasize that Suitability: single [RunoffInstance.RankRunoffPotential] RankPoor depends on these parent relationships are general, meaning that the case [VeryLow] => land cover or soil type may vary from scenario to (0.1 -> VeryPoor, 0.15 -> Poor, 0.5 -> Med, 0.15 -> Well, 0.1 -> VeryWell) scenario, but the probabilistic relationships hold for all case [Low] => scenarios (e.g., when a new area is investigated). (0.7 -> VeryPoor, 0.1 -> Poor, 0.1 -> Med, 0.05 -> Well, 0.05 -> VeryWell) case [Med] => (0.05 -> VeryPoor, 0.1 -> Poor, 0.7 -> Med, 0.1 -> Well, 0.05 -> VeryWell) case [High] => (0.05 -> VeryPoor, 0.05 -> Poor, 0.1 -> Med, 0.7 -> Well, 0.1 -> VeryWell) case [VeryHigh] => (0.05 -> VeryPoor, 0.05 -> Poor, 0.1 -> Med, 0.1 -> Well, 0.7 -> VeryWell) case [_] => (0.2 -> VeryPoor, 0.2 -> Poor, 0.2 -> Med, 0.2 -> Well, 0.2 -> VeryWell) } With a clear understanding of how relationships and CPDs are specified in the model, we can discuss how inference ultimately determines if a location is suitable. The basic order of how the model performs inference is: once the values of an attribute’s parents are known, the value of that attribute can be inferred. Therefore, the process begins by posting evidence to the leaf classes. First, the LandCoverType, SoilType, RankSlope, and RankFlow evidence is posted to the model. Next, the Figure 2: Hydrologic PRM model can infer the value of RankDrainageCapacity from the land cover and soil data. For example, if the In our tree-structured PRM, the attributes in the classes LandCoverType is Shrub and the SoilType is Vertisols, directly below another class are parents to the attributes in the probability distribution encoded in the model for the class above them. Therefore, the attributes in the three RankDrainageCapacity given this evidence is: case leaf classes, LandCover, Soil and Topography, do not [Shrub,Vertisols] => (0.7 -> VeryPoor, 0.3 -> Poor, 0.0 -> have parents. The values of these attributes are derived Med, 0.0 -> Well, 0.0 -> VeryWell). Again, this outside the model and posted as evidence to the model. distribution can be interpreted as: If LandCoverType is Conversely, the values of the attributes in the remaining Shrub and SoilType is Vertisols, there is 70% likelihood classes, SiteModel, Runoff, and DrainageCapacity, are the RankDrainageCapacity is VeryPoor and 30% inferred from the data available within the model. likelihood the rank of drainage capacity is Poor. Once the Recall that the other information associated with a simple CPD of RankDrainageCapacity is determined, the attribute is a CPD that specifies a distribution over values distribution can be used in conjunction with the of an attribute given the values of its parents. In the case Topography evidence to infer the value of the of Suitability, its parent is Runoff.RankRunoffPotential. RankRunoffPotential. This process propagates up the Table 1 shows the code for the implementation of the model, as RankRunoffPotential influences the value of SiteModel class, complete with its attribute, Suitability, Suitability. specification of its parent, RankRunoffPotential, and CPD To determine a site’s suitability, the model uses all for every possible value of RankRunoffPotential. The available data. While accuracy increases with amount of bolded line specifies, in plain terms, that if the available data, our model is capable of reasoning with RankRunoffPotential is VeryLow then there is 10% incomplete or no data. In the case that data is unavailable likelihood Suitability is VeryPoor, 15% likelihood or unknown for the four inputs – LandCoverType, Suitability is Poor, 50% likelihood Suitability is Medium, SoilType, RankSlope, and RankFlow – the probability is 15% likelihood Suitability is Well and 10% likelihood evenly distributed over all possible values. Suitability is VeryWell. Our PRM Editor utilizes the open source Figaro Similar to how parent relationships are defined, the probabilistic programming language (PPL) assigned CPD is general; it holds no matter what the (www.cra.com/figaro) to perform inference. PPLs provide specific related objects are. a powerful and flexible way to represent probabilistic models using the power of programming languages. In 35 addition, PPLs offer general-purpose reasoning Table 2: Mapping terrain slope angles to model inputs algorithms for inference and machine learning. Our implementation utilizes the Metropolis-Hastings Angle Range Rank of Slope reasoning algorithm, capped with a runtime of 5,000 milliseconds per inference. ”Ϊ ” Very Low 10 < Ϊ ” Low 2.4 INTEGRATION 20 < Ϊ ” Medium Our PRM used data from the following data sources. 30 < Ϊ ” High 2.4.1 SRTMVF2 Ϊ > 60 Very High The Shuttle Radar Topography Mission (SRTM) was a joint project between NASA and the National Geospatial- Intelligence Agency (NGA) to create high-resolution land Flow surface data for much of the world (roughly 80% of the The elevation dataset is used to predict flow channels – Earth’s land surface is covered). The SRTM Void-Filled 2 that is, paths that surface water is likely to take in the (SRTMVF2) data set is at 1-arc-second (approximately event of rainfall. A greater amount of flow indicates a risk 30-meter) resolution data, with many gaps in data void- of surface water accumulation. To calculate flow, we filled using interpolation techniques (Dowding, Kuuskivi, relied on the TopoToolbox (Schwanghart & Kuhn, 2010). Li et al., 2004). The SRTMVF2 dataset serves as our The toolbox includes techniques for predicting flow primary elevation data source, as our hydrologic model is estimation. The flow values predicted can vary wildly. In heavily dependent on accurate, high-resolution elevation the case of our AOI, estimated flow varied between 0 and data. However, we have identified that there are gaps over 3,300. To normalize the dataset, we first transformed within the SRTMVF2 elevation data. In areas where no the flows to a logarithmic scale (changing the range from SRTMVF2 data can be found, we can fall back to lower 1 to ~9.7) and then normalized the results to [0, 1]. resolution DTED data, including the SRTMVF1 and SRTMVF0 data sets. Elevation is used to determine Table 3: Mapping Flow to Model Inputs inputs to our model, slope and water flow. Slope and flow implicitly capture the spatial relationships of each DTED Flow Range Rank of Flow point with its neighbors, allowing the PRM to reason about each point’s data independently. flow is exactly 0 Very Low Slope 0 < flow ” 0.1 Low Slope is determined using the elevation dataset. For the 0.1 < flow ” Medium initial effort, we used a simple algorithm that iterates across each elevation point. For each point, the relative 0.2 < flow ” High change, dE, in elevation is calculated for each adjacent flow > 0.5 Very High point (excluding diagonally adjacent points.) The dE value with the greatest magnitude is selected, and the distance between points (1 arc-second in the case of the As shown in Table 3, these values are then translated into SRTMVF2 dataset) is used to calculate the angle of the five discrete inputs for the terrain assessment model WHUUDLQ¶VVXUIDFHԦ7KLVYDOXHFDQUDQJHIURPGHJUHHV (same as the slope). As with the slope values, the process (i.e., perfectly flat) to 90 degrees (which would be a of mapping flows to discrete ranking values is perfectly vertical surface.) While there are more elaborate independent from the flow calculations. This means that methods for determining slope that provide more accurate calculating the flows (a process that took roughly two results, this technique can process millions of points in a hours for the Demonstration Scenario’s AOI) need only matter of minutes, and yields sufficient accuracy for the be run once per AOI, even if we adjust model values or needs of the terrain assessment model. how flow values are mapped to model inputs. Once the slope angles have been calculated using the algorithm described above, they are translated from a 2.4.2 GeoCover continuum of [0, 90) to five discrete values, which are Earth Satellite Corporation (EarthSat) developed the used as inputs for the terrain model. Table 2 shows how GeoGover data set, a global landcover database. The angle ranges are mapped to model inputs. GeoCover dataset consists of 13 land cover classes and is available for much of the world (Cunningham, Melican, Wemmelmann et al., 2002). Classes of land cover include grasslands, agriculture areas (i.e., farmland), wetlands, and water/ice. This data will serve as additional inputs to 36 our terrain models so we can more accurately assess Table 4: Mission Rules rainfall-runoff response. The GeoCover dataset will also enable TIDE to identify bodies of water. Condition Effect 2.4.3 Harmonized World Soil Database If the point is a body of water Mission suitability is Very Poor The Harmonized World Soil Database (HWSD) was If slope is ranked as “Low” or Mission suitability is High produced by the European Union’s European Commission “Very Low” and hydrologic Joint Research Centre (more specifically, the Land suitability is “Medium” Management Unit of the Institute for Environment and If slope is ranked as “Low” or Mission suitability is equal to Sustainability.) The HWSD is a 30 arc-second “Very Low” and hydrologic hydrologic suitability (approximately 90-meter) resolution that contains detailed suitability is not “Medium” information about the top soil and subsoil properties. It was created by merging data from four different soil If slope is ranked as “Medium” Mission suitability is Medium databases (Nachtergaele, Van Velthuizen, Verelst et al., and hydrologic suitability is 2008). “High” or “Very High” This data allows the model to more accurately predict If slope is ranked as “Medium” Mission suitability is equal to how terrain will respond to surface water (for example, and hydrologic suitability is not hydrologic suitability how quickly water will be absorbed into the soil.) This “High” or “Very High” dataset’s low resolution means that some terrain boundaries (such as coasts) and geographical features If slope is “High” or “Very Mission suitability is Very Poor (such as bodies of water) are of low accuracy compared to High” the other data sets. 2.5 MISSION DECISION RULES Currently, rules are distinct from the PRM model, so that custom rules can be written for different operational needs The system must provide a set of logistic system-specific while using the same PRM model. For example, while the terrain assessment rules for a variety of systems and PRM output is constant, the mission requirements for a purposes (e.g., Tactical Water Distribution System, long fuel pipeline may be very different than the mission Assault Hose Line System). Terrain suitability may vary requirements for a convoy. The fuel line would have a from system to system—for example, a suspended hose very low tolerance for changes in elevation (as the pumps may be unaffected by some types of standing water while cannot handle the increased workload) and would be a ground-level hose could be at risk for contamination. susceptible to contamination from standing water. The Rule sets for individual systems will need to account for convoy, while still limited by severe terrain or flooding, these differences, allowing planners to choose the would be much more resilient to water and slopes. appropriate system given the characteristics of a prospective emplacement site. Additionally, logistics A standard rules engine and associated rules language, planners must be able to easily modify and expand these such as that provided by JBoss, would allow mission rules as new systems are introduced, and as mission experts to author rules for the TIDE system without requirements change. (For example, different rules would requiring them to understand the PRM or hydrology. be used for route planning than well placement.) 2.6 VISUALIZATION In our initial effort, we have implemented some basic rules that filter terrain suitability for a hypothetical fuel Outputs of the PRM and the rules engine, as well as the line. The fuel line has two requirements: (1) it must be data sources themselves, were rendered within NASA installed on flat land (so the pumps can function WorldWind. WorldWind can accept a variety of GIS data properly); and (2) the fuel lines cannot be placed in formats and is easily customizable. Using the open-source standing water (to prevent contamination), which includes the Geospatial Data Abstraction Library (GDAL), we bodies of water (such as lakes) and areas that are prone to wrote custom modules to render the HWSD and flooding. GeoCover data sets, while the SRTMVF2 data was loaded in using built-in WorldWind methods. Model output and To determine suitability for the fuel lines, we take slope, rules output were rendered as textures which were then land coverage, and hydrologic suitability as inputs. We projected onto the WorldWind globe at the appropriate then apply a set of rules as described in Table 4. The rules coordinates, but the data could easily be written to a transform the hydrologic suitability into mission variety of GIS formats. suitability. These rules favor flat land over sloped land. The hydrologic suitability is presented as belief values in five categories: {Very Poor, Poor, Medium, High, Very High}. Related military impact assessment (e.g., weather impact assessment) is done at three intervals (e.g., low 37 risk, medium risk, and high risk.) We expanded our model the central region of the image – these are riverbeds and to use five intervals instead of three to present additional their surrounding valleys, which were detected despite granularity in the model’s output. Further work is needed those bodies of water not being explicitly present within to determine the best number of intervals and their our GeoCover or HWSD data sets. thresholds. Figure 4 shows the output of our rules engine (and a For the initial effort, the category with the highest belief region slightly larger than the figure above). While these value is selected as the ‘correct’ suitability value. These rules are very simple, they demonstrate how rules can categories are then color-mapped for visualization: {Red, transform the high-density output of the models (Figure 3, Orange, Yellow, Green, Blue}. The same categories and above). The model output scores each point in the colors are used for the rules output. elevation grid (approximately a 30 by 30 meter square when using the SRTMVF2 data set), producing a very 3. DISCUSSION dense output. Rules can be used to simplify the models’ output into easier-to-interpret regions. With these simple For our area of interest and the 1 arc-second SRTMFV2 rules, we were able to execute rules across the entire set, there are 25,934,402 points to process. Executing the region in five minutes. Figure 4 shows the same riverbeds entire PRM for each point would be unnecessarily as in Figure 3, but the view is expanded to show a large complex – instead, we store each unique combination of lake to the west, which has been appropriately flagged as {soil type, land cover, rank slope, rank flow} and store having very poor mission suitability. Unlike the riverbeds the associated beliefs. This means we can simply look up (which were predicted by the PRM), this body of water is the correct PRM output for each unique combination of present within both the GeoCover and HWSD data sets inputs, which need only be run through the PRM once. As (at differing levels of precision). a result, we are able to process all 25.9 million points in only two hours. (Further updates to the Figaro library should increase runtime performance as well.) In a full- scale TIDE system, the PRM values for all combinations could be calculated once and only once, and then stored in a database for quick reference. This database would only need to be updated when the PRM is updated. Figure 4: Rules Output 4. FUTURE WORK 4.1 TERRAIN AND HYDROLOGIC MODELS Future improvements to the model begin by incorporating more data. The more information captured by the model, the more accurate the inferences will be. The next data Figure 3: PRM Model Output source to integrate is precipitation data. Depending on the duration of the mission, weather or climate data would be Figure 3 shows the output of the model. (Figures 3 and 4 used. For example, missions spanning from zero to four are best viewed in color.) The output of the model is very months would heavily rely on weather information, grainy as each point in the elevation set can have a missions spanning from four to eight months would distinct rank. Of note are the red regions running across integrate both weather and climate data, and missions 38 lasting longer than eight months would incorporate sources into the model, the number of attributes and climate data. We will also work to quantitatively evaluate dependencies will increase, resulting in more accurate the performance and applicability of our models. inferences. Existing data can also be used to infer missing data. For example, using higher-resolution data (such as Our approach to testing and verifying the accuracy of our elevation data or land cover data) we can easily determine models is two-fold. First, we will compare the outputs of that the HSWD fails to cover the coastlines. We can then our models to those of existing, alternative hydrologic predict the missing values using spatial relationships. models. These models are often based on solving complex Ambiguous areas could be assigned multiple values with equations that govern the physics of surface and different confidence values. Figure 5 shows how the two subsurface water (Abbott et al., 1986; Panday & HSWD regions could be used to infer the values for the Huyakorn, 2004) or assign statistical values to terrain missing regions. based on observation (Yoram, 2003). These models are not practical for US Army planning because they require Point A, to the north, would be assigned a high complete data sets, are extremely time-consuming to probability of having luvisols as the dominate soil type. compute, and do not scale to the levels of detail and scope Point B would be assigned near equal probabilities of required by US Army logistics planners. However, their being either luvisols or vertisols. Point C, to the south, outputs have been validated when tested on carefully would be assigned a high probability of vertisols as the monitored and measured regions of terrain, typically dominate soil type. The inference used for point B could within the US. By running the TIDE models on the same be assigned to any region near the boundaries of low- regions and comparing its output to that of the established resolution data sets – for example, point D could also be models, we can confirm that the TIDE models are assigned a probability of being either vertisols or luvisols; functioning correctly. even though the data set classifies it as vertisols, the resolution is low enough that the point could be a Second, we will gather existing data sources of rainfall- misclassification. The assigned probabilities, along with runoff responses. Several regions within the United States the soil types themselves, would serve as inputs to the have had their rainfall-runoff responses measured at PRM models.. For example, the soil type input to our various degrees of fidelity. For example, the Leaf River PRMs for Point D could be “{Vertisols-50%, Luvisols- basin in Mississippi has over forty years of time series 50%} instead of simply {Vertisols}. data that includes precipitation and runoff (Yapo, Gupta, Sorooshian et al., 1996). Additional data sources could be built from flood records and high water level records. These data sets will serve to validate the PRM models used by TIDE. They may also serve as training data to calibrate the model to more accurately predict the severity of rainfall run-off responses (e.g., flooding). 4.2 DATA FUSION MODEL Our basic solution for handling cases of limited or missing data assumes that each value is equally likely if no evidence is posted to the model. Under this assumption, the accuracy of our inferences declines with limited or no data. The inferences are only as strong as the Figure 5: Reasoning about incomplete data data known and evidence provided. Future improvements for how to reason with 5. CONCLUSIONS incomplete or no data involve adjusting the prior distributions. Although the prior distributions in our Flooding, and other terrain rainfall-runoff responses, pose current model assume that all values of an attribute are significant risk and cost to US Army operations. equally likely if no data is available, one would argue this Assessing the magnitude of flood risk and the impact it is not representative of the real world. We plan to explore will have on a mission requires both time and expertise the possibilities of more representative prior distributions. that may not always be available. An automated system For example, the prior distribution for land cover type for predicting the likelihood and impact of flooding and could reflect that fact that over 70% of the earth’s surface surface water accumulation would be of great benefit to is covered in water, making it the most likely of the seven logistics planners and the US Army at large. values. During our initial effort, we demonstrated the feasibility This being said, the most dramatic mitigation of of Terrain Impact Decision Extensions to predict rainfall- consequences due to incomplete data or unknown values runoff response. We have identified key data sources will result from future improvements to the model itself required for predicting flooding and have developed an rather than the dependencies. As we integrate more data initial set of models that are capable of identifying regions 39 that are at high risk of flooding. These models are capable Infiltration and Runoff/On. Environmental of processing millions of data points per hour, allowing Modelling & Software, 23, 794-812. them to process thousands of square kilometers. We feel these models and their performance indicate our approach Nachtergaele, F., Van Velthuizen, H., Verelst, L., Batjes, is sound, and future work will refine and validate the N., Dijkshoorn, K., Van Engelen, V., Fischer, G., models’ performance. Jones, A., Montanarella, L., and Petri, M. (2008). Harmonized world soil database. Food and Acknowledgements Agriculture Organization of the United Nations. This work was performed under US Army Research Lab contract number W911QX-13-C-0111. The authors would Panday, S.& Huyakorn, P. S. (2004). A fully coupled like to thank Mr. Peter Grazaitis for his significant physically-based spatially-distributed model for technical support and eager engagement on this project. evaluating surface/subsurface flow. Advances in This work was funded in its entirety by ARL. We would water Resources, 27, 361-382. also like to thank Ms. Yvonne Fuller and Ms. Jill Oliver for their assistance in preparing this paper. Pfeffer, A., Koller, D., Milch, B., and Takusagawa, K. T. (1999). SPOOK: A system for probabilistic References: object-oriented knowledge expression. In 14th Annual Conference on Uncertainty in AI (UAI). Abbott, M. B., Bathurst, J. C., Cunge, J. A., O'Connell, P. E., and Rasmussen, J. (1986). An introduction to Schwanghart, W.& Kuhn, N. J. (2010). TopoToolbox: A the European Hydrological System—Systeme set of Matlab functions for topographic analysis. Hydrologique Europeen,“SHE”, 1: History and Environmental Modelling & Software, 25, 770- philosophy of a physically-based, distributed 781. modelling system. Journal of hydrology, 87, 45- 59. Singh, V. P. (1988). Hydrologic systems. Volume I: Rainfall-runoff modeling. Prentice Hall, Beven, K.& Binley, A. (1992). The future of distributed Englewood Cliffs New Jersey. 1988. 480. models: model calibration and uncertainty prediction. Hydrological processes, 6, 279-298. Thiemann, M., Trosset, M., Gupta, H., and Sorooshian, S. (2001). Bayesian recursive parameter estimation Cunningham, D., Melican, J., Wemmelmann, E., and for hydrologic models. Water Resources Jones, T. (2002). GeoCover LC-A moderate Research, 37, 2521-2535. resolution global land cover database. In ESRI International User Conference. Vicens, G. J., Rodriguez-Iturbe, I., and Schaake, J. C. (1975). A Bayesian framework for the use of Dowding, S., Kuuskivi, T., and Li, X. (2004). Void fill of regional information in hydrology. Water SRTM elevation data–principles, processes and Resources Research, 11, 405-414. performance. In Images to Decisions: Remote Sensing Foundations for GIS Applications, Vrugt, J. A., Ter Braak, C. J., Clark, M. P., Hyman, J. M., ASPRS, Fall Conf., Sep, 12-16. and Robinson, B. A. (2008). Treatment of input uncertainty in hydrologic modeling: Doing Friedman, N., Getoor, L., Koller, D., and Pfeffer, A. hydrology backward with Markov chain Monte (1999). Learning probabilistic relational models. Carlo simulation. Water Resources Research, 44. In Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99). Yapo, P. O., Gupta, H. V., and Sorooshian, S. (1996). Automatic calibration of conceptual rainfall- Koller, D.& Pfeffer, A. (1998). Probabilistic frame-based runoff models: sensitivity to calibration data. systems. In Fifteenth National Conference on Journal of Hydrology, 181, 23-48. Artificial Intelligence (AAAI-98), 580-587. Yoram, R. (2003). Applied stochastic hydrogeology.: Meng, H., Green, T. R., Salas, J. D., and Ahuja, L. R. Oxford University Press. (2008). Development and testing of a terrain- based hydrologic model for spatial Hortonian 40