ceur-ws.org/Vol-2399/paper02.pdf Adopting Markov Logic Networks for Big Spatial Data and Applications Ibrahim Sabek Supervised by: Mohamed F. Mokbel Department of Computer Science and Engineering University of Minnesota, MN, USA {sabek, mokbel}@cs.umn.edu ABSTRACT for deploying machine learning in various applications, in- Spatial data has become ubiquitous everywhere, e.g., GPS cluding knowledge base construction [25], data cleaning [18], data, medical data, with increasingly sheer sizes. This raises genetic analysis [23], among others. the need for efficient spatial machine learning and analysis Meanwhile, in recent years, there has been a prolifera- solutions to extract useful insights from such data. Mean- tion in the amounts of spatial data produced from several while, Markov Logic Networks (MLN) have emerged as pow- devices such as satellites, space telescopes, and medical de- erful framework for building usable and scalable machine vices. Various applications and agencies need to analyze learning tools. Unfortunately, MLN is ill-equipped for spa- these unprecedented amounts of spatial data. For example, tial applications because it ignores the distinguished spatial epidemiologists use spatial analysis techniques to track in- data characteristics. This paper describes SMLN, the first fectious disease [2]. News reporters use geotagged tweets for full-fledged MLN framework with native support for spatial event detection and analysis [24]. Unfortunately, researchers data. SMLN comes with a high-level datalog-like language never take advantage of the recent advances of Markov Logic with spatial constructs, and spatially-equipped grounding, Networks (MLN) to boost the usability, scalability, and ac- inference and learning modules. We show the e↵ectiveness curacy of spatial machine learning tasks (e.g., spatial re- of SMLN by illustrating three systems, namely, Sya, Tur- gression [9]) used in these applications. Furthermore, MLN- boReg, and Flash, that are already built using SMLN. based applications (e.g., knowledge base construction [25]) would miss important results and have less accuracy when dealing with spatial data. The main reason is that MLN 1. INTRODUCTION is oblivious to the spatial data. The only way to support Data scientists and developers have been spending signif- spatial data in MLN is to simply ignore its distinguished icant e↵orts applying machine learning and artificial intelli- properties (e.g., spatial relationships among objects) and gent methods, e.g., deep learning, to analyze and turn their deal with it as non-spatial data. While this would work to massive data into useful insights. However, the expertise some extent, it will result in a sub-par performance. skills and e↵orts needed to deploy these methods become a The goal of our work is to provide the first full-fledged major blocking factor in having a wide deployment of ma- MLN framework with a native support for spatial data, chine learning applications. As a result, Markov Logic Net- called Spatial Markov Logic Networks (SMLN). In partic- work (MLN) [19] was recently introduced to reduce this gap. ular, SMLN pushes the spatial awareness inside the internal In particular, MLN combines first-order logic [5] with prob- data structures and core learning and inference function- abilistic graphical models to efficiently represent statistical alities of MLN, and hence inside all MLN-based machine learning and inference problems with few logical rules (e.g., learning techniques and applications. SMLN consists of rules with imply and bit-wise AND predicates) instead of four main modules, namely, language, grounding, inference thousands of lines of code. With MLN, data scientists and and learning. The language module extends the DDlog lan- developers do not need to worry about the underlying ma- guage [25] with spatial data types and predicates to express chine learning work. Instead, they will focus their e↵orts on spatial semantics when writing rules. The grounding mod- developing the rules that represent their applications. Re- ule constructs a spatial variation of the factor graph [28], cently, MLN has been widely adopted as a research vehicle namely Spatial Factor Graph, to efficiently represent SMLN graphical models. The inference module provides a novel 1 This work is partially supported by the National Science algorithm of Gibbs Sampling [30] that combines the Con- Foundation Grants IIS-1525953, and CNS-1512877, USA. clique [11] concept from spatial statistics with query-driven independent Metropolis-Hastings approach [12]. The learn- ing module employs a new distance-based optimization tech- nique based on the gradient descent method [31] to effi- ciently learn SMLN model parameters. Users of SMLN would be able to seamlessly build a myr- iad of scalable spatial applications, without worrying about Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los the underlying spatial machine learning and computation Angeles, California. details. We show three case studies that use SMLN as a Copyright (C) 2019 for this paper by its authors. Copying permitted for backbone for their computation. These case studies include private and academic purposes. Schema Declaration S1: County (id bigint, location point, hasLowSanitation bool). S2: HasEbola? (id bigint, location point). Inference Rule R1: HasEbola(C1, L1) => HasEbola(C2, L2) :- County(C1, L1, -), County(C2, L2, S2) [distance(L1, L2) < 2.5, within(liberia_geom, L1), S2 = true]. Figure 2: Example on Spatial Extensions in DDlog. Figure 2, which captures the e↵ect of Ebola infected coun- ties on each other, is composed of two spatial predicates distance and within that measure the distance between infected counties, and check whether they are located in Liberia or not, respectively. Once submitting the SMLN Figure 1: SMLN System Architecture. program, this module checks the syntax correctness of used spatial constructs, compiles the program, and forwards the output to the grounding module. Sya [21], a system for spatial probabilistic knowledge base construction, TurboReg [20], a framework for scaling up spa- 2.2 The Grounding Module tial autologistic regression models, and Flash [22], a frame- Grounding is an essential operation in the MLN execu- work for scalable spatial data analysis. tion pipeline, where it constructs a data structure called factor graph [28] that will be used later to perform infer- ence and learning operations on. Such factor graph is ef- 2. SMLN OVERVIEW ficiently constructed by evaluating the compiled rules from Figure 1 provides an overview of SMLN architecture. the language module as a sequence of SQL queries (e.g., There are three types of users who interact with SMLN, [25]). To accommodate for the newly introduced spatial namely, developers, casual users, and administrators. De- constructs, e.g., distance, in rules, SMLN adapts a scalable velopers should have expertise with MLN and use the pro- in-database grounding technique from [14] to translate and vided high-level language by SMLN to create new applica- evaluate rules with these constructs as a set of spatial SQL tions. We briefly review three example applications includ- queries (e.g., range query and spatial join). The generated ing Sya [21], TurboReg [20], and Flash [22] in sections 3, 4 queries are then executed through standard spatial database and 5, respectively. Casual users can either use standard engines, e.g. PostGIS, to produce a spatial variation of the querying or visualization APIs to perform inference and factor graph, namely, Spatial Factor Graph, that consists learning queries over the built applications (e.g., what is of: (1) logical and spatial random variables. (2) logical and the probability of a specific event to happen?). Adminis- spatial correlations among these variables. trators can monitor the system and tune up the inference SMLN provides two e↵ective optimizations in the ground- and learning configurations. SMLN adopts an extensible ap- ing process: (1) It supports creating on-fly spatial indices proach, where it injects the spatial awareness inside the four (e.g., R-tree [7]) on relations with spatial attributes, making main modules of MLN, namely, language, grounding, infer- the evaluation of complex predicates (e.g., overlap) more ence, and learning. In the rest of this section, we highlight efficient. (2) It provides a simple heuristic query optimizer our contributions in each of these four modules. that re-orders the execution of nested spatial queries that come from rules with multiple spatial predicates. Moreover, 2.1 The Language Module SMLN provides an abstract database driver that supports SMLN employs a high-level language to help users write defining spatial storage, functions and query capabilities. on-top applications as a set of rules and save huge coding Such abstract can be extended by users to run their spatial e↵orts. Instead of providing a completely new language, database engine choice inside SMLN. SMLN extends DDlog [25], a datalog-like language for defin- ing MLN rules, with spatial data types and predicates that 2.3 The Inference Module conform to the Open Geospatial Consortium (OGC) stan- The main objective of the inference module in MLN is to dard [16]. Such extensions allow users to express their spa- infer the values of variables in the constructed factor graph tial semantics without the need for re-implementing user- and compute their associated probabilities in an efficient defined functions in each application. For example, SMLN and scalable manner [14]. Gibbs Sampling [30] is consid- adds four spatial data types, namely, point, rectangle, ered the most widely used inference algorithm in MLN sys- polygon, and linestring, to the schema declaration of tems, mainly because its simplicity and efficiency. However, relations in DDlog. Figure 2 shows an example of two there are two main limitations in using the existing Gibbs schema declarations S1 and S2 with point spatial at- sampling techniques when inferring the values of the spatial tributes. In addition, SMLN extends the derivation, su- factor graph variables. First, these techniques infer values pervision and inference rules in DDlog with spatial predi- that maximize the satisfaction of the logical semantics (e.g., cates (e.g., overlaps, within, and distance) and functions imply) encoded in the factor graph. Therefore, in case of (e.g., union and buffer) to efficiently evaluate the relation- spatial factor graph, these inferred values will be suboptimal ships between spatial objects. Such predicates and functions because they never consider the spatial correlation between can be composed. For example, the inference rule R1 in variables. Second, these techniques require a large num- ber of sampling iterations (i.e., slow convergence) to obtain DeepDive TurboReg Quality (F1-Score) Sya ngspatial Time in sec.(Log) an acceptable output in case there are spatial correlations 1 100k among variables, because they perform sequential sampling 0.8 10k 1k over the factor graph nodes [11]. 0.6 100 10 To overcome the above two limitations, SMLN employs 0.4 1 a novel Gibbs Sampling algorithm, namely Spatial Gibbs 0.2 0.1 GWDB NYCCAS 250 1k 3.5k 21k 84k Sampling, that can efficiently perform inference on the spa- Dataset Grid Size tial factor graph coming from the grounding module. To (a) Sya Accuracy (b) TurboReg Latency take the spatial correlations into account, the proposed sam- pling algorithm adapts a new variation of the query-driven independent Metropolis-Hastings approach [12] that uses Figure 3: Initial Experiments. inverse-distance method [13] to spatially weigh the correla- tions among variables in the spatial factor graph, and hence yields more accurate inferred values. with many important applications, e.g., web search, digi- To alleviate the slow convergence issue, a straightforward tal libraries, and health care. Most recently, KBC systems solution is to randomly partition the variables into a set employed the idea of MLN to associate each extracted rela- of buckets and then sample these buckets in parallel. Even tion with a probability of how confident is the system that though this solution will finish the sampling iterations faster this relation is factual. DeepDive [25], an MLN-based sys- than the sequential one, it may not converge to an accept- tem, has emerged as one of the most popular probabilistic able solution as spatially-dependent variables might run in KBC systems, applied in di↵erent domains (e.g., law en- parallel (i.e., independent of each other). This will force the forcement [10], geology [29], and paleontology [17]). sampler to run additional sampling iterations to converge, Unfortunately, DeepDive does not fully utilize the under- and hence incur a significant latency overhead. As a re- lying spatial information, which results in less accuracy in sult, SMLN employs an approach that combines in-memory the factual scores. This is because of two reasons: (1) Deep- spatial partitioning technique, namely pyramid index [1], Dive treats any predicate in the knowledge base inference with a well-known spatial statistics concept, namely con- rules as a boolean function, which yields either true or false cliques [11], to heuristically partition the spatial factor graph (i.e., satisfied or not). So, although one can define spatial into a set of spatially-independent partitions, and sample predicates in DeepDive, internally DeepDive and its infer- them in parallel. Defining concliques ensures the neighbour- ence engine do not do anything special for any spatial predi- ing independence between nodes in the same conclique set, cate. (2) DeepDive estimates the factual scores of extracted and hence these nodes can be efficiently sampled in parallel. relations only based on how much support for these rela- tions in training data. However, in case spatial information 2.4 The Learning Module exists, the relative distance between entities participating in In general, the learning module in MLN mainly focuses on the extracted relations should be considered as well. optimizing the weights of correlations defined in the factor To overcome the above limitations in DeepDive, we pro- graph. However, in case of spatial factor graph, the rela- posed Sya [21]; a spatial MLN-based KBC system, built tive distance between spatial variables participating in any using our SMLN framework. We re-implemented the exist- correlation should be considered as well to learn optimal ing grounding, inference and learning modules in DeepDive weights. In particular, correlations between spatially close using the corresponding modules in SMLN. We initially eval- variables should have higher e↵ect on learned weights than uated Sya through building two real spatial knowledge bases correlations between distant variables. We refer to this con- about: (1) the water quality in Texas [27], namely, GWDB, cept as correlation locality. Recently, the gradient descent and (2) the air pollution in New York city [15], namely, optimization [31] has been widely used in optimizing the NYCCAS. Figure 3(a) shows the quality (i.e., accuracy) for weights in MLN models that use Gibbs sampling inference both Sya and DeepDive, measured by the F1-score, when (e.g., [25]). However, standard gradient descent optimiza- building these two knowledge bases. Sya has an improve- tion techniques fall short in supporting the correlation local- ment of 120% and 27% over DeepDive in GWDB and NY- ity concept. As a result, SMLN introduces a new variation CCAS, respectively. of gradient descent optimization that applies distance-based weighing technique. Given a correlation c, defined over m 4. TURBOREG SYSTEM spatial variables v1 , v2 , ..., vm in the spatial factor graph, its Autologistic regression [9] is an important statistical tool weight wc is optimized as follows: for predicting spatial phenomena. Unlike standard logis- tic regression that assumes predictions of spatial phenom- m(m 1) ena over neighbouring locations are completely indepen- wc = wc + Pm 1 Pm ↵g (1) 2 i=1 j=i+1 d(vi , vj ) dent of each other, autologistic regression takes into ac- count the spatial dependence between neighbouring loca- where g is the gradient sign (either 1 or -1), ↵ is the tions while building and running the prediction model (i.e., step size, and d(vi , vj ) is the Euclidean distance between neighbouring locations tend to systematically a↵ect each the variables vi and vj . other). Myriad applications require the autologistic regres- sion model to be built over large spatial data. However, ex- 3. SYA SYSTEM isting methods for autologistic regression (e.g., see [9]) are Knowledge base construction (KBC) has been an active prohibitively computationally expensive for large grid data, area of research over the last two decades with several sys- e.g., fine-grained satellite images, and large spatial epidemi- tem prototypes coming from academia and industry, along ology datasets. It could take about week to infer the model parameters using training data of only few gigabytes [9]. gistic regression models, and Flash, a framework for scalable To solve this issue, we introduced TurboReg [20]; a scal- spatial data analysis. able framework for building autologistic models with large number of prediction and predictor variables. In TurboReg, 7. REFERENCES we employed the inference and learning modules of SMLN [1] W. G. Aref and H. Samet. Efficient Processing of Window to estimate the autologistic model parameters in an accu- Queries in the Pyramid Data Structure. In PODS, 1990. rate and efficient manner. Basically, TurboReg provides [2] A. Auchincloss et al. A Review of Spatial Methods in Epidemiology. Annual Review of Public Health, 2012. an equivalent first-order logic [5] representation to depen- [3] bnspatial: Spatial Bayesian Networks, 2019. dency relations among neighbours in autologistic models. In cran.r-project.org/web/packages/bnspatial. particular, TurboReg transforms each neighbouring depen- [4] gamlss.spatial: Gaussian Markov Random Fields, 2019. dency relation into a predicate with bitwise-AND operation cran.r-project.org/web/packages/gamlss.spatial. [5] M. Genesereth and N. Nilsson. Logical Foundations of on all variables involved in this relation. For example, a Artificial Intelligence. Morgan Kaufmann Publishers, 1987. binary dependency relation between neighbouring variables [6] P. J. Green and S. Richardson. Hidden Markov Models and C1 , and C2 is transformed to C1 ^ C2 . This simple logi- Disease Mapping. JASA, 2002. cal transformation allows non experts to express the depen- [7] A. Guttman. R-trees: A Dynamic Index Structure for Spatial Searching. SIGMOD Record, pages 47–57, 1984. dency relations within autologistic models in a simple way [8] J. Hughes. ngspatial: A Package for Fitting the Centered without needing to specify complex details. Autologistic and Sparse Spatial Generalized Linear Mixed We experimentally evaluated TurboReg using a real Models for Areal Data. The R Journal, 2014. dataset of the daily distribution of bird species [26], and [9] J. Hughes, M. Haran, and P. C. Caragea. Autologistic Models for Binary Data on a Lattice. Environmetrics, 2011. compared its scalability performance to a state-of-the-art [10] Human Trafficking: Forbes. method, namely ngspatial [8]. Figure 3(b) shows the run- http://www.forbes.com/sites/thomasbrewster/2015/04/17/ ning time for both systems while building an autologistic darpa-nasa-and-partners-show-off-memex/. model over grid sizes ranging from 250 to 84k cells. Tur- [11] M. Kaiser et al. Goodness of Fit Tests for a Class of Markov Random Field Models. The Annals of Statistics, 2012. boReg has at least three orders of magnitude reduction in [12] K. Li et al. In-database Batch and Query-time Inference over the running time over ngspatial, while preserving the same Probabilistic Graphical Models using UDA-GIST. VLDB accuracy in estimating the model parameters. Journal, 26(2):177–201, 2017. [13] G. Y. Lu and D. W. Wong. An Adaptive Inverse-distance Weighting Spatial Interpolation Technique. Journal of 5. FLASH FRAMEWORK Computer and GeoSciences, 34(9):1044–1055, Sept. 2008. [14] F. Niu et al. Tu↵y: Scaling Up Statistical Inference in Markov Same as MLN made it possible for data scientists and Logic Networks Using an RDBMS. PVLDB, 4(6):373–384, developers to embrace the difficulty of deploying machine 2011. learning techniques, we aim to use our proposed SMLN [15] NYC OpenData. https://data.cityofnewyork.us/Environment/ framework as a backbone infrastructure to support long last- NYCCAS-Air-Pollution-Rasters/q68s-8qxv. [16] Open Geospatial Consortium. http://www.opengeospatial.org/. ing spatial analysis techniques that lack scalability as well [17] S. E. Peters, C. Zhang, M. Livny, and C. Ré. A Machine as su↵er from difficulty of deployment (e.g., [3, 4]). Reading System for Assembling Synthetic Paleontological To that end, we proposed Flash [22]; a framework for Databases. PLoS One, 9(12), 2014. generic and scalable spatial data analysis. Flash focuses on [18] T. Rekatsinas et al. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB, 10(11):1190–1201, 2017. building a major class of spatial analysis techniques, called [19] M. Richardson and P. M. Domingos. Markov Logic Networks. spatial probabilistic graphical modelling (SPGM) (e.g., [3, 4, Machine Learning, 2006. 6]), which uses probability distributions and graphical repre- [20] I. Sabek, M. Musleh, and M. Mokbel. TurboReg: A Framework sentations to describe spatial phenomena and make predic- for Scaling Up Spatial Logistic Regression Models. In SIGSPATIAL, pages 129–138, 2018. tions about them. In Flash, we built a generic transforma- [21] I. Sabek, M. Musleh, and M. F. Mokbel. A Demonstration of tion module that is responsible to generate a set of weighted Sya: A Spatial Probabilistic Knowledge Base Construction logical rules representing any SPGM input. The generated System. In SIGMOD, pages 1689–1692, 2018. rules are then executed normally through the SMLN frame- [22] I. Sabek, M. Musleh, and M. F. Mokbel. Flash in Action: Scalable Spatial Data Analysis Using Markov Logic Networks. work, where the weights of these rules represent the param- In VLDB, 2019. eters of the original SPGM and will be inferred using the [23] N. A. Sakhanenko and D. J. Galas. Markov Logic Networks in SMLN inference and learning modules. Currently, Flash the Analysis of Genetic Data. Journal of Computational supports transformation for three spatial graphical mod- Biology, 17(11):1491–1508, Nov. 2010. [24] J. Sankaranarayanan et al. TwitterStand: News in Tweets. In els; spatial Markov random fields [4], spatial hidden Markov SIGSPATIAL, 2009. models [6] and spatial Bayesian networks [3]. [25] J. Shin et al. Incremental Knowledge Base Construction Using DeepDive. PVLDB, 8(11):1310–1321, 2015. 6. CONCLUSION [26] B. Sullivan et al. eBird: A Citizen-based Bird Observation Network in the Biological Sciences. Biological Conservation, In this paper, we introduce SMLN; a framework that in- 2009. jects the spatial awareness inside the core functionality of [27] Texas Ground Water Database. Markov Logic Networks (MLN). SMLN equips the MLN http://www.twdb.texas.gov/groundwater/data/. framework with a spatial high-level language, and spatially- [28] M. Wick et al. Scalable Probabilistic Databases with Factor Graphs and MCMC. PVLDB, 3(1-2):794–804, 2010. optimized grounding, inference and learning modules. Data [29] C. Zhang et al. GeoDeepDive: Statistical Inference Using scientists and developers can exploit SMLN to build numer- Familiar Data-processing Languages. In SIGMOD, 2013. ous scalable and efficient spatial machine learning and anal- [30] C. Zhang and C. Ré. Towards High-throughput Gibbs Sampling ysis applications. We showed three on-top applications; Sya, at Scale: A Study Across Storage Managers. In SIGMOD, pages 397–408, 2013. a system for spatial probabilistic knowledge base construc- [31] M. Zinkevich et al. Parallelized Stochastic Gradient Descent. In tion, TurboReg, a framework for scaling up spatial autolo- NIPS, pages 2595–2603, 2010.