=Paper=
{{Paper
|id=Vol-2399/paper02
|storemode=property
|title=Adopting Markov Logic Networks for Big Spatial Data and Applications
|pdfUrl=https://ceur-ws.org/Vol-2399/paper02.pdf
|volume=Vol-2399
|authors=Ibrahim Sabek
|dblpUrl=https://dblp.org/rec/conf/vldb/Sabek19
}}
==Adopting Markov Logic Networks for Big Spatial Data and Applications==
ceur-ws.org/Vol-2399/paper02.pdf
Adopting Markov Logic Networks for Big Spatial Data and
Applications
Ibrahim Sabek
Supervised by: Mohamed F. Mokbel
Department of Computer Science and Engineering
University of Minnesota, MN, USA
{sabek, mokbel}@cs.umn.edu
ABSTRACT for deploying machine learning in various applications, in-
Spatial data has become ubiquitous everywhere, e.g., GPS cluding knowledge base construction [25], data cleaning [18],
data, medical data, with increasingly sheer sizes. This raises genetic analysis [23], among others.
the need for efficient spatial machine learning and analysis Meanwhile, in recent years, there has been a prolifera-
solutions to extract useful insights from such data. Mean- tion in the amounts of spatial data produced from several
while, Markov Logic Networks (MLN) have emerged as pow- devices such as satellites, space telescopes, and medical de-
erful framework for building usable and scalable machine vices. Various applications and agencies need to analyze
learning tools. Unfortunately, MLN is ill-equipped for spa- these unprecedented amounts of spatial data. For example,
tial applications because it ignores the distinguished spatial epidemiologists use spatial analysis techniques to track in-
data characteristics. This paper describes SMLN, the first fectious disease [2]. News reporters use geotagged tweets for
full-fledged MLN framework with native support for spatial event detection and analysis [24]. Unfortunately, researchers
data. SMLN comes with a high-level datalog-like language never take advantage of the recent advances of Markov Logic
with spatial constructs, and spatially-equipped grounding, Networks (MLN) to boost the usability, scalability, and ac-
inference and learning modules. We show the e↵ectiveness curacy of spatial machine learning tasks (e.g., spatial re-
of SMLN by illustrating three systems, namely, Sya, Tur- gression [9]) used in these applications. Furthermore, MLN-
boReg, and Flash, that are already built using SMLN. based applications (e.g., knowledge base construction [25])
would miss important results and have less accuracy when
dealing with spatial data. The main reason is that MLN
1. INTRODUCTION is oblivious to the spatial data. The only way to support
Data scientists and developers have been spending signif- spatial data in MLN is to simply ignore its distinguished
icant e↵orts applying machine learning and artificial intelli- properties (e.g., spatial relationships among objects) and
gent methods, e.g., deep learning, to analyze and turn their deal with it as non-spatial data. While this would work to
massive data into useful insights. However, the expertise some extent, it will result in a sub-par performance.
skills and e↵orts needed to deploy these methods become a The goal of our work is to provide the first full-fledged
major blocking factor in having a wide deployment of ma- MLN framework with a native support for spatial data,
chine learning applications. As a result, Markov Logic Net- called Spatial Markov Logic Networks (SMLN). In partic-
work (MLN) [19] was recently introduced to reduce this gap. ular, SMLN pushes the spatial awareness inside the internal
In particular, MLN combines first-order logic [5] with prob- data structures and core learning and inference function-
abilistic graphical models to efficiently represent statistical alities of MLN, and hence inside all MLN-based machine
learning and inference problems with few logical rules (e.g., learning techniques and applications. SMLN consists of
rules with imply and bit-wise AND predicates) instead of four main modules, namely, language, grounding, inference
thousands of lines of code. With MLN, data scientists and and learning. The language module extends the DDlog lan-
developers do not need to worry about the underlying ma- guage [25] with spatial data types and predicates to express
chine learning work. Instead, they will focus their e↵orts on spatial semantics when writing rules. The grounding mod-
developing the rules that represent their applications. Re- ule constructs a spatial variation of the factor graph [28],
cently, MLN has been widely adopted as a research vehicle namely Spatial Factor Graph, to efficiently represent SMLN
graphical models. The inference module provides a novel
1
This work is partially supported by the National Science algorithm of Gibbs Sampling [30] that combines the Con-
Foundation Grants IIS-1525953, and CNS-1512877, USA. clique [11] concept from spatial statistics with query-driven
independent Metropolis-Hastings approach [12]. The learn-
ing module employs a new distance-based optimization tech-
nique based on the gradient descent method [31] to effi-
ciently learn SMLN model parameters.
Users of SMLN would be able to seamlessly build a myr-
iad of scalable spatial applications, without worrying about
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los the underlying spatial machine learning and computation
Angeles, California. details. We show three case studies that use SMLN as a
Copyright (C) 2019 for this paper by its authors. Copying permitted for backbone for their computation. These case studies include
private and academic purposes.
Schema Declaration
S1: County (id bigint, location point, hasLowSanitation bool).
S2: HasEbola? (id bigint, location point).
Inference Rule
R1: HasEbola(C1, L1) => HasEbola(C2, L2) :-
County(C1, L1, -), County(C2, L2, S2)
[distance(L1, L2) < 2.5, within(liberia_geom, L1), S2 = true].
Figure 2: Example on Spatial Extensions in DDlog.
Figure 2, which captures the e↵ect of Ebola infected coun-
ties on each other, is composed of two spatial predicates
distance and within that measure the distance between
infected counties, and check whether they are located in
Liberia or not, respectively. Once submitting the SMLN
Figure 1: SMLN System Architecture. program, this module checks the syntax correctness of used
spatial constructs, compiles the program, and forwards the
output to the grounding module.
Sya [21], a system for spatial probabilistic knowledge base
construction, TurboReg [20], a framework for scaling up spa- 2.2 The Grounding Module
tial autologistic regression models, and Flash [22], a frame- Grounding is an essential operation in the MLN execu-
work for scalable spatial data analysis. tion pipeline, where it constructs a data structure called
factor graph [28] that will be used later to perform infer-
ence and learning operations on. Such factor graph is ef-
2. SMLN OVERVIEW ficiently constructed by evaluating the compiled rules from
Figure 1 provides an overview of SMLN architecture. the language module as a sequence of SQL queries (e.g.,
There are three types of users who interact with SMLN, [25]). To accommodate for the newly introduced spatial
namely, developers, casual users, and administrators. De- constructs, e.g., distance, in rules, SMLN adapts a scalable
velopers should have expertise with MLN and use the pro- in-database grounding technique from [14] to translate and
vided high-level language by SMLN to create new applica- evaluate rules with these constructs as a set of spatial SQL
tions. We briefly review three example applications includ- queries (e.g., range query and spatial join). The generated
ing Sya [21], TurboReg [20], and Flash [22] in sections 3, 4 queries are then executed through standard spatial database
and 5, respectively. Casual users can either use standard engines, e.g. PostGIS, to produce a spatial variation of the
querying or visualization APIs to perform inference and factor graph, namely, Spatial Factor Graph, that consists
learning queries over the built applications (e.g., what is of: (1) logical and spatial random variables. (2) logical and
the probability of a specific event to happen?). Adminis- spatial correlations among these variables.
trators can monitor the system and tune up the inference SMLN provides two e↵ective optimizations in the ground-
and learning configurations. SMLN adopts an extensible ap- ing process: (1) It supports creating on-fly spatial indices
proach, where it injects the spatial awareness inside the four (e.g., R-tree [7]) on relations with spatial attributes, making
main modules of MLN, namely, language, grounding, infer- the evaluation of complex predicates (e.g., overlap) more
ence, and learning. In the rest of this section, we highlight efficient. (2) It provides a simple heuristic query optimizer
our contributions in each of these four modules. that re-orders the execution of nested spatial queries that
come from rules with multiple spatial predicates. Moreover,
2.1 The Language Module SMLN provides an abstract database driver that supports
SMLN employs a high-level language to help users write defining spatial storage, functions and query capabilities.
on-top applications as a set of rules and save huge coding Such abstract can be extended by users to run their spatial
e↵orts. Instead of providing a completely new language, database engine choice inside SMLN.
SMLN extends DDlog [25], a datalog-like language for defin-
ing MLN rules, with spatial data types and predicates that 2.3 The Inference Module
conform to the Open Geospatial Consortium (OGC) stan- The main objective of the inference module in MLN is to
dard [16]. Such extensions allow users to express their spa- infer the values of variables in the constructed factor graph
tial semantics without the need for re-implementing user- and compute their associated probabilities in an efficient
defined functions in each application. For example, SMLN and scalable manner [14]. Gibbs Sampling [30] is consid-
adds four spatial data types, namely, point, rectangle, ered the most widely used inference algorithm in MLN sys-
polygon, and linestring, to the schema declaration of tems, mainly because its simplicity and efficiency. However,
relations in DDlog. Figure 2 shows an example of two there are two main limitations in using the existing Gibbs
schema declarations S1 and S2 with point spatial at- sampling techniques when inferring the values of the spatial
tributes. In addition, SMLN extends the derivation, su- factor graph variables. First, these techniques infer values
pervision and inference rules in DDlog with spatial predi- that maximize the satisfaction of the logical semantics (e.g.,
cates (e.g., overlaps, within, and distance) and functions imply) encoded in the factor graph. Therefore, in case of
(e.g., union and buffer) to efficiently evaluate the relation- spatial factor graph, these inferred values will be suboptimal
ships between spatial objects. Such predicates and functions because they never consider the spatial correlation between
can be composed. For example, the inference rule R1 in variables. Second, these techniques require a large num-
ber of sampling iterations (i.e., slow convergence) to obtain DeepDive TurboReg
Quality (F1-Score)
Sya ngspatial
Time in sec.(Log)
an acceptable output in case there are spatial correlations 1 100k
among variables, because they perform sequential sampling 0.8 10k
1k
over the factor graph nodes [11]. 0.6 100
10
To overcome the above two limitations, SMLN employs 0.4 1
a novel Gibbs Sampling algorithm, namely Spatial Gibbs 0.2 0.1
GWDB NYCCAS 250 1k 3.5k 21k 84k
Sampling, that can efficiently perform inference on the spa- Dataset Grid Size
tial factor graph coming from the grounding module. To (a) Sya Accuracy (b) TurboReg Latency
take the spatial correlations into account, the proposed sam-
pling algorithm adapts a new variation of the query-driven
independent Metropolis-Hastings approach [12] that uses Figure 3: Initial Experiments.
inverse-distance method [13] to spatially weigh the correla-
tions among variables in the spatial factor graph, and hence
yields more accurate inferred values. with many important applications, e.g., web search, digi-
To alleviate the slow convergence issue, a straightforward tal libraries, and health care. Most recently, KBC systems
solution is to randomly partition the variables into a set employed the idea of MLN to associate each extracted rela-
of buckets and then sample these buckets in parallel. Even tion with a probability of how confident is the system that
though this solution will finish the sampling iterations faster this relation is factual. DeepDive [25], an MLN-based sys-
than the sequential one, it may not converge to an accept- tem, has emerged as one of the most popular probabilistic
able solution as spatially-dependent variables might run in KBC systems, applied in di↵erent domains (e.g., law en-
parallel (i.e., independent of each other). This will force the forcement [10], geology [29], and paleontology [17]).
sampler to run additional sampling iterations to converge, Unfortunately, DeepDive does not fully utilize the under-
and hence incur a significant latency overhead. As a re- lying spatial information, which results in less accuracy in
sult, SMLN employs an approach that combines in-memory the factual scores. This is because of two reasons: (1) Deep-
spatial partitioning technique, namely pyramid index [1], Dive treats any predicate in the knowledge base inference
with a well-known spatial statistics concept, namely con- rules as a boolean function, which yields either true or false
cliques [11], to heuristically partition the spatial factor graph (i.e., satisfied or not). So, although one can define spatial
into a set of spatially-independent partitions, and sample predicates in DeepDive, internally DeepDive and its infer-
them in parallel. Defining concliques ensures the neighbour- ence engine do not do anything special for any spatial predi-
ing independence between nodes in the same conclique set, cate. (2) DeepDive estimates the factual scores of extracted
and hence these nodes can be efficiently sampled in parallel. relations only based on how much support for these rela-
tions in training data. However, in case spatial information
2.4 The Learning Module exists, the relative distance between entities participating in
In general, the learning module in MLN mainly focuses on the extracted relations should be considered as well.
optimizing the weights of correlations defined in the factor To overcome the above limitations in DeepDive, we pro-
graph. However, in case of spatial factor graph, the rela- posed Sya [21]; a spatial MLN-based KBC system, built
tive distance between spatial variables participating in any using our SMLN framework. We re-implemented the exist-
correlation should be considered as well to learn optimal ing grounding, inference and learning modules in DeepDive
weights. In particular, correlations between spatially close using the corresponding modules in SMLN. We initially eval-
variables should have higher e↵ect on learned weights than uated Sya through building two real spatial knowledge bases
correlations between distant variables. We refer to this con- about: (1) the water quality in Texas [27], namely, GWDB,
cept as correlation locality. Recently, the gradient descent and (2) the air pollution in New York city [15], namely,
optimization [31] has been widely used in optimizing the NYCCAS. Figure 3(a) shows the quality (i.e., accuracy) for
weights in MLN models that use Gibbs sampling inference both Sya and DeepDive, measured by the F1-score, when
(e.g., [25]). However, standard gradient descent optimiza- building these two knowledge bases. Sya has an improve-
tion techniques fall short in supporting the correlation local- ment of 120% and 27% over DeepDive in GWDB and NY-
ity concept. As a result, SMLN introduces a new variation CCAS, respectively.
of gradient descent optimization that applies distance-based
weighing technique. Given a correlation c, defined over m 4. TURBOREG SYSTEM
spatial variables v1 , v2 , ..., vm in the spatial factor graph, its
Autologistic regression [9] is an important statistical tool
weight wc is optimized as follows:
for predicting spatial phenomena. Unlike standard logis-
tic regression that assumes predictions of spatial phenom-
m(m 1) ena over neighbouring locations are completely indepen-
wc = wc + Pm 1 Pm ↵g (1)
2 i=1 j=i+1 d(vi , vj ) dent of each other, autologistic regression takes into ac-
count the spatial dependence between neighbouring loca-
where g is the gradient sign (either 1 or -1), ↵ is the tions while building and running the prediction model (i.e.,
step size, and d(vi , vj ) is the Euclidean distance between neighbouring locations tend to systematically a↵ect each
the variables vi and vj . other). Myriad applications require the autologistic regres-
sion model to be built over large spatial data. However, ex-
3. SYA SYSTEM isting methods for autologistic regression (e.g., see [9]) are
Knowledge base construction (KBC) has been an active prohibitively computationally expensive for large grid data,
area of research over the last two decades with several sys- e.g., fine-grained satellite images, and large spatial epidemi-
tem prototypes coming from academia and industry, along ology datasets. It could take about week to infer the model
parameters using training data of only few gigabytes [9]. gistic regression models, and Flash, a framework for scalable
To solve this issue, we introduced TurboReg [20]; a scal- spatial data analysis.
able framework for building autologistic models with large
number of prediction and predictor variables. In TurboReg, 7. REFERENCES
we employed the inference and learning modules of SMLN [1] W. G. Aref and H. Samet. Efficient Processing of Window
to estimate the autologistic model parameters in an accu- Queries in the Pyramid Data Structure. In PODS, 1990.
rate and efficient manner. Basically, TurboReg provides [2] A. Auchincloss et al. A Review of Spatial Methods in
Epidemiology. Annual Review of Public Health, 2012.
an equivalent first-order logic [5] representation to depen- [3] bnspatial: Spatial Bayesian Networks, 2019.
dency relations among neighbours in autologistic models. In cran.r-project.org/web/packages/bnspatial.
particular, TurboReg transforms each neighbouring depen- [4] gamlss.spatial: Gaussian Markov Random Fields, 2019.
dency relation into a predicate with bitwise-AND operation cran.r-project.org/web/packages/gamlss.spatial.
[5] M. Genesereth and N. Nilsson. Logical Foundations of
on all variables involved in this relation. For example, a Artificial Intelligence. Morgan Kaufmann Publishers, 1987.
binary dependency relation between neighbouring variables [6] P. J. Green and S. Richardson. Hidden Markov Models and
C1 , and C2 is transformed to C1 ^ C2 . This simple logi- Disease Mapping. JASA, 2002.
cal transformation allows non experts to express the depen- [7] A. Guttman. R-trees: A Dynamic Index Structure for Spatial
Searching. SIGMOD Record, pages 47–57, 1984.
dency relations within autologistic models in a simple way [8] J. Hughes. ngspatial: A Package for Fitting the Centered
without needing to specify complex details. Autologistic and Sparse Spatial Generalized Linear Mixed
We experimentally evaluated TurboReg using a real Models for Areal Data. The R Journal, 2014.
dataset of the daily distribution of bird species [26], and [9] J. Hughes, M. Haran, and P. C. Caragea. Autologistic Models
for Binary Data on a Lattice. Environmetrics, 2011.
compared its scalability performance to a state-of-the-art [10] Human Trafficking: Forbes.
method, namely ngspatial [8]. Figure 3(b) shows the run- http://www.forbes.com/sites/thomasbrewster/2015/04/17/
ning time for both systems while building an autologistic darpa-nasa-and-partners-show-off-memex/.
model over grid sizes ranging from 250 to 84k cells. Tur- [11] M. Kaiser et al. Goodness of Fit Tests for a Class of Markov
Random Field Models. The Annals of Statistics, 2012.
boReg has at least three orders of magnitude reduction in [12] K. Li et al. In-database Batch and Query-time Inference over
the running time over ngspatial, while preserving the same Probabilistic Graphical Models using UDA-GIST. VLDB
accuracy in estimating the model parameters. Journal, 26(2):177–201, 2017.
[13] G. Y. Lu and D. W. Wong. An Adaptive Inverse-distance
Weighting Spatial Interpolation Technique. Journal of
5. FLASH FRAMEWORK Computer and GeoSciences, 34(9):1044–1055, Sept. 2008.
[14] F. Niu et al. Tu↵y: Scaling Up Statistical Inference in Markov
Same as MLN made it possible for data scientists and Logic Networks Using an RDBMS. PVLDB, 4(6):373–384,
developers to embrace the difficulty of deploying machine 2011.
learning techniques, we aim to use our proposed SMLN [15] NYC OpenData. https://data.cityofnewyork.us/Environment/
framework as a backbone infrastructure to support long last- NYCCAS-Air-Pollution-Rasters/q68s-8qxv.
[16] Open Geospatial Consortium. http://www.opengeospatial.org/.
ing spatial analysis techniques that lack scalability as well
[17] S. E. Peters, C. Zhang, M. Livny, and C. Ré. A Machine
as su↵er from difficulty of deployment (e.g., [3, 4]). Reading System for Assembling Synthetic Paleontological
To that end, we proposed Flash [22]; a framework for Databases. PLoS One, 9(12), 2014.
generic and scalable spatial data analysis. Flash focuses on [18] T. Rekatsinas et al. HoloClean: Holistic Data Repairs with
Probabilistic Inference. PVLDB, 10(11):1190–1201, 2017.
building a major class of spatial analysis techniques, called
[19] M. Richardson and P. M. Domingos. Markov Logic Networks.
spatial probabilistic graphical modelling (SPGM) (e.g., [3, 4, Machine Learning, 2006.
6]), which uses probability distributions and graphical repre- [20] I. Sabek, M. Musleh, and M. Mokbel. TurboReg: A Framework
sentations to describe spatial phenomena and make predic- for Scaling Up Spatial Logistic Regression Models. In
SIGSPATIAL, pages 129–138, 2018.
tions about them. In Flash, we built a generic transforma-
[21] I. Sabek, M. Musleh, and M. F. Mokbel. A Demonstration of
tion module that is responsible to generate a set of weighted Sya: A Spatial Probabilistic Knowledge Base Construction
logical rules representing any SPGM input. The generated System. In SIGMOD, pages 1689–1692, 2018.
rules are then executed normally through the SMLN frame- [22] I. Sabek, M. Musleh, and M. F. Mokbel. Flash in Action:
Scalable Spatial Data Analysis Using Markov Logic Networks.
work, where the weights of these rules represent the param- In VLDB, 2019.
eters of the original SPGM and will be inferred using the [23] N. A. Sakhanenko and D. J. Galas. Markov Logic Networks in
SMLN inference and learning modules. Currently, Flash the Analysis of Genetic Data. Journal of Computational
supports transformation for three spatial graphical mod- Biology, 17(11):1491–1508, Nov. 2010.
[24] J. Sankaranarayanan et al. TwitterStand: News in Tweets. In
els; spatial Markov random fields [4], spatial hidden Markov SIGSPATIAL, 2009.
models [6] and spatial Bayesian networks [3]. [25] J. Shin et al. Incremental Knowledge Base Construction Using
DeepDive. PVLDB, 8(11):1310–1321, 2015.
6. CONCLUSION [26] B. Sullivan et al. eBird: A Citizen-based Bird Observation
Network in the Biological Sciences. Biological Conservation,
In this paper, we introduce SMLN; a framework that in- 2009.
jects the spatial awareness inside the core functionality of [27] Texas Ground Water Database.
Markov Logic Networks (MLN). SMLN equips the MLN http://www.twdb.texas.gov/groundwater/data/.
framework with a spatial high-level language, and spatially- [28] M. Wick et al. Scalable Probabilistic Databases with Factor
Graphs and MCMC. PVLDB, 3(1-2):794–804, 2010.
optimized grounding, inference and learning modules. Data
[29] C. Zhang et al. GeoDeepDive: Statistical Inference Using
scientists and developers can exploit SMLN to build numer- Familiar Data-processing Languages. In SIGMOD, 2013.
ous scalable and efficient spatial machine learning and anal- [30] C. Zhang and C. Ré. Towards High-throughput Gibbs Sampling
ysis applications. We showed three on-top applications; Sya, at Scale: A Study Across Storage Managers. In SIGMOD,
pages 397–408, 2013.
a system for spatial probabilistic knowledge base construc-
[31] M. Zinkevich et al. Parallelized Stochastic Gradient Descent. In
tion, TurboReg, a framework for scaling up spatial autolo- NIPS, pages 2595–2603, 2010.