ceur-ws.org/Vol-2399/paper02.pdf


 Adopting Markov Logic Networks for Big Spatial Data and
                     Applications

                                                               Ibrahim Sabek
                                               Supervised by: Mohamed F. Mokbel
                                         Department of Computer Science and Engineering
                                                University of Minnesota, MN, USA
                                                  {sabek, mokbel}@cs.umn.edu


ABSTRACT                                                                  for deploying machine learning in various applications, in-
Spatial data has become ubiquitous everywhere, e.g., GPS                  cluding knowledge base construction [25], data cleaning [18],
data, medical data, with increasingly sheer sizes. This raises            genetic analysis [23], among others.
the need for efficient spatial machine learning and analysis                 Meanwhile, in recent years, there has been a prolifera-
solutions to extract useful insights from such data. Mean-                tion in the amounts of spatial data produced from several
while, Markov Logic Networks (MLN) have emerged as pow-                   devices such as satellites, space telescopes, and medical de-
erful framework for building usable and scalable machine                  vices. Various applications and agencies need to analyze
learning tools. Unfortunately, MLN is ill-equipped for spa-               these unprecedented amounts of spatial data. For example,
tial applications because it ignores the distinguished spatial            epidemiologists use spatial analysis techniques to track in-
data characteristics. This paper describes SMLN, the first                fectious disease [2]. News reporters use geotagged tweets for
full-fledged MLN framework with native support for spatial                event detection and analysis [24]. Unfortunately, researchers
data. SMLN comes with a high-level datalog-like language                  never take advantage of the recent advances of Markov Logic
with spatial constructs, and spatially-equipped grounding,                Networks (MLN) to boost the usability, scalability, and ac-
inference and learning modules. We show the e↵ectiveness                  curacy of spatial machine learning tasks (e.g., spatial re-
of SMLN by illustrating three systems, namely, Sya, Tur-                  gression [9]) used in these applications. Furthermore, MLN-
boReg, and Flash, that are already built using SMLN.                      based applications (e.g., knowledge base construction [25])
                                                                          would miss important results and have less accuracy when
                                                                          dealing with spatial data. The main reason is that MLN
1.    INTRODUCTION                                                        is oblivious to the spatial data. The only way to support
   Data scientists and developers have been spending signif-              spatial data in MLN is to simply ignore its distinguished
icant e↵orts applying machine learning and artificial intelli-            properties (e.g., spatial relationships among objects) and
gent methods, e.g., deep learning, to analyze and turn their              deal with it as non-spatial data. While this would work to
massive data into useful insights. However, the expertise                 some extent, it will result in a sub-par performance.
skills and e↵orts needed to deploy these methods become a                    The goal of our work is to provide the first full-fledged
major blocking factor in having a wide deployment of ma-                  MLN framework with a native support for spatial data,
chine learning applications. As a result, Markov Logic Net-               called Spatial Markov Logic Networks (SMLN). In partic-
work (MLN) [19] was recently introduced to reduce this gap.               ular, SMLN pushes the spatial awareness inside the internal
In particular, MLN combines first-order logic [5] with prob-              data structures and core learning and inference function-
abilistic graphical models to efficiently represent statistical           alities of MLN, and hence inside all MLN-based machine
learning and inference problems with few logical rules (e.g.,             learning techniques and applications. SMLN consists of
rules with imply and bit-wise AND predicates) instead of                  four main modules, namely, language, grounding, inference
thousands of lines of code. With MLN, data scientists and                 and learning. The language module extends the DDlog lan-
developers do not need to worry about the underlying ma-                  guage [25] with spatial data types and predicates to express
chine learning work. Instead, they will focus their e↵orts on             spatial semantics when writing rules. The grounding mod-
developing the rules that represent their applications. Re-               ule constructs a spatial variation of the factor graph [28],
cently, MLN has been widely adopted as a research vehicle                 namely Spatial Factor Graph, to efficiently represent SMLN
                                                                          graphical models. The inference module provides a novel
1
  This work is partially supported by the National Science                algorithm of Gibbs Sampling [30] that combines the Con-
Foundation Grants IIS-1525953, and CNS-1512877, USA.                      clique [11] concept from spatial statistics with query-driven
                                                                          independent Metropolis-Hastings approach [12]. The learn-
                                                                          ing module employs a new distance-based optimization tech-
                                                                          nique based on the gradient descent method [31] to effi-
                                                                          ciently learn SMLN model parameters.
                                                                             Users of SMLN would be able to seamlessly build a myr-
                                                                          iad of scalable spatial applications, without worrying about
Proceedings of the VLDB 2019 PhD Workshop, August 26th, 2019. Los         the underlying spatial machine learning and computation
Angeles, California.                                                      details. We show three case studies that use SMLN as a
Copyright (C) 2019 for this paper by its authors. Copying permitted for   backbone for their computation. These case studies include
private and academic purposes.
                                                                    Schema Declaration
                                                                    S1: County (id bigint, location point, hasLowSanitation bool).
                                                                    S2: HasEbola? (id bigint, location point).
                                                                    Inference Rule
                                                                    R1: HasEbola(C1, L1) => HasEbola(C2, L2) :-
                                                                          County(C1, L1, -), County(C2, L2, S2)
                                                                             [distance(L1, L2) < 2.5, within(liberia_geom, L1), S2 = true].


                                                                 Figure 2: Example on Spatial Extensions in DDlog.


                                                                 Figure 2, which captures the e↵ect of Ebola infected coun-
                                                                 ties on each other, is composed of two spatial predicates
                                                                 distance and within that measure the distance between
                                                                 infected counties, and check whether they are located in
                                                                 Liberia or not, respectively. Once submitting the SMLN
       Figure 1: SMLN System Architecture.                       program, this module checks the syntax correctness of used
                                                                 spatial constructs, compiles the program, and forwards the
                                                                 output to the grounding module.
Sya [21], a system for spatial probabilistic knowledge base
construction, TurboReg [20], a framework for scaling up spa-     2.2 The Grounding Module
tial autologistic regression models, and Flash [22], a frame-       Grounding is an essential operation in the MLN execu-
work for scalable spatial data analysis.                         tion pipeline, where it constructs a data structure called
                                                                 factor graph [28] that will be used later to perform infer-
                                                                 ence and learning operations on. Such factor graph is ef-
2.    SMLN OVERVIEW                                              ficiently constructed by evaluating the compiled rules from
   Figure 1 provides an overview of SMLN architecture.           the language module as a sequence of SQL queries (e.g.,
There are three types of users who interact with SMLN,           [25]). To accommodate for the newly introduced spatial
namely, developers, casual users, and administrators. De-        constructs, e.g., distance, in rules, SMLN adapts a scalable
velopers should have expertise with MLN and use the pro-         in-database grounding technique from [14] to translate and
vided high-level language by SMLN to create new applica-         evaluate rules with these constructs as a set of spatial SQL
tions. We briefly review three example applications includ-      queries (e.g., range query and spatial join). The generated
ing Sya [21], TurboReg [20], and Flash [22] in sections 3, 4     queries are then executed through standard spatial database
and 5, respectively. Casual users can either use standard        engines, e.g. PostGIS, to produce a spatial variation of the
querying or visualization APIs to perform inference and          factor graph, namely, Spatial Factor Graph, that consists
learning queries over the built applications (e.g., what is      of: (1) logical and spatial random variables. (2) logical and
the probability of a specific event to happen?). Adminis-        spatial correlations among these variables.
trators can monitor the system and tune up the inference            SMLN provides two e↵ective optimizations in the ground-
and learning configurations. SMLN adopts an extensible ap-       ing process: (1) It supports creating on-fly spatial indices
proach, where it injects the spatial awareness inside the four   (e.g., R-tree [7]) on relations with spatial attributes, making
main modules of MLN, namely, language, grounding, infer-         the evaluation of complex predicates (e.g., overlap) more
ence, and learning. In the rest of this section, we highlight    efficient. (2) It provides a simple heuristic query optimizer
our contributions in each of these four modules.                 that re-orders the execution of nested spatial queries that
                                                                 come from rules with multiple spatial predicates. Moreover,
2.1   The Language Module                                        SMLN provides an abstract database driver that supports
   SMLN employs a high-level language to help users write        defining spatial storage, functions and query capabilities.
on-top applications as a set of rules and save huge coding       Such abstract can be extended by users to run their spatial
e↵orts. Instead of providing a completely new language,          database engine choice inside SMLN.
SMLN extends DDlog [25], a datalog-like language for defin-
ing MLN rules, with spatial data types and predicates that       2.3 The Inference Module
conform to the Open Geospatial Consortium (OGC) stan-              The main objective of the inference module in MLN is to
dard [16]. Such extensions allow users to express their spa-     infer the values of variables in the constructed factor graph
tial semantics without the need for re-implementing user-        and compute their associated probabilities in an efficient
defined functions in each application. For example, SMLN         and scalable manner [14]. Gibbs Sampling [30] is consid-
adds four spatial data types, namely, point, rectangle,          ered the most widely used inference algorithm in MLN sys-
polygon, and linestring, to the schema declaration of            tems, mainly because its simplicity and efficiency. However,
relations in DDlog. Figure 2 shows an example of two             there are two main limitations in using the existing Gibbs
schema declarations S1 and S2 with point spatial at-             sampling techniques when inferring the values of the spatial
tributes. In addition, SMLN extends the derivation, su-          factor graph variables. First, these techniques infer values
pervision and inference rules in DDlog with spatial predi-       that maximize the satisfaction of the logical semantics (e.g.,
cates (e.g., overlaps, within, and distance) and functions       imply) encoded in the factor graph. Therefore, in case of
(e.g., union and buffer) to efficiently evaluate the relation-   spatial factor graph, these inferred values will be suboptimal
ships between spatial objects. Such predicates and functions     because they never consider the spatial correlation between
can be composed. For example, the inference rule R1 in           variables. Second, these techniques require a large num-
ber of sampling iterations (i.e., slow convergence) to obtain                                                DeepDive                                        TurboReg


                                                                         Quality (F1-Score)
                                                                                                             Sya                                             ngspatial


                                                                                                                        Time in sec.(Log)
an acceptable output in case there are spatial correlations                                    1                                            100k
among variables, because they perform sequential sampling                                     0.8                                            10k
                                                                                                                                              1k
over the factor graph nodes [11].                                                             0.6                                            100
                                                                                                                                               10
   To overcome the above two limitations, SMLN employs                                        0.4                                               1
a novel Gibbs Sampling algorithm, namely Spatial Gibbs                                        0.2                                             0.1
                                                                                                        GWDB   NYCCAS                            250   1k    3.5k        21k   84k
Sampling, that can efficiently perform inference on the spa-                                               Dataset                                          Grid Size
tial factor graph coming from the grounding module. To                                              (a) Sya Accuracy                         (b) TurboReg Latency
take the spatial correlations into account, the proposed sam-
pling algorithm adapts a new variation of the query-driven
independent Metropolis-Hastings approach [12] that uses                                                  Figure 3: Initial Experiments.
inverse-distance method [13] to spatially weigh the correla-
tions among variables in the spatial factor graph, and hence
yields more accurate inferred values.                                  with many important applications, e.g., web search, digi-
   To alleviate the slow convergence issue, a straightforward          tal libraries, and health care. Most recently, KBC systems
solution is to randomly partition the variables into a set             employed the idea of MLN to associate each extracted rela-
of buckets and then sample these buckets in parallel. Even             tion with a probability of how confident is the system that
though this solution will finish the sampling iterations faster        this relation is factual. DeepDive [25], an MLN-based sys-
than the sequential one, it may not converge to an accept-             tem, has emerged as one of the most popular probabilistic
able solution as spatially-dependent variables might run in            KBC systems, applied in di↵erent domains (e.g., law en-
parallel (i.e., independent of each other). This will force the        forcement [10], geology [29], and paleontology [17]).
sampler to run additional sampling iterations to converge,                Unfortunately, DeepDive does not fully utilize the under-
and hence incur a significant latency overhead. As a re-               lying spatial information, which results in less accuracy in
sult, SMLN employs an approach that combines in-memory                 the factual scores. This is because of two reasons: (1) Deep-
spatial partitioning technique, namely pyramid index [1],              Dive treats any predicate in the knowledge base inference
with a well-known spatial statistics concept, namely con-              rules as a boolean function, which yields either true or false
cliques [11], to heuristically partition the spatial factor graph      (i.e., satisfied or not). So, although one can define spatial
into a set of spatially-independent partitions, and sample             predicates in DeepDive, internally DeepDive and its infer-
them in parallel. Defining concliques ensures the neighbour-           ence engine do not do anything special for any spatial predi-
ing independence between nodes in the same conclique set,              cate. (2) DeepDive estimates the factual scores of extracted
and hence these nodes can be efficiently sampled in parallel.          relations only based on how much support for these rela-
                                                                       tions in training data. However, in case spatial information
2.4    The Learning Module                                             exists, the relative distance between entities participating in
   In general, the learning module in MLN mainly focuses on            the extracted relations should be considered as well.
optimizing the weights of correlations defined in the factor              To overcome the above limitations in DeepDive, we pro-
graph. However, in case of spatial factor graph, the rela-             posed Sya [21]; a spatial MLN-based KBC system, built
tive distance between spatial variables participating in any           using our SMLN framework. We re-implemented the exist-
correlation should be considered as well to learn optimal              ing grounding, inference and learning modules in DeepDive
weights. In particular, correlations between spatially close           using the corresponding modules in SMLN. We initially eval-
variables should have higher e↵ect on learned weights than             uated Sya through building two real spatial knowledge bases
correlations between distant variables. We refer to this con-          about: (1) the water quality in Texas [27], namely, GWDB,
cept as correlation locality. Recently, the gradient descent           and (2) the air pollution in New York city [15], namely,
optimization [31] has been widely used in optimizing the               NYCCAS. Figure 3(a) shows the quality (i.e., accuracy) for
weights in MLN models that use Gibbs sampling inference                both Sya and DeepDive, measured by the F1-score, when
(e.g., [25]). However, standard gradient descent optimiza-             building these two knowledge bases. Sya has an improve-
tion techniques fall short in supporting the correlation local-        ment of 120% and 27% over DeepDive in GWDB and NY-
ity concept. As a result, SMLN introduces a new variation              CCAS, respectively.
of gradient descent optimization that applies distance-based
weighing technique. Given a correlation c, defined over m              4.                     TURBOREG SYSTEM
spatial variables v1 , v2 , ..., vm in the spatial factor graph, its
                                                                          Autologistic regression [9] is an important statistical tool
weight wc is optimized as follows:
                                                                       for predicting spatial phenomena. Unlike standard logis-
                                                                       tic regression that assumes predictions of spatial phenom-
                               m(m       1)                            ena over neighbouring locations are completely indepen-
           wc = wc +        Pm 1 Pm                    ↵g       (1)
                        2    i=1     j=i+1 d(vi , vj )                 dent of each other, autologistic regression takes into ac-
                                                                       count the spatial dependence between neighbouring loca-
  where g is the gradient sign (either 1 or -1), ↵ is the              tions while building and running the prediction model (i.e.,
step size, and d(vi , vj ) is the Euclidean distance between           neighbouring locations tend to systematically a↵ect each
the variables vi and vj .                                              other). Myriad applications require the autologistic regres-
                                                                       sion model to be built over large spatial data. However, ex-
3.    SYA SYSTEM                                                       isting methods for autologistic regression (e.g., see [9]) are
  Knowledge base construction (KBC) has been an active                 prohibitively computationally expensive for large grid data,
area of research over the last two decades with several sys-           e.g., fine-grained satellite images, and large spatial epidemi-
tem prototypes coming from academia and industry, along                ology datasets. It could take about week to infer the model
parameters using training data of only few gigabytes [9].        gistic regression models, and Flash, a framework for scalable
  To solve this issue, we introduced TurboReg [20]; a scal-      spatial data analysis.
able framework for building autologistic models with large
number of prediction and predictor variables. In TurboReg,       7.    REFERENCES
we employed the inference and learning modules of SMLN            [1] W. G. Aref and H. Samet. Efficient Processing of Window
to estimate the autologistic model parameters in an accu-             Queries in the Pyramid Data Structure. In PODS, 1990.
rate and efficient manner. Basically, TurboReg provides           [2] A. Auchincloss et al. A Review of Spatial Methods in
                                                                      Epidemiology. Annual Review of Public Health, 2012.
an equivalent first-order logic [5] representation to depen-      [3] bnspatial: Spatial Bayesian Networks, 2019.
dency relations among neighbours in autologistic models. In           cran.r-project.org/web/packages/bnspatial.
particular, TurboReg transforms each neighbouring depen-          [4] gamlss.spatial: Gaussian Markov Random Fields, 2019.
dency relation into a predicate with bitwise-AND operation            cran.r-project.org/web/packages/gamlss.spatial.
                                                                  [5] M. Genesereth and N. Nilsson. Logical Foundations of
on all variables involved in this relation. For example, a            Artificial Intelligence. Morgan Kaufmann Publishers, 1987.
binary dependency relation between neighbouring variables         [6] P. J. Green and S. Richardson. Hidden Markov Models and
C1 , and C2 is transformed to C1 ^ C2 . This simple logi-             Disease Mapping. JASA, 2002.
cal transformation allows non experts to express the depen-       [7] A. Guttman. R-trees: A Dynamic Index Structure for Spatial
                                                                      Searching. SIGMOD Record, pages 47–57, 1984.
dency relations within autologistic models in a simple way        [8] J. Hughes. ngspatial: A Package for Fitting the Centered
without needing to specify complex details.                           Autologistic and Sparse Spatial Generalized Linear Mixed
  We experimentally evaluated TurboReg using a real                   Models for Areal Data. The R Journal, 2014.
dataset of the daily distribution of bird species [26], and       [9] J. Hughes, M. Haran, and P. C. Caragea. Autologistic Models
                                                                      for Binary Data on a Lattice. Environmetrics, 2011.
compared its scalability performance to a state-of-the-art       [10] Human Trafficking: Forbes.
method, namely ngspatial [8]. Figure 3(b) shows the run-              http://www.forbes.com/sites/thomasbrewster/2015/04/17/
ning time for both systems while building an autologistic             darpa-nasa-and-partners-show-off-memex/.
model over grid sizes ranging from 250 to 84k cells. Tur-        [11] M. Kaiser et al. Goodness of Fit Tests for a Class of Markov
                                                                      Random Field Models. The Annals of Statistics, 2012.
boReg has at least three orders of magnitude reduction in        [12] K. Li et al. In-database Batch and Query-time Inference over
the running time over ngspatial, while preserving the same            Probabilistic Graphical Models using UDA-GIST. VLDB
accuracy in estimating the model parameters.                          Journal, 26(2):177–201, 2017.
                                                                 [13] G. Y. Lu and D. W. Wong. An Adaptive Inverse-distance
                                                                      Weighting Spatial Interpolation Technique. Journal of
5.   FLASH FRAMEWORK                                                  Computer and GeoSciences, 34(9):1044–1055, Sept. 2008.
                                                                 [14] F. Niu et al. Tu↵y: Scaling Up Statistical Inference in Markov
   Same as MLN made it possible for data scientists and               Logic Networks Using an RDBMS. PVLDB, 4(6):373–384,
developers to embrace the difficulty of deploying machine             2011.
learning techniques, we aim to use our proposed SMLN             [15] NYC OpenData. https://data.cityofnewyork.us/Environment/
framework as a backbone infrastructure to support long last-          NYCCAS-Air-Pollution-Rasters/q68s-8qxv.
                                                                 [16] Open Geospatial Consortium. http://www.opengeospatial.org/.
ing spatial analysis techniques that lack scalability as well
                                                                 [17] S. E. Peters, C. Zhang, M. Livny, and C. Ré. A Machine
as su↵er from difficulty of deployment (e.g., [3, 4]).                Reading System for Assembling Synthetic Paleontological
   To that end, we proposed Flash [22]; a framework for               Databases. PLoS One, 9(12), 2014.
generic and scalable spatial data analysis. Flash focuses on     [18] T. Rekatsinas et al. HoloClean: Holistic Data Repairs with
                                                                      Probabilistic Inference. PVLDB, 10(11):1190–1201, 2017.
building a major class of spatial analysis techniques, called
                                                                 [19] M. Richardson and P. M. Domingos. Markov Logic Networks.
spatial probabilistic graphical modelling (SPGM) (e.g., [3, 4,        Machine Learning, 2006.
6]), which uses probability distributions and graphical repre-   [20] I. Sabek, M. Musleh, and M. Mokbel. TurboReg: A Framework
sentations to describe spatial phenomena and make predic-             for Scaling Up Spatial Logistic Regression Models. In
                                                                      SIGSPATIAL, pages 129–138, 2018.
tions about them. In Flash, we built a generic transforma-
                                                                 [21] I. Sabek, M. Musleh, and M. F. Mokbel. A Demonstration of
tion module that is responsible to generate a set of weighted         Sya: A Spatial Probabilistic Knowledge Base Construction
logical rules representing any SPGM input. The generated              System. In SIGMOD, pages 1689–1692, 2018.
rules are then executed normally through the SMLN frame-         [22] I. Sabek, M. Musleh, and M. F. Mokbel. Flash in Action:
                                                                      Scalable Spatial Data Analysis Using Markov Logic Networks.
work, where the weights of these rules represent the param-           In VLDB, 2019.
eters of the original SPGM and will be inferred using the        [23] N. A. Sakhanenko and D. J. Galas. Markov Logic Networks in
SMLN inference and learning modules. Currently, Flash                 the Analysis of Genetic Data. Journal of Computational
supports transformation for three spatial graphical mod-              Biology, 17(11):1491–1508, Nov. 2010.
                                                                 [24] J. Sankaranarayanan et al. TwitterStand: News in Tweets. In
els; spatial Markov random fields [4], spatial hidden Markov          SIGSPATIAL, 2009.
models [6] and spatial Bayesian networks [3].                    [25] J. Shin et al. Incremental Knowledge Base Construction Using
                                                                      DeepDive. PVLDB, 8(11):1310–1321, 2015.
6.   CONCLUSION                                                  [26] B. Sullivan et al. eBird: A Citizen-based Bird Observation
                                                                      Network in the Biological Sciences. Biological Conservation,
   In this paper, we introduce SMLN; a framework that in-             2009.
jects the spatial awareness inside the core functionality of     [27] Texas Ground Water Database.
Markov Logic Networks (MLN). SMLN equips the MLN                      http://www.twdb.texas.gov/groundwater/data/.
framework with a spatial high-level language, and spatially-     [28] M. Wick et al. Scalable Probabilistic Databases with Factor
                                                                      Graphs and MCMC. PVLDB, 3(1-2):794–804, 2010.
optimized grounding, inference and learning modules. Data
                                                                 [29] C. Zhang et al. GeoDeepDive: Statistical Inference Using
scientists and developers can exploit SMLN to build numer-            Familiar Data-processing Languages. In SIGMOD, 2013.
ous scalable and efficient spatial machine learning and anal-    [30] C. Zhang and C. Ré. Towards High-throughput Gibbs Sampling
ysis applications. We showed three on-top applications; Sya,          at Scale: A Study Across Storage Managers. In SIGMOD,
                                                                      pages 397–408, 2013.
a system for spatial probabilistic knowledge base construc-
                                                                 [31] M. Zinkevich et al. Parallelized Stochastic Gradient Descent. In
tion, TurboReg, a framework for scaling up spatial autolo-            NIPS, pages 2595–2603, 2010.