Spatial Clustering
         Spatial  Clusteringof Disease Events
                               of Disease     Using Bayesian
                                            Events  Using
                              Methods
                       Bayesian    Methods

                    Lukáš Marek1, Vít Pászto1, Pavel Tuček1, Jiří Dvorský1
                 Lukáš Marek, Vı́t Pászto, Pavel Tuček, and Jiřı́ Dvorský
        1 Department of Geoinformatics, Faculty of Science, Palacky University in Olomouc,
     Department  of Geoinformatics, Faculty of Science, Palacky University in Olomouc
                    17.17. listopadu50,
                        listopadu    50, Olomouc,
                                         Olomouc, 771
                                                  77146,
                                                       46,Czech
                                                           CzechRepublic
                                                                  Republic
        {lukas.marek, pavel.tucek}@upol.cz, vit.paszto@gmail.com,
            {lukas.marek, pavel.tucek}@upol.cz,
                                    jiri.dvorsky@vsb.cz  vit.paszto@gmail.com,
                                     jiri.dvorsky@vsb.cz
                         http://www.geoinformatics.upol.cz/
                           http://www.geoinformatics.upol.cz/


           Abstract. One of main aims of the spatial analysis of health and medical da-
           tasets is to provide additional information to the specialized medical research.
           These analyses can be used for disease mapping; searching for places with a
           higher intensity and probability of the disease event; or the influence assess-
           ment of selected natural or artificial phenomena. Suitably selected methods al-
           low a proper analysis of these data and identification of irregularities and devia-
           tions of the phenomena in the area of interest. The structure of medical data
           usually needs to be standardized (over age structure of the population) before
           the comparison of different regions. Bayesian statistics derives the posterior
           probability as a consequence of a prior probability and a probability model for
           the data observed. Geosciences and geomedicine usually use the Bayesian theo-
           ry for smoothing of data - to help depict the real spatial pattern and its changea-
           bility. The Bayesian principles, together with the spatial neighbourhood and
           statistical models, are successfully used also for the identification of spatial and
           space-time clusters with significantly higher/lower risk of incidence of the dis-
           ease. These procedures are denoted as methods of spatial clustering and can be
           used with or without utilization of properties of certain phenomena. Particular-
           ly, occurrence data of campylobacteriosis infection in four Moravian regions in
           period 2008 – 2012, which were provided by The National Institute of Public
           Health, were used for the case study.


    1 Introduction
    The disease mapping, visualization of disease rates and the clustering of disease data
    are still one of the most interesting topics in geosciences. It is because of the nature of
    the data which are often pure spatial with rich descriptive part and it is easy to com-
    bine them with other data (demographic, economic, etc.) [14]. This contribution aims
    to present the usage of empirical Bayesian methods in the disease mapping and subse-
    quent creating of disease maps. Bayesian methods incorporate the prior knowledge
    about the phenomenon (or underlying processes) to provide more accurate and easily
    understandable description of the situation. Empirical Bayesian procedures are used
    for disease rates smoothing in the case of choropleth map. They also help to identify
    local clusters of more/less affected areas. The main topic of the case study in this
    paper is the analysis of the spatial distribution of disease called campylobacteriosis in


J. Pokorný, K. Richta, V. Snášel (Eds.): Dateso 2014, pp. 25–34, ISBN 978-80-01-05482-6.
26      Lukáš Marek, Vı́t Pászto, Pavel Tuček, Jiřı́ Dvorský


Moravian regions between years 2008 and 2012 with usage of Bayesian estimates
based on Poisson distribution.

2 Case study and Data
The case study, where further described methods are applied, is dealing with the spa-
tial distribution of the campylobacteriosis in four Moravian regions (Moraskoslezsky,
Zlinsky, Olomoucky and Jihomoravsky) between years 2008 and 2012. There were
almost 49 thousand of cases of the disease during that period, while only 34 thousand
were expected according to previous records. Using disease counts and disease rates
calculated for the municipalities in the area of interest, we tried to identify areas that
are possibly more vulnerable to the disease than their neighbourhood. The 5-year
observed number of cases, expected number of cases and relative risk (SIR) were used
as main disease characteristics for this study.
    Campylobacteriosis is caused by bacteria called Campylobacter jejuni, which is
found worldwide in the intestinal tracts of animals. The bacteria are spiral shaped and
can cause disease in animals and humans. Most cases of campylobacteriosis are asso-
ciated with handling or eating raw or undercooked poultry meat or fresh milk. Cam-
pylobacteriosis causes gastrointestinal symptoms, such as diarrhoea, cramping, ab-
dominal pain, and fever in domestic animals and humans. Young animals and humans
are the most severely affected [23].


2.1 Data

The data set for this study was provided by The National Institute of Public Health of
the Czech Republic. The database contains almost 50 thousands records of the cam-
pylobacteriosis occurrence in the period 2008 – 2012. Names, surnames, identity
numbers and sometimes also the full addresses are not included because it is treated
with sensitive personal data. The data were firstly cleansed of inconsistencies and then
the geocoding process was run. Furthermore, the individual records were aggregated
to the municipalities - administrative units - due to the clarity of the visualization and
analyses [15]. The problem of the conversion of spatial phenomena between different
areal or administrative units is well known as MAUP – Modifiable Area Unit Problem
[18]. During the calculation of disease rates and expected number of cases, the popu-
lation data from the Population and Housing Census of the Czech Republic were used
as the main basis for the data standardization.
    Figure 1 shows the probability density function of disease events counts, total pop-
ulation and standardized incidence ratio in Moravian municipalities visualized in the
logarithmic scale (upper graph) and in the logarithmic scale and centred (lower graph)
in order to simplify visual analysis. The probability function of population and diseas-
es events counts are fairly similar, which indicates the need for standardization and
also analysis that considers this close relation. Figure 2 then depicts the spatial distri-
bution of standardized incidence ratio (SIR). SIR is the ratio of the number of disease
cases observed in the study group or population to the number that would be expected
if the study population had the same specific rates as the standard population, multi-
plied by 100 and usually expressed as a percentage [10]. By this way, SIR expresses
                  Spatial Clustering of Disease Events Using Bayesian Methods                 27


relative risk (or vulnerability) of the municipality to certain disease. Municipalities
traversing value of 1 are more vulnerable to disease, while municipalities with SIR
lower than 1 are healthier.


Fig. 1. The probability density function of disease events counts (red), total population (green)
and standardized incidence ratio (blue) in Moravian municipalities visualized in the logarithmic
scale (upper graph) and in the logarithmic scale and centred (lower graph) in order to simplify
visual analysis.


3 Methods
During the study of disease spatial distribution, mainly in the case of aggregated data,
it is often suitable to focus on the local variability of the disease occurrence or relative
risk rather than examine the study area as a whole. This procedure is usually denoted
the disease cluster detection. The general review of methodology as well as usage of
spatial clustering methods and its Bayesian enhancements in the literature, e.g [6, 11,
21] etc.
    In geosciences the spatial clustering is often encapsulated as the analysis of the spa-
tial autocorrelation. The spatial autocorrelation is the correlation among values of a
single variable, which is strictly attributable to their relatively close locations on a
two-dimensional (2-D) surface, introducing a deviation from the independent observa-
tions assumption of classical statistics [7]. Positive spatial autocorrelation refers to the
patterns where nearby or neighbouring values are more alike; while negative spatial
28      Lukáš Marek, Vı́t Pászto, Pavel Tuček, Jiřı́ Dvorský


autocorrelation refers to the patterns where nearby or neighbouring values are dissimi-
lar. One can distinguish two main types of spatial autocorrelation, which are global
and local. The null hypothesis for global clustering is simply that no clustering exists
(i.e. random spatial dispersion ≈ CSR). Probably the most used method for both global
and local analyses of spatial autocorrelation is Moran’s I statistics (together with e.g.
Getis-Ord G and Geary’s C statistics). Moran’s I coefficient of autocorrelation is simi-
lar to Pearson’s correlation coefficient, and quantifies the similarity of an outcome
variable among areas that are defined as spatially related [16]. The problem with vari-
ance instability for rates or proportions, which served as the motivation for applying
smoothing techniques to maps may also affect the inference for Moran’s I test for
spatial autocorrelation [1]. The implementation of the adjustment procedure of
Assuncao and Reis (1999), which uses a variable transformation based on the Empiri-
cal Bayes principle may be one of solutions. This yields a new variable that has been
adjusted for the potentially biasing effects of variance instability due to differences in
the size of the underlying population at risk [1].


Fig. 2. Choropleth map of standardized incidence ratio, which is generally the ratio between
observed disease cases and its potential (expected) amount, which is based on the population
and its age structure in individual municipalities.


3.1 Spatial clustering of case events data

If a cluster is described as an uncommon collection of events, then it is needed to
detect these collections observed within the data set. Such methods define a set of
potential clusters, collections of events each of which we might define as a cluster if
the collection appears unusual enough (discrepant from the null model of interest),
then identifies the most unusual of these [21]. This general idea motivated the “geo-
                Spatial Clustering of Disease Events Using Bayesian Methods            29


graphical analysis machine” (GAM) of Openshaw where potential clusters were de-
fined as collections of events falling within circular buffers of varying radii [17]. The
buffers were centred at each point in a fine grid covering the study area and the GAM
approach mapped any circle whose collection of events were detected as unusual, e.g.,
those circles where the number of events exceeded the 99.8th percentile of a Poisson
distribution with mean defined by the population size within the buffer multiplied by
the overall disease risk [21]. GAM is very useful for descriptive purposes, but should
not be used for hypothesis testing.
   Scan statistics provide another approach that is similar to the local case/control ra-
tios. A scan statistic involves definition of a moving window and a statistical compari-
son of a measurement (e.g., a count or a rate) within the window to the same sort of
measurement outside the window. Kulldorff [8] defines a spatial scan statistic very
similar to the GAM and other methods, but with a slightly different inferential frame-
work. The primary goal of a scan statistic is to find the collection(s) of cases least
consistent with the null hypothesis, i.e. the most likely cluster(s) but Kulldorff goes a
bit further and seeks to provide a significance value representing the detected cluster’s
unusualness, with an adjustment for multiple testing [22]. Kulldorff [8] considers
circular windows with variable radii ranging from the smallest observed distance be-
tween a pair of cases to a user-defined upper bound. He builds an inferential structure
based on earlier works where authors note that variable-width one-dimensional scan
statistics represent collections of local likelihood ratio tests comparing a null hypothe-
sis of the constant risk hypothesis compared to alternatives where the disease rate
within the scanning window is greater than that outside the window. The maximum
observed likelihood ratio statistic provides a test of overall general clustering and an
indication of the most likely cluster(s), with significance determined by Monte Carlo
testing of the constant risk hypothesis [22].
   The outstanding description of methods including their mathematical apparatus or
their possible implementations and applications provide mainly [8, 9, 17].


3.2 Bayesian mapping and spatial clustering of case events data

Presentation of disease rates in area units as choropleth maps can inadvertently pro-
vide misleading information. This fact is well known mainly in the case of small-area
studies that introduces an extra source of variability into the map because of random
variation. Typically, sparsely populated areas with few (or zero) cases can generate
extreme values of the SMR (and also prevalence), as the variance of the SMR is in-
versely related to expected number of cases and small populations have large variabil-
ity in the estimated rates [5] and that is why risk estimates and other rates are rather
unstable.
Bayesian methods provide a solution for this kind of bias. They use probability mod-
els to obtain smoothed estimates consisting of a compromise between the observed
rate for each region and an estimate from a larger collection of cases and persons at
risk (e.g., the rate observed over the entire study area or over a collection of neigh-
bouring regions) [22]. The basic principle of Bayesian methods is that uncertain data
can be strengthened by combining them with prior information [19]. In the case of
30      Lukáš Marek, Vı́t Pászto, Pavel Tuček, Jiřı́ Dvorský


empirical Bayes estimation of spatially-varying disease risk, posterior risk can be
estimated from a weighted combination of the local risk (also called the likelihood)
and the risk in surrounding areas, the latter representing the prior information [4].
   The set of areal units on which data are recorded can form a regular lattice or differ
largely in both shape and size, so data typically exhibit spatial autocorrelation, with
observations from areal units close together tending to have similar values. A propor-
tion of this spatial autocorrelation may be modelled by including known covariate risk
factors in a regression model, the residual spatial autocorrelation can be induced by a
number of factors, and violates the assumption of independence that is common in
many regression models [12]. The most common remedy for this residual autocorrela-
tion is to augment the linear predictor with a set of spatially correlated random effects,
as part of a Bayesian hierarchical model. These random effects are typically repre-
sented with a conditional autoregressive (CAR) model, which induces spatial autocor-
relation through the adjacency structure of the areal units. However, the CAR priors
force the random effects to exhibit a single global level of spatial autocorrelation,
ranging from independence through to strong spatial smoothing. Such a uniform level
of spatial smoothness for the entire region is unrealistic for real data, which are in-
stead likely to exhibit sub-areas of spatial autocorrelation separated by discontinuities.
Such localized spatial smoothing may occur where rich and poor communities live
side-by-side, and in this context the response variable is likely to evolve smoothly
within each community with a sudden change in its value at the border where the two
communities meet [12].
   To be more particular, the analysis provided in the case study is based on the func-
tion that fits a Poisson log-normal random effects models to spatial count data, where
the random effects are modelled by the localised conditional autoregressive (CAR)
model proposed by [13]. The random effects in neighbouring areas (e.g. those that
share a common border) are modelled as correlated or conditionally independent,
depending on whether the populations living in the two areas are similar (correlated
random effects) or very different (conditionally independent). The model represents
the natural log of the mean function for the set of Poisson responses by a combination
of covariates and a set of random effects. Inference is based on Markov Chain Monte
Carlo (MCMC) simulation, using a combination of Gibbs sampling and Metropolis
steps [12]. The outstanding overview of Bayesian techniques are provided in [11, 12]
and others.

4 Results
Firstly, the original data of disease events needed to be aggregated to the municipality
level, filtered to selected area of the Czech Republic. Subsequently, aggregated counts
that represented actually observed cases served as the bases for the calculation of
expected number of cases in the area that were found out using internally indirect
standardization. SIR, which is the ratio between observed and expected number of
cases and expresses the relative risk of the area can be seen in the Fig. 2.
   Both values served as inputs for the Openshaw’s Geographical Analysis Machine
that allowed the identification of possible disease clusters in the area. Radius for the
analysis was chosen as 7 km, the alpha value for the cluster identification was 99.8
                 Spatial Clustering of Disease Events Using Bayesian Methods                 31


quantile of the Poisson probability distribution. Several significant clusters can be
identified throughout the study area (Fig. 3), but 2 most visible can be seen – the first
is located in the southern part of the area near the Brno municipality, while the second
cluster is placed in densely populated surroundings of Ostrava (but surprisingly except
the city itself). As it was mentioned before, GAM is very useful for descriptive pur-
poses, but should not be used for hypothesis testing because of the overestimation of
clusters. That is why other methods - scan statistics and Bayesian identification of
inference in the area, were performed.


Fig. 3. Identification of spatial disease clusters of campylobacteriosis with the use of Open-
shaw’s Geographical Analysis Machine. Colours and legend depicts standardized incidence
ration, while red squares identify locations that are involved in probable disease spatial clus-
ters.
    Results of scan statistics used for the identification of spatial disease clusters of
campylobacteriosis with the use of the clustering function for Kulldorff and Nagarwal-
la's statistic are shown on the Fig. 4. The scan statistics is based on the Poisson distri-
bution of disease events, 15 % significance and 5 % fraction of total population. Un-
like GAM results, only one significant cluster was identified in the northern part of the
study area and it is located in the surrounding of the Ostrava with the core in the vil-
lage Kateřinice (dark grey area on the Fig. 4).
    The last analysis is based on the function that fits a Poisson log-normal random ef-
fects models to spatial count data, where the random effects are modelled by the local-
ised conditional autoregressive (CAR) model. The model is based on the list of binary
neighbourhood with the queen contiguity conceptualization of space. The observed
amount of cases is modelled as the of logarithmical scale of amount of expected
number of disease events (intercept) and the ratio between young people (under 15)
and elderly people (64+), which is also the basis of the dissimilarity matrix. The anal-
ysis detected only two areas (Fig. 5 - left part) that might be the cores of possible
32       Lukáš Marek, Vı́t Pászto, Pavel Tuček, Jiřı́ Dvorský


clusters. The first municipality is located in the south-western part of the study area
(village Podhradí nad Dyjí). The second theoretical core area is placed in the village
Nelepeč-Žernůvka in the west of the study area. That might indicate other necessary
customization of the model with the use of other characteristics of the area. Similar
analysis based on the distribution of the population was performed due to the compar-
ison. It is depicted in the right part of Fig. 5. Unlike the previous analysis, the result
showed significantly more borders between clustering areas and their neighbourhood.
On the other hand most of them are densely populated, so the analyst should consider
their importance carefully and focus on several individual locations.


Fig. 4. Identification of spatial disease clusters of campylobacteriosis with the use of the clus-
tering function for Kulldorff and Nagarwalla's statistic. Dark grey areas stand for central (core)
area, light grey colour stand for other municipalities in the spatial cluster.


5 Discussion and Conclusion
The contribution aimed to introduce methods of spatial clustering and Bayesian spatial
clustering that were based either on the location of disease events in the study area of
four Moravian regions or their locations and demographical characteristics of munici-
palities. One has to realize that all presented methods are dependent on the scale and
also on the prior information, which is entering the models mostly in the form of the
probability distribution. Therefore, results and their evaluation have to be performed
carefully in order to avoid misinterpretation. The aim of the contribution is therefore
not only to use methods in real case study but also to show several different results
that originally come from the same data.
   Firstly Openshaw’s GAM detected high number of possible diseases clusters, but
due to its disadvantages, results were taken just as informative and an initial step for
further analysis. Then, scan statistics based on Kulldorff and Nagarwalla's statistic was
used for the identification of spatial disease clusters of campylobacteriosis. The scan
statistics discovered one statistically significant cluster on the north of the study area.
Lastly, the Poisson log-normal random effects models to spatial count data, where the
                  Spatial Clustering of Disease Events Using Bayesian Methods                     33


random effects are modelled by the localised conditional autoregressive (CAR) model,
was used to proceed more detailed and complex analysis. This model incorporated the
information about neighbourhood of individual municipalities and also the dissimilari-
ty matrix based on the age structure of the population in the neighbouring villages or
cities. The model was able to identify two core areas of possible clusters.
    One has to realize that Bayesian techniques usually tend to shift values to the mean
risk – global or local by incorporating information between areas. The risks in areas
with more information (e.g., urban areas) are usually less smoothed than in areas that
exhibit higher sampling variation (typically those with low number of cases), and thus
produce more stable estimates of the pattern of underlying disease risk [20]. However,
although raw risks can produce “noisy” maps that are difficult to interpret, over-
smoothed maps may produce a homogeneous risk surface, masking the true risk distri-
bution [3]. It is important to mention that all analyses presented in this paper are heav-
ily dependent on the scale. We chose the scale of municipal districts but results on
other scales could show differences. When someone chooses to broad scale for the
analysis, results will probably reveal one (or several) large cluster so the local vari-
ance disappears. On the other hand, to local scale may not lead to identification of
any clusters. The extension of Bayesian model using other characteristic of the popu-
lation, spatial unit or disease is possible; however their dynamic properties are mainly
shrunk to the sequential procession of time series or time slices.


Fig. 5. Identification of spatial disease clusters of campylobacteriosis with the use of localised
conditional autoregressive (CAR) model based on dissimilarity metrics with binary neighbour-
hood relations to spatial Poisson data. Colours and legend depicts standardized incidence ra-
tion, while red areas identify locations that are centres of probable disease spatial clusters. The
left part describes the relation of likely clusters to the ratio of old people to children, while the
right part is based on the population.


Acknowledgement

The authors gratefully acknowledge the support by the Operational Program Educa-
tion for Competitiveness - European Social Fund (project CZ.1.07/2.3.00/20.0170 of
the Ministry of Education, Youth and Sports of the Czech Republic).
34       Lukáš Marek, Vı́t Pászto, Pavel Tuček, Jiřı́ Dvorský


References

1. Anselin, L.: GeoDaTM 0.9 User’s Guide. (2003).
2. Assuncao, R., Reis, E.: A new proposal to adjust Moran’s I for population density. Stat.
    Med. 2162, November 1998, 2147–2162 (1999).
3. Beale, L. et al.: Methodologic issues and approaches to spatial epidemiology. Environ.
    Health Perspect. 116, 8, 1105–10 (2008).
4. Clayton, D., Bernardinelli, L.: Bayesian methods for mapping disease risk. In: Elliott, P. et
    al. (eds.) Geographical and Environmental Epidemiology: Methods for Small Area Studies.
    Oxford University Press, Oxford (1996).
5. Elliott, P., Wartenberg, D.: Spatial Epidemiology: Current Approaches and Future
    Challenges. Environ. Health Perspect. 112, 9, 998–1006 (2004).
6. Goodchild, M., Haining, R.: GIS and spatial data analysis: Converging perspectives. Pap.
    Reg. Sci. 44, 0, 1–26 (2004).
7. Griffith, D., Arbia, G.: Detecting negative spatial autocorrelation in georeferenced random
    variables. Int. J. Geogr. Inf. Sci. 24, 3, 417–437 (2010).
8. Kulldorff, M.: A spatial scan statistic. Commun. Stat. - Theory Methods. 26, 6, 1481–1496
    (1997).
9. Kulldorff, M., Nagarwalla, N.: Spatial disease clusters: Detection and inference. Stat. Med.
    14, 8, 799–810 (1995).
10. Last, J., Abramson, J.: A Dictionary of Epidemiology. Oxford University Press, USA
    (2001).
11. Lawson, A.B.: Bayesian Disease Mapping: Hierarchical Modeling in Spatial Epidemiology.
    CRC Press (2009).
12. Lee, D.: CARBayes: An R Package for Bayesian Spatial. J. Stat. Softw. 55, 13, 24 (2013).
13. Lee, D., Mitchell, R.: Boundary detection in disease mapping studies. Biostatistics. 13, 3,
    415–26 (2012).
14. Marek, L. et al.: Bayesian mapping of medical data. (2014).
15. Marek, L. et al.: On Estimation of the Spatial Clustering: Case Study of Epidemiological
    Data In Olomouc Region, Czech Republic. VŠB – Technická univerzita Ostrava, Ostrava
    (2013).
16. Moran, P.: Notes on continuous stochastic phenomena. Biometrika. 37, 1, 17–23 (1950).
17. Openshaw, S. et al.: A Mark 1 Geographical Analysis Machine for the automated analysis
    of point data sets. Int. J. Geogr. Inf. Syst. 1, 4, 335–358 (1987).
18. Openshaw, S.: The Modifiable Areal Unit Problem. , Norwhich (1984).
19. Pfeiffer, D. et al.: Spatial analysis in epidemiology. Oxford University Press (2008).
20. Richardson, S. et al.: Interpreting Posterior Relative Risk Estimates in Disease-Mapping
    Studies. Environ. Health Perspect. 112, 9, 1016–1025 (2004).
21. Waller, L.: Detection of clustering in spatial data. SAGE Handb. Spat. Anal. 34 (2009).
22. Waller, L.A., Gotway, C.A.: Applied Spatial Statistics for Public Health Data. John Wiley
    & Sons (2004).
23. The Center for food security & public health: Campylobacteriosis, available at:
    http://www.cfsph.iastate.edu/FastFacts/pdfs/campylobacterosis_F.pdf, (2013).