Data Warehouse Development to Identify Regions
      with High Rates of Cancer Incidence in México
     through a Spatial Data Mining Clustering Task.

                            Joaquin Pérez Ortega1,
                        María del Rocío Boone Rojas1,2,
                       María Josefa Somodevilla García2,
                     Mariam Viridiana Meléndez Hernández2

   1 Centro Nacional de Investigación y Desarrollo Tecnológico, Cuernavaca Mor. Mex.
     2 Benemèrita Universidad Autónoma Puebla, Fac. Cs. de la Computaciòn, México.
    jperez@cenidet.edu.mx,{rboone,mariasg}@cs.buap.mx,mvmh_099@hotmail.com


                                        Abstract

Data warehouses arise in many contexts, such as business, medicine and science, in which
the availability of a repository of heterogeneous data sources, integrated and organized
under a unified framework facilitates analysis and supports the decision making
process. These data repositories increase their scope and application, when used for data
mining tasks, which can extract useful knowledge, new and valuable from large amounts
of data.
    This paper presents the design and implementation of population-based data
warehouses on the incidence of cancer in Mexico; based on the conceptual level
multidimensional model and the ROLAP model (Relational On-Line Analytical
Processing) at the implementation level.
    A data warehouses is built, to be used as input for clustering data mining tasks, in
particular, the k-means algorithm, in order to identify regions in Mexico, with high rates of
cancer incidence.
    The identified regions, as well as, the dimension related to the geographic location of
the municipalities and their rate of incidence of cancer, are processed by IRIS, a
Geographic Information System, developed at the National Institute of Statistics,
Geography and Informatics of Mexico.

1 Introduction
Data warehouses arise in many contexts, such as business, medicine and science,
in which the availability of a repository of heterogeneous data sources, integrated
and organized under a unified framework facilitates analysis and supports the
decision making process. These data repositories increase their scope and
application, when used for data mining tasks, which can extract useful
knowledge, new and valuable from large amounts of data.
    Data warehouses have been applied mainly in the commercial and business
areas [3] and more recently there have been some applications in the Health field


                                                                                          37
[16] [17] and the trend towards its integration with various technologies [11]
[16].
    Moreover, according to the literature, the use of data mining systems applied
to the analysis of massive databases of health on a population basis has been
limited, it is noteworthy work: Constructing Over Dendrogram Matrix Detail
view + Views. [6], Application of data mining techniques to databases population
of cancer [1], Subgroup discovery in cervical cancer using data mining
Techniques [18] and Data mining for cancer management in Egypt [10]. In the
case of Mexico, to the best of our knowledge, the work that has been developed
at the Centro Nacional de Investigación y Desarrollo Tecnológico and BUAP, are
the first ones in this field.
    This work has been preceded by other works which has been done on the
incidence of other cancers such as stomach and lung [15]. It is part of a larger
project doomed to make proposals for improving the k-means algorithm in
various aspects such as effectiveness and efficiency, reported in [12], [13] and
[14] and its application in the Health field.
    This article presents the data warehouse design and integration for the
development of a data mining task on cancer incidence by regions in Mexico,
based on the integration of complementary technologies such as clustering and
geographical information systems. As a study case, the results for the incidence
of cervical cancer are presented, which has been of special interest, since in
Mexico, cervical cancer is the leading cause of cancer death in women [11].
    The report is organized as follows, followed by this introduction, Section 2
presents the description of data sources and process design and implementation of
data warehouse, Section 3 provides an overview of each application. In Section 4,
results for the case of cervical cancer and its visualization by GIS INEGI IRIS [5]
are included. Finally, in Section 5, conclusions and perspectives of this work are
presented.

2 The Data Warehouse
The process of collecting and integrating data warehouse on cancer incidence by
region in Mexico, required to select the data sources necessary to accomplish the
task of data mining. This section describes the data sources and the conceptual
design based on the multidimensional model and also, the implementation of the
data warehouse under the ROLAP approach.

2.1 The Data Sources

In the study, the processed databases have been derived from official records of
the National Institute of Public Health (INSP) and the National Institute of
Statistics, Geography and Informatics (INEGI) of Mexico.
    Data on cancer incidence were obtained through subsystem Remote
Consultation System for Health Information (SCRIS) of the INSP [9]. In


                                                                                38
particular, the databases were queried for cases of mortality cancer and results
were configured by considering levels of aggregation such as: National States,
division (Jurisdiction, Municipalities), year, age range, gender and causes
(including tumors).
    The information on the population and the actual geographical location of the
municipalities was obtained from INEGI official databases, through its
Geographic Information System IRIS, which has statistical information covering
a wide geographical number of subjects, demographic, social and economic; also
includes aspects of the physical environment, natural resources and
infrastructure. This wealth of statistical and geographical data was obtained
through various activities such as conducting population and housing census and
economic census and the generation of basic cartography and census.
    The information in the databases of the above institutions are integrated into a
data warehouse (see Fig. 1), and according to the conventions in the area of
health, for this study, only the municipalities with more than one hundred
thousand inhabitants were considered.

                             CategoryMunicipalityID


 Fig. 1 Multidimensional Model Data Warehouse on the incidence of cancer
                               in Mexico.


                                                                                 39
2.2 Data Warehouse Multidimensional Model for a population-based
incidence of cancer in Mexico.

According to [4] the conceptual data model most widely used for data
warehouses is the multidimensional model. The data are organized around
the facts that have attributes or measures that may be more or less detail
according to certain dimensions. In our case, the data warehouse design at the
conceptual level is based on the multidimensional model, in which the
dimensions can be distinguished as CAUSE, TIME, and PLACE. In this case, it
is considered that a country has the basic fact, "deaths" that may have associated
attributes such as number of cases, incidence rate, mean, variance, etc.. Fact can
be detailed in several dimensions such as cause of death, place of death, date of
death, etc. In Fig. 1 shows the facts "deaths" and three dimensions with various
levels of aggregation. The arrows can be read as "is added". As shown in Fig. 1,
each dimension has a hierarchical structure but not necessarily linear. When the
number of dimensions cannot exceed three represent each combination of levels
of aggregation as a cube.
    The cube is made up of boxes with one box for each possible value from each
dimension to the corresponding level of aggregation. On this "view", each box
represents a fact. Fig. 2 shows a three dimensional cube corresponding to the fact:
"According to the 2000 census, the town of Atlixco, there were 15 deaths from
cervical cancer" in which the dimensions Cause, Place and Time have been added
by type of disease (cancer), Municipality and Census. The representation of a fact
corresponds therefore to a square in the cube. The value of the box is the
observed (in this case is the number of deaths).


             Fig. 2 Display of a fact in a multidimensional model


                                                                                40
2.3 Data warehouse scheme ROLAP (Relational OLAP) implementation of
population-based cancer incidence in Mexico.

One of the most efficient ways to implement a multidimensional model using
relational databases is based on the ROLAP model [4]. In our case, the tables for
the ROLAP model have the following schemes:
Snowflake Tables
Dimension Cause
DISEASE (Clave_Enfermedad, name, IdGama, CategoryID)
GAMA (IdGama, CategoryID, Description)
CATEGORY (CategoryID, Description)
Place dimension
STATE (Clave_Estado, name, población_total)
MUNICIPALITY (Clave_Municipio, Clave_Estado, name, población_total,
Loc_x, Loc_y, extension, tipo_zona, nivel_socioeconómico)
Time dimension
YEAR (Idan)
CENSUS (IdCenso, Idan, number, name)
Fact Tables
DEATH (IdEnfermedad, IdCenso, IdMunicipio, no_casos, rate, mean, variance)
Star Tables
TIME (Idan, IdCenso)
CAUSE (IdEnfermedad, IdGama, CategoryID)
PLACE (IdCiudad, IdMunicipio)


3 Data Mining Application on Cancer Incidence

The implemented data warehouse has been used to develop a data mining task
space based on the integration of additional technologies to the data warehouse,
such as clustering and Geographic Information Systems, which in this case are
very suitable, to identify and display areas with incidence of cancer in
Mexico. The following provides a general description of the integration process
of technologies and tools (Fig. 3) made for this application.
    The data warehouse integrates the following information for our application:
the component space that allows viewing of the regions of municipalities,
population data such as the death rate and incidence rate and the time component,
which in this case is the census year.
    The IRIS GIS INEGI [5], through your options allows the recovery of
population data and the real location of the municipalities, which are integrated
into the data warehouse.


                                                                              41
    Since IRIS stores geographical representation of municipalities in the vector
format standardized "shape" and by means of polygons, there is the need for a
process of transfer of forms and formats in order to have a numerical
representation of each municipality, in this case, corresponds to a point on the
municipality center location, which is accomplished primarily through the tools
of ESRI's ArcInfo GIS.


              Fig. 3 Integration of Technology and Data Mining Tools

    Given the numerical representation of each municipality through a point (x,
y), along with its rate of incidence of cancer, the Matlab programming
environment and its implementation of k-means algorithm [2] [7] is used
to generate patterns / groups of municipalities and the corresponding centroids.
    Once you have the above results, it is again necessary to transfer digital data
format to format shape, a process similar to above using ArcInfo tools, allowing
viewing through GIS IRIS.
    Finally, the groups of municipalities and their corresponding centroids, are
passed as GIS layers to IRIS, for display on the geographic map of Mexico.

4 Results and visualization with IRIS
In this project we have done grouping tasks according to the affinity of location
and incidence rate of the municipalities. Series of experimental tests on the data
stores in cities with more than 100.000 inhabitants were carried out. Size groups
were considered k = 5, 10, 15, 20 and 30. The best result was obtained for k =
20.


                                                                                42
    As a case study, this paper presents the results obtained by k-means algorithm
in Matlab for the cervical cancer data warehouse. Fig. 4 provides the visualization
of the 20 regions identified.


 Fig. 4 Regions of the Municipalities with an incidence of Cervical Cancer.

    From the results, we distinguish the groups spearheading the three
municipalities with higher incidence rates: Atlixco, Apatzingán and Tapachula
(Chiapas). In Fig. 5 the detail of the display of the group corresponding to the
region of Chiapas and the incidence of cervical cancer is shown. Table 1 provides
data for the previous group, and statistical measures for the mean and standard
deviation.


                       Fig. 5 Tapachula Chiapas Group

    The groups identified with high incidence rates: Tapachula and Apatzingan
match municipalities identified in other studies [4] and correspond to the
population characteristics, identified in the work of the medical field [8], [15]
such situations such as poverty, lack of preparation and access to effective health
services and the initiation of sexual activity at an early age. This allows us to


                                                                                43
assert that the grouping is made valid. On the other hand, the study allowed
discovering other municipalities that had not been identified in other research,
such as the group of Atlixco, in particular showing the highest incidence rate in
the country (see table 2).

     Table 1 Municipalities Incidence Rates of Cervical-Uterine Cancer

               State             Municipality            Population    Deaths    Rate
          Chiapas          Tapachula                     271674        27        9.93
          Veracruz-Llave Coatzacoalcos                   267212        23        8.60
          Veracruz-Llave Minatitlán                      153001        13        8.49
          Chiapas          Comitán de Domínguez          105210        8         7.60
          Chiapas          San Cristóbal de las Casas    132421        9         6.79
          Tabasco          Comalcalco                    164637        11        6.68
          Tabasco          Cárdenas                      217261        11        5.06
          Tabasco          Huimanguillo                  158573        8         5.04
          Chiapas          Tuxtla Gutiérrez              434143        21        4.83
          Tabasco          Cunduacán                     104360        5         4.79
          Campeche         Carmen                        172076        8         4.64
          Tabasco          Macuspana                     133985        6         4.47
          Tabasco          Centro                        520308        23        4.42
          Chiapas          Ocosingo                      146696        2         1.36
          Average                                                                5.91
          Standard deviation                                                     2.23

    In order to perform a global analysis of our results, Table 2 provides
information of the ten municipalities with the highest incidence rate in the
country.

Table 2 Top Ten Municipalities Incidence Rates of Cervical-Uterine Cancer

            Key         State        Municipality Population Deaths Rate
            21019 Puebla            Atlixco             117111        15        12,80
            16006 Michoacán         Apatzingán          117949        13        11,02
            07089 Chiapas           Tapachula           271674        27        9,93
            17006 Morelos           Cuautla             153329        14        9,13
            28021 Tamaulipas        El Mante            112602        10        8,88
            06007 Colima            Manzanillo          125143        11        8,78
            30039 Veracruz-Llave Coatzacoalcos 267212                 23        8,60
            18017 Nayarit           Tepic               305176        26        8,51
            30108 Veracruz-Llave Minatitlán             153001        13        8,49
            30118 Veracruz-Llave Orizaba                118593        10        8,43
            General Mean                                                        4.70
            Standard Deviation                                                  1.95


                                                                                        44
    Figure 6, illustrates the location of previous incidence rates compared to the
national average and the corresponding standard deviation.


               Figure 6. Top Ten municipalities incidence rates.


5 Conclusions
Multidimensional model for conceptual design of the data warehouse, turned out
to be very appropriate, since this model is easily scalable and allows analysis of
the information under different perspectives. It is expected that future studies
process other variables, related to the municipalities, included in this design, such
as socioeconomic status, type of region, gender and access to health services,
among others. Moreover, the implementation of data warehouse based on the
ROLAP model has allowed taking advantage of the facilities developed for
relational databases. In addition, it is expected that the design and implementation
carried out in the data warehouse can be used in other applications.
    The processing of the spatial component of our data warehouse, using the
IRIS GIS INEGI, has resulted in a high quality visual representation of our
results, based on the actual physical location of the municipalities and on a map
of the topography of the Republic Mexican INEGI. Also experience and learning
has been gained on transfer of shapes (polygons, points) techniques and
formats (Number-shape) through ArcView GIS tools.
    Currently we are working to complete studies in other cancer types. Besides,
data mining tasks will be developed on the incidence of conditions such as
diabetes, influenza and cardiovascular diseases, among others.


                                                                                  45
Acknowledgement. R. Boone expresses her gratitude to Ms. Rocío Pérez Osorno
from INEGI, Puebla. (Graduated from the Faculty of Cs. Computing, BUAP) for
advice and support in plotting the results of this work through the IRIS GIS.

References

1. Barrón Vivanco M. Arandine, Pérez O. J., Miranda H. Fátima, Pazos R., XII
Congreso de Investigación en Salud Pública, Aplicación de técnicas de minería
de datos a bases de datos poblacionales de cáncer, CENIDET, México,
Secretaría de Saúde do Estado de Pernambuco, Brasil, Abril (2007).
2. Forgy E. “Cluster analysis of multivariate data: Efficiency vs. Interpretability
of classification”, Biometrics, vol. 21, pp.768-780.1965
3. Hernández-Orallo J., Ramiréz-Quintana M. J., Ferri-Ramiréz C., Introducción
a la Minería de Datos, Ed. Pearson Prentice Hall, Madrid (2004).
4. Hidalgo-Martínez Ana C. El cáncer cérvico-uterino su impacto en México.
Porqué no funciona el programa nacional de detección oportuna. Revista
Biomédica, Centro Nal. De Investigaciones Regionales Dr. Hineyo Noguchi,
UADY, 2006, México.
5. IRIS 4. http://mapserver.inegi.gob.mx. SNIEG Sistema Nacional de
Información Estadística y Geográfica.
6. Jin Chen, MacEachren, Alan M., Peuquet, Donna. Constructing
Overview+Detail Dendogram Matrix Views. IEEE Transactions on Visualization
& Computer Graphics., Vol. 15, Issue 6, p889-896, Dec. 2009.
7. MacQueen, J.: Some Methods for Classification and Analysis of Multivariate
Observations. In Proceedings Fifth Berkeley Symposium Mathematics Statistics
and Probability. Vol. 1. Berkeley, CA (1967) 281-297.
8. Martínez M. Francisco Javier. Epidemiología del cáncer del cuello uterino.
Medicina Universitaria 2004, 39-46. Vol. 6, N. 22, UANL, México.
9. NAIIS Instituto Nacional de Salud Pública, SCRIS, Mortalidad,
http://sigsalud.insp.mx/naais/, Cuernavaca, Morelos, México, (2003).
10. Nevine M. Labib, Michael N. Malek: Data Mining for Cancer Management
in Egypt. Transactions on Engineering, Computing and Technology V8 October
2005: (ISSN 1305-5313).
11. Pérez-C. Nelson, Abril-Frade D.O. Estado Actual de las Tecnologías de
Bodegas de Datos Espaciales. Ing. E Investigación. Vol.27, No. 1, Univ. Nal. De
Colombia. 2007.
12. Pérez-O. J.,1, R. Pazos R, L. Cruz R.,G. Reyes S. “Improvement the
Efficiency and Efficacy of the K-means Clustering Algorithm through a New
Convergence Condition”. Computational Science and Its Applications – ICCSA
2007 – International Conference Proceedings. Springer Verlag.
13. Pérez-O. J.2, M.F. Henriques, R. Pazos, L. Cruz, G. Reyes, J. Salinas, A.
Mexicano. Mejora al Algoritmo de K-means mediante un Nuevo criterio de


                                                                                46
convergencia y su aplicación a bases de datos poblacionales de cancer. 2do Taller
Latino Iberoamericano de Investigación de Operaciones, Mèxico, 2007.
14. Pérez-O. J.3, Rocío Boone Rojas, María J. Somodevilla García. Research
issues on K-means Algorithm: An Experimental Trial Using Matlab., Advances
on Semantic Web and New Technologies”. Vol 534. http://ceur-ws.org/.
15. Rangel-Gómez, G. Lazcano-Ponce,E. Palacio-Mejía, Cáncer cervical, una
enfermedad de la pobreza: diferencias en la mortalidad por áreas urbanas y
rurales en México, http:// www.insp.mx/salud/index.html.
16. Scotch,Matthew, Parmato B. Monaco, V. Evaluation of SOVAT: An OLAP-
GIS decision support system for community health assessment data analysis.
BMC Medical Informatics & Decisión Making Vol. 8 (1-12). 2008.
17. Simonet, A., Landais, P. Guillon D.A multi-source Information System for
end-stage renaldisease. Comptes Residus Biologies, 2002, Vol. 325 I4., p515.
18. Thangavel K. Jaganathan P. and Esmy P. O., Subgroup Discovery in Cervical
Cancer Analysis Using Data Mining Techniques, Departament of Computer
Science, Periyar University: Departament of Computer Science and Applications,
Gandhigram Rural Institute-Deemed University, Gandhigram: Radiation
Oncologist , Christian Fellowship Community Health Centre, Tamil Nadu, India:
AIML journal, Vol(6), Issue(1), January, 2006.


                                                                              47