Data Warehouse Development to Identify Regions with High Rates of Cancer Incidence in México through a Spatial Data Mining Clustering Task

Data Warehouse Development to Identify Regions with High Rates of Cancer Incidence in México through a Spatial Data Mining Clustering Task JoaquinPérezOrtega Centro Nacional de Investigación y Desarrollo Tecnológico

Cuernavaca Mor. Mex

MaríaDel Rocío BooneRojas Centro Nacional de Investigación y Desarrollo Tecnológico

Cuernavaca Mor. Mex

Fac. Cs. de la Computaciòn Benemèrita Universidad Autónoma Puebla

México

MaríaJosefaSomodevilla García Fac. Cs. de la Computaciòn Benemèrita Universidad Autónoma Puebla

México

MariamViridiana Meléndez Hernández Fac. Cs. de la Computaciòn Benemèrita Universidad Autónoma Puebla

México

Data Warehouse Development to Identify Regions with High Rates of Cancer Incidence in México through a Spatial Data Mining Clustering Task 4FCA13D0D43B9501FED821F79943687E GROBID - A machine learning software for extracting information from scholarly documents

Data warehouses arise in many contexts, such as business, medicine and science, in which the availability of a repository of heterogeneous data sources, integrated and organized under a unified framework facilitates analysis and supports the decision making process. These data repositories increase their scope and application, when used for data mining tasks, which can extract useful knowledge, new and valuable from large amounts of data.

This paper presents the design and implementation of population-based data warehouses on the incidence of cancer in Mexico; based on the conceptual level multidimensional model and the ROLAP model (Relational On-Line Analytical Processing) at the implementation level.

A data warehouses is built, to be used as input for clustering data mining tasks, in particular, the k-means algorithm, in order to identify regions in Mexico, with high rates of cancer incidence.

The identified regions, as well as, the dimension related to the geographic location of the municipalities and their rate of incidence of cancer, are processed by IRIS, a Geographic Information System, developed at the National Institute of Statistics, Geography and Informatics of Mexico.

Introduction

Data warehouses have been applied mainly in the commercial and business areas [3] and more recently there have been some applications in the Health field [16] [17] and the trend towards its integration with various technologies [11] [16].

Moreover, according to the literature, the use of data mining systems applied to the analysis of massive databases of health on a population basis has been limited, it is noteworthy work: Constructing Over Dendrogram Matrix Detail view + Views. [6], Application of data mining techniques to databases population of cancer [1], Subgroup discovery in cervical cancer using data mining Techniques [18] and Data mining for cancer management in Egypt [10]. In the case of Mexico, to the best of our knowledge, the work that has been developed at the Centro Nacional de Investigación y Desarrollo Tecnológico and BUAP, are the first ones in this field.

This work has been preceded by other works which has been done on the incidence of other cancers such as stomach and lung [15]. It is part of a larger project doomed to make proposals for improving the k-means algorithm in various aspects such as effectiveness and efficiency, reported in [12], [13] and [14] and its application in the Health field.

This article presents the data warehouse design and integration for the development of a data mining task on cancer incidence by regions in Mexico, based on the integration of complementary technologies such as clustering and geographical information systems. As a study case, the results for the incidence of cervical cancer are presented, which has been of special interest, since in Mexico, cervical cancer is the leading cause of cancer death in women [11].

The report is organized as follows, followed by this introduction, Section 2 presents the description of data sources and process design and implementation of data warehouse, Section 3 provides an overview of each application. In Section 4, results for the case of cervical cancer and its visualization by GIS INEGI IRIS [5] are included. Finally, in Section 5, conclusions and perspectives of this work are presented.

The Data Warehouse

The process of collecting and integrating data warehouse on cancer incidence by region in Mexico, required to select the data sources necessary to accomplish the task of data mining. This section describes the data sources and the conceptual design based on the multidimensional model and also, the implementation of the data warehouse under the ROLAP approach.

The Data Sources

In the study, the processed databases have been derived from official records of the National Institute of Public Health (INSP) and the National Institute of Statistics, Geography and Informatics (INEGI) of Mexico.

Data on cancer incidence were obtained through subsystem Remote Consultation System for Health Information (SCRIS) of the INSP [9]. In particular, the databases were queried for cases of mortality cancer and results were configured by considering levels of aggregation such as: National States, division (Jurisdiction, Municipalities), year, age range, gender and causes (including tumors).

The information on the population and the actual geographical location of the municipalities was obtained from INEGI official databases, through its Geographic Information System IRIS, which has statistical information covering a wide geographical number of subjects, demographic, social and economic; also includes aspects of the physical environment, natural resources and infrastructure. This wealth of statistical and geographical data was obtained through various activities such as conducting population and housing census and economic census and the generation of basic cartography and census.

The information in the databases of the above institutions are integrated into a data warehouse (see Fig. 1), and according to the conventions in the area of health, for this study, only the municipalities with more than one hundred thousand inhabitants were considered. According to [4] the conceptual data model most widely used for data warehouses is the multidimensional model. The data are organized around the facts that have attributes or measures that may be more or less detail according to certain dimensions. In our case, the data warehouse design at the conceptual level is based on the multidimensional model, in which the dimensions can be distinguished as CAUSE, TIME, and PLACE. In this case, it is considered that a country has the basic fact, "deaths" that may have associated attributes such as number of cases, incidence rate, mean, variance, etc.. Fact can be detailed in several dimensions such as cause of death, place of death, date of death, etc. In Fig. 1 shows the facts "deaths" and three dimensions with various levels of aggregation. The arrows can be read as "is added". As shown in Fig. 1, each dimension has a hierarchical structure but not necessarily linear. When the number of dimensions cannot exceed three represent each combination of levels of aggregation as a cube.

CategoryMunicipalityID

The cube is made up of boxes with one box for each possible value from each dimension to the corresponding level of aggregation. On this "view", each box represents a fact. Fig. 2 shows a three dimensional cube corresponding to the fact: "According to the 2000 census, the town of Atlixco, there were 15 deaths from cervical cancer" in which the dimensions Cause, Place and Time have been added by type of disease (cancer), Municipality and Census. The representation of a fact corresponds therefore to a square in the cube. The value of the box is the observed (in this case is the number of deaths). One of the most efficient ways to implement a multidimensional model using relational databases is based on the ROLAP model [4]

Data Mining Application on Cancer Incidence

The implemented data warehouse has been used to develop a data mining task space based on the integration of additional technologies to the data warehouse, such as clustering and Geographic Information Systems, which in this case are very suitable, to identify and display areas with incidence of cancer in Mexico. The following provides a general description of the integration process of technologies and tools (Fig. 3) made for this application.

The data warehouse integrates the following information for our application: the component space that allows viewing of the regions of municipalities, population data such as the death rate and incidence rate and the time component, which in this case is the census year.

The IRIS GIS INEGI [5], through your options allows the recovery of population data and the real location of the municipalities, which are integrated into the data warehouse.

Since IRIS stores geographical representation of municipalities in the vector format standardized "shape" and by means of polygons, there is the need for a process of transfer of forms and formats in order to have a numerical representation of each municipality, in this case, corresponds to a point on the municipality center location, which is accomplished primarily through the tools of ESRI's ArcInfo GIS.

Fig. 3 Integration of Technology and Data Mining Tools

Given the numerical representation of each municipality through a point (x, y), along with its rate of incidence of cancer, the Matlab programming environment and its implementation of k-means algorithm [2] [7] is used to generate patterns / groups of municipalities and the corresponding centroids.

Once you have the above results, it is again necessary to transfer digital data format to format shape, a process similar to above using ArcInfo tools, allowing viewing through GIS IRIS.

Finally, the groups of municipalities and their corresponding centroids, are passed as GIS layers to IRIS, for display on the geographic map of Mexico.

Results and visualization with IRIS

In this project we have done grouping tasks according to the affinity of location and incidence rate of the municipalities. Series of experimental tests on the data stores in cities with more than 100.000 inhabitants were carried out. Size groups were considered k = 5, 10, 15, 20 and 30. The best result was obtained for k = 20.

As a case study, this paper presents the results obtained by k-means algorithm in Matlab for the cervical cancer data warehouse. Fig. 4 provides the visualization of the 20 regions identified.

Fig. 4 Regions of the Municipalities with an incidence of Cervical Cancer.

From the results, we distinguish the groups spearheading the three municipalities with higher incidence rates: Atlixco, Apatzingán and Tapachula (Chiapas). In Fig. 5 the detail of the display of the group corresponding to the region of Chiapas and the incidence of cervical cancer is shown. Table 1 provides data for the previous group, and statistical measures for the mean and standard deviation.

Fig. 5 Tapachula Chiapas Group

The groups identified with high incidence rates: Tapachula and Apatzingan match municipalities identified in other studies [4] and correspond to the population characteristics, identified in the work of the medical field [8], [15] such situations such as poverty, lack of preparation and access to effective health services and the initiation of sexual activity at an early age. This allows us to assert that the grouping is made valid. On the other hand, the study allowed discovering other municipalities that had not been identified in other research, such as the group of Atlixco, in particular showing the highest incidence rate in the country (see table 2). In order to perform a global analysis of our results, Table 2 provides information of the ten municipalities with the highest incidence rate in the country. Figure 6, illustrates the location of previous incidence rates compared to the national average and the corresponding standard deviation.

Conclusions

Multidimensional model for conceptual design of the data warehouse, turned out to be very appropriate, since this model is easily scalable and allows analysis of the information under different perspectives. It is expected that future studies process other variables, related to the municipalities, included in this design, such as socioeconomic status, type of region, gender and access to health services, among others. Moreover, the implementation of data warehouse based on the ROLAP model has allowed taking advantage of the facilities developed for relational databases. In addition, it is expected that the design and implementation carried out in the data warehouse can be used in other applications.

The processing of the spatial component of our data warehouse, using the IRIS GIS INEGI, has resulted in a high quality visual representation of our results, based on the actual physical location of the municipalities and on a map of the topography of the Republic Mexican INEGI. Also experience and learning has been gained on transfer of shapes (polygons, points) techniques and formats (Number-shape) through ArcView GIS tools.

Currently we are working to complete studies in other cancer types. Besides, data mining tasks will be developed on the incidence of conditions such as diabetes, influenza and cardiovascular diseases, among others.

Fig. 1 . 2 . 2122Fig. 1 Multidimensional Model Data Warehouse on the incidence of cancer in Mexico.

Fig. 22Fig. 2 Display of a fact in a multidimensional model

Figure 6 .6Figure 6. Top Ten municipalities incidence rates.

. In our case, the tables for the ROLAP model have the following schemes:Snowflake TablesDimension CauseDISEASE (Clave_Enfermedad, name, IdGama, CategoryID)GAMA (IdGama, CategoryID, Description)CATEGORY (CategoryID, Description)Place dimensionSTATE (Clave_Estado, name, población_total)MUNICIPALITY (Clave_Municipio, Clave_Estado, name, población_total,Loc_x, Loc_y, extension, tipo_zona, nivel_socioeconómico)Time dimensionYEAR (Idan)CENSUS (IdCenso, Idan, number, name)Fact TablesDEATH (IdEnfermedad, IdCenso, IdMunicipio, no_casos, rate, mean, variance)Star TablesTIME (Idan, IdCenso)CAUSE (IdEnfermedad, IdGama, CategoryID)PLACE (IdCiudad, IdMunicipio)

Table 1 Municipalities Incidence Rates of Cervical-Uterine Cancer1StateMunicipalityPopulation Deaths RateChiapasTapachula271674279.93Veracruz-Llave Coatzacoalcos267212238.60Veracruz-Llave Minatitlán153001138.49ChiapasComitán de Domínguez10521087.60ChiapasSan Cristóbal de las Casas 13242196.79TabascoComalcalco164637116.68TabascoCárdenas217261115.06TabascoHuimanguillo15857385.04ChiapasTuxtla Gutiérrez434143214.83TabascoCunduacán10436054.79CampecheCarmen17207684.64TabascoMacuspana13398564.47TabascoCentro520308234.42ChiapasOcosingo14669621.36Average5.91Standard deviation2.23

Table 2 Top Ten Municipalities Incidence Rates of Cervical-Uterine Cancer Key State Municipality Population Deaths Rate221019 PueblaAtlixco1171111512,8016006 MichoacánApatzingán1179491311,0207089 ChiapasTapachula271674279,9317006 MorelosCuautla153329149,1328021 TamaulipasEl Mante112602108,8806007 ColimaManzanillo125143118,7830039 Veracruz-Llave Coatzacoalcos 267212238,6018017 NayaritTepic305176268,5130108 Veracruz-Llave Minatitlán153001138,4930118 Veracruz-Llave Orizaba118593108,43General Mean4.70Standard Deviation1.95

Acknowledgement. R. Boone expresses her gratitude to Ms. Rocío Pérez Osorno from INEGI, Puebla. (Graduated from the Faculty of Cs. Computing, BUAP) for advice and support in plotting the results of this work through the IRIS GIS.

BarrónVivanco MArandine OJPérez HMiranda Fátima RPazos XII Congreso de Investigación en Salud Pública, Aplicación de técnicas de minería de datos a bases de datos poblacionales de cáncer

CENIDET, México; Brasil; Abril

Secretaría de Saúde 2007 do Estado de Pernambuco Cluster analysis of multivariate data: Efficiency vs. Interpretability of classification EForgy Biometrics 21 1965 JHernández-Orallo MJRamiréz-Quintana CFerri-Ramiréz Introducción a la Minería de Datos

Madrid

Pearson Prentice Hall 2004 El cáncer cérvico-uterino su impacto en México. Porqué no funciona el programa nacional de detección oportuna CHidalgo-Martínez Ana Revista Biomédica

Centro Nal; UADY; México

De Investigaciones Regionales Dr. Hineyo Noguchi 2006 Constructing Overview+Detail Dendogram Matrix Views JinChen AlanMMaceachren DonnaPeuquet IEEE Transactions on Visualization & Computer Graphics 15 6 Dec. 2009 Some Methods for Classification and Analysis of Multivariate Observations JMacqueen Proceedings Fifth Berkeley Symposium Mathematics Statistics and Probability Fifth Berkeley Symposium Mathematics Statistics and Probability

Berkeley, CA

1967 1 Epidemiología del cáncer del cuello uterino MMartínez JavierFrancisco Medicina Universitaria 6 22 2004 NAIIS Instituto Nacional de Salud Pública, SCRIS, Mortalidad

Cuernavaca; Morelos, México

2003 Data Mining for Cancer Management in Egypt MNevine Labib NMichael Malek Transactions on Engineering, Computing and Technology 1305-5313 October 2005 Estado Actual de las Tecnologías de Bodegas de Datos Espaciales -CPérez Nelson DOAbril-Frade Ing. E Investigación 27 1 2007 Univ. Nal. De Colombia Improvement the Efficiency and Efficacy of the K-means Clustering Algorithm through a New Convergence Condition Pérez-OJ RPazos R LCruz R GReyes S Computational Science and Its Applications -ICCSA 2007 -International Conference Proceedings Springer Verlag Mejora al Algoritmo de K-means mediante un Nuevo criterio de convergencia y su aplicación a bases de datos poblacionales de cancer Pérez-OJ MFHenriques RPazos LCruz GReyes JSalinas AMexicano 2do Taller Latino Iberoamericano de Investigación de Operaciones

Mèxico

2007 Research issues on K-means Algorithm: An Experimental Trial Using Matlab Pérez-OJ RocíoBoone Rojas MaríaJSomodevilla García Advances on Semantic Web and New Technologies 534 Cáncer cervical, una enfermedad de la pobreza GRangel-Gómez ELazcano-Ponce Palacio-Mejía diferencias en la mortalidad por áreas urbanas y rurales en México Evaluation of SOVAT: An OLAP-GIS decision support system for community health assessment data analysis Scotch ParmatoBMatthew VMonaco BMC Medical Informatics & Decisión Making 8 2008 A multi-source Information System for end-stage renaldisease ASimonet PLandais DGuillon Comptes Residus Biologies 325 515 2002 Subgroup Discovery in Cervical Cancer Analysis Using Data Mining Techniques KThangavel PJaganathan POEsmy AIML journal 6 January, 2006 Departament of Computer Science, Periyar University ; Departament of Computer Science and Applications, Gandhigram Rural Institute-Deemed University, Gandhigram: Radiation Oncologist , Christian Fellowship Community Health Centre