-

Advances on Semantic Web and New Technologies

2010

21 62

Davide Buscaldi and Gerardo Sierra were the invited speakers in this Third Workshop Semantic Web.

Davide Buscaldi is currently completing his Ph.D. in pattern recognition and artificial intelligence at the UPV - Universidad Politécnica de Valencia (Spain), with a thesis titled "Toponym Disambiguation in NLP Applications". His research interests are mainly focused on question answering, word sense disambiguation and geographical information retrieval. He obtained his DEA (Diploma de Estudios Avanzados) in 2008 with a dissertation on the "integration of resources for QA and GIR". He is the author of over 40 papers in different international conferences, workshops and journals. He has been awarded a FPI grant by the Valencian local government which allowed him to participate in the "LiveMemories" project during a stage at the FBK-IRST research institute in Trento, Italy, under the direction of Bernardo Magnini. He has been the UPV responsible of the organization of the QAST (Question Answering on Speech Transcript) track in CLEF 2009. Currently, he is member of the Natural Language Engineering (NLE) Lab of the Universidad Politécnica de Valencia.

Gerardo Sierra is a Ph.D. in Computational Linguistics at UMIST, England. He is the coordinator of the Linguistic Engineering Group at UNAM. He has promoted this area in teaching level such as research and development, in areas such as computational lexicography, terminotics, retrieval and information extraction, text mining and corpus linguistics. Currently, he is researcher level A, National Researcher II, CONACYT project evaluator, member of several scientific committees. He has taught courses at UNAM, for the Faculties of Engineering and Philosophy and Letters, such as Posgrade in Linguistic, Biotechnology and Computer Science. Invited Paper Ambiguous Place Names on the Web Davide Buscaldi.

SV: a Visualization Mechanism for Ontologies of Records Based on SVG Graphics Ma. Auxilio Medina, Miriam Cruz, Rebeca Rodríguez, and Argelia B. Urbina.

Modeling of CSCW system with Ontologies Mario Anzures-García, Luz A. Sánchez-Gálvez, Miguel J. Hornos, Patricia PaderewskiRodríguez, and Antonio Cid.

The Use of WAP Technology in Question Answering Fernando Zacarías F., Alberto Tellez V., Marco Antonio Balderas, Guillermo De Ita L., and Barbara Sánchez R.

Data Warehouse Development to Identify Regions with High Rates of Cancer Incidence in México through a Spatial Data Mining Clustering Task.

Joaquin Pérez Ortega, María del Rocío Boone Rojas, María Josefa Somodevilla García, and Mariam Viridiana Meléndez Hernández.

An Approach of Crawlers for Semantic Web Application (Short paper) José Manuel Pérez Ramírez, and Luis Enrique Colmenares Guillen.

Decryption Through the Likelihood of Frequency of Letters (Short paper) Barbara Sánchez Rinza, Fernando Zacarias Flores, Luna Pérez Mauricio, and Martínez Cortés Marco Antonio. 1 8 13 24 37 48 57 Ambiguous Place Names on the Web?

Davide Buscaldi Natural Language Engineering Lab., ELiRF Research Group,

Dpto. de Sistemas Informaticos y Computacion (DSIC),

Universidad Politecnica de Valencia, Spain,

dbuscaldi@dsic.upv.es Abstract. Geographical information is achieving an increasing importance in the World Wide Web. Everyday, the number of users looking for geographically constrained information is growing. Map-based services, such as Google or Yahoo Maps provide users with a graphical interface, visualizing results on maps. However, most of the geographical information contained in web documents is represented by means of toponyms, which in many cases are ambiguous. Therefore, it is important to properly disambiguate toponyms in order to improve the accuracy of web searches. The advent of the semantic web will allow to overcame this issue by labelling documents with geographical IDs. In this paper we discuss the problems of using toponyms in web documents instead of identifying places using tools such as Geonames RDF, focusing on the errors that a ect a prototype geographical web search engine, Geooreka!, currently under development. 1

Introduction

The interest of users for geographically constrained information in the Web has increased over the past years, boosted by the availability of services such as Google Maps1. Sanderson and Kohler [ 1 ] showed that 18:6% of the queries submitted to the Excite search engine contained at least a geographic term, while Gan et al. [ 2 ] estimated that 12:94% of queries submitted to the AOL search engine expressed a geographically constrained information need. Most of the geographical information contained in the Web and unstructured text is composed by toponyms, or place names. There are two main problems that derive from using toponyms to represent geographical information. The rst one is the polysemy of toponyms, or toponym ambiguity: a toponym may be used to represent more than one place, such as \Puebla" which may be used to indicate the city at 19o30N, 98o120W, the state in which it is contained, a suburb of Mexicali in the state of Baja California, or three more small towns in Mexico. The second problem is that the mere inclusion of a toponym in a document does not always mean that the document is geographically relevant with respect to the region or ? We would like to thank the TIN2009-13391-C04-03 research project for partially supporting this work. 1 http://maps.google.com area represented by the toponym. In the rst case, the solution is constituted by the Toponym Disambiguation (TD) task, also named toponym grounding or resolution; in the second case, the solution is to carry out Geographic Scope Resolution, which is also a ected by the problem of toponym ambiguity [ 3 ].

The Geonames ontology2 provide users with RDF description of more than 6 million places. The use of this ontology would allow to include geospatial semantic information in the Web, eliminating the need of toponym disambiguation. Unfortunately, as noted by [ 4 ], in the Web \references to geographical locations remain unstructured and typically implicit in nature", determining a \lack of explicit spatial knowledge within the Web" which \makes it di cult to service user needs for location-speci c information". In this paper, with the help of the Geooreka!3 system [ 5 ], a prototype web search engine developed at the Universidad Politecnica of Valencia in Spain, we will the problems that users interested in geographically constrained information may found because of the ambiguity of toponyms in the web. 2

Geooreka!: a Geographical Web Search Engine

Geooreka! is a search engine developed on the basis of our experiences at GeoCLEF4 [ 6,7 ], which suggested us that the use of term-based queries could not be the optimal method to express a geographically constrained information need. For instance, it is common for users to employ vernacular names that have vague spatial extent and which do not correspond to the o cial administrative place name terminology. Another issue is the use of vague geographical constraints that are di cult to automatically translate from the natural language to a precise query. For instance, the query \Cultivos de tabaco al este de Puebla" (\Tobacco plantations East of Puebla") presents a double problem because of the ambiguity of the place name and the fact that the geographical constraint \East of" is vague (for instance, it does not specify if the search should be constrained within Mexico or extend to other countries).

These issues are addressed in Geooreka! by allowing the user to specify his geographical information needs using a map-based interface. The user writes a natural language query in order to represent the query theme (e.g., \Cultivos de tabaco") and selects a rectangular map in a box (Figure 1), representing the query geographical footprint. All toponyms in the box are retrieved using a PostGIS database, and then the Web is queried in order to check the maximum Mutual Information (MI) between the thematic part of the query and all the places retrieved. The complete architecture of the system can be observed in Figure 2. Web counts and MI are used in order to determine which combinations theme-toponym are most relevant with respect to the information need expressed by the user (Selection of Relevant Queries ). In order to speed-up the process, 2 http://www.geonames.org/ontology/ 3 http://www.geooreka.eu 4 http://ir.shef.ac.uk/geoclef/ web counts are calculated using the static Google 1T Web database5, indexed using the jWeb1T interface [ 8 ], whereas Yahoo! Search is used to retrieve the results of the queries composed by the combination of a theme and a toponym. The key issue in the selection of the relevant queries is to obtain a relevance model that is able to select pairs theme-toponym that are most promising to satisfy the user's information need. On the basis of the theory of probability, we assume that the two component parts of a query, theme T and a place G, are independent if their conditional probabilities are independent, i.e., p(T jG) = p(T ) and p(GjT ) = p(G), or, equivalently, their joint probability is the product of their probabilities: p^(T \ G) = p(G)p(T ) (1)

If probabilities are calculated using page counts, that is, as the number of pages in which the term (or phrase) representing the theme or toponym appears, divided by Fmax = 2; 147; 436; 244 which is the maximum term frequency contained in the Google Web 1T database, then p^(T \ G) is the expected probability of co-occurrence of T and G in the same web page. It is clear that this represents a rough estimation of the fact that T occurred in G, since the mere inclusion of G in a page where T is mentioned does not guarantee the semantic relation between G and T .

Considering this model for the independence of theme and place, we can measure the divergence of the expected probability p^(T \ G) from the observed probability p(T \ G): the more the divergence, the more informative is the result 5 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13 of the query. The Kullback-Leibler measure [ 9 ] is commonly used in order to determine the divergence of two probability distributions.

DKL(p(T \ G)jjp^(T \ G)) = p(T \ G) log p(T \ G) p(T )p(G) (2) This formula is exactly one of the formulations of the Mutual Information (MI) of T and G, usually denoted as (I (T ; G)). 3

Evaluation

Geooreka! has been evaluated over the GeoCLEF 2005 test set, in order to compare the results that could be obtained by specifying the geographic footprint by means of keywords and those that could be obtained using a map-based interface to de ne the geographic footprint of the query. With this setup, topic title only was used as input for the Geooreka! thematic part, while the area corresponding to the geographic scope of the topic was manually selected. Probabilities were calculated using the number of occurrences in the GeoCLEF collection. Occurrences for toponyms were calculated by taking into account only the geo index. The results were calculated over the 25 topics of GeoCLEF-2005, minus the queries in which the geographic footprint was composed of disjoint areas (for instance, \Europe" and \USA" or \California" and \Australia"), which could not be processed by Geooreka!. Mean Reciprocal Rank (MRR) was used as a measure of accuracy. The GIR system GeoWorSE, where queries are speci ed by text, was used as a baseline [ 10 ]. Table 1 displays the obtained results.

The results show that the web-based results are sensibly worse than those obtained on the static collection. This is due primarily to two reasons: in the rst place, because topics were tailored on the GeoCLEF collection. Therefore, some topics refer explicitly to events that are particularly relevant in the collection and are easier to retrieve. For instance, query GC-005 \Japanese Rice Imports" targets documents regarding the opening of the Japanese rice market for the rst time to other countries; \Japan" and \Rice" in the document collection appear together only in such documents, therefore it is easier to retrieve the relevant documents when searching the GeoCLEF collection.

The second factor a ecting the results for the Web-based system is the ambiguity of toponyms, which does not allow to correctly estimate the probabilities for places. For instance, in the results obtained for topic GC-008 (\Milk Consumption in Europe"), the MI obtained for \Turkey" was abnormally high with respect to the expected value for this country. The reason is that in most documents, the name \turkey" was referring to the animal and not to the country. This kind of ambiguity represents one of the most important issue at the time of estimating the probability of occurrence of places. Ambiguity (or, better, the polysemy of toponyms) grows together with the size and the scope of the collection being searched. The GeoCLEF collection was also semantically tagged using WordNet and Geonames IDs to identify the places referenced by toponyms, while Web content is rarely tagged using precise IDs, therefore increasing the chance of error in the estimation of probabilities for places which share the same name.

There are three kind of toponym ambiguity that can be recognised (after the two main types identi ed by [ 11 ]: { Geo / Non-Geo ambiguity: in this case, a toponym is ambiguous with respect to another class of name (such as \Turkey" which may be the animal or the country); { Geo / Geo ambiguity of di erent class: for instance, \Puebla" the city or the state; { Same class Geo / Geo ambiguity.

The solution in all cases would be to use an ontology to precisely identify places in documents; the only di erence is the amount of information that the ontology should include. For the rst type of ambiguity, the only information needed is whether the name represents a place or not. In the second case, we would also need to know the class of the place. Finally, in the Geo / Geo ambiguity, we may di erentiates places using their coordinates or by knowing the including entity, or both. The Geonames ontology contains all these information and represents the best option at the time of geographically tag place names. 4

Conclusions

The results obtained with Geooreka! over a static, semantically-labelled (at least from a geographical viewpoint) collection compared to the results obtained in the Web showed that the imprecise identi cation of places is a problem for search engines destined to users who are interested in searching for geographically constrained information. The use of precise semantically tagging schemes for toponyms, such as Geonames RDF, would allow these search engines to produce more reliable results. Spreading the use of geographical tagging for the Semantic Web would also allow users to mine information using geographical constraints in a more e ective way. In this sense, we would like to encourage the use of Geonamen in order to produce accurate geographically tagged Web content. SV: a visualization mechanism for ontologies of records based on SVG graphics Ma. Auxilio Medina, Miriam Cruz, Rebeca Rodr guez, Argelia B. Urbina Universidad Politecnica de Puebla Tercer Carril del Ejido Serrano S/N

Juan C. Bonilla, Puebla, Mexico fmmedina, mcruz, rrodriguez, aurbinag @uppuebla.edu.mx,

WWW home page: http://informatica.uppuebla.edu.mx/

~mmedina, ~rrodriguez, ~aurbina Abstract. This paper describes SV, a visualization mechanism used to explore digital collections represented as hierarchical structures called ontologies of records. These ontologies are XML les constructed using OAI-PMH records and a clustering algorithm. SV is composed by a web interface and SVG graphics. Through the interface, users can recognize the organization of the collection and access to metadata of documents. 1

Introduction

Digital libraries gather valuable information. Organizations such as the Open Archives Initiative (OAI1) have proposed di erent alternatives to share data. The Protocol for Metadata Harvesting (OAI-PMH protocol), for example, supports interoperability between federated digital libraries. Documents are described in metadata records. Dublin Core Metadata (DC2) is the default metadata format for this protocol.

The services and the collections of digital libraries are enriched in the Semantic Web. The use of XML, Resource Description Framework (RDF), OWL, conceptual maps and other metadata technologies are addressed to improve search tasks [ 1 ]. Semantic Digital Libraries (SDLs) refer to systems build upon digital libraries and social networking technologies (Web 2.0) [ 2 ]. Freely distributed software exists to construct SDLs such as Greenstone3 or Jerome DL4. In this type of software, ontologies play a key role, they refer to explicit speci cations of shared conceptualizations [ 3 ]. Ontologies enables the representation of knowledge that software and human agents can understand and use.

This paper proposes the use of ontologies called \ontologies of records" that are represented as XML documents as the basis of a visualization mechanism 1 http://www.openarchives.org/ 2 http://dublincore.org 3 http://www.greenstone.org/ 4 http://www.jeromedl.org/ called semantic view (SV). The name also refers to the rst two letters of \Support Vector Graphics". SV o ers an interactive view to allow users to explore the content of a federated collection.

The paper is organized as follows. Section 2 describes the features of an ontology of records. Section 3 includes related work. Section 4 and 5 explains the design and implementation of SV, respectively. Experimental results are described in Section 6. Finally, Section 7 includes conclusions and suggests future directions of our work. 2

What is an ontology of records

An ontology of records is a hierarchical structure of clusters of OAI-PMH records that provides an unambiguous interpretation of its elements. Its construction is based on the Frequent Itemset Hierarchical Clustering algorithm [ 8 ]. This structure organizes a collection of documents, this has concept-term relationships useful for keyword based searches. An ontology of records is stored as a well formed XML le that is validated against an XML Schema. An ontology of records has the following features[ 9 ]: 1. Documents are clustered by similarity 2. Clusters in the k -level have labels of k -terms 3. All the records of a cluster share the terms of its label 3

Related work

This section describes some systems that have been used to visualize collections of documents. Proat et al. [ 4 ] use 3D trees to visualize documents organized according to the Library of Congress Classi cation (LCC). Documents are clustered in seven subsets. The interface has controls to rotate or zoom the nodes of trees. The leaf nodes contain metadata of documents.

Geroimenko et al. [ 5 ] have proposed the Generalizad Document Object Model tree Interface (G-DOM-Tree interface) to visualize metadata from XML DOM (Document Object Model) documents. The model displays a hierarchy of labels, this is very similar to the visualization that browsers o er of XML Schema. The interface is implemented as a Java applet or a Flash lm.

Fluit et al. [ 6 ] describe Spectacle, this mechanism uses lightweight ontologies to represent classes of similar objects and their relationships. The navigation can be done by using hypertext or \cluster maps". A cluster map visualizes the objects and their classes.

At last, Sanchez et al. [ 7 ] use a star eld grid to visualize documents from several collections. Documents are stored as OAI-PMH records. The axis of the grid represent attributes of the collections that can be chosen by users. Small polygons are associated with the type of document and di erent colors are used to distinguish the collections.

Desing of SV

The design of SV is addressed to reach the following objectives: { Construct a visualization mechanism with semantic features that allow users to explore a collection of documents { Represent the organization of a collection of documents { Retrieve the metadata and the content of a determined document

In order to reach these objectives, we have used the levels of knowledge proposed by [ 2 ] in the design of SV. We want to uses CORTUPP as a test bed, this is a collection represented as an ontology of records5. 1. Level 1: Organization of the metadata. Metadata is organized in the ontology of records. Content information is stored in dc:title, dc:subject and dc:description elements. 2. Level 2: Organization of the information in the documents. Technical reports have a common structure formed by six mandatory chapters: 1)research propose, 2)state of the art, 3)research design, 4)implementation, 5)results and 6)conclusions. This structure is de ned in a Latex template. The BibTex le format is used to manage the bibliography. A technical report is described as a @techreport entry. 3. Level 3: Organization of the information in databases. The technical reports are stored as PDF les in a database that also includes data and counts of users. Documents are accessible through a web interface. 4. Level 4: Organization of the topics treated in the documents. The dc:subject element stores the topic of a document. Keywords of this element belong to the labels of the clusters in the ontology of records. 5. Level 5: Organization of the concepts, terms and relations. This level is also represented in the ontology of records. 5

Implementation of SV

SV is formed by a web interface and SVG graphics6. SVG is a format developed and maintained by the W3C SVG Working Group. This is an XML application used to describe animated or static two dimensional vectorial graphics. The main feature of these graphics is scalability.

SV uses Xerces, this is a Java parser used to extract data from an ontology of records. The classes of SV are built using Java language. In the interface, each document, that is, an OAI-PMH record, is represented with a yellow star in a blue gradient background. The background is divided in ve parts that correspond to the rst levels of the ontology. These levels are divided by lines that form angles of 90 degrees. The distribution of the lines try to re ect an estimation of the amount of documents that can be found in each level. The documents closer to 5 CORTUPP is available at http://server3.uppuebla.edu.mx/cortupp/ 6 http://www.w3.org/svg/ the upper left corner belong to the rst level of the ontology, these documents share one term. The second level shows the documents that share two terms, and then on. The stars have di erent size according to their level, they are bigger at the rst level and smaller at the last one.

The interface of SV is a SVG graphic of 502 per 502 pixels. XML Parser is the Java application used to construct the XML document that contains the interface. XLink is used to create hyperlinks between documents and their metadata. Given a click on a star, users can allow the metadata on the right panel. Figure 1 shows the SV interface where only six documents at the second and third level were included, however SV is designed to support until 500 documents. The colors can be modi ed without requiring compilation because they are stored in a text le. The mechanism is accessible at http://informatica.uppuebla.edu.mx/ visualizacionPI/index.html. Di erent con guration of ontologies of records were constructed in order to check SV, that is, unity tests and integration tests were performed successfully. After the installation of the SVG Plugin Version 1.7, the visualization of SV was successful using Internet Explorer 8, Google Chrome 7.0.517.41 and Opera 10.6, however, there were some inconveniences using Firefox 1.5, Firefox 3.6 and Firefox Beta due to these versions do not support the animation features of SVG graphics. We have described SV, a visualization mechanism of federated collections based on ontologies. SV has semantic features represented in the interface such as the location of documents in the ontology and the similarity between documents. Additional semantic information is stored in the metadata attached to each document and in the ontology of records. Through SV interface, users can access to metadata or download a document.

CORTUPP was used as a test bed for SV, however, any collection of OAIPMH records represented as an ontology of records can be visualized. Although the size of an ontology of records can impact the visualization of SV, its design is exible enough to support distinct collections. As future work, we plan to expand SV to show the clusters and their labels. Then, we would like to incorporate tagging and recommendation mechanisms.

Modeling of CSCW system with Ontologies Abstract. In recent years, there has been a growing interest in the development and use of domain ontologies, strongly motivated by the Semantic Web initiative. However, the application of ontologies in the CSCW domain has been scarce. Therefore in this paper, it presents a novel architectural model to CSCW systems described by means of an ontology. This ontology defines the fundamental organization of a CSCW system, represented in its concepts, relations, axioms and instances.

1 Introduction

In the last two decades, the enormous growth of Internet and the web have given rise to an intercreativity cyberspace, in which groups of people can communicate, collaborate and coordinate to carry out common tasks. Therefore, a great number of groupware applications has been developed using different approaches, including object-oriented, component-oriented, and agent-oriented ones. However, the development of this kind of applications is very complex, because different elements and aspects must be taken into account. Hence, these applications must be simultaneously supported by models, methodologies, architectures and platforms to be developed in keeping with current needs. In the groupware domain, one of the models most used is the Unified Modelling Language (UML) [ 1 ], although this has not any element to represent constrains, which are very important in applications so complex as the groupware ones.

There has recently been an increase in the use of ontologies in any domain to model applications. An ontology is presented as an organization resource and knowledge representation through an abstract model. This representation model provides a common vocabulary of a domain and defines the meaning of the terms and the relations amongst them. In the domain of groupware applications, the ontology provides a well-defined common and shared vocabulary, which supplies a set of concepts, relations and axioms to describe this domain in a formal way.

In this paper, two ontologies for the groupware domain are proposed. The first ontology determines who authorize the registration of users, how interaction is carried out among them, and how the turns for users participation are defined, among other aspects. Moreover, it allows supporting modifications in runtime, such as changing the user role, the rights/obligations of a role, the current policy, etc. The second ontology establishes the necessary SOA-based services to develop groupware applications in accordance with the existing papers in the literature about the development of this type of applications. In addition, these services are clustered in modules and layers with respect to the concern that they represent.

This paper is organized as follows. Section 2 gives an brief introduction to the ontologies. Section 3 describes the ontology-based modeling of the group organizational structure. Section 4 presents an ontological model, which allows us to specify an architectural model for the development of groupware applications. Finally, Section 5 outlines some conclusions and future work.

2 Introduction to the Ontologies

There are several definitions of ontology, which have different connotations depending on the specific domain. In this paper, we will refer to Gruber’s well-know definition [ 2 ], where an ontology is an explicit specification of a conceptualization. For Gruber, a conceptualization is an abstract and simplified view of the world that we wish to represent for some purpose, by the objects, concepts, and other entities that are presumed to exist in some area of interest, and the relationships that hold them. Furthermore, an explicit specification means that concepts and relations need to be couched by means of explicit names and definitions.

Jasper and Ushold [ 3 ] identify four main categories of ontology applications: 1) neutral authoring, 2) ontology-based specification, 3) common access to information, and 4) ontology-based search. In the work presented here, the main idea is to use ontologies to specify the modeling of both the group organizational structure and the architectural model in the groupware domain, since an ontology is a high level formal specification of a certain knowledge domain, which provides a simplified and well defined view of such domain.

Ontology is specified using the following components: Classes: There is a set of classes, which represent concepts that belong to the ontology. Each class may contain individuals (or instances), other classes or a combination of both, with their corresponding attributes. Relations: These define interactions between two or several classes (object properties) or between a concept and a data type (data type properties). Axioms: These are used to impose constraints on the values of classes or instances. Axioms represent expressions (logical statement) in the ontology and are always true inside the ontology. Instances: These represent the objects, elements or individuals of an ontology.

These four components will be described for the two ontologies proposed in this paper.

In addition, ontologies require of a logical and formal language to be expressed. In Artificial Intelligence, different languages have been developed, like the First-Order Logic-based (which provide powerful primitive for modeling), the Frames-based (with more expressive power but less inference capacity), and the Description Logicsbased (which are more robust in the reasoning power) ones.

OWL (Web Ontology Language) [ 4 ] is a language based on Description Logics for defining and instantiating Web ontologies based on XML (eXtensible Markup Language) [ 5 ] and RDF (Resource Description Framework) [ 6 ]. OWL can be used to explicitly represent the meaning of terms in vocabularies and the relationships among those terms. This language makes possible to infer new knowledge from a conceptualization, by using a specific software called reasoner. It has used the tool Protégé [ 7 ], which is based on OWL, to define the ontology for group organizational structure.

In the groupware domain, ontologies have mainly been used to model task analysis or sessions. Different concepts and terms, such as group, role, actor, task, etc. have been used for the design of task analysis and sessions. Many of these terms are considered in our conceptual model. Moreover, semiformal methods (e.g. UML class diagrams, use cases, activity graphs, transition graphs, etc.) and formal ones (such as algebraic expressions) have also been applied to model the sessions. There is also a work [ 8 ] for modeling cross-enterprise business processes from the perspective of cooperative system, which is a multi-level design scheme for the construction of cooperative system ontologies. This last work is focused on business processes, and it describes a general scheme for the construction of ontologies. However, in this paper, we propose to model two specific aspects: the group organizational structure and the architecture of a groupware application. Consequently, the application domain of both ontologies is groupware, not business processes. 3 Ontology for specifying an architectural model In order to specify architectural model five concerns are identified: Data, Group, Cooperation, Application, and Adaptation. Consequently, five layers are considered. Four layers are composed by modules and services, while the fifth one, the Data Layer, contains repositories with the necessary information to carry out the group work. The services of the architectural model are defined by the concepts' ontology. 3.1. Ontology Concepts The architecture components are characterized through the concepts' ontology (shown in Figure 1), which will be briefly described below: Registration is the first action that a user must carry out to can participate in the group work using the collaborative application. Authentication validates the access to the group and depends on the organizational style defined in the same.

Group is who works in the session to perform work group.

Organizational_Style defines the organizational style that a group will use to carry out the group work.

Stage restricts user’s access to the application in accordance with the organizational style defined in it.

Session defines a shared workspace where a group carries out common tasks. Session_Management manages and controls one or more sessions.

Concurrency manages shared resources to avoid inconsistencies by using them. Shared_Resource is used by users to carry out basic activities.

Basic_Activity is an action that a user must perform to carry out a task (which can be made up by one or more basic activities).

Task is carried out by the group to achieve a common goal.

Notification notifies one or more users of all events that happen in a session. Group_Awareness gets the necessary information to supply group awareness to users that take part in a group.

Group_Memory is supplied by the application to facilitate a common context. Application is used by the users to carry out group work in established session. Configuration configures the application the first time that it is used and when it is necessary.

User_Interface shows users all the information about the application execution. Environment modifies the user interface to present the information in accordance with the device used by each user.

Adaptation is a process that allows adapting the collaborative application to the new needs of the group.

Detection monitors the execution environment to detect the events that determine the adaptation process.

Agreement decides whether an adaptation process must be carried out or not. Vote_Tool is used by users to perform the agreement.

Adaptation_Flow is a set of steps carried out to adapt the collaborative application in accordance with the selected event.

Repair is required when the adaptation process can not be performed.

AFA

AA performs is_determined

AE needs

AAA uses requieres is_adapted

MU MAV MIV CAE modifies presents

CMS TSP is_part_of ASI

APU is_used

IUP SRP manages has gives

CS administers GN supplies

NGM provides RUI AUI SPI COS SOS AOS

URA 3.2. Ontology Relations The architecture relationship to each component and its environment are symbolized with the ontology relations (see Figure 1) listed below: allows (Registration, Authentication): Only registered users are allowed to authenticate to access to the collaborative application. access (Authentication, Group): Authentication allows users to access to group. depends (Registration, Organizational_Style): Users registration depends on the organizational style defined at a given stage. organizes (Organizational_Style, Group): An organizational style specifies the way in which the group is organized. defines (Stage, Organizational_Style): A stage defines an organizational style. works (Group, Session): A group needs to be connected to a session to work. governs (Session_Management, Session): The session management governs a session. controls (Concurrency, Session): The concurrency service controls the existing interaction in a session. manages (Concurrency, Shared_Resource): The concurrency service manages the shared resources to guarantee mutually exclusive usage of these. is_used (Shared_Resource, Basic_Activity): The shared resources are used by basic activities. is_part_of (Basic_Activity, Task): A basic activity is part of a task. administers (Session, Notification): The session administers the notification. provides (Notification, Group_Awareness): The notification process provides group awareness. obtains (Group, Group_Awareness): A group obtains group awareness to avoid inconsistencies in the collaborative application. supplies (Notification, Group_Memory): The notification process supplies group memory. gives (Application, Group_Memory): The application gives group memory. establishes (Application, Session): An application establishes a session. presents (Application, User_Interface): An application presents an user interface so that users can use the collaborative application. 2.3 Data warehouse scheme ROLAP (Relational OLAP) implementation of population-based cancer incidence in Mexico. 3 Data Mining Application on Cancer Incidence The implemented data warehouse has been used to develop a data mining task space based on the integration of additional technologies to the data warehouse, such as clustering and Geographic Information Systems, which in this case are very suitable, to identify and display areas with incidence of cancer in Mexico. The following provides a general description of the integration process of technologies and tools (Fig. 3) made for this application.

The data warehouse integrates the following information for our application: the component space that allows viewing of the regions of municipalities, population data such as the death rate and incidence rate and the time component, which in this case is the census year.

The IRIS GIS INEGI [ 5 ], through your options allows the recovery of population data and the real location of the municipalities, which are integrated into the data warehouse.

Since IRIS stores geographical representation of municipalities in the vector format standardized "shape" and by means of polygons, there is the need for a process of transfer of forms and formats in order to have a numerical representation of each municipality, in this case, corresponds to a point on the municipality center location, which is accomplished primarily through the tools of ESRI's ArcInfo GIS.

Given the numerical representation of each municipality through a point (x, y), along with its rate of incidence of cancer, the Matlab programming environment and its implementation of k-means algorithm [ 2 ] [ 7 ] is used to generate patterns / groups of municipalities and the corresponding centroids.

Once you have the above results, it is again necessary to transfer digital data format to format shape, a process similar to above using ArcInfo tools, allowing viewing through GIS IRIS.

Finally, the groups of municipalities and their corresponding centroids, are passed as GIS layers to IRIS, for display on the geographic map of Mexico. 4 Results and visualization with IRIS In this project we have done grouping tasks according to the affinity of location and incidence rate of the municipalities. Series of experimental tests on the data stores in cities with more than 100.000 inhabitants were carried out. Size groups were considered k = 5, 10, 15, 20 and 30. The best result was obtained for k = 20.

As a case study, this paper presents the results obtained by k-means algorithm in Matlab for the cervical cancer data warehouse. Fig. 4 provides the visualization of the 20 regions identified.

From the results, we distinguish the groups spearheading the three municipalities with higher incidence rates: Atlixco, Apatzingán and Tapachula (Chiapas). In Fig. 5 the detail of the display of the group corresponding to the region of Chiapas and the incidence of cervical cancer is shown. Table 1 provides data for the previous group, and statistical measures for the mean and standard deviation.

The groups identified with high incidence rates: Tapachula and Apatzingan match municipalities identified in other studies [ 4 ] and correspond to the population characteristics, identified in the work of the medical field [ 8 ], [15] such situations such as poverty, lack of preparation and access to effective health services and the initiation of sexual activity at an early age. This allows us to assert that the grouping is made valid. On the other hand, the study allowed discovering other municipalities that had not been identified in other research, such as the group of Atlixco, in particular showing the highest incidence rate in the country (see table 2).

Table 1 Municipalities Incidence Rates of Cervical-Uterine Cancer

State Chiapas Veracruz-Llave Veracruz-Llave Chiapas Chiapas Tabasco Tabasco Tabasco Chiapas Tabasco Campeche Tabasco Tabasco Chiapas Average Standard deviation

In order to perform a global analysis of our results, Table 2 provides information of the ten municipalities with the highest incidence rate in the country.

Table 2 Top Ten Municipalities Incidence Rates of Cervical-Uterine Cancer Key

State 21019 Puebla 16006 Michoacán 07089 Chiapas 17006 Morelos 28021 Tamaulipas 06007 Colima

Atlixco Apatzingán Tapachula Cuautla El Mante

Manzanillo 30039 Veracruz-Llave Coatzacoalcos 267212 18017 Nayarit

Tepic 30108 Veracruz-Llave Minatitlán 30118 Veracruz-Llave Orizaba General Mean Standard Deviation 117111 117949 271674 153329 112602 125143 305176 153001 118593 15 13 27 14 10 11 23 26 13 10

Figure 6, illustrates the location of previous incidence rates compared to the national average and the corresponding standard deviation. 5 Conclusions Multidimensional model for conceptual design of the data warehouse, turned out to be very appropriate, since this model is easily scalable and allows analysis of the information under different perspectives. It is expected that future studies process other variables, related to the municipalities, included in this design, such as socioeconomic status, type of region, gender and access to health services, among others. Moreover, the implementation of data warehouse based on the ROLAP model has allowed taking advantage of the facilities developed for relational databases. In addition, it is expected that the design and implementation carried out in the data warehouse can be used in other applications.

The processing of the spatial component of our data warehouse, using the IRIS GIS INEGI, has resulted in a high quality visual representation of our results, based on the actual physical location of the municipalities and on a map of the topography of the Republic Mexican INEGI. Also experience and learning has been gained on transfer of shapes (polygons, points) techniques and formats (Number-shape) through ArcView GIS tools.

Currently we are working to complete studies in other cancer types. Besides, data mining tasks will be developed on the incidence of conditions such as diabetes, influenza and cardiovascular diseases, among others. Acknowledgement. R. Boone expresses her gratitude to Ms. Rocío Pérez Osorno from INEGI, Puebla. (Graduated from the Faculty of Cs. Computing, BUAP) for advice and support in plotting the results of this work through the IRIS GIS. convergencia y su aplicación a bases de datos poblacionales de cancer. 2do Taller Latino Iberoamericano de Investigación de Operaciones, Mèxico, 2007. 14. Pérez-O. J.3, Rocío Boone Rojas, María J. Somodevilla García. Research issues on K-means Algorithm: An Experimental Trial Using Matlab., Advances on Semantic Web and New Technologies”. Vol 534. http://ceur-ws.org/. 15. Rangel-Gómez, G. Lazcano-Ponce,E. Palacio-Mejía, Cáncer cervical, una enfermedad de la pobreza: diferencias en la mortalidad por áreas urbanas y rurales en México, http:// www.insp.mx/salud/index.html. 16. Scotch,Matthew, Parmato B. Monaco, V. Evaluation of SOVAT: An OLAPGIS decision support system for community health assessment data analysis. BMC Medical Informatics & Decisión Making Vol. 8 (1-12). 2008. 17. Simonet, A., Landais, P. Guillon D.A multi-source Information System for end-stage renaldisease. Comptes Residus Biologies, 2002, Vol. 325 I4., p515. 18. Thangavel K. Jaganathan P. and Esmy P. O., Subgroup Discovery in Cervical Cancer Analysis Using Data Mining Techniques, Departament of Computer Science, Periyar University: Departament of Computer Science and Applications, Gandhigram Rural Institute-Deemed University, Gandhigram: Radiation Oncologist , Christian Fellowship Community Health Centre, Tamil Nadu, India: AIML journal, Vol(6), Issue(1), January, 2006. An Approach of Crawlers for Semantic

Web Application José Manuel Pérez Ramírez1 , Luis Enrique Colmenares Guillen1 1

Benémerita Universidad Autónoma de Puebla,

Facultad de Ciencias de la Computación,

BUAP – FCC, Ciudad Universitaria,

Apartado Postal J-32,

Puebla, Pue. México.

{ mankod, lecolme}@gmail.com Abstract. This paper presents a proposal for a system capable of retrieval information from the processes generated by the system Yacy. The information retrieved will be used in the generation of a knowledge base. This knowledge base may be used in the generation of semantic web applications.

Keywords: Semantic Web, Crawler, Corpora, Knowledgebase. A knowledgebase is a special type of database for managing knowledge. It provides the means to collect organize and recover knowledge in a computed way. In general, a knowledgebase is not a static set of information it is a dynamic resource that maybe have the ability to learn. In the future, Internet will be a complete and complex knowledgebase, already known as semantic web [ 1 ].

Some examples of knowledge base are: a public library, an information database related to a specific subject, Whatis.com, Wikipedia.org, Google.com, Bing.com and Recaptcha.net.

Investigate related to Generation Automatic of a specialized corpus from the Web is present in [ 2 ], this investigate have a reviews of methods to process knowledgebase that generates specialized corpus.

In section 2 we present related work to semantic web in order to comprehend the benefits that may be obtained by elaborating them.

In Section 3 we describe the challenges and we explain the problems that could be have if you tried to use Google Search for getting information or tried to retrieval information of queries to Google.

Section 4 the methodology to use for solving the problem. And section 5, conclusions and ongoing work.

We continue this paper present a form abstract to describe a Query Processing on the Semantic Web [ 8 ] is as follows Fig. 1 1. A query with a data type. 2. A server that sends queries to the servers decentralized indexing. The content found on the servers is similar to indexing a book index indicates which pages contain the words that match the query. 3. The query travels to the servers where documents stored documents are retrieved are generated to describe each search result. 4. The user receives the results of its semantic search which has already been processed in the semantic web server.

Fig. 1. Querying the Semantic Web. 2 Related Work Nowadays, the investigation related to retrieval information on the web has a different result like: knowledgebase, web sites dedicated to retrieval information, Wikipedia, Twine, Evri, Google, Vivísimo, Clusty, etc.

An example of a company that working with “retrieval information” is Google Inc, one of their products is Google Search this web search engine is the one of the mostused search engine on the Web [ 9 ], Google receives several hundred million queries each day through its various services [ 10 ].

This kind of example it’s necessary for the following analogy: For what reason Google doesn’t put their information of their knowledgebase under domain public? And the answer it’s very simple: because their information or their knowledgebase it’s money. In section 3 we explain some form of extract information of Google Search only a protected few of information it’s impossible retrieval many information of Google Search whit the idea to generate knowledgebase this because Google protects their information of their queries.

Another kind of knowledgebase are: 2.1 Wikipedia A specific case is Wikipedia, a project to write a free communitarian encyclopedia in all languages. This project have 514 621 articles today. The quantity and quality of the articles present an excellent knowledgebase for the creation of semantic webs. We present some ways to obtain semantic information from Wikipedia: from its structure, from the collected notes of the people that contributes and from the existent links in the entries. 2.2 Twine Twine is a tool for storage, organizes and shares information, all of it with an intelligence provided by the platform that analyzes the semantic of the information and classifies automatically [ 7 ]. The main idea is to save users from labeling and connecting related content and leave this work to Twine, bringing more value and storage the contents next to the information about its meaning. 3 Challenges The principal challenge is development a system with the capacity of works with Yacy for retrieval information of Indexing Process and generate information this information will be essential for produce knowledgebase.

We present in the figure 5 all modules of yacy, so the module to development will be works with some of these modules. The principal question is:

What we can do to get information under domain public.

It’s very simple we use the very popular Wikipedia

Wikipedia is a project of the Wikimedia Foundation. More than 13.7 million of its articles have been drafted in conjunction with volunteers from all over the world and practically every one of them may be edited by any person that may have access to Wikipedia. Actually it is the most popular reference work on the internet.

This project of dynamic content like Wikipedia illustrates the information that have great potential to be exploited.

Otherwise Google Search is one of the most-used search engine provides at least 22 special features beyond the original word-search capability. These include synonyms, weather forecasts, time zones, stock quotes, maps, earthquake data, movie showtimes, airports, home listings, and sports scores.

And maybe you could be thinking:

For what reason the people don’t use a Google Search for get all the knowledgebase about topic specific and this knowledgebase could be export to file of text plan with the possibilities of management this and generate corpus.

Very simple is the answer because the information of Google is their information and gold for company.

It the past Google Inc. allowed the retrieval information from any kind of query[ 3 ]. Google allowed the retrieval information based on their form and methods like University Research Program for Google Search [ 10 ] but any kind of answered we get of this project when we make the inscription to this program.

Another way to exploit Google Search knowledge is using scripts, APIS [ 3 ], programming languages such as AWK, development tools like SED or GREP, all of them analyzed in [ 2 ] but with few results and we need a lot of information for create knowledgebase. This section gives a description of the project taking into consideration the design that will be used to give a solution to the problem of creating the module. 4.1 Project description The obtained results of the module that connected with Yacy will be used to create semantic webs, corpus and any other project that needs information in a plain text about web content.

Described below are a series of procedures to follow that use as a methodology to implement within the project.

A) Check the modules of Yacy B) Check the logistic and architecture of Yacy C) Check the form that Yacy create their crawlers

D) Think in a form of create the Module capable of manage the information of the crawler and generate knowledgebase

E) Some of the polices described above are implemented in YaCy [ 6 ], the variant to use is the implementation of the JXTA[ 5 ] tool and the URI and RDF policies that allow to structure and outline the results, to finally present then in a semantic way or knowledgebase. 4.2 Development platform

This work is done with YaCY, which is a free distribution search engine, based on the principles of the peer to peer (P2P). Its core is a program written in Java that it’s distributed in hundreds of computers, from September 2006. It’s called YaCy-peer. Each YaCy-peer is an independent crawler that navigates trough the Internet, and analyzes and indexes web pages found. To storages the indexation results in a common database (called index) which is shared with other YaCy-peers using the principles of the P2P networks [ 4 ].

Compared to semi-distributed search engines, the YaCy-network has a decentralized architecture. All of the YaCy-peers are equal and there is no central server. It may be executed in Crawling mode or as a local proxy server. The figure 2 shows a diagram that describes the distributed process of indexation and the search in the network for the YaCy crawler.

Fig. 3. Distributed indexing process

The figure 3, to have the main components of YaCy, and the process that exists among the web search, web crawler, the indexing and data storage processes. 5 Conclusions and ongoing work In this section present some the conclusions and results that are expected of project and the future work.

1. Index all content of Wikipedia. 2. Storage this content. 3. Present the content of Wikipedia by topic in a web site. 4. Use a tagged of text for share the information with tags. 5. Present the module and their code on a web site 6. Share knowledgebase extract of Wikipedia 1. Definition of knowledgebase

http://searchcrm.techtarget.com/definition/knowledge-base 2. Alarcón, R., Sierra, G., Bach, C. (2007). “Developing a Definitional Knowledge Extraction System”. En Vetulani, Z. (ed.), Actas del 3er Language & Technology Conference. Human Language Technologies as a Challenge for Computer Science and Linguistics. Poznan, Universidad Adam Mickiewicza: pp. 374-378. 3. Google Hacks, Second Edition, 2004, O’Reilly Media. 4. S. Rhea, B. Godfrey, B. Karp, J. Kubiatowicz, S. Ratnasamy, S. Shenker, I.

Stoica, and H. Yu. OpenDHT: a Public DHT Service and its Uses. SIGCOMM' 05, Philadelphia, Pennsylvania, USA, august 21-26, (2005). 5. http://www.jxta.org (2010). 6. http://yacy.net/ (2010). 7. http://www.twine.com/ (2010). 8. Query Processing on the Semantic Web Heiner Stuckenschmidt, Vrije

Universiteit Amsterdam 9. http://www.alexa.com/siteinfo/google.com+yahoo.com+altavista.com (2009) 10. http://searchenginewatch.com/showPage.html?page=3630718 (2008) 11. http://research.google.com/university/search/ (2010) Decryption Through the Likelihood of

Frequency of Letters Barbara Sa¶nchez Rinza, Fernando Zacarias Flores, Luna P¶erez Mauricio, and

Mart¶inez Cort¶es Marco Antonio Benem¶erita Universidad Auto¶noma de Puebla,

Computer Science 14 Sur y Av. San Claudio, Puebla, Pue.

72000 M¶exico brinza@cs.buap.mx, fzflores@yahoo.com.mx Abstract. The method to decrypt the information using probability leads to a more thorough job, because you have to know the percentage of each of the letters of the language that is being analyzed here is Spanish. You can consider not only the probabilities of the letters also syllables, set of three, four letters and even words. Then you have this thing to do is make comparisons of the frequencies of cipher text and the frequencies of the language to begin to replace by a correspondence. And ¯nally passing a scanner and ¯nd the decrypted text.

Keywords Probability, Decrypt. 1 Cryptography is the science that alters the linguistic representations of a message [ 1 ]. For this there are di®erent methods, where the most common is encryption. This science masking the original references of the information by a conversion method governed by an algorithm that allows the reverse or decryption of information. Use of this or other techniques, allowing for an exchange of messages that can only be read by the intended bene¯ciaries as 'consistent'. A consistent recipient is the person to whom the message is directed with the intention of the sender. Thus, the recipient knows the discrete coherent used for masking the message. So either have the means to bring the message to the reverse process cryptographic, or can infer the process that becomes a message to the public. The original information to be protected is called plaintext or cleartext. Encryption is the process of converting plain text into unreadable gibberish called ciphertext or cryptogram. In general, the concrete implementation of the encryption algorithm (also called ¯gure) is based on the existence of key secret information that ¯ts the encryption algorithm for each di®erent use [ 2 ].

Decryption is the reverse process to recover the plaintext from the ciphertext and key. Cryptographic protocol speci¯es the details of how to use algorithms and keys (and other primitive operations) to achieve the desired e®ect. The set of protocols, encryption algorithms, key management processes and actions of the users, which together constitute a cryptosystem, which is what the end user works and interacts. In this work, we must ¯rst have a ciphertext which must meet certain requirements, such a text should be bijective so that each element of the domain carries a single element of the condominium. In addition we must also take account of the rules of Kerckho® [ 3 ]. 2.1

Development work

Frequencies in Spanish Is required to decrypt text using the odds as to how often they used certain letters in the alphabet, for this work only considered the Spanish language [ 5 ].

The frequencies of Spanish, which were used for this study were: The letter frequency statistics may vary from one to another depending on the corpus author has chosen to develop them. Usually di®erences when the corpus is literary or consists of texts of di®erent origins. Table 1 shows the frequency of each of the Spanish alphabet with their respective percentage.

High frequency letters Medium frequency letters Low frequency letters Frequencies 0:5% letter freq.% letter freq.% letter freq.% G, F, V, W E 16,78 R 4,94 Y 1,54 A 11,96 U 4,80 Q 1,53 O 8,69 I 4,15 B 0,92 L 8,37 T 3,31 H 0,89 S 7,88 C 2,92 J, Z, X, K, N N 7,01 P 2,76 D 6,87 M 2,12

Most Frequent words The vowels make up about 46.38% of the text. The high frequency letters account for 67.56% of the text. Mid-frequency points accounting for 25% of the text [ 4 ]. In the dictionary the most common vowel is A, but in written texts is the E because of prepositions, conjunctions, verbs, etc. The most common consonants are L, S, N, D, with about 30%. The less frequent six letters: V, N, J, Z, X and K (just over 1%). The average frequency of a Spanish word is 5.9 letters. The coincidence index for Spanish is 0.0775. In addition to solving the encryption table 2 we mentioned that we most frequently used words in a text of 10 000 words.

Most common words Two-letter words Three-letter words Word Frequency Frequency Word Frequency DE 778 778 QUE 289 LA 460 460 LOS 196 El 339 339 DEL 156 EN 302 302 LAS 114 QUE 289 119 POR 110

Y 226 98 CON 82

A 213 74 UNA 78 LOS 196 64 MAS 36 DEL 156 63 SUS 27 SE 119 47 HAN 19 LAS 114

Next, table 3 shows the frequencies of the 4-letter words. The size of the corpus is 60,115 letters. The frequencies are absolute. The digraphs are read by row and column in that order. Below in table 4 shows the union digraphs are letters from letters.

Most common initial letter The most frequent letters in Spanish that start a word are listed in Table 5 3

Results The ciphertext is used as said it had to be bijective and have Kerckho® rules and the decrypted text shown in Figure 1. Four-letter words Distribution of letters in literary texts Word Frequency E - 16,78% R - 4,94% Y - 1,54% J - 0,30% PARA 67 A - 11,96% U - 4,80% Q - 1,53% COMO 36 O - 8,69% I - 4,15% B - 0,92% AYER 25 L - 8,37% T - 3,31% H - 0,89% ESTE 23 S - 7,88% C - 2,92% G - 0,73% PERO 18 N - 7,01% P - 2,77% F - 0,52% ESTA 17 D - 6,87% M - 2,12% V - 0,39% AOS 14 TODO 11 SIDO 11 SOLO 10 We conclude that this method of decryption is good however would have to tweak a little more due to it depends on the text we have and how much text to decrypt was also observed that only decrypts an encrypted bijective. In this work, as seen in the results of Figure 1, which apply various processes, ¯rst see the probability of the lyrics in Spanish that are more frequent, then seen with the syllables that are more frequent in Spanish, and then with the last word and you miss the information, text analyzer, as shown in Figure 1 a large percentage of the information is decoded, but as mentioned in the top, this will depend have that much information to process it.

References 1. Liddell and Scott's Greek-English Lexicon. Oxford University Press. (1984) 2. Anaya Multimedia, Codigos Y Claves Secretas: Programas En Basic, Basado A Su

Vez En Un Estudio Lexicogr¯co Del Diario "El Pas", Mexico 1986. 3. Friedman, William F. And Callimahos, Lambros D., Military Cryptanalytics, Cryptographic Series, 1962 4. Part I - Volume 2, Aegean Park Press, Laguna Hills, Ca, 1985 5. Barker, Wayne G., Cryptograms In Spanish, Aegean Park Press, Laguna Hills, Ca., letter P C D E S A L R M N T frequency 1.1128 1.081 1.012 989 789 761 435 425 403 346 298

letter Q I H U G V F O B J Y W Z K frequency 286 281 230 219 206 183 177 169 124 47 27 19 2 1

1. Barrón Vivanco M. Arandine , Pérez O. J. , Miranda

Fátima , Pazos R., XII Congreso de Investigación en Salud Pública, Aplicación de técnicas de minería de datos a bases de datos poblacionales de cáncer, CENIDET, México, Secretaría de Saúde do Estado de Pernambuco, Brasil, Abril ( 2007 ).

2. Forgy

“ Cluster analysis of multivariate data: Efficiency vs . Interpretability of classification” , Biometrics , vol. 21 , pp. 768 - 780 . 1965

3. Hernández-Orallo

, Ramiréz-Quintana M. J. , Ferri-Ramiréz

, Introducción a la Minería de Datos, Ed. Pearson Prentice Hall, Madrid ( 2004 ).

4. Hidalgo-Martínez Ana C. El cáncer cérvico-uterino su impacto en México. Porqué no funciona el programa nacional de detección oportuna . Revista Biomédica ,

Centro

Nal. De Investigaciones Regionales Dr. Hineyo Noguchi , UADY , 2006 , México.

5. IRIS 4. http://mapserver.inegi.gob.mx. SNIEG Sistema Nacional de Información Estadística y Geográfica.

6. Jin

Chen

, MacEachren, Alan

, Peuquet , Donna. Constructing Overview+ Detail Dendogram Matrix Views . IEEE Transactions on Visualization & Computer Graphics ., Vol. 15 , Issue

, p889 - 896 , Dec. 2009 .

7. MacQueen , J.: Some Methods for Classification and Analysis of Multivariate Observations . In Proceedings Fifth Berkeley Symposium Mathematics Statistics and Probability . Vol. 1 . Berkeley, CA ( 1967 ) 281 - 297 .

8. Martínez

M. Francisco

Javier . Epidemiología del cáncer del cuello uterino . Medicina Universitaria 2004 , 39 - 46 . Vol. 6 , N. 22 , UANL , México.

NAIIS

Instituto Nacional de Salud Pública , SCRIS , Mortalidad, http://sigsalud.insp.mx/naais/, Cuernavaca, Morelos, México, ( 2003 ).

10. Nevine M. Labib , Michael N. Malek: Data Mining for Cancer Management in Egypt . Transactions on Engineering, Computing and Technology V8 October 2005 : (ISSN 1305- 5313).

11. Pérez-C. Nelson , Abril-Frade D.O. Estado Actual de las Tecnologías de Bodegas de Datos Espaciales . Ing. E Investigación . Vol. 27 , No. 1, Univ. Nal. De Colombia. 2007 .

12. Pérez-O. J. ,1, R. Pazos R , L. Cruz R., G. Reyes S. “Improvement the Efficiency and Efficacy of the K-means Clustering Algorithm through a New Convergence Condition” . Computational Science and Its Applications - ICCSA 2007 - International Conference Proceedings . Springer Verlag.

13. Pérez-O. J .2, M.F.

Henriques , R.

Pazos , L.

Cruz , G. Reyes, J.

Salinas , A.

Mexicano . Mejora al Algoritmo de K-means mediante un Nuevo criterio de