-

Advances on Semantic Web and New Technologies

2003

66 107

The Workshop on Semantic Web and New Technologies was held by second time at the Faculty of Computer Science of Benemérita Universidad Autónoma de Puebla, Mexico in March 2009. The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. Semantic Web technologies are beginning to play a significant role in many diverse areas, marking a turning point in the evolution of the Web. The goal of this workshop is to provide a forum for the Semantic Web community, in which participants can present and discuss approaches to add semantics on the Web, show innovative applications in this field and identify upcoming research issues related to Semantic Web. In order to fulfill these objectives, the more important workshop topics included Semantic Search, Semantic Advertising and Marketing, Linked Data, Collaboration and Social Network, Foundational Topics, Semantic Web and Web 3.0, Ontologies, Semantic Integration, Data Integration and Mashups, Unstructured Information, Semantic Query, Semantic Rules, Developing Semantic Applications and Semantic SOA.

Dr John Cardiff was the invited speaker in this Second Workshop on Semantic Web, who is a fulltime lecturer and lead researcher in the Social Media Research Group, based at the Institute of Technology Tallaght, Dublin, Ireland. He has previously held positions in the Department of Computer Science, Trinity College Dublin, and in the University of Queensland, Australia, where he obtained the degree of Ph.D. in 1990. He has extensive experience in semantic web technologies, heterogeneous database research and query processing and optimization. He collaborates closely with researchers of the National Language Engineering Laboratory at the Polytechnic University of Valencia, Spain, the Knowledge and Data Engineering Group of Trinity College Dublin, and the IBM Dublin Center for Advanced Studies. He is currently supervising four PhD students who are investigating semantic web based recommender systems, blogosphere analysis, and adaptive hypermedia systems. Dr Cardiff has a wide breadth of experience of research and management of large European Union funded projects under programmes such as RACE, Esprit, and AIM. He has over 20 refereed publications in international conferences and journals. Invited Paper The Evolution of the Semantic Web John Cardiff Exploiting Wikipedia as a Knowledge Base: Towards and Ontology of Movies Rodrigo Alarcón, Octavio Sánchez and Víctor Mijangos Translation of Verbal Expressions and Context of Use Extraction through a Corpus on Web Arturo Velasco, María J. Somodevilla, and Ivo. H. Pineda Dynamic Concept-Based Taxonomy used for image recovery based on their textual description Jaime Lara, María de la Concepción Pérez de Celis and David Pinto The Use of Document Fingerprinting in the Web People Search Task David Pinto, Mireya Tovar, Beatriz Beltrán, Darnes Vilariño and Héctor Furlog mQA: Question Answering in Mobile devices Fernando Zacarías F., Alberto Tellez V., Marco Antonio Balderas and Rosalba Cuapa C. 1 8 17 26 37 44 Semantic Routing for Structured Peer-to-Peer Networks Luis Enrique Colmenares Guillén, Omar Ariosto Niño Prieto and Leandro Navarro Moldes Some Considerations for the Semantic Web María Elena Franco Carcedo 67 76 83 97 The Evolution of the Semantic Web

John Cardiff

Social Media Research Group, Institute of Technology Tallaght, Dublin, Ireland email John.Cardiff@ittdublin.ie Abstract — The Semantic Web offers an exciting promise of a world in which computers and humans can cooperate effectively with a common understanding of the meaning of data. However, in the decade since the term has come into widespread usage, Semantic Web applications have been slow to emerge from the research laboratories. In this paper, we present a brief overview of the Semantic Web vision and the underlying technologies. We describe the advances made in recent years and explain why we believe that Semantic Web technology will be the driving force behind the next generation of Web applications.

I. INTRODUCTION

The World Wide Web (WWW) was invented by Tim Berners Lee in 1989, while he was working at the European Laboratory for Particle Physics (CERN) in Switzerland. It was conceived as a means to allow physicists working in different countries to communicate and to share documentation more efficiently. He wrote the first browser and Web server, allowing hypertext documents to be stored, retrieved and viewed.

The Web added two important services to the internet - it provided a very convenient means for us to retrieve and view information - we can then see the web as a vast document store in which we retrieve documents (web pages) by typing in their address into a web browser. Secondly, it provided a language called HTML, which describes to computers how to display documents written in this language. Documents, or web pages, are accessed by a unique identifier called a Uniform Resource Locator (URL) and are accessed using a Web browser. Within a short space of time, the WWW had become a popular infrastructure for sharing information, and as the volume of information increased its use became increasingly widespread.

Although the web provides the infrastructure for us to publish and retrieve documents, the HTML language defines only the visual characteristics, ie. how the documents are to be presented on a computer screen to the user. It is up to the user who requested the document to interpret the information it contains. This seems counterintuitive, as we normally think of computers as the tools to perform the more complex tasks, making life easier for humans. The problem is that within HTML there is no consideration of the meaning of the document, they are not represented in a way that allows interpretation of their information content by computers.

If computers could interpret the content of a web page, a lot of exciting possibilities would arise. Information could be exchanged between machines, automated processing and integration of data on different sites could occur. Fundamentally, they could improve the ways in which they can retrieve and utilise the information for us because they would have an understanding of what we are interested in. This is where the Semantic Web fits into the picture today's web (the "syntactic" web) is about documents whereas the semantic web is about "things" - concepts we are interested in (people, places, events etc.), and the relationships between these concepts.

The Semantic Web vision envisages advanced management of the information on the internet, allowing us to pose queries rather than browse documents, to infer new knowledge from existing facts, and to identify inconsistencies. Some of the advantages of achieving this goal include [ 4 ]:   

The ability to locate information based on its

meaning, eg. knowing when two statements are equivalent, or knowing that a reference to a person in different web pages are referring to the same individual.

Integrating information across different sources − by creating mappings across application and terminological boundaries we can identify identical or related concepts, Improving the way in which information is presented to a user, eg. aggregating information from different sources, removing duplicates, and summarising the data.

While the technologies to enable the development of the Semantic Web were in place from the conception of the web, a seminal article by Tim Berners-Lee, James Hendler and Ora Lassila [ 1 ] published in Scientific American in 2001 provided the impetus for research and development to commence. The authors described a world in which independent applications could cooperate and share their data in a seamless way to allow the user to achieve a task with minimal intervention. Central to this vision is the ability to "unlock" data that is controlled by different applications and make it available for use by other applications. Much of this data is already available on the Web, for example we can access our bank statements, our diaries and our photos online. But the data is controlled by proprietary applications. The Semantic Web vision is to publish this data in a sharable form − we could integrate the items of our bank statements into our calendar so that we could see what transactions we made on that day, or include photos so that we could see what we were doing at that time. However, eight years after publication of this article, we are still some distance realising this vision. In this paper, present an overview of the Semantic Web. We explain why progress has been slow and the reasons we believe this to be about to change.

The paper is organized as follows. In Section II we describe the problems we face when trying to extract meaning from the web as it is today. Section III presents a brief overview of the technologies underlying the Semantic Web. In Section IV we give an overview of the gamut of typical Semantic Web applications and Section V introduces the Linking Open Data project. Finally, we present our conclusions and look to the future in Section VI.

II.

THE PROBLEM WITH THE "SYNTACTIC WEB"

In Figure 1 we see a "typical" web page written in HTML which we will use to exemplify some of the drawbacks of the traditional web. This page lists the keynote speeches which took place at the 2009 World Wide Web conference1. To the reader, the content of the page can be interpreted intuitively. We can read the titles of the speeches, the names of the speakers and the time and dates at which they take place. Furthermore, by familiarity with browser interaction paradigms, we can realize that by following a hyperlink we can retrieve information about concepts related to the conference (authors, sponsors, attendees etc.). In this example, by following the hyperlink labelled "Sir Tim Berners-Lee" we will retrieve a document containing information about the person of this name. We intuitively assign a meaning - perhaps "has-homepage" - to the hyperlink, allowing us to assimilate the information presented to us.

A web browser cannot assign any to these links we see in this page − a hyperlink is simply a link from one document to another and the interpretation of the meaning of the link (and of the documents themselves!) is a task for the human reader. All that can be inferred automatically is that some undefined association between the two documents exists.

1 http://www2009.org/keynote.html

The problems are even more clear when we consider the nature of keyword-based browsing. While search engines such as Google and Yahoo! are clearly very good at what they do, we frequently are presented with a vast number of results, many (most?) of which will be irrelevant to our search. Semantically similar items will not be retrieved (for instance a search for "movie" will not retrieve results where the word "film" was used). And most significantly, the result set is a collection of individual web pages. Our tasks often require access to multiple sites (such as when we book a holiday), and so it is our responsibility to formulate a sequence of queries to retrieve the individual web pages, each one of which performs part of the task at hand. There are two potential ways to deal with this problem. One approach is to take the web as it is currently implemented, and to use Artificial Intelligence techniques to analyze the content of web pages in order to provide an interpretation of its meaning. This approach however would be prone to error and would require validation. Furthermore, the rate at which the web is growing would render it practically impossible to achieve.

The other approach is to represent the web pages in a form in which we can represent and interpret the data they contain. If there is a common representation to express the meaning of the data on the web, we can then develop languages, reasoners, and applications which can exploit this representation. This is the approach of the Semantic Web.

III. SEMANTIC WEB TECHNOLOGIES

The Semantic Web describes a web of data rather than documents. And just as we need common formats and standards to be able to retrieve documents from computers all over the world, we need common formats for the representation and integration of data. We also need languages that allow us to describe how this data relates to real world objects and to reason about the data. The famous "Layer Cake" [ 10 ] diagram, shown in Figure 2, gives an overview of the hierarchy of the principal languages and technologies, each one exploiting the features of the levels beneath it. It also reinforces the fact that the Semantic Web is not separate from the existing web, but is in fact an extension of its capabilities.

In this section, we summarize and discuss the key aspects shown in the Layer Cake diagram. Firstly we describe the core technologies: the languages RDF and RDFS. Next we describe the higher level concepts, focusing in particular on the concept of the ontology which is at the heart of the Semantic Web infrastructure. Finally we examine the trends and directions of the technology. For further information on the concepts presented in this section, the reader is referred to a more detailed work (eg. [ 4 ], [ 5 ]). What HTML is to documents, RDF (Resource Description Framework) is to data. It is a W3C standard2 based on XML which allows us to make statements about objects. It is a data model rather than a language - we can say that an object possesses a particular property, or that it has a named relationship with another object. RDF statements are written as triples: a subject, predicate and object.

By way of example, the statement “The Adventures of Tom Sawyer” was written

by Mark Twain could be expressed in RDF by a statement such as <rdf:Description rdf:about=www.famouswriters.org/twain/mark> <s:hasName>Mark Twain</s:hasName> <s:hasWritten rdf:resource=

www.books.org/ISBN0001047> </rdf:Description> At first glance it may appear that this information could be equally well represented using XML. However XML makes no commitment on which words should be used to describe a given set of concepts. In the above example we have a property entitled "hasWritten", but this could equally have been "IsAuthorOf" or another such variant. So, XML is suitable for closed and stable domains, rather than for sharable web resources.

The statements we make in RDF are unambiguous and have a uniform structure. Concepts are each identified by a Universal Resource Identifer (URI) which allows us to make statements about the same concept in different applications. This provides the basis for semantic interoperability, allowing us to distinguish between ambiguous terms (for instance an address could be a geographical location, or a speech) and to define a place on the web at which we can find the definition of the concept. To describe and make general statements collectively about groups of objects (or classes), and to assign properties to members of these groups we use RDF Schema, or RDFS3. RDFS provides a basic object model, and enables us to describe resources in terms of classes, properties, and values. Whereas in RDF we spoke about specific objects such as '“The Adventures of Tom Sawyer” and "Mark Twain", in RDFS we can make general statements such as "A book was written by an author"

This could be expressed in RDFS as

<rdf:Property rdf:ID=“HasWritten” <rdfs:domain rdf:resource=“#author”\> <rdfs:range rdf:resource=“#book”\> <\rdf:Property> An expansion of these examples, and the relationship between the graphical representations of RDF and RDFS is shown in Figure 3. 2 www.w3.org/RDF/

3 http://www.w3.org/TR/rdf-schema/

RDF and RDFS allow us to describe aspects of a domain, but the modelling primitives are too restrictive to be of general use. We need to be able to describe the taxonomic structure of the domain, to be able to model restrictions or constraints of the domain, and to be able to state and reason over a set of inference rules associated with the domain. We need to be able to describe an ontology of our domain. The term ontology originated in the sphere of philosophy, where it signified the nature and the organisation of reality, ie. concerning the kinds of things that exist, and how to describe them. Our definition within Computer Science is more specific, and the most commonly cited definition has been provided to us by Tom Gruber in [ 6 ], where he defines an ontology as "an explicit and formal specification of a conceptualization". In other words, an ontology provides us with a shared understanding of a domain of interest. The fact that the specification is formal means that computers can perform reasoning about it. This in turn will improve the accuracy of searches, since a search engine can retrieve data regarding a precise concept, rather than a large collection of web pages based on keyword matching.

In relation to the Semantic Web, for us to share, reuse and reason about data we must provide a precise definition of the ontology, and represent it in a form that makes it amenable to machine processing. An ontology language should ideally extend existing standards such as XML and RDF/S, be of "adequate" expressive power, and provide efficient automated reasoning support. The most widely used ontology language is the "Web Ontology Language", which curiously has the acronym "OWL"4. Along with RDF/S, OWL is a W3C standard and augments RDFS with additional constraints such as localised domain and range constraints, cardinality and existence constraints, and transitive, inverse, and symmetric properties.

Adding a reasoning capability to an ontology language is tricky since there will be a trade-off between efficiency and expressiveness. Ultimately it depends on the nature and requirements of the end application, and it is for this reason that OWL offers three sublanguages, 4 www.w3.org/2004/OWL   

OWL Lite supports only a limited subset of OWL constructs and is computationally efficient, OWL DL is based on a first order logic called Description Logic, OWL Full offers the full compatibility with RDFS but at the price of computational tractability.

Examples of applications which could require very different levels of reasoning capabilities are described in the following section.

The top layers of the layer cake have received surprising little attention considering that they are crucial to successful deployment of Semantic Web applications. The proof layer involves the actual deductive process, representation of proofs, and proof validation. It allows applications to inquire why a particular conclusion has been reached, ie. they can give proof of their conclusions. The trust layer provides authentication of identity and evidence of the trustworthiness of data and services. It is supported through the use of digital signatures, recommendations by trusted agents, ratings by certification agencies etc.

C. Recent Trends and Technological Developments As with any maturing technology, the architecture will not remain static. In 2006 Tim Berners Lee suggested an update to the layer cake diagram [ 2 ] which is shown in Figure 4, however this is just one of several proposed refinements. Some of the new features and languages which include the following.

Rules and Inferencing Systems. Alternative approaches to rule specification and inferencing are being developed. RIF (Rules Interchange Format) is a language for representing rules on the Web and for linking different rule-based systems together. The various formalisms are being extended in order to capture causal, probabilistic and temporal knowledge.

Database Support for RDF. As the volume of RDF data increases, it is necessary to provide the means to store, query and reason efficiently over the data. Database support for RDF and OWL is now available from Oracle (although at present the focus is on storage, rather than inferencing capabilities). Other open source products include 3Store5 and Jena6. The specification of a query language for RDF, SPARQL, was adopted by the W3C in 2008.

RDF Extraction. The language GRDDL: ("Gleaning Resource Descriptions from Dialects of Languages") identifies when an XML document contains data compatible with RDF and provides transformations which can extract the data. Considering the volume of XML data available on the web, a means of converting this to RDF is clearly highly desirable.

5 http://sourceforge.net/projects/threestore/ 6 http://jena.sourceforge.net/

Ontology Language Developments. The OWL language was adapted as a standard in 2004. In 2007, work began on the definition of a new version, OWL 2 which includes easier query capabilities and efficient reasoning algorithms scaled to large datasets. Even though Semantic Web technology is in its infancy, there are a wide range of applications in existence. In this section we give a brief overview of some typical application areas.

E-Science Applications. Typically e-science describes scenarios involving large data collections requiring computationally intensive processing, and where the participants are distributed across the world. An infrastructure whereby scientists from different disciplines are able to share their insights and results is seen as critical, particularly when we consider the availability of large volumes of data becoming available online. The Gene Ontology7 is a project aimed at standardizing the representation of genes across databases and species. Perhaps the most famous e-science project is the Human Genome Project8 which identified the genes in human DNA and which includes over 500 datasets and tools. The International Virtual Observatory Alliance9 makes available astronomical data from a number of digital archives. Interoperation of Digital Libraries. Institutions such as libraries, universities, and museums have vast inventories of materials which are increasingly becoming available online. These systems are implemented using a range of different technologies, and although their aims are similar it is a huge challenge to enable the different institutions to access each 7 http://www.geneontology.org/index.shtml 8 http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml 9 www.ivoa.net other's catalogues. Ontologies are useful for providing shared descriptions of the objects, and ontology mapping techniques are being applied to achieve semantic interoperability [ 3 ].

Travel Information Systems. The goal of building an application which would allow a user to seamlessly book and plan the various elements of a trip (flights, hotel, car hire etc.) is highly desirable. Ontologies again could be used to arrive at a common understanding of terminology. The Open Travel Alliance is building XML based specifications which allow for the interchange of messages between companies. While this is a first step, an agreed ontology would be needed in order to achieve any meaningful interoperation.

Although many potential applications can be identified, there are less deployed at this time than we might expect. One possible reason is the lack of a common understanding of what the Semantic Web can offer, and more particularly what the role of ontology. At one end of the spectrum we find applications which take the "traditional", or AI view of inferencing, in which accuracy is paramount. Such applications arise in combinatorial chemistry for example, in which vast quantities of information on chemicals and their properties are analysed in order to identify useful new drugs. By coding the required drug's properties as assertions will reduce the number of samples which need to be constructed and manually analyzed by orders of magnitude. In cases such as these, the time taken to perform the inferencing is less important, since the trade-off will be a large reduction in the samples to be analyzed.

At the other end of the spectrum, we have "data centric" web applications which require a swift response to the user. Examples of this type of application include social network recommender systems such as Twine10 which make use of ontologies to recommend their users to other individuals who may be of interest to them. While it is clear that a response must be generated for the user within a few seconds, we can observe too that there can be no logical proof of correctness and soundness of the answers generated in this type of case! Accordingly, the level of inferencing required in this type of application is minimal.

THE FUTURE: A WEB OF DATA?

While we have stated that the Semantic Web focuses on data in contrast to the document centric view of the traditional web, this is not the complete picture. In order to realize value from putting data on the web, links need to be made in order to create a "web of data". Instead of having a web with pages that link to each other, we can have (with the same infrastructure) a data model with information on each entity distributed over the web.

The Linking Open Data [ 3 ] project aims to extend the collections of data being published on the web in RDF 10 www.twine.com format and to create links between them. In a sense, this is analogous to traditional navigation between hypertext documents where the links are now the URIs contained in the RDF statements. Search engines could then query, rather than browse this information.

In a recent talk at the TDC 2009 conference11, Tim Berners Lee gave a powerful motivation example for the project: scientists investigating the drug discovery for Alzheimer's disease needed to know which proteins were involved in signal transduction and were related to pyramidal neurons. Searching on Google returned 223,000 hits, but no document provided the answer as nobody had asked the question before. Posing the same question to the linked data produces 32 hits, each of which is a protein meeting the specified properties.

At the conception of the project in early 2007, there were a reported 200,000 RDF triples published. By May 2009 this had grown to 4.7 billion [dh]. Core datasets include   

DBpedia, a database extracted from Wikipedia

containing over 274 million pieces of information.

The knowledge base is constructed by analyzing the different types of structured information, such as the "infoboxes", tables, pictures etc.

The DBLP Bibliography, which contains bibliographic information of academic papers, Geonames, which contains RDF descriptions of 6.5 million geographical features. So where is the Semantic Web? In a 2006 article [ 11 ], Tim Berners Lee agreed that the vision he described in the 11 http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html 12 http://en.wikipedia.org/wiki/File:Lod-datasets_2009-07-14_colored.png 13 http://protege.stanford.edu/ 14 http://www.kowari.org/ Scientific American article has not yet arrived. But perhaps it is arriving by stealth, under the guise of the "Web 3.0" umbrella. Confusion still abounds about the meaning of the term "Web 3.0", which has been variously described as being about the meaning of data, intelligent search, or a "personal assistant". This sounds like what the Semantic Web has to offer, but even if the terms do not become synonymous, it is clear that the Semantic Web will form a crucial component of Web 3.0 (or vice versa!).

The last five years have seen Semantic Web applications move from the research labs to the marketplace. While the use of ontologies has been flourishing in niche areas such as e-science for a number of years a recent survey by Hendler [ 7 ] shows a marked increase in the number of commercially focused semantic web products. The main industrial players are starting to take the technology more seriously. In August 2008, Microsoft bought Powerset, a semantic search engine, for a reported $100m.

As we have discussed, the "chicken and egg" dilemma is resolving itself with tens of billions of RDF triples now available on the web, and this number is continuing to increase exponentially.

Also, it is becoming easier for companies to enter the market of Semantic Web applications. There are now a wide range of open source applications such as Protégé13 and Kowari14 which provide building blocks for application development, making it more cost effective to develop Semantic Web products.

Some observers argue that the Semantic Web has failed to deliver its promise, arguing instead that the Web 2.0 genre of applications signifies the way forward. The Web 2.0 approach has made an enormous impact in recent years, but these applications could be developed and deployed more rapidly as their designers did not have the inconvenience of standards to adhere to. In this article we have demonstrated the steady infiltration from the research lab to the marketplace being made by the Semantic Web over the last decade. As the standards mature and the web of data expands, we are confident that the Semantic Web vision is set to become a reality.

Gruber, T. 1993. Toward principles for the design of ontologies used for knowledge sharing. In Guarino N, Poli R (eds). International Workshop on Formal Ontology, Padova, Italy, Hendler, J., 2008. Linked Data: The Dark Side of the Semantic Web, (tutorial), 7th International Semantic Web Conference (ISWC08), Karlsruhe, Germany.

Linking Open Data Wiki, available at http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData Manning, C., Schütze, H., 1999. Foundations of statistical natural language processing. MIT Press. [ 10 ] "Semantic Web - XML2000, slide 10".

http://www.w3.org/2000/Talks/1206-xml2k-tbl/slide10-0.html.

W3C. Exploiting Wikipedia as a Knowledge Base: Towards and Ontology of Movies

Rodrigo Alarcón, Octavio Sánchez, Víctor Mijangos Grupo de Ingeniería Lingüística, Universidad Nacional Autónoma de México Basamento de la Torre de Ingeniería, Ciudad Universitaria, México, D.F. {ralarconm,osanchezv,vmijangosc}@iingen.unam.mx

Abstract. Wikipedia is a huge knowledge base growing every day due to the contribution of people all around the world. Some part of the information of each article is kept in a special, consistently and formatted table called infobox. In this article, we analyze the Wikipedia infoboxes of movies articles; we describe some of the problems that can make extracting information from these tables a difficult task. We also present a methodology to automatically extract information that could be useful towards the building of an ontology of movies from Wikipedia in Spanish. 1 Introduction

Wikipedia is a free encyclopedia of open content that has become an important resource towards the construction of the Semantic Web. Since it beginnings, in the year 2001, the English version has achieve more than 2 million of articles, while the Spanish version has around 480 thousand of articles. All of the content has been written and edited by volunteers from different countries in many different languages, and it is covered by GFDL (GNU Free Document License), which makes possible to freely use them.

One important thing about the structure of Wikipedia is the social control executed by the community, which is able to avoid the spam, the nonsense and other kind of vandalism that is recurrent on some media sites. Besides, this same control makes possible to constantly increase the quality and precision of the articles.

Inside Wikipedia, there is an entry called Wikipedia: Wikipedia in academic

studies1, where it is possible to see the growth of academic interest in this encyclopedia. This interest is related to the use of Wikipedia on different academic studies and as a knowledge base for developing specific tools. On one hand, to mention a few, some works have focused on the social theme that represents Wikipedia [ 1 ] [ 2 ], some other have denounced inherent problems presented on this 1 http://en.wikipedia.org/wiki/Academic_Research_on_Wikipedia. kind of media sites [ 3 ], and others have obtained specific information and statistic data about the users [ 4 ]. On the other hand, Wikipedia has become a useful resource for the extraction of definitions, name entity recognition, machine translation or semantic relation extraction [ 5 ]. In this last field, Wikipedia represents a huge knowledge base that has made possible the developing of specific ontologies for the construction of the Semantic Web.

In this paper we present a work in process for the elaboration of an ontology of movies from Wikipedia on Spanish language. First we will briefly present an overview of some studies related to the use of Wikipedia for semantic relation extraction and ontology construction (2). Then we will explain the first step towards the elaboration of an ontology of movies (3). This step includes: a) the description of the so-called infobox, which is part of each movie of Wikipedia and contains specific data about the film (3.1); b) the specific relations to automatically extract (3.2); c) and our proposed XML schema to represent these relations (3.3). Finally, we will discuss our preliminary results and present the future work (4). 2

Wikipedia as a Semantic Knowledge Base

There is a growing interest of efforts to mine the information in Wikipedia for

different purposes. As we have mentioned before, one of this interest is the extraction of semantic information that could be helpful on the process of giving more meaning to the Web. In Wikipedia, the meaning could be seen as the knowledge about things represented in different ways: definitions, descriptions, images, numeric data, etc. Furthermore, the meaning of each concept explained on the encyclopedia is related to the meaning of other concepts, which becomes a helpful semantic network to understand concepts on the field where they belong.

In this sense, Wikipedia represents a valuable source of knowledge to extract

semantic information between concepts. A general overview of how Wikipedia could be used to extract concepts, relations, facts and descriptions can be found in [ 6 ]. Here, the authors explain the use of Wikipedia for natural language processing, information extraction and ontology building.

In [ 7 ], the authors describe a methodology that uses the links between categories to mine specific relations. They analyze some measures to infer relations and try to provide a semantic scheme in order to improve the search capabilities and to give the users meaningful suggestions to edit articles. In the same context, in [ 8 ] the authors use Wikipedia to develop a methodology for the automatic annotation of different semantic relations. This work is based on discovering lexical patterns that can be used to recognized specific relations between concepts. They evaluate the methodology by using a corpus and searching on it the relations founded in Wikipedia. Their results show that this kind of methodology could be a good starting point for automatic ontology construction.

The research presented in [ 9 ] shows how hyperlinked pages are used to generate a domain hierarchy by means of ranking articles that are strongly linked. These articles become a domain corpus for the automatic construction of an ontology. The same goal of obtaining ontologies through Wikipedia is described in [ 10 ], where the authors apply machine learning techniques to improve the performance of a system that mines the infoboxes. Finally, in [ 11 ] we can found another example of the use of Wikipedia for ontology construction, specifically for document classification.

This is not, and does not pretend to be an extensive list of all the works made about

semantic relation extraction or ontology construction from Wikipedia. Our main purpose is to state both the interest that has woken up in the area of extraction and organization of semantic information, and some of the automatic analyzes and procedures that are possible to develop taking into account Wikipedia’s structure. Nevertheless, as we will see in this paper, this structure is often not well organized and makes it difficult to implement automatic processes. 3

Towards an Ontology of Movies In order to develop an ontology of movies we have stated three main steps that can lead us to our purpose. The first one is to collect our input corpus from Wikipedia movies articles and the analysis of the infobox structure on them. After that, the second step is the delimitation and automatic extraction of specific semantic information. Finally, as a third step we consider the implementation of the extracted information into a XML schema that will conform the basis for another later annotation schema. 3.1

Movies infobox structure The first step of our methodology was to conform a corpus from the articles of the films by year category. We use the categories tree option to find a list of the movies titles from the year 1892 to 20082. After that, we use the export pages option to retrieve all the articles of this list. We found a total of 5,561 articles, where the opening and closing infoboxes tags ({{Fields…}}) was on 5,092 cases. This late number represents the total of articles from our corpus.

After that, we analyze the infobox of each entry. The infobox is a resource used on

Wikipedia to summarize and group the information about specific data on some articles. In general words, its purpose is to make the information on a more available format and it can be use as a resource to other applications.

In Spanish language, there are 49 proposed fields for the infobox, where only two

are consider as required: film title and original title. The infobox will be framed in {{Fields…}}, and each field inside will be preceded by a vertical bar “|” and followed by an equal sign “=” and the specific information. Fields without descriptions will remain empty after the equal sign. That means it will have the following structure: | Field = description of the field

An example could be the next one:

| genre = Science fiction

2 Data was collected on February 2009. The whole fields used in the movies infoboxes from Wikipedia in Spanish can be found in table 1.

Table 1. Infobox template in Spanish.  Fields título original título índice imagen nombre imagen dirección dirección2 dirección3 dirección4 dirección5 dirección6 dirección7 dirección8 dirección9 ayudantedirección dirección artistica producción diseño de producción guión música sonido edición fotografía montaje vestuario efectos reparto país país2 país3 país4 estreno estreno1 género duración clasificación idioma idioma2 idioma3 idioma4 productora distribución presupuesto recaudación precedida_por sucedida_por imdb filmaffinity sincat

From the table above we can see the different kind of information that the fields

can introduce. We see information about dirección (direction), estreno (premiere), idioma (language, language2, language3, etc.), as well as país (country country2, country3, etc.), IMDb (Internet Movie Data Base) or Filmaffinity links (external Web sites with movies information).

The 49 fields from this table are the suggested ones in the official Wikipedia movies infobox template. Nevertheless, in our corpus we found several empty fields. We automatically found a total of 94,584 fields occurrences, while 30,742 cases where empty (32.48% of the whole occurrences).

Furthermore, one of the problems presented in the infoboxes is the lack of standardization. Some of the elements established by Wikipedia are written in an indistinctive way by the authors of the articles, while others have typographical errors. For example, the field dirección (direction) appears also as director (director); the field título original (original title) can be found as título en España (title in Spain), título principal (main title), título traducido (translated title), among others. More complicated is the case of estreno (premiere), which presents variations like año (year), fecha (date), fecha de estreno (premiere date), or primera emisión (first emission).

Typos are another common lack of standardization. For the field género (genre) we can find mistakes like *gènero, *genero or *genro. In the corpus we can also find the case of another fields that are not proposed in the original schema, such as asistente de artes marciales (martial arts assistant), calificación (qualification), premios (awards), Myspace, and so on. In this case we found a total of 205 non-official fields. If we compare the schema in Spanish to the English one, we can notice that the latter infobox contains fewer fields, which probably allows to be more standardized at the moment to put it into practice. The fields of the movies infobox in English can be seen in table 2.

Table 2. Infobox template in English  Fields name image image_size caption director producer writer narrator starring music cinematography editing studio distributor released runtime country language budget gross preceded_by followed_by

Here we can observe a total of 22 fields, comparing to the 49 in the Spanish

template. It is important to notice the fact that most of another languages follow a similar structure like the one described for English. There is a similar template to the English movies infobox in French Wikipedia, with some added elements like format, awards, and IMDb. In Italian, the infobox determines general fields for different genres of films: generic, animation or film a episodi (films conformed from several short films), with specific fields for each genre; while in German, the fields specifies a more generic data, i.e., title, original title, producer or cameraman.

In infoboxes of different languages, the most common fields are title, director and

premiere. There are also coincidences in other fields, for example music and photography. Between English and Spanish there is a coincidence in preceded_by and followed_by. Furthermore, in Spanish, as well as in French, there is the field of IMDb, while Italian or English do not include. However, in English links to IMDb or Allmovie can appear within the article as external links and not inside the template of the infobox. These external links are also a valuable information to extend the semantic data for an ontology, as they can add more information about the films that does not appears in Wikipedia, or be used to complete the empty fields of the infoboxes. Nevertheless, there is also no consistence between the occurrences of the tags with external links. In our corpus, the IMDb tag occurs approximately in the 80% of the articles, while Filmaffinity occurs around in the 5%. 3.2 Extracting specific relations data

Theoretically, the structure of the infoboxes contains information that should be

exploited with relative easiness. We decide to automatically extract the title, original title, director, premiere year and genre, in order to generate a database with all of this information. Although, not all of this information is present on all the movies articles founded in the films by year category.

As we have mentioned before, there are some inconsistencies within the name of the fields, their completeness, or the way the authors write them. In the case of

director field, we found it with complete information in the 5,092 occurrences of the articles with infoboxes, however the field genre occurs only in 4,499 of these cases. Taking into account that the inconsistencies of the metadata make more difficult the process of automatic relation extraction from the films information, we achieve to obtain the data through the process we hereby describe.

From our corpus, we find out that 5,092 articles contained at least one director,

although the field name from many of them was not the same and a review had to be made in order to compile a list of ad hoc synonyms for searching this specific field. The synset was formed by dirección (direction), director (director) and dirigida (directed). Also, after the equal sign that should follow the name of the field, the kind of following blanks was not always the same. Sometimes there were tabs; some other, more than one simple space; and even other, without spaces. Many of the director’s names are also entries of Wikipedia, so many users decided to establish links to their names, using the symbol “[[” followed by the name of the director and closing with “]]”. This has the purpose of specifying to the wikiengine that there is a link: [[link to the article]]. But not all of them had those brackets, and it caused troubles while parsing the data with the aim of recovering the director’s name of the film associated with the title of the entry.

The same problems were founded when we tried to mine the original title of the

movie. Despite the fact that this field does appear in all the infoboxes, not all of them appear with information, which means that there are articles with the original title field empty. It does not contain information in 195 articles occurrences in the corpus.

With the premiere field it was also problematic to extract the information, because

most of the films had different words to express the premiere year, for example año (year), fecha de estreno (premiere date) or *añoacceso (acces year). In this case we decide to mine only the año (year) and estreno (premiere) variants, because of the wide range of structural possibilities. We found that 23 films infoboxes do not contain a premiere year, sometimes it was in the title and sometimes were completely absent.

Other field we exploited was the one of género (genre), which also present some inconsistencies that could be attached to human errors at the time of transcribing the template. This field was empty on 593 occurrences in our corpus and is the more unused one. Summarizing, we can find the number of occurrences for each field in table 3:

Table 3. Numerical data found over the analysis of infoboxes 

Field name

Director Título Título ID Título orignal Año Género Director Título

From the table above we can see the three fields with empty information: premiere or year, original title and genre. The first one was empty only in 23 articles, while the

last one in more tan 500 cases. It is important to mention that the title of the movies was not obtained from the infobox but directly from the XML given by the Wikipedia, mainly because it is well demarcated by the labels <title> </title>; in the same way, we obtained the id used by the Wikipedia to identify each article.

Despite the inconsistencies and typos that make difficult the automatic process, in 4,499 cases all the information that we were trying to mine was complete. We consider that this number represents a good starting point to conform the basis of a first schema that could be later extended.

3.3 Proposed XML schema With the data from the infoboxes that were exploited, we decided to generate a first XML scheme, which should give basic information about the film. This scheme can be expanded as we extend our extraction processes of the information contained in the

Wikipedia articles. To make this scheme, we decided to take director field as the root XML tag. The

first tag will consist of the director’s name. Taking into account that directors can have more than one film, we decided to introduce a filmography tag to include them. This last tag will include each film with title, original title, year and genre tags. On the opening film tag we added an attribute with Wikipedia’s title id number. An example of the schema can be seen below.

Proposed XML schema for the organization of movies data on Spanish Wikipedia.

As we can see in this example, the root tag is <director></director>. It is followed by the director’s name tag <name></ name >. At the same level there is the tag <filmography></ filmography>. This tag nests the film tag <film wiki_id=“”></film>, which contains the relevant information of each film: <title></title>, <original_title></original_title>, <year></year> and <genre> </genre>.

Based on the XML scheme, relational databases can be generated to manipulate the

information that we have considered at this first stage of the ontology construction. As we have said, this is not the full final scheme because as more data is extracted, the more can be added. This scheme is currently based on Wikipedia films articles in Spanish language, however it can be extended to fit another kind of relevant information, for example the country, external links (IMDb) or the id of directors or genre from Wikipedia. Furthermore, it will be possible to use this scheme in order to exploit Wikipedia in other languages, which could make possible to fill the empty fields in one language by relating them with the information on another languages, as well as to make multilingual queries. 4 Conclusions and future work Nowadays, Wikipedia can be explored with the aim of obtaining information on different ways. The information added in a manual way by the users is generally well organized and semi-structured. Also, many entries from Wikipedia have infoboxes with summarized specific information about the theme treated in the article. We have mentioned that the structure of Wikipedia has made possible to exploit the information in order to extract semantic data. The extraction of semantic relations is one of the growing interests aiming to the construction of the Semantic Web.

Even so the structure of Wikipedia, we have noticed some specific problems on

automatically exploiting it. To summarize a few, there are: a) the fact that the field’s names are not respected; b) typos by human errors; c) lack of information; and d) differences on the infobox structure between languages. The latter should not be seen as a problem, however it would be advantageous to have standard fields on different languages.

Aiming to the standardization idea, it would be useful that the Wikipedia’s process

of writing or editing an article use a check-bot to confirm the information of the infoboxes templates. Thus, the fields not belonging to the template would be alerted, as well as typos on the field names. Furthermore, the same check-bot could be used to seek the existing fields looking for inconsistencies in the infoboxes or the whole articles.

The work that we have presented here is a first approach towards the elaboration of an ontology of movies from the Wikipedia in Spanish. We have showed the kind of semantic relations that are possible to mine, as well as a first scheme to represent them. We are conscious that this scheme may well be improved for achieving a complete ontology of movies. The future work will include: a) to define a scheme to represent subject, relation and predicates between the extracted information, for example a RDF scheme; b) to implement this new scheme for making the information available and sharing it with systems dedicated to the construction of the Semantic Web; c) to develop a movie-ontology query system capable of retrieve the information on specific ways related to directors, titles, genres and years fields. Acknowledgments This research was made possible by the financial support of CONACYT (82050) and DGAPA-PAPIIT (IN403108). The authors wish to thank Sarahi Abrego Romero for the proofreading of this paper. 3.1 Algortihm K-means Advantajes.

MacQeen J. [ 24 ], the author of one of the initial k-means algorithm and the most frequently cited, states.

The process, which is called “k-means”, appears to give partitions which are reasonably efficient in the sense of within-class variance, corroborated to some extend by mathematical analysis and practical experience. Also, the k-means procedure is easily programmed and is computationally economical, so that it is feasible to process very large samples on a digital computer.

Likewise [ 39 ], summarizes the benefits of k-means, in the introduction to his work: K-means algortihm is one of first which a data analyst will use to investigate a new data set because it is algorthmically simple, relatively robust and gives “good enough” answers over a wide variety of data sets. 3.2 Algorithm K-means Shortcomings.

Taking as a framework and as an extension and update, the k-means shortcomings that are identified in [ 42 ] the following is the result of the analysis previously cited in a series of tables grouped by category of work that arose as extensions k-means or as possible solutions to one or more of the limitations that have been identified above. 3.2.1 The algorithm's sensitivity to initial conditions: The number of partitions, the initial centroids.

According to [ 42 ] there is a universal and efficient method to identify initial patterns and the number k of clusters. In [ 40 ] briefly is discussed the sensitivity of the algorithm for the allocation of initial centroids, that in practice the usual method is to test iteratively with a random allocation to find the best allocation in terms of minimizing the total squared distance. However, there have been various investigations aimed at making various proposals related to these limitations:

Authors [ 45 ] Zhang, Chen; Xia Shixiong. [ 2 ] B. Bahmani Firouzi, T. Niknam, and M.

Nayeripour. [ 39 ] Barbakh Wesam And Colin Fyfe.

Title and Commentary “K-means Clustering Algorithm with improved initial Center.” It avoids the initial random assignment of centers. Use strategy called "sub-merger" “A New Evolutionary Algorithm for Cluster Analysis”.

It not depend on the initial centers. Algorithm PSO-SA-K combines the algorithms "Particle Swarm Optimization (PSO)," Simulated Annealing "(SA) and K-means. “Local vs global interactions in clustering algorithms: Advances over K-means.” It focuses on the algorithm's sensitivity to initial conditions. Incorporate information on the role of overall performance. Define three new algorithms: Weighted k-means (WK), Inverse Weighted K-means (IWK) and Inverse Exponential k-means (IEK). [ 20 ] L. Kaufman and P.

Rouseeuw. [ 3 ] G. Ball and D. Hall “A hibridized approach to data clustering”. Draft bioinformatics.

Hybrid techniques called K-NM-PSO-based K-means, NelderMead Simplex search and optimization of exchange of particles. “Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with the Highest Variance.” Title explicit.

A method for initialising the K-means clustering algorithm using kd-trees” A kd-tree used to calculate an estimate of the density of data and to select the number of clusters. “Analysis of Global k-means, an Incremental Heuristic for Minimum Sum of Squares Clustering”. Commentary on work [ 22 ]. “Selection of K in K-means clustering”. It proposes a measure to select the reference number of clusters. “The Global K-means Clustering Algorithm.” Algorithm that consists of a series of k-means clusterings with varying number of clusters from 1 to k. It argues that it is independent of initial partitions and accelerates the calculations of k-means. “An empirical comparision of four initialization methods for the k-means algorithm.” Compare initialization methods for k-means: Random, [ 12 ], [ 20 ] and [ 24 ]. “Refining initial points for k-means clustering”. Use k-means M times for M random subsets of the original data.

L. Kaufman and P. Rouseeuw. Finding Groups in Data: An Introduction to Cluster analysis: Text. Def. K-means. “A clustering technique for summarizing multivariate data”, (ISODATA). Perform dynamic estimation of K. 3.2.2 The convergence of algorithm to a local optimum rather than a global optimum.

According to [ 24 ], the iterative procedure of k-means can not guarantee convergence to a global optimum, but in his work, some research is cited, which are special cases. Currently, there are several developments that analyze and / or proposed solutions to this constraint:

Authors [ 39 ] Barbakh Wesam And Colin Fyfe. [ 29 ] Joaquín .Pérez O, Rodolfo Pazos R, Laura Cruz R.,Gerardo Reyes S.

Rosy Basave T. Héctor Fraire H. [ 44 ] Z. Zhang, B. Tian D.

And Tung A.K.H.

Títle and Commentary “Local vs global interactions in clustering algorithms: Advances over K-means.” Addresses the algorithm's sensitivity to initial conditions. Incorporating global information on the performance function. Define three new algorithms: Weighted k-means (WK), Inverse Weighted K-means (IWK) and Inverse Exponential kmeans (IEK). “Improvement the Efficiency and Efficacy of the K-means Clustering Algorithm through a New Convergence Condition”.

Improvement to the k-means algorithm by new convergence conditions. Experimentally analyze the local convergence of kmeans. “On the Lower Bound of Local Optimums in K-means Algorithm.” Estimate lower limit for local optimum. [ 24 ] MacQUEEN J.

“Genetic K-means algorithm”. Hybrid scheme based on Genetic Algorithm - Simulated annealing with new operators to perform global search and rapid convergence. “Some Methods for Classification and Analysis of Multivariate Observations.” Definition, Analysis and Applications of kmeans. 3.2.3 The efficiency of the algorithm.

According to the work of [ 42 ] the complexity of the k-means algorithm is O (n, d, k) which involves the sample size, the number of dimensions and the number of partitions. There are several works that have focused on different aspects of the algorithm, in order to reduce computational load.

Authors [ 4 ] Moh`d Belal Al-Zoubi, Amjad Hudaib, Ammar Huneiti and Bassam Hammo [ 43 ] Zalik, Krista Rizman [ 26 ] Cao. D. Nguyen & Cios, Krzysztof J. [ 13 ] G. Frahling & Ch.

Sohler. [ 35 ] Taoying Li & Yan Chen [ 18 ] Kashima, H. Hu, J.; Ray,B; Singh, M. [ 36 ] Tsai, Chieh-Yuan, Chiu, Chuang-Cheng. [ 29 ] Joaquín .Pérez O, Rodolfo Pazos R, Laura Cruz R.,Gerardo Reyes S. Rosy Basave T. Héctor Fraire H. [ 30 ] J.Pérez, M.F. Henriques, R. Pazos, L. Cruz, G. Reyes, J. Salinas, A. Mexicano.

Títle and Commentary “New Efficient Strategy to Accelerate k-Means Clustering Algorithm”. Strategy to accelerate k-means algorithm, which avoids many calculations of distance, through a strategy based on an improvement to the partial distance algorithm (PD). “An Efficient k`-means Clustering Algorithm.” Based on the algorithm Rival, it penalizes competitive Learninig (RPCL). It does not require pre-allocation of the number of clusters. Two-step process. Pre processes and uses the prior information to minimize the cost function. “GAKREM: A novel hybrid clustering algorithm.” Eliminates the need to specify a priori the number of clusters.

Combines genetic algorithms and logarithmic regression Màxima expectation. “A Fast k-means implementation using coresets.” Implemented version of Lloyd's k-means [ 23 ], using a weighted set of points that approximate the original set. “An improved k-means algorithm for clustering using entropy weighting measures”. Improvement of the algorithm by introducing a variable to the function of cost. “K-means clustering of proportional data using L1 distance”.

K-means based on distance L1. Proportionate restrictions incorporated in the calculation of centroids. “Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm.” Improvement of the quality of k-means clustering via FWSA mechanism called "SelfAdjusment Feature Weight." Is modeled as an optimization problem. “Improvement the Efficiency and Efficacy of the K-means Clustering Algorithm through a New Convergence Condition”.

Improvement to the k-means algorithm by new convergence conditions. Experimentally analyze the local convergence of kmeans. “Improvement of the K-means algorithm using a new approach of convergence and its application to databases cancer population." Title explicit. [ 33 ] Pun, W.K.D., Ali, A.S. [ 7 ] Zejin Ding, Jian Yu, Self-Yang-Qing Zhang. [ 17 ] Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatko, C.D., Silverman, R., Wu, A.Y.: “Unique distance measure approach for K-means (UDMAKm) clustering algorithm.” Sets distance measure based on statistical data. “A New improved K-Means Algorithm with Penalizaed Term”. Define new objective function and minimize it with genetic algorithm. “An Efficient K-means Clustering Algorithm: Analysis and Implementation,” presents an implementation of the version of Lloyd's k-means [ 23 ] called "filtering algorithm" based on a kd-tree. 3.2.4 K-means is sensitive to outliers and noise.

According to [ 42 ] even if an object is quite far away from the cluster centroid, it is still forced into a cluster and, thus, it distorts the cluster shapes. Here are works that focus on shortcoming:

Authors [ 1 ] Asgharbeygi, N.

Maleki, A. [ 9 ] V. EstivillCastro and J. Yang [ 3 ] G. Ball and D.

Hall

Títle and Commentary “Geodesic K-means clustering”.Extends k-means by using a geodesic distance metric. Algorithm ensures resistance to outliers. “A fast and robust general purpose clustering algorithm.” It eliminates the effect of outliers through a process that considers real points as centroids. “A clustering technique for summarizing multivariate data”.(ISODATA). It performs dynamic estimation of K. Considers the effect of outliers in the process of clustering. 3.2.5 The definition of “means” limits the application only to numerical variables.

Several works have been developed that extend the application of categorical variables or others: k-means for

Authors [ 38 ] Song, Wei, Li Cheng Hua, Park, Soon Cheo. [ 14 ] S. Gupata, K.

Rao, &Bhatnagar [ 16 ] Z. Huang.

Títle and Commentary “Genetic Algorithm for text clustering using ontology and evaluating the vality of various semantic simility measures.” Improving the kmeans algorithm by using a genetic algorithm that finds similarities conceptual. Based on ontology, thesaurus corpus for clustering of text fields. “K-means clustering algorithm for categorical attributes”. Title explicit. “Extensions to the k-means algorithm for clustering large data sets with categorical values.” Title explicit. 4 The Algorithm k-means on Matlab.

Experimental tests were conducted for K-means in the Matlab [ 25 ]. The Matlab (Matrix Laboratory) is both, an environment and programming language for numerical calculations with vectors and matrices. It is a product of the company The Math Works Inc. (Natick, MA). [ 1 ]. The K-means algorithm for clustering is in the following MATLAB function: [IDX, C, SUMD, D] = KMEANS(X, K) This function partitions the points in the N-by-P data matrix X into K clusters. This partition minimizes the sum, over all clusters, of the within-cluster sums of point-tocluster-centroid distances. Rows of X correspond to points, columns correspond to variables. KMEANS returns an N-by-1 vector IDX containing the cluster indices of each point. By default, KMEANS uses squared Euclidean distances. The K cluster centroid is located in the K-by-P matrix C. The within-cluster sums point-to-centroid distances in the 1-by-K vector sumD. Distances from each point to every centroid in the N-by-K matrix D. It may include optional parameters to specify distance measure, the method used to choose the initial cluster centroid positions, display information. 5 Test Results of K-means in Matlab.

Tests for k-means in Matlab, used the well known UCI Machine Learning Repository [ 37 ]. The UCI Machine Learning Repository [ 37 ] is among other things, a collection of databases, which is widely used by the research community of Machine Learning, especially for the empirical algorithms analysis of this discipline. 8 7 t h g n e6 L l a p e S5

4 6 0.5

1.5 1 2 2.5

Fig.1 Representation of the Iris Data Set For the experimental tests carried, the following data sets have been used: Iris, Glass and Wine. This report presents the results for the Iris data set. The Iris Data Set is a database of types Iris plant, which has No. of instances: 150 (50 in each class), No. of attributes: 4 (Sepal length, Sepal width, Petal Length, Petal width) No. of classes: 3 (Hedges Iris, Iris versicolor, Iris virginica). One class is linearly separable from the other two; the latter are NOT linearly separable from each other. Based on data and classes defined in [ 37 ] and [ 42 ], fig.1 includes the Iris data, for illustrative purposes only are considered the attributes sepal length, petal length and petal width.

Test I4: >> [u,v,sumd,D]= kmeans(z,3,'display','iter'); The “Test I4” is an example of the test results on Matlab and kmeans for the Iris data set. Iter column represents the number of iteration, the phase indicates the algorithm phase, num provides the number of exchanged points, sum, provides the total sum of distances, inter% are the percentage of exchanged points in each iteration. 100 90 80 70 trce 60 n I .tso 50 P de 40 % 30 20 10 00 2 4 10 12

14 Fig. 2 % of Exchanged Points. Fig. 2 corresponds to the test I4 and the graphical representation of the exchange behavior at each iteration. Likewise Fig. 3 represents the behavior of the sum of distances for the same test. 350 t. isD300 e d uam250 S e lda200 t o T 150 100 500 2 4 Table 1 Summary of Results for Iris Data Set 5.1 Summary of Results for the Iris data set.

With regard to exchanges between groups that the algorithm makes in the tests conducted, it was observed that the most significant changes occurred from first to second iteration. For all cases, in the first step all points are located (100%), the third column includes the number of points to be exchanged in the second iteration and the fourth column is the percentage difference in the number of items exchanged between the first and second iteration.

According to the results in Table 1: It can see that for 150 points in the Iris database, and a set of 25 tests, the k-means algorithm in Matlab: Converge in an average of 7.2 iterations.

The average number of points exchanged during the second iteration was 18.84 points The percentage of points located on the second iteration in the corresponding group was 91.0% 6 Conclusions.

The results of the analysis for our sample work, allow us to establish a framework and analyze the theoretical study of the k-means algorithm Also we reseach and distinguish the different lines on which there is still a fertile field for investigation. As we can see several attempts at overcoming the shortcomings of the k-means algorithm have been done and different approaches in different disciplines have been proposed: Optimization, Probability and Statistics, Neural Networks, Evolutionary Algorithms, among others. The vast majority of contributions have focused on the first three lines of research identified in this study: The sensibility of the algorithm to initial conditions, convergence of the algorithm to a local optimum rather than a global optimum and the efficiency of the algorithm. Notes that challenges still to be resolved in such research and has been relatively little work done on the lines related to the implementation of the algorithm to other variables as well as treatment to outliers and noise.

Acording the tests conducted in Matlab, this laboratory showed that it is actually very conducive to experimental testing, the implementation of k-means, allows to monitor the performance of the algorithm through the information that can be deployed at runtime, such as result of the objective function and the number of points exchanged in each iteration.The results allow us to establish a framework to compare the proposal improvement algorithm with the previous work. As part of this project and to give continuity to previous work [ 29 ] [ 30 ], also ventures into different applications to k-means, such as in the areas of health care in Mexico and in Web Usage Mining for Log files from the server of the Faculty of Compute Science BUAP, México. Image Classification by Texture Segmentation using

GAF-SVM Sergio Manuel Dorantes, Manuel Martín Ortiz, María J. Somodevilla, Jesús Lavalle Martínez, Ivo H. Pineda Torres

Facultad de Ciencias de la Computación, BUAP sergiomanuel@hotmail.com, {mmartin, mariasg, jlavalle, ipineda}@cs.buap.mx Abstract. Due to the amount of visual information that currently exists, there is a need to classify it properly. In this paper we present an alternative dual method for image categorization according to their texture content defined as GAF-SVM, this method is based in the use of Gabor Filters (GAF) and Support Vector Machine (SVM). To perform the image classification we rely on filtering techniques for feature extraction mixed with statistical learning techniques to perform the data separation. The experiments were carried out by taking a set of images containing coastal beach scenes and a set of images containing city scenes. A feature vector is obtained from applying a bank of Gabor Filters to the input images; the output feature space is then used as an input to the SVM Classifier. The Support Vector Machine is responsible for learning a model that is capable of separating the sets of input images. Experimental results demonstrate the effectiveness of the proposed dual method by getting the error classification rate to near 9%.

I. INTRODUCTION The proposal for an alternative method of image classification requires analysis of the methods presented so far concerning the area. Extracting visual information from an image to obtain their most important features is essential for classification tasks; over the years various approaches have been presented regarding this field of study such as: color histograms, region-based classification and gray-level values of raw pixels; though one solution has been to incorporate the texture analysis as main feature descriptor. This is largely due to the fact that most surfaces on images contain some kind of texture. In the recent years, texture analysis has been used for object recognition, image interpretation, image segmentation and classification [ 1, 2, 6, 8, 9, 10 ].

In recent papers like [ 4, 6, 7, 10, 11, 12 ], texture is been understudy in an isolated manner to evaluate the performance of the proposed algorithms, in some cases it has been used artificial textures, which limits the application area of these methods. Textures are used by the human visual system to separate different objects within scenes as well as surface analysis [ 11 ]. Texture can be recognized as an irradiation patterns that are perceptually uniform. Textures can be explained as an efficient measure to estimate the structural differences of orientation, roughness, smoothness or regularity between different regions of an image [ 14 ].

But bring out a formal definition of what a texture really is, it became a subjective topic. As it was mention in [ 13 ] the definition of texture is dependent on the purpose for which it is being used and outlines some definitions: 1. The basic pattern and repetition frequency of a texture sample could be perceptually invisible, although quantitatively present. In the deterministic formulation texture is considered as a basic local pattern that is periodically repeated over some area. 2. An image texture may be defined as a local arrangement of image irradiances projected from a surface patch of perceptually homogeneous irradiances. 3. Texture is characterized not only by the grey value at a given pixel, but also by the grey value ‘pattern’ in a neighborhood surrounding the pixel.

Our proposal is based on the use of natural textures in real world images, for that reason the classification model must deal with more complex images in natural conditions.

The 2-D Gabor filters (2D-GF) have certain properties that make them suitable for textural identification in many ways: 2D-GF have tunable orientation and radial frequency bandwidths, tunable center frequencies, and optimally achieve joint resolution in space and spatial frequency. The demodulated Gabor channel envelopes generally contain only low spatial frequencies which are optimally localized in both domains [ 16 ].

Gabor filter based methods have been successfully applied for a variety of machine vision application, such as texture segmentation [ 10, 11, 12, 15, 16, 18 ], texture classification [ 9, 13, 19 ], iris recognition [ 21, 22, 23 ], on-road vehicle detection [ 17 ], fingerprint classification [ 20 ], and as mentioned in [ 15 ] edge detection, object detection, image representation, and recognition of handwritten numerals.

This paper is organized as follows: in section II it is mention the related work we based on to develop this article, in section III it is made a detail description of the proposed method, in section IV it is an explanation of the way the input data is processed as well as the Gabor filter’s parameters selection, in section V the details of the SVM classifier parameters, and in section VI the experimental results. II. RELATED WORK The classification of images has been studied from various approaches, most of all through the mixing of methods, one for texture extraction and one for the classification process.

In [ 9 ] is emphasized the use of Gabor filters as a texture extraction method and classification is performed with maximum likelihood method for the classification of aerial and satellite digital images. In [ 3 ] is proposed a method of image classification using as an image representation their color histogram and as method of classification the Support Vector Machine. In [ 4 ], is not used an external feature extractor, instead the SVM classifier receives the grey level values of each pixel on the image, trying to prove that SVM can implement feature extraction methods within its architecture, this method is computationally expensive due to the number of regions that can define an image. Another approach is performed in [ 5 ] where a modification of the SVM is used for identification of regions among a group of images. In [ 6 ] the SVM is combined with the Discrete Wavelet Frame Transform for the classification of images of the Brodatz album. In [ 7 ] is mixed the use of wavelet transform as a feature extractor known as the pyramid-structured wavelet transform and SVM as the classification method. In [ 8 ] is proposed a method called Gaussian Mixture Model mixed with Independent Component Analysis (ICA) to perform the image classification, which is called ICA Mixture Model.

The first step to complete the proposed method is to extract the texture features with a bank of Gabor filters applied to each input image, and then take the filter’s output to form a training dataset to feed the SVM classifier.

III. PROPOSED METHOD In order to accomplish the image classification we rely on filter based techniques to perform texture feature extraction mixed with statistical learning theory techniques to achieve the image data separation. Gabor filters were selected to extract texture features from images due to their resemblance to the human visual system [ 13 ].

A. Gabor Filters A number of authors have used a bank of filters to extract local images features [ 10, 11, 16, 19 ]. Different authors used different sets of Gabor Filters, from spatial domain to frequency domain.

A 2-D Gabor filter is a linear filter whose impulse response is defined by a harmonic function multiplied by a Gaussian function. In the spatial domain can be defined as follows: ψ (x, y) =

f 2 e−(γf 22 x′2 +ηf 22 y′2 ) πγη

⋅ e j2π fx′ x′ = x cosθ + y sinθ y′ = −x sinθ + y cosθ (1)

Where f is the central frequency of the filter,θ the rotation angle of the Gaussian major axis and the plane wave, γ the sharpness along the major axis, and η the sharpness along the minor axis (perpendicular to the wave). In the given form, the aspect ratio of the Gaussian is λ=η/γ. This four parameters (f,θ,γ,η) define the shape of the filter, and by changing them we can detect different textures.

The normalized 2-D Gabor filter function has an analytical form in the frequency domain. u′ = u cosθ + v sinθ v′ = −u sinθ + v cosθ

Filter Design vs. Filter Bank There exist two aspects regarding the implementation of Gabor filters, on one hand the filter bank approach and in the other hand the filter design approach [ 25 ]. In the first one, a bank of filters is formed by grouping multiple filters tuned at different frequencies and different orientations. The decision of the parameters setting depends on the type of texture to be analyzed. The difficulty of using the filter bank approach relies on the fact that their parameters are established ad hoc and are not optimal for a specific processing task. One of the goals of this work consists on presenting results that would help to specify such parameters. Furthermore, if the bank handles many frequencies and orientations, resulting in a large bank with a lot of filters within, this translates in a large number of convolutions. The filter design approach focuses on designing one or a few filters for a particular application in an effort to reduce the difficulty provided by the filters bank and also to reduce the dimensionality of the output, as well as the processing cost. The disadvantage of this approach lies in the limitation of the tasks for which it was designed. When working with a single filter it is possible that some of the textures in the images are not identified or detected as the filter has a narrow range capacity to detect local texture features.

A filters bank allows the analysis of an image in a single pass way at several frequencies and in several orientations at once. According to the characteristics of our model, the use of a filters bank is the solution choice of deployment, although it could mean an increase, in the computational processing, this is not significant. The design of a Gabor filter bank consists, in general, in the selection, for each filter, of the proper values of the following parameters: frequency, orientation, γ andη, the last two parameters known as the smoothing parameters [ 26 ].

In this research it is defined a bank with up to 3 orientations and up to 2 frequencies, resulting in a bank with maximum 6 output filters, allowing us to accurately detect a texture among a large set of images. This decision was made based on the studies presented in [ 26 ], where is compared with various parameter selection approaches, and summarizes some parameter values adopted in literature.

Using many different orientations and scales (frequencies) ensures invariance; objects and some textures can be recognized al various different orientations, scales and translations [ 27 ].

C. Support Vector Machines Support Vector Machines (SVM) were introduced by Vapnik as a powerful learning tool based on statistical learning theory, a Support Vector Machine is a binary classifier that makes its decision by constructing a linear decision boundary or hyperplane that optimally separates data points of the two classes in feature hyper space and also makes the margin maximized [ 20 ].

SVM starts from the goal of separating the data with a hyperplane, and extend this to non-linear decision boundaries using the kernel trick [ 29 ]. A hyperplane can be defined as:

Where x represents a point (a vector), w represent the weight (also a vector). We want to choose w and b to maximize the margin, or distance between the parallel hyperplanes that are as far as possible while still separating the data. The hyperplane must separate data such as:

wT x + b = 0

T w xk + b > 0 for all xk of a class y

T w x j + b < 0 for all x j of another class

If data are separable in this way, there will probably be more than one way to do it. Among all the possible existing hyperplanes, SVM selects the one in which the distance between the hyperplane and the closest data is the widest possible [ 29 ].

When working with a dataset that is not linearly separable, it is necessary to turn to the use of a kernel function. The kernel function allows the SVM to form non-linear boundaries [ 29 ]. Data representation through kernel function offers an alternative solution to the nonlinearity problem, projecting the information to a higher dimension feature space [ 28 ]. This is accomplished by changing the representation of the function; this is similar to mapping the input space X to a new space H, called feature space, in the form: φ : X ⊂ R d → H (3)

Now instead of considering the input vectors {x1,…, xn} it is considered the transformed vectors {φ(x1),…, φ(xn)} as shown in figure 1. By doing this substitution, it is obtained a SVM raised in a new space (this is called the ‘kernel trick’), it is important to mention that in practice the implementation of this nonlinear technique consumes the same amount of computational resources of its linear equivalent. Fig. 1. Using the Kernel to transform (map) the input data space.

The general problem that SVM want to resolve is to search, for a given learning task, with a finite amount of data, an appropriate function that helps to carry out a good generalization, which results from a proper relationship between the accuracy achieved with a particular training set and the ability of the model [ 30 ].

The use of the ‘Radial Basis Function’ (RBF) kernel is based on the fact that this kernel is basically suited best to deal with data that have a class-conditional probability distribution function approaching the Gaussian distribution, like the texture present on the input images. It maps such data into a different space where the data becomes linearly separable. The kernel function is defined as follows:

A disadvantage concerning this kernel is that is difficult to design, in the sense that it is difficult to obtain an optimum value for its parameter σ (sigma) and choose the corresponding C that works best for a given problem. The fact that certain combinations of σ and C make the SVM highly sensitive to training data also contributes to the error rate of the RBF-based SVM.

One of the advantages of the RBF kernel is that given the kernel, the weights, the number of support vector and the support vectors itself are automatically obtained as part of the training procedure, i.e. they don’t need to be specified by the training mechanism.

IV. SETTING THE EXPERIMENTS As part of the experiments it was decided to work with two sets of images, one set consisting of coastal beach scenes, and the other set consisting of city scenes images. The processing of input images is done in order to reduce computational complexity.

The first set is conformed by 128 images of beach scenes content, the second set is conformed by 128 images of city scenes content, a total of 256 images.

The input images after processed end up being 8-bit per pixel grayscale images of dimension 128x128, working with just one channel reduces de number of convolutions. The output of each filter is obtained by the convolution of the input image with a Gabor filter. The process is shown below:

G(x, y) = I (x, y) ⊗ψ (x, y) (7) where G(x, y) is the output of the filter I (x, y) is the original image ψ (x, y) is the Gabor filter

This computation can theoretically be done in the spatial domain however the Gabor filter is usually narrow. The filter is usually much larger in the frequency domain and thus less affected by aliasing effects due to sampling. It is thus more convenient to do all the computation process in the frequency domain. The convolution is then reduced to a simple and efficient point-wise multiplication of the Fourier transforms [ 11 ].

The family of Gabor filters selected to set up the filter bank for the experiments in the frequency domain are: v′ = −u sinθ + v cosθ

I (u, v) = ∫−∞∞ ∫−∞∞ i(x, y)e−i2π (ux+vy)dxdy where i(x, y) is the original input image (11)

After the transformation, normalization is applied to the output image in order to avoid effects by illumination.

At the end of normalization, we have a certain number of square matrices per filtered image; each matrix dimension is 128x128. The number of square matrices depends on the number of orientations and frequencies concerning the filters bank, in our experiments the filter bank consist of 2 frequencies and 3 orientations, so the number of output matrices is 6.

The convolution of the input image with the Gabor filter is performed. In domain frequency the convolution is represented in a point-to-point multiplication of the transformed image with the Gabor filter.

Once the filter output is obtained, G(x,y) needs to be transformed back to its spatial representation using the Inverse Fourier Transform in 2-D.

g(x, y) = i(x, y) ⊗ψ (x, y) - Spatial Domain G(x, y) = I (x, y) ⋅ Ψ(x, y) - Frequency Domain i(x, y) = ∫−∞∞ ∫−∞∞ I (u, v)ei2π (ux+vy)dudv where

I (u, v) is the image in frequency domain

When convolution is performed some results are not useful especially if the image does not contain textures that respond meaningfully to the filter selected parameters. To reduce the problem all the outputs obtained by convolution are summed up to remove the results that are not relevant and to enhance those that helps to detect texture regions; this also helps to reduce dimensionality of the feature space by having one square matrix as an output, same size of the input image.

At this point we have one matrix per input image, reducing the dimensionality of input data. Each matrix is used to build up a feature matrix, which is going to serve as an input of the SVM classifier.

To complete the convolution of the input image with each one of the Gabor filters we take only the real part of the output filtered image. As mentioned in [ 31 ], by this way we can keep most the texture response information ignoring phase information.

Re(G(x, y)) (12)

Then we modified each output matrix of dimension 128x128 to construct the feature matrix. We take each matrix and transform it in a 1x16384 vector, each vector is then piled up with the next transformed matrix to form the feature matrix.

Finally we have a feature matrix of dimension 256x16384 which serves as input to the classifier.

V. SVM CLASSIFIER The goal of experimentation is to obtain a training model through SVM, which can be capable of separate a set of input images. Once we have the feature matrix with processed and filtered images we proceed with the SVM classification procedure. According to the nature of the classification process we need to define a training dataset, so the classifier could learn a model, and a test dataset, that let us test the learned model. The training dataset is conformed by 75% of the input dataset, and the test dataset by the remaining 25%. The selection criteria to build up the training dataset and the test dataset are done randomly. In fig. 2(a) is shown an example of coastal beach scene images, and in fig 2 (b) an example of city scenes images, which the classifier will try to separate. Fig. 2. Example of images used in the experiments. (a) Beach scene images, (b) City images.

The experiments were performed using SPIDER [ 32 ], a MATLAB implementation of SVM, a complete object oriented environment for machine learning. Being SVM a binary classifier it is necessary to label the datasets for the classification experiments, the beach scenes images are labeled as 1, and the city scene images are labeled as -1.

In table I there is a list of kernels available in SPIDER, its formula and its parameters. k (x, y) = (x ⋅ y + 1)d

− x− y 2 k ( x, y) = e 2⋅σ 2 k ( x, y) = It is used the “RBF” kernel to execute the experiments with different sigma values. Another parameter used by SPIDER is the ‘soft margin parameter’, C, which penalizes the training errors. This value is set to 1000 in all the experiments.

Iteratively, the sigma values were changed until a significant error rate is obtained. The test results for the learned algorithm are presented in table II.

As it can be seen in table II the sigma value which represents the lower percentage error is σ = 35, with an error rate of 9.37%.

RBF σ=21 σ=22 σ=23 σ=24 σ=25 σ=26 σ=27 σ=28 error error VII. CONCLUSIONS Extracting texture features by a Gabor filter bank and classify the filter outputs via the Support Vector Machines offers an excellent accuracy rate, 90.63% of the input images are correctly classified according to their class, belonging to beach scenes class or city scenes class.

The article proves the efficiency of using a dual model, first to extract de texture features and then classify them with SVM.

REFERENCES [ 1 ] T. Randen and J.H. Husoy, “Filtering for texture classification: a comparative study”, IEEE Trans. on

Pattern Analysis and Machine Intelligence, vol 21, Issue 4,, pp. 291 – 310, Apr 1999. [ 2 ] F. Lumbreras Ruiz, “Segmentation, classification and modelization of textures by means of multiresolution decomposition techniques”, Ph.D. dissertation, Dept. Informática and Computer Vision Center, Universitat Autònoma de Barcelona, Barcelona, España, 2001. [ 3 ] O. Chapelle, P. Haffner, and V.N. Vapnik, “Support vector machines for histogram-based image classification”, IEEE Trans. On Neural Networks, Vol. 10, Issue 5, pp. 1055 – 1064, Sep 1999. [ 4 ] Kwang In Kim, Keechul Jung, Se Hyun Park, and Hang Joon Kim, “Support vector machines for texture classification”, IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol. 24, Issue 11, pp. 1542 – 1550, Nov 2002. [ 5 ] I. Gondra and D.R. Heisterkamp, “Learning in region-based image retrieval with generalized support vector machines”, In Proc. of the Computer Vision and Pattern Recognition, pp. 149 – 154, 2004. [ 6 ] Shutao Li, J.T. Kwok, Hailong Zhu, and Yaonan Wang, “Texture classification using the support vector machines”, Pattern Recognition, Vol. 36, No. 12, pp. 2883 – 2893, 2003. [ 7 ] Bing-Yu Sun and De-Shuang Huang, “Texture classification based on support vector machine and wavelet transform”, In Proc. of the Fifth World Congress on Intelligent Control and Automation, WCICA 2004. Vol. 2, pp. 1862 – 1864, June 15–19, 2004. [ 8 ] V.P. Subramanyam Rallabandi and S.K. Sett, “Unsupervised texture classification and segmentation”,

Proceedings Of World Academy of Science, Engineering and Technology, Vol. 5, April 2005. [ 9 ] J.A. Recio, L.A. Ruiz and A.Fernández-Sarriá, “Use of Gabor filters for texture classification of digital images”, Física de la Tierra, Vol. 17, pp. 47 – 59, 2005. [ 10 ] M.R. Turner, “Texture discrimination by Gabor functions”, Biol. Cybern., Vol. 55, Num. 2–3, pp. 71 – 82, 1986. [ 11 ] V. Levesque, “Texture segmentation using Gabor filters”, Center for Intelligent Machines Journal, 2000 [ 12 ] P. Guha and R. Banerjee, “Segmentation and classification of multi-textured images”, 2000,

Available: http://www.cse.iitk.ac.in/~amit/courses/768/00/rajrup/, last visited: April 20, 2009. [ 13 ] V.S. Vyas and P. Rege, “Automated texture analysis with Gabor filters”, GVIP Journal, Vol. 6, Issue 1, pp. 35 – 41, July 2006. [ 14 ] K.M. Rajpoot and N.M. Rajpoot, “Wavelets and Support Vector Machines for Texture Classification”, In proceedings of the 8th International Multitopic Conference, INMIC 2004, 24-26 Dec., pp. 328 – 333, 2004. [ 15 ] D.M. Tsai, “Optimal Gabor filter design for texture segmentation”, Technical Report, Machine Vision

Lab, Dept. of Ind. Eng. and Mgmt., Yuan-Ze University, Chung-Li, Taiwan, 2000. [ 16 ] A.C. Bovik, M. Clark and W.S.Geisler, “Multichannel Texture Analysis Using Localized Spatial Filters”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, Num. 1, pp. 55 – 73, 1990. [ 17 ] Zehang Sun, G. Bebis and R. Miller, “ On-road vehicle detection using Gabor filters and support vector machines”, 14th International Conference on Digital Signal Processing, DSP 2002, Vol. 2, pp. 1019 – 1022, 2002. [ 18 ] K. Hammouda and E. Jernigan, “Texture segmentation using Gabor filters”, tech. rep., Biotechnology and health engineering centre, University of Waterloo, Dec. 2000. [ 19 ] S.E. Grigorescu, N. Petkov and P. Kruizinga, “Comparison of texture features based on Gabor filters”, IEEE Trans. On Image Processing, Vol. 11, Num. 10, pp. 1160 – 1167, 2002. [ 20 ] D. Batra, G. Singhal and S. Chaudhury, “Gabor filter based fingerprint classification using support vector machines”, Proceedings of the IEEE First India Annual Conference, 2004, INDICON 2004, pp. 256 – 261, 20-22 Dec. 2004. [ 21 ] Q.A. Salih and V. Dhandapani, “IRIS Recognition based on multi-channel feature extraction using gabor filters”, Proceedings of the 2nd IASTED international conference on Advances in computer science and technology, ACST’06, pp. 168 – 173, 2006. [ 22 ] L. Ma, Y. Wang and T. Tan, “Iris recognition based on multichannel Gabor filtering”, 5th Asian Conf.

Computer Vision, Vol. 1, 2002. [ 23 ] D. Carr, “Iris recognition: Gabor filtering”, Connexions. Dec. 18, 2004, Available: http://cnx.org/content/m12493/1.4/, last visited April 20, 2009. [ 24 ] K. Kämäräinen, “Feature extraction using Gabor filters”, Ph. D. dissertation, Lappeenranta

University of Technology, Finland, Nov. 2003. [ 25 ] T.P. Weldon, W.E. Higgins and D.F. Dunn, “Gabor filter design for multiple texture segmentation”,

Optical Engineering, Vol. 35, pp. 2852 – 2863, 1996. [ 26 ] F. Bianconi and A. Fernández, “Evaluation of the effects of Gabor filter parameters on texture classification”, Pattern Recognition, Vol. 40, Num. 12, pp. 3325 – 3335, 2007. [ 27 ] J. Ilonen, J.K. Kämäräinen and J.K. Kälviäinen, “Efficient computation of Gabor features”, Research

Report 100, Lappeenranta University of Technology, Dept. of Information Technology, 2005. [ 28 ] J.A. Reséndiz, “Las máquinas de vectores de soporte para identificación en línea”, Masters dissertation, Departamento de control automático, Centro de investigación y estudios avanzados, I.P.N., 2006. [ 29 ] J.P. Lewis, “A short SVM (support vector machine) tutorial”, CGIT Lab / IMSC, University Southern

California, 2004. [ 30 ] L. González, “Modelos de clasificación basados en máquinas de vectores de soporte”, Asoc. científica europea de econ. aplicada. Anales de economía aplicada, 2003. [ 31 ] D. Dunn, W.E. Higgins and J. Wakeley, “Texture segmentation using 2-D Gabor elementary functions”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 16, Num. 2, pp. 130 – 149, Feb 1994. [ 32 ] SPIDER, A complete object oriented environment for machine learning in MATLAB. Available: http://www.kyb.mpg.de/bs/people/spider/, last visited May 15, 2009.

1. Kittur , A. , Chi , E. , Pendleton , B. A. , Suh , B. , Mytkowicz , T. : Power of the Few vs. Wisdom of the Crowd: Wikipedia and the Rise of the Bourgeoisie . In: 25th Annual ACM Conference on Human Factors in Computing Systems (CHI 2007 ), ACM, New York ( 2007 )

2. Suh , B. , Chi , E. H. , Pendleton , B. A. , & Kittur , A. : Us vs . Them: Understanding Social Dynamics in Wikipedia with Revert Graph Visualizations . In: Visual Analytics Science and Technology . pp. 163 - 170 , IEEE-Press, New York ( 2007 )

3. Potthast , M. , Stein , B. , Anderka , M. : Automatic Vandalism Detection in Wikipedia . In: 30th European Conference on IR Research , ECIR 2008 , pp. 663 - 668 , Glasgow ( 2008 )

4. Stein , K. , Hess , C. : Does it matter who contributes: a study on featured articles in the german wikipedia . In: Proceedings of the 18th conference on Hypertext and hypermedia , pp. 171 - 174 , ACM, New York ( 2007 ).

5. 99

Wikipedia

Sources Aiding the Semantic Web . AI3, http://www.mkbergman.com/?p= 417

6. Medelyan , O. , Milne , D. , Legg , C. , Witten , I.A. : Mining meaning from Wikipedia . Hamilton, ( 2008 )

7. Chernov , S. , Iofciu , T. , Nejdl , W. , Zhou , X. : Extracting Semantic Relationships between Wikipedia Categories . In: Proceedings of the 1st Workshop on Semantic Wikis - From Wiki to Semantics, ESWC2006, Budva ( 2006 )

8. Ruiz-casado, M. , Alfonseca , E. , Castells , P. : From Wikipedia to Semantic Relationships: a Semi-automated Annotation Approach . In: Proceedings of the 1st Workshop on Semantic Wikis - From Wiki to Semantics, ESWC2006, Budva ( 2006 )

9. Cui , G. , Lu , Q. , Li , W. , Chen , Y. : Corpus Exploitation from Wikipedia for Ontology Construction . Conference on Language Resources and Evaluation , LREC2008, Morocco ( 2008 ) 10 . Wu , F. , Weld , D. S. : Automatically Refining the Wikipedia Infobox Ontology . In 17th International World Wide Web Conference , Beijing ( 2008 ) 11 . Kozlova , N.: Automatic Ontology Extraction for Document Classification . Ma. Thesis, Saarland University ( 2006 ) [19] Kao , Yi-Tung , Zahara, Erwie, Kao. IWei. [6] Deelers

S. And S.

Auwatanamongkol . [34] Redmond , Stephen J. , Heneghan , Conor. [15] P. Hansen & E. Nagai. [31] Pham , D.

Dimov , S.S.

Nguyen , C.D. [22] Likas , A. , Vlassis , N. , Verbeek , J.J. [28]

Peña ,

Lozano and

Larrañaga , [5]

Bradley , U.Fayyad. [21]

Krishna and M. Murty 2 6 8 No . de iteraciòn

1. Asgharbeygi , N. Maleki , A. “ Geodesic K-means clustering” . Pattern Recognition , 2008 , ICPR 2008 , 19th International Conference on. Dec . 2008 .

2. Bahmani , B. , Firouzi , T.

Niknam , and M.

Nayeripour . “A New Evolutionary Algorithm for Cluster Analysis” . Proceedings of world Academy of Science , Engineering and Technology Vol. 36 , Dec . 2008 .

3. Ball , G. and D. Hall , “ A clustering technique for summarizing multivariate data”, (ISODATA), Behav Sci ., vol. 12 , pp. 153 - 155 , 1967 .

Belal

Al-Zoubi , Al-Zoubi, Amjad Hudaib, Ammar Huneiti and

Bassam

Hammo . “ New Efficient Strategy to Accelerate k-Means Clustering Algorithm” . American Journal of Applied Sciences 5 ( 9 ) 1247 - 1250 ,

Science

Publications . 2008 .

5. Bradley

, U.Fayyad. “ Refining initial points for k-means clustering” , in Proc. 15 th Int. Conf. Machine Learning , 1998 pp. 91 - 99 .

6. Deelers S . And S. Auwatanamongkol. “ Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with the Highest Variance . ” Proceedings of world Academy of Science, Engineering and Technology . Vol. 26 , Dec . 2007 .

Ding

Zejin , Jian

, Yang-Qing Zhang . “ A New improved K-Means Algorithm with Penalizaed Term” . Granular Computing , 2007 , GRC 2007 , IEEE International Conference on. Nov . 2007 .

8. Duda , R.O. , Hart , P.E.: Pattern Classification and Scene Analysis , John Wiley &Sons, New York, NY, 1973 .

9. Estivill-Castro

and

Yang , “ A fast and robust general purpose clustering algorithm .” In Proc. 6 th Pacific Rim Int. Conf. Artificial Intelligence (PRICAI`00) ,

Mizoguchi and J. Slaney, Eds., Melbourne, Australia, 2000 , pp, 208 - 218 .

10. Fayyad , U.M. , Piatetsky-Shanpiro , G. , Smyth

, Uthurusamy , R.: Advances in Knowledge Discovery and Data Mining . AAAI/MIT Press, 1996 .

11. Fissher , D. : Knowledge Acquisition via Incremental Conceptual Clustering . Machine Learning , Vol. 2 , No. 2 ( 1987 ) 139 - 172 .

12. Forgy

“ Cluster analysis of multivariate data: Efficiency vs . Interpretability of classification” , Biometrics , vol. 21 , pp. 768 - 780 . 1965

13. Frahling , G. & Ch. Sohler. “A Fast k-means implementation using coresets .”. International Journal of Computational Geometry & Applications. Dec . 2008 . Vol. 18 Issue 6. P605 - 625 .

14. Gupata

Rao , and

Bhatnagar , “ K-means clustering algorithm for categorical attributes” , in Proc. 1st Int. Conf. Data Werehousing and Knowledge Discovery (DaWak`99) . Florence, Italy, 1999 , pp. 203 - 208 .

15. Hansen , P. & E. Nagai. “ Analysis of Global k-means, an Incremental Heuristic for Minimum Sum of Squares Clustering” . Journal Classification 22 : 287 - 310 .

16. Huang , Z. , “ Extensions to the k-means algorithm for clustering large data sets with categorical values.”. Data Mining Knowl . Discov., vol. 2 , pp. 283 - 304 , 1998 .

17. Kanungo , T. , Mount , D.M. , Netanyahu , N.S. , Piatko , C.D. , Silverman , R. , Wu , A.Y. : An Efficient K-means Clustering Algorithm: Analysis and Implementation . Pattern Analysis and Machine Intelligence , IEEE Transactions on Pattern Analysis and Machine Intelligence . Vol. 24 , No. 7 ( 2002 ) 881 - 892 .

18. Kashima , H. Hu , J.; Ray , B ; Singh , M. “ K-means clustering of proportional data using L1 distance” . Pattern Recognition , 2008 , ICPR 2008 . International Conference On. Volume Issue, Dec. 2008 .

19. Kao , Yi-Tung, Zahara, Erwie, Kao. I-Wei. “A hibridized approach to data clustering” . Expert Systems with Applications . Vol. 34 Issue 3. P 1754- 1762 . Apr. 2008 .

20. Kaufman L. and

Rouseeuw . Finding Groups in Data: An Introduction to Cluster analysis : Wiley, 1990 .

21. Krishna , K. and

Murty , “ Genetic K-means algorithm” . IEEE Trans. Syst ., Man , Cybern. B., Cybern ., vol. 29 , no. 3 , pp. 433 - 439 , Jun. 1999 .

22. Likas , A. , Vlassis , N. , Verbeek , J.J.: The

Global K-means Clustering

Algorithm. Pattern

Recognition . The Journal of the Pattern Recognition Society . Vol. 36 , No. 2 ( 2003 ) 451 - 461

23. Lloyd

“ Least squares quantization in PCM. Unpublished Bell Lab . Tech. Note, portions presented at the Institute of Mathematical statistics Meeting Atlantic City , NJ, sep. 1957 .

IEEE Trans. inform, Theory (Special Issue on Quantization) , vol. IT-28, pp 129 - 137 Mach 1982 .

24. MacQueen , J.: Some Methods for Classification and Analysis of Multivariate Observations . In Proceedings Fifth Berkeley Symposium Mathematics Statistics and Probability . Vol. 1 . Berkeley, CA ( 1967 ) 281 - 297 .

25. Matworks . http: //www.matworks.com

26. Mehmed , K. : Data Mining: Concepts, Models, Methods, and Algorithms . John Wiley & Sons. 2003 .

27. Nguyen , Cao. D. & Cios , Krzysztof J. “ GAKREM: A novel hybrid clustering algorithm . Information Sciences . Vol. 178 Issue 22 , p4205 - 4227 - Nov. 2008 .

28. Peña , J.

Lozano and P.

Larrañaga , “ An empirical comparision of four initialization methods for the k-means algorithm . “Pattern Recognit Lett., vol. 20 pp. 1027 - 1040 , 1999 .

29. Pérez

, Rodolfo Pazos

, Laura Cruz R.,Gerardo Reyes S. Rosy Basave T. Héctor Fraire H. “ Improvement the Efficiency and Efficacy of the K-means Clustering Algorithm through a New Convergence Condition” . Computational Science and Its Applications - ICCSA 2007 - International Conference Proceedings . Springer Verlag.

30. Pérez , J., M.F.

Henriques , R.

Pazos , L.

Cruz , G. Reyes, J.

Salinas , A.

Mexicano . Mejora al Algoritmo de K-means mediante un Nuevo criterio de convergencia y su aplicación a bases de datos poblacionales de cáncer . 2do Taller Latino Iberoamericano de Investigación de Operaciones, “ La IO aplicada a la solución de problemas regionales” . México. “In Spanish”.

31. Pham , D.T.

Dimov , S.S.

Nguyen , C.D. “ Selection of K in K-means clustering” . “Proceedings of the Institution of Mechanical Engineers - Part C - Journal of Mechanical Engineering Science; Vol. 219 Issue 1 , p103 - 109 . Jan 2005 .

32. Proietti , Guido and Christos Faloutsos . “Analysis of Range Queries on Real Region Datasets Stored Using an R-Tree.” IEEE Transactions on Knowledge and Data Engieneering ., Vol. 12 , No. 5, Sep./Oct. 2000 .

33. Pun , W.K.D. , Ali , A.S. “ Unique distance measure approach for K-means (UDMA-Km) clustering algorithm . TENCON 2007 - 2007 IEEE Region 10 Conference . Oct. 30 2007 .

34. Redmond , Stephen J. , Heneghan , Conor. “ A method for initialising the K-means clustering algorithm using kd-trees” . Pattern Recognition Letters; Vol. 28 Issue 8, Jun . 2007 .

35.

Taoying

Li & Yan Chen “An improved k-means algorithm for clustering using entropy weighting measures” . Intelligent Control and Automation , 2008 , WCICA 2008 , 7th World Congress on. June 2008 .

36. Tsai , Chieh-Yuan, Chiu, Chuang-Cheng. “ Developing a feature weight selfadjustment mechanism for a K-means clustering algorithm . ” Computational Statistics & Data Analysis . Vol. 52 Issue 10. Jun . 2008 .

37. UCI. Asuncion , A. & Newman , D.J. ( 2007 ). UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, School of Information and Computer Science.

38. Wei , Song, Li Cheng Hua, Park, Soon Cheo. “ Genetic Algorithm for text clustering using ontology and evaluating the vality of various semantic simility measures . ” Expert Systems with Applications . Vol. 36 , Issue

, Jul . 2009 .

39. Wesan , Barbakh And Colin Fyfe. “ Local vs global interactions in clustering algorithms: Advances over K-means .” International Journal of knowledge-based and Intelilligent Engineering Systems 12 ( 2008 ). 83 - 99 .

40. Witten , I.H. , Frank , E. : Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations . Morgan Kaufmann Publishers. San Diego, CA ( 1999 )

41. Xindong

Kumar , J. Ross Q. ,

Ghosh ,

Yang ,

Motoda ,

G. J.

McLachlan ,

Ng ,,

Liu ,

P. S.

Yu ,

Zhou ,

Steinbach ,

D. J.

Hand and

Steinberg . “ Top 10 algorthms in data mining” . Knowl Inf Syst ( 2008 ). Springer.

42. Xu , Rui and Donald Wunsch II. Survey of Clustering Algorithms . IEEE Transactions on Neural Networks , Vol., 16 , No. 3, May 2005 .

43. Zalik , Krista Rizman . “ An Efficient k`-means Clustering Algorithm .” Pattern Reconition Letters , Vol. 29 , I. 9 . Pag. 1385 - 1391 . Elsevier 07/ 2008 .

44. Zhang , Z. , B. Tian D . And Tung A.K.H. “ On the Lower Bound of Local Optimums in K-means Algorithm . ” Data Mining 2006 , ICDM`06 Sixth International Conference on Data Mining. Dec . 2006 .

45. Zhang, Chen; Xia Shixiong. “K-means Clustering Algorithm with improved initial Center.” Knowledge Discovery and

Data

Mining , 2009 . Second International Workshop on. Vol. Issue , 23 - 25 Jan., 2009 .