<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Advances on Semantic Web and New Technologies</article-title>
      </title-group>
      <pub-date>
        <year>2003</year>
      </pub-date>
      <fpage>66</fpage>
      <lpage>107</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The Workshop on Semantic Web and New Technologies was held by second time at the Faculty of
Computer Science of Benemérita Universidad Autónoma de Puebla, Mexico in March 2009.
The Semantic Web provides a common framework that allows data to be shared and reused across
application, enterprise, and community boundaries. Semantic Web technologies are beginning to play a
significant role in many diverse areas, marking a turning point in the evolution of the Web.
The goal of this workshop is to provide a forum for the Semantic Web community, in which
participants can present and discuss approaches to add semantics on the Web, show innovative
applications in this field and identify upcoming research issues related to Semantic Web. In order to
fulfill these objectives, the more important workshop topics included Semantic Search, Semantic
Advertising and Marketing, Linked Data, Collaboration and Social Network, Foundational Topics,
Semantic Web and Web 3.0, Ontologies, Semantic Integration, Data Integration and Mashups,
Unstructured Information, Semantic Query, Semantic Rules, Developing Semantic Applications and
Semantic SOA.</p>
      <p>Dr John Cardiff was the invited speaker in this Second Workshop on Semantic Web, who is a
fulltime lecturer and lead researcher in the Social Media Research Group, based at the Institute
of Technology Tallaght, Dublin, Ireland. He has previously held positions in the Department
of Computer Science, Trinity College Dublin, and in the University of Queensland, Australia,
where he obtained the degree of Ph.D. in 1990. He has extensive experience in semantic web
technologies, heterogeneous database research and query processing and optimization. He
collaborates closely with researchers of the National Language Engineering Laboratory at the
Polytechnic University of Valencia, Spain, the Knowledge and Data Engineering Group of
Trinity College Dublin, and the IBM Dublin Center for Advanced Studies. He is currently
supervising four PhD students who are investigating semantic web based recommender
systems, blogosphere analysis, and adaptive hypermedia systems. Dr Cardiff has a wide
breadth of experience of research and management of large European Union funded projects
under programmes such as RACE, Esprit, and AIM. He has over 20 refereed publications in
international conferences and journals.
Invited Paper
The Evolution of the Semantic Web
John Cardiff
Exploiting Wikipedia as a Knowledge Base: Towards
and Ontology of Movies
Rodrigo Alarcón, Octavio Sánchez and Víctor Mijangos
Translation of Verbal Expressions and Context of Use
Extraction through a Corpus on Web
Arturo Velasco, María J. Somodevilla, and Ivo. H. Pineda
Dynamic Concept-Based Taxonomy used for image recovery
based on their textual description
Jaime Lara, María de la Concepción Pérez de Celis and David Pinto
The Use of Document Fingerprinting in the Web
People Search Task
David Pinto, Mireya Tovar, Beatriz Beltrán, Darnes Vilariño and Héctor Furlog
mQA: Question Answering in Mobile devices
Fernando Zacarías F., Alberto Tellez V., Marco Antonio Balderas and Rosalba Cuapa C.
1
8
17
26
37
44
Semantic Routing for Structured Peer-to-Peer Networks
Luis Enrique Colmenares Guillén, Omar Ariosto Niño Prieto and Leandro Navarro Moldes
Some Considerations for the Semantic Web
María Elena Franco Carcedo
67
76
83
97
The Evolution of the Semantic Web</p>
    </sec>
    <sec id="sec-2">
      <title>John Cardiff</title>
      <p>Social Media Research Group,
Institute of Technology Tallaght, Dublin, Ireland
email John.Cardiff@ittdublin.ie
Abstract — The Semantic Web offers an exciting promise of a
world in which computers and humans can cooperate
effectively with a common understanding of the meaning of
data. However, in the decade since the term has come into
widespread usage, Semantic Web applications have been slow
to emerge from the research laboratories. In this paper, we
present a brief overview of the Semantic Web vision and the
underlying technologies. We describe the advances made in
recent years and explain why we believe that Semantic Web
technology will be the driving force behind the next generation
of Web applications.</p>
      <sec id="sec-2-1">
        <title>I. INTRODUCTION</title>
        <p>The World Wide Web (WWW) was invented by Tim
Berners Lee in 1989, while he was working at the European
Laboratory for Particle Physics (CERN) in Switzerland. It
was conceived as a means to allow physicists working in
different countries to communicate and to share
documentation more efficiently. He wrote the first browser
and Web server, allowing hypertext documents to be stored,
retrieved and viewed.</p>
        <p>The Web added two important services to the internet - it
provided a very convenient means for us to retrieve and
view information - we can then see the web as a vast
document store in which we retrieve documents (web pages)
by typing in their address into a web browser. Secondly, it
provided a language called HTML, which describes to
computers how to display documents written in this
language. Documents, or web pages, are accessed by a
unique identifier called a Uniform Resource Locator (URL)
and are accessed using a Web browser. Within a short
space of time, the WWW had become a popular
infrastructure for sharing information, and as the volume of
information increased its use became increasingly
widespread.</p>
        <p>Although the web provides the infrastructure for us to
publish and retrieve documents, the HTML language
defines only the visual characteristics, ie. how the
documents are to be presented on a computer screen to the
user. It is up to the user who requested the document to
interpret the information it contains. This seems
counterintuitive, as we normally think of computers as the
tools to perform the more complex tasks, making life easier
for humans. The problem is that within HTML there is no
consideration of the meaning of the document, they are not
represented in a way that allows interpretation of their
information content by computers.</p>
        <p>If computers could interpret the content of a web page, a lot
of exciting possibilities would arise. Information could be
exchanged between machines, automated processing and
integration of data on different sites could occur.
Fundamentally, they could improve the ways in which they
can retrieve and utilise the information for us because they
would have an understanding of what we are interested in.
This is where the Semantic Web fits into the picture
today's web (the "syntactic" web) is about documents
whereas the semantic web is about "things" - concepts we
are interested in (people, places, events etc.), and the
relationships between these concepts.</p>
        <p>
          The Semantic Web vision envisages advanced management
of the information on the internet, allowing us to pose
queries rather than browse documents, to infer new
knowledge from existing facts, and to identify
inconsistencies. Some of the advantages of achieving this
goal include [
          <xref ref-type="bibr" rid="ref13 ref4">4</xref>
          ]:



        </p>
      </sec>
      <sec id="sec-2-2">
        <title>The ability to locate information based on its</title>
        <p>meaning, eg. knowing when two statements are
equivalent, or knowing that a reference to a person
in different web pages are referring to the same
individual.</p>
        <p>Integrating information across different sources −
by creating mappings across application and
terminological boundaries we can identify identical
or related concepts,
Improving the way in which information is
presented to a user, eg. aggregating information
from different sources, removing duplicates, and
summarising the data.</p>
        <p>
          While the technologies to enable the development of the
Semantic Web were in place from the conception of the
web, a seminal article by Tim Berners-Lee, James Hendler
and Ora Lassila [
          <xref ref-type="bibr" rid="ref1 ref10">1</xref>
          ] published in Scientific American in 2001
provided the impetus for research and development to
commence. The authors described a world in which
independent applications could cooperate and share their
data in a seamless way to allow the user to achieve a task
with minimal intervention. Central to this vision is the
ability to "unlock" data that is controlled by different
applications and make it available for use by other
applications. Much of this data is already available on the
Web, for example we can access our bank statements, our
diaries and our photos online. But the data is controlled by
proprietary applications. The Semantic Web vision is to
publish this data in a sharable form − we could integrate the
items of our bank statements into our calendar so that we
could see what transactions we made on that day, or include
photos so that we could see what we were doing at that time.
However, eight years after publication of this article, we are
still some distance realising this vision. In this paper,
present an overview of the Semantic Web. We explain why
progress has been slow and the reasons we believe this to be
about to change.
        </p>
        <p>The paper is organized as follows. In Section II we describe
the problems we face when trying to extract meaning from
the web as it is today. Section III presents a brief overview
of the technologies underlying the Semantic Web. In
Section IV we give an overview of the gamut of typical
Semantic Web applications and Section V introduces the
Linking Open Data project. Finally, we present our
conclusions and look to the future in Section VI.</p>
        <p>II.</p>
      </sec>
      <sec id="sec-2-3">
        <title>THE PROBLEM WITH THE "SYNTACTIC WEB"</title>
        <p>In Figure 1 we see a "typical" web page written in HTML
which we will use to exemplify some of the drawbacks of
the traditional web. This page lists the keynote speeches
which took place at the 2009 World Wide Web conference1.
To the reader, the content of the page can be interpreted
intuitively. We can read the titles of the speeches, the names
of the speakers and the time and dates at which they take
place. Furthermore, by familiarity with browser interaction
paradigms, we can realize that by following a hyperlink we
can retrieve information about concepts related to the
conference (authors, sponsors, attendees etc.). In this
example, by following the hyperlink labelled "Sir Tim
Berners-Lee" we will retrieve a document containing
information about the person of this name. We intuitively
assign a meaning - perhaps "has-homepage" - to the
hyperlink, allowing us to assimilate the information
presented to us.</p>
        <p>A web browser cannot assign any to these links we see in
this page − a hyperlink is simply a link from one document
to another and the interpretation of the meaning of the link
(and of the documents themselves!) is a task for the human
reader. All that can be inferred automatically is that some
undefined association between the two documents exists.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>1 http://www2009.org/keynote.html</title>
      <p>The problems are even more clear when we consider the
nature of keyword-based browsing. While search engines
such as Google and Yahoo! are clearly very good at what
they do, we frequently are presented with a vast number of
results, many (most?) of which will be irrelevant to our
search. Semantically similar items will not be retrieved (for
instance a search for "movie" will not retrieve results where
the word "film" was used). And most significantly, the
result set is a collection of individual web pages. Our tasks
often require access to multiple sites (such as when we book
a holiday), and so it is our responsibility to formulate a
sequence of queries to retrieve the individual web pages,
each one of which performs part of the task at hand.
There are two potential ways to deal with this problem. One
approach is to take the web as it is currently implemented,
and to use Artificial Intelligence techniques to analyze the
content of web pages in order to provide an interpretation of
its meaning. This approach however would be prone to error
and would require validation. Furthermore, the rate at which
the web is growing would render it practically impossible to
achieve.</p>
      <p>The other approach is to represent the web pages in a form
in which we can represent and interpret the data they
contain. If there is a common representation to express the
meaning of the data on the web, we can then develop
languages, reasoners, and applications which can exploit
this representation. This is the approach of the Semantic
Web.</p>
      <sec id="sec-3-1">
        <title>III. SEMANTIC WEB TECHNOLOGIES</title>
        <p>
          The Semantic Web describes a web of data rather than
documents. And just as we need common formats and
standards to be able to retrieve documents from computers
all over the world, we need common formats for the
representation and integration of data. We also need
languages that allow us to describe how this data relates to
real world objects and to reason about the data. The famous
"Layer Cake" [
          <xref ref-type="bibr" rid="ref19">10</xref>
          ] diagram, shown in Figure 2, gives an
overview of the hierarchy of the principal languages and
technologies, each one exploiting the features of the levels
beneath it. It also reinforces the fact that the Semantic Web
is not separate from the existing web, but is in fact an
extension of its capabilities.
        </p>
        <p>
          In this section, we summarize and discuss the key aspects
shown in the Layer Cake diagram. Firstly we describe the
core technologies: the languages RDF and RDFS. Next we
describe the higher level concepts, focusing in particular on
the concept of the ontology which is at the heart of the
Semantic Web infrastructure. Finally we examine the trends
and directions of the technology. For further information on
the concepts presented in this section, the reader is referred
to a more detailed work (eg. [
          <xref ref-type="bibr" rid="ref13 ref4">4</xref>
          ], [
          <xref ref-type="bibr" rid="ref14 ref5">5</xref>
          ]).
What HTML is to documents, RDF (Resource Description
Framework) is to data. It is a W3C standard2 based on XML
which allows us to make statements about objects. It is a
data model rather than a language - we can say that an
object possesses a particular property, or that it has a named
relationship with another object. RDF statements are written
as triples: a subject, predicate and object.
        </p>
        <p>By way of example, the statement
“The Adventures of Tom Sawyer” was written</p>
        <p>by Mark Twain
could be expressed in RDF by a statement such as
&lt;rdf:Description
rdf:about=www.famouswriters.org/twain/mark&gt;
&lt;s:hasName&gt;Mark Twain&lt;/s:hasName&gt;
&lt;s:hasWritten rdf:resource=</p>
        <p>www.books.org/ISBN0001047&gt;
&lt;/rdf:Description&gt;
At first glance it may appear that this information could be
equally well represented using XML. However XML makes
no commitment on which words should be used to describe
a given set of concepts. In the above example we have a
property entitled "hasWritten", but this could equally have
been "IsAuthorOf" or another such variant. So, XML is
suitable for closed and stable domains, rather than for
sharable web resources.</p>
        <p>The statements we make in RDF are unambiguous and have
a uniform structure. Concepts are each identified by a
Universal Resource Identifer (URI) which allows us to
make statements about the same concept in different
applications. This provides the basis for semantic
interoperability, allowing us to distinguish between
ambiguous terms (for instance an address could be a
geographical location, or a speech) and to define a place on
the web at which we can find the definition of the concept.
To describe and make general statements collectively about
groups of objects (or classes), and to assign properties to
members of these groups we use RDF Schema, or RDFS3.
RDFS provides a basic object model, and enables us to
describe resources in terms of classes, properties, and
values. Whereas in RDF we spoke about specific objects
such as '“The Adventures of Tom Sawyer” and "Mark
Twain", in RDFS we can make general statements such as
"A book was written by an author"</p>
      </sec>
      <sec id="sec-3-2">
        <title>This could be expressed in RDFS as</title>
        <p>&lt;rdf:Property rdf:ID=“HasWritten”
&lt;rdfs:domain rdf:resource=“#author”\&gt;
&lt;rdfs:range rdf:resource=“#book”\&gt;
&lt;\rdf:Property&gt;
An expansion of these examples, and the relationship
between the graphical representations of RDF and RDFS is
shown in Figure 3.
2 www.w3.org/RDF/</p>
      </sec>
      <sec id="sec-3-3">
        <title>3 http://www.w3.org/TR/rdf-schema/</title>
        <p>
          RDF and RDFS allow us to describe aspects of a domain,
but the modelling primitives are too restrictive to be of
general use. We need to be able to describe the taxonomic
structure of the domain, to be able to model restrictions or
constraints of the domain, and to be able to state and reason
over a set of inference rules associated with the domain. We
need to be able to describe an ontology of our domain.
The term ontology originated in the sphere of philosophy,
where it signified the nature and the organisation of reality,
ie. concerning the kinds of things that exist, and how to
describe them. Our definition within Computer Science is
more specific, and the most commonly cited definition has
been provided to us by Tom Gruber in [
          <xref ref-type="bibr" rid="ref15 ref6">6</xref>
          ], where he defines
an ontology as "an explicit and formal specification of a
conceptualization". In other words, an ontology provides us
with a shared understanding of a domain of interest. The
fact that the specification is formal means that computers
can perform reasoning about it. This in turn will improve the
accuracy of searches, since a search engine can retrieve data
regarding a precise concept, rather than a large collection of
web pages based on keyword matching.
        </p>
        <p>In relation to the Semantic Web, for us to share, reuse and
reason about data we must provide a precise definition of
the ontology, and represent it in a form that makes it
amenable to machine processing. An ontology language
should ideally extend existing standards such as XML and
RDF/S, be of "adequate" expressive power, and provide
efficient automated reasoning support. The most widely
used ontology language is the "Web Ontology Language",
which curiously has the acronym "OWL"4. Along with
RDF/S, OWL is a W3C standard and augments RDFS with
additional constraints such as localised domain and range
constraints, cardinality and existence constraints, and
transitive, inverse, and symmetric properties.</p>
        <p>Adding a reasoning capability to an ontology language is
tricky since there will be a trade-off between efficiency and
expressiveness. Ultimately it depends on the nature and
requirements of the end application, and it is for this reason
that OWL offers three sublanguages,
4 www.w3.org/2004/OWL


</p>
        <p>OWL Lite supports only a limited subset of OWL
constructs and is computationally efficient,
OWL DL is based on a first order logic called
Description Logic,
OWL Full offers the full compatibility with RDFS but
at the price of computational tractability.</p>
        <p>Examples of applications which could require very different
levels of reasoning capabilities are described in the
following section.</p>
        <p>The top layers of the layer cake have received surprising
little attention considering that they are crucial to successful
deployment of Semantic Web applications. The proof layer
involves the actual deductive process, representation of
proofs, and proof validation. It allows applications to
inquire why a particular conclusion has been reached, ie.
they can give proof of their conclusions. The trust layer
provides authentication of identity and evidence of the
trustworthiness of data and services. It is supported through
the use of digital signatures, recommendations by trusted
agents, ratings by certification agencies etc.</p>
        <p>
          C. Recent Trends and Technological Developments
As with any maturing technology, the architecture will not
remain static. In 2006 Tim Berners Lee suggested an update
to the layer cake diagram [
          <xref ref-type="bibr" rid="ref11 ref2">2</xref>
          ] which is shown in Figure 4,
however this is just one of several proposed refinements.
Some of the new features and languages which include the
following.
        </p>
        <p>Rules and Inferencing Systems. Alternative approaches to
rule specification and inferencing are being developed. RIF
(Rules Interchange Format) is a language for representing
rules on the Web and for linking different rule-based
systems together. The various formalisms are being
extended in order to capture causal, probabilistic and
temporal knowledge.</p>
        <p>Database Support for RDF. As the volume of RDF data
increases, it is necessary to provide the means to store,
query and reason efficiently over the data. Database support
for RDF and OWL is now available from Oracle (although
at present the focus is on storage, rather than inferencing
capabilities). Other open source products include 3Store5
and Jena6. The specification of a query language for RDF,
SPARQL, was adopted by the W3C in 2008.</p>
        <p>RDF Extraction. The language GRDDL: ("Gleaning
Resource Descriptions from Dialects of Languages")
identifies when an XML document contains data compatible
with RDF and provides transformations which can extract
the data. Considering the volume of XML data available on
the web, a means of converting this to RDF is clearly highly
desirable.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5 http://sourceforge.net/projects/threestore/ 6 http://jena.sourceforge.net/</title>
      <p>Ontology Language Developments. The OWL language was
adapted as a standard in 2004. In 2007, work began on the
definition of a new version, OWL 2 which includes easier
query capabilities and efficient reasoning algorithms scaled
to large datasets.
Even though Semantic Web technology is in its infancy,
there are a wide range of applications in existence. In this
section we give a brief overview of some typical application
areas.</p>
      <p>
        E-Science Applications. Typically e-science describes
scenarios involving large data collections requiring
computationally intensive processing, and where the
participants are distributed across the world. An
infrastructure whereby scientists from different disciplines
are able to share their insights and results is seen as critical,
particularly when we consider the availability of large
volumes of data becoming available online. The Gene
Ontology7 is a project aimed at standardizing the
representation of genes across databases and species.
Perhaps the most famous e-science project is the Human
Genome Project8 which identified the genes in human DNA
and which includes over 500 datasets and tools. The
International Virtual Observatory Alliance9 makes available
astronomical data from a number of digital archives.
Interoperation of Digital Libraries. Institutions such as
libraries, universities, and museums have vast inventories of
materials which are increasingly becoming available online.
These systems are implemented using a range of different
technologies, and although their aims are similar it is a huge
challenge to enable the different institutions to access each
7 http://www.geneontology.org/index.shtml
8 http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml
9 www.ivoa.net
other's catalogues. Ontologies are useful for providing
shared descriptions of the objects, and ontology mapping
techniques are being applied to achieve semantic
interoperability [
        <xref ref-type="bibr" rid="ref12 ref3">3</xref>
        ].
      </p>
      <p>Travel Information Systems. The goal of building an
application which would allow a user to seamlessly book
and plan the various elements of a trip (flights, hotel, car
hire etc.) is highly desirable. Ontologies again could be used
to arrive at a common understanding of terminology. The
Open Travel Alliance is building XML based specifications
which allow for the interchange of messages between
companies. While this is a first step, an agreed ontology
would be needed in order to achieve any meaningful
interoperation.</p>
      <p>Although many potential applications can be identified,
there are less deployed at this time than we might expect.
One possible reason is the lack of a common understanding
of what the Semantic Web can offer, and more particularly
what the role of ontology. At one end of the spectrum we
find applications which take the "traditional", or AI view of
inferencing, in which accuracy is paramount. Such
applications arise in combinatorial chemistry for example,
in which vast quantities of information on chemicals and
their properties are analysed in order to identify useful new
drugs. By coding the required drug's properties as assertions
will reduce the number of samples which need to be
constructed and manually analyzed by orders of magnitude.
In cases such as these, the time taken to perform the
inferencing is less important, since the trade-off will be a
large reduction in the samples to be analyzed.</p>
      <p>At the other end of the spectrum, we have "data centric"
web applications which require a swift response to the user.
Examples of this type of application include social network
recommender systems such as Twine10 which make use of
ontologies to recommend their users to other individuals
who may be of interest to them. While it is clear that a
response must be generated for the user within a few
seconds, we can observe too that there can be no logical
proof of correctness and soundness of the answers generated
in this type of case! Accordingly, the level of inferencing
required in this type of application is minimal.</p>
      <p>V.</p>
      <sec id="sec-4-1">
        <title>THE FUTURE: A WEB OF DATA?</title>
        <p>While we have stated that the Semantic Web focuses on
data in contrast to the document centric view of the
traditional web, this is not the complete picture. In order to
realize value from putting data on the web, links need to be
made in order to create a "web of data". Instead of having a
web with pages that link to each other, we can have (with
the same infrastructure) a data model with information on
each entity distributed over the web.</p>
        <p>
          The Linking Open Data [
          <xref ref-type="bibr" rid="ref12 ref3">3</xref>
          ] project aims to extend the
collections of data being published on the web in RDF
10 www.twine.com
format and to create links between them. In a sense, this is
analogous to traditional navigation between hypertext
documents where the links are now the URIs contained in
the RDF statements. Search engines could then query, rather
than browse this information.
        </p>
        <p>In a recent talk at the TDC 2009 conference11, Tim Berners
Lee gave a powerful motivation example for the project:
scientists investigating the drug discovery for Alzheimer's
disease needed to know which proteins were involved in
signal transduction and were related to pyramidal neurons.
Searching on Google returned 223,000 hits, but no
document provided the answer as nobody had asked the
question before. Posing the same question to the linked data
produces 32 hits, each of which is a protein meeting the
specified properties.</p>
        <p>At the conception of the project in early 2007, there were a
reported 200,000 RDF triples published. By May 2009 this
had grown to 4.7 billion [dh]. Core datasets include


</p>
      </sec>
      <sec id="sec-4-2">
        <title>DBpedia, a database extracted from Wikipedia</title>
        <p>containing over 274 million pieces of information.</p>
        <p>The knowledge base is constructed by analyzing
the different types of structured information, such
as the "infoboxes", tables, pictures etc.</p>
        <p>
          The DBLP Bibliography, which contains
bibliographic information of academic papers,
Geonames, which contains RDF descriptions of 6.5
million geographical features.
So where is the Semantic Web? In a 2006 article [
          <xref ref-type="bibr" rid="ref20">11</xref>
          ], Tim
Berners Lee agreed that the vision he described in the
11 http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html
12 http://en.wikipedia.org/wiki/File:Lod-datasets_2009-07-14_colored.png
13 http://protege.stanford.edu/
14 http://www.kowari.org/
Scientific American article has not yet arrived. But perhaps
it is arriving by stealth, under the guise of the "Web 3.0"
umbrella. Confusion still abounds about the meaning of the
term "Web 3.0", which has been variously described as
being about the meaning of data, intelligent search, or a
"personal assistant". This sounds like what the Semantic
Web has to offer, but even if the terms do not become
synonymous, it is clear that the Semantic Web will form a
crucial component of Web 3.0 (or vice versa!).
        </p>
        <p>
          The last five years have seen Semantic Web applications
move from the research labs to the marketplace. While the
use of ontologies has been flourishing in niche areas such as
e-science for a number of years a recent survey by Hendler
[
          <xref ref-type="bibr" rid="ref16 ref7">7</xref>
          ] shows a marked increase in the number of commercially
focused semantic web products. The main industrial players
are starting to take the technology more seriously. In August
2008, Microsoft bought Powerset, a semantic search engine,
for a reported $100m.
        </p>
        <p>As we have discussed, the "chicken and egg" dilemma is
resolving itself with tens of billions of RDF triples now
available on the web, and this number is continuing to
increase exponentially.</p>
        <p>Also, it is becoming easier for companies to enter the
market of Semantic Web applications. There are now a wide
range of open source applications such as Protégé13 and
Kowari14 which provide building blocks for application
development, making it more cost effective to develop
Semantic Web products.</p>
        <p>Some observers argue that the Semantic Web has failed to
deliver its promise, arguing instead that the Web 2.0 genre
of applications signifies the way forward. The Web 2.0
approach has made an enormous impact in recent years, but
these applications could be developed and deployed more
rapidly as their designers did not have the inconvenience of
standards to adhere to. In this article we have demonstrated
the steady infiltration from the research lab to the
marketplace being made by the Semantic Web over the last
decade. As the standards mature and the web of data
expands, we are confident that the Semantic Web vision is
set to become a reality.</p>
        <p>Gruber, T. 1993. Toward principles for the design of ontologies used
for knowledge sharing. In Guarino N, Poli R (eds). International
Workshop on Formal Ontology, Padova, Italy,
Hendler, J., 2008. Linked Data: The Dark Side of the Semantic Web,
(tutorial), 7th International Semantic Web Conference (ISWC08),
Karlsruhe, Germany.</p>
        <p>
          Linking Open Data Wiki, available at
http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData
Manning, C., Schütze, H., 1999. Foundations of statistical natural
language processing. MIT Press.
[
          <xref ref-type="bibr" rid="ref19">10</xref>
          ] "Semantic Web - XML2000, slide 10".
        </p>
        <p>http://www.w3.org/2000/Talks/1206-xml2k-tbl/slide10-0.html.</p>
        <p>W3C.
Exploiting Wikipedia as a Knowledge Base: Towards
and Ontology of Movies</p>
        <sec id="sec-4-2-1">
          <title>Rodrigo Alarcón, Octavio Sánchez, Víctor Mijangos</title>
        </sec>
        <sec id="sec-4-2-2">
          <title>Grupo de Ingeniería Lingüística, Universidad Nacional Autónoma de México</title>
        </sec>
        <sec id="sec-4-2-3">
          <title>Basamento de la Torre de Ingeniería, Ciudad Universitaria, México, D.F. {ralarconm,osanchezv,vmijangosc}@iingen.unam.mx</title>
          <p>Abstract. Wikipedia is a huge knowledge base growing every day due to the
contribution of people all around the world. Some part of the information of
each article is kept in a special, consistently and formatted table called infobox.
In this article, we analyze the Wikipedia infoboxes of movies articles; we
describe some of the problems that can make extracting information from these
tables a difficult task. We also present a methodology to automatically extract
information that could be useful towards the building of an ontology of movies
from Wikipedia in Spanish.
1 Introduction</p>
          <p>Wikipedia is a free encyclopedia of open content that has become an important
resource towards the construction of the Semantic Web. Since it beginnings, in the
year 2001, the English version has achieve more than 2 million of articles, while the
Spanish version has around 480 thousand of articles. All of the content has been
written and edited by volunteers from different countries in many different languages,
and it is covered by GFDL (GNU Free Document License), which makes possible to
freely use them.</p>
          <p>One important thing about the structure of Wikipedia is the social control executed
by the community, which is able to avoid the spam, the nonsense and other kind of
vandalism that is recurrent on some media sites. Besides, this same control makes
possible to constantly increase the quality and precision of the articles.</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>Inside Wikipedia, there is an entry called Wikipedia: Wikipedia in academic</title>
          <p>
            studies1, where it is possible to see the growth of academic interest in this
encyclopedia. This interest is related to the use of Wikipedia on different academic
studies and as a knowledge base for developing specific tools. On one hand, to
mention a few, some works have focused on the social theme that represents
Wikipedia [
            <xref ref-type="bibr" rid="ref1 ref10">1</xref>
            ] [
            <xref ref-type="bibr" rid="ref11 ref2">2</xref>
            ], some other have denounced inherent problems presented on this
1 http://en.wikipedia.org/wiki/Academic_Research_on_Wikipedia.
kind of media sites [
            <xref ref-type="bibr" rid="ref12 ref3">3</xref>
            ], and others have obtained specific information and statistic
data about the users [
            <xref ref-type="bibr" rid="ref13 ref4">4</xref>
            ]. On the other hand, Wikipedia has become a useful resource
for the extraction of definitions, name entity recognition, machine translation or
semantic relation extraction [
            <xref ref-type="bibr" rid="ref14 ref5">5</xref>
            ]. In this last field, Wikipedia represents a huge
knowledge base that has made possible the developing of specific ontologies for the
construction of the Semantic Web.
          </p>
          <p>In this paper we present a work in process for the elaboration of an ontology of
movies from Wikipedia on Spanish language. First we will briefly present an
overview of some studies related to the use of Wikipedia for semantic relation
extraction and ontology construction (2). Then we will explain the first step towards
the elaboration of an ontology of movies (3). This step includes: a) the description of
the so-called infobox, which is part of each movie of Wikipedia and contains specific
data about the film (3.1); b) the specific relations to automatically extract (3.2); c) and
our proposed XML schema to represent these relations (3.3). Finally, we will discuss
our preliminary results and present the future work (4).
2</p>
          <p>Wikipedia as a Semantic Knowledge Base</p>
        </sec>
        <sec id="sec-4-2-5">
          <title>There is a growing interest of efforts to mine the information in Wikipedia for</title>
          <p>different purposes. As we have mentioned before, one of this interest is the extraction
of semantic information that could be helpful on the process of giving more meaning
to the Web. In Wikipedia, the meaning could be seen as the knowledge about things
represented in different ways: definitions, descriptions, images, numeric data, etc.
Furthermore, the meaning of each concept explained on the encyclopedia is related to
the meaning of other concepts, which becomes a helpful semantic network to
understand concepts on the field where they belong.</p>
        </sec>
        <sec id="sec-4-2-6">
          <title>In this sense, Wikipedia represents a valuable source of knowledge to extract</title>
          <p>
            semantic information between concepts. A general overview of how Wikipedia could
be used to extract concepts, relations, facts and descriptions can be found in [
            <xref ref-type="bibr" rid="ref15 ref6">6</xref>
            ]. Here,
the authors explain the use of Wikipedia for natural language processing, information
extraction and ontology building.
          </p>
          <p>
            In [
            <xref ref-type="bibr" rid="ref16 ref7">7</xref>
            ], the authors describe a methodology that uses the links between categories to
mine specific relations. They analyze some measures to infer relations and try to
provide a semantic scheme in order to improve the search capabilities and to give the
users meaningful suggestions to edit articles. In the same context, in [
            <xref ref-type="bibr" rid="ref17 ref8">8</xref>
            ] the authors
use Wikipedia to develop a methodology for the automatic annotation of different
semantic relations. This work is based on discovering lexical patterns that can be used
to recognized specific relations between concepts. They evaluate the methodology by
using a corpus and searching on it the relations founded in Wikipedia. Their results
show that this kind of methodology could be a good starting point for automatic
ontology construction.
          </p>
          <p>
            The research presented in [
            <xref ref-type="bibr" rid="ref18 ref9">9</xref>
            ] shows how hyperlinked pages are used to generate a
domain hierarchy by means of ranking articles that are strongly linked. These articles
become a domain corpus for the automatic construction of an ontology. The same
goal of obtaining ontologies through Wikipedia is described in [
            <xref ref-type="bibr" rid="ref19">10</xref>
            ], where the authors
apply machine learning techniques to improve the performance of a system that mines
the infoboxes. Finally, in [
            <xref ref-type="bibr" rid="ref20">11</xref>
            ] we can found another example of the use of Wikipedia
for ontology construction, specifically for document classification.
          </p>
        </sec>
        <sec id="sec-4-2-7">
          <title>This is not, and does not pretend to be an extensive list of all the works made about</title>
          <p>semantic relation extraction or ontology construction from Wikipedia. Our main
purpose is to state both the interest that has woken up in the area of extraction and
organization of semantic information, and some of the automatic analyzes and
procedures that are possible to develop taking into account Wikipedia’s structure.
Nevertheless, as we will see in this paper, this structure is often not well organized
and makes it difficult to implement automatic processes.
3</p>
          <p>Towards an Ontology of Movies
In order to develop an ontology of movies we have stated three main steps that can
lead us to our purpose. The first one is to collect our input corpus from Wikipedia
movies articles and the analysis of the infobox structure on them. After that, the
second step is the delimitation and automatic extraction of specific semantic
information. Finally, as a third step we consider the implementation of the extracted
information into a XML schema that will conform the basis for another later
annotation schema.
3.1</p>
          <p>Movies infobox structure
The first step of our methodology was to conform a corpus from the articles of the
films by year category. We use the categories tree option to find a list of the movies
titles from the year 1892 to 20082. After that, we use the export pages option to
retrieve all the articles of this list. We found a total of 5,561 articles, where the
opening and closing infoboxes tags ({{Fields…}}) was on 5,092 cases. This late
number represents the total of articles from our corpus.</p>
        </sec>
        <sec id="sec-4-2-8">
          <title>After that, we analyze the infobox of each entry. The infobox is a resource used on</title>
          <p>Wikipedia to summarize and group the information about specific data on some
articles. In general words, its purpose is to make the information on a more available
format and it can be use as a resource to other applications.</p>
        </sec>
        <sec id="sec-4-2-9">
          <title>In Spanish language, there are 49 proposed fields for the infobox, where only two</title>
          <p>are consider as required: film title and original title. The infobox will be framed in
{{Fields…}}, and each field inside will be preceded by a vertical bar “|” and followed
by an equal sign “=” and the specific information. Fields without descriptions will
remain empty after the equal sign. That means it will have the following structure:
| Field = description of the field</p>
        </sec>
        <sec id="sec-4-2-10">
          <title>An example could be the next one:</title>
          <p>| genre = Science fiction</p>
        </sec>
        <sec id="sec-4-2-11">
          <title>2 Data was collected on February 2009.</title>
        </sec>
        <sec id="sec-4-2-12">
          <title>The whole fields used in the movies infoboxes from Wikipedia in Spanish can be found in table 1.</title>
          <p>Table 1. Infobox template in Spanish. 
Fields
título original
título
índice
imagen
nombre imagen
dirección
dirección2
dirección3
dirección4
dirección5
dirección6
dirección7
dirección8
dirección9
ayudantedirección
dirección artistica
producción
diseño de producción
guión
música
sonido
edición
fotografía
montaje
vestuario
efectos
reparto
país
país2
país3
país4
estreno
estreno1
género
duración
clasificación
idioma
idioma2
idioma3
idioma4
productora
distribución
presupuesto
recaudación
precedida_por
sucedida_por
imdb
filmaffinity
sincat</p>
        </sec>
        <sec id="sec-4-2-13">
          <title>From the table above we can see the different kind of information that the fields</title>
          <p>can introduce. We see information about dirección (direction), estreno (premiere),
idioma (language, language2, language3, etc.), as well as país (country country2,
country3, etc.), IMDb (Internet Movie Data Base) or Filmaffinity links (external Web
sites with movies information).</p>
        </sec>
        <sec id="sec-4-2-14">
          <title>The 49 fields from this table are the suggested ones in the official Wikipedia movies infobox template. Nevertheless, in our corpus we found several empty fields. We automatically found a total of 94,584 fields occurrences, while 30,742 cases where empty (32.48% of the whole occurrences).</title>
          <p>Furthermore, one of the problems presented in the infoboxes is the lack of
standardization. Some of the elements established by Wikipedia are written in an
indistinctive way by the authors of the articles, while others have typographical
errors. For example, the field dirección (direction) appears also as director (director);
the field título original (original title) can be found as título en España (title in Spain),
título principal (main title), título traducido (translated title), among others. More
complicated is the case of estreno (premiere), which presents variations like año
(year), fecha (date), fecha de estreno (premiere date), or primera emisión (first
emission).</p>
        </sec>
        <sec id="sec-4-2-15">
          <title>Typos are another common lack of standardization. For the field género (genre) we can find mistakes like *gènero, *genero or *genro.</title>
        </sec>
        <sec id="sec-4-2-16">
          <title>In the corpus we can also find the case of another fields that are not proposed in the original schema, such as asistente de artes marciales (martial arts assistant), calificación (qualification), premios (awards), Myspace, and so on. In this case we found a total of 205 non-official fields.</title>
        </sec>
        <sec id="sec-4-2-17">
          <title>If we compare the schema in Spanish to the English one, we can notice that the latter infobox contains fewer fields, which probably allows to be more standardized at the moment to put it into practice. The fields of the movies infobox in English can be seen in table 2.</title>
          <p>Table 2. Infobox template in English 
Fields
name
image
image_size
caption
director
producer
writer
narrator
starring
music
cinematography
editing
studio
distributor
released
runtime
country
language
budget
gross
preceded_by
followed_by</p>
        </sec>
        <sec id="sec-4-2-18">
          <title>Here we can observe a total of 22 fields, comparing to the 49 in the Spanish</title>
          <p>template. It is important to notice the fact that most of another languages follow a
similar structure like the one described for English. There is a similar template to the
English movies infobox in French Wikipedia, with some added elements like format,
awards, and IMDb. In Italian, the infobox determines general fields for different
genres of films: generic, animation or film a episodi (films conformed from several
short films), with specific fields for each genre; while in German, the fields specifies
a more generic data, i.e., title, original title, producer or cameraman.</p>
        </sec>
        <sec id="sec-4-2-19">
          <title>In infoboxes of different languages, the most common fields are title, director and</title>
          <p>premiere. There are also coincidences in other fields, for example music and
photography. Between English and Spanish there is a coincidence in preceded_by and
followed_by. Furthermore, in Spanish, as well as in French, there is the field of IMDb,
while Italian or English do not include. However, in English links to IMDb or
Allmovie can appear within the article as external links and not inside the template of
the infobox. These external links are also a valuable information to extend the
semantic data for an ontology, as they can add more information about the films that
does not appears in Wikipedia, or be used to complete the empty fields of the
infoboxes. Nevertheless, there is also no consistence between the occurrences of the
tags with external links. In our corpus, the IMDb tag occurs approximately in the 80%
of the articles, while Filmaffinity occurs around in the 5%.
3.2 Extracting specific relations data</p>
        </sec>
        <sec id="sec-4-2-20">
          <title>Theoretically, the structure of the infoboxes contains information that should be</title>
          <p>exploited with relative easiness. We decide to automatically extract the title, original
title, director, premiere year and genre, in order to generate a database with all of this
information. Although, not all of this information is present on all the movies articles
founded in the films by year category.</p>
        </sec>
        <sec id="sec-4-2-21">
          <title>As we have mentioned before, there are some inconsistencies within the name of the fields, their completeness, or the way the authors write them. In the case of</title>
          <p>director field, we found it with complete information in the 5,092 occurrences of the
articles with infoboxes, however the field genre occurs only in 4,499 of these cases.
Taking into account that the inconsistencies of the metadata make more difficult the
process of automatic relation extraction from the films information, we achieve to
obtain the data through the process we hereby describe.</p>
        </sec>
        <sec id="sec-4-2-22">
          <title>From our corpus, we find out that 5,092 articles contained at least one director,</title>
          <p>although the field name from many of them was not the same and a review had to be
made in order to compile a list of ad hoc synonyms for searching this specific field.
The synset was formed by dirección (direction), director (director) and dirigida
(directed). Also, after the equal sign that should follow the name of the field, the kind
of following blanks was not always the same. Sometimes there were tabs; some other,
more than one simple space; and even other, without spaces. Many of the director’s
names are also entries of Wikipedia, so many users decided to establish links to their
names, using the symbol “[[” followed by the name of the director and closing with
“]]”. This has the purpose of specifying to the wikiengine that there is a link: [[link to
the article]]. But not all of them had those brackets, and it caused troubles while
parsing the data with the aim of recovering the director’s name of the film associated
with the title of the entry.</p>
        </sec>
        <sec id="sec-4-2-23">
          <title>The same problems were founded when we tried to mine the original title of the</title>
          <p>movie. Despite the fact that this field does appear in all the infoboxes, not all of them
appear with information, which means that there are articles with the original title
field empty. It does not contain information in 195 articles occurrences in the corpus.</p>
        </sec>
        <sec id="sec-4-2-24">
          <title>With the premiere field it was also problematic to extract the information, because</title>
          <p>most of the films had different words to express the premiere year, for example año
(year), fecha de estreno (premiere date) or *añoacceso (acces year). In this case we
decide to mine only the año (year) and estreno (premiere) variants, because of the
wide range of structural possibilities. We found that 23 films infoboxes do not contain
a premiere year, sometimes it was in the title and sometimes were completely absent.</p>
        </sec>
        <sec id="sec-4-2-25">
          <title>Other field we exploited was the one of género (genre), which also present some inconsistencies that could be attached to human errors at the time of transcribing the template. This field was empty on 593 occurrences in our corpus and is the more unused one.</title>
        </sec>
        <sec id="sec-4-2-26">
          <title>Summarizing, we can find the number of occurrences for each field in table 3:</title>
          <p>Table 3. Numerical data found over the analysis of infoboxes </p>
        </sec>
        <sec id="sec-4-2-27">
          <title>Field name</title>
          <p>Director
Título
Título ID
Título orignal
Año
Género
Director
Título</p>
        </sec>
        <sec id="sec-4-2-28">
          <title>From the table above we can see the three fields with empty information: premiere or year, original title and genre. The first one was empty only in 23 articles, while the</title>
          <p>last one in more tan 500 cases. It is important to mention that the title of the movies
was not obtained from the infobox but directly from the XML given by the
Wikipedia, mainly because it is well demarcated by the labels &lt;title&gt; &lt;/title&gt;; in the
same way, we obtained the id used by the Wikipedia to identify each article.</p>
        </sec>
        <sec id="sec-4-2-29">
          <title>Despite the inconsistencies and typos that make difficult the automatic process, in 4,499 cases all the information that we were trying to mine was complete. We consider that this number represents a good starting point to conform the basis of a first schema that could be later extended.</title>
          <p>3.3 Proposed XML schema
With the data from the infoboxes that were exploited, we decided to generate a first
XML scheme, which should give basic information about the film. This scheme can
be expanded as we extend our extraction processes of the information contained in the</p>
        </sec>
        <sec id="sec-4-2-30">
          <title>Wikipedia articles.</title>
        </sec>
        <sec id="sec-4-2-31">
          <title>To make this scheme, we decided to take director field as the root XML tag. The</title>
          <p>first tag will consist of the director’s name. Taking into account that directors can
have more than one film, we decided to introduce a filmography tag to include them.
This last tag will include each film with title, original title, year and genre tags. On
the opening film tag we added an attribute with Wikipedia’s title id number. An
example of the schema can be seen below.</p>
          <p>Proposed XML schema for the organization of movies data on Spanish Wikipedia.</p>
          <p>As we can see in this example, the root tag is &lt;director&gt;&lt;/director&gt;. It is
followed by the director’s name tag &lt;name&gt;&lt;/ name &gt;. At the same level there is the
tag &lt;filmography&gt;&lt;/ filmography&gt;. This tag nests the film tag &lt;film
wiki_id=“”&gt;&lt;/film&gt;, which contains the relevant information of each film:
&lt;title&gt;&lt;/title&gt;, &lt;original_title&gt;&lt;/original_title&gt;, &lt;year&gt;&lt;/year&gt; and &lt;genre&gt;
&lt;/genre&gt;.</p>
        </sec>
        <sec id="sec-4-2-32">
          <title>Based on the XML scheme, relational databases can be generated to manipulate the</title>
          <p>information that we have considered at this first stage of the ontology construction.
As we have said, this is not the full final scheme because as more data is extracted,
the more can be added. This scheme is currently based on Wikipedia films articles in
Spanish language, however it can be extended to fit another kind of relevant
information, for example the country, external links (IMDb) or the id of directors or
genre from Wikipedia. Furthermore, it will be possible to use this scheme in order to
exploit Wikipedia in other languages, which could make possible to fill the empty
fields in one language by relating them with the information on another languages, as
well as to make multilingual queries.
4 Conclusions and future work
Nowadays, Wikipedia can be explored with the aim of obtaining information on
different ways. The information added in a manual way by the users is generally well
organized and semi-structured. Also, many entries from Wikipedia have infoboxes
with summarized specific information about the theme treated in the article. We have
mentioned that the structure of Wikipedia has made possible to exploit the
information in order to extract semantic data. The extraction of semantic relations is
one of the growing interests aiming to the construction of the Semantic Web.</p>
        </sec>
        <sec id="sec-4-2-33">
          <title>Even so the structure of Wikipedia, we have noticed some specific problems on</title>
          <p>automatically exploiting it. To summarize a few, there are: a) the fact that the field’s
names are not respected; b) typos by human errors; c) lack of information; and d)
differences on the infobox structure between languages. The latter should not be seen
as a problem, however it would be advantageous to have standard fields on different
languages.</p>
        </sec>
        <sec id="sec-4-2-34">
          <title>Aiming to the standardization idea, it would be useful that the Wikipedia’s process</title>
          <p>of writing or editing an article use a check-bot to confirm the information of the
infoboxes templates. Thus, the fields not belonging to the template would be alerted,
as well as typos on the field names. Furthermore, the same check-bot could be used to
seek the existing fields looking for inconsistencies in the infoboxes or the whole
articles.</p>
          <p>The work that we have presented here is a first approach towards the elaboration of
an ontology of movies from the Wikipedia in Spanish. We have showed the kind of
semantic relations that are possible to mine, as well as a first scheme to represent
them. We are conscious that this scheme may well be improved for achieving a
complete ontology of movies. The future work will include: a) to define a scheme to
represent subject, relation and predicates between the extracted information, for
example a RDF scheme; b) to implement this new scheme for making the information
available and sharing it with systems dedicated to the construction of the Semantic
Web; c) to develop a movie-ontology query system capable of retrieve the
information on specific ways related to directors, titles, genres and years fields.
Acknowledgments
This research was made possible by the financial support of CONACYT (82050) and
DGAPA-PAPIIT (IN403108). The authors wish to thank Sarahi Abrego Romero for
the proofreading of this paper.
3.1 Algortihm K-means Advantajes.</p>
          <p>
            MacQeen J. [
            <xref ref-type="bibr" rid="ref34">24</xref>
            ], the author of one of the initial k-means algorithm and the most
frequently cited, states.
          </p>
          <p>The process, which is called “k-means”, appears to give partitions which are
reasonably efficient in the sense of within-class variance, corroborated to some
extend by mathematical analysis and practical experience. Also, the k-means
procedure is easily programmed and is computationally economical, so that it is
feasible to process very large samples on a digital computer.</p>
          <p>
            Likewise [
            <xref ref-type="bibr" rid="ref49">39</xref>
            ], summarizes the benefits of k-means, in the introduction to his work:
K-means algortihm is one of first which a data analyst will use to investigate a new
data set because it is algorthmically simple, relatively robust and gives “good
enough” answers over a wide variety of data sets.
3.2 Algorithm K-means Shortcomings.
          </p>
          <p>
            Taking as a framework and as an extension and update, the k-means shortcomings
that are identified in [
            <xref ref-type="bibr" rid="ref52">42</xref>
            ] the following is the result of the analysis previously cited in
a series of tables grouped by category of work that arose as extensions k-means or as
possible solutions to one or more of the limitations that have been identified above.
3.2.1 The algorithm's sensitivity to initial conditions: The number of partitions,
the initial centroids.
          </p>
          <p>
            According to [
            <xref ref-type="bibr" rid="ref52">42</xref>
            ] there is a universal and efficient method to identify initial patterns
and the number k of clusters. In [
            <xref ref-type="bibr" rid="ref50">40</xref>
            ] briefly is discussed the sensitivity of the
algorithm for the allocation of initial centroids, that in practice the usual method is to
test iteratively with a random allocation to find the best allocation in terms of
minimizing the total squared distance. However, there have been various
investigations aimed at making various proposals related to these limitations:
          </p>
          <p>
            Authors
[
            <xref ref-type="bibr" rid="ref55">45</xref>
            ] Zhang, Chen; Xia
Shixiong.
[
            <xref ref-type="bibr" rid="ref11 ref2">2</xref>
            ] B. Bahmani Firouzi,
T. Niknam, and M.
          </p>
          <p>
            Nayeripour.
[
            <xref ref-type="bibr" rid="ref49">39</xref>
            ] Barbakh Wesam
And Colin Fyfe.
          </p>
          <p>Title and Commentary
“K-means Clustering Algorithm with improved initial Center.” It
avoids the initial random assignment of centers. Use strategy
called "sub-merger"
“A New Evolutionary Algorithm for Cluster Analysis”.</p>
          <p>
            It not depend on the initial centers. Algorithm PSO-SA-K
combines the algorithms "Particle Swarm Optimization (PSO),"
Simulated Annealing "(SA) and K-means.
“Local vs global interactions in clustering algorithms: Advances
over K-means.” It focuses on the algorithm's sensitivity to initial
conditions. Incorporate information on the role of overall
performance. Define three new algorithms: Weighted k-means
(WK), Inverse Weighted K-means (IWK) and Inverse Exponential
k-means (IEK).
[
            <xref ref-type="bibr" rid="ref29">20</xref>
            ] L. Kaufman and P.
          </p>
          <p>
            Rouseeuw.
[
            <xref ref-type="bibr" rid="ref12 ref3">3</xref>
            ] G. Ball and D. Hall
“A hibridized approach to data clustering”. Draft bioinformatics.
          </p>
          <p>Hybrid techniques called K-NM-PSO-based K-means,
NelderMead Simplex search and optimization of exchange of particles.
“Enhancing K-Means Algorithm with Initial Cluster Centers
Derived from Data Partitioning along the Data Axis with the
Highest Variance.” Title explicit.</p>
          <p>
            A method for initialising the K-means clustering algorithm using
kd-trees” A kd-tree used to calculate an estimate of the density of
data and to select the number of clusters.
“Analysis of Global k-means, an Incremental Heuristic for
Minimum Sum of Squares Clustering”. Commentary on work
[
            <xref ref-type="bibr" rid="ref31">22</xref>
            ].
“Selection of K in K-means clustering”. It proposes a measure to
select the reference number of clusters.
“The Global K-means Clustering Algorithm.” Algorithm that
consists of a series of k-means clusterings with varying number of
clusters from 1 to k. It argues that it is independent of initial
partitions and accelerates the calculations of k-means.
“An empirical comparision of four initialization methods for the
k-means algorithm.” Compare initialization methods for k-means:
Random, [
            <xref ref-type="bibr" rid="ref21">12</xref>
            ], [
            <xref ref-type="bibr" rid="ref29">20</xref>
            ] and [
            <xref ref-type="bibr" rid="ref34">24</xref>
            ].
“Refining initial points for k-means clustering”. Use k-means M
times for M random subsets of the original data.
          </p>
          <p>L. Kaufman and P. Rouseeuw. Finding Groups in Data: An
Introduction to Cluster analysis: Text. Def. K-means.
“A clustering technique for summarizing multivariate data”,
(ISODATA). Perform dynamic estimation of K.
3.2.2 The convergence of algorithm to a local optimum rather than a global
optimum.</p>
          <p>
            According to [
            <xref ref-type="bibr" rid="ref34">24</xref>
            ], the iterative procedure of k-means can not guarantee convergence
to a global optimum, but in his work, some research is cited, which are special cases.
Currently, there are several developments that analyze and / or proposed solutions to
this constraint:
          </p>
          <p>
            Authors
[
            <xref ref-type="bibr" rid="ref49">39</xref>
            ] Barbakh Wesam And
Colin Fyfe.
[
            <xref ref-type="bibr" rid="ref39">29</xref>
            ] Joaquín .Pérez O,
Rodolfo Pazos R, Laura
Cruz R.,Gerardo Reyes S.
          </p>
          <p>
            Rosy Basave T. Héctor
Fraire H.
[
            <xref ref-type="bibr" rid="ref54">44</xref>
            ] Z. Zhang, B. Tian D.
          </p>
          <p>And Tung A.K.H.</p>
          <p>Títle and Commentary
“Local vs global interactions in clustering algorithms: Advances
over K-means.” Addresses the algorithm's sensitivity to initial
conditions. Incorporating global information on the performance
function. Define three new algorithms: Weighted k-means (WK),
Inverse Weighted K-means (IWK) and Inverse Exponential
kmeans (IEK).
“Improvement the Efficiency and Efficacy of the K-means
Clustering Algorithm through a New Convergence Condition”.</p>
          <p>
            Improvement to the k-means algorithm by new convergence
conditions. Experimentally analyze the local convergence of
kmeans.
“On the Lower Bound of Local Optimums in K-means
Algorithm.” Estimate lower limit for local optimum.
[
            <xref ref-type="bibr" rid="ref34">24</xref>
            ] MacQUEEN J.
          </p>
          <p>“Genetic K-means algorithm”. Hybrid scheme based on Genetic
Algorithm - Simulated annealing with new operators to perform
global search and rapid convergence.
“Some Methods for Classification and Analysis of Multivariate
Observations.” Definition, Analysis and Applications of
kmeans.
3.2.3 The efficiency of the algorithm.</p>
          <p>
            According to the work of [
            <xref ref-type="bibr" rid="ref52">42</xref>
            ] the complexity of the k-means algorithm is O (n, d, k)
which involves the sample size, the number of dimensions and the number of
partitions. There are several works that have focused on different aspects of the
algorithm, in order to reduce computational load.
          </p>
          <p>
            Authors
[
            <xref ref-type="bibr" rid="ref13 ref4">4</xref>
            ] Moh`d Belal Al-Zoubi,
Amjad Hudaib, Ammar
Huneiti and Bassam Hammo
[
            <xref ref-type="bibr" rid="ref53">43</xref>
            ] Zalik, Krista Rizman
[
            <xref ref-type="bibr" rid="ref36">26</xref>
            ] Cao. D. Nguyen &amp; Cios,
Krzysztof J.
[
            <xref ref-type="bibr" rid="ref22">13</xref>
            ] G. Frahling &amp; Ch.
          </p>
          <p>
            Sohler.
[
            <xref ref-type="bibr" rid="ref45">35</xref>
            ] Taoying Li &amp; Yan Chen
[
            <xref ref-type="bibr" rid="ref27">18</xref>
            ] Kashima, H. Hu, J.;
Ray,B; Singh, M.
[
            <xref ref-type="bibr" rid="ref46">36</xref>
            ] Tsai, Chieh-Yuan, Chiu,
Chuang-Cheng.
[
            <xref ref-type="bibr" rid="ref39">29</xref>
            ] Joaquín .Pérez O,
Rodolfo Pazos R, Laura Cruz
R.,Gerardo Reyes S. Rosy
Basave T. Héctor Fraire H.
[
            <xref ref-type="bibr" rid="ref40">30</xref>
            ] J.Pérez, M.F. Henriques,
R. Pazos, L. Cruz, G. Reyes,
J. Salinas, A. Mexicano.
          </p>
          <p>Títle and Commentary
“New Efficient Strategy to Accelerate k-Means Clustering
Algorithm”. Strategy to accelerate k-means algorithm, which
avoids many calculations of distance, through a strategy based
on an improvement to the partial distance algorithm (PD).
“An Efficient k`-means Clustering Algorithm.”
Based on the algorithm Rival, it penalizes competitive
Learninig (RPCL). It does not require pre-allocation of the
number of clusters. Two-step process. Pre processes and uses
the prior information to minimize the cost function.
“GAKREM: A novel hybrid clustering algorithm.”
Eliminates the need to specify a priori the number of clusters.</p>
          <p>
            Combines genetic algorithms and logarithmic regression
Màxima expectation.
“A Fast k-means implementation using coresets.”
Implemented version of Lloyd's k-means [
            <xref ref-type="bibr" rid="ref32">23</xref>
            ], using a
weighted set of points that approximate the original set.
“An improved k-means algorithm for clustering using entropy
weighting measures”. Improvement of the algorithm by
introducing a variable to the function of cost.
“K-means clustering of proportional data using L1 distance”.
          </p>
          <p>K-means based on distance L1. Proportionate restrictions
incorporated in the calculation of centroids.
“Developing a feature weight self-adjustment mechanism for a
K-means clustering algorithm.” Improvement of the quality of
k-means clustering via FWSA mechanism called
"SelfAdjusment Feature Weight." Is modeled as an optimization
problem.
“Improvement the Efficiency and Efficacy of the K-means
Clustering Algorithm through a New Convergence Condition”.</p>
          <p>
            Improvement to the k-means algorithm by new convergence
conditions. Experimentally analyze the local convergence of
kmeans.
“Improvement of the K-means algorithm using a new
approach of convergence and its application to databases
cancer population." Title explicit.
[
            <xref ref-type="bibr" rid="ref43">33</xref>
            ] Pun, W.K.D., Ali, A.S.
[
            <xref ref-type="bibr" rid="ref16 ref7">7</xref>
            ] Zejin Ding, Jian Yu,
Self-Yang-Qing Zhang.
[
            <xref ref-type="bibr" rid="ref26">17</xref>
            ] Kanungo, T., Mount,
D.M., Netanyahu, N.S.,
Piatko, C.D., Silverman, R.,
Wu, A.Y.:
“Unique distance measure approach for K-means
(UDMAKm) clustering algorithm.” Sets distance measure based on
statistical data.
“A New improved K-Means Algorithm with Penalizaed
Term”. Define new objective function and minimize it with
genetic algorithm.
“An Efficient K-means Clustering Algorithm: Analysis and
Implementation,” presents an implementation of the version of
Lloyd's k-means [
            <xref ref-type="bibr" rid="ref32">23</xref>
            ] called "filtering algorithm" based on a
kd-tree.
3.2.4 K-means is sensitive to outliers and noise.
          </p>
          <p>
            According to [
            <xref ref-type="bibr" rid="ref52">42</xref>
            ] even if an object is quite far away from the cluster centroid, it is
still forced into a cluster and, thus, it distorts the cluster shapes. Here are works that
focus on shortcoming:
          </p>
          <p>
            Authors
[
            <xref ref-type="bibr" rid="ref1 ref10">1</xref>
            ] Asgharbeygi, N.
          </p>
          <p>
            Maleki, A.
[
            <xref ref-type="bibr" rid="ref18 ref9">9</xref>
            ] V.
EstivillCastro and J. Yang
[
            <xref ref-type="bibr" rid="ref12 ref3">3</xref>
            ] G. Ball and D.
          </p>
          <p>Hall</p>
          <p>Títle and Commentary
“Geodesic K-means clustering”.Extends k-means by using a geodesic
distance metric. Algorithm ensures resistance to outliers.
“A fast and robust general purpose clustering algorithm.”
It eliminates the effect of outliers through a process that considers real
points as centroids.
“A clustering technique for summarizing multivariate
data”.(ISODATA). It performs dynamic estimation of K. Considers the
effect of outliers in the process of clustering.
3.2.5 The definition of “means” limits the application only to numerical
variables.</p>
          <p>Several works have been developed that extend the application of
categorical variables or others:
k-means for</p>
          <p>
            Authors
[
            <xref ref-type="bibr" rid="ref48">38</xref>
            ] Song, Wei, Li
Cheng Hua, Park,
Soon Cheo.
[
            <xref ref-type="bibr" rid="ref23">14</xref>
            ] S. Gupata, K.
          </p>
          <p>
            Rao, &amp;Bhatnagar
[
            <xref ref-type="bibr" rid="ref25">16</xref>
            ] Z. Huang.
          </p>
          <p>Títle and Commentary
“Genetic Algorithm for text clustering using ontology and evaluating
the vality of various semantic simility measures.” Improving the
kmeans algorithm by using a genetic algorithm that finds similarities
conceptual. Based on ontology, thesaurus corpus for clustering of text
fields.
“K-means clustering algorithm for categorical attributes”. Title explicit.
“Extensions to the k-means algorithm for clustering large data sets with
categorical values.” Title explicit.
4 The Algorithm k-means on Matlab.</p>
          <p>
            Experimental tests were conducted for K-means in the Matlab [
            <xref ref-type="bibr" rid="ref35">25</xref>
            ]. The Matlab
(Matrix Laboratory) is both, an environment and programming language for
numerical calculations with vectors and matrices. It is a product of the company The
Math Works Inc. (Natick, MA). [
            <xref ref-type="bibr" rid="ref1 ref10">1</xref>
            ]. The K-means algorithm for clustering is in the
following MATLAB function:
[IDX, C, SUMD, D] = KMEANS(X, K)
This function partitions the points in the N-by-P data matrix X into K clusters. This
partition minimizes the sum, over all clusters, of the within-cluster sums of
point-tocluster-centroid distances. Rows of X correspond to points, columns correspond to
variables. KMEANS returns an N-by-1 vector IDX containing the cluster indices of
each point. By default, KMEANS uses squared Euclidean distances. The K cluster
centroid is located in the K-by-P matrix C. The within-cluster sums point-to-centroid
distances in the 1-by-K vector sumD. Distances from each point to every centroid in
the N-by-K matrix D. It may include optional parameters to specify distance measure,
the method used to choose the initial cluster centroid positions, display information.
5 Test Results of K-means in Matlab.
          </p>
          <p>
            Tests for k-means in Matlab, used the well known UCI Machine Learning Repository
[
            <xref ref-type="bibr" rid="ref47">37</xref>
            ]. The UCI Machine Learning Repository [
            <xref ref-type="bibr" rid="ref47">37</xref>
            ] is among other things, a collection
of databases, which is widely used by the research
community of Machine Learning, especially for the
empirical algorithms analysis of this discipline.
8
7
t
h
g
n
e6
L
l
a
p
e
S5
          </p>
          <p>4
6
0.5</p>
          <p>1.5
1
2
2.5</p>
          <p>
            Fig.1 Representation of the Iris Data Set
For the experimental tests carried, the following data sets have been used: Iris, Glass
and Wine. This report presents the results for the Iris data set.
The Iris Data Set is a database of types Iris plant, which has No. of instances: 150
(50 in each class), No. of attributes: 4 (Sepal length, Sepal width, Petal Length, Petal
width) No. of classes: 3 (Hedges Iris, Iris versicolor, Iris virginica). One class is
linearly separable from the other two; the latter are NOT linearly separable from each
other. Based on data and classes defined in [
            <xref ref-type="bibr" rid="ref47">37</xref>
            ] and [
            <xref ref-type="bibr" rid="ref52">42</xref>
            ], fig.1 includes the Iris data,
for illustrative purposes only are considered the attributes sepal length, petal length
and petal width.
          </p>
          <p>Test I4: &gt;&gt; [u,v,sumd,D]= kmeans(z,3,'display','iter');
The “Test I4” is an example of the test results on Matlab and kmeans for the Iris data
set. Iter column represents the number of iteration, the phase indicates the algorithm
phase, num provides the number of exchanged points, sum, provides the total sum of
distances, inter% are the percentage of exchanged points in each iteration.
100
90
80
70
trce 60
n
I
.tso 50
P
de 40
%
30
20
10
00
2
4
10
12</p>
          <p>14
Fig. 2 % of Exchanged Points.
Fig. 2 corresponds to the test I4 and the graphical representation of the exchange
behavior at each iteration. Likewise Fig. 3 represents the behavior of the sum of
distances for the same test.
350
t.
isD300
e
d
uam250
S
e
lda200
t
o
T
150
100
500
2
4
Table 1 Summary of Results for Iris Data Set
5.1 Summary of Results for the Iris data set.</p>
          <p>With regard to exchanges between groups that the algorithm makes in the tests
conducted, it was observed that the most significant changes occurred from first to
second iteration. For all cases, in the first step all points are located (100%), the third
column includes the number of points to be exchanged in the second iteration and the
fourth column is the percentage difference in the number of items exchanged between
the first and second iteration.</p>
          <p>According to the results in Table 1:
It can see that for 150 points in the Iris database, and a set of 25 tests, the k-means
algorithm in Matlab:
Converge in an average of 7.2 iterations.</p>
          <p>The average number of points exchanged during the second iteration was 18.84 points
The percentage of points located on the second iteration in the corresponding group
was 91.0%
6 Conclusions.</p>
          <p>The results of the analysis for our sample work, allow us to establish a framework and
analyze the theoretical study of the k-means algorithm Also we reseach and
distinguish the different lines on which there is still a fertile field for investigation.
As we can see several attempts at overcoming the shortcomings of the k-means
algorithm have been done and different approaches in different disciplines have been
proposed: Optimization, Probability and Statistics, Neural Networks, Evolutionary
Algorithms, among others. The vast majority of contributions have focused on the
first three lines of research identified in this study: The sensibility of the algorithm to
initial conditions, convergence of the algorithm to a local optimum rather than a
global optimum and the efficiency of the algorithm. Notes that challenges still to be
resolved in such research and has been relatively little work done on the lines related
to the implementation of the algorithm to other variables as well as treatment to
outliers and noise.</p>
          <p>
            Acording the tests conducted in Matlab, this laboratory showed that it is actually very
conducive to experimental testing, the implementation of k-means, allows to monitor
the performance of the algorithm through the information that can be deployed at
runtime, such as result of the objective function and the number of points exchanged
in each iteration.The results allow us to establish a framework to compare the
proposal improvement algorithm with the previous work. As part of this project and
to give continuity to previous work [
            <xref ref-type="bibr" rid="ref39">29</xref>
            ] [
            <xref ref-type="bibr" rid="ref40">30</xref>
            ], also ventures into different applications
to k-means, such as in the areas of health care in Mexico and in Web Usage Mining
for Log files from the server of the Faculty of Compute Science BUAP, México.
Image Classification by Texture Segmentation using
          </p>
          <p>GAF-SVM
Sergio Manuel Dorantes, Manuel Martín Ortiz, María J. Somodevilla, Jesús
Lavalle Martínez, Ivo H. Pineda Torres</p>
          <p>Facultad de Ciencias de la Computación, BUAP
sergiomanuel@hotmail.com, {mmartin, mariasg, jlavalle, ipineda}@cs.buap.mx
Abstract. Due to the amount of visual information that currently exists,
there is a need to classify it properly. In this paper we present an
alternative dual method for image categorization according to their texture
content defined as GAF-SVM, this method is based in the use of Gabor
Filters (GAF) and Support Vector Machine (SVM). To perform the image
classification we rely on filtering techniques for feature extraction mixed
with statistical learning techniques to perform the data separation. The
experiments were carried out by taking a set of images containing coastal
beach scenes and a set of images containing city scenes. A feature vector is
obtained from applying a bank of Gabor Filters to the input images; the
output feature space is then used as an input to the SVM Classifier. The
Support Vector Machine is responsible for learning a model that is capable
of separating the sets of input images. Experimental results demonstrate
the effectiveness of the proposed dual method by getting the error
classification rate to near 9%.</p>
          <p>
            I. INTRODUCTION
The proposal for an alternative method of image classification requires analysis of
the methods presented so far concerning the area. Extracting visual information from
an image to obtain their most important features is essential for classification tasks;
over the years various approaches have been presented regarding this field of study
such as: color histograms, region-based classification and gray-level values of raw
pixels; though one solution has been to incorporate the texture analysis as main
feature descriptor. This is largely due to the fact that most surfaces on images contain
some kind of texture. In the recent years, texture analysis has been used for object
recognition, image interpretation, image segmentation and classification [
            <xref ref-type="bibr" rid="ref1 ref10 ref11 ref15 ref17 ref18 ref19 ref2 ref6 ref8 ref9">1, 2, 6, 8, 9,
10</xref>
            ].
          </p>
          <p>
            In recent papers like [
            <xref ref-type="bibr" rid="ref13 ref15 ref16 ref19 ref20 ref21 ref4 ref6 ref7">4, 6, 7, 10, 11, 12</xref>
            ], texture is been understudy in an isolated
manner to evaluate the performance of the proposed algorithms, in some cases it has
been used artificial textures, which limits the application area of these methods.
Textures are used by the human visual system to separate different objects within
scenes as well as surface analysis [
            <xref ref-type="bibr" rid="ref20">11</xref>
            ]. Texture can be recognized as an irradiation
patterns that are perceptually uniform. Textures can be explained as an efficient
measure to estimate the structural differences of orientation, roughness, smoothness
or regularity between different regions of an image [
            <xref ref-type="bibr" rid="ref23">14</xref>
            ].
          </p>
          <p>
            But bring out a formal definition of what a texture really is, it became a subjective
topic. As it was mention in [
            <xref ref-type="bibr" rid="ref22">13</xref>
            ] the definition of texture is dependent on the purpose
for which it is being used and outlines some definitions:
1. The basic pattern and repetition frequency of a texture sample could be
perceptually invisible, although quantitatively present. In the deterministic
formulation texture is considered as a basic local pattern that is periodically
repeated over some area.
2. An image texture may be defined as a local arrangement of image irradiances
projected from a surface patch of perceptually homogeneous irradiances.
3. Texture is characterized not only by the grey value at a given pixel, but also by the
grey value ‘pattern’ in a neighborhood surrounding the pixel.
          </p>
          <p>Our proposal is based on the use of natural textures in real world images, for that
reason the classification model must deal with more complex images in natural
conditions.</p>
          <p>
            The 2-D Gabor filters (2D-GF) have certain properties that make them suitable for
textural identification in many ways: 2D-GF have tunable orientation and radial
frequency bandwidths, tunable center frequencies, and optimally achieve joint
resolution in space and spatial frequency. The demodulated Gabor channel envelopes
generally contain only low spatial frequencies which are optimally localized in both
domains [
            <xref ref-type="bibr" rid="ref25">16</xref>
            ].
          </p>
          <p>
            Gabor filter based methods have been successfully applied for a variety of machine
vision application, such as texture segmentation [
            <xref ref-type="bibr" rid="ref19 ref20 ref21 ref24 ref25 ref27">10, 11, 12, 15, 16, 18</xref>
            ], texture
classification [
            <xref ref-type="bibr" rid="ref18 ref22 ref28 ref9">9, 13, 19</xref>
            ], iris recognition [
            <xref ref-type="bibr" rid="ref30 ref31 ref32">21, 22, 23</xref>
            ], on-road vehicle detection [
            <xref ref-type="bibr" rid="ref26">17</xref>
            ],
fingerprint classification [
            <xref ref-type="bibr" rid="ref29">20</xref>
            ], and as mentioned in [
            <xref ref-type="bibr" rid="ref24">15</xref>
            ] edge detection, object
detection, image representation, and recognition of handwritten numerals.
          </p>
          <p>This paper is organized as follows: in section II it is mention the related work we
based on to develop this article, in section III it is made a detail description of the
proposed method, in section IV it is an explanation of the way the input data is
processed as well as the Gabor filter’s parameters selection, in section V the details of
the SVM classifier parameters, and in section VI the experimental results.
II. RELATED WORK
The classification of images has been studied from various approaches, most of all
through the mixing of methods, one for texture extraction and one for the
classification process.</p>
          <p>
            In [
            <xref ref-type="bibr" rid="ref18 ref9">9</xref>
            ] is emphasized the use of Gabor filters as a texture extraction method and
classification is performed with maximum likelihood method for the classification of
aerial and satellite digital images. In [
            <xref ref-type="bibr" rid="ref12 ref3">3</xref>
            ] is proposed a method of image classification
using as an image representation their color histogram and as method of classification
the Support Vector Machine. In [
            <xref ref-type="bibr" rid="ref13 ref4">4</xref>
            ], is not used an external feature extractor, instead
the SVM classifier receives the grey level values of each pixel on the image, trying to
prove that SVM can implement feature extraction methods within its architecture, this
method is computationally expensive due to the number of regions that can define an
image. Another approach is performed in [
            <xref ref-type="bibr" rid="ref14 ref5">5</xref>
            ] where a modification of the SVM is
used for identification of regions among a group of images. In [
            <xref ref-type="bibr" rid="ref15 ref6">6</xref>
            ] the SVM is
combined with the Discrete Wavelet Frame Transform for the classification of images
of the Brodatz album. In [
            <xref ref-type="bibr" rid="ref16 ref7">7</xref>
            ] is mixed the use of wavelet transform as a feature
extractor known as the pyramid-structured wavelet transform and SVM as the
classification method. In [
            <xref ref-type="bibr" rid="ref17 ref8">8</xref>
            ] is proposed a method called Gaussian Mixture Model
mixed with Independent Component Analysis (ICA) to perform the image
classification, which is called ICA Mixture Model.
          </p>
          <p>The first step to complete the proposed method is to extract the texture features
with a bank of Gabor filters applied to each input image, and then take the filter’s
output to form a training dataset to feed the SVM classifier.</p>
          <p>
            III. PROPOSED METHOD
In order to accomplish the image classification we rely on filter based techniques to
perform texture feature extraction mixed with statistical learning theory techniques to
achieve the image data separation. Gabor filters were selected to extract texture
features from images due to their resemblance to the human visual system [
            <xref ref-type="bibr" rid="ref22">13</xref>
            ].
          </p>
          <p>
            A. Gabor Filters
A number of authors have used a bank of filters to extract local images features [
            <xref ref-type="bibr" rid="ref19 ref20 ref25 ref28">10,
11, 16, 19</xref>
            ]. Different authors used different sets of Gabor Filters, from spatial domain
to frequency domain.
          </p>
          <p>A 2-D Gabor filter is a linear filter whose impulse response is defined by a
harmonic function multiplied by a Gaussian function. In the spatial domain can be
defined as follows:
ψ (x, y) =</p>
          <p>f 2 e−(γf 22 x′2 +ηf 22 y′2 )
πγη</p>
          <p>⋅ e j2π fx′
x′ = x cosθ + y sinθ
y′ = −x sinθ + y cosθ
(1)</p>
          <p>Where f is the central frequency of the filter,θ the rotation angle of the Gaussian
major axis and the plane wave, γ the sharpness along the major axis, and η the
sharpness along the minor axis (perpendicular to the wave). In the given form, the
aspect ratio of the Gaussian is λ=η/γ. This four parameters (f,θ,γ,η) define the shape
of the filter, and by changing them we can detect different textures.</p>
          <p>The normalized 2-D Gabor filter function has an analytical form in the frequency
domain.
u′ = u cosθ + v sinθ
v′ = −u sinθ + v cosθ</p>
          <p>
            Filter Design vs. Filter Bank
There exist two aspects regarding the implementation of Gabor filters, on one hand
the filter bank approach and in the other hand the filter design approach [
            <xref ref-type="bibr" rid="ref35">25</xref>
            ]. In the
first one, a bank of filters is formed by grouping multiple filters tuned at different
frequencies and different orientations. The decision of the parameters setting depends
on the type of texture to be analyzed. The difficulty of using the filter bank approach
relies on the fact that their parameters are established ad hoc and are not optimal for a
specific processing task. One of the goals of this work consists on presenting results
that would help to specify such parameters. Furthermore, if the bank handles many
frequencies and orientations, resulting in a large bank with a lot of filters within, this
translates in a large number of convolutions. The filter design approach focuses on
designing one or a few filters for a particular application in an effort to reduce the
difficulty provided by the filters bank and also to reduce the dimensionality of the
output, as well as the processing cost. The disadvantage of this approach lies in the
limitation of the tasks for which it was designed. When working with a single filter it
is possible that some of the textures in the images are not identified or detected as the
filter has a narrow range capacity to detect local texture features.
          </p>
          <p>
            A filters bank allows the analysis of an image in a single pass way at several
frequencies and in several orientations at once. According to the characteristics of our
model, the use of a filters bank is the solution choice of deployment, although it could
mean an increase, in the computational processing, this is not significant. The design
of a Gabor filter bank consists, in general, in the selection, for each filter, of the
proper values of the following parameters: frequency, orientation, γ andη, the last two
parameters known as the smoothing parameters [
            <xref ref-type="bibr" rid="ref36">26</xref>
            ].
          </p>
          <p>
            In this research it is defined a bank with up to 3 orientations and up to 2
frequencies, resulting in a bank with maximum 6 output filters, allowing us to
accurately detect a texture among a large set of images. This decision was made based
on the studies presented in [
            <xref ref-type="bibr" rid="ref36">26</xref>
            ], where is compared with various parameter selection
approaches, and summarizes some parameter values adopted in literature.
          </p>
          <p>
            Using many different orientations and scales (frequencies) ensures invariance;
objects and some textures can be recognized al various different orientations, scales
and translations [
            <xref ref-type="bibr" rid="ref37">27</xref>
            ].
          </p>
          <p>
            C. Support Vector Machines
Support Vector Machines (SVM) were introduced by Vapnik as a powerful learning
tool based on statistical learning theory, a Support Vector Machine is a binary
classifier that makes its decision by constructing a linear decision boundary or
hyperplane that optimally separates data points of the two classes in feature hyper
space and also makes the margin maximized [
            <xref ref-type="bibr" rid="ref29">20</xref>
            ].
          </p>
          <p>
            SVM starts from the goal of separating the data with a hyperplane, and extend this
to non-linear decision boundaries using the kernel trick [
            <xref ref-type="bibr" rid="ref39">29</xref>
            ]. A hyperplane can be
defined as:
          </p>
          <p>Where x represents a point (a vector), w represent the weight (also a vector). We
want to choose w and b to maximize the margin, or distance between the parallel
hyperplanes that are as far as possible while still separating the data. The hyperplane
must separate data such as:</p>
          <p>wT x + b = 0</p>
          <p>T
w xk + b &gt; 0 for all xk of a class y</p>
          <p>T
w x j + b &lt; 0 for all x j of another class</p>
          <p>
            If data are separable in this way, there will probably be more than one way to do it.
Among all the possible existing hyperplanes, SVM selects the one in which the
distance between the hyperplane and the closest data is the widest possible [
            <xref ref-type="bibr" rid="ref39">29</xref>
            ].
          </p>
          <p>
            When working with a dataset that is not linearly separable, it is necessary to turn to
the use of a kernel function. The kernel function allows the SVM to form non-linear
boundaries [
            <xref ref-type="bibr" rid="ref39">29</xref>
            ]. Data representation through kernel function offers an alternative
solution to the nonlinearity problem, projecting the information to a higher dimension
feature space [
            <xref ref-type="bibr" rid="ref38">28</xref>
            ]. This is accomplished by changing the representation of the
function; this is similar to mapping the input space X to a new space H, called feature
space, in the form:
φ : X ⊂ R d → H
(3)
          </p>
          <p>Now instead of considering the input vectors {x1,…, xn} it is considered the
transformed vectors {φ(x1),…, φ(xn)} as shown in figure 1. By doing this substitution,
it is obtained a SVM raised in a new space (this is called the ‘kernel trick’), it is
important to mention that in practice the implementation of this nonlinear technique
consumes the same amount of computational resources of its linear equivalent.
Fig. 1. Using the Kernel to transform (map) the input data space.</p>
          <p>
            The general problem that SVM want to resolve is to search, for a given learning
task, with a finite amount of data, an appropriate function that helps to carry out a
good generalization, which results from a proper relationship between the accuracy
achieved with a particular training set and the ability of the model [
            <xref ref-type="bibr" rid="ref40">30</xref>
            ].
          </p>
          <p>The use of the ‘Radial Basis Function’ (RBF) kernel is based on the fact that this
kernel is basically suited best to deal with data that have a class-conditional
probability distribution function approaching the Gaussian distribution, like the
texture present on the input images. It maps such data into a different space where the
data becomes linearly separable. The kernel function is defined as follows:</p>
          <p>A disadvantage concerning this kernel is that is difficult to design, in the sense that
it is difficult to obtain an optimum value for its parameter σ (sigma) and choose the
corresponding C that works best for a given problem. The fact that certain
combinations of σ and C make the SVM highly sensitive to training data also
contributes to the error rate of the RBF-based SVM.</p>
          <p>One of the advantages of the RBF kernel is that given the kernel, the weights, the
number of support vector and the support vectors itself are automatically obtained as
part of the training procedure, i.e. they don’t need to be specified by the training
mechanism.</p>
          <p>IV. SETTING THE EXPERIMENTS
As part of the experiments it was decided to work with two sets of images, one set
consisting of coastal beach scenes, and the other set consisting of city scenes images.
The processing of input images is done in order to reduce computational complexity.</p>
          <p>The first set is conformed by 128 images of beach scenes content, the second set is
conformed by 128 images of city scenes content, a total of 256 images.</p>
          <p>The input images after processed end up being 8-bit per pixel grayscale images of
dimension 128x128, working with just one channel reduces de number of
convolutions. The output of each filter is obtained by the convolution of the input
image with a Gabor filter. The process is shown below:</p>
          <p>G(x, y) = I (x, y) ⊗ψ (x, y)
(7)
where
G(x, y) is the output of the filter
I (x, y) is the original image
ψ (x, y) is the Gabor filter</p>
          <p>
            This computation can theoretically be done in the spatial domain however the
Gabor filter is usually narrow. The filter is usually much larger in the frequency
domain and thus less affected by aliasing effects due to sampling. It is thus more
convenient to do all the computation process in the frequency domain. The
convolution is then reduced to a simple and efficient point-wise multiplication of the
Fourier transforms [
            <xref ref-type="bibr" rid="ref20">11</xref>
            ].
          </p>
          <p>The family of Gabor filters selected to set up the filter bank for the experiments in
the frequency domain are:
v′ = −u sinθ + v cosθ</p>
          <p>I (u, v) = ∫−∞∞ ∫−∞∞ i(x, y)e−i2π (ux+vy)dxdy
where
i(x, y) is the original input image
(11)</p>
          <p>After the transformation, normalization is applied to the output image in order to
avoid effects by illumination.</p>
          <p>At the end of normalization, we have a certain number of square matrices per
filtered image; each matrix dimension is 128x128. The number of square matrices
depends on the number of orientations and frequencies concerning the filters bank, in
our experiments the filter bank consist of 2 frequencies and 3 orientations, so the
number of output matrices is 6.</p>
          <p>The convolution of the input image with the Gabor filter is performed. In domain
frequency the convolution is represented in a point-to-point multiplication of the
transformed image with the Gabor filter.</p>
          <p>Once the filter output is obtained, G(x,y) needs to be transformed back to its spatial
representation using the Inverse Fourier Transform in 2-D.</p>
          <p>g(x, y) = i(x, y) ⊗ψ (x, y) - Spatial Domain
G(x, y) = I (x, y) ⋅ Ψ(x, y) - Frequency Domain
i(x, y) = ∫−∞∞ ∫−∞∞ I (u, v)ei2π (ux+vy)dudv
where</p>
          <p>I (u, v) is the image in frequency domain</p>
          <p>When convolution is performed some results are not useful especially if the image
does not contain textures that respond meaningfully to the filter selected parameters.
To reduce the problem all the outputs obtained by convolution are summed up to
remove the results that are not relevant and to enhance those that helps to detect
texture regions; this also helps to reduce dimensionality of the feature space by having
one square matrix as an output, same size of the input image.</p>
          <p>At this point we have one matrix per input image, reducing the dimensionality of
input data. Each matrix is used to build up a feature matrix, which is going to serve as
an input of the SVM classifier.</p>
          <p>
            To complete the convolution of the input image with each one of the Gabor filters
we take only the real part of the output filtered image. As mentioned in [
            <xref ref-type="bibr" rid="ref41">31</xref>
            ], by this
way we can keep most the texture response information ignoring phase information.
          </p>
          <p>Re(G(x, y))
(12)</p>
          <p>Then we modified each output matrix of dimension 128x128 to construct the
feature matrix. We take each matrix and transform it in a 1x16384 vector, each vector
is then piled up with the next transformed matrix to form the feature matrix.</p>
          <p>Finally we have a feature matrix of dimension 256x16384 which serves as input to
the classifier.</p>
          <p>V. SVM CLASSIFIER
The goal of experimentation is to obtain a training model through SVM, which can be
capable of separate a set of input images. Once we have the feature matrix with
processed and filtered images we proceed with the SVM classification procedure.
According to the nature of the classification process we need to define a training
dataset, so the classifier could learn a model, and a test dataset, that let us test the
learned model. The training dataset is conformed by 75% of the input dataset, and the
test dataset by the remaining 25%. The selection criteria to build up the training
dataset and the test dataset are done randomly. In fig. 2(a) is shown an example of
coastal beach scene images, and in fig 2 (b) an example of city scenes images, which
the classifier will try to separate.
Fig. 2. Example of images used in the experiments. (a) Beach scene images, (b) City
images.</p>
          <p>
            The experiments were performed using SPIDER [
            <xref ref-type="bibr" rid="ref42">32</xref>
            ], a MATLAB implementation
of SVM, a complete object oriented environment for machine learning. Being SVM a
binary classifier it is necessary to label the datasets for the classification experiments,
the beach scenes images are labeled as 1, and the city scene images are labeled as -1.
          </p>
          <p>In table I there is a list of kernels available in SPIDER, its formula and its
parameters.
k (x, y) = (x ⋅ y + 1)d</p>
          <p>− x− y 2
k ( x, y) = e 2⋅σ 2
k ( x, y) =
It is used the “RBF” kernel to execute the experiments with different sigma values.
Another parameter used by SPIDER is the ‘soft margin parameter’, C, which
penalizes the training errors. This value is set to 1000 in all the experiments.</p>
          <p>Iteratively, the sigma values were changed until a significant error rate is obtained.
The test results for the learned algorithm are presented in table II.</p>
          <p>As it can be seen in table II the sigma value which represents the lower percentage
error is σ = 35, with an error rate of 9.37%.</p>
          <p>RBF
σ=21
σ=22
σ=23
σ=24
σ=25
σ=26
σ=27
σ=28
error
error
VII. CONCLUSIONS
Extracting texture features by a Gabor filter bank and classify the filter outputs via the
Support Vector Machines offers an excellent accuracy rate, 90.63% of the input
images are correctly classified according to their class, belonging to beach scenes
class or city scenes class.</p>
          <p>The article proves the efficiency of using a dual model, first to extract de texture
features and then classify them with SVM.</p>
          <p>
            REFERENCES
[
            <xref ref-type="bibr" rid="ref1 ref10">1</xref>
            ] T. Randen and J.H. Husoy, “Filtering for texture classification: a comparative study”, IEEE Trans. on
          </p>
          <p>
            Pattern Analysis and Machine Intelligence, vol 21, Issue 4,, pp. 291 – 310, Apr 1999.
[
            <xref ref-type="bibr" rid="ref11 ref2">2</xref>
            ] F. Lumbreras Ruiz, “Segmentation, classification and modelization of textures by means of
multiresolution decomposition techniques”, Ph.D. dissertation, Dept. Informática and Computer
Vision Center, Universitat Autònoma de Barcelona, Barcelona, España, 2001.
[
            <xref ref-type="bibr" rid="ref12 ref3">3</xref>
            ] O. Chapelle, P. Haffner, and V.N. Vapnik, “Support vector machines for histogram-based image
classification”, IEEE Trans. On Neural Networks, Vol. 10, Issue 5, pp. 1055 – 1064, Sep 1999.
[
            <xref ref-type="bibr" rid="ref13 ref4">4</xref>
            ] Kwang In Kim, Keechul Jung, Se Hyun Park, and Hang Joon Kim, “Support vector machines for
texture classification”, IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol. 24, Issue 11,
pp. 1542 – 1550, Nov 2002.
[
            <xref ref-type="bibr" rid="ref14 ref5">5</xref>
            ] I. Gondra and D.R. Heisterkamp, “Learning in region-based image retrieval with generalized support
vector machines”, In Proc. of the Computer Vision and Pattern Recognition, pp. 149 – 154, 2004.
[
            <xref ref-type="bibr" rid="ref15 ref6">6</xref>
            ] Shutao Li, J.T. Kwok, Hailong Zhu, and Yaonan Wang, “Texture classification using the support
vector machines”, Pattern Recognition, Vol. 36, No. 12, pp. 2883 – 2893, 2003.
[
            <xref ref-type="bibr" rid="ref16 ref7">7</xref>
            ] Bing-Yu Sun and De-Shuang Huang, “Texture classification based on support vector machine and
wavelet transform”, In Proc. of the Fifth World Congress on Intelligent Control and Automation,
WCICA 2004. Vol. 2, pp. 1862 – 1864, June 15–19, 2004.
[
            <xref ref-type="bibr" rid="ref17 ref8">8</xref>
            ] V.P. Subramanyam Rallabandi and S.K. Sett, “Unsupervised texture classification and segmentation”,
          </p>
          <p>
            Proceedings Of World Academy of Science, Engineering and Technology, Vol. 5, April 2005.
[
            <xref ref-type="bibr" rid="ref18 ref9">9</xref>
            ] J.A. Recio, L.A. Ruiz and A.Fernández-Sarriá, “Use of Gabor filters for texture classification of
digital images”, Física de la Tierra, Vol. 17, pp. 47 – 59, 2005.
[
            <xref ref-type="bibr" rid="ref19">10</xref>
            ] M.R. Turner, “Texture discrimination by Gabor functions”, Biol. Cybern., Vol. 55, Num. 2–3, pp. 71
– 82, 1986.
[
            <xref ref-type="bibr" rid="ref20">11</xref>
            ] V. Levesque, “Texture segmentation using Gabor filters”, Center for Intelligent Machines Journal,
2000
[
            <xref ref-type="bibr" rid="ref21">12</xref>
            ] P. Guha and R. Banerjee, “Segmentation and classification of multi-textured images”, 2000,
          </p>
          <p>
            Available: http://www.cse.iitk.ac.in/~amit/courses/768/00/rajrup/, last visited: April 20, 2009.
[
            <xref ref-type="bibr" rid="ref22">13</xref>
            ] V.S. Vyas and P. Rege, “Automated texture analysis with Gabor filters”, GVIP Journal, Vol. 6, Issue
1, pp. 35 – 41, July 2006.
[
            <xref ref-type="bibr" rid="ref23">14</xref>
            ] K.M. Rajpoot and N.M. Rajpoot, “Wavelets and Support Vector Machines for Texture
Classification”, In proceedings of the 8th International Multitopic Conference, INMIC 2004, 24-26
Dec., pp. 328 – 333, 2004.
[
            <xref ref-type="bibr" rid="ref24">15</xref>
            ] D.M. Tsai, “Optimal Gabor filter design for texture segmentation”, Technical Report, Machine Vision
          </p>
          <p>
            Lab, Dept. of Ind. Eng. and Mgmt., Yuan-Ze University, Chung-Li, Taiwan, 2000.
[
            <xref ref-type="bibr" rid="ref25">16</xref>
            ] A.C. Bovik, M. Clark and W.S.Geisler, “Multichannel Texture Analysis Using Localized Spatial
Filters”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, Num. 1, pp. 55 –
73, 1990.
[
            <xref ref-type="bibr" rid="ref26">17</xref>
            ] Zehang Sun, G. Bebis and R. Miller, “ On-road vehicle detection using Gabor filters and support
vector machines”, 14th International Conference on Digital Signal Processing, DSP 2002, Vol. 2, pp.
1019 – 1022, 2002.
[
            <xref ref-type="bibr" rid="ref27">18</xref>
            ] K. Hammouda and E. Jernigan, “Texture segmentation using Gabor filters”, tech. rep., Biotechnology
and health engineering centre, University of Waterloo, Dec. 2000.
[
            <xref ref-type="bibr" rid="ref28">19</xref>
            ] S.E. Grigorescu, N. Petkov and P. Kruizinga, “Comparison of texture features based on Gabor
filters”, IEEE Trans. On Image Processing, Vol. 11, Num. 10, pp. 1160 – 1167, 2002.
[
            <xref ref-type="bibr" rid="ref29">20</xref>
            ] D. Batra, G. Singhal and S. Chaudhury, “Gabor filter based fingerprint classification using support
vector machines”, Proceedings of the IEEE First India Annual Conference, 2004, INDICON 2004,
pp. 256 – 261, 20-22 Dec. 2004.
[
            <xref ref-type="bibr" rid="ref30">21</xref>
            ] Q.A. Salih and V. Dhandapani, “IRIS Recognition based on multi-channel feature extraction using
gabor filters”, Proceedings of the 2nd IASTED international conference on Advances in computer
science and technology, ACST’06, pp. 168 – 173, 2006.
[
            <xref ref-type="bibr" rid="ref31">22</xref>
            ] L. Ma, Y. Wang and T. Tan, “Iris recognition based on multichannel Gabor filtering”, 5th Asian Conf.
          </p>
          <p>
            Computer Vision, Vol. 1, 2002.
[
            <xref ref-type="bibr" rid="ref32">23</xref>
            ] D. Carr, “Iris recognition: Gabor filtering”, Connexions. Dec. 18, 2004, Available:
http://cnx.org/content/m12493/1.4/, last visited April 20, 2009.
[
            <xref ref-type="bibr" rid="ref34">24</xref>
            ] K. Kämäräinen, “Feature extraction using Gabor filters”, Ph. D. dissertation, Lappeenranta
          </p>
          <p>
            University of Technology, Finland, Nov. 2003.
[
            <xref ref-type="bibr" rid="ref35">25</xref>
            ] T.P. Weldon, W.E. Higgins and D.F. Dunn, “Gabor filter design for multiple texture segmentation”,
          </p>
          <p>
            Optical Engineering, Vol. 35, pp. 2852 – 2863, 1996.
[
            <xref ref-type="bibr" rid="ref36">26</xref>
            ] F. Bianconi and A. Fernández, “Evaluation of the effects of Gabor filter parameters on texture
classification”, Pattern Recognition, Vol. 40, Num. 12, pp. 3325 – 3335, 2007.
[
            <xref ref-type="bibr" rid="ref37">27</xref>
            ] J. Ilonen, J.K. Kämäräinen and J.K. Kälviäinen, “Efficient computation of Gabor features”, Research
          </p>
          <p>
            Report 100, Lappeenranta University of Technology, Dept. of Information Technology, 2005.
[
            <xref ref-type="bibr" rid="ref38">28</xref>
            ] J.A. Reséndiz, “Las máquinas de vectores de soporte para identificación en línea”, Masters
dissertation, Departamento de control automático, Centro de investigación y estudios avanzados, I.P.N.,
2006.
[
            <xref ref-type="bibr" rid="ref39">29</xref>
            ] J.P. Lewis, “A short SVM (support vector machine) tutorial”, CGIT Lab / IMSC, University Southern
          </p>
          <p>
            California, 2004.
[
            <xref ref-type="bibr" rid="ref40">30</xref>
            ] L. González, “Modelos de clasificación basados en máquinas de vectores de soporte”, Asoc. científica
europea de econ. aplicada. Anales de economía aplicada, 2003.
[
            <xref ref-type="bibr" rid="ref41">31</xref>
            ] D. Dunn, W.E. Higgins and J. Wakeley, “Texture segmentation using 2-D Gabor elementary
functions”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 16, Num. 2, pp.
130 – 149, Feb 1994.
[
            <xref ref-type="bibr" rid="ref42">32</xref>
            ] SPIDER, A complete object oriented environment for machine learning in MATLAB. Available:
http://www.kyb.mpg.de/bs/people/spider/, last visited May 15, 2009.
          </p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Kittur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chi</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pendleton</surname>
            ,
            <given-names>B. A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Suh</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mytkowicz</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          :
          <article-title>Power of the Few vs. Wisdom of the Crowd: Wikipedia and the Rise of the Bourgeoisie</article-title>
          .
          <source>In: 25th Annual ACM Conference on Human Factors in Computing Systems (CHI</source>
          <year>2007</year>
          ), ACM, New York (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Suh</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chi</surname>
            ,
            <given-names>E. H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pendleton</surname>
            ,
            <given-names>B. A.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Kittur</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Us vs</article-title>
          . Them:
          <article-title>Understanding Social Dynamics in Wikipedia with Revert Graph Visualizations</article-title>
          .
          <source>In: Visual Analytics Science and Technology</source>
          . pp.
          <fpage>163</fpage>
          -
          <lpage>170</lpage>
          , IEEE-Press, New York (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Potthast</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anderka</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Automatic Vandalism Detection in Wikipedia</article-title>
          .
          <source>In: 30th European Conference on IR Research</source>
          , ECIR
          <year>2008</year>
          , pp.
          <fpage>663</fpage>
          -
          <lpage>668</lpage>
          ,
          <string-name>
            <surname>Glasgow</surname>
          </string-name>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Stein</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hess</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Does it matter who contributes: a study on featured articles in the german wikipedia</article-title>
          .
          <source>In: Proceedings of the 18th conference on Hypertext and hypermedia</source>
          , pp.
          <fpage>171</fpage>
          -
          <lpage>174</lpage>
          , ACM, New York (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. 99
          <string-name>
            <given-names>Wikipedia</given-names>
            <surname>Sources</surname>
          </string-name>
          <article-title>Aiding the Semantic Web</article-title>
          . AI3, http://www.mkbergman.com/?p=
          <fpage>417</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Medelyan</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Milne</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Legg</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.A.</given-names>
          </string-name>
          :
          <article-title>Mining meaning from Wikipedia</article-title>
          . Hamilton, (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Chernov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iofciu</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nejdl</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          :
          <article-title>Extracting Semantic Relationships between Wikipedia Categories</article-title>
          .
          <source>In: Proceedings of the 1st Workshop on Semantic Wikis - From</source>
          Wiki to Semantics, ESWC2006,
          <string-name>
            <surname>Budva</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Ruiz-casado,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Alfonseca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ,
            <surname>Castells</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>From Wikipedia to Semantic Relationships: a Semi-automated Annotation Approach</article-title>
          .
          <source>In: Proceedings of the 1st Workshop on Semantic Wikis - From</source>
          Wiki to Semantics, ESWC2006,
          <string-name>
            <surname>Budva</surname>
          </string-name>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Cui</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Corpus Exploitation from Wikipedia for Ontology Construction</article-title>
          .
          <source>Conference on Language Resources and Evaluation</source>
          , LREC2008,
          <string-name>
            <surname>Morocco</surname>
          </string-name>
          (
          <year>2008</year>
          )
          <fpage>10</fpage>
          .
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weld</surname>
            ,
            <given-names>D. S.</given-names>
          </string-name>
          :
          <article-title>Automatically Refining the Wikipedia Infobox Ontology</article-title>
          .
          <source>In 17th International World Wide Web Conference</source>
          , Beijing (
          <year>2008</year>
          )
          <fpage>11</fpage>
          .
          <string-name>
            <surname>Kozlova</surname>
          </string-name>
          , N.:
          <article-title>Automatic Ontology Extraction for Document Classification</article-title>
          . Ma. Thesis, Saarland University (
          <year>2006</year>
          ) [19]
          <string-name>
            <surname>Kao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Yi-Tung</surname>
          </string-name>
          , Zahara, Erwie, Kao. IWei. [6]
          <string-name>
            <surname>Deelers</surname>
            <given-names>S. And S.</given-names>
          </string-name>
          <string-name>
            <surname>Auwatanamongkol</surname>
          </string-name>
          . [34]
          <string-name>
            <surname>Redmond</surname>
            ,
            <given-names>Stephen J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heneghan</surname>
            , Conor. [15]
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Hansen</surname>
          </string-name>
          &amp; E. Nagai. [31]
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimov</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , C.D. [22]
          <string-name>
            <surname>Likas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlassis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verbeek</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Peña</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lozano</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Larrañaga</surname>
          </string-name>
          , [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bradley</surname>
          </string-name>
          , U.Fayyad. [21]
          <string-name>
            <given-names>K.</given-names>
            <surname>Krishna</surname>
          </string-name>
          and
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Murty 2 6 8 No</article-title>
          . de iteraciòn
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          1.
          <string-name>
            <surname>Asgharbeygi</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Maleki</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          “
          <article-title>Geodesic K-means clustering”</article-title>
          .
          <source>Pattern Recognition</source>
          ,
          <year>2008</year>
          ,
          <string-name>
            <surname>ICPR</surname>
          </string-name>
          <year>2008</year>
          , 19th International Conference on.
          <source>Dec</source>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bahmani</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Firouzi</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Niknam</surname>
            , and
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nayeripour</surname>
          </string-name>
          .
          <article-title>“A New Evolutionary Algorithm for Cluster Analysis”</article-title>
          .
          <source>Proceedings of world Academy of Science</source>
          , Engineering and Technology Vol.
          <volume>36</volume>
          ,
          <string-name>
            <surname>Dec</surname>
          </string-name>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ball</surname>
            , G. and
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hall</surname>
          </string-name>
          , “
          <article-title>A clustering technique for summarizing multivariate data”, (ISODATA), Behav Sci</article-title>
          ., vol.
          <volume>12</volume>
          , pp.
          <fpage>153</fpage>
          -
          <lpage>155</lpage>
          ,
          <year>1967</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Belal</given-names>
            <surname>Al-Zoubi</surname>
          </string-name>
          , Al-Zoubi, Amjad Hudaib, Ammar Huneiti and
          <string-name>
            <given-names>Bassam</given-names>
            <surname>Hammo</surname>
          </string-name>
          . “
          <article-title>New Efficient Strategy to Accelerate k-Means Clustering Algorithm”</article-title>
          .
          <source>American Journal of Applied Sciences</source>
          <volume>5</volume>
          (
          <issue>9</issue>
          )
          <fpage>1247</fpage>
          -
          <lpage>1250</lpage>
          ,
          <string-name>
            <given-names>Science</given-names>
            <surname>Publications</surname>
          </string-name>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          5.
          <string-name>
            <surname>Bradley</surname>
            <given-names>P.</given-names>
          </string-name>
          , U.Fayyad. “
          <article-title>Refining initial points for k-means clustering”</article-title>
          ,
          <source>in Proc. 15 th Int. Conf. Machine Learning</source>
          ,
          <year>1998</year>
          pp.
          <fpage>91</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          6.
          <string-name>
            <surname>Deelers S</surname>
          </string-name>
          . And S. Auwatanamongkol. “
          <article-title>Enhancing K-Means Algorithm with Initial Cluster Centers Derived from Data Partitioning along the Data Axis with the Highest Variance</article-title>
          .
          <source>” Proceedings of world Academy of Science, Engineering and Technology</source>
          . Vol.
          <volume>26</volume>
          ,
          <string-name>
            <surname>Dec</surname>
          </string-name>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Ding</given-names>
            <surname>Zejin</surname>
          </string-name>
          ,
          <string-name>
            <surname>Jian</surname>
            <given-names>Yu</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang-Qing Zhang</surname>
          </string-name>
          . “
          <article-title>A New improved K-Means Algorithm with Penalizaed Term”</article-title>
          .
          <source>Granular Computing</source>
          ,
          <year>2007</year>
          ,
          <string-name>
            <surname>GRC</surname>
          </string-name>
          <year>2007</year>
          , IEEE International Conference on.
          <source>Nov</source>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          8.
          <string-name>
            <surname>Duda</surname>
            ,
            <given-names>R.O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hart</surname>
            ,
            <given-names>P.E.: Pattern</given-names>
          </string-name>
          <string-name>
            <surname>Classification and Scene Analysis</surname>
          </string-name>
          , John Wiley &amp;Sons, New York, NY,
          <year>1973</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          9.
          <string-name>
            <surname>Estivill-Castro</surname>
            <given-names>V.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          , “
          <article-title>A fast and robust general purpose clustering algorithm</article-title>
          .”
          <source>In Proc. 6 th Pacific Rim Int. Conf. Artificial Intelligence (PRICAI`00)</source>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Mizoguchi</surname>
          </string-name>
          and J. Slaney, Eds., Melbourne, Australia,
          <year>2000</year>
          , pp,
          <fpage>208</fpage>
          -
          <lpage>218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          10.
          <string-name>
            <surname>Fayyad</surname>
            ,
            <given-names>U.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piatetsky-Shanpiro</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smyth</surname>
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Uthurusamy</surname>
          </string-name>
          , R.:
          <article-title>Advances in Knowledge Discovery and Data Mining</article-title>
          . AAAI/MIT Press,
          <year>1996</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          11.
          <string-name>
            <surname>Fissher</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Knowledge Acquisition via Incremental Conceptual Clustering</article-title>
          .
          <source>Machine Learning</source>
          , Vol.
          <volume>2</volume>
          , No.
          <volume>2</volume>
          (
          <year>1987</year>
          )
          <fpage>139</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          12.
          <string-name>
            <surname>Forgy</surname>
            <given-names>E.</given-names>
          </string-name>
          “
          <article-title>Cluster analysis of multivariate data: Efficiency vs</article-title>
          .
          <source>Interpretability of classification”</source>
          ,
          <source>Biometrics</source>
          , vol.
          <volume>21</volume>
          , pp.
          <fpage>768</fpage>
          -
          <lpage>780</lpage>
          .
          <year>1965</year>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          13.
          <string-name>
            <surname>Frahling</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          &amp;
          <article-title>Ch. Sohler. “A Fast k-means implementation using coresets</article-title>
          .”.
          <source>International Journal of Computational Geometry &amp; Applications. Dec</source>
          .
          <year>2008</year>
          . Vol.
          <volume>18</volume>
          Issue 6.
          <fpage>P605</fpage>
          -
          <volume>625</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          14.
          <string-name>
            <surname>Gupata</surname>
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rao</surname>
          </string-name>
          , and
          <string-name>
            <given-names>V.</given-names>
            <surname>Bhatnagar</surname>
          </string-name>
          , “
          <article-title>K-means clustering algorithm for categorical attributes”</article-title>
          ,
          <source>in Proc. 1st Int. Conf. Data Werehousing and Knowledge Discovery (DaWak`99)</source>
          . Florence, Italy,
          <year>1999</year>
          , pp.
          <fpage>203</fpage>
          -
          <lpage>208</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          15.
          <string-name>
            <surname>Hansen</surname>
            ,
            <given-names>P. &amp; E. Nagai.</given-names>
          </string-name>
          “
          <article-title>Analysis of Global k-means, an Incremental Heuristic for Minimum Sum of Squares Clustering”</article-title>
          .
          <source>Journal Classification</source>
          <volume>22</volume>
          :
          <fpage>287</fpage>
          -
          <lpage>310</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          16.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          , “
          <article-title>Extensions to the k-means algorithm for clustering large data sets with categorical values.”. Data Mining Knowl</article-title>
          . Discov., vol.
          <volume>2</volume>
          , pp.
          <fpage>283</fpage>
          -
          <lpage>304</lpage>
          ,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          17.
          <string-name>
            <surname>Kanungo</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mount</surname>
            ,
            <given-names>D.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Netanyahu</surname>
            ,
            <given-names>N.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piatko</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Silverman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>A.Y.</given-names>
          </string-name>
          :
          <article-title>An Efficient K-means Clustering Algorithm: Analysis and Implementation</article-title>
          .
          <source>Pattern Analysis and Machine Intelligence</source>
          ,
          <source>IEEE Transactions on Pattern Analysis and Machine Intelligence</source>
          . Vol.
          <volume>24</volume>
          , No.
          <volume>7</volume>
          (
          <year>2002</year>
          )
          <fpage>881</fpage>
          -
          <lpage>892</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          18.
          <string-name>
            <surname>Kashima</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
          </string-name>
          , J.;
          <string-name>
            <surname>Ray</surname>
            ,
            <given-names>B</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>M. “</given-names>
          </string-name>
          <article-title>K-means clustering of proportional data using L1 distance”</article-title>
          .
          <source>Pattern Recognition</source>
          ,
          <year>2008</year>
          ,
          <string-name>
            <surname>ICPR</surname>
          </string-name>
          <year>2008</year>
          . International Conference On. Volume Issue, Dec.
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          19.
          <string-name>
            <surname>Kao</surname>
          </string-name>
          , Yi-Tung, Zahara, Erwie, Kao. I-Wei.
          <article-title>“A hibridized approach to data clustering”</article-title>
          .
          <source>Expert Systems with Applications</source>
          . Vol.
          <volume>34</volume>
          Issue 3. P 1754-
          <fpage>1762</fpage>
          . Apr.
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          20. Kaufman L. and
          <string-name>
            <given-names>P.</given-names>
            <surname>Rouseeuw</surname>
          </string-name>
          .
          <article-title>Finding Groups in Data: An Introduction to Cluster analysis</article-title>
          : Wiley,
          <year>1990</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          21.
          <string-name>
            <surname>Krishna</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Murty</surname>
          </string-name>
          , “
          <article-title>Genetic K-means algorithm”</article-title>
          .
          <source>IEEE Trans. Syst</source>
          .,
          <string-name>
            <surname>Man</surname>
          </string-name>
          , Cybern. B.,
          <string-name>
            <surname>Cybern</surname>
          </string-name>
          ., vol.
          <volume>29</volume>
          , no.
          <issue>3</issue>
          , pp.
          <fpage>433</fpage>
          -
          <lpage>439</lpage>
          , Jun.
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          22.
          <string-name>
            <surname>Likas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vlassis</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verbeek</surname>
            ,
            <given-names>J.J.: The</given-names>
          </string-name>
          <string-name>
            <surname>Global K-means Clustering</surname>
            <given-names>Algorithm. Pattern</given-names>
          </string-name>
          <string-name>
            <surname>Recognition</surname>
          </string-name>
          .
          <source>The Journal of the Pattern Recognition Society</source>
          . Vol.
          <volume>36</volume>
          , No.
          <volume>2</volume>
          (
          <year>2003</year>
          )
          <fpage>451</fpage>
          -
          <lpage>461</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          23.
          <string-name>
            <surname>Lloyd</surname>
            <given-names>SP</given-names>
          </string-name>
          “
          <article-title>Least squares quantization in PCM. Unpublished Bell Lab</article-title>
          .
          <source>Tech. Note, portions presented at the Institute of Mathematical statistics Meeting Atlantic City</source>
          , NJ, sep.
          <year>1957</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <source>IEEE Trans. inform, Theory (Special Issue on Quantization)</source>
          , vol. IT-28, pp
          <fpage>129</fpage>
          -
          <lpage>137</lpage>
          Mach
          <year>1982</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          24.
          <string-name>
            <surname>MacQueen</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Some Methods for Classification and Analysis of Multivariate Observations</article-title>
          .
          <source>In Proceedings Fifth Berkeley Symposium Mathematics Statistics and Probability</source>
          . Vol.
          <volume>1</volume>
          . Berkeley, CA (
          <year>1967</year>
          )
          <fpage>281</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          25.
          <string-name>
            <surname>Matworks</surname>
          </string-name>
          . http: //www.matworks.com
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          26.
          <string-name>
            <surname>Mehmed</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Data Mining: Concepts, Models, Methods, and Algorithms</article-title>
          . John Wiley &amp; Sons.
          <year>2003</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          27.
          <string-name>
            <surname>Nguyen</surname>
            , Cao.
            <given-names>D.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Cios</surname>
            ,
            <given-names>Krzysztof J. “</given-names>
          </string-name>
          <article-title>GAKREM: A novel hybrid clustering algorithm</article-title>
          .
          <source>Information Sciences</source>
          . Vol.
          <volume>178</volume>
          Issue
          <issue>22</issue>
          ,
          <fpage>p4205</fpage>
          -
          <lpage>4227</lpage>
          - Nov.
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          28.
          <string-name>
            <surname>Peña</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lozano</surname>
            and
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Larrañaga</surname>
          </string-name>
          , “
          <article-title>An empirical comparision of four initialization methods for the k-means algorithm</article-title>
          . “Pattern Recognit Lett., vol.
          <volume>20</volume>
          pp.
          <fpage>1027</fpage>
          -
          <lpage>1040</lpage>
          ,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          29.
          <string-name>
            <surname>Pérez</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodolfo Pazos</surname>
            <given-names>R</given-names>
          </string-name>
          , Laura Cruz R.,Gerardo Reyes S. Rosy Basave T. Héctor Fraire H. “
          <article-title>Improvement the Efficiency and Efficacy of the K-means Clustering Algorithm through a New Convergence Condition”</article-title>
          .
          <source>Computational Science and Its Applications - ICCSA 2007 - International Conference Proceedings</source>
          . Springer Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          30.
          <string-name>
            <surname>Pérez</surname>
            , J.,
            <given-names>M.F.</given-names>
          </string-name>
          <string-name>
            <surname>Henriques</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Pazos</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Cruz</surname>
            , G. Reyes,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Salinas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mexicano</surname>
          </string-name>
          . Mejora al Algoritmo de
          <article-title>K-means mediante un Nuevo criterio de convergencia y su aplicación a bases de datos poblacionales de cáncer</article-title>
          . 2do Taller Latino Iberoamericano de Investigación de Operaciones, “
          <article-title>La IO aplicada a la solución de problemas regionales”</article-title>
          . México. “In Spanish”.
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          31.
          <string-name>
            <surname>Pham</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          <string-name>
            <surname>Dimov</surname>
            ,
            <given-names>S.S.</given-names>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , C.D. “
          <article-title>Selection of K in K-means clustering”</article-title>
          .
          <source>“Proceedings of the Institution of Mechanical</source>
          Engineers - Part C - Journal of Mechanical Engineering Science; Vol.
          <volume>219</volume>
          <issue>Issue 1</issue>
          ,
          <fpage>p103</fpage>
          -
          <lpage>109</lpage>
          .
          <year>Jan 2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          32.
          <string-name>
            <surname>Proietti</surname>
            , Guido and
            <given-names>Christos</given-names>
          </string-name>
          <string-name>
            <surname>Faloutsos</surname>
          </string-name>
          .
          <article-title>“Analysis of Range Queries on Real Region Datasets Stored Using an R-Tree.” IEEE Transactions on Knowledge and Data Engieneering</article-title>
          ., Vol.
          <volume>12</volume>
          , No. 5, Sep./Oct.
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          33.
          <string-name>
            <surname>Pun</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.K.D.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ali</surname>
            ,
            <given-names>A.S. “</given-names>
          </string-name>
          <article-title>Unique distance measure approach for K-means (UDMA-Km) clustering algorithm</article-title>
          .
          <source>TENCON 2007 - 2007 IEEE Region 10 Conference</source>
          . Oct.
          <volume>30</volume>
          <fpage>2007</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          34.
          <string-name>
            <surname>Redmond</surname>
            ,
            <given-names>Stephen J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heneghan</surname>
          </string-name>
          , Conor. “
          <article-title>A method for initialising the K-means clustering algorithm using kd-trees”</article-title>
          .
          <source>Pattern Recognition Letters;</source>
          Vol.
          <volume>28</volume>
          Issue 8,
          <string-name>
            <surname>Jun</surname>
          </string-name>
          .
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          35.
          <string-name>
            <given-names>Taoying</given-names>
            <surname>Li</surname>
          </string-name>
          &amp;
          <article-title>Yan Chen “An improved k-means algorithm for clustering using entropy weighting measures”</article-title>
          .
          <source>Intelligent Control and Automation</source>
          ,
          <year>2008</year>
          ,
          <string-name>
            <surname>WCICA</surname>
          </string-name>
          <year>2008</year>
          , 7th World Congress on.
          <source>June</source>
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          36.
          <string-name>
            <surname>Tsai</surname>
          </string-name>
          , Chieh-Yuan, Chiu, Chuang-Cheng. “
          <article-title>Developing a feature weight selfadjustment mechanism for a K-means clustering algorithm</article-title>
          .
          <source>” Computational Statistics &amp; Data Analysis</source>
          . Vol.
          <volume>52</volume>
          Issue 10.
          <string-name>
            <surname>Jun</surname>
          </string-name>
          .
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          37.
          <string-name>
            <surname>UCI. Asuncion</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Newman</surname>
            ,
            <given-names>D.J.</given-names>
          </string-name>
          (
          <year>2007</year>
          ). UCI Machine Learning Repository [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, School of Information and Computer Science.
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          38.
          <string-name>
            <surname>Wei</surname>
          </string-name>
          , Song, Li Cheng Hua, Park, Soon Cheo. “
          <article-title>Genetic Algorithm for text clustering using ontology and evaluating the vality of various semantic simility measures</article-title>
          .
          <source>” Expert Systems with Applications</source>
          . Vol.
          <volume>36</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>5</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jul</surname>
          </string-name>
          .
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          39.
          <string-name>
            <surname>Wesan</surname>
          </string-name>
          , Barbakh And Colin Fyfe. “
          <article-title>Local vs global interactions in clustering algorithms: Advances over K-means</article-title>
          .”
          <source>International Journal of knowledge-based and Intelilligent Engineering Systems</source>
          <volume>12</volume>
          (
          <year>2008</year>
          ).
          <fpage>83</fpage>
          -
          <lpage>99</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          40.
          <string-name>
            <surname>Witten</surname>
            ,
            <given-names>I.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Frank</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations</article-title>
          . Morgan Kaufmann Publishers. San Diego, CA (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          41.
          <string-name>
            <surname>Xindong</surname>
            <given-names>Wu</given-names>
          </string-name>
          ,,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Ross Q.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Motoda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. J.</given-names>
            <surname>McLachlan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Steinbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Hand</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Steinberg</surname>
          </string-name>
          . “
          <article-title>Top 10 algorthms in data mining”</article-title>
          .
          <source>Knowl Inf Syst</source>
          (
          <year>2008</year>
          ). Springer.
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          42.
          <string-name>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>Rui and Donald Wunsch II. Survey of Clustering Algorithms</article-title>
          .
          <source>IEEE Transactions on Neural Networks</source>
          , Vol.,
          <volume>16</volume>
          , No. 3, May
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          43.
          <string-name>
            <surname>Zalik</surname>
          </string-name>
          , Krista Rizman . “
          <article-title>An Efficient k`-means Clustering Algorithm</article-title>
          .”
          <source>Pattern Reconition Letters</source>
          , Vol.
          <volume>29</volume>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <year>9</year>
          . Pag.
          <volume>1385</volume>
          -
          <fpage>1391</fpage>
          . Elsevier 07/
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          44.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>B. Tian D</given-names>
            . And
            <surname>Tung A.K.H.</surname>
          </string-name>
          “
          <article-title>On the Lower Bound of Local Optimums in K-means Algorithm</article-title>
          .
          <source>” Data Mining</source>
          <year>2006</year>
          ,
          <source>ICDM`06 Sixth International Conference on Data Mining. Dec</source>
          .
          <year>2006</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          45. Zhang, Chen; Xia Shixiong.
          <article-title>“K-means Clustering Algorithm with improved initial Center.” Knowledge Discovery</article-title>
          and
          <string-name>
            <given-names>Data</given-names>
            <surname>Mining</surname>
          </string-name>
          ,
          <year>2009</year>
          . Second International Workshop on. Vol.
          <volume>Issue</volume>
          ,
          <volume>23</volume>
          -
          <fpage>25</fpage>
          Jan.,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>