<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Advances on Semantic Web and New Technologies</article-title>
      </title-group>
      <pub-date>
        <year>2010</year>
      </pub-date>
      <fpage>21</fpage>
      <lpage>62</lpage>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Davide Buscaldi and Gerardo Sierra were the invited speakers in this Third Workshop
Semantic Web.</p>
      <p>Davide Buscaldi is currently completing his Ph.D. in pattern recognition and artificial
intelligence at the UPV - Universidad Politécnica de Valencia (Spain), with a thesis titled
"Toponym Disambiguation in NLP Applications". His research interests are mainly focused
on question answering, word sense disambiguation and geographical information retrieval.
He obtained his DEA (Diploma de Estudios Avanzados) in 2008 with a dissertation on the
"integration of resources for QA and GIR". He is the author of over 40 papers in different
international conferences, workshops and journals. He has been awarded a FPI grant by the
Valencian local government which allowed him to participate in the "LiveMemories"
project during a stage at the FBK-IRST research institute in Trento, Italy, under the
direction of Bernardo Magnini. He has been the UPV responsible of the organization of the
QAST (Question Answering on Speech Transcript) track in CLEF 2009. Currently, he is
member of the Natural Language Engineering (NLE) Lab of the Universidad Politécnica de
Valencia.</p>
      <p>Gerardo Sierra is a Ph.D. in Computational Linguistics at UMIST, England. He is the
coordinator of the Linguistic Engineering Group at UNAM. He has promoted this area in
teaching level such as research and development, in areas such as computational
lexicography, terminotics, retrieval and information extraction, text mining and corpus
linguistics. Currently, he is researcher level A, National Researcher II, CONACYT project
evaluator, member of several scientific committees. He has taught courses at UNAM, for
the Faculties of Engineering and Philosophy and Letters, such as Posgrade in Linguistic,
Biotechnology and Computer Science.
Invited Paper
Ambiguous Place Names on the Web
Davide Buscaldi.</p>
      <p>SV: a Visualization Mechanism for Ontologies of Records
Based on SVG Graphics
Ma. Auxilio Medina, Miriam Cruz, Rebeca Rodríguez, and Argelia B. Urbina.</p>
      <p>Modeling of CSCW system with Ontologies
Mario Anzures-García, Luz A. Sánchez-Gálvez, Miguel J. Hornos, Patricia
PaderewskiRodríguez, and Antonio Cid.</p>
      <p>The Use of WAP Technology in Question Answering
Fernando Zacarías F., Alberto Tellez V., Marco Antonio Balderas, Guillermo De Ita L.,
and Barbara Sánchez R.</p>
      <p>Data Warehouse Development to Identify Regions with High
Rates of Cancer Incidence in México through a Spatial Data
Mining Clustering Task.</p>
      <p>Joaquin Pérez Ortega, María del Rocío Boone Rojas, María Josefa Somodevilla García,
and Mariam Viridiana Meléndez Hernández.</p>
      <p>An Approach of Crawlers for Semantic Web Application
(Short paper)
José Manuel Pérez Ramírez, and Luis Enrique Colmenares Guillen.</p>
      <p>Decryption Through the Likelihood of Frequency of Letters
(Short paper)
Barbara Sánchez Rinza, Fernando Zacarias Flores, Luna Pérez Mauricio, and Martínez
Cortés Marco Antonio.
1
8
13
24
37
48
57
Ambiguous Place Names on the Web?</p>
      <p>Davide Buscaldi
Natural Language Engineering Lab., ELiRF Research Group,</p>
      <p>Dpto. de Sistemas Informaticos y Computacion (DSIC),</p>
      <p>Universidad Politecnica de Valencia, Spain,</p>
      <p>dbuscaldi@dsic.upv.es
Abstract. Geographical information is achieving an increasing
importance in the World Wide Web. Everyday, the number of users looking for
geographically constrained information is growing. Map-based services,
such as Google or Yahoo Maps provide users with a graphical interface,
visualizing results on maps. However, most of the geographical
information contained in web documents is represented by means of toponyms,
which in many cases are ambiguous. Therefore, it is important to
properly disambiguate toponyms in order to improve the accuracy of web
searches. The advent of the semantic web will allow to overcame this
issue by labelling documents with geographical IDs. In this paper we
discuss the problems of using toponyms in web documents instead of
identifying places using tools such as Geonames RDF, focusing on the
errors that a ect a prototype geographical web search engine, Geooreka!,
currently under development.
1</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>
        The interest of users for geographically constrained information in the Web has
increased over the past years, boosted by the availability of services such as
Google Maps1. Sanderson and Kohler [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] showed that 18:6% of the queries
submitted to the Excite search engine contained at least a geographic term, while
Gan et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] estimated that 12:94% of queries submitted to the AOL search
engine expressed a geographically constrained information need. Most of the
geographical information contained in the Web and unstructured text is composed
by toponyms, or place names. There are two main problems that derive from
using toponyms to represent geographical information. The rst one is the
polysemy of toponyms, or toponym ambiguity: a toponym may be used to represent
more than one place, such as \Puebla" which may be used to indicate the city
at 19o30N, 98o120W, the state in which it is contained, a suburb of Mexicali in
the state of Baja California, or three more small towns in Mexico. The second
problem is that the mere inclusion of a toponym in a document does not always
mean that the document is geographically relevant with respect to the region or
? We would like to thank the TIN2009-13391-C04-03 research project for partially
supporting this work.
1 http://maps.google.com
area represented by the toponym. In the rst case, the solution is constituted
by the Toponym Disambiguation (TD) task, also named toponym grounding
or resolution; in the second case, the solution is to carry out Geographic Scope
Resolution, which is also a ected by the problem of toponym ambiguity [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        The Geonames ontology2 provide users with RDF description of more than
6 million places. The use of this ontology would allow to include geospatial
semantic information in the Web, eliminating the need of toponym disambiguation.
Unfortunately, as noted by [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], in the Web \references to geographical locations
remain unstructured and typically implicit in nature", determining a \lack of
explicit spatial knowledge within the Web" which \makes it di cult to service
user needs for location-speci c information". In this paper, with the help of the
Geooreka!3 system [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a prototype web search engine developed at the
Universidad Politecnica of Valencia in Spain, we will the problems that users interested
in geographically constrained information may found because of the ambiguity
of toponyms in the web.
2
      </p>
    </sec>
    <sec id="sec-3">
      <title>Geooreka!: a Geographical Web Search Engine</title>
      <p>
        Geooreka! is a search engine developed on the basis of our experiences at
GeoCLEF4 [
        <xref ref-type="bibr" rid="ref6 ref7">6,7</xref>
        ], which suggested us that the use of term-based queries could not be
the optimal method to express a geographically constrained information need.
For instance, it is common for users to employ vernacular names that have vague
spatial extent and which do not correspond to the o cial administrative place
name terminology. Another issue is the use of vague geographical constraints that
are di cult to automatically translate from the natural language to a precise
query. For instance, the query \Cultivos de tabaco al este de Puebla" (\Tobacco
plantations East of Puebla") presents a double problem because of the
ambiguity of the place name and the fact that the geographical constraint \East of" is
vague (for instance, it does not specify if the search should be constrained within
Mexico or extend to other countries).
      </p>
      <p>
        These issues are addressed in Geooreka! by allowing the user to specify his
geographical information needs using a map-based interface. The user writes a
natural language query in order to represent the query theme (e.g., \Cultivos
de tabaco") and selects a rectangular map in a box (Figure 1), representing
the query geographical footprint. All toponyms in the box are retrieved using a
PostGIS database, and then the Web is queried in order to check the maximum
Mutual Information (MI) between the thematic part of the query and all the
places retrieved. The complete architecture of the system can be observed in
Figure 2. Web counts and MI are used in order to determine which combinations
theme-toponym are most relevant with respect to the information need expressed
by the user (Selection of Relevant Queries ). In order to speed-up the process,
2 http://www.geonames.org/ontology/
3 http://www.geooreka.eu
4 http://ir.shef.ac.uk/geoclef/
web counts are calculated using the static Google 1T Web database5, indexed
using the jWeb1T interface [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], whereas Yahoo! Search is used to retrieve the
results of the queries composed by the combination of a theme and a toponym.
The key issue in the selection of the relevant queries is to obtain a relevance
model that is able to select pairs theme-toponym that are most promising to
satisfy the user's information need. On the basis of the theory of probability,
we assume that the two component parts of a query, theme T and a place G,
are independent if their conditional probabilities are independent, i.e., p(T jG) =
p(T ) and p(GjT ) = p(G), or, equivalently, their joint probability is the product
of their probabilities:
p^(T \ G) = p(G)p(T )
(1)
      </p>
      <p>If probabilities are calculated using page counts, that is, as the number of
pages in which the term (or phrase) representing the theme or toponym appears,
divided by Fmax = 2; 147; 436; 244 which is the maximum term frequency
contained in the Google Web 1T database, then p^(T \ G) is the expected probability
of co-occurrence of T and G in the same web page. It is clear that this represents
a rough estimation of the fact that T occurred in G, since the mere inclusion
of G in a page where T is mentioned does not guarantee the semantic relation
between G and T .</p>
      <p>
        Considering this model for the independence of theme and place, we can
measure the divergence of the expected probability p^(T \ G) from the observed
probability p(T \ G): the more the divergence, the more informative is the result
5 http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13
of the query. The Kullback-Leibler measure [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is commonly used in order to
determine the divergence of two probability distributions.
      </p>
      <p>DKL(p(T \ G)jjp^(T \ G)) = p(T \ G) log
p(T \ G)
p(T )p(G)
(2)
This formula is exactly one of the formulations of the Mutual Information (MI)
of T and G, usually denoted as (I (T ; G)).
3</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation</title>
      <p>
        Geooreka! has been evaluated over the GeoCLEF 2005 test set, in order to
compare the results that could be obtained by specifying the geographic footprint by
means of keywords and those that could be obtained using a map-based interface
to de ne the geographic footprint of the query. With this setup, topic title only
was used as input for the Geooreka! thematic part, while the area
corresponding to the geographic scope of the topic was manually selected. Probabilities
were calculated using the number of occurrences in the GeoCLEF collection.
Occurrences for toponyms were calculated by taking into account only the geo
index. The results were calculated over the 25 topics of GeoCLEF-2005, minus
the queries in which the geographic footprint was composed of disjoint areas (for
instance, \Europe" and \USA" or \California" and \Australia"), which could
not be processed by Geooreka!. Mean Reciprocal Rank (MRR) was used as a
measure of accuracy. The GIR system GeoWorSE, where queries are speci ed
by text, was used as a baseline [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Table 1 displays the obtained results.
      </p>
      <p>The results show that the web-based results are sensibly worse than those
obtained on the static collection. This is due primarily to two reasons: in the rst
place, because topics were tailored on the GeoCLEF collection. Therefore, some
topics refer explicitly to events that are particularly relevant in the collection
and are easier to retrieve. For instance, query GC-005 \Japanese Rice Imports"
targets documents regarding the opening of the Japanese rice market for the rst
time to other countries; \Japan" and \Rice" in the document collection appear
together only in such documents, therefore it is easier to retrieve the relevant
documents when searching the GeoCLEF collection.</p>
      <p>The second factor a ecting the results for the Web-based system is the
ambiguity of toponyms, which does not allow to correctly estimate the probabilities
for places. For instance, in the results obtained for topic GC-008 (\Milk
Consumption in Europe"), the MI obtained for \Turkey" was abnormally high with
respect to the expected value for this country. The reason is that in most
documents, the name \turkey" was referring to the animal and not to the country.
This kind of ambiguity represents one of the most important issue at the time
of estimating the probability of occurrence of places. Ambiguity (or, better, the
polysemy of toponyms) grows together with the size and the scope of the
collection being searched. The GeoCLEF collection was also semantically tagged
using WordNet and Geonames IDs to identify the places referenced by toponyms,
while Web content is rarely tagged using precise IDs, therefore increasing the
chance of error in the estimation of probabilities for places which share the same
name.</p>
      <p>
        There are three kind of toponym ambiguity that can be recognised (after the
two main types identi ed by [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]:
{ Geo / Non-Geo ambiguity: in this case, a toponym is ambiguous with respect
to another class of name (such as \Turkey" which may be the animal or the
country);
{ Geo / Geo ambiguity of di erent class: for instance, \Puebla" the city or the
state;
{ Same class Geo / Geo ambiguity.
      </p>
      <p>The solution in all cases would be to use an ontology to precisely identify places
in documents; the only di erence is the amount of information that the ontology
should include. For the rst type of ambiguity, the only information needed is
whether the name represents a place or not. In the second case, we would also
need to know the class of the place. Finally, in the Geo / Geo ambiguity, we may
di erentiates places using their coordinates or by knowing the including entity,
or both. The Geonames ontology contains all these information and represents
the best option at the time of geographically tag place names.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusions</title>
      <p>The results obtained with Geooreka! over a static, semantically-labelled (at least
from a geographical viewpoint) collection compared to the results obtained in
the Web showed that the imprecise identi cation of places is a problem for
search engines destined to users who are interested in searching for geographically
constrained information. The use of precise semantically tagging schemes for
toponyms, such as Geonames RDF, would allow these search engines to produce
more reliable results. Spreading the use of geographical tagging for the Semantic
Web would also allow users to mine information using geographical constraints
in a more e ective way. In this sense, we would like to encourage the use of
Geonamen in order to produce accurate geographically tagged Web content.
SV: a visualization mechanism for ontologies of
records based on SVG graphics
Ma. Auxilio Medina, Miriam Cruz, Rebeca Rodr guez, Argelia B. Urbina
Universidad Politecnica de Puebla
Tercer Carril del Ejido Serrano S/N</p>
      <p>Juan C. Bonilla, Puebla, Mexico
fmmedina, mcruz, rrodriguez, aurbinag @uppuebla.edu.mx,</p>
      <p>WWW home page: http://informatica.uppuebla.edu.mx/</p>
      <p>~mmedina, ~rrodriguez, ~aurbina
Abstract. This paper describes SV, a visualization mechanism used to
explore digital collections represented as hierarchical structures called
ontologies of records. These ontologies are XML les constructed using
OAI-PMH records and a clustering algorithm. SV is composed by a web
interface and SVG graphics. Through the interface, users can recognize
the organization of the collection and access to metadata of documents.
1</p>
      <sec id="sec-5-1">
        <title>Introduction</title>
        <p>Digital libraries gather valuable information. Organizations such as the Open
Archives Initiative (OAI1) have proposed di erent alternatives to share data. The
Protocol for Metadata Harvesting (OAI-PMH protocol), for example, supports
interoperability between federated digital libraries. Documents are described in
metadata records. Dublin Core Metadata (DC2) is the default metadata format
for this protocol.</p>
        <p>
          The services and the collections of digital libraries are enriched in the
Semantic Web. The use of XML, Resource Description Framework (RDF), OWL,
conceptual maps and other metadata technologies are addressed to improve search
tasks [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Semantic Digital Libraries (SDLs) refer to systems build upon
digital libraries and social networking technologies (Web 2.0) [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Freely distributed
software exists to construct SDLs such as Greenstone3 or Jerome DL4. In this
type of software, ontologies play a key role, they refer to explicit speci cations of
shared conceptualizations [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. Ontologies enables the representation of knowledge
that software and human agents can understand and use.
        </p>
        <p>This paper proposes the use of ontologies called \ontologies of records" that
are represented as XML documents as the basis of a visualization mechanism
1 http://www.openarchives.org/
2 http://dublincore.org
3 http://www.greenstone.org/
4 http://www.jeromedl.org/
called semantic view (SV). The name also refers to the rst two letters of
\Support Vector Graphics". SV o ers an interactive view to allow users to explore
the content of a federated collection.</p>
        <p>The paper is organized as follows. Section 2 describes the features of an
ontology of records. Section 3 includes related work. Section 4 and 5 explains
the design and implementation of SV, respectively. Experimental results are
described in Section 6. Finally, Section 7 includes conclusions and suggests future
directions of our work.
2</p>
      </sec>
      <sec id="sec-5-2">
        <title>What is an ontology of records</title>
        <p>
          An ontology of records is a hierarchical structure of clusters of OAI-PMH records
that provides an unambiguous interpretation of its elements. Its construction
is based on the Frequent Itemset Hierarchical Clustering algorithm [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. This
structure organizes a collection of documents, this has concept-term relationships
useful for keyword based searches. An ontology of records is stored as a well
formed XML le that is validated against an XML Schema. An ontology of
records has the following features[
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]:
1. Documents are clustered by similarity
2. Clusters in the k -level have labels of k -terms
3. All the records of a cluster share the terms of its label
3
        </p>
      </sec>
      <sec id="sec-5-3">
        <title>Related work</title>
        <p>
          This section describes some systems that have been used to visualize collections
of documents. Proat et al. [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] use 3D trees to visualize documents organized
according to the Library of Congress Classi cation (LCC). Documents are
clustered in seven subsets. The interface has controls to rotate or zoom the nodes of
trees. The leaf nodes contain metadata of documents.
        </p>
        <p>
          Geroimenko et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] have proposed the Generalizad Document Object Model
tree Interface (G-DOM-Tree interface) to visualize metadata from XML DOM
(Document Object Model) documents. The model displays a hierarchy of labels,
this is very similar to the visualization that browsers o er of XML Schema. The
interface is implemented as a Java applet or a Flash lm.
        </p>
        <p>
          Fluit et al. [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] describe Spectacle, this mechanism uses lightweight ontologies
to represent classes of similar objects and their relationships. The navigation
can be done by using hypertext or \cluster maps". A cluster map visualizes the
objects and their classes.
        </p>
        <p>
          At last, Sanchez et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] use a star eld grid to visualize documents from
several collections. Documents are stored as OAI-PMH records. The axis of the
grid represent attributes of the collections that can be chosen by users. Small
polygons are associated with the type of document and di erent colors are used
to distinguish the collections.
        </p>
      </sec>
      <sec id="sec-5-4">
        <title>Desing of SV</title>
        <p>The design of SV is addressed to reach the following objectives:
{ Construct a visualization mechanism with semantic features that allow users
to explore a collection of documents
{ Represent the organization of a collection of documents
{ Retrieve the metadata and the content of a determined document</p>
        <p>
          In order to reach these objectives, we have used the levels of knowledge
proposed by [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] in the design of SV. We want to uses CORTUPP as a test bed,
this is a collection represented as an ontology of records5.
1. Level 1: Organization of the metadata. Metadata is organized in the
ontology of records. Content information is stored in dc:title, dc:subject
and dc:description elements.
2. Level 2: Organization of the information in the documents.
Technical reports have a common structure formed by six mandatory chapters:
1)research propose, 2)state of the art, 3)research design, 4)implementation,
5)results and 6)conclusions. This structure is de ned in a Latex template.
The BibTex le format is used to manage the bibliography. A technical
report is described as a @techreport entry.
3. Level 3: Organization of the information in databases. The technical
reports are stored as PDF les in a database that also includes data and
counts of users. Documents are accessible through a web interface.
4. Level 4: Organization of the topics treated in the documents. The
dc:subject element stores the topic of a document. Keywords of this
element belong to the labels of the clusters in the ontology of records.
5. Level 5: Organization of the concepts, terms and relations. This
level is also represented in the ontology of records.
5
        </p>
      </sec>
      <sec id="sec-5-5">
        <title>Implementation of SV</title>
        <p>SV is formed by a web interface and SVG graphics6. SVG is a format developed
and maintained by the W3C SVG Working Group. This is an XML application
used to describe animated or static two dimensional vectorial graphics. The main
feature of these graphics is scalability.</p>
        <p>SV uses Xerces, this is a Java parser used to extract data from an ontology
of records. The classes of SV are built using Java language. In the interface, each
document, that is, an OAI-PMH record, is represented with a yellow star in a blue
gradient background. The background is divided in ve parts that correspond to
the rst levels of the ontology. These levels are divided by lines that form angles
of 90 degrees. The distribution of the lines try to re ect an estimation of the
amount of documents that can be found in each level. The documents closer to
5 CORTUPP is available at http://server3.uppuebla.edu.mx/cortupp/
6 http://www.w3.org/svg/
the upper left corner belong to the rst level of the ontology, these documents
share one term. The second level shows the documents that share two terms, and
then on. The stars have di erent size according to their level, they are bigger at
the rst level and smaller at the last one.</p>
        <p>The interface of SV is a SVG graphic of 502 per 502 pixels. XML Parser is the
Java application used to construct the XML document that contains the
interface. XLink is used to create hyperlinks between documents and their metadata.
Given a click on a star, users can allow the metadata on the right panel.
Figure 1 shows the SV interface where only six documents at the second and third
level were included, however SV is designed to support until 500 documents. The
colors can be modi ed without requiring compilation because they are stored in a
text le. The mechanism is accessible at http://informatica.uppuebla.edu.mx/
visualizacionPI/index.html.
Di erent con guration of ontologies of records were constructed in order to check
SV, that is, unity tests and integration tests were performed successfully. After
the installation of the SVG Plugin Version 1.7, the visualization of SV was
successful using Internet Explorer 8, Google Chrome 7.0.517.41 and Opera 10.6,
however, there were some inconveniences using Firefox 1.5, Firefox 3.6 and
Firefox Beta due to these versions do not support the animation features of SVG
graphics.
We have described SV, a visualization mechanism of federated collections based
on ontologies. SV has semantic features represented in the interface such as the
location of documents in the ontology and the similarity between documents.
Additional semantic information is stored in the metadata attached to each
document and in the ontology of records. Through SV interface, users can access
to metadata or download a document.</p>
        <p>CORTUPP was used as a test bed for SV, however, any collection of
OAIPMH records represented as an ontology of records can be visualized. Although
the size of an ontology of records can impact the visualization of SV, its design is
exible enough to support distinct collections. As future work, we plan to expand
SV to show the clusters and their labels. Then, we would like to incorporate
tagging and recommendation mechanisms.</p>
        <p>Modeling of CSCW system with Ontologies
Abstract. In recent years, there has been a growing interest in the development
and use of domain ontologies, strongly motivated by the Semantic Web
initiative. However, the application of ontologies in the CSCW domain has
been scarce. Therefore in this paper, it presents a novel architectural model to
CSCW systems described by means of an ontology. This ontology defines the
fundamental organization of a CSCW system, represented in its concepts,
relations, axioms and instances.</p>
        <sec id="sec-5-5-1">
          <title>1 Introduction</title>
          <p>
            In the last two decades, the enormous growth of Internet and the web have given rise
to an intercreativity cyberspace, in which groups of people can communicate,
collaborate and coordinate to carry out common tasks. Therefore, a great number of
groupware applications has been developed using different approaches, including
object-oriented, component-oriented, and agent-oriented ones. However, the
development of this kind of applications is very complex, because different elements
and aspects must be taken into account. Hence, these applications must be
simultaneously supported by models, methodologies, architectures and platforms to
be developed in keeping with current needs. In the groupware domain, one of the
models most used is the Unified Modelling Language (UML) [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ], although this has
not any element to represent constrains, which are very important in applications so
complex as the groupware ones.
          </p>
          <p>There has recently been an increase in the use of ontologies in any domain to
model applications. An ontology is presented as an organization resource and
knowledge representation through an abstract model. This representation model
provides a common vocabulary of a domain and defines the meaning of the terms and
the relations amongst them. In the domain of groupware applications, the ontology
provides a well-defined common and shared vocabulary, which supplies a set of
concepts, relations and axioms to describe this domain in a formal way.</p>
          <p>In this paper, two ontologies for the groupware domain are proposed. The first
ontology determines who authorize the registration of users, how interaction is carried
out among them, and how the turns for users participation are defined, among other
aspects. Moreover, it allows supporting modifications in runtime, such as changing
the user role, the rights/obligations of a role, the current policy, etc. The second
ontology establishes the necessary SOA-based services to develop groupware
applications in accordance with the existing papers in the literature about the
development of this type of applications. In addition, these services are clustered in
modules and layers with respect to the concern that they represent.</p>
          <p>This paper is organized as follows. Section 2 gives an brief introduction to the
ontologies. Section 3 describes the ontology-based modeling of the group
organizational structure. Section 4 presents an ontological model, which allows us to
specify an architectural model for the development of groupware applications.
Finally, Section 5 outlines some conclusions and future work.</p>
        </sec>
        <sec id="sec-5-5-2">
          <title>2 Introduction to the Ontologies</title>
          <p>
            There are several definitions of ontology, which have different connotations
depending on the specific domain. In this paper, we will refer to Gruber’s well-know
definition [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ], where an ontology is an explicit specification of a conceptualization.
For Gruber, a conceptualization is an abstract and simplified view of the world that
we wish to represent for some purpose, by the objects, concepts, and other entities
that are presumed to exist in some area of interest, and the relationships that hold
them. Furthermore, an explicit specification means that concepts and relations need to
be couched by means of explicit names and definitions.
          </p>
          <p>
            Jasper and Ushold [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] identify four main categories of ontology applications: 1)
neutral authoring, 2) ontology-based specification, 3) common access to information,
and 4) ontology-based search. In the work presented here, the main idea is to use
ontologies to specify the modeling of both the group organizational structure and the
architectural model in the groupware domain, since an ontology is a high level formal
specification of a certain knowledge domain, which provides a simplified and well
defined view of such domain.
          </p>
          <p>Ontology is specified using the following components:
 Classes: There is a set of classes, which represent concepts that belong to the
ontology. Each class may contain individuals (or instances), other classes or a
combination of both, with their corresponding attributes.
 Relations: These define interactions between two or several classes (object
properties) or between a concept and a data type (data type properties).
 Axioms: These are used to impose constraints on the values of classes or
instances. Axioms represent expressions (logical statement) in the ontology and
are always true inside the ontology.
 Instances: These represent the objects, elements or individuals of an ontology.</p>
          <p>These four components will be described for the two ontologies proposed in this
paper.</p>
          <p>In addition, ontologies require of a logical and formal language to be expressed. In
Artificial Intelligence, different languages have been developed, like the First-Order
Logic-based (which provide powerful primitive for modeling), the Frames-based
(with more expressive power but less inference capacity), and the Description
Logicsbased (which are more robust in the reasoning power) ones.</p>
          <p>
            OWL (Web Ontology Language) [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] is a language based on Description Logics for
defining and instantiating Web ontologies based on XML (eXtensible Markup
Language) [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] and RDF (Resource Description Framework) [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ]. OWL can be used to
explicitly represent the meaning of terms in vocabularies and the relationships among
those terms. This language makes possible to infer new knowledge from a
conceptualization, by using a specific software called reasoner. It has used the tool
Protégé [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], which is based on OWL, to define the ontology for group organizational
structure.
          </p>
          <p>
            In the groupware domain, ontologies have mainly been used to model task analysis
or sessions. Different concepts and terms, such as group, role, actor, task, etc. have
been used for the design of task analysis and sessions. Many of these terms are
considered in our conceptual model. Moreover, semiformal methods (e.g. UML class
diagrams, use cases, activity graphs, transition graphs, etc.) and formal ones (such as
algebraic expressions) have also been applied to model the sessions. There is also a
work [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] for modeling cross-enterprise business processes from the perspective of
cooperative system, which is a multi-level design scheme for the construction of
cooperative system ontologies. This last work is focused on business processes, and it
describes a general scheme for the construction of ontologies. However, in this paper,
we propose to model two specific aspects: the group organizational structure and the
architecture of a groupware application. Consequently, the application domain of both
ontologies is groupware, not business processes.
3 Ontology for specifying an architectural model
In order to specify architectural model five concerns are identified: Data, Group,
Cooperation, Application, and Adaptation. Consequently, five layers are considered.
Four layers are composed by modules and services, while the fifth one, the Data
Layer, contains repositories with the necessary information to carry out the group
work. The services of the architectural model are defined by the concepts' ontology.
3.1. Ontology Concepts
The architecture components are characterized through the concepts' ontology (shown
in Figure 1), which will be briefly described below:
 Registration is the first action that a user must carry out to can participate in the
group work using the collaborative application.
 Authentication validates the access to the group and depends on the
organizational style defined in the same.
          </p>
          <p>Group is who works in the session to perform work group.</p>
          <p>Organizational_Style defines the organizational style that a group will use to
carry out the group work.</p>
          <p>Stage restricts user’s access to the application in accordance with the
organizational style defined in it.</p>
          <p>Session defines a shared workspace where a group carries out common tasks.
Session_Management manages and controls one or more sessions.</p>
          <p>Concurrency manages shared resources to avoid inconsistencies by using them.
Shared_Resource is used by users to carry out basic activities.</p>
          <p>Basic_Activity is an action that a user must perform to carry out a task (which
can be made up by one or more basic activities).</p>
          <p>Task is carried out by the group to achieve a common goal.</p>
          <p>Notification notifies one or more users of all events that happen in a session.
Group_Awareness gets the necessary information to supply group awareness to
users that take part in a group.</p>
          <p>Group_Memory is supplied by the application to facilitate a common context.
Application is used by the users to carry out group work in established session.
Configuration configures the application the first time that it is used and when
it is necessary.</p>
          <p>User_Interface shows users all the information about the application execution.
Environment modifies the user interface to present the information in
accordance with the device used by each user.</p>
          <p>Adaptation is a process that allows adapting the collaborative application to the
new needs of the group.</p>
          <p>Detection monitors the execution environment to detect the events that
determine the adaptation process.</p>
          <p>Agreement decides whether an adaptation process must be carried out or not.
Vote_Tool is used by users to perform the agreement.</p>
          <p>Adaptation_Flow is a set of steps carried out to adapt the collaborative
application in accordance with the selected event.</p>
          <p>Repair is required when the adaptation process can not be performed.</p>
          <p>RA</p>
          <p>AFA</p>
          <p>AA
performs
is_determined</p>
          <p>AE
needs</p>
          <p>AAA
uses
requieres
is_adapted</p>
          <p>MU
MAV
MIV
CAE
modifies
presents</p>
          <p>CMS
TSP
is_part_of ASI</p>
          <p>APU
is_used</p>
          <p>IUP
SRP
manages
has
gives</p>
          <p>CS
administers GN
supplies</p>
          <p>NGM
provides
RUI
AUI
SPI
COS
SOS
AOS</p>
          <p>URA
3.2. Ontology Relations
The architecture relationship to each component and its environment are symbolized
with the ontology relations (see Figure 1) listed below:
 allows (Registration, Authentication): Only registered users are allowed to
authenticate to access to the collaborative application.
 access (Authentication, Group): Authentication allows users to access to group.
 depends (Registration, Organizational_Style): Users registration depends on
the organizational style defined at a given stage.
 organizes (Organizational_Style, Group): An organizational style specifies the
way in which the group is organized.
 defines (Stage, Organizational_Style): A stage defines an organizational style.
 works (Group, Session): A group needs to be connected to a session to work.
 governs (Session_Management, Session): The session management governs a
session.
 controls (Concurrency, Session): The concurrency service controls the existing
interaction in a session.
 manages (Concurrency, Shared_Resource): The concurrency service manages
the shared resources to guarantee mutually exclusive usage of these.
 is_used (Shared_Resource, Basic_Activity): The shared resources are used by
basic activities.
 is_part_of (Basic_Activity, Task): A basic activity is part of a task.
 administers (Session, Notification): The session administers the notification.
 provides (Notification, Group_Awareness): The notification process provides
group awareness.
 obtains (Group, Group_Awareness): A group obtains group awareness to avoid
inconsistencies in the collaborative application.
 supplies (Notification, Group_Memory): The notification process supplies
group memory.
 gives (Application, Group_Memory): The application gives group memory.
 establishes (Application, Session): An application establishes a session.
 presents (Application, User_Interface): An application presents an user
interface so that users can use the collaborative application.
2.3 Data warehouse scheme ROLAP (Relational OLAP) implementation of
population-based cancer incidence in Mexico.
3 Data Mining Application on Cancer Incidence
The implemented data warehouse has been used to develop a data mining task
space based on the integration of additional technologies to the data warehouse,
such as clustering and Geographic Information Systems, which in this case are
very suitable, to identify and display areas with incidence of cancer in
Mexico. The following provides a general description of the integration process
of technologies and tools (Fig. 3) made for this application.</p>
          <p>The data warehouse integrates the following information for our application:
the component space that allows viewing of the regions of municipalities,
population data such as the death rate and incidence rate and the time component,
which in this case is the census year.</p>
          <p>
            The IRIS GIS INEGI [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ], through your options allows the recovery of
population data and the real location of the municipalities, which are integrated
into the data warehouse.
          </p>
          <p>Since IRIS stores geographical representation of municipalities in the vector
format standardized "shape" and by means of polygons, there is the need for a
process of transfer of forms and formats in order to have a numerical
representation of each municipality, in this case, corresponds to a point on the
municipality center location, which is accomplished primarily through the tools
of ESRI's ArcInfo GIS.</p>
          <p>
            Given the numerical representation of each municipality through a point (x,
y), along with its rate of incidence of cancer, the Matlab programming
environment and its implementation of k-means algorithm [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] is used
to generate patterns / groups of municipalities and the corresponding centroids.
          </p>
          <p>Once you have the above results, it is again necessary to transfer digital data
format to format shape, a process similar to above using ArcInfo tools, allowing
viewing through GIS IRIS.</p>
          <p>Finally, the groups of municipalities and their corresponding centroids, are
passed as GIS layers to IRIS, for display on the geographic map of Mexico.
4 Results and visualization with IRIS
In this project we have done grouping tasks according to the affinity of location
and incidence rate of the municipalities. Series of experimental tests on the data
stores in cities with more than 100.000 inhabitants were carried out. Size groups
were considered k = 5, 10, 15, 20 and 30. The best result was obtained for k =
20.</p>
          <p>As a case study, this paper presents the results obtained by k-means algorithm
in Matlab for the cervical cancer data warehouse. Fig. 4 provides the visualization
of the 20 regions identified.</p>
          <p>From the results, we distinguish the groups spearheading the three
municipalities with higher incidence rates: Atlixco, Apatzingán and Tapachula
(Chiapas). In Fig. 5 the detail of the display of the group corresponding to the
region of Chiapas and the incidence of cervical cancer is shown. Table 1 provides
data for the previous group, and statistical measures for the mean and standard
deviation.</p>
          <p>
            The groups identified with high incidence rates: Tapachula and Apatzingan
match municipalities identified in other studies [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ] and correspond to the
population characteristics, identified in the work of the medical field [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ], [15]
such situations such as poverty, lack of preparation and access to effective health
services and the initiation of sexual activity at an early age. This allows us to
assert that the grouping is made valid. On the other hand, the study allowed
discovering other municipalities that had not been identified in other research,
such as the group of Atlixco, in particular showing the highest incidence rate in
the country (see table 2).
          </p>
          <p>Table 1 Municipalities Incidence Rates of Cervical-Uterine Cancer</p>
          <p>State
Chiapas
Veracruz-Llave
Veracruz-Llave
Chiapas
Chiapas
Tabasco
Tabasco
Tabasco
Chiapas
Tabasco
Campeche
Tabasco
Tabasco
Chiapas
Average
Standard deviation</p>
          <p>In order to perform a global analysis of our results, Table 2 provides
information of the ten municipalities with the highest incidence rate in the
country.</p>
          <p>Table 2 Top Ten Municipalities Incidence Rates of Cervical-Uterine Cancer
Key</p>
          <p>State
21019 Puebla
16006 Michoacán
07089 Chiapas
17006 Morelos
28021 Tamaulipas
06007 Colima</p>
          <p>Atlixco
Apatzingán
Tapachula
Cuautla
El Mante</p>
          <p>Manzanillo
30039 Veracruz-Llave Coatzacoalcos 267212
18017 Nayarit</p>
          <p>Tepic
30108 Veracruz-Llave Minatitlán
30118 Veracruz-Llave Orizaba
General Mean
Standard Deviation
117111
117949
271674
153329
112602
125143
305176
153001
118593
15
13
27
14
10
11
23
26
13
10</p>
          <p>Figure 6, illustrates the location of previous incidence rates compared to the
national average and the corresponding standard deviation.
5 Conclusions
Multidimensional model for conceptual design of the data warehouse, turned out
to be very appropriate, since this model is easily scalable and allows analysis of
the information under different perspectives. It is expected that future studies
process other variables, related to the municipalities, included in this design, such
as socioeconomic status, type of region, gender and access to health services,
among others. Moreover, the implementation of data warehouse based on the
ROLAP model has allowed taking advantage of the facilities developed for
relational databases. In addition, it is expected that the design and implementation
carried out in the data warehouse can be used in other applications.</p>
          <p>The processing of the spatial component of our data warehouse, using the
IRIS GIS INEGI, has resulted in a high quality visual representation of our
results, based on the actual physical location of the municipalities and on a map
of the topography of the Republic Mexican INEGI. Also experience and learning
has been gained on transfer of shapes (polygons, points) techniques and
formats (Number-shape) through ArcView GIS tools.</p>
          <p>Currently we are working to complete studies in other cancer types. Besides,
data mining tasks will be developed on the incidence of conditions such as
diabetes, influenza and cardiovascular diseases, among others.
Acknowledgement. R. Boone expresses her gratitude to Ms. Rocío Pérez Osorno
from INEGI, Puebla. (Graduated from the Faculty of Cs. Computing, BUAP) for
advice and support in plotting the results of this work through the IRIS GIS.
convergencia y su aplicación a bases de datos poblacionales de cancer. 2do Taller
Latino Iberoamericano de Investigación de Operaciones, Mèxico, 2007.
14. Pérez-O. J.3, Rocío Boone Rojas, María J. Somodevilla García. Research
issues on K-means Algorithm: An Experimental Trial Using Matlab., Advances
on Semantic Web and New Technologies”. Vol 534. http://ceur-ws.org/.
15. Rangel-Gómez, G. Lazcano-Ponce,E. Palacio-Mejía, Cáncer cervical, una
enfermedad de la pobreza: diferencias en la mortalidad por áreas urbanas y
rurales en México, http:// www.insp.mx/salud/index.html.
16. Scotch,Matthew, Parmato B. Monaco, V. Evaluation of SOVAT: An
OLAPGIS decision support system for community health assessment data analysis.
BMC Medical Informatics &amp; Decisión Making Vol. 8 (1-12). 2008.
17. Simonet, A., Landais, P. Guillon D.A multi-source Information System for
end-stage renaldisease. Comptes Residus Biologies, 2002, Vol. 325 I4., p515.
18. Thangavel K. Jaganathan P. and Esmy P. O., Subgroup Discovery in Cervical
Cancer Analysis Using Data Mining Techniques, Departament of Computer
Science, Periyar University: Departament of Computer Science and Applications,
Gandhigram Rural Institute-Deemed University, Gandhigram: Radiation
Oncologist , Christian Fellowship Community Health Centre, Tamil Nadu, India:
AIML journal, Vol(6), Issue(1), January, 2006.
An Approach of Crawlers for Semantic</p>
          <p>Web Application
José Manuel Pérez Ramírez1 , Luis Enrique Colmenares Guillen1
1</p>
          <p>Benémerita Universidad Autónoma de Puebla,</p>
          <p>Facultad de Ciencias de la Computación,</p>
          <p>BUAP – FCC, Ciudad Universitaria,</p>
          <p>Apartado Postal J-32,</p>
          <p>Puebla, Pue. México.</p>
          <p>{ mankod, lecolme}@gmail.com
Abstract. This paper presents a proposal for a system capable of retrieval
information from the processes generated by the system Yacy. The information
retrieved will be used in the generation of a knowledge base. This knowledge
base may be used in the generation of semantic web applications.</p>
          <p>
            Keywords: Semantic Web, Crawler, Corpora, Knowledgebase.
A knowledgebase is a special type of database for managing knowledge. It provides
the means to collect organize and recover knowledge in a computed way. In general, a
knowledgebase is not a static set of information it is a dynamic resource that maybe
have the ability to learn. In the future, Internet will be a complete and complex
knowledgebase, already known as semantic web [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ].
          </p>
          <p>Some examples of knowledge base are: a public library, an information database
related to a specific subject, Whatis.com, Wikipedia.org, Google.com, Bing.com and
Recaptcha.net.</p>
          <p>
            Investigate related to Generation Automatic of a specialized corpus from the Web
is present in [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ], this investigate have a reviews of methods to process knowledgebase
that generates specialized corpus.
          </p>
          <p>In section 2 we present related work to semantic web in order to comprehend the
benefits that may be obtained by elaborating them.</p>
          <p>In Section 3 we describe the challenges and we explain the problems that could be
have if you tried to use Google Search for getting information or tried to retrieval
information of queries to Google.</p>
          <p>Section 4 the methodology to use for solving the problem. And section 5,
conclusions and ongoing work.</p>
          <p>
            We continue this paper present a form abstract to describe a Query Processing on
the Semantic Web [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] is as follows Fig. 1
1. A query with a data type.
2. A server that sends queries to the servers decentralized indexing. The content
found on the servers is similar to indexing a book index indicates which pages
contain the words that match the query.
3. The query travels to the servers where documents stored documents are
retrieved are generated to describe each search result.
4. The user receives the results of its semantic search which has already been
processed in the semantic web server.
          </p>
          <p>Fig. 1. Querying the Semantic Web.
2 Related Work
Nowadays, the investigation related to retrieval information on the web has a
different result like: knowledgebase, web sites dedicated to retrieval information,
Wikipedia, Twine, Evri, Google, Vivísimo, Clusty, etc.</p>
          <p>
            An example of a company that working with “retrieval information” is Google Inc,
one of their products is Google Search this web search engine is the one of the
mostused search engine on the Web [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ], Google receives several hundred million queries
each day through its various services [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ].
          </p>
          <p>This kind of example it’s necessary for the following analogy: For what reason
Google doesn’t put their information of their knowledgebase under domain public?
And the answer it’s very simple: because their information or their knowledgebase it’s
money.
In section 3 we explain some form of extract information of Google Search only a
protected few of information it’s impossible retrieval many information of Google
Search whit the idea to generate knowledgebase this because Google protects their
information of their queries.</p>
          <p>
            Another kind of knowledgebase are:
2.1 Wikipedia
A specific case is Wikipedia, a project to write a free communitarian encyclopedia in
all languages. This project have 514 621 articles today. The quantity and quality of
the articles present an excellent knowledgebase for the creation of semantic webs.
We present some ways to obtain semantic information from Wikipedia: from its
structure, from the collected notes of the people that contributes and from the existent
links in the entries.
2.2 Twine
Twine is a tool for storage, organizes and shares information, all of it with an
intelligence provided by the platform that analyzes the semantic of the information
and classifies automatically [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. The main idea is to save users from labeling and
connecting related content and leave this work to Twine, bringing more value and
storage the contents next to the information about its meaning.
3 Challenges
The principal challenge is development a system with the capacity of works with Yacy
for retrieval information of Indexing Process and generate information this
information will be essential for produce knowledgebase.
          </p>
          <p>We present in the figure 5 all modules of yacy, so the module to development will be
works with some of these modules.
The principal question is:</p>
          <p>What we can do to get information under domain public.</p>
          <p>It’s very simple we use the very popular Wikipedia</p>
          <p>Wikipedia is a project of the Wikimedia Foundation. More than 13.7 million of its
articles have been drafted in conjunction with volunteers from all over the world and
practically every one of them may be edited by any person that may have access to
Wikipedia. Actually it is the most popular reference work on the internet.</p>
          <p>This project of dynamic content like Wikipedia illustrates the information that have
great potential to be exploited.</p>
          <p>Otherwise Google Search is one of the most-used search engine provides at least
22 special features beyond the original word-search capability. These include
synonyms, weather forecasts, time zones, stock quotes, maps, earthquake data, movie
showtimes, airports, home listings, and sports scores.</p>
          <p>And maybe you could be thinking:</p>
          <p>For what reason the people don’t use a Google Search for get all the
knowledgebase about topic specific and this knowledgebase could be export to file of
text plan with the possibilities of management this and generate corpus.</p>
          <p>Very simple is the answer because the information of Google is their information
and gold for company.</p>
          <p>
            It the past Google Inc. allowed the retrieval information from any kind of query[
            <xref ref-type="bibr" rid="ref3">3</xref>
            ].
Google allowed the retrieval information based on their form and methods like
University Research Program for Google Search [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] but any kind of answered we
get of this project when we make the inscription to this program.
          </p>
          <p>
            Another way to exploit Google Search knowledge is using scripts, APIS [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ],
programming languages such as AWK, development tools like SED or GREP, all of
them analyzed in [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ] but with few results and we need a lot of information for create
knowledgebase.
This section gives a description of the project taking into consideration the design that
will be used to give a solution to the problem of creating the module.
4.1 Project description
The obtained results of the module that connected with Yacy will be used to create
semantic webs, corpus and any other project that needs information in a plain text
about web content.
          </p>
          <p>Described below are a series of procedures to follow that use as a methodology to
implement within the project.</p>
          <p>A) Check the modules of Yacy
B) Check the logistic and architecture of Yacy
C) Check the form that Yacy create their crawlers</p>
          <p>D) Think in a form of create the Module capable of manage the information of the
crawler and generate knowledgebase</p>
          <p>
            E) Some of the polices described above are implemented in YaCy [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], the variant
to use is the implementation of the JXTA[
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] tool and the URI and RDF policies that
allow to structure and outline the results, to finally present then in a semantic way or
knowledgebase.
4.2 Development platform
          </p>
          <p>
            This work is done with YaCY, which is a free distribution search engine, based on
the principles of the peer to peer (P2P). Its core is a program written in Java that it’s
distributed in hundreds of computers, from September 2006. It’s called YaCy-peer.
Each YaCy-peer is an independent crawler that navigates trough the Internet, and
analyzes and indexes web pages found. To storages the indexation results in a
common database (called index) which is shared with other YaCy-peers using the
principles of the P2P networks [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ].
          </p>
          <p>Compared to semi-distributed search engines, the YaCy-network has a
decentralized architecture. All of the YaCy-peers are equal and there is no central
server. It may be executed in Crawling mode or as a local proxy server. The figure 2
shows a diagram that describes the distributed process of indexation and the search in
the network for the YaCy crawler.</p>
          <p>Fig. 3. Distributed indexing process</p>
          <p>The figure 3, to have the main components of YaCy, and the process that exists
among the web search, web crawler, the indexing and data storage processes.
5 Conclusions and ongoing work
In this section present some the conclusions and results that are expected of project
and the future work.</p>
          <p>1. Index all content of Wikipedia.
2. Storage this content.
3. Present the content of Wikipedia by topic in a web site.
4. Use a tagged of text for share the information with tags.
5. Present the module and their code on a web site
6. Share knowledgebase extract of Wikipedia
1. Definition of knowledgebase</p>
          <p>http://searchcrm.techtarget.com/definition/knowledge-base
2. Alarcón, R., Sierra, G., Bach, C. (2007). “Developing a Definitional Knowledge
Extraction System”. En Vetulani, Z. (ed.), Actas del 3er Language &amp;
Technology Conference. Human Language Technologies as a Challenge for
Computer Science and Linguistics. Poznan, Universidad Adam Mickiewicza: pp.
374-378.
3. Google Hacks, Second Edition, 2004, O’Reilly Media.
4. S. Rhea, B. Godfrey, B. Karp, J. Kubiatowicz, S. Ratnasamy, S. Shenker, I.</p>
          <p>Stoica, and H. Yu. OpenDHT: a Public DHT Service and its Uses. SIGCOMM'
05, Philadelphia, Pennsylvania, USA, august 21-26, (2005).
5. http://www.jxta.org (2010).
6. http://yacy.net/ (2010).
7. http://www.twine.com/ (2010).
8. Query Processing on the Semantic Web Heiner Stuckenschmidt, Vrije</p>
          <p>Universiteit Amsterdam
9. http://www.alexa.com/siteinfo/google.com+yahoo.com+altavista.com (2009)
10. http://searchenginewatch.com/showPage.html?page=3630718 (2008)
11. http://research.google.com/university/search/ (2010)
Decryption Through the Likelihood of</p>
          <p>Frequency of Letters
Barbara Sa¶nchez Rinza, Fernando Zacarias Flores, Luna P¶erez Mauricio, and</p>
          <p>Mart¶inez Cort¶es Marco Antonio
Benem¶erita Universidad Auto¶noma de Puebla,</p>
          <p>Computer Science
14 Sur y Av. San Claudio, Puebla, Pue.</p>
          <p>72000 M¶exico
brinza@cs.buap.mx, fzflores@yahoo.com.mx
Abstract. The method to decrypt the information using probability
leads to a more thorough job, because you have to know the
percentage of each of the letters of the language that is being analyzed here is
Spanish. You can consider not only the probabilities of the letters also
syllables, set of three, four letters and even words. Then you have this
thing to do is make comparisons of the frequencies of cipher text and
the frequencies of the language to begin to replace by a correspondence.
And ¯nally passing a scanner and ¯nd the decrypted text.</p>
          <p>
            Keywords Probability, Decrypt.
1
Cryptography is the science that alters the linguistic representations of a message
[
            <xref ref-type="bibr" rid="ref1">1</xref>
            ]. For this there are di®erent methods, where the most common is encryption.
This science masking the original references of the information by a conversion
method governed by an algorithm that allows the reverse or decryption of
information. Use of this or other techniques, allowing for an exchange of messages
that can only be read by the intended bene¯ciaries as 'consistent'. A consistent
recipient is the person to whom the message is directed with the intention of
the sender. Thus, the recipient knows the discrete coherent used for masking the
message. So either have the means to bring the message to the reverse process
cryptographic, or can infer the process that becomes a message to the public. The
original information to be protected is called plaintext or cleartext. Encryption
is the process of converting plain text into unreadable gibberish called
ciphertext or cryptogram. In general, the concrete implementation of the encryption
algorithm (also called ¯gure) is based on the existence of key secret information
that ¯ts the encryption algorithm for each di®erent use [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ].
          </p>
          <p>
            Decryption is the reverse process to recover the plaintext from the ciphertext
and key. Cryptographic protocol speci¯es the details of how to use algorithms
and keys (and other primitive operations) to achieve the desired e®ect. The set
of protocols, encryption algorithms, key management processes and actions of
the users, which together constitute a cryptosystem, which is what the end user
works and interacts. In this work, we must ¯rst have a ciphertext which must
meet certain requirements, such a text should be bijective so that each element
of the domain carries a single element of the condominium. In addition we must
also take account of the rules of Kerckho® [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ].
2.1
          </p>
          <p>Development work</p>
          <p>
            Frequencies in Spanish
Is required to decrypt text using the odds as to how often they used certain
letters in the alphabet, for this work only considered the Spanish language [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ].
          </p>
          <p>The frequencies of Spanish, which were used for this study were:
The letter frequency statistics may vary from one to another depending on the
corpus author has chosen to develop them. Usually di®erences when the corpus
is literary or consists of texts of di®erent origins. Table 1 shows the frequency of
each of the Spanish alphabet with their respective percentage.</p>
          <p>High frequency letters Medium frequency letters Low frequency letters Frequencies 0:5%
letter freq.% letter freq.% letter freq.% G, F, V, W
E 16,78 R 4,94 Y 1,54
A 11,96 U 4,80 Q 1,53
O 8,69 I 4,15 B 0,92
L 8,37 T 3,31 H 0,89
S 7,88 C 2,92 J, Z, X, K, N
N 7,01 P 2,76
D 6,87 M 2,12</p>
          <p>
            Most Frequent words
The vowels make up about 46.38% of the text. The high frequency letters account
for 67.56% of the text. Mid-frequency points accounting for 25% of the text [
            <xref ref-type="bibr" rid="ref4">4</xref>
            ].
In the dictionary the most common vowel is A, but in written texts is the E
because of prepositions, conjunctions, verbs, etc. The most common consonants
are L, S, N, D, with about 30%. The less frequent six letters: V, N, J, Z, X and
K (just over 1%). The average frequency of a Spanish word is 5.9 letters. The
coincidence index for Spanish is 0.0775. In addition to solving the encryption
table 2 we mentioned that we most frequently used words in a text of 10 000
words.
          </p>
          <p>Most common words Two-letter words Three-letter words
Word Frequency Frequency Word Frequency
DE 778 778 QUE 289
LA 460 460 LOS 196
El 339 339 DEL 156
EN 302 302 LAS 114
QUE 289 119 POR 110</p>
          <p>Y 226 98 CON 82</p>
          <p>A 213 74 UNA 78
LOS 196 64 MAS 36
DEL 156 63 SUS 27
SE 119 47 HAN 19
LAS 114</p>
          <p>Next, table 3 shows the frequencies of the 4-letter words.
The size of the corpus is 60,115 letters. The frequencies are absolute. The
digraphs are read by row and column in that order. Below in table 4 shows the
union digraphs are letters from letters.</p>
          <p>Most common initial letter
The most frequent letters in Spanish that start a word are listed in Table 5
3</p>
          <p>Results
The ciphertext is used as said it had to be bijective and have Kerckho® rules
and the decrypted text shown in Figure 1.
Four-letter words Distribution of letters in literary texts
Word Frequency E - 16,78% R - 4,94% Y - 1,54% J - 0,30%
PARA 67 A - 11,96% U - 4,80% Q - 1,53%
COMO 36 O - 8,69% I - 4,15% B - 0,92%
AYER 25 L - 8,37% T - 3,31% H - 0,89%
ESTE 23 S - 7,88% C - 2,92% G - 0,73%
PERO 18 N - 7,01% P - 2,77% F - 0,52%
ESTA 17 D - 6,87% M - 2,12% V - 0,39%
AOS 14
TODO 11
SIDO 11
SOLO 10
We conclude that this method of decryption is good however would have to
tweak a little more due to it depends on the text we have and how much text
to decrypt was also observed that only decrypts an encrypted bijective. In this
work, as seen in the results of Figure 1, which apply various processes, ¯rst see
the probability of the lyrics in Spanish that are more frequent, then seen with
the syllables that are more frequent in Spanish, and then with the last word and
you miss the information, text analyzer, as shown in Figure 1 a large percentage
of the information is decoded, but as mentioned in the top, this will depend have
that much information to process it.</p>
          <p>References
1. Liddell and Scott's Greek-English Lexicon. Oxford University Press. (1984)
2. Anaya Multimedia, Codigos Y Claves Secretas: Programas En Basic, Basado A Su</p>
          <p>Vez En Un Estudio Lexicogr¯co Del Diario "El Pas", Mexico 1986.
3. Friedman, William F. And Callimahos, Lambros D., Military Cryptanalytics,
Cryptographic Series, 1962
4. Part I - Volume 2, Aegean Park Press, Laguna Hills, Ca, 1985
5. Barker, Wayne G., Cryptograms In Spanish, Aegean Park Press, Laguna Hills, Ca.,
letter P C D E S A L R M N T
frequency 1.1128 1.081 1.012 989 789 761 435 425 403 346 298</p>
          <p>letter Q I H U G V F O B J Y W Z K
frequency 286 281 230 219 206 183 177 169 124 47 27 19 2 1</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Barrón Vivanco M. Arandine</surname>
            ,
            <given-names>Pérez O. J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Miranda</surname>
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Fátima</surname>
          </string-name>
          , Pazos R., XII Congreso de Investigación en Salud Pública, Aplicación de técnicas de minería de datos a bases de datos poblacionales de cáncer, CENIDET, México, Secretaría de Saúde do Estado de Pernambuco, Brasil, Abril (
          <year>2007</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Forgy</surname>
            <given-names>E.</given-names>
          </string-name>
          “
          <article-title>Cluster analysis of multivariate data: Efficiency vs</article-title>
          .
          <source>Interpretability of classification”</source>
          ,
          <source>Biometrics</source>
          , vol.
          <volume>21</volume>
          , pp.
          <fpage>768</fpage>
          -
          <lpage>780</lpage>
          .
          <year>1965</year>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Hernández-Orallo</surname>
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ramiréz-Quintana M. J.</surname>
          </string-name>
          ,
          <string-name>
            <surname>Ferri-Ramiréz</surname>
            <given-names>C.</given-names>
          </string-name>
          , Introducción a la Minería de Datos, Ed. Pearson Prentice Hall, Madrid (
          <year>2004</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Hidalgo-Martínez Ana C.</surname>
          </string-name>
          <article-title>El cáncer cérvico-uterino su impacto en México. Porqué no funciona el programa nacional de detección oportuna</article-title>
          .
          <source>Revista Biomédica</source>
          ,
          <string-name>
            <given-names>Centro</given-names>
            <surname>Nal. De Investigaciones Regionales Dr. Hineyo Noguchi</surname>
          </string-name>
          ,
          <string-name>
            <surname>UADY</surname>
          </string-name>
          ,
          <year>2006</year>
          , México.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>5. IRIS 4. http://mapserver.inegi.gob.mx. SNIEG Sistema Nacional de Información Estadística y Geográfica.</mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Jin</surname>
            <given-names>Chen</given-names>
          </string-name>
          , MacEachren,
          <string-name>
            <surname>Alan</surname>
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Peuquet</surname>
          </string-name>
          , Donna. Constructing Overview+
          <article-title>Detail Dendogram Matrix Views</article-title>
          .
          <source>IEEE Transactions on Visualization &amp; Computer Graphics</source>
          ., Vol.
          <volume>15</volume>
          ,
          <string-name>
            <surname>Issue</surname>
            <given-names>6</given-names>
          </string-name>
          ,
          <fpage>p889</fpage>
          -
          <lpage>896</lpage>
          , Dec.
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>MacQueen</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Some Methods for Classification and Analysis of Multivariate Observations</article-title>
          .
          <source>In Proceedings Fifth Berkeley Symposium Mathematics Statistics and Probability</source>
          . Vol.
          <volume>1</volume>
          . Berkeley, CA (
          <year>1967</year>
          )
          <fpage>281</fpage>
          -
          <lpage>297</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Martínez</surname>
            <given-names>M. Francisco</given-names>
          </string-name>
          <string-name>
            <surname>Javier</surname>
          </string-name>
          .
          <article-title>Epidemiología del cáncer del cuello uterino</article-title>
          .
          <source>Medicina Universitaria</source>
          <year>2004</year>
          ,
          <fpage>39</fpage>
          -
          <lpage>46</lpage>
          . Vol.
          <volume>6</volume>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <year>22</year>
          ,
          <string-name>
            <surname>UANL</surname>
          </string-name>
          , México.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>NAIIS</given-names>
            <surname>Instituto Nacional de Salud Pública</surname>
          </string-name>
          ,
          <string-name>
            <surname>SCRIS</surname>
          </string-name>
          , Mortalidad, http://sigsalud.insp.mx/naais/, Cuernavaca, Morelos, México, (
          <year>2003</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Nevine M. Labib</surname>
            ,
            <given-names>Michael N.</given-names>
          </string-name>
          <article-title>Malek: Data Mining for Cancer Management in Egypt</article-title>
          .
          <source>Transactions on Engineering, Computing and Technology V8 October</source>
          <year>2005</year>
          :
          <article-title>(ISSN 1305-</article-title>
          5313).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Pérez-C. Nelson</surname>
          </string-name>
          ,
          <string-name>
            <surname>Abril-Frade D.O. Estado Actual de las Tecnologías de Bodegas de Datos Espaciales</surname>
          </string-name>
          .
          <source>Ing. E Investigación</source>
          . Vol.
          <volume>27</volume>
          , No. 1, Univ. Nal. De Colombia.
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Pérez-O. J.</surname>
            ,1,
            <given-names>R. Pazos R</given-names>
          </string-name>
          , L. Cruz R.,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Reyes S. “Improvement the Efficiency and Efficacy of the K-means Clustering Algorithm through a New Convergence Condition”</article-title>
          .
          <source>Computational Science and Its Applications - ICCSA 2007 - International Conference Proceedings</source>
          . Springer Verlag.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Pérez-O. J</surname>
            .2,
            <given-names>M.F.</given-names>
          </string-name>
          <string-name>
            <surname>Henriques</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Pazos</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Cruz</surname>
            , G. Reyes,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Salinas</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mexicano</surname>
          </string-name>
          . Mejora al Algoritmo de
          <article-title>K-means mediante un Nuevo criterio de</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>