=Paper= {{Paper |id=None |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-686/SemWeb2010.pdf |volume=Vol-686 }} ==None== https://ceur-ws.org/Vol-686/SemWeb2010.pdf
Advances on Semantic Web
  and New Technologies
                        July, 2010




Editors:

Dra. María Josefa Somodevilla García

Dra. Darnes Vilariño Ayala

Dr. David Eduardo Pinto Avedaño
The Workshop on Semantic Web and New Technologies was held by third time at the
Faculty of Computer Science of Benemérita Universidad Autónoma de Puebla, Mexico in
July 2010.


The Semantic Web provides a common framework that allows data to be shared and reused
across application, enterprise, and community boundaries. Semantic Web technologies are
beginning to play a significant role in many diverse areas, marking a turning point in the
evolution of the Web.

The goal of this workshop is to provide a forum for the Semantic Web community, in
which participants can present and discuss approaches to add semantics on the Web, show
innovative applications in this field and identify upcoming research issues related to
Semantic Web. In order to fulfill these objectives, the more important workshop topics
included Semantic Search, Semantic Advertising and Marketing, Linked Data,
Collaboration and Social Network, Foundational Topics, Semantic Web and Web 3.0,
Ontologies, Semantic Integration, Data Integration and Mashups, Unstructured Information,
Semantic Query, Semantic Rules, Developing Semantic Applications and Semantic SOA.


Davide Buscaldi and Gerardo Sierra were the invited speakers in this Third Workshop
Semantic Web.

Davide Buscaldi is currently completing his Ph.D. in pattern recognition and artificial
intelligence at the UPV - Universidad Politécnica de Valencia (Spain), with a thesis titled
"Toponym Disambiguation in NLP Applications". His research interests are mainly focused
on question answering, word sense disambiguation and geographical information retrieval.
He obtained his DEA (Diploma de Estudios Avanzados) in 2008 with a dissertation on the
"integration of resources for QA and GIR". He is the author of over 40 papers in different
international conferences, workshops and journals. He has been awarded a FPI grant by the
Valencian local government which allowed him to participate in the "LiveMemories"
project during a stage at the FBK-IRST research institute in Trento, Italy, under the
direction of Bernardo Magnini. He has been the UPV responsible of the organization of the
QAST (Question Answering on Speech Transcript) track in CLEF 2009. Currently, he is
member of the Natural Language Engineering (NLE) Lab of the Universidad Politécnica de
Valencia.

Gerardo Sierra is a Ph.D. in Computational Linguistics at UMIST, England. He is the
coordinator of the Linguistic Engineering Group at UNAM. He has promoted this area in
teaching level such as research and development, in areas such as computational
lexicography, terminotics, retrieval and information extraction, text mining and corpus
linguistics. Currently, he is researcher level A, National Researcher II, CONACYT project
evaluator, member of several scientific committees. He has taught courses at UNAM, for
the Faculties of Engineering and Philosophy and Letters, such as Posgrade in Linguistic,
Biotechnology and Computer Science.
Content

Invited Paper
Ambiguous Place Names on the Web                                                         1
Davide Buscaldi.




SV: a Visualization Mechanism for Ontologies of Records                                  8
Based on SVG Graphics
Ma. Auxilio Medina, Miriam Cruz, Rebeca Rodríguez, and Argelia B. Urbina.




Modeling of CSCW system with Ontologies                                                 13
Mario Anzures-García, Luz A. Sánchez-Gálvez, Miguel J. Hornos, Patricia Paderewski-
Rodríguez, and Antonio Cid.




The Use of WAP Technology in Question Answering                                         24
Fernando Zacarías F., Alberto Tellez V., Marco Antonio Balderas, Guillermo De Ita L.,
and Barbara Sánchez R.




Data Warehouse Development to Identify Regions with High                                37
Rates of Cancer Incidence in México through a Spatial Data
Mining Clustering Task.
Joaquin Pérez Ortega, María del Rocío Boone Rojas, María Josefa Somodevilla García,
and Mariam Viridiana Meléndez Hernández.



An Approach of Crawlers for Semantic Web Application                                    48
(Short paper)
José Manuel Pérez Ramírez, and Luis Enrique Colmenares Guillen.


Decryption Through the Likelihood of Frequency of Letters                               57
(Short paper)
Barbara Sánchez Rinza, Fernando Zacarias Flores, Luna Pérez Mauricio, and Martínez
Cortés Marco Antonio.
         Ambiguous Place Names on the Web?

                                 Davide Buscaldi

           Natural Language Engineering Lab., ELiRF Research Group,
             Dpto. de Sistemas Informáticos y Computación (DSIC),
                   Universidad Politécnica de Valencia, Spain,
                            dbuscaldi@dsic.upv.es



      Abstract. Geographical information is achieving an increasing impor-
      tance in the World Wide Web. Everyday, the number of users looking for
      geographically constrained information is growing. Map-based services,
      such as Google or Yahoo Maps provide users with a graphical interface,
      visualizing results on maps. However, most of the geographical informa-
      tion contained in web documents is represented by means of toponyms,
      which in many cases are ambiguous. Therefore, it is important to prop-
      erly disambiguate toponyms in order to improve the accuracy of web
      searches. The advent of the semantic web will allow to overcame this
      issue by labelling documents with geographical IDs. In this paper we
      discuss the problems of using toponyms in web documents instead of
      identifying places using tools such as Geonames RDF, focusing on the
      errors that affect a prototype geographical web search engine, Geooreka!,
      currently under development.


1   Introduction
The interest of users for geographically constrained information in the Web has
increased over the past years, boosted by the availability of services such as
Google Maps1 . Sanderson and Kohler [1] showed that 18.6% of the queries sub-
mitted to the Excite search engine contained at least a geographic term, while
Gan et al. [2] estimated that 12.94% of queries submitted to the AOL search
engine expressed a geographically constrained information need. Most of the ge-
ographical information contained in the Web and unstructured text is composed
by toponyms, or place names. There are two main problems that derive from
using toponyms to represent geographical information. The first one is the poly-
semy of toponyms, or toponym ambiguity: a toponym may be used to represent
more than one place, such as “Puebla” which may be used to indicate the city
at 19o 30 N, 98o 120 W, the state in which it is contained, a suburb of Mexicali in
the state of Baja California, or three more small towns in Mexico. The second
problem is that the mere inclusion of a toponym in a document does not always
mean that the document is geographically relevant with respect to the region or
?
  We would like to thank the TIN2009-13391-C04-03 research project for partially
  supporting this work.
1
  http://maps.google.com




                                                                                      1
area represented by the toponym. In the first case, the solution is constituted
by the Toponym Disambiguation (TD) task, also named toponym grounding
or resolution; in the second case, the solution is to carry out Geographic Scope
Resolution, which is also affected by the problem of toponym ambiguity [3].
    The Geonames ontology2 provide users with RDF description of more than
6 million places. The use of this ontology would allow to include geospatial se-
mantic information in the Web, eliminating the need of toponym disambiguation.
Unfortunately, as noted by [4], in the Web “references to geographical locations
remain unstructured and typically implicit in nature”, determining a “lack of
explicit spatial knowledge within the Web” which “makes it difficult to service
user needs for location-specific information”. In this paper, with the help of the
Geooreka!3 system [5], a prototype web search engine developed at the Universi-
dad Politécnica of Valencia in Spain, we will the problems that users interested
in geographically constrained information may found because of the ambiguity
of toponyms in the web.


2   Geooreka!: a Geographical Web Search Engine

Geooreka! is a search engine developed on the basis of our experiences at Geo-
CLEF4 [6,7], which suggested us that the use of term-based queries could not be
the optimal method to express a geographically constrained information need.
For instance, it is common for users to employ vernacular names that have vague
spatial extent and which do not correspond to the official administrative place
name terminology. Another issue is the use of vague geographical constraints that
are difficult to automatically translate from the natural language to a precise
query. For instance, the query “Cultivos de tabaco al este de Puebla” (“Tobacco
plantations East of Puebla”) presents a double problem because of the ambigu-
ity of the place name and the fact that the geographical constraint “East of” is
vague (for instance, it does not specify if the search should be constrained within
Mexico or extend to other countries).
    These issues are addressed in Geooreka! by allowing the user to specify his
geographical information needs using a map-based interface. The user writes a
natural language query in order to represent the query theme (e.g., “Cultivos
de tabaco”) and selects a rectangular map in a box (Figure 1), representing
the query geographical footprint. All toponyms in the box are retrieved using a
PostGIS database, and then the Web is queried in order to check the maximum
Mutual Information (MI) between the thematic part of the query and all the
places retrieved. The complete architecture of the system can be observed in
Figure 2. Web counts and MI are used in order to determine which combinations
theme-toponym are most relevant with respect to the information need expressed
by the user (Selection of Relevant Queries). In order to speed-up the process,
2
  http://www.geonames.org/ontology/
3
  http://www.geooreka.eu
4
  http://ir.shef.ac.uk/geoclef/




                                                                                      2
                          Fig. 1. Main page of Geooreka!


web counts are calculated using the static Google 1T Web database5 , indexed
using the jWeb1T interface [8], whereas Yahoo! Search is used to retrieve the
results of the queries composed by the combination of a theme and a toponym.


2.1    Model of Theme-Place Relevance

The key issue in the selection of the relevant queries is to obtain a relevance
model that is able to select pairs theme-toponym that are most promising to
satisfy the user’s information need. On the basis of the theory of probability,
we assume that the two component parts of a query, theme T and a place G,
are independent if their conditional probabilities are independent, i.e., p(T |G) =
p(T ) and p(G|T ) = p(G), or, equivalently, their joint probability is the product
of their probabilities:

                              p̂(T ∩ G) = p(G)p(T )                            (1)

    If probabilities are calculated using page counts, that is, as the number of
pages in which the term (or phrase) representing the theme or toponym appears,
divided by Fmax = 2, 147, 436, 244 which is the maximum term frequency con-
tained in the Google Web 1T database, then p̂(T ∩ G) is the expected probability
of co-occurrence of T and G in the same web page. It is clear that this represents
a rough estimation of the fact that T occurred in G, since the mere inclusion
of G in a page where T is mentioned does not guarantee the semantic relation
between G and T .
    Considering this model for the independence of theme and place, we can
measure the divergence of the expected probability p̂(T ∩ G) from the observed
probability p(T ∩ G): the more the divergence, the more informative is the result
5
    http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13




                                                                                      3
                         Fig. 2. Architecture of Geooreka!


of the query. The Kullback-Leibler measure [9] is commonly used in order to
determine the divergence of two probability distributions.
                                                             p(T ∩ G)
              DKL (p(T ∩ G)||p̂(T ∩ G)) = p(T ∩ G) log                        (2)
                                                             p(T )p(G)
This formula is exactly one of the formulations of the Mutual Information (MI)
of T and G, usually denoted as (I(T ; G)).


3   Evaluation
Geooreka! has been evaluated over the GeoCLEF 2005 test set, in order to com-
pare the results that could be obtained by specifying the geographic footprint by
means of keywords and those that could be obtained using a map-based interface
to define the geographic footprint of the query. With this setup, topic title only
was used as input for the Geooreka! thematic part, while the area correspond-
ing to the geographic scope of the topic was manually selected. Probabilities
were calculated using the number of occurrences in the GeoCLEF collection.
Occurrences for toponyms were calculated by taking into account only the geo
index. The results were calculated over the 25 topics of GeoCLEF-2005, minus
the queries in which the geographic footprint was composed of disjoint areas (for
instance, “Europe” and “USA” or “California” and “Australia”), which could
not be processed by Geooreka!. Mean Reciprocal Rank (MRR) was used as a
measure of accuracy. The GIR system GeoWorSE, where queries are specified
by text, was used as a baseline [10]. Table 1 displays the obtained results.




                                                                                     4
Table 1. MRR obtained with Geooreka!, using GeoCLEF or the WWW as target
collection, compared to the MRR obtained using the GeoWorSE system, Topic Only
runs.

                                        Geooreka! Geooreka!
              topic   GeoWorSE (GeoCLEF collection)  (Web)
              GC-002      0.250               1.000   0.083
              GC-003      0.013               1.000   1.000
              GC-005      1.000               1.000   0.000
              GC-006      0.143               0.000   0.500
              GC-007      1.000               1.000   0.125
              GC-008      0.143               1.000   0.000
              GC-009      1.000               1.000   0.067
              GC-010      1.000               0.333   0.250
              GC-012      0.500               1.000   0.000
              GC-013      1.000               0.000   0.000
              GC-014      1.000               0.500   0.091
              GC-015      1.000               1.000   1.000
              GC-016      0.000               0.000   1.000
              GC-017      1.000               1.000   0.143
              GC-018      1.000               0.333   0.500
              GC-019      0.200               1.000   0.045
              GC-020      0.500               1.000   0.090
              GC-021      1.000               1.000   0.000
              GC-022      0.333               1.000   0.076
              GC-023      0.019               0.200   0.125
              GC-024      0.250               1.000   1.000
              GC-025      0.500               0.000   0.000
              average     0.584               0.698   0.280




                                                                                 5
    The results show that the web-based results are sensibly worse than those
obtained on the static collection. This is due primarily to two reasons: in the first
place, because topics were tailored on the GeoCLEF collection. Therefore, some
topics refer explicitly to events that are particularly relevant in the collection
and are easier to retrieve. For instance, query GC-005 “Japanese Rice Imports”
targets documents regarding the opening of the Japanese rice market for the first
time to other countries; “Japan” and “Rice” in the document collection appear
together only in such documents, therefore it is easier to retrieve the relevant
documents when searching the GeoCLEF collection.
    The second factor affecting the results for the Web-based system is the ambi-
guity of toponyms, which does not allow to correctly estimate the probabilities
for places. For instance, in the results obtained for topic GC-008 (“Milk Con-
sumption in Europe”), the MI obtained for “Turkey” was abnormally high with
respect to the expected value for this country. The reason is that in most doc-
uments, the name “turkey” was referring to the animal and not to the country.
This kind of ambiguity represents one of the most important issue at the time
of estimating the probability of occurrence of places. Ambiguity (or, better, the
polysemy of toponyms) grows together with the size and the scope of the col-
lection being searched. The GeoCLEF collection was also semantically tagged
using WordNet and Geonames IDs to identify the places referenced by toponyms,
while Web content is rarely tagged using precise IDs, therefore increasing the
chance of error in the estimation of probabilities for places which share the same
name.
    There are three kind of toponym ambiguity that can be recognised (after the
two main types identified by [11]:
 – Geo / Non-Geo ambiguity: in this case, a toponym is ambiguous with respect
   to another class of name (such as “Turkey” which may be the animal or the
   country);
 – Geo / Geo ambiguity of different class: for instance, “Puebla” the city or the
   state;
 – Same class Geo / Geo ambiguity.
The solution in all cases would be to use an ontology to precisely identify places
in documents; the only difference is the amount of information that the ontology
should include. For the first type of ambiguity, the only information needed is
whether the name represents a place or not. In the second case, we would also
need to know the class of the place. Finally, in the Geo / Geo ambiguity, we may
differentiates places using their coordinates or by knowing the including entity,
or both. The Geonames ontology contains all these information and represents
the best option at the time of geographically tag place names.


4   Conclusions
The results obtained with Geooreka! over a static, semantically-labelled (at least
from a geographical viewpoint) collection compared to the results obtained in




                                                                                        6
the Web showed that the imprecise identification of places is a problem for
search engines destined to users who are interested in searching for geographically
constrained information. The use of precise semantically tagging schemes for
toponyms, such as Geonames RDF, would allow these search engines to produce
more reliable results. Spreading the use of geographical tagging for the Semantic
Web would also allow users to mine information using geographical constraints
in a more effective way. In this sense, we would like to encourage the use of
Geonamen in order to produce accurate geographically tagged Web content.


References
 1. Sanderson, M., Kohler, J.: Analyzing geographic queries. In: Proceedings of Work-
    shop on Geographic Information Retrieval (GIR04). (2004)
 2. Gan, Q., Attenberg, J., Markowetz, A., Suel, T.: Analysis of geographic queries
    in a search engine log. In: LOCWEB ’08: Proceedings of the first international
    workshop on Location and the web, New York, NY, USA, ACM (2008) 49–56
 3. Andogah, G.: Geographically Constrained Information Retrieval. PhD thesis,
    University of Groningen (2010)
 4. Boll, S., Jones, C., Kansa, E., Kishor, P., Naaman, M., Purves, R., Scharl, A.,
    Wilde, E.: Location and the web (locweb 2008). In: Proceeding of the 17th inter-
    national conference on World Wide Web. WWW ’08, New York, NY, USA, ACM
    (2008) 1261–1262
 5. Buscaldi, D., Rosso, P.: Geooreka: Enhancing Web Searches with Geographical
    Information. In: Proc. Italian Symposium on Advanced Database Systems SEBD-
    2009, Camogli, Italy (2009) 205–212
 6. Buscaldi, D., Rosso, P., Sanchis, E.: Using the WordNet Ontology in the GeoCLEF
    Geographical Information Retrieval Task. In Peters, C., Gey, F.C., Gonzalo, J.,
    Mller, H., Jones, G.J., Kluck, M., Magnini, B., de Rijke, M., Giampiccolo, D., eds.:
    Accessing Multilingual Information Repositories. Volume 4022 of Lecture Notes in
    Computer Science. Springer, Berlin (2006) 939–946
 7. Buscaldi, D., Rosso, P.: On the relative importance of toponyms in geoclef. In:
    Advances in Multilingual and Multimodal Information Retrieval, 8th Workshop of
    the Cross-Language Evaluation Forum, CLEF 2007, Budapest, Hungary, Septem-
    ber 19-21, 2007, Revised Selected Papers, Springer (2007) 815–822
 8. Giuliano, C.:       jWeb1T: a library for searching the Web 1T 5-gram
    corpus. (2007) Software available at http://tcc.itc.it/research/textec/tools-
    resources/jweb1t.html.
 9. Kullback, S., Leibler, R.A.: On Information and Sufficiency. Annals of Mathemat-
    ical Statistics 22(1) (1951) pp. 79–86
10. Buscaldi, D., Rosso, P.: Using GeoWordNet for Geographical Information Re-
    trieval. In: Evaluating Systems for Multilingual and Multimodal Information Ac-
    cess, 9th Workshop of the Cross-Language Evaluation Forum, CLEF 2008, Aarhus,
    Denmark, September 17-19, 2008, Revised Selected Papers. (2009) 863–866
11. Amitay, E., Harel, N., Sivan, R., Soffer, A.: Web-a-where: Geotagging web con-
    tent. In: Proceedings of the 27th Annual International ACM SIGIR Conference on
    Research and Development in Information Retrieval, Sheffield, UK (2004) 273–280




                                                                                           7
 SV: a visualization mechanism for ontologies of
         records based on SVG graphics

    Ma. Auxilio Medina, Miriam Cruz, Rebeca Rodrı́guez, Argelia B. Urbina

                        Universidad Politécnica de Puebla
                       Tercer Carril del Ejido Serrano S/N
                         Juan C. Bonilla, Puebla, México
           {mmedina, mcruz, rrodriguez, aurbina} @uppuebla.edu.mx,
            WWW home page: http://informatica.uppuebla.edu.mx/
                       ~mmedina, ~rrodriguez, ~aurbina




       Abstract. This paper describes SV, a visualization mechanism used to
       explore digital collections represented as hierarchical structures called
       ontologies of records. These ontologies are XML files constructed using
       OAI-PMH records and a clustering algorithm. SV is composed by a web
       interface and SVG graphics. Through the interface, users can recognize
       the organization of the collection and access to metadata of documents.



1    Introduction

Digital libraries gather valuable information. Organizations such as the Open
Archives Initiative (OAI1 ) have proposed different alternatives to share data. The
Protocol for Metadata Harvesting (OAI-PMH protocol), for example, supports
interoperability between federated digital libraries. Documents are described in
metadata records. Dublin Core Metadata (DC2 ) is the default metadata format
for this protocol.
    The services and the collections of digital libraries are enriched in the Seman-
tic Web. The use of XML, Resource Description Framework (RDF), OWL, con-
ceptual maps and other metadata technologies are addressed to improve search
tasks [1]. Semantic Digital Libraries (SDLs) refer to systems build upon digi-
tal libraries and social networking technologies (Web 2.0) [2]. Freely distributed
software exists to construct SDLs such as Greenstone3 or Jerome DL4 . In this
type of software, ontologies play a key role, they refer to explicit specifications of
shared conceptualizations [3]. Ontologies enables the representation of knowledge
that software and human agents can understand and use.
    This paper proposes the use of ontologies called “ontologies of records” that
are represented as XML documents as the basis of a visualization mechanism
1
  http://www.openarchives.org/
2
  http://dublincore.org
3
  http://www.greenstone.org/
4
  http://www.jeromedl.org/




                                                                                         8
called semantic view (SV). The name also refers to the first two letters of “Sup-
port Vector Graphics”. SV offers an interactive view to allow users to explore
the content of a federated collection.
    The paper is organized as follows. Section 2 describes the features of an
ontology of records. Section 3 includes related work. Section 4 and 5 explains
the design and implementation of SV, respectively. Experimental results are
described in Section 6. Finally, Section 7 includes conclusions and suggests future
directions of our work.


2   What is an ontology of records

An ontology of records is a hierarchical structure of clusters of OAI-PMH records
that provides an unambiguous interpretation of its elements. Its construction
is based on the Frequent Itemset Hierarchical Clustering algorithm [8]. This
structure organizes a collection of documents, this has concept-term relationships
useful for keyword based searches. An ontology of records is stored as a well
formed XML file that is validated against an XML Schema. An ontology of
records has the following features[9]:

 1. Documents are clustered by similarity
 2. Clusters in the k -level have labels of k -terms
 3. All the records of a cluster share the terms of its label


3   Related work

This section describes some systems that have been used to visualize collections
of documents. Proat et al. [4] use 3D trees to visualize documents organized
according to the Library of Congress Classification (LCC). Documents are clus-
tered in seven subsets. The interface has controls to rotate or zoom the nodes of
trees. The leaf nodes contain metadata of documents.
    Geroimenko et al. [5] have proposed the Generalizad Document Object Model
tree Interface (G-DOM-Tree interface) to visualize metadata from XML DOM
(Document Object Model) documents. The model displays a hierarchy of labels,
this is very similar to the visualization that browsers offer of XML Schema. The
interface is implemented as a Java applet or a Flash film.
    Fluit et al. [6] describe Spectacle, this mechanism uses lightweight ontologies
to represent classes of similar objects and their relationships. The navigation
can be done by using hypertext or “cluster maps”. A cluster map visualizes the
objects and their classes.
    At last, Sánchez et al. [7] use a star field grid to visualize documents from
several collections. Documents are stored as OAI-PMH records. The axis of the
grid represent attributes of the collections that can be chosen by users. Small
polygons are associated with the type of document and different colors are used
to distinguish the collections.




                                                                                      9
4     Desing of SV
The design of SV is addressed to reach the following objectives:
 – Construct a visualization mechanism with semantic features that allow users
   to explore a collection of documents
 – Represent the organization of a collection of documents
 – Retrieve the metadata and the content of a determined document
    In order to reach these objectives, we have used the levels of knowledge
proposed by [2] in the design of SV. We want to uses CORTUPP as a test bed,
this is a collection represented as an ontology of records5 .

 1. Level 1: Organization of the metadata. Metadata is organized in the
    ontology of records. Content information is stored in dc:title, dc:subject
    and dc:description elements.
 2. Level 2: Organization of the information in the documents. Tech-
    nical reports have a common structure formed by six mandatory chapters:
    1)research propose, 2)state of the art, 3)research design, 4)implementation,
    5)results and 6)conclusions. This structure is defined in a Latex template.
    The BibTex file format is used to manage the bibliography. A technical re-
    port is described as a @techreport entry.
 3. Level 3: Organization of the information in databases. The technical
    reports are stored as PDF files in a database that also includes data and
    counts of users. Documents are accessible through a web interface.
 4. Level 4: Organization of the topics treated in the documents. The
    dc:subject element stores the topic of a document. Keywords of this ele-
    ment belong to the labels of the clusters in the ontology of records.
 5. Level 5: Organization of the concepts, terms and relations. This
    level is also represented in the ontology of records.


5     Implementation of SV
SV is formed by a web interface and SVG graphics6 . SVG is a format developed
and maintained by the W3C SVG Working Group. This is an XML application
used to describe animated or static two dimensional vectorial graphics. The main
feature of these graphics is scalability.
    SV uses Xerces, this is a Java parser used to extract data from an ontology
of records. The classes of SV are built using Java language. In the interface, each
document, that is, an OAI-PMH record, is represented with a yellow star in a blue
gradient background. The background is divided in five parts that correspond to
the first levels of the ontology. These levels are divided by lines that form angles
of 90 degrees. The distribution of the lines try to reflect an estimation of the
amount of documents that can be found in each level. The documents closer to
5
    CORTUPP is available at http://server3.uppuebla.edu.mx/cortupp/
6
    http://www.w3.org/svg/




                                                                                       10
the upper left corner belong to the first level of the ontology, these documents
share one term. The second level shows the documents that share two terms, and
then on. The stars have different size according to their level, they are bigger at
the first level and smaller at the last one.
    The interface of SV is a SVG graphic of 502 per 502 pixels. XML Parser is the
Java application used to construct the XML document that contains the inter-
face. XLink is used to create hyperlinks between documents and their metadata.
Given a click on a star, users can allow the metadata on the right panel. Fig-
ure 1 shows the SV interface where only six documents at the second and third
level were included, however SV is designed to support until 500 documents. The
colors can be modified without requiring compilation because they are stored in a
text file. The mechanism is accessible at http://informatica.uppuebla.edu.mx/ vi-
sualizacionPI/index.html.




                     Fig. 1. Using VS to visualize CORTUPP




6   Experimental results

Different configuration of ontologies of records were constructed in order to check
SV, that is, unity tests and integration tests were performed successfully. After
the installation of the SVG Plugin Version 1.7, the visualization of SV was suc-
cessful using Internet Explorer 8, Google Chrome 7.0.517.41 and Opera 10.6,
however, there were some inconveniences using Firefox 1.5, Firefox 3.6 and Fire-
fox Beta due to these versions do not support the animation features of SVG
graphics.




                                                                                      11
7    Conclusions

We have described SV, a visualization mechanism of federated collections based
on ontologies. SV has semantic features represented in the interface such as the
location of documents in the ontology and the similarity between documents.
Additional semantic information is stored in the metadata attached to each
document and in the ontology of records. Through SV interface, users can access
to metadata or download a document.
    CORTUPP was used as a test bed for SV, however, any collection of OAI-
PMH records represented as an ontology of records can be visualized. Although
the size of an ontology of records can impact the visualization of SV, its design is
flexible enough to support distinct collections. As future work, we plan to expand
SV to show the clusters and their labels. Then, we would like to incorporate
tagging and recommendation mechanisms.


References
1. Geroimenko V., C.C.: Visualizing the Semantic Web. XML-based Internet and
   Information Visualization. Segunda edición edn. Volume 1. Springer, Wokingham,
   England (2003) Libro. los primeros cuatro datos son obligatorios.
2. Kruk, S.R., McDaniel, B.: Semantic Digital Libraries. Springer-Verlag, Berlin,
   Heidelberg (2009)
3. Gruber, T.: A translation approach to portable ontology specification. Knowledge
   Adquisition 5(2) (1993) 199–220
4. C., P.: Sistema uva: interfaces para visualización de grandes colecciones digitales.
   Tesis de maestrı́a, Universidad de las Américas Puebla, Santa Catarina Mártir S/N,
   San Andrés Cholula, Puebla, México. (2002) Tesis de maestrı́a. Los primeros cuatro
   campos son obligatorios.
5. Geroimenko V., G.L.: Interactive interfaces for mapping e-commerce ontologies in
   Visualizing the Semantic Web. XML-based Internet and Information Visualization.
   Segunda edición edn. Volume 1. Springer, Wokingham, England (2003) Libro. los
   primeros cuatro datos son obligatorios.
6. Fluit C., Sabou N., H.v.F.: Ontology-based information visualization. In: Visualizing
   the semantic web: XML based internet and information visualization. Volume 1.
   Segunda edición edn. Springer, Wokingham, England (2002)
7. Sánchez J. A., Quintana M. G., R.A.: Star-fish: Starfields+fisheye visualization and
   its application to federated digital libraries. Proceedings of the 3rd Latin American
   Conference on Human-Computer Interaction (CLIHC 2007, Nov.) (2007)
8. Fung, B., Wang, K., Ester, M.: Hierarchical document clustering using frequent
   itemsets. In: Proceedings of the Third SIAM International Conference on Data
   Mining, (SDM’03, San Francisco, California, May),, San Francisco, CA, USA, SIAM
   (2003) 59–70
9. Medina, M.A., Sánchez, J.A.: Ontoair: A method to construct lightweight ontologies
   from document collections. Mexican International Conference on Computer Science
   0 (2008) 115–125




                                                                                           12
              Modeling of CSCW system with Ontologies

          Mario Anzures-García1, 2, Luz A. Sánchez-Gálvez1,2, Miguel J. Hornos2,
                  Patricia Paderewski-Rodríguez2, and Antonio Cid1
  1
      Facultad de Ciencias de la Computación, Benemérita Universidad Autónoma de Puebla, 14
      sur y avenida San Claudio. Ciudad Universitaria, San Manuel, 72570 Puebla, Mexico
                                  {anzures, luzsg}@correo.ugr.es
          2 Departamento de Lenguajes y Sistemas Informáticos, E.T.S.I. Informática y de

         Telecomunicación, Universidad de Granada, C/ Periodista Saucedo Aranda, s/n,
                                      18071 Granada, Spain.
                                   {mhornos, patricia}@ugr.es



        Abstract. In recent years, there has been a growing interest in the development
        and use of domain ontologies, strongly motivated by the Semantic Web
        initiative. However, the application of ontologies in the CSCW domain has
        been scarce. Therefore in this paper, it presents a novel architectural model to
        CSCW systems described by means of an ontology. This ontology defines the
        fundamental organization of a CSCW system, represented in its concepts,
        relations, axioms and instances.

        Keywords: Ontology, Groupware Application, SOA, Architectural Model,
        Services.




1 Introduction

In the last two decades, the enormous growth of Internet and the web have given rise
to an intercreativity cyberspace, in which groups of people can communicate,
collaborate and coordinate to carry out common tasks. Therefore, a great number of
groupware applications has been developed using different approaches, including
object-oriented, component-oriented, and agent-oriented ones. However, the
development of this kind of applications is very complex, because different elements
and aspects must be taken into account. Hence, these applications must be
simultaneously supported by models, methodologies, architectures and platforms to
be developed in keeping with current needs. In the groupware domain, one of the
models most used is the Unified Modelling Language (UML) [1], although this has
not any element to represent constrains, which are very important in applications so
complex as the groupware ones.
   There has recently been an increase in the use of ontologies in any domain to
model applications. An ontology is presented as an organization resource and
knowledge representation through an abstract model. This representation model
provides a common vocabulary of a domain and defines the meaning of the terms and
the relations amongst them. In the domain of groupware applications, the ontology




                                                                                              13
provides a well-defined common and shared vocabulary, which supplies a set of
concepts, relations and axioms to describe this domain in a formal way.
   In this paper, two ontologies for the groupware domain are proposed. The first
ontology determines who authorize the registration of users, how interaction is carried
out among them, and how the turns for users participation are defined, among other
aspects. Moreover, it allows supporting modifications in runtime, such as changing
the user role, the rights/obligations of a role, the current policy, etc. The second
ontology establishes the necessary SOA-based services to develop groupware
applications in accordance with the existing papers in the literature about the
development of this type of applications. In addition, these services are clustered in
modules and layers with respect to the concern that they represent.
   This paper is organized as follows. Section 2 gives an brief introduction to the
ontologies. Section 3 describes the ontology-based modeling of the group
organizational structure. Section 4 presents an ontological model, which allows us to
specify an architectural model for the development of groupware applications.
Finally, Section 5 outlines some conclusions and future work.


2 Introduction to the Ontologies

There are several definitions of ontology, which have different connotations
depending on the specific domain. In this paper, we will refer to Gruber’s well-know
definition [2], where an ontology is an explicit specification of a conceptualization.
For Gruber, a conceptualization is an abstract and simplified view of the world that
we wish to represent for some purpose, by the objects, concepts, and other entities
that are presumed to exist in some area of interest, and the relationships that hold
them. Furthermore, an explicit specification means that concepts and relations need to
be couched by means of explicit names and definitions.
   Jasper and Ushold [3] identify four main categories of ontology applications: 1)
neutral authoring, 2) ontology-based specification, 3) common access to information,
and 4) ontology-based search. In the work presented here, the main idea is to use
ontologies to specify the modeling of both the group organizational structure and the
architectural model in the groupware domain, since an ontology is a high level formal
specification of a certain knowledge domain, which provides a simplified and well
defined view of such domain.
   Ontology is specified using the following components:
ƒ    Classes: There is a set of classes, which represent concepts that belong to the
     ontology. Each class may contain individuals (or instances), other classes or a
     combination of both, with their corresponding attributes.
ƒ    Relations: These define interactions between two or several classes (object
     properties) or between a concept and a data type (data type properties).
ƒ    Axioms: These are used to impose constraints on the values of classes or
     instances. Axioms represent expressions (logical statement) in the ontology and
     are always true inside the ontology.
ƒ    Instances: These represent the objects, elements or individuals of an ontology.




                                                                                          14
   These four components will be described for the two ontologies proposed in this
paper.
   In addition, ontologies require of a logical and formal language to be expressed. In
Artificial Intelligence, different languages have been developed, like the First-Order
Logic-based (which provide powerful primitive for modeling), the Frames-based
(with more expressive power but less inference capacity), and the Description Logics-
based (which are more robust in the reasoning power) ones.
   OWL (Web Ontology Language) [4] is a language based on Description Logics for
defining and instantiating Web ontologies based on XML (eXtensible Markup
Language) [5] and RDF (Resource Description Framework) [6]. OWL can be used to
explicitly represent the meaning of terms in vocabularies and the relationships among
those terms. This language makes possible to infer new knowledge from a
conceptualization, by using a specific software called reasoner. It has used the tool
Protégé [7], which is based on OWL, to define the ontology for group organizational
structure.
   In the groupware domain, ontologies have mainly been used to model task analysis
or sessions. Different concepts and terms, such as group, role, actor, task, etc. have
been used for the design of task analysis and sessions. Many of these terms are
considered in our conceptual model. Moreover, semiformal methods (e.g. UML class
diagrams, use cases, activity graphs, transition graphs, etc.) and formal ones (such as
algebraic expressions) have also been applied to model the sessions. There is also a
work [8] for modeling cross-enterprise business processes from the perspective of
cooperative system, which is a multi-level design scheme for the construction of
cooperative system ontologies. This last work is focused on business processes, and it
describes a general scheme for the construction of ontologies. However, in this paper,
we propose to model two specific aspects: the group organizational structure and the
architecture of a groupware application. Consequently, the application domain of both
ontologies is groupware, not business processes.


3 Ontology for specifying an architectural model

In order to specify architectural model five concerns are identified: Data, Group,
Cooperation, Application, and Adaptation. Consequently, five layers are considered.
Four layers are composed by modules and services, while the fifth one, the Data
Layer, contains repositories with the necessary information to carry out the group
work. The services of the architectural model are defined by the concepts' ontology.


3.1. Ontology Concepts

The architecture components are characterized through the concepts' ontology (shown
in Figure 1), which will be briefly described below:
   ƒ Registration is the first action that a user must carry out to can participate in the
       group work using the collaborative application.
   ƒ Authentication validates the access to the group and depends on the
       organizational style defined in the same.




                                                                                             15
ƒ Group is who works in the session to perform work group.
ƒ Organizational_Style defines the organizational style that a group will use to
  carry out the group work.
ƒ Stage restricts user’s access to the application in accordance with the
  organizational style defined in it.
ƒ Session defines a shared workspace where a group carries out common tasks.
ƒ Session_Management manages and controls one or more sessions.
ƒ Concurrency manages shared resources to avoid inconsistencies by using them.
ƒ Shared_Resource is used by users to carry out basic activities.
ƒ Basic_Activity is an action that a user must perform to carry out a task (which
  can be made up by one or more basic activities).
ƒ Task is carried out by the group to achieve a common goal.
ƒ Notification notifies one or more users of all events that happen in a session.
ƒ Group_Awareness gets the necessary information to supply group awareness to
  users that take part in a group.
ƒ Group_Memory is supplied by the application to facilitate a common context.
ƒ Application is used by the users to carry out group work in established session.
ƒ Configuration configures the application the first time that it is used and when
  it is necessary.
ƒ User_Interface shows users all the information about the application execution.
ƒ Environment modifies the user interface to present the information in
  accordance with the device used by each user.
ƒ Adaptation is a process that allows adapting the collaborative application to the
  new needs of the group.
ƒ Detection monitors the execution environment to detect the events that
  determine the adaptation process.
ƒ Agreement decides whether an adaptation process must be carried out or not.
ƒ Vote_Tool is used by users to perform the agreement.
ƒ Adaptation_Flow is a set of steps carried out to adapt the collaborative
  application in accordance with the selected event.
ƒ Repair is required when the adaptation process can not be performed.
RA                        performs               is_determined
                  AFA                  AA                               AE         needs        AAA     uses       MU
                                                                                                                   MAV
                                                                                                                   MIV
      requieres
                                                 is_adapted




       CAE        modifies       RUI        presents         CMS                 has       CS
                                 AUI
                                 SPI
                                                                                 gives

                                                                   establishes


         is_part_of                                      manages                 administers
TSP                     ASI                  IUP                                                GN      supplies
                                is_used                                 CBL                                          NGM
                        APU                  SRP


                                                                                                 provides
                                                                    controls



                                                   governs                        obtains        NGA
                                SM                                      SCM




                                                                        works



                      defines   COS              organizes              G1
             CF
                                SOS                                      organ
             SU
                                AOS                                 i
             AS
                                     dependes
                                                               acces



                                URA             allows                  AUG




                                                                                                                           16
Figure 1. Ontology for specifying an architecture to model collaborative applications.


3.2. Ontology Relations

The architecture relationship to each component and its environment are symbolized
with the ontology relations (see Figure 1) listed below:
   ƒ allows (Registration, Authentication): Only registered users are allowed to
       authenticate to access to the collaborative application.
   ƒ access (Authentication, Group): Authentication allows users to access to group.
   ƒ depends (Registration, Organizational_Style): Users registration depends on
       the organizational style defined at a given stage.
   ƒ organizes (Organizational_Style, Group): An organizational style specifies the
       way in which the group is organized.
   ƒ defines (Stage, Organizational_Style): A stage defines an organizational style.
   ƒ works (Group, Session): A group needs to be connected to a session to work.
   ƒ governs (Session_Management, Session): The session management governs a
       session.
   ƒ controls (Concurrency, Session): The concurrency service controls the existing
       interaction in a session.
   ƒ manages (Concurrency, Shared_Resource): The concurrency service manages
       the shared resources to guarantee mutually exclusive usage of these.
   ƒ is_used (Shared_Resource, Basic_Activity): The shared resources are used by
       basic activities.
   ƒ is_part_of (Basic_Activity, Task): A basic activity is part of a task.
   ƒ administers (Session, Notification): The session administers the notification.
   ƒ provides (Notification, Group_Awareness): The notification process provides
       group awareness.
   ƒ obtains (Group, Group_Awareness): A group obtains group awareness to avoid
       inconsistencies in the collaborative application.
   ƒ supplies (Notification, Group_Memory): The notification process supplies
       group memory.
   ƒ gives (Application, Group_Memory): The application gives group memory.
   ƒ establishes (Application, Session): An application establishes a session.
   ƒ presents (Application, User_Interface): An application presents an user
       interface so that users can use the collaborative application.




                                                                                         17
   ƒ modifies (Environment, User_Interface): The environment modifies the user
     interface according to the device used by each user.
   ƒ has (Application, Configuration): Each application has a configuration process,
     which is carried out by users.
   ƒ is_adapted (Application, Adaptation): An application is adapted by the
     adaptation process.
   ƒ is_determined (Adaptation, Detection): The adaptation process is determined
     by the detection process.
   ƒ needs (Adaptation, Agreement): The adaptation process needs an agreement
     process to decide whether the adaptation is carried out or not.
   ƒ uses (Agreement, Vote_Tool): The agreement process uses a vote tool to carry
     out the agreement.
   ƒ performs (Adaptation, Adaptation_Flow): The adaptation process performs an
     adaptation flow to appropriately adjust the application.
   ƒ requires (Adaptation, Repair): When the adaptation process can not be
     performed, it is required to repair the application to avoid inconsistencies in it.


3.3. Ontology Axioms

Finally, the principles governing design and evolution of the architectural model are
represented by ontology axioms (see Figure 1):
   ƒ An authentication must have only one registration, i.e. an user is authenticated
       only if she/he is registered.
   ƒ A registration depends on an organizational style, i.e. an user is registered with
       accordance to organizational style established in the group work.
   ƒ An organizational style organizes at least one group.
   ƒ A group works at least in one session.
   ƒ An application establishes at least one session.
   ƒ A session administers at least one notification process.
   ƒ A group obtains group awareness.
   ƒ An application gives group memory.
   ƒ The concurrency service controls at least one session.
   ƒ The concurrency service manages at least one shared resource.
   ƒ A shared resource is used by at least one basic activity.
   ƒ A basic activity is part of at least one task.
   ƒ An application has at least one possible configuration.
   ƒ An application presents at least one user interface.
   ƒ An environment modifies at least one user interface.
   ƒ An application can be adapted by an adaptation process.
   ƒ An adaptation process is determined by at least one detection process.
   ƒ An agreement process is carried out only if there is an adaptation process in a
       non-hierarchical organizational style.
   ƒ An agreement process uses at least one vote tool.
   ƒ An adaptation process performs only one adaptation flow.
   ƒ An adaptation flow must verify at least one pre-condition and post-condition to
       carry out the adaptation.




                                                                                           18
   ƒ An adaptation process can require a repair process, if this has not finished.


3.4. Ontology Instances

In order to show the architectural functionality, this section presents a set of instances
(see Figure 1), derived from the definition of the application instance, which is a
Conference Management System (CMS). A CMS is a web-based application that
supports the organization of scientific conferences. It can be regarded as a domain-
specific content management system. Nowadays, similar systems are used by editors
of scientific journals.
     This type of systems generally has four stages: submission, assignment, review,
and acceptance/rejection of papers, and this paper adds the stage of application
configuration. CMS supports three user groups: Authors (A), Program Committee
Members (PCM) and Program Committee Chairs (PCC). The first user group (A)
corresponds to people who can submit papers (at the submission stage) through the
Internet, and who receive the review results and the final decision via an email (at the
acceptance/rejection stage). It is the largest user group (its average number is
normally between 100 and 400 people for most conferences). The second user group
(PCM) is made up of people who must evaluate some of the submitted papers and
send the result to the PC Chairs via the Internet (at the review stage). Its number is
about 20–50 persons in average. People in the last user group (PCC) are in charge of
allocating papers to reviewers (at the assignment stage) and making the final decision
on papers, as well as a number of other operations. This is the least numerous group,
being usual 1–3 PCC per conference. Therefore:
     ƒ Session instance is Session of the Conference Management (SCM).
     ƒ Session_Management instance is Session Management (SM).
     ƒ Stage instances are configuration (CF), and submission (SU), assignment
        (AS), review (RE) and acceptance/rejection (AC) of papers.
     ƒ Authentication instance is user authentication in the group (UAG).
     ƒ Registration instance is user registration in the application (URA).
     ƒ Users instances are U1, U3 and U4 as A, U3 as PCM, and U2 as PCC.
     ƒ User_Interface (UI) instances are Registration UI (RUI), Authentication UI
        (AUI), Submitting Paper UI (SPI), Configuration UI (CUI), etc.
     ƒ Environment instance is collaborative application environment (AE).
     ƒ Organizational_Style (OS) instances are Configuration OS (COS),
        Submission OS (SOS), Assignment OS (AOS), Review OS (ROS), and
        Acceptance/rejection OS (POS). In the ontology shown in Figure 3, SOS is
        the unique OS considered by simplicity reasons.
     ƒ Group instance is G1, which is made up of three users, U1, U3 and U4,
        because U2 does not participate at SOS.
     ƒ Concurrency instance is locks mechanism (LM).
     ƒ Shared_Resource (SR) instances are paper (SRP), and uploading paper (IUP).
     ƒ Basic Activity (BA) instances are submitting information (ASI), and uploading
        paper (AUP).
     ƒ Task instance is submitting paper (TSP).
     ƒ Notification instance is group notification (GN).




                                                                                             19
    ƒ Configuration instance is configuration of the system (CS).
    ƒ Adaptation instance is adaptation application (AA).
    ƒ Detection instance is adaptation event (AE).
    ƒ Agreement instance is adaptation agreement of the application (AAA).
    ƒ Vote Tool instances are majority vote (MV), maximum value (MAV) and
      minimum value (MIV).
    ƒ Adaptation_Flow instance is adaptation flow of the application (AFA).
    ƒ Repair instance is reparation of the adaptation (RA).


4 BPM to Manage the Ontoloy-based Architectural Model

BPM [16] is a set of methods, tools and technologies used to design, perform, analyze
and manage operational business processes, by means of different phases. It also
facilitates service composition. In this paper, BPM is based on ontological approach
[17] and is composes for three phases, which are: Process Modeling, Process
Implementation, and Process Execution. The ontology is used in order to simplify the
task of governing the behaviour of BPM; it enables to BPM to uses concepts to
describe the models and the entities being controlled, thus simplifying their
description and facilitating the analysis and the careful reasoning over them; and it
allows dynamically calculating relations between business processes and
environment, supporting modifications in runtime. BPM and SOA make the
integration faster and easier than never, it is not necessary to discard of the
investments already made, everything can be reused.


4.1. Process Modeling

This phase identifies the participating elements in a business process. Unlike other
existent models, we use an ontology (shown in the Figure 1), to clearly specify the
semantics of the tasks and the decisions in the process flow. Therefore, four layers of
the architectural model, (see Figure 2), are composed of the services, which were
defined as concepts, in the ontology. The Group Layer includes three modules, which
are Access, Group and Session. The Access Module has two services: Registration and
Authentication. The Group Module presents three services: Group, Organizational
Style, and Stage. The Session Module contains two services: Session Management,
and Session. The Cooperation Layer has the Context Module and Interaction Module.
The former includes four services: Concurrency, Shared Resource, Activity Basic, and
Task. The later encompasses the Group Awareness Service, the Group Memory
Service, and the Notification Service. The Application Layer comprises only the
Application Service, the Configuration Service, the User Interface Service, and the
Environment Service. Finally, the Adaptation Layer involves the Pre-adaptation
Module and the Adaptation Module. The former encompass the Detection Service, the
Agreement Service, and the Vote Tool Service. The latter comprises the Adaptation
Flow Service, and the Repair Service.




                                                                                          20
  Figure 2. Architectural model for developing collaborative applications.



4.2. Process Execution

In this phase, the business process model is transformed into an executable process
model, which can be deployed to a process engine for its execution. Figure 3 shows a
sequence diagrams (that represents an executable process model), when the author
submits papers to the CMS. In this figure, the CMS users are consumer services,
invoking different services and only are considering some services of the architectural
model by simplicity.




  Figure 3. Sequence diagram of papers sent.




                                                                                          21
4.3. Process Implementation

    In the process execution phase, the process engine executes a process model by
firstly creating a process instance and then navigating through the control flow of the
process model. In order to ensure seamless interaction when navigating through the
control flow of the process model, this phase provides mechanisms for the discovery,
selection and invocation of services. The module dynamically discovers and selects
the appropriate service of the architectural model basing on the task description, and
invokes it on behalf of the process engine, which plays the role of a requester service
when it invokes the service to perform a task. Moreover, this module carries out a
process monitoring that provides relevant information about running process
instances. If during the process execution a failure arises, such as network faults,
server crashes, or application-related errors (e.g. unavailability of a requested service,
errors in the composition or missing data, etc.), reconfiguration actions are carried
out, such as duplication (or replication) or substitution of a faulty service. The first
case involves addition of services representing similar functionalities; this aims at
improving load balancing between services in order to achieve a better adaptation.
The second case encompasses redirection between two services; applying this action
means the first one is deactivated and replaced by the second one.


5 Conclusions and Future Work

The current work has presented an ontology-based architectural model, which
facilitates the development of collaborative applications. The ontology describes the
components, their relationships to each other and the environment, and the principles
governing architectural design and evolution. For that reason, we think that the
ontology is a proper model to describe architectures. BPM is used to manage and
control the interaction between the services that make up the architectural model. In
addition, BPM also is based on the ontology proposed. These services are designed
using SOA that together with BPM, facilitates the application's integration. The future
work will consist on extending the existent reconfiguration actions of the service-
based collaborative applications.


References

1. Garlan D., Shaw, M.: An introduction to software architecture. Advances in Software
   Engineering and Knowledge Engineering, 1--39, (1993)
2. Perry, D.E., Wolf, A.L.: Foundations for the study of software architecture. ACM
   SIGSOFT Software Engineering Notes 17(4), 40--52, (1992)
3. Architecture working group: Recommended practice for architectural description of
   software-intensive systems. IEEE Std 1471 (2000)
4. UML 2.0 Superstructure Specification (OMG). Ptc/03-08-02, 455—510, (2003)
5. Spivey, J.M.: The Z Notation: A Reference Manua., Prentice Hall, (1989)
6. Abrial, J.R.: The B-book: Assigning Programs to Meanings. Cambridge University Press,
   (1996)




                                                                                             22
7. Gruber, T.R.: Toward Principles for the Design of Ontologies Used for Knowledge
    Sharing. I. J. Human Computer Studies 43-(5/6), 907--928 (1995)
8. Gómez-Pérez, A., Fernández-López, M, Corcho, O.: Ontological Engineering with
    Examples from the Areas of Knowledge Management, e-Commerce and the Semantic Web,
    Springer, (2004)
9. Uschold, M., Grüninger, M.: Ontologies: Principles, Methods and Applications. Knowledge
    Engineering Review 11(2), 93–155 (1996)
10. Farquhar, A., Fikes, R, Rice, J.: The Ontolingua Server: A Tool for Collaborative Ontology
    Construction. I. J. Human Computer Studies 46(6), 707–727, (1997)
11. Dean, M., Schreiber, G.: OWL Web Ontology Language Reference. W3C Working Draft.
    http://www.w3.org/TR/owl-ref/ (2003)
12. Protegé: http://protege.stanford.edu/
13. Noguera, M., Hurtado, V, Garrido, J.L.: An Ontology-Based Scheme Enabling the
    Modeling of Cooperation in Business Processes. In; Meersman, R., Tari, Z., Herrero, P.
    (eds.) OTM Workshops 2006. LNCS vol. 4277, pp. 863—872. Springer, Heidelberg (2006)
14. Erl, T.: SOA: Concepts, Technology and Design. Prentice-Hall, (2005)
15. Howard, S., Fingar, P.: Business Process Management: The Third Wave, Meghan-Kiffer,
    (2003)
16. May, M.: Business Process Management: Integration in a Web-Enabled Environment,
    Prentice Hall, (2003)
17. Hepp, M., Roman, D.: An Ontology Framework for Semantic Business Process
    Management, In: 8th International Conference on Wirtschafts Informatik, Vol. 1 pp. 42--
    440, (2007)




                                                                                                 23
      The use of WAP Technology in Question
                   Answering

    Fernando Zacarías F.1 , Alberto Tellez V.2 , Marco Antonio Balderas3 ,
              Guillermo De Ita L., and Barbara Sánchez R.4

                    Benemérita Universidad Autónoma de Puebla,
               1,3,4,5
                      Computer Science and 2 Collaborator - INAOE
                       14 Sur y Av. San Claudio, Puebla, Pue.
                                    72000 México
           1
             fzflores@yahoo.com.mx, 2 albertotellezv@ccc.inaoep.mx
               3
                 balderasespmarco@gmail.com, 4 brinza@hotmail.com


      Abstract. The experience of Puebla Autonomous University on using
      WAP technology in the development of novel applications is deployed.
      The goal is to enhance question answering through innovative mobile ap-
      plications providing new services and more efficiently. The architecture
      proposed based on WAP protocol, moves the issue of Question Answering
      to the context of mobility. This paradigm ensures that QA is seen as an
      activity that provides entertainment and excitement. This characteristic
      gives to Question Answering an added value. Furthermore, the method
      for answering definition questions is very precise. It could answer almost
      95% of the questions; moreover, it never replies wrong or unsupported
      answers. Considering that the mobile-phone has had a boom in the last
      years and that a lot of people already have mobile telephones (approx-
      imately 3.5 billions), we propose a new application based on Wikipedia
      that makes Question Answering something natural and effective for work
      in all fields of development. This obeys to that the new mobile tech-
      nology can help us to achieve our perspectives of growth. This system
      provides to user with a permanent service in anytime, anywhere and any
      device (PDA’s, cell-phone, NDS, etc.). Furthermore, our application can
      be accessed via Web through iPhone and any device with internet access.

      Keywords: Mobile devices, Question Answering, WAP, GPRS.


1   Introduction
Each generation of mobile communications has been based on a dominant tech-
nology, which has significantly improved spectrum capacity. Until the advent of
IMT-2000, cellular networks had been developed under a number of proprietary,
regional and national standards, creating a fragmented market.
 – First Generation was characterized for Advanced Mobile Phone System (AM-
   PS). It is an analog system based on FDMA (Frequency Division Multiple
   Access) technology. However, there were also a number of other proprietary
   systems, rarely sold outside the home country.




                                                                                   24
 – Second Generation, it includes five types of cellular systems mainly:
     • Global System for Mobile Communications (GSM) was the first com-
       mercially operated digital cellular system.
     • GSM uses TDMA (Time Division Multiple Access) technology.
     • TDMA IS-136 is the digital enhancement of the analog AMPS technol-
       ogy. It was called D-AMPS when it was fist introduced in late 1991 and
       its main objective was to protect the substantial investment that service
       providers had bmade in AMPS technology.
     • CDMA IS-95 increases capacity by using the entire radio band with each
       using a unique code (CDMA or Code Division Multiple Access)
     • Personal Digital Cellular (PDC) is the second largest digital mobile stan-
       dard although it is exclusively used in Japan where it was introduced in
       1994.
     • Personal Handyphone System (PHS) is a digital system used in Japan,
 – Third Generation, better known as 3G or 3rd Generation, is a family of
   standards for wireless communications defined by the International Telecom-
   munication Union, which includes GSM EDGE, UMTS, and CDMA2000 as
   well as DECT and WiMAX. Services include wide-area wireless voice tele-
   phone, video calls, and wireless data, all in a mobile environment. Thus, 3G
   networks enable network operators to offer users a wider range of more ad-
   vanced services while achieving greater network capacity through improved
   spectral efficiency.

    Currently, mobile devices are part of our everyday environment and conse-
quently part of our daily landscape [5]. The current mobile trends in several
application areas have demonstrated that training and learning no longer needs
to be classroom. Current trends suggest that the following three areas are likely
to lead the mobile movement: m-application, e-application and u-application.
There are estimated to be 2.5 billion mobile phones in the world today. This
means that this is more than four times the number of personal computers
(PCs), and today’s most sophisticated phones have the processing power of a
mid-1990s PC. Even, in a special way, many companies, organizations, people
and educators are already using iPhone, iPod, NDS, etc., in their tasks and cur-
riculas with great results. They are integrating audio and video content including
speeches, interviews, artwork, music, and photos to bring lessons to life. Many
current developments, just as ours [5, 3, 6], incorporate multimedia applications.

    In the late 1980’s, a researcher at Xerox PARC named Mark Weiser [4],
coined the term “Ubiquitous Computing”. It refers to the process of seamlessly
integrating computers into the physical world. Ubiquitous computing includes
computer technology found in microprocessors, mobile phones, digital cameras
and other devices. All of which add new and exciting dimensions to applications.

    As pragmatic uses grow for cellphones, mobile technology is also expanding
into creative territory. New public space art projects are using cellphones and




                                                                                     25
other mobile devices to explore new ways of communicating while giving every-
day people the chance to share some insights about real world locations.

    While your cellphone now allows you to play games, check your e-mail, send
text messages, take pictures, and oh, yeah, make phone calls, it can perhaps
serve a more enriching purpose. Thus, we think that widespread internet access
and collaboration technologies are allowing businesses of all sizes to mobilise
their workforce. Such innovations provide additional flexibility without the need
to invest in expensive and complex on-premise infrastructure requirements. Fur-
thermore, it makes “eminent sense“ to fully utilise the web commuting options
provided by mobile technology.

    The problem of answering questions has been recognized and partially tacled
since the 70’s for specific domains. However, with the advent of browsers work-
ing with billions of documents in internet, the need has newly emerged, having
led to approaches for open-domain QA. Some examples of such approaches are
emergent question answering engines such as answers.com, ask.com, or addi-
tional services in traditional nrowsers, such as Yahoo.

    Recent research in QA has been mainly fostered by the TREC and CLEF
conferences. The first one focus on English QA, whereas the second evaluates
QA systems for most European languages except English. To do, both evalua-
tion conferences have considered only a very restriced version of the general QA
problem. They basically contemplate simple questions which assume a definite
answer typified by a named entity or noun phrase, such as factoid questions
(for instance, “How old is Cher?” or “Where is the Taj Mahal?”) or definition
questions (“Who is Nelson Mandela?” or “What is the quinoa?”), and exclude
complex questions such as procedural or epaculative ones.

   Our paper is structured as follows: In section 2 we describe the state of
the art about QA and similar works. Next, we present the method for question
answering for definitions questions in section 3. After, in section 4 we present
the WAP technology as support for our mobile application. Section 5 shows our
application on the two variants, WiFi and WAP protocol. Section 6 describe our
perspectives about our future work. Finally, the conclusions are drawn in section
7.


2    The state of the art

One of the oldest problems of human history is raising questions about several
issues and conflicts that torments our existence. Since children this is the mech-
anism we use to understand and adapt to our environment. The counterpart
to ask questions is to answer the questions that we do, an activity that also
requires intelligence. This activity has a difficulty level that has tried to delegate
to computers, almost since the emergence of these. The issue of question an-




                                                                                         26
swering for a computer has been recognized and tackled from the decade of the
70s century past for specific domains. In Mexico, have been obtained excellent
results in this context, for this reason we propose to bring these same results
with mobile technologies.
    Recent research has focused on developing systems for question answering to
open domain, ie systems that takes as their source of information a collection of
texts on a variety of topics, and solve questions whose answers can be obtained
from the collection of departure. From question answering systems developed so
far, we can identify three main phases:

1. Analysis of the question. This first phase will identify the type of response
   expected from the given question, that is expected to be a question of ”when”
   a kind of response time, or a question ”where” will lead us to identify a place.
   Response rates are most commonly used personal name, name organization,
   number, date and place.
2. Recovery of the document. In the second stage performs a recovery process
   on the collection of documents using the question, which is to identify docu-
   ments on the question that probably contain the kind of response expected.
   The result of this second stage is a reduced set of documents and preferably
   specific paragraphs.
3. Extraction of the response. The last phase uses the set of documents obtained
   in the previous phase and the expected type of response identified in the first
   phase, to locate the desired response.

    Questions of definition require a more complex process in the third stage,
since they must obtain additional information segments and at the same time
are not repetitive. To achieve a good ”definition” must often resort to various
documents [1].

    Currently the question answering on mobile devices for open domains is in a
development stage. The project QALL-ME, is a project of 36 months, funded by
the European Union and will be conducted by a consortium of seven institutions,
including four academic and three industrial companies. The aim is to establish
a shared infrastructure for developing a QA infrastructure via mobile phone for
any tourist or citizen can instantly access to different information regarding the
services sector, be it a movie in the cinema, a theater or restaurant of a certain
type of food. All this in a multilingual and multimodal mode for mobile devices.
The project will experiment with the potential of open domain QA and evalu-
ation in the context of seeking information from mobile devices, a multimodal
scenery which includes natural speech as input, and the integration of textual
answers, maps, pictures and short videos as output.

    The architecture proposed in the QALL-ME project is a distributed archi-
tecture in which all modules are implemented as Web services using standard
language for defining services. In figure 1 shows the main modules of this archi-
tecture. The architecture of the QALL-ME described as follows:




                                                                                      27
                     Fig. 1. Main QALL-ME Architecture [8]




    “The central planner is responsible for interpreting multilingual queries. This
module receives the query as input, processes the question in the language in
which it develops and, according to the parameters of context, directs the search
for required information. Extractor to a local response. The extraction of the
response is made on different semantic representations of the information de-
pends on the type of the original source data from which we get the answer
(if the source is plain text, the semantic representation is an annotated XML
document if the source is a website, the semantic representation is a database
built by a wrapper). Finally, the responses are returned to the central planners
to determine the best way to represent the requested information” [8].


3   Movile Question Answering for Definitions Questions

The method for answering definition questions uses Wikipedia [10] as target doc-
ument collection. It takes advantage of two known facts: [10] Wikipedia organizes
information by topics, that is, each document concerns one single subject and,
[11] the first paragraph of each document tend to contain a short description of
the topic at hand. This way, it simply retrieves the document(s) describing the
target term of the question and then returns some part of its initial paragraph as




                                                                                      28
answer. Figure 2 shows the general process for answering definition questions. It
consists of three main modules: target term extraction, document retrieval and
answer extraction.




                 Fig. 2. Process for answer definition questions [7]




3.1   Finding Relevant Documents

In order to search in Wikipedia for the most relevant document to the given
question, it is necessary to firstly recognize the target term. For this purpose
the method uses a set of manually constructed regular expressions such as:
“What—Which—Who—How”+“any form of verb to be”++“?”,
“What is a  used for?”, “What is the purpose of ?”,
“What does  do?”, etc. Then, the extracted target term is compared
against all document names and the document having the greatest similarity is
recovered and delivered to the answer extraction module. It is important to men-
tion that, in order to favor the retrieval recall, we decided using the document
names instead of the document titles since they also indicate their subject but
normally they are more general (i.e., titles tend to be a subset of document
names). In particular, the system uses the Lucene [11] information retrieval sys-
tem for both indexing and searching.




                                                                                    29
3.2   Extracting the Target Definition
As we previously mentioned, most Wikipedia’s documents tend to contain a brief
description of its topic in the first paragraph. Based on this fact, this method
for answer extraction is defined as follows:
 – Consider the first sentence of the retrieved document as the target definition
   (the answer).
 – Eliminate all text between parenthesis (the goal is to eliminate comments
   and less important information).
 – If the constructed answer is shorter than a given specified threshold2, then
   aggregate as many sentences of the first paragraph as necessary to obtain an
   answer of the desire size.
    For instance, the answer for the question “Who was Hermann Emil Fischer?”
(refer to Figure 2) was extracted from the first paragraph of the document “Her-
mann Emil Fischer”: “Hermann Emil Fischer (October 9, 1852 - July 15, 1919)
was a German chemist and recipient of the Nobel Prize for Chemistry in 1902.
Emil Fischer was born in Euskirchen, near Cologne, the son of a businessman.
After graduating he wished to study natural sciences, but his father compelled
him to work in the family business until determining that his son was unsuit-
able”.

3.3   Evaluation Results of our method
This section presents the experimental results about the participation [7] at the
monolingual Spanish QA track at CLEF 2007. This evaluation exercise considers
two basic types of questions, definition and factoid. However, this year there were
also included some groups of related questions. From the given set of 200 test
question, our QA system treated 34 as definition questions and 166 as factoid.
Table 3.3 details our general accuracy results.


                   Table 1. System’s general evaluation




   It is very interesting to notice that our method for answering definition
questions is very precise. It could answer almost 90% of the questions; more-
over, it never replies wrong or unsupported answers. This result evidenced that




                                                                                      30
Wikipedia has some inherent structure, and that our method could effectively
take advantage of it. [7]


4    WAP technology in Question Answering
Wireless Application Protocol (WAP) is a secure specification that allows users
to access information instantly via handheld wireless devices such as mobile
phones, pagers, two-way radios, Smart phone and communicators.

  WAP is designed to be user-friendly and innovative data applications for
mobile phones easily. There are three types of terminals have been defined [12]:

 – Feature phones, which offer high voice quality with the capability of text
   messaging and Internet browsing.
 – Smart phones, with similar functionality but with larger display.
 – The communicator, which is an advanced terminal designed with the mobile
   professional in mind, similar in size to a palm-top with a large display.

   WAPs that use displays and access the Internet run what are called micro
browsers; browsers with small file sizes that can accommodate the low memory
constraints of handheld devices and the low-bandwidth constraints of a wireless-
handheld network.

    WAP uses Wireless Markup Language (WML), which includes the Hand-
held Device Markup Language (HDML) developed by Phone.com. WML can
also trace its roots to eXtensible Markup Language (XML). A markup language
is a way of adding information to your content that tells the device receiving
the content and what to do with it. The best known markup language is Hy-
pertext Markup Language (HTML). Unlike HTML, WML is considered a Meta
language. Basically, this means that in addition to providing predefined tags,
WML lets you design your own markup language components. WAP also allows
the use of standard Internet protocols such as UDP, IP and XML.

   Although WAP supports HTML and XML, the WML language (an XML
application) is specifically devised for small screens and one-hand navigation
without a keyboard. WML is scalable from two-line text displays up through
graphic screens found on items such as smart phones and communicators.

    WAP also supports WML Script. It is similar to JavaScript, but makes min-
imal demands on memory and CPU power because it does not contain many of
the unnecessary functions found in other scripting languages. Because WAP is
fairly new, it is not a formal standard yet. It is still an initiative that was started
by Unwired Planet, Motorola, Nokia, and Ericsson.

    There are three main reasons why wireless Internet needs the Wireless Ap-
plication Protocol:




                                                                                          31
                      Fig. 3. Migration of Markup language



 – Transfer speed: most cell phones and Web-enabled PDAs have data transfer
   rates of 14.4 Kbps or less. Compare this to a typical modem, a cable modem
   or a DSL connection. Most Web pages today are full of graphics that would
   take an unbearably long time to download at 14.4 Kbps. In order to minimize
   this problem, wireless Internet content is typically textbased in most cases.
 – Size and readability: the relatively small size of the LCD on a cell phone or
   PDA presents another challenge. Most Web pages are designed for a resolu-
   tion of 640x480 pixels, which is fine if you are reading on a desktop or a lap-
   top. The page simply does not fit on a wireless device’s display, which might
   be 150x150 pixels. Also, the majority of wireless devices use monochrome
   screens. Pages are harder to read when font and background colors become
   similar shades of gray.
 – Navigation: navigation is another issue. You make your way through a Web
   page with points and clicks using a mouse; but if you are using a wireless
   device, you often use one hand to scroll keys.
   WAP takes each of these limitations into account and provides a way to work
with a typical wireless device.

    Here’s what happens when you access a Web site using a WAP-enabled de-
vice:
 – You turn on the device and open the mini-browser.




                                                                                     32
                      Fig. 4. WAP Technology Infrastructure



 – The device sends out a radio signal, searching for service.
 – A connection is made with your service provider.
 – You select a Web site that you wish to view.
 – A request is sent to a gateway server using WAP.
 – The gateway server retrieves the information via HTTP from the Web site.
 – The gateway server encodes the HTTP data as WML.
 – The WML-encoded data is sent to your device.
 – You see the wireless Internet version of the Web page you selected.
    Although WML is well suited to most mundane content delivery tasks, it
falls short of being useful for database integration or extremely dynamic content.
PHP fills this gap quite nicely-integrating into most databases and other Web
structures and languages. It’s possible to ”crossbreed” mime types in Apache to
enable PHP to deliver WML content. WML pages are often called ”decks”. A
deck contains a set of cards. A card element can contain text, markup, links,
input-fields, tasks, images and more. Cards can be related to each other with
links.

    When a WML page is accessed from a mobile phone, all the cards in the page
are downloaded from the WAP server. Navigation between the cards is done by
the phone computer (inside the phone) without any extra access communica-
tions to the server.




                                                                                     33
5   Application mobile

As we mentioned at the begining, our proposal is the combination of mobile
technologies and web technologies. First, we have development a mobile appli-
cation (as you can see in figure 5) based on WAP technology. This application
allows users to use at anytime and anyplace at very low cost, 2 cents per search.
Furthermore, this application is available for most types of mobile phones. The
figure 5 shows the main interface, as well as the request and response from the
user’s search.




               Fig. 5. Mobile application through WAP Technology




    On the other hand, the figure 6 shows how our application mQAB can be
accesed from web via iPhone through Wi-Fi. This is another channel of access to
our application via wireless network. This feature allows our application covering
all existing wireless and mobile devices.


6   Perspectives and Future work

People throughout the world are increasingly relying on cell phones and mobile
devices to keep them plugged in. Obviously, search will play an ever increasing
role in the evolution of mobile. When will mobile search surpass desktop search?
We have been expecting better search capabilities from mobile devices for some
time, and know that Asia is far ahead of North America in this respect at the
current time. Today, experts discuss their views about the evolution of search in
North America. And, what we are sure, is that we must continue working on this
line. For this purpose, the next phase of development is the implementation of
the Mobile Question Answering System for spanish and English. Furthermore, we




                                                                                     34
                    Fig. 6. Mobile application through iPhone



seek the application of such search in some opportunity niches such as education.

    To sum up the results expected from our architecture presented in this article
are:

 – Architecture presented here, unlike other proposals based on short text mes-
   sages [2] is cheaper, such as was presented in section 4.
 – Our proposal gives a better performance because the communication via
   WAP is much more reliable than that based on SMS. This is mainly due to
   SMS-based systems have a 80 percent certainty. While the WAP protocol
   provides a 100 percent reliability.
 – Our proposal makes use of only a servlet on the server side and a simple
   midlet on the side of mobile device.
 – Furthermore, our proposal will benefit from the availability of Spanish WIKIPEDIA.
 – Finally, our proposal is based on Java Micro Edition, thus it will be inde-
   pendent of Operating Systems (OS).


7   Conclusions
A consortium of companies are pushing for products and services to be based
on open, global standards, protocols and interfaces and are not locked to pro-
prietary technologies. The architecture framework and service enablers will be
independent of Operating Systems (OS). There will be support for interoper-
ability of applications and platforms, seamless geographic and intergenerational
roaming. Mobile archutecture proposed in this paper has the advantage of being
adaptable to any system and infrastructure, following the current trend that
mobile technologies demand.




                                                                                        35
    We believe the selection of topics covered in encyclopedias like WIKIPEDIA
for a language is not universal, but reflects the salience attributed to themes in
a particular culture that speaks the language. Our approach also would benefit
from the availability of the Spanish WIKIPEDIA and the English WIKIPEDIA.


8    Acknowledgments

Thank you very much to the Autonomous University of Puebla for their financial
support. This work was supported under project VIEP register number 15968.
Also, we thank the support of the academic body: Sistemas de Informacin.


References
 1. A. Lopez. La busqueda de respuestas, un desafio computacional antiguo y vigente.
    La jornada de Oriente http://ccc.inaoep.mx/cm50-ci10/columna/080721.pdf,
    1(1):1-2, July 2008.
 2. L. Jochen, The Deployment of a mobile question answering system. Search Engine
    Meeting. Boston, Massachusetts, 1(1), April 2005.
 3. F. Zacaras Flores, F. Lozano Torralba, R. Cuapa Canto, A. Vzquez Flores. En-
    glish’s Teaching Based On New Technologies. The International Journal of Tech-
    nology, Knowledge & Society, Northeastern University in Boston, Massachussetts,
    USA. ISSN: 1832-3669, Common Ground Publishing, USA 2008.
 4. Weiser, M. (1991). The computer for the twenty-first century. Scientific American,
    September, 94-104.
 5. Zacarías F., Sánchez A., Zacarías D., Méndez A., Cuapa R. FINANCIAL MOBILE
    SYSTEM BASED ON INTELLIGENT AGENTS in the Austrian Computer Soci-
    ety book series, Austria, 2006.
 6. F. Zacaras Flores, R. Cuapa Canto, F. Lozano Torralba, A. Vzquez Flores, D.
    Zacarias Flores. u-Teacher: Ubiquitous learning approach, pp. 9–20, june 2008.
 7. Alberto Tellez, Antonio Juarez, Gustavo Hernandez, Claudia Denicia, Esau Villa-
    toro, Manuel Montes, Luis Villasenor, INAOE’s Participation at QA@CLEF 2007,
    Laboratorio de Tecnologas del Lenguaje, Instituto Nacional de Astrofsica, ptica y
    Electrnica (INAOE), Mexico.
 8. Izquierdo R., Ferrndez O., Ferrndez S., Toms D., Vicedo J.L., Martinez P. and
    Surez A. QALL-ME: Question Answering Learning technologies in a multiLingual
    and multiModal Envinroment, Departamento de Lenguajes y Sistemas Informticos,
    Universidad de Alicante.
 9. http://java.sun.com/developer/technicalArticles/javaserverpages/wap
10. http://ilps.science.uva.nl/WikiXML/database.php
11. http://lucene.apache.org/
12. J. AlSadi, B. AbuShawar, MLearning: The Usage of WAP Technology in E-
    Learning, International Journal of Interactive Mobile Technologies/Vol. 3, (2009)




                                                                                         36
    Data Warehouse Development to Identify Regions
      with High Rates of Cancer Incidence in México
     through a Spatial Data Mining Clustering Task.

                            Joaquin Pérez Ortega1,
                        María del Rocío Boone Rojas1,2,
                       María Josefa Somodevilla García2,
                     Mariam Viridiana Meléndez Hernández2

   1 Centro Nacional de Investigación y Desarrollo Tecnológico, Cuernavaca Mor. Mex.
     2 Benemèrita Universidad Autónoma Puebla, Fac. Cs. de la Computaciòn, México.
    jperez@cenidet.edu.mx,{rboone,mariasg}@cs.buap.mx,mvmh_099@hotmail.com


                                        Abstract

Data warehouses arise in many contexts, such as business, medicine and science, in which
the availability of a repository of heterogeneous data sources, integrated and organized
under a unified framework facilitates analysis and supports the decision making
process. These data repositories increase their scope and application, when used for data
mining tasks, which can extract useful knowledge, new and valuable from large amounts
of data.
    This paper presents the design and implementation of population-based data
warehouses on the incidence of cancer in Mexico; based on the conceptual level
multidimensional model and the ROLAP model (Relational On-Line Analytical
Processing) at the implementation level.
    A data warehouses is built, to be used as input for clustering data mining tasks, in
particular, the k-means algorithm, in order to identify regions in Mexico, with high rates of
cancer incidence.
    The identified regions, as well as, the dimension related to the geographic location of
the municipalities and their rate of incidence of cancer, are processed by IRIS, a
Geographic Information System, developed at the National Institute of Statistics,
Geography and Informatics of Mexico.

1 Introduction
Data warehouses arise in many contexts, such as business, medicine and science,
in which the availability of a repository of heterogeneous data sources, integrated
and organized under a unified framework facilitates analysis and supports the
decision making process. These data repositories increase their scope and
application, when used for data mining tasks, which can extract useful
knowledge, new and valuable from large amounts of data.
    Data warehouses have been applied mainly in the commercial and business
areas [3] and more recently there have been some applications in the Health field




                                                                                          37
[16] [17] and the trend towards its integration with various technologies [11]
[16].
    Moreover, according to the literature, the use of data mining systems applied
to the analysis of massive databases of health on a population basis has been
limited, it is noteworthy work: Constructing Over Dendrogram Matrix Detail
view + Views. [6], Application of data mining techniques to databases population
of cancer [1], Subgroup discovery in cervical cancer using data mining
Techniques [18] and Data mining for cancer management in Egypt [10]. In the
case of Mexico, to the best of our knowledge, the work that has been developed
at the Centro Nacional de Investigación y Desarrollo Tecnológico and BUAP, are
the first ones in this field.
    This work has been preceded by other works which has been done on the
incidence of other cancers such as stomach and lung [15]. It is part of a larger
project doomed to make proposals for improving the k-means algorithm in
various aspects such as effectiveness and efficiency, reported in [12], [13] and
[14] and its application in the Health field.
    This article presents the data warehouse design and integration for the
development of a data mining task on cancer incidence by regions in Mexico,
based on the integration of complementary technologies such as clustering and
geographical information systems. As a study case, the results for the incidence
of cervical cancer are presented, which has been of special interest, since in
Mexico, cervical cancer is the leading cause of cancer death in women [11].
    The report is organized as follows, followed by this introduction, Section 2
presents the description of data sources and process design and implementation of
data warehouse, Section 3 provides an overview of each application. In Section 4,
results for the case of cervical cancer and its visualization by GIS INEGI IRIS [5]
are included. Finally, in Section 5, conclusions and perspectives of this work are
presented.

2 The Data Warehouse
The process of collecting and integrating data warehouse on cancer incidence by
region in Mexico, required to select the data sources necessary to accomplish the
task of data mining. This section describes the data sources and the conceptual
design based on the multidimensional model and also, the implementation of the
data warehouse under the ROLAP approach.

2.1 The Data Sources

In the study, the processed databases have been derived from official records of
the National Institute of Public Health (INSP) and the National Institute of
Statistics, Geography and Informatics (INEGI) of Mexico.
    Data on cancer incidence were obtained through subsystem Remote
Consultation System for Health Information (SCRIS) of the INSP [9]. In




                                                                                38
particular, the databases were queried for cases of mortality cancer and results
were configured by considering levels of aggregation such as: National States,
division (Jurisdiction, Municipalities), year, age range, gender and causes
(including tumors).
    The information on the population and the actual geographical location of the
municipalities was obtained from INEGI official databases, through its
Geographic Information System IRIS, which has statistical information covering
a wide geographical number of subjects, demographic, social and economic; also
includes aspects of the physical environment, natural resources and
infrastructure. This wealth of statistical and geographical data was obtained
through various activities such as conducting population and housing census and
economic census and the generation of basic cartography and census.
    The information in the databases of the above institutions are integrated into a
data warehouse (see Fig. 1), and according to the conventions in the area of
health, for this study, only the municipalities with more than one hundred
thousand inhabitants were considered.

                             CategoryMunicipalityID




 Fig. 1 Multidimensional Model Data Warehouse on the incidence of cancer
                               in Mexico.




                                                                                 39
2.2 Data Warehouse Multidimensional Model for a population-based
incidence of cancer in Mexico.

According to [4] the conceptual data model most widely used for data
warehouses is the multidimensional model. The data are organized around
the facts that have attributes or measures that may be more or less detail
according to certain dimensions. In our case, the data warehouse design at the
conceptual level is based on the multidimensional model, in which the
dimensions can be distinguished as CAUSE, TIME, and PLACE. In this case, it
is considered that a country has the basic fact, "deaths" that may have associated
attributes such as number of cases, incidence rate, mean, variance, etc.. Fact can
be detailed in several dimensions such as cause of death, place of death, date of
death, etc. In Fig. 1 shows the facts "deaths" and three dimensions with various
levels of aggregation. The arrows can be read as "is added". As shown in Fig. 1,
each dimension has a hierarchical structure but not necessarily linear. When the
number of dimensions cannot exceed three represent each combination of levels
of aggregation as a cube.
    The cube is made up of boxes with one box for each possible value from each
dimension to the corresponding level of aggregation. On this "view", each box
represents a fact. Fig. 2 shows a three dimensional cube corresponding to the fact:
"According to the 2000 census, the town of Atlixco, there were 15 deaths from
cervical cancer" in which the dimensions Cause, Place and Time have been added
by type of disease (cancer), Municipality and Census. The representation of a fact
corresponds therefore to a square in the cube. The value of the box is the
observed (in this case is the number of deaths).




             Fig. 2 Display of a fact in a multidimensional model




                                                                                40
2.3 Data warehouse scheme ROLAP (Relational OLAP) implementation of
population-based cancer incidence in Mexico.

One of the most efficient ways to implement a multidimensional model using
relational databases is based on the ROLAP model [4]. In our case, the tables for
the ROLAP model have the following schemes:
Snowflake Tables
Dimension Cause
DISEASE (Clave_Enfermedad, name, IdGama, CategoryID)
GAMA (IdGama, CategoryID, Description)
CATEGORY (CategoryID, Description)
Place dimension
STATE (Clave_Estado, name, población_total)
MUNICIPALITY (Clave_Municipio, Clave_Estado, name, población_total,
Loc_x, Loc_y, extension, tipo_zona, nivel_socioeconómico)
Time dimension
YEAR (Idan)
CENSUS (IdCenso, Idan, number, name)
Fact Tables
DEATH (IdEnfermedad, IdCenso, IdMunicipio, no_casos, rate, mean, variance)
Star Tables
TIME (Idan, IdCenso)
CAUSE (IdEnfermedad, IdGama, CategoryID)
PLACE (IdCiudad, IdMunicipio)


3 Data Mining Application on Cancer Incidence

The implemented data warehouse has been used to develop a data mining task
space based on the integration of additional technologies to the data warehouse,
such as clustering and Geographic Information Systems, which in this case are
very suitable, to identify and display areas with incidence of cancer in
Mexico. The following provides a general description of the integration process
of technologies and tools (Fig. 3) made for this application.
    The data warehouse integrates the following information for our application:
the component space that allows viewing of the regions of municipalities,
population data such as the death rate and incidence rate and the time component,
which in this case is the census year.
    The IRIS GIS INEGI [5], through your options allows the recovery of
population data and the real location of the municipalities, which are integrated
into the data warehouse.




                                                                              41
    Since IRIS stores geographical representation of municipalities in the vector
format standardized "shape" and by means of polygons, there is the need for a
process of transfer of forms and formats in order to have a numerical
representation of each municipality, in this case, corresponds to a point on the
municipality center location, which is accomplished primarily through the tools
of ESRI's ArcInfo GIS.




              Fig. 3 Integration of Technology and Data Mining Tools

    Given the numerical representation of each municipality through a point (x,
y), along with its rate of incidence of cancer, the Matlab programming
environment and its implementation of k-means algorithm [2] [7] is used
to generate patterns / groups of municipalities and the corresponding centroids.
    Once you have the above results, it is again necessary to transfer digital data
format to format shape, a process similar to above using ArcInfo tools, allowing
viewing through GIS IRIS.
    Finally, the groups of municipalities and their corresponding centroids, are
passed as GIS layers to IRIS, for display on the geographic map of Mexico.

4 Results and visualization with IRIS
In this project we have done grouping tasks according to the affinity of location
and incidence rate of the municipalities. Series of experimental tests on the data
stores in cities with more than 100.000 inhabitants were carried out. Size groups
were considered k = 5, 10, 15, 20 and 30. The best result was obtained for k =
20.




                                                                                42
    As a case study, this paper presents the results obtained by k-means algorithm
in Matlab for the cervical cancer data warehouse. Fig. 4 provides the visualization
of the 20 regions identified.




 Fig. 4 Regions of the Municipalities with an incidence of Cervical Cancer.

    From the results, we distinguish the groups spearheading the three
municipalities with higher incidence rates: Atlixco, Apatzingán and Tapachula
(Chiapas). In Fig. 5 the detail of the display of the group corresponding to the
region of Chiapas and the incidence of cervical cancer is shown. Table 1 provides
data for the previous group, and statistical measures for the mean and standard
deviation.




                       Fig. 5 Tapachula Chiapas Group

    The groups identified with high incidence rates: Tapachula and Apatzingan
match municipalities identified in other studies [4] and correspond to the
population characteristics, identified in the work of the medical field [8], [15]
such situations such as poverty, lack of preparation and access to effective health
services and the initiation of sexual activity at an early age. This allows us to




                                                                                43
assert that the grouping is made valid. On the other hand, the study allowed
discovering other municipalities that had not been identified in other research,
such as the group of Atlixco, in particular showing the highest incidence rate in
the country (see table 2).

     Table 1 Municipalities Incidence Rates of Cervical-Uterine Cancer

               State             Municipality            Population    Deaths    Rate
          Chiapas          Tapachula                     271674        27        9.93
          Veracruz-Llave Coatzacoalcos                   267212        23        8.60
          Veracruz-Llave Minatitlán                      153001        13        8.49
          Chiapas          Comitán de Domínguez          105210        8         7.60
          Chiapas          San Cristóbal de las Casas    132421        9         6.79
          Tabasco          Comalcalco                    164637        11        6.68
          Tabasco          Cárdenas                      217261        11        5.06
          Tabasco          Huimanguillo                  158573        8         5.04
          Chiapas          Tuxtla Gutiérrez              434143        21        4.83
          Tabasco          Cunduacán                     104360        5         4.79
          Campeche         Carmen                        172076        8         4.64
          Tabasco          Macuspana                     133985        6         4.47
          Tabasco          Centro                        520308        23        4.42
          Chiapas          Ocosingo                      146696        2         1.36
          Average                                                                5.91
          Standard deviation                                                     2.23

    In order to perform a global analysis of our results, Table 2 provides
information of the ten municipalities with the highest incidence rate in the
country.

Table 2 Top Ten Municipalities Incidence Rates of Cervical-Uterine Cancer

            Key         State        Municipality Population Deaths Rate
            21019 Puebla            Atlixco             117111        15        12,80
            16006 Michoacán         Apatzingán          117949        13        11,02
            07089 Chiapas           Tapachula           271674        27        9,93
            17006 Morelos           Cuautla             153329        14        9,13
            28021 Tamaulipas        El Mante            112602        10        8,88
            06007 Colima            Manzanillo          125143        11        8,78
            30039 Veracruz-Llave Coatzacoalcos 267212                 23        8,60
            18017 Nayarit           Tepic               305176        26        8,51
            30108 Veracruz-Llave Minatitlán             153001        13        8,49
            30118 Veracruz-Llave Orizaba                118593        10        8,43
            General Mean                                                        4.70
            Standard Deviation                                                  1.95




                                                                                        44
    Figure 6, illustrates the location of previous incidence rates compared to the
national average and the corresponding standard deviation.




               Figure 6. Top Ten municipalities incidence rates.


5 Conclusions
Multidimensional model for conceptual design of the data warehouse, turned out
to be very appropriate, since this model is easily scalable and allows analysis of
the information under different perspectives. It is expected that future studies
process other variables, related to the municipalities, included in this design, such
as socioeconomic status, type of region, gender and access to health services,
among others. Moreover, the implementation of data warehouse based on the
ROLAP model has allowed taking advantage of the facilities developed for
relational databases. In addition, it is expected that the design and implementation
carried out in the data warehouse can be used in other applications.
    The processing of the spatial component of our data warehouse, using the
IRIS GIS INEGI, has resulted in a high quality visual representation of our
results, based on the actual physical location of the municipalities and on a map
of the topography of the Republic Mexican INEGI. Also experience and learning
has been gained on transfer of shapes (polygons, points) techniques and
formats (Number-shape) through ArcView GIS tools.
    Currently we are working to complete studies in other cancer types. Besides,
data mining tasks will be developed on the incidence of conditions such as
diabetes, influenza and cardiovascular diseases, among others.




                                                                                  45
Acknowledgement. R. Boone expresses her gratitude to Ms. Rocío Pérez Osorno
from INEGI, Puebla. (Graduated from the Faculty of Cs. Computing, BUAP) for
advice and support in plotting the results of this work through the IRIS GIS.

References

1. Barrón Vivanco M. Arandine, Pérez O. J., Miranda H. Fátima, Pazos R., XII
Congreso de Investigación en Salud Pública, Aplicación de técnicas de minería
de datos a bases de datos poblacionales de cáncer, CENIDET, México,
Secretaría de Saúde do Estado de Pernambuco, Brasil, Abril (2007).
2. Forgy E. “Cluster analysis of multivariate data: Efficiency vs. Interpretability
of classification”, Biometrics, vol. 21, pp.768-780.1965
3. Hernández-Orallo J., Ramiréz-Quintana M. J., Ferri-Ramiréz C., Introducción
a la Minería de Datos, Ed. Pearson Prentice Hall, Madrid (2004).
4. Hidalgo-Martínez Ana C. El cáncer cérvico-uterino su impacto en México.
Porqué no funciona el programa nacional de detección oportuna. Revista
Biomédica, Centro Nal. De Investigaciones Regionales Dr. Hineyo Noguchi,
UADY, 2006, México.
5. IRIS 4. http://mapserver.inegi.gob.mx. SNIEG Sistema Nacional de
Información Estadística y Geográfica.
6. Jin Chen, MacEachren, Alan M., Peuquet, Donna. Constructing
Overview+Detail Dendogram Matrix Views. IEEE Transactions on Visualization
& Computer Graphics., Vol. 15, Issue 6, p889-896, Dec. 2009.
7. MacQueen, J.: Some Methods for Classification and Analysis of Multivariate
Observations. In Proceedings Fifth Berkeley Symposium Mathematics Statistics
and Probability. Vol. 1. Berkeley, CA (1967) 281-297.
8. Martínez M. Francisco Javier. Epidemiología del cáncer del cuello uterino.
Medicina Universitaria 2004, 39-46. Vol. 6, N. 22, UANL, México.
9. NAIIS Instituto Nacional de Salud Pública, SCRIS, Mortalidad,
http://sigsalud.insp.mx/naais/, Cuernavaca, Morelos, México, (2003).
10. Nevine M. Labib, Michael N. Malek: Data Mining for Cancer Management
in Egypt. Transactions on Engineering, Computing and Technology V8 October
2005: (ISSN 1305-5313).
11. Pérez-C. Nelson, Abril-Frade D.O. Estado Actual de las Tecnologías de
Bodegas de Datos Espaciales. Ing. E Investigación. Vol.27, No. 1, Univ. Nal. De
Colombia. 2007.
12. Pérez-O. J.,1, R. Pazos R, L. Cruz R.,G. Reyes S. “Improvement the
Efficiency and Efficacy of the K-means Clustering Algorithm through a New
Convergence Condition”. Computational Science and Its Applications – ICCSA
2007 – International Conference Proceedings. Springer Verlag.
13. Pérez-O. J.2, M.F. Henriques, R. Pazos, L. Cruz, G. Reyes, J. Salinas, A.
Mexicano. Mejora al Algoritmo de K-means mediante un Nuevo criterio de




                                                                                46
convergencia y su aplicación a bases de datos poblacionales de cancer. 2do Taller
Latino Iberoamericano de Investigación de Operaciones, Mèxico, 2007.
14. Pérez-O. J.3, Rocío Boone Rojas, María J. Somodevilla García. Research
issues on K-means Algorithm: An Experimental Trial Using Matlab., Advances
on Semantic Web and New Technologies”. Vol 534. http://ceur-ws.org/.
15. Rangel-Gómez, G. Lazcano-Ponce,E. Palacio-Mejía, Cáncer cervical, una
enfermedad de la pobreza: diferencias en la mortalidad por áreas urbanas y
rurales en México, http:// www.insp.mx/salud/index.html.
16. Scotch,Matthew, Parmato B. Monaco, V. Evaluation of SOVAT: An OLAP-
GIS decision support system for community health assessment data analysis.
BMC Medical Informatics & Decisión Making Vol. 8 (1-12). 2008.
17. Simonet, A., Landais, P. Guillon D.A multi-source Information System for
end-stage renaldisease. Comptes Residus Biologies, 2002, Vol. 325 I4., p515.
18. Thangavel K. Jaganathan P. and Esmy P. O., Subgroup Discovery in Cervical
Cancer Analysis Using Data Mining Techniques, Departament of Computer
Science, Periyar University: Departament of Computer Science and Applications,
Gandhigram Rural Institute-Deemed University, Gandhigram: Radiation
Oncologist , Christian Fellowship Community Health Centre, Tamil Nadu, India:
AIML journal, Vol(6), Issue(1), January, 2006.




                                                                              47
       An Approach of Crawlers for Semantic
                Web Application

       José Manuel Pérez Ramírez1 , Luis Enrique Colmenares Guillen1
                1
                    Benémerita Universidad Autónoma de Puebla,
                      Facultad de Ciencias de la Computación,
                       BUAP – FCC, Ciudad Universitaria,
                               Apartado Postal J-32,
                               Puebla, Pue. México.
                         { mankod, lecolme}@gmail.com



Abstract. This paper presents a proposal for a system capable of retrieval
information from the processes generated by the system Yacy. The information
retrieved will be used in the generation of a knowledge base. This knowledge
base may be used in the generation of semantic web applications.




Keywords: Semantic Web, Crawler, Corpora, Knowledgebase.




                                                                               48
1 Introduction


A knowledgebase is a special type of database for managing knowledge. It provides
the means to collect organize and recover knowledge in a computed way. In general, a
knowledgebase is not a static set of information it is a dynamic resource that maybe
have the ability to learn. In the future, Internet will be a complete and complex
knowledgebase, already known as semantic web [1].

   Some examples of knowledge base are: a public library, an information database
related to a specific subject, Whatis.com, Wikipedia.org, Google.com, Bing.com and
Recaptcha.net.

   Investigate related to Generation Automatic of a specialized corpus from the Web
is present in [2], this investigate have a reviews of methods to process knowledgebase
that generates specialized corpus.

  In section 2 we present related work to semantic web in order to comprehend the
benefits that may be obtained by elaborating them.

   In Section 3 we describe the challenges and we explain the problems that could be
have if you tried to use Google Search for getting information or tried to retrieval
information of queries to Google.

  Section 4 the methodology to use for solving the problem. And section 5,
conclusions and ongoing work.


   We continue this paper present a form abstract to describe a Query Processing on
the Semantic Web [8] is as follows Fig. 1

  1.   A query with a data type.
  2.   A server that sends queries to the servers decentralized indexing. The content
       found on the servers is similar to indexing a book index indicates which pages
       contain the words that match the query.
  3.   The query travels to the servers where documents stored documents are
       retrieved are generated to describe each search result.
  4.   The user receives the results of its semantic search which has already been
       processed in the semantic web server.




                                                                                   49
                          Fig. 1. Querying the Semantic Web.




2 Related Work

 Nowadays, the investigation related to retrieval information on the web has a
different result like: knowledgebase, web sites dedicated to retrieval information,
Wikipedia, Twine, Evri, Google, Vivísimo, Clusty, etc.

An example of a company that working with “retrieval information” is Google Inc,
one of their products is Google Search this web search engine is the one of the most-
used search engine on the Web [9], Google receives several hundred million queries
each day through its various services [10].

This kind of example it’s necessary for the following analogy: For what reason
Google doesn’t put their information of their knowledgebase under domain public?
And the answer it’s very simple: because their information or their knowledgebase it’s
money.




                                                                                   50
In section 3 we explain some form of extract information of Google Search only a
protected few of information it’s impossible retrieval many information of Google
Search whit the idea to generate knowledgebase this because Google protects their
information of their queries.

Another kind of knowledgebase are:



2.1 Wikipedia

A specific case is Wikipedia, a project to write a free communitarian encyclopedia in
all languages. This project have 514 621 articles today. The quantity and quality of
the articles present an excellent knowledgebase for the creation of semantic webs.
We present some ways to obtain semantic information from Wikipedia: from its
structure, from the collected notes of the people that contributes and from the existent
links in the entries.


2.2 Twine

Twine is a tool for storage, organizes and shares information, all of it with an
intelligence provided by the platform that analyzes the semantic of the information
and classifies automatically [7]. The main idea is to save users from labeling and
connecting related content and leave this work to Twine, bringing more value and
storage the contents next to the information about its meaning.


3 Challenges


The principal challenge is development a system with the capacity of works with Yacy
for retrieval information of Indexing Process and generate information this
information will be essential for produce knowledgebase.

We present in the figure 5 all modules of yacy, so the module to development will be
works with some of these modules.




                                                                                     51
                             Figure 5. Components of Yacy

  The principal question is:
                What we can do to get information under domain public.
  It’s very simple we use the very popular Wikipedia

   Wikipedia is a project of the Wikimedia Foundation. More than 13.7 million of its
articles have been drafted in conjunction with volunteers from all over the world and
practically every one of them may be edited by any person that may have access to
Wikipedia. Actually it is the most popular reference work on the internet.

   This project of dynamic content like Wikipedia illustrates the information that have
great potential to be exploited.

   Otherwise Google Search is one of the most-used search engine provides at least
22 special features beyond the original word-search capability. These include
synonyms, weather forecasts, time zones, stock quotes, maps, earthquake data, movie
showtimes, airports, home listings, and sports scores.

  And maybe you could be thinking:

         For what reason the people don’t use a Google Search for get all the
knowledgebase about topic specific and this knowledgebase could be export to file of
      text plan with the possibilities of management this and generate corpus.

  Very simple is the answer because the information of Google is their information
and gold for company.

  It the past Google Inc. allowed the retrieval information from any kind of query[3].




                                                                                     52
Google allowed the retrieval information based on their form and methods like
University Research Program for Google Search [10] but any kind of answered we
get of this project when we make the inscription to this program.



   Another way to exploit Google Search knowledge is using scripts, APIS [3],
programming languages such as AWK, development tools like SED or GREP, all of
them analyzed in [2] but with few results and we need a lot of information for create
knowledgebase.




3.1     Considerations

      1. Create a module with the goal to connect this with YACY and
         retrieval information of their crawlers.
      2. Export a set of information related with a topic in plain text.
      3. Management information of web site like Wikipedia.org.
      4. Index the content of this kind of retrieval information in storage local.
      5. Public the module in the web and share the knowledgebase.


4 Methodology

This section gives a description of the project taking into consideration the design that
will be used to give a solution to the problem of creating the module.


4.1 Project description

The obtained results of the module that connected with Yacy will be used to create
semantic webs, corpus and any other project that needs information in a plain text
about web content.

  Described below are a series of procedures to follow that use as a methodology to
implement within the project.

  A) Check the modules of Yacy

  B) Check the logistic and architecture of Yacy

  C) Check the form that Yacy create their crawlers




                                                                                      53
   D) Think in a form of create the Module capable of manage the information of the
crawler and generate knowledgebase

   E) Some of the polices described above are implemented in YaCy [6], the variant
to use is the implementation of the JXTA[5] tool and the URI and RDF policies that
allow to structure and outline the results, to finally present then in a semantic way or
knowledgebase.


4.2 Development platform

   This work is done with YaCY, which is a free distribution search engine, based on
the principles of the peer to peer (P2P). Its core is a program written in Java that it’s
distributed in hundreds of computers, from September 2006. It’s called YaCy-peer.
Each YaCy-peer is an independent crawler that navigates trough the Internet, and
analyzes and indexes web pages found. To storages the indexation results in a
common database (called index) which is shared with other YaCy-peers using the
principles of the P2P networks [4].




                            Fig. 2 Distributed indexing process

   Compared to semi-distributed search engines, the YaCy-network has a
decentralized architecture. All of the YaCy-peers are equal and there is no central
server. It may be executed in Crawling mode or as a local proxy server. The figure 2
shows a diagram that describes the distributed process of indexation and the search in
the network for the YaCy crawler.




                                                                                      54
                          Fig. 3. Distributed indexing process

  The figure 3, to have the main components of YaCy, and the process that exists
among the web search, web crawler, the indexing and data storage processes.


5 Conclusions and ongoing work

In this section present some the conclusions and results that are expected of project
and the future work.
     1. Index all content of Wikipedia.
     2. Storage this content.
     3. Present the content of Wikipedia by topic in a web site.
     4. Use a tagged of text for share the information with tags.
     5. Present the module and their code on a web site
     6. Share knowledgebase extract of Wikipedia




                                                                                  55
References


1.   Definition of knowledgebase
     http://searchcrm.techtarget.com/definition/knowledge-base
2. Alarcón, R., Sierra, G., Bach, C. (2007). “Developing a Definitional Knowledge
     Extraction System”. En Vetulani, Z. (ed.), Actas del 3er Language &
     Technology Conference. Human Language Technologies as a Challenge for
     Computer Science and Linguistics. Poznan, Universidad Adam Mickiewicza: pp.
     374-378.
3. Google Hacks, Second Edition, 2004, O’Reilly Media.
4. S. Rhea, B. Godfrey, B. Karp, J. Kubiatowicz, S. Ratnasamy, S. Shenker, I.
    Stoica, and H. Yu. OpenDHT: a Public DHT Service and its Uses. SIGCOMM'
    05, Philadelphia, Pennsylvania, USA, august 21-26, (2005).
5. http://www.jxta.org (2010).
6. http://yacy.net/ (2010).
7. http://www.twine.com/ (2010).
8. Query Processing on the Semantic Web Heiner Stuckenschmidt, Vrije
    Universiteit Amsterdam
9. http://www.alexa.com/siteinfo/google.com+yahoo.com+altavista.com (2009)
10. http://searchenginewatch.com/showPage.html?page=3630718 (2008)
11. http://research.google.com/university/search/ (2010)




                                                                              56
         Decryption Through the Likelihood of
                 Frequency of Letters

 Barbara Sánchez Rinza, Fernando Zacarias Flores, Luna Pérez Mauricio, and
                      Martínez Cortés Marco Antonio

                  Benemérita Universidad Autónoma de Puebla,
                               Computer Science
                     14 Sur y Av. San Claudio, Puebla, Pue.
                                  72000 México
                  brinza@cs.buap.mx, fzflores@yahoo.com.mx



      Abstract. The method to decrypt the information using probability
      leads to a more thorough job, because you have to know the percent-
      age of each of the letters of the language that is being analyzed here is
      Spanish. You can consider not only the probabilities of the letters also
      syllables, set of three, four letters and even words. Then you have this
      thing to do is make comparisons of the frequencies of cipher text and
      the frequencies of the language to begin to replace by a correspondence.
      And finally passing a scanner and find the decrypted text.

      Keywords Probability, Decrypt.


1   Introduction
Cryptography is the science that alters the linguistic representations of a message
[1]. For this there are different methods, where the most common is encryption.
This science masking the original references of the information by a conversion
method governed by an algorithm that allows the reverse or decryption of in-
formation. Use of this or other techniques, allowing for an exchange of messages
that can only be read by the intended beneficiaries as ’consistent’. A consistent
recipient is the person to whom the message is directed with the intention of
the sender. Thus, the recipient knows the discrete coherent used for masking the
message. So either have the means to bring the message to the reverse process
cryptographic, or can infer the process that becomes a message to the public. The
original information to be protected is called plaintext or cleartext. Encryption
is the process of converting plain text into unreadable gibberish called cipher-
text or cryptogram. In general, the concrete implementation of the encryption
algorithm (also called figure) is based on the existence of key secret information
that fits the encryption algorithm for each different use [2].

   Decryption is the reverse process to recover the plaintext from the ciphertext
and key. Cryptographic protocol specifies the details of how to use algorithms
and keys (and other primitive operations) to achieve the desired effect. The set




                                                                                      57
of protocols, encryption algorithms, key management processes and actions of
the users, which together constitute a cryptosystem, which is what the end user
works and interacts. In this work, we must first have a ciphertext which must
meet certain requirements, such a text should be bijective so that each element
of the domain carries a single element of the condominium. In addition we must
also take account of the rules of Kerckhoff [3].


2      Development work

2.1     Frequencies in Spanish

Is required to decrypt text using the odds as to how often they used certain
letters in the alphabet, for this work only considered the Spanish language [5].

      The frequencies of Spanish, which were used for this study were:

 1. Frequency triglyphs
 2. Frequency of digraphs
 3. Most common words
 4. Frequency of letters at the beginning of words
 5. Frequency of letters in Spanish
 6. Frequency Words


2.2     Triglyphs Frequencies

The letter frequency statistics may vary from one to another depending on the
corpus author has chosen to develop them. Usually differences when the corpus
is literary or consists of texts of different origins. Table 1 shows the frequency of
each of the Spanish alphabet with their respective percentage.


High frequency letters Medium frequency letters Low frequency letters Frequencies 0.5%
  letter    freq.%       letter     freq.%        letter   freq.%        G, F, V, W
    E        16,78         R          4,94          Y        1,54
    A        11,96         U          4,80          Q        1,53
    O         8,69          I         4,15          B        0,92
    L         8,37         T          3,31          H        0,89
    S         7,88         C          2,92                              J, Z, X, K, N
    N         7,01         P          2,76
    D         6,87         M          2,12

                            Table 1. Frequency triglyphs




                                                                                         58
2.3     Most Frequent words
The vowels make up about 46.38% of the text. The high frequency letters account
for 67.56% of the text. Mid-frequency points accounting for 25% of the text [4].
In the dictionary the most common vowel is A, but in written texts is the E
because of prepositions, conjunctions, verbs, etc. The most common consonants
are L, S, N, D, with about 30%. The less frequent six letters: V, N, J, Z, X and
K (just over 1%). The average frequency of a Spanish word is 5.9 letters. The
coincidence index for Spanish is 0.0775. In addition to solving the encryption
table 2 we mentioned that we most frequently used words in a text of 10 000
words.


               Most common words Two-letter words Three-letter words
               Word Frequency Frequency Word         Frequency
                DE      778        778      QUE          289
                LA      460        460      LOS          196
                El      339        339      DEL          156
                EN      302        302      LAS          114
               QUE      289        119      POR          110
                 Y      226         98      CON           82
                 A      213         74      UNA           78
               LOS      196         64      MAS           36
               DEL      156         63      SUS           27
                SE      119         47      HAN           19
               LAS      114

               Table 2. Most frequent words of one, two and three letter




      Next, table 3 shows the frequencies of the 4-letter words.

2.4     Frequency digraphs
The size of the corpus is 60,115 letters. The frequencies are absolute. The di-
graphs are read by row and column in that order. Below in table 4 shows the
union digraphs are letters from letters.

2.5     Most common initial letter
The most frequent letters in Spanish that start a word are listed in Table 5


3      Results
The ciphertext is used as said it had to be bijective and have Kerckhoff rules
and the decrypted text shown in Figure 1.




                                                                                   59
           Four-letter words   Distribution of letters in literary texts
            Word Frequency E - 16,78% R - 4,94% Y - 1,54% J - 0,30%
           PARA        67    A - 11,96% U - 4,80% Q - 1,53%
           COMO        36    O - 8,69% I - 4,15% B - 0,92%
           AYER        25    L - 8,37% T - 3,31% H - 0,89%
           ESTE        23    S - 7,88% C - 2,92% G - 0,73%
           PERO        18    N - 7,01% P - 2,77% F - 0,52%
           ESTA        17    D - 6,87% M - 2,12% V - 0,39%
            AOS        14
           TODO        11
            SIDO       11
           SOLO        10

                       Table 3. Frequency with four letters



4   Conclusions

We conclude that this method of decryption is good however would have to
tweak a little more due to it depends on the text we have and how much text
to decrypt was also observed that only decrypts an encrypted bijective. In this
work, as seen in the results of Figure 1, which apply various processes, first see
the probability of the lyrics in Spanish that are more frequent, then seen with
the syllables that are more frequent in Spanish, and then with the last word and
you miss the information, text analyzer, as shown in Figure 1 a large percentage
of the information is decoded, but as mentioned in the top, this will depend have
that much information to process it.


References
 1. Liddell and Scott’s Greek-English Lexicon. Oxford University Press. (1984)
 2. Anaya Multimedia, Codigos Y Claves Secretas: Programas En Basic, Basado A Su
    Vez En Un Estudio Lexicogrfico Del Diario ”El Pas”, Mexico 1986.
 3. Friedman, William F. And Callimahos, Lambros D., Military Cryptanalytics, Cryp-
    tographic Series, 1962
 4. Part I - Volume 2, Aegean Park Press, Laguna Hills, Ca, 1985
 5. Barker, Wayne G., Cryptograms In Spanish, Aegean Park Press, Laguna Hills, Ca.,




                                                                                      60
               A B C D E F G H I JK L M
             A 12 14 54 64 15 5 8 4 10 8 41 30
             B 11           5       14 1 12
             C 39     5    17     8 80    3
             D 32     1 2 84      1 30
             E 20 5 47 26 17 8 21 6 9 3 44 26
             F 2            9       12    1
             G 12          12        5    1
             H 15           3        5
             I 43 8 42 29 40 5 8       1 14 16
             J 4            5
             K              1
             L 44     5 5 35 1 3    28    9 5
             M 32 10       42       30
             N 41 2 33 37 41 10 6 2 28 1  5 4
             O 19 17 28 26 16 6 5 5 4 1 22 33
             P 30     1    16        5    8
             Q
             R 74 1 12 10 94 1 12 45 1 1 6 15
             S 32 2 18 15 57 3 2 4 41 1   5 7
             T 60     1    67       35
             U 13 6 11 5 52 1 3      9    9 6
             V 12        1 15       15
             W 1            1
             X        1     4
             Y 5 1 3 2 5 1 1              1 1

                Table 4. Frequency of digraphs




   letter    P     C     D    E S A L R M N T
frequency 1.1128 1.081 1.012 989 789 761 435 425 403 346 298
   letter   Q      I     H U G V F O B J Y WZK
frequency 286 281 230 219 206 183 177 169 124 47 27 19 2 1

              Table 5. Frequency of initial letters




                                                               61
Fig. 1. with each of the texts worked, 01 encrypted text, 02 text one pass, 03 second
pass the text, either original text decrypted




                                                                                        62