<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>An Approach of Crawlers for Semantic Web Application</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>José Manuel Pérez Ramírez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luis Enrique Colmenares Guillen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Benémerita Universidad Autónoma de Puebla, Facultad de Ciencias de la Computación, BUAP - FCC</institution>
          ,
          <addr-line>Ciudad Universitaria, Apartado Postal J-32, Puebla, Pue.</addr-line>
          <country country="MX">México</country>
        </aff>
      </contrib-group>
      <fpage>48</fpage>
      <lpage>56</lpage>
      <abstract>
        <p>This paper presents a proposal for a system capable of retrieval information from the processes generated by the system Yacy. The information retrieved will be used in the generation of a knowledge base. This knowledge base may be used in the generation of semantic web applications.</p>
      </abstract>
      <kwd-group>
        <kwd>Semantic Web</kwd>
        <kwd>Crawler</kwd>
        <kwd>Corpora</kwd>
        <kwd>Knowledgebase</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A knowledgebase is a special type of database for managing knowledge. It provides
the means to collect organize and recover knowledge in a computed way. In general, a
knowledgebase is not a static set of information it is a dynamic resource that maybe
have the ability to learn. In the future, Internet will be a complete and complex
knowledgebase, already known as semantic web [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Some examples of knowledge base are: a public library, an information database
related to a specific subject, Whatis.com, Wikipedia.org, Google.com, Bing.com and
Recaptcha.net.</p>
      <p>
        Investigate related to Generation Automatic of a specialized corpus from the Web
is present in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], this investigate have a reviews of methods to process knowledgebase
that generates specialized corpus.
      </p>
      <p>In section 2 we present related work to semantic web in order to comprehend the
benefits that may be obtained by elaborating them.</p>
      <p>In Section 3 we describe the challenges and we explain the problems that could be
have if you tried to use Google Search for getting information or tried to retrieval
information of queries to Google.</p>
      <p>Section 4 the methodology to use for solving the problem. And section 5,
conclusions and ongoing work.</p>
      <p>
        We continue this paper present a form abstract to describe a Query Processing on
the Semantic Web [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] is as follows Fig. 1
1. A query with a data type.
2. A server that sends queries to the servers decentralized indexing. The content
found on the servers is similar to indexing a book index indicates which pages
contain the words that match the query.
3. The query travels to the servers where documents stored documents are
retrieved are generated to describe each search result.
4. The user receives the results of its semantic search which has already been
processed in the semantic web server.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 Related Work</title>
      <p>Nowadays, the investigation related to retrieval information on the web has a
different result like: knowledgebase, web sites dedicated to retrieval information,
Wikipedia, Twine, Evri, Google, Vivísimo, Clusty, etc.</p>
      <p>
        An example of a company that working with “retrieval information” is Google Inc,
one of their products is Google Search this web search engine is the one of the
mostused search engine on the Web [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Google receives several hundred million queries
each day through its various services [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>This kind of example it’s necessary for the following analogy: For what reason
Google doesn’t put their information of their knowledgebase under domain public?
And the answer it’s very simple: because their information or their knowledgebase it’s
money.
In section 3 we explain some form of extract information of Google Search only a
protected few of information it’s impossible retrieval many information of Google
Search whit the idea to generate knowledgebase this because Google protects their
information of their queries.</p>
      <sec id="sec-2-1">
        <title>Another kind of knowledgebase are:</title>
        <sec id="sec-2-1-1">
          <title>2.1 Wikipedia</title>
          <p>A specific case is Wikipedia, a project to write a free communitarian encyclopedia in
all languages. This project have 514 621 articles today. The quantity and quality of
the articles present an excellent knowledgebase for the creation of semantic webs.
We present some ways to obtain semantic information from Wikipedia: from its
structure, from the collected notes of the people that contributes and from the existent
links in the entries.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.2 Twine</title>
          <p>
            Twine is a tool for storage, organizes and shares information, all of it with an
intelligence provided by the platform that analyzes the semantic of the information
and classifies automatically [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. The main idea is to save users from labeling and
connecting related content and leave this work to Twine, bringing more value and
storage the contents next to the information about its meaning.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3 Challenges</title>
      <p>The principal challenge is development a system with the capacity of works with Yacy
for retrieval information of Indexing Process and generate information this
information will be essential for produce knowledgebase.</p>
      <p>We present in the figure 5 all modules of yacy, so the module to development will be
works with some of these modules.</p>
      <sec id="sec-3-1">
        <title>The principal question is:</title>
        <p>What we can do to get information under domain public.</p>
        <p>It’s very simple we use the very popular Wikipedia</p>
        <p>Wikipedia is a project of the Wikimedia Foundation. More than 13.7 million of its
articles have been drafted in conjunction with volunteers from all over the world and
practically every one of them may be edited by any person that may have access to
Wikipedia. Actually it is the most popular reference work on the internet.</p>
        <p>This project of dynamic content like Wikipedia illustrates the information that have
great potential to be exploited.</p>
        <p>Otherwise Google Search is one of the most-used search engine provides at least
22 special features beyond the original word-search capability. These include
synonyms, weather forecasts, time zones, stock quotes, maps, earthquake data, movie
showtimes, airports, home listings, and sports scores.</p>
      </sec>
      <sec id="sec-3-2">
        <title>And maybe you could be thinking:</title>
        <p>For what reason the people don’t use a Google Search for get all the
knowledgebase about topic specific and this knowledgebase could be export to file of
text plan with the possibilities of management this and generate corpus.</p>
        <p>Very simple is the answer because the information of Google is their information
and gold for company.</p>
        <p>
          It the past Google Inc. allowed the retrieval information from any kind of query[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
Google allowed the retrieval information based on their form and methods like
University Research Program for Google Search [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] but any kind of answered we
get of this project when we make the inscription to this program.
        </p>
        <p>
          Another way to exploit Google Search knowledge is using scripts, APIS [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ],
programming languages such as AWK, development tools like SED or GREP, all of
them analyzed in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] but with few results and we need a lot of information for create
knowledgebase.
3.1
        </p>
        <p>Considerations
1. Create a module with the goal to connect this with YACY and</p>
        <p>retrieval information of their crawlers.
2. Export a set of information related with a topic in plain text.
3. Management information of web site like Wikipedia.org.
4. Index the content of this kind of retrieval information in storage local.
5. Public the module in the web and share the knowledgebase.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <p>This section gives a description of the project taking into consideration the design that
will be used to give a solution to the problem of creating the module.</p>
      <sec id="sec-4-1">
        <title>4.1 Project description</title>
        <p>The obtained results of the module that connected with Yacy will be used to create
semantic webs, corpus and any other project that needs information in a plain text
about web content.</p>
        <p>Described below are a series of procedures to follow that use as a methodology to
implement within the project.</p>
        <sec id="sec-4-1-1">
          <title>A) Check the modules of Yacy</title>
        </sec>
        <sec id="sec-4-1-2">
          <title>B) Check the logistic and architecture of Yacy</title>
        </sec>
        <sec id="sec-4-1-3">
          <title>C) Check the form that Yacy create their crawlers</title>
          <p>D) Think in a form of create the Module capable of manage the information of the
crawler and generate knowledgebase</p>
          <p>
            E) Some of the polices described above are implemented in YaCy [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], the variant
to use is the implementation of the JXTA[
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] tool and the URI and RDF policies that
allow to structure and outline the results, to finally present then in a semantic way or
knowledgebase.
          </p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2 Development platform</title>
        <p>
          This work is done with YaCY, which is a free distribution search engine, based on
the principles of the peer to peer (P2P). Its core is a program written in Java that it’s
distributed in hundreds of computers, from September 2006. It’s called YaCy-peer.
Each YaCy-peer is an independent crawler that navigates trough the Internet, and
analyzes and indexes web pages found. To storages the indexation results in a
common database (called index) which is shared with other YaCy-peers using the
principles of the P2P networks [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>Compared to semi-distributed search engines, the YaCy-network has a
decentralized architecture. All of the YaCy-peers are equal and there is no central
server. It may be executed in Crawling mode or as a local proxy server. The figure 2
shows a diagram that describes the distributed process of indexation and the search in
the network for the YaCy crawler.</p>
        <p>The figure 3, to have the main components of YaCy, and the process that exists
among the web search, web crawler, the indexing and data storage processes.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5 Conclusions and ongoing work</title>
      <p>In this section present some the conclusions and results that are expected of project
and the future work.</p>
      <p>1. Index all content of Wikipedia.
2. Storage this content.
3. Present the content of Wikipedia by topic in a web site.
4. Use a tagged of text for share the information with tags.
5. Present the module and their code on a web site
6. Share knowledgebase extract of Wikipedia</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>1. Definition of knowledgebase http://searchcrm.techtarget.com/definition/knowledge-base</mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Alarcón</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sierra</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bach</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          (
          <year>2007</year>
          ).
          <article-title>“Developing a Definitional Knowledge Extraction System”</article-title>
          . En
          <string-name>
            <surname>Vetulani</surname>
            ,
            <given-names>Z</given-names>
          </string-name>
          . (ed.),
          <source>Actas del 3er Language &amp; Technology Conference</source>
          .
          <article-title>Human Language Technologies as a Challenge for Computer Science</article-title>
          and Linguistics. Poznan, Universidad Adam Mickiewicza: pp.
          <fpage>374</fpage>
          -
          <lpage>378</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Google</given-names>
            <surname>Hacks</surname>
          </string-name>
          ,
          <source>Second Edition</source>
          ,
          <year>2004</year>
          ,
          <string-name>
            <given-names>O</given-names>
            <surname>'Reilly Media</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>S.</given-names>
            <surname>Rhea</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Godfrey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Karp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kubiatowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ratnasamy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shenker</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          .
          <article-title>OpenDHT: a Public DHT Service and its Uses</article-title>
          .
          <source>SIGCOMM' 05</source>
          , Philadelphia, Pennsylvania, USA, august
          <volume>21</volume>
          -26, (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. http://www.jxta.org (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. http://yacy.net/ (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7. http://www.twine.com/ (
          <year>2010</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Query</surname>
          </string-name>
          <article-title>Processing on the Semantic Web Heiner Stuckenschmidt</article-title>
          , Vrije Universiteit Amsterdam
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. http://www.alexa.com/siteinfo/google.com+yahoo.com+altavista.com (
          <year>2009</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10. http://searchenginewatch.com/showPage.html?page=
          <volume>3630718</volume>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. http://research.google.com/university/search/ (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>