<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Analysis of Semantic and Non-Semantic crawlers</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Shridevi s</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shashwat Sanket</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jayraj Thakor</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dhivya M</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Vellore Institute of Technology</institution>
          ,
          <addr-line>Chennai</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <fpage>360</fpage>
      <lpage>367</lpage>
      <abstract>
        <p>A focused crawler goes through the world wide web and selects out those pages that are apropos to a predefined topic and neglects those pages that are not matter of interest. It collects the domain specific documents and is considered as one of the most important ways to gather information. However, centralized crawlers are not adequate to spider meaningful and relevant portions of the Web. A crawler which is scalable and which is good at load balancing can improve the overall performance. Therefore, with the size of web pages increasing over internet day by day, in order to download the pages efficiently in terms of time and increase the coverage of crawlers distributed web crawling is of prime importance. This paper describes about different semantic and non-semantic web crawler architectures: broadly classifying them into Nonsemantic (Serial, Parallel and Distributed) and Semantic (Distributed and focused). An implementation of all the aforementioned types is done using the various libraries provided by Python 3, and a comparative analysis is done among them. The purpose of this paper is to outline how different processes can be run parallelly and on a distributed system and how all these interact with each other using shared variables and message passing algorithms.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Semantic Crawler</kwd>
        <kwd>Serial</kwd>
        <kwd>Parallel</kwd>
        <kwd>Distributed</kwd>
        <kwd>message passing</kwd>
        <kwd>shared variables</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>A web crawler, also known as a spiderbot is
a system made up of a program or an
automated script that downloads web pages
on a large scale. Web crawlers are used in
various applications and in diverse
domains. In fact, web crawling is one of the
impact factors for the growth of internet in
domains like marketing and E-commerce.
In E- commerce, crawlers can be used for
price comparison and to monitor the recent
market trends. Similarly, it can be used to
predict stock market movements by
analysing social media content blogs and
other data from different websites. Web
crawlers are primary component of web
search engines whose purpose is to collect
web pages in bulk, index them and execute
the user-defined query to find the web
pages.</p>
      <p>A similar use is web archiving where the
web pages are collected and preserved or
stored for future use. Along with the above
mentioned uses web crawlers are also used
to create a replica of visited pages which are
processed by search engine for faster search
optimization and web data mining to analyse
statistically. Also, web crawlers are used to
collect specific information like harvesting
or collecting spam email addresses or
application testing. Due to rapid increase of
web pages and most of the data on web are
unstructured, the semantic crawlers are used
for retrieval of context relevant web pages.
Semantic crawlers have different
architectures like distributed, parallel,
focused and increment crawler.</p>
      <p>Today, web crawlers form an important
part of various software services to evolve
into large scale integrated distributed
software proving that they are not just a
program preserving a list of pages to be
crawled. The web crawler is the principal
and time demanding element of web search
engine. It consumes huge amount of CPU
time, memory and storage space to crawl
through ever increasing and dynamic web.
The time it consumes to crawl through web
should be as small as possible to maintain its
recent updates of the search outputs.
Parallel and distributed processing is one
way to increase the speed of crawling
process due to technological advancement
and improvement in hardware
architectures.The work consists of
implementation and comparison between
different web crawler architecture namely
Serial, Parallel and Distributed. The
purpose of this work is to outline how we
can increase the processing capabilities of
web crawlers and get the query output in
lesser amount of time. This paper covers
detailed information about how different
processes can be executed on parallel and
on a distributed system and how all these
interact with each other using shared
variables and message passing algorithms.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Existing Work and Literature Survey</title>
      <p>
        In this section, the recent works related to
crawler processing is described. In “Speeding
up the web crawling process on a multi-core
processor using virtualization” [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] by Hussein
Al- Bahadili, Hamzah Qtishat, and Reyadh S.
Naoum, they have presented and analysed
their new approach to increase the crawler
efficiency in terms of time through
virtualization using multi- core processor. In
their work they have divided the multi-core
processor into many VMs (Virtual Machines),
so that the task can be executed concurrently
on different data. In addition to this they have
also described their implementation and
analysis of VM-based distributed web crawler
after rigorous testing.
      </p>
      <p>
        J. Cho, Hector G., L. Page [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in their work
have described in what sequence or in what
order the URLs must be visited by the crawler
to obtain the important pages first. This
method of obtaining pages of prime
importance rapidly, helps to save time when a
crawler is unable to go through the increasing
and dynamically changing web. In this work
they created a dataset by downloading an
image of Stanford Webpages and performed
experiment by modifying and using different
large-scale and small-scale crawlers like
PageRank Crawler, Breadth-first and
Depthfirst search crawler and Backlink-based
crawlers.
      </p>
      <p>“Google‟s Deep-Web Crawl” by J.</p>
      <p>
        Madhavan,
D. Ko et al [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is another notable work
describing how to crawl the contents of
deepweb which is used in Google search engine.
They have described a system to extract
deepweb content which includes pre-computing
submissions for each HTML form and adding
the resulting HTML pages into a search engine
index. The entire system is based on achieving
three main goals. The first goal is to develop an
approach that is time saving, automatic and
scalable to index the hidden web content from
HTML forms that are varied in domains and are
in languages from all over the world. The
second aim is to develop two types of
algorithm, one that can identify the inputs that
accepts only specific value types and other to
accept a keyword to select input values for text
search inputs. The third aim is to develop an
algorithm that goes through the possible input
combinations to identify and generate URLs
suitable for web search index.
      </p>
      <p>
        Anirban Kundu, Ruma Dutta, Rana
Dattagupta, and Debajyoti Mukhopadhyay in
their paper “Mining the web with hierarchical
crawlers – a resource sharing based crawling
approach” [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] have proposed an extended
web crawling method to crawl over the
internet on behalf of search engine. The
approach is combination of parallelism and
focused crawling using multiple crawlers.
The algorithm divides the entire structure of
the website into many levels based on
hyperlink structure to download web pages
from the website and the number of crawlers
is dynamic at each level. The number of
crawlers required is determined based on the
demand at run time by and by developing a
thread-based program using the number of
hyperlinks from the specificpage.
      </p>
      <p>
        M. Sunil Kumar and P. Neelima in their
work “Design and Implementation of
Scalable, Fully Distributed Web Crawler for
a Web Search Engine” [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have presented
Dcrawler which is highly scalable and
distributed. The core features of the presented
crawler are decentralization of tasks, an
assignment function that partitions the
domain for the crawler to crawl effectively,
cooperative ability in order to work with other
web servers and platform independence. For
assignment function Identifier-Seeded
Consistent Hashing have been used. On
performing tests using distributed crawlers
they concluded that the Dcrawler performs
better than other traditional centralized
crawlers and also performance can be
improved with addition of more crawlers.
      </p>
      <p>
        T. Patidar and A. Ambasth in their paper
“Improvised Architecture for Distributed
Web Crawling” [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have proposed reliable
and efficient methods for a web crawler that
is scalable. In addition to this they have
discussed challenges and issues regarding
web structure, job scheduling, spider traps
and URL canonicalization. The components
of their proposed work include Child
Manager, Cluster Manager, Bot Manager
and incremental batch analyser for
recrawling. Their results show that they have
successfully implemented distributed
crawler along with politeness techniques and
selection policies but still they face
challenges like resource utilization.
      </p>
      <p>
        The work “A Hierarchical Approach to
Model Web Query Interfaces for Web Source
Integration” [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] by E. Dragut et al. describes
an algorithm which extracts and maps query
interfaces into a hierarchical representation.
The algorithm is divided into 4 steps namely
Token Extraction, Tree of Fields, Tree of Text
Tokens and Integration and therefore they
convert extraction algorithm into integration
algorithm. They carried out experiments on
three different datasets (ICQ, Tel8 and WISE)
and evaluated the algorithm based on
performance metrics like leaf labelling,
Schema Tree Structure and Gold Standard.
      </p>
      <p>
        D. H. P. Chau, S. Pandit, S. Wang, and C.
Faloutsos have described parallel crawling by
illustrating it on an online auction website in
their work “Parallel Crawling for Online
Social Networks” [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. They have presented
this work for online social networks. They
have dynamic assignment architecture which
ensures that failing of one crawler does not
affect another crawler and that there is no
redundant crawling. They visited about 11
million users out of which approximately
66,000 were completely crawled. J. Cho and
H. Garcia-Molina [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] proposed different
architectures for parallel web crawlers, metrics
to evaluate the performance of parallel web
crawlers and the issues related to parallel
crawling. They described issues like Overlap,
quality and communication bandwidth and
advantages of parallel crawlinglike scalability,
Network-load dispersion and Network- load
reduction.
      </p>
      <p>
        C. C. Aggarwal, F. Al-Garawi, and P. S.
Yu in their work “Intelligent crawling on the
world wide web with arbitrary predicates”
[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] have described intelligent crawling as a
method that learns properties and features of
the linkage structure of WWW while crawling.
The technique proposed by them is more
generalized than focused crawling which is
based on pre-defined structure of web. The
intelligent crawling described by them is
applicable to web pages that support arbitrary
user-defined topical and keyword queries. The
technique described is capable of reusing the
information gained from previous crawl in
orderto crawl more efficiently the next time.
Figure 1 Flow of Serial Web
      </p>
      <p>Crawler</p>
    </sec>
    <sec id="sec-3">
      <title>3. Architectures and Implementation</title>
    </sec>
    <sec id="sec-4">
      <title>3.1 Serial Web Crawler</title>
      <p>The crawler maintains a list of unvisited
URLscalled the frontier which acts a queue.
The list is initialized with seed URLs. Each
crawling loop involves picking the next URL
from the frontier to crawl, checking if the
URL is previously visited or not, if not
visited then fetch the page corresponding to
the URL through HTTP, followed by parsing
the retrieved page to extract the URLs and
application specific information, and finally
adding the unvisited URLs to the frontier.
The crawling process may be terminated
when a certain number of pages have been
crawled. If the crawler is ready to crawl
another page and the frontier is empty, the
situation signals a dead-end for the crawler.
The crawler has no new page to fetch and
hence it stops. Figure 1 shows the flow of a
basic serial web crawler. The complete
implementation of the above model can be
found in Implementation 1.</p>
      <p>Algorithm:
1. Initialise a constructor crawler with
variable pageTable and revPageTable
assigns to HashMap
2. define a method get_seed() to get the
the seed_url</p>
    </sec>
    <sec id="sec-5">
      <title>3.2 Parallel Web Crawler</title>
      <p>
        Parallel crawlers can be understood as several
modified serial crawlers running as separate
processes. These multiple processes run in
parallel thus named parallel web crawler. Figure
2 shows the flow chart of the working of the
parallel web crawler. Here we created a process
pool that is managed by the process manager,
which is also responsible for spawning and
scheduling new processes. And a shared
memory that is used as Frontier. Note our
crawler is a simple parallel web crawler.
Although there are many different ways of URL
partitioning as mentioned in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. But the main
aim here is to create a Baselinemodel.
      </p>
      <p>Algorithm: Crawler
1. Initialise a class crawler with a</p>
      <p>constructor.
2. Define method testseedurl with seed
url as parameter
3. if hostname is same as that of seed
url then return false
4. otherwise return true
5. Define method getallurl</p>
      <p>with url as parameter
6. fetch and parse the html page
7. for each</p>
      <p>parsedPage.findAll(„a‟,href=TRUE)
8. if (!safeURL(link)) then throw URL
invalid
9. otherwise append url</p>
      <p>urls.append(url+link)
10. Initialise the daemon server.</p>
      <p>Algorithm: Frontier
1. Initialise a class frontier_manager</p>
      <p>witha constructor.
2. Initialise the process pool and seed_url.
3. Initialise the frontier with the
global list and assign token to
each process bylock acquire and
release method
4. for each url in urls
5. if url is not in frontier push url into</p>
      <p>frontier and self-release the lock
6. //end if
7. Define a method to write to the url to
index table
8. The index acquires the lock and
check if seed url is not equal to
key in index table then add url to
table.
9. otherwise
indextable[local_seed_url].exte
nd(url)
10. Define a method Make_write and
writethe urls to frontier and index
table
11. define a static method crawl and
initialise a crawler
12. return the seed url
13. Define a method start and</p>
      <p>create apool process
14. close the pool process
The above Baseline code uses
multiprocessing for creating and managing
multiple processes. Here lock-based system
is used to access the shared memory space.
The crawler code is very similar to that of
serial web crawler the modification is done
for the Frontier. The Frontier spawns two
crawler workers to fetch the pages.</p>
    </sec>
    <sec id="sec-6">
      <title>3.3 Distributed Web Crawler</title>
      <p>
        Distributed web crawlers [
        <xref ref-type="bibr" rid="ref14 ref16 ref17">14,16,17</xref>
        ] are a
technique in which many computers
participate by providing their computing
bandwidth in the crawling process. The
proposed architecture acts as a baseline for
this technique. In this, there is a central
server as nameserver and four other servers
as workers as crawlers. Here we used the
dynamic assignment as a policy where the
nameserver dynamically assigns the URLs
and balances the load. Note the nameserver
here is not responsible for crawling in order
to reduce theworkload.
      </p>
      <p>Apart from the dynamic assignment job of the
nameserver, it is also responsible for
monitoring the hearth beat and other meta
information of these worker crawlers. Since
the nameserver can be a single point of failure
(SPOF) during the task. To avoid this, the
nameserver saves all the meta- information in
the form of checkpoints in the global store. On
failure of a nameserver, one ofthe crawlers will
be elected as the nameserver, and the new
nameserver will fetch the latest checkpoint and
continue the task. Note here the crawlers are
also receiving the heartbeat signal of a
nameserver, in order to identify when the
nameserver is down.</p>
      <p>Before going the flow, here we used two
frontiers as local frontier and global frontier.
local frontier is the frontier of the worker
instance where as a global frontier is part of the
global store .The client will trigger the
nameserver by providing the seed URLs to
crawl, the nameserver will initialize the global
frontier with the seed URLs, and will
dynamically assign the URLs to the respective
the crawlers local frontier. The crawlers
individually be acting as serial web-crawler
with its own DNS resolver, Frontier queue, and
pagetable. For filtering the URLs they will also
communicate with the global store to check if
visited or not. Upon completing the crawling
process, thecrawler will dump the pagetable in
the common storage and will ask the
nameserver to reallocate the new seed URLs.
This process continues till termination
triggered by the nameserver.</p>
      <p>Algorithm: Crawler
1. Initialise a class crawler with a</p>
      <p>constructor.
2. Define method testseedurl with seed
url as parameter
3. if hostname is same as that of seed
url then return false
4. otherwise return true
5. Define method getallurl</p>
      <p>with url as parameter
6. fetch and parse the html page
7. for each</p>
      <p>parsedPage.findAll(„a‟,href=TRUE)
8. if (!safeURL(link)) then throw URL
invalid
9. otherwise append url</p>
      <p>urls.append(url+link)
10. Initialise the daemon server.</p>
      <p>Algorithm:
1. Initialize class</p>
      <p>Frontier_Manager() and variable
seed url.
2. create a index table using</p>
      <p>HashMap function
3. Define method testseedurl with seed
url as parameter
4. if hostname is same as that of seed
url then return false
5. otherwise return true
6. Define method
write_to_frontier with
seed_url and host urls a
parameter.
7. for each url in urls
8. if(!(url in frontier and
test_seed_url(url))) then
self.frontier.append(url.strip())
9. Define method write_to_index_table.
10. if local seed url is not present
in index table then add the
local seed url
11. otherwise
index_Tble(local_seed_url).</p>
      <p>extend(urls)
12. write to frontier and
index_table the
local_seed_url and urls.
13. fetch the method crawl by
using index and crawler
variable
14. end if
15. end each for.</p>
    </sec>
    <sec id="sec-7">
      <title>3.4. Semantic Distributed web crawler</title>
      <p>Distributed Semantic web crawlers are
used for crawling both semantic web pages in
RDF/OWL format and HTML pages. The
distributed semantic crawler uses a component
called page analyser for understanding the
page context. The ontology analyser creates
models for fetched OWL/RDF pages and these
models are stored. Later these models are
matched with the stored Ontology to make
crawling decision
Algorithm: Crawler controller
1. Initialise string with seed URL
2. check the whether the string is</p>
      <p>present in Database
3. check if (seed</p>
      <p>URL exist)
print already
exist
4. otherwise insert the URL details
5. assign variable for statement and
add the seed details to the
database
6. if (statement! = empty)</p>
      <p>{execute statement}
7. otherwise print statement not</p>
      <p>executed.</p>
      <p>Algorithm: Model Extraction
1. Initialise variable id, url, html,</p>
      <p>langType
2. create an object and read the url.
3. Repeat till the statement is present
4. define variables
and get the
subject, predicate, object
and URI.
5. if object is URIResource then</p>
      <p>get URI and asiign to ob.
6. //end if
7. if langtype is HTML then
8. if subject does not contain #
and is not null then save subject
to database,
9. if predicate does not contain #
and is not null then save object to
database.</p>
    </sec>
    <sec id="sec-8">
      <title>3.5 Focused web crawlers</title>
      <p>
        Focused web crawlers [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] are to use to
collect web pages on a specific topic.
These crawlers search the entire web on a
predefined topic which in turn avoids
irrelevant information to the user. Focused
crawler saves the computational resource.
Semantic focused crawler has multi thread
and each thread takes a web page with
highest dynamic semantic relevance from
priority queue. The main work of the thread
is to parse the various hyperlink and add
them to the priority queue. Thus, the priority
queue has the details of the web page that
has to parsed by the thread Semantic
focused crawler has another temporary
queue which maintains the visited web
pages. The thread also checks
temporary queue for visited web pages.
this
Algorithm: semantic focused crawler
Q: Priority Queue
DSR: Dynamic Semantic relevance
Link: Queue of traversed URL
1. Initialize priority queue Q with
seed URLs
2. Repeat till (!Q.empty() ||fetch cnt
6 Limit) {
3. web page.url = Q.top.getUrl ();
//Get most relevant
single URL from
priority queue
4. Fetch and parse web page.url;
5. web page.urls = extract URLs
(hyperlinks) from web page.url;
      </p>
      <p>//List of URLs
6. For each web page.urls {
7. already exist = Check web
page.urls[i] in Links;</p>
      <p>//Check for duplicates
8. If (!already exist) {
9. Enqueue web page.urls[i] in</p>
      <p>Links;
10. Fetch and parse web page.urls[i];
11. Compute DSR of web</p>
      <p>page.urls[i];
12. Enqueue (web page.urls[i], DSR
) in Q;
13. Store (web page.urls[i], DSR )</p>
      <p>in local database;
14. } //end of If
15. } //end of For each
16. }</p>
      <p>//
Here using Pyro4 python library to
stimulate the described architecture.</p>
      <p>Beautiful Soup to parse the HTML pages.</p>
    </sec>
    <sec id="sec-9">
      <title>4. Results</title>
      <p>Fig 6 shows the testing of all the semantic
and non- semantic crawlers for a given
website. From the table in figure 6 the total
number of test cases are 30 out of which the
non-semantic crawlers (distributed crawler
outperforms in 20 cases; parallel crawler
outperforms in 8 and serial crawler
outperforms in only 2 cases) and semantic
crawlers (semantic distributed crawler
outperforms in 24 cases and focused crawler
in 26 cases). Therefore, distributed crawler
achieves an accuracy of about 66.67%,
parallel crawler achieves an accuracy of
26.67 %, serial crawler gives an accuracy of
about 0.67%, semantic distributed crawler
gives an accuracy of 80% and focused
crawler gives an accuracy of about 86.66%.</p>
      <p>Fig 6 shows the graphical
presentation of the number of times a
specified crawler outperforms.</p>
    </sec>
    <sec id="sec-10">
      <title>5. Conclusion</title>
      <p>It can be concluded that for majority
of time, a focused crawler and semantic
distributed crawler gives the best result for
crawling a specific website. From the result it
is also clear that focused crawler works well
as the number of crawling increases.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Al-Bahadili</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Qtishat</surname>
            , Hamzah &amp; Naoum,
            <given-names>Reyadh.</given-names>
          </string-name>
          (
          <year>2013</year>
          ).
          <article-title>Speeding Up the Web Crawling Process on a Multi-Core Processor Using Virtualization</article-title>
          .
          <source>International Journal on Web Service Computing. 4</source>
          .
          <fpage>19</fpage>
          -
          <lpage>37</lpage>
          .
          <fpage>10</fpage>
          .5121/ijwsc.
          <year>2013</year>
          .
          <volume>4102</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Junghoo</given-names>
            <surname>Cho</surname>
          </string-name>
          , Hector Garcia-Molina, and
          <string-name>
            <given-names>Lawrence</given-names>
            <surname>Page</surname>
          </string-name>
          .
          <year>1998</year>
          .
          <article-title>Efficient crawling through URL ordering</article-title>
          .
          <source>In Proceedings of the seventh international conference on World Wide Web</source>
          <volume>7</volume>
          (
          <issue>WWW7</issue>
          ). Elsevier Science Publishers B. V.,
          <string-name>
            <surname>NLD</surname>
          </string-name>
          ,
          <fpage>161</fpage>
          -
          <lpage>172</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jayant</given-names>
            <surname>Madhavan</surname>
          </string-name>
          , David Ko,
          <string-name>
            <given-names>Łucja</given-names>
            <surname>Kot</surname>
          </string-name>
          , Vignesh Ganapathy, Alex Rasmussen, and
          <string-name>
            <given-names>Alon</given-names>
            <surname>Halevy</surname>
          </string-name>
          .
          <year>2008</year>
          .
          <article-title>Google‟s Deep Web crawl</article-title>
          .
          <source>Proc. VLDB Endow</source>
          .
          <volume>1</volume>
          ,
          <issue>2</issue>
          (
          <year>August 2008</year>
          ),
          <fpage>1241</fpage>
          -
          <lpage>1252</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Kundu</surname>
            ,
            <given-names>Anirban</given-names>
          </string-name>
          &amp; Dutta, Ruma &amp; Dattagupta, Rana &amp; Mukhopadhyay,
          <string-name>
            <surname>Debajyoti.</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>Mining the web with hierarchical crawlers - A resource sharing based crawling approach</article-title>
          .
          <source>IJIIDS. 3</source>
          .
          <fpage>90</fpage>
          -
          <lpage>106</lpage>
          .
          <fpage>10</fpage>
          .1504/IJIIDS.
          <year>2009</year>
          .
          <volume>023040</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          &amp;
          <string-name>
            <given-names>P</given-names>
            ,
            <surname>Neelima.</surname>
          </string-name>
          (
          <year>2011</year>
          ).
          <article-title>"Design and Implementation of Scalable, Fully Distributed Web Crawler for a Web Search Engine"</article-title>
          .
          <source>International Journal of Computer Applications</source>
          .
          <volume>15</volume>
          . 10.5120/1963-2629.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Patidar</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ambasth</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          (
          <year>2016</year>
          ).
          <article-title>Improvised Architecture for Distributed Web Crawling</article-title>
          .
          <source>International Journal of Computer Applications</source>
          ,
          <volume>151</volume>
          ,
          <fpage>14</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Kabisch</surname>
            , Thomas &amp; Dragut,
            <given-names>Eduard</given-names>
          </string-name>
          &amp; Yu, Clement &amp; Leser,
          <string-name>
            <surname>Ulf.</surname>
          </string-name>
          (
          <year>2009</year>
          ).
          <article-title>A Hierarchical Approach to Model Web Query Interfaces for Web Source Integration</article-title>
          .
          <source>PVLDB.325-336. 10.14778/1687627</source>
          .1687665.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Duen</given-names>
            <surname>Horng</surname>
          </string-name>
          <string-name>
            <surname>Chau</surname>
          </string-name>
          , Shashank Pandit,
          <string-name>
            <given-names>Samuel</given-names>
            <surname>Wang</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Christos</given-names>
            <surname>Faloutsos</surname>
          </string-name>
          .
          <year>2007</year>
          .
          <article-title>Parallel crawling for online social networks</article-title>
          .
          <source>In Proceedings of the 16th international conference on World Wide Web (WWW ‟07)</source>
          .
          <article-title>Association for Computing Machinery</article-title>
          , New York, NY, USA,
          <fpage>1283</fpage>
          -
          <lpage>1284</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Cho</surname>
          </string-name>
          , Junghoo, and
          <article-title>Hector GarciaMolina. "</article-title>
          <source>Parallel crawlers." Proceedings of the 11th international conference on World Wide Web</source>
          .
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>Aggarwal</surname>
            ,
            <given-names>Charu</given-names>
          </string-name>
          &amp;
          <string-name>
            <surname>Al-Garawi</surname>
            , Fatima &amp; Yu,
            <given-names>Philip.</given-names>
          </string-name>
          (
          <year>2001</year>
          ).
          <source>Intelligent Crawling on the World Wide Web with Arbitrary Predicates</source>
          .
          <fpage>96</fpage>
          -
          <lpage>105</lpage>
          .
          <fpage>10</fpage>
          .1145/371920.371955.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Parallel</given-names>
            <surname>Crawlers Junghoo Cho</surname>
          </string-name>
          , Hector Garcia-Molina University of California, Los Angel
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Naresh</surname>
            <given-names>Kumar</given-names>
          </string-name>
          , Manjeet
          <string-name>
            <surname>Singh</surname>
          </string-name>
          (
          <year>2015</year>
          ).
          <article-title>Framework for Distributed Semantic Web Crawler</article-title>
          . IEEE - International
          <source>Conference on Computational Intelligence</source>
          and
          <string-name>
            <given-names>Communication</given-names>
            <surname>Networks</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Lokeshwaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rajesh</surname>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>A Study of Various Semantic Web Crawlers and Semantic Web Mining</article-title>
          .
          <source>International Journal of Pure and Applied Mathematics Volume 120 No. 5</source>
          <year>2018</year>
          ,
          <fpage>1163</fpage>
          -
          <lpage>1173</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          and
          <string-name>
            <given-names>W.</given-names>
            <surname>Xin</surname>
          </string-name>
          ,
          <year>2020</year>
          <article-title>"Implementation of Distributed Crawler System Based on Spark for Massive Data Mining,"</article-title>
          <source>2020 5th International Conference on Computer and Communication Systems (ICCCS)</source>
          , Shanghai, China,
          <year>2020</year>
          , pp.
          <fpage>482</fpage>
          -
          <lpage>485</lpage>
          , doi: 10.1109/ICCCS49078.
          <year>2020</year>
          .
          <volume>9118442</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <surname>Rajiv</surname>
            ,
            <given-names>S</given-names>
          </string-name>
          and Navaneethan,
          <string-name>
            <surname>C</surname>
          </string-name>
          ,
          <year>2020</year>
          ,
          <article-title>"Keyword Weight Optimization using Gradient Strategies in Event Focused Web Crawling" Pattern Recognition Letters 01678655 CrossRef</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Bal</surname>
          </string-name>
          and
          <string-name>
            <given-names>G.</given-names>
            <surname>Geetha</surname>
          </string-name>
          ,
          <article-title>"Smart distributed web crawler,"</article-title>
          <source>2016 International Conference on Information Communication and Embedded Systems (ICICES)</source>
          ,
          <year>Chennai</year>
          ,
          <year>2016</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          , doi: 10.1109/ICICES.
          <year>2016</year>
          .
          <volume>7518893</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>HongRu</given-names>
          </string-name>
          , et al.
          <year>2018</year>
          <article-title>"AntiCrawler strategy and distributed crawler based on Hadoop." 2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA)</article-title>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>Boukadi</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rekik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rekik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , &amp;
          <string-name>
            <surname>Ben-Abdallah</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          (
          <year>2018</year>
          ).
          <article-title>FC4CD: a new SOA-based Focused Crawler for Cloud service Discovery</article-title>
          . Computing,
          <volume>100</volume>
          (
          <issue>10</issue>
          ),
          <fpage>1081</fpage>
          -
          <lpage>1107</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>